I’ve seen plenty of discussion about pitch classification recently, both on Twitter and online forums. In their AL Wild Card game chat on Fangraphs, former Astros’ executive Kevin Goldstein, ZiPS creator Dan Szymborski, and Roster Resource writer Jon Becker expressed their dissatisfaction on current pitch classification systems.
Becker made a good point: classifying pitches can be a real challenge.
Pitch Classification is one of the few areas of baseball, where no matter how the data tells the story, if the pitcher has a differentiating opinion, then said pitcher would be correct. Although fans might see on the scoreboard that the pitcher threw a slider, the data could portend it as a true cutter.
This is due to the way Major League Baseball classifies their pitchers’ pitches.
Rather than zooming in on a pitcher’s hand and having a league employee memorize a grip, each stadium employs a neural network model to track the metadata of the pitch. The system instantly takes several features into account and spits out the type of pitch on the stadium’s scoreboard. The network already has inputs with the pitchers’ recognized pitches and based on previous training data, it assigns a pitch type to the delivery.
My biggest pet peeve about the system is the possibility of a differentiating opinion between the pitcher and system. If the movement, velocity, spin rate, and efficiency all classify the pitch as one thing but the pitcher disagrees, the data doesn’t matter.
That’s not to discount the incredible job that Major League Baseball does in classifying these pitches. They’re doing everything right. They’re calling pitches the way the pitchers want them to be called. But, given the pitch’s data, they might end up misclassified.
I wanted to check to see if the pitches from the 2021 season are truly the correctly classified ones.
I downloaded every pitch from the 2021 season (50 pitch minimum) and their corresponding average spin rate, average pitch velocity, average horizontal break, average vertical break, and spin efficiency. I then sorted them by seven pitch types: four-seam fastball, sinker, cutter, slider, curveball, changeup, and splitter. I ended with 1690 pitches, with the majority of them predictably being fastballs.
I then ran each pitch through two machine learning algorithms that, given the five aforementioned variables, decided if the pitch was correctly classified as such.
I decided to use two different models, a logistic regression model and a Naive Bayes model, to determine which one would work best in this regard. Both models are supervised machine learning models; they were each trained on 80% training data, 20% testing data.
While other analysts might use a classification model, such as a KNN or K-Means model, based on previous work experience, I felt more comfortable using a machine learning model. A regression model is one of the most popular Machine Learning models in Python. It’s given a set of independent variables, which it uses to predicting the categorical dependent variable. I pivoted to a Naive Bayes model for my alternate model. Naive Bayes works well even when the training data is small, a feature that’s helpful considering some training groups have limited data to work with.
Below are the accuracy, precision, recall, and F1 of each pitch type, color-coded by column.
For those confused about the metrics:
- Accuracy of each model is the sum of the True Positives and the True Negatives divided by the fields in the dataset. Accuracy isn’t the best metric to use alone so I wanted to show several other variables as well.
- Precision is amount of true positives divided by the sum of the true positives and false positives.I’m using the precision from both the 0 group (the independent variable) and the 1 group (the dependent variable).
- Recall measures the ability of correctly find the positive samples: measured by taking the true positives and dividing it by the sum of the true positives and the false negatives.
- The F1 is a weighted average of the precision and the recall, calculated by multiplying the product of the precision and recall by 2, then dividing the product by the sum of the precision and recall.
A lower accuracy from sliders and cutters were one area of ambiguity that I expected from the model. Based on my previous research, cutters and sliders can be difficult to differentiate. The movement profiles and velocity of sliders can be extremely similar to cutters, despite its slider labelling.
The Mets, in particular, have a group of pitchers that throw a cutter/slider hybrid. Their group of pitchers all throw sliders with a high velocity and move in a unique way.
This scatter plot shows the movement profile of the Mets’ sliders, compared to the movement profile of other teams’ sliders.
The group of small orange dots on the lower left side are the Mets’ sliders. For a further in-depth analysis: check out my previous work on the subject here.
While helpful in terms of results, the ambiguity between their sliders and cutters caused the accuracy of the metrics of the specific pitches to decrease.
Classifying Sinkers looked to be an issue that both of my models had trouble with. Baseball Savant, which I took most of my data from, doesn’t classify two-seamers when downloading their CSV files. This is because Savant classifies sinkers (which were included in my model) and two-seamers interchangeably. Therefore, the sinker group can include two-seamers and sinkers, two pitches which can have vastly different data.
I also noticed that fastballs and splitters were among the two pitches that each model was split on, granted to a small amount. As aforementioned in the previous paragraph, the raw data of fastballs may have a slight inaccuracy because of the source data.
Splitters have similar movement profiles to changeups, a reason why I think their accuracy scores were lower. Current Twins Analyst Ethan Moore previously touched on this topic, discussing how the behaviors of both pitches can confuse classification systems.
Unfortunately, there’s no specific way to know exactly why these pitches were misclassified. These Machine Learning models don’t allow me to check which pitch would’ve fit better.
Additionally, there are certainly limitations to these models, which I would like to address. Each model were only ran for one iteration; the metrics might change a small amount when run additional times. The delta in the metrics are due to the randomization in the selection of the training and testing data. Additionally, some pitch types, like splitters, have a small amount of training data, due to the 50 pitch minimum instituted. By using a Naive Bayes model, I tried to control for a limited dataset.
That being said, both models certainly performed better than expected. Going in, I expected the average classification to be around 85% correct, but I’m glad both models had an average accuracy scores significantly higher.
If you have any questions/comments/tips on this post, feel free to let me know!
Github link to all the relevant code and data.
Picture Credit: Sports Illustrated