While baseball has always been a game of numbers, the increase in large datasets and the availability of these datasets over recent years has greatly facilitated the ability of both public analysts and teams to draw conclusions and insights into the game that they previously would not have been able to.
As the data has grown in size and our access to it has increased, a branch of machine learning called deep learning has become increasingly popular in baseball as well as many other fields.
Deep learning refers to the use of deep neural networks that are able to learn patterns that exist in data and then use this knowledge to make predictions. Feedforward neural networks (see image below) are neural networks that take inputs (A,B,C,D) and feed them through each of the hidden layers (red, green and blue) in the network until it arrives at the output layer.
Feedforward neural networks for supervised learning tasks utilize a very powerful method known as backpropagation. While a full explanation behind backpropagation and the calculus involved is very interesting, it is unnecessary here and beyond the scope of this post. It is helpful however, to know the process of backpropagation exists and that it fine-tunes the weights in the network to ensure better overall performance of the model.
As a result of all this, neural networks provide us with extremely powerful tools for understanding complex relationships that exist in data. While these methods have become extremely popular over the last few years, I think we are only beginning to scratch the surface with them and their popularity will only continue to grow as the increasing data further facilitates improved performance.
As I’ve begun to appreciate both the flexibility offered by neural networks and their success in a multitude of industries, I’ve spent a lot of time considering the many different ways which they can be applied to baseball and our understanding of it. While neural networks can be applied for supervised, semi-supervised and even unsupervised learning tasks, I have decided to use them here in an attempt to build a model for predicting what pitch a pitcher is likely to throw next. This would be an extremely useful task for teams since the ability of a batter to know which pitch is likely to come next would be hugely beneficial and I know many MLB teams are already using machine learning for tasks similar to this.
For this article, I will be building my neural networks using data provided by Baseball Savant from the 2020 season. While all of my formal education in deep learning has actually been in Python, I will be using R for this to remain consistent with my previous baseball work and since it provides helpful packages for building and assessing the performance of machine learning classification models. These packages also allow the use of TensorFlow and typically seem to work well dealing with class imbalance. Class imbalance is when the classes in the dataset are not equally distributed and this is very likely to be an issue here since it is very unlikely a pitcher will ever throw his pitches at an equal distribution.
For this post, I will be focusing on Shane Bieber since he was near the top of the leaderboard for total pitches thrown last season with 1238 total pitches. For simplicity, I will break his pitches down into 3 categories (fastball, breaking and offspeed). Biebers total distribution of pitches in each category for the 2020 season is shown below:
For building my neural network, I will use input features including batter handedness, current base-out state, current score of the game, previous pitch thrown by Bieber, the count, and the length of the current at-bat that should all provide useful information when attempting to predict which pitch is likely to be thrown next. I will randomly split my data and use 80% to train the model (training set) and the remaining 20% to test the performance (test set) of my model. My initial model will make use of the Leaky ReLu activation function in each of the neural networks 6 hidden layers. Leaky ReLu activation functions differ slightly from regular ReLu activation functions, but both offer benefits over the more traditional activation functions, such as sigmoid and tanh, which sometimes suffer from a slower learning process when converting inputs to outputs.
One of the easiest and most common ways to evaluate the performance of supervised classification models is through what is referred to as a confusion matrix. A confusion matrix shows the number of correct and incorrect predictions the model made when evaluating the performance of the model on the test set. The confusion matrix also allows for quick computation of statistics that can be used to easily judge how the model did such as overall accuracy, sensitivity (true positive rate) and specificity (true negative rate).
My model will make predictions as either 0’s (breaking balls), 1’s (fastballs) or 2’s (offspeed pitches). Using the confusion matrix below and the reference columns, it can easily be computed that out of the 251 pitches (~20% of 1238) in the test set, there were 94 (50 + 40 + 4) total breaking balls, 140 (34 + 101 + 5) total fastballs and 17 (2 + 11 + 4) off-speed pitches. This shows us that the test set is a good representation of the overall dataset.
The confusion matrix can also be used to easily see the predictions made by my model. Using the prediction rows, we can determine that my model predicts 86 (50 + 34 + 2) breaking balls, 152 (40 + 101 + 11) fastballs, and only 13 (4 + 5 + 4) off-speed pitches to be thrown. If you were just given this information you might think this model performed fine and is ready to be deployed for use. After viewing the confusion matrix however, we can see that many of the predictions made by my model were unfortunately incorrect.
Judging by the above, it is clear my model performed the best at predicting when fastballs were going to be thrown since out of the 140 total fastballs in my test set, 101 were predicted correctly. The model did worse however at predicting breaking pitches (50 out of 94 correctly predicted) and much worse at predicting when off-speed pitches were going to be thrown (4 out of 17 correctly predicted).
In terms of accuracy, the 61.75% overall accuracy of this model is certainly not very impressive and not good enough for this model to be deployed in a real-life setting. It is very likely that given all the information used as inputs, people with an understanding of the game of baseball could do better than my model at predicting what pitch is likely to be thrown next.
Improving Model Performance
While overall accuracy isn’t the only metric that can be used to judge the performance of classification models, it seems very likely that a model with much better performance than this could be developed. It is likely that this could be accomplished by adjusting some of the models hyperparameters or perhaps a larger data set is needed to train my model than what the shortened 2020 season provided.
Are you able to provide a link to your dataset? I want to try adding this https://roadtolarissa.com/oracle/. Based on some of your thoughts on pitch tunneling and sequencing, I like how you incorporated previous pitch.
60% is pretty good, you might get a call from the Astros soon enough