When it comes to two pitch sequencing, one of the most well founded strategies is pitch tunneling. The idea is if you throw two consecutive pitches that follow a similar path up to the point where the batter has to decide to swing or not and then the paths diverge as the balls cross the plate, that should result in more swings and misses or at least throw off the batter’s perception of the pitch. This effectiveness of this approach is shown well in the heat map below from Jon Roegele’s article, “The Effects of Pitch Sequencing” where he shows that the second pitch of a well tunneled two pitch sequence has a significantly higher rate of swinging strikes.
One cool finding in Roegele’s article was how that band of high SwStr% pitches is nonlinear so the closer together two consecutive pitches are at the batter’s decision point, the less distance is needed between the two pitches by the time they reach the plate.
But I think more aspects of a pitch’s flight need to be looked at to make the tunneling strategy rigorous. One thing that initially comes to mind is the inconsistency of the 33ft decision point. If two pitches are coming in at different velocities then that will effect the point in the pitch’s flight that the batter will have to make a decision. Check out this still frame from BaseballCloud’s PitchR showing the difference in distance to home plate of two different pitch types after being released at the same time.
In an attempt to see what other factors contribute to a successful two pitch sequence, I got some 2019 MLB pitch data from Baseball Savant and started digging around. I first wanted to check out how many instances I could find of very well tunneled two pitch sequences which ended unfavorably for the pitcher using the criteria: distance between pitches at decision point < 0.1ft and distance at the plate > 2ft. There were a lot. One good example is this Joey Gallo AB from April 19, 2019. Reymin Guduan gets Gallo on a check-swing strike on the first pitch, tunnels the next pitch very well, but Gallo barrels it up and clobbers a deep homer.
There must be a way to explain or show statistically why this well tunneled sequence and many others like it fell short. To do this, I made a series of logistic regression models which predict a swinging strike of the second pitch in a two pitch sequence. The first model I looked at regressed distance between two consecutive pitches at the decision point and distance between the pitches at home plate on a binary indicator of a swinging strike. I wanted to make sure the p-values of the predictors showed that they are in fact correlated to the result of a swinging strike. As you can see below, the p-values (the numbers under “Pr(>|t|)”) are very small which verifies there is a statistically significant relationship between a well tunneled pitch and a swinging strike outcome.
Great. Now what else is there to look at when it comes to the relationship between two consecutive pitches? I created as many difference measurements as possible looking at the relationship between two pitches in a sequence such as difference in velocity, vertical break difference, and so on. Then, I plugged them into a logistic model regressing on a swinging strike outcome. I also included interaction variables between the difference metrics and the distance measurements between pitches at home and the decision point in case the effect of one variable on a swinging strike outcome is dependent upon its interaction with another variable. The idea here is to build a big model where most predictors are not going to be statistically significant and then use model diagnostics to boil it down and see precisely which aspects of a two pitch sequence are really important in achieving a swinging strike outcome. This is the model summary of the big logistic regression:
Now using an algorithm called backward stepwise elimination, I can minimize the model’s AIC which is single number estimator of prediction error used for model comparison. Backward stepwise elimination methodically removes one predictor at a time based on which removal yields the smallest AIC and will stop once the AIC won’t go any lower. After running through this process and manually removing any predictors with p-values indicating statistical insignificance, this is the final logistic regression model for predicting a whiff based on the characteristics of a two pitch sequence.
Now that we’ve got a simple and cohesive model, let’s look a bit deeper. The psuedo-R-squared for this model is 0.6174 which means about 62% of the variance of a two pitch sequence resulting in a swinging strike is explained by the variables included in the model. I think this is pretty good for our purposes and shows that this model isn’t just taking a shot in the dark.
The first thing I want to draw attention to about the predictors in the model is that distance between consecutive pitches at the decision point and at home plate were dropped and only the interaction variable “distance_plate:distance_decision” remains. This makes sense since if two consecutive pitches are very far apart as they cross the plate but not well tunneled 33ft out, that’s just wild pitching not strategic sequencing and the interaction variable picks up that mutual relationship.
Another interesting result is that the model coefficient corresponding to “velo_diff” is negative. If the first pitch in a sequence has a higher velocity than the second pitch in a sequence, then “velo_diff” will be negative. Since the coefficient is negative, that means when “velo_diff” is negative, that increases the predicted probability of a swinging strike and vice-versa because the coefficient is multiplied by the velo_diff instance. Therefore, the model favors the second pitch in a sequence being off-speed or slower.
The final note I wanted to make on this final model is that the only other interaction variable included is “velo_diff:distance_decision” which demonstrates the relationship I alluded to earlier with the visualization from PitchR. How well tunneled a pitch sequence is doesn’t only depend on the location of two consecutive pitches 33ft before home plate. The difference in velocity of the two pitches plays a role in determining where precisely that decision point is and subsequently the effectiveness of the tunneling. I think as more data recording devices shift to reporting location of the ball’s flight in terms of time after release rather than at an arbitrary distance from home-plate, we will be able to do much more thorough analysis of sequencing and tunneling.
To close out this post, I want to circle back to the Joey Gallo AB. If I plug that two pitch sequence into the initial logistic regression model which has only distance between the two pitches at the decision point and at home plate, you get a predicted probability of 0.5066 of getting a swinging strike outcome. As you know, Gallo did not swing and miss; he actually put the ball into orbit. If I plug the same sequence into the final model which incorporates slightly more aspects of the two consecutive pitches, I get a predicted whiff probability of 0.4176. This just goes to show tunneling two pitches well will only get you so far and there is always more strategy that can be involved when it comes to crafting pitch sequences. I look forward to see how pitch sequencing will evolve as we get access to more data sources, do new research, and try new things.
this is awesome, bookmarking for later