In the field of statistics, we use the the law of large numbers to explain that as the sample size of something increases, the average of that sample will fall closer to the mean of the population as a whole. The proof of this law has existed since the publishing of the book Ars Conjectandi by Jacob Bernoulli in the early parts of the 18th century.
While a lot of what happens on a baseball field can be attributed to skill, there are many examples where the law of large numbers can be applied. One of the most prominent examples of the application of this law is when discussing batting average on balls in play, more commonly referred to as BABIP.
BABIP has long been discussed by the baseball analytics community and is a simple measure of how often the balls a player hits into play fall for hits. If you’re reading this article I’m sure you are familiar with the concept but if not Fangraphs has a great primer on it.
In Major League Baseball, league average BABIP will typically be around .300 for a given season, though it has fallen a little below that mark in each of the last few seasons.
|Season||League Average BABIP|
While it is undeniable that certain skills, such as the ability to hit for a consistently high exit velocity or faster average sprint speeds are going to contribute to a player’s ability to hit for a higher BABIP, it has long been known that there is some element of luck involved. This is especially the case when we are dealing with a small number of balls in play, such as what the 2020 season provided. The distribution of BABIP’s for qualified hitters across Major League Baseball typically follows close to a normal, or gaussian distribution with a mean of around .300.
Players who hit for a BABIP well above or below the league average mark over a small number of plate appearances are likely to experience some convergence towards the mean. That is, given more plate appearances and balls hit into play, their overall BABIP is more likely to end up closer to the league average mark of about .300 than it is to remain at an extreme value. Though this is the case, there are many examples of players who have the ability to consistently hit for a BABIP much higher (or lower) than league average.
One example of a player who benefited greatly from BABIP luck last season was Tigers’ shortstop Willi Castro. For all players with a minimum of 100 plate appearances last season, Castro led Major League Baseball with a .448 BABIP. Other than Castro, the next highest BABIP was the .412 mark posted by Met’s outfielder Michael Conforto.
Castro’s BABIP was so high last season that no player in the 21st century with more than 100 plate appearances in a season has come remotely close to matching what Castro was able to do last season.
If you take all qualified hitters since Major League Baseball’s expansion prior to the 1961 season and exclude 2020, only 5 hitters during that time period have finished a season with a BABIP higher than .400. The highest BABIP from any qualified hitter over that time period was the .408 mark posted by Rod Carew during the 1977 season.
While Castro does possess some skills (84th percentile in average sprint speed last season) that might enable him to maintain a higher than league average BABIP over the long run, a .448 BABIP is unheard of and no one has ever been able to maintain anything close to that over a large number of plate appearances. While he was about 50% better than league average offensively last year, it is very likely his offensive production will fall quite a bit this season as a result of him not getting as lucky on the balls he hits into play. It should be noted that while it is still incredibly early, he is off to a 6 for 25 start with a .316 BABIP this season as of this writing on April 8th.
Using a dataset provided by Baseball Savant that includes all hitters with at least 100 plate appearances in each of the last 3 seasons, I set out in an attempt to better understand what has actually appeared to contribute to a players ability to hit for a higher BABIP. Included in my dataset are features including each hitters average exit velocity, average launch angle, average sprint speed, average home to first time, their age in that season as well as the average starting location of all the fielders while the hitter was up to bat. Each of these features Pearson correlation coefficient (or what we typically talk about when discussing correlation coefficients) with BABIP can be easily calculated and highlighted, with red bars indicating a negative Pearson correlation coefficient and blue bars indicating a positive Pearson correlation coefficient. Remember, correlation does not necessarily mean causation.
While launch angle did show a negative correlation to BABIP, the overall effect of launch angle on BABIP is going to be largely dependent on a number of other factors. Out of all the batted ball types, line drives are still generally going to lead to the highest BABIP’s overall.
Looking at the rest of this graph, a lot of the other elements on it are about what one would have expected. Some of the other things that immediately stood out were average sprint speed (expressed in ft/s on Baseball Savant) showing a stronger correlation to BABIP than average home to first time and players with higher swing rates typically hitting for higher BABIP’s in my 3 year sample.
All statistics and data courtesy of Fangraphs and Baseball Savant.