At the bottom of Yu Darvish’s baseball savant page is a wavy mess. Distributions of different colors bobbing up and down indicating the speed uncertainty batters face when stepping in the box against one of the best pitchers in the MLB. It may not be the most practically helpful to a pitcher, but it was a visual I had seen a hundred times on baseball savant and thought it would be fun to recreate.
The main component of this visual are the colorful density plots of each pitch type. The higher the y-axis, the more likely Darvish is to throw the given pitch at that speed (given by the x-axis). What proved to be the more difficult component of the graph is the subtle dotted line. Representing the summed density plots of the individual pitch types below it, it required me taking an exploration into the world of smoothing.
Smoothing, as the name suggests, is the process of removing noise from data to better reveal trends within it. To illustrate this concept, let’s look at some fangraphs heat-maps of Cody Bellinger against fastballs in 2019.
Leaving results raw or relatively un-smoothed can lead to patchy, confusing graphs. One extra base hit skews the area immediately around it leaving a lot of work to be done by whoever is interpreting the graph. On the other end of the spectrum, oversmoothing risks losing potential insight. For example, a well-smoothed graph shows Bellinger had a hole low and away on fastballs. However, when the smoothing radius is too large, the positive results surrounding the hole shroud the weakness. Undersmoothing hurts interpretability, oversmoothing loses information.
Back to the velocity density chart, R will automatically smooth the density chart based on the distribution of data points. Therefore, combining the pitch types into one group results in a different output. The dotted line simply does not align well with the individual pitch curves. Take Clayton Kershaw’s distribution with default smoothing.
The dotted lines consistently fail to reach the peaks of each pitch type while subsequently making valleys too high. When combining the subdivided pitch types, R is oversmoothing. Fortunately, the smoothing is customizable with the “Adjust” input. With this Kershaw example, I found Adjust = 0.35 to be the optimal level.
The peaks and valleys align better. This is what the graphs should look like. Where it gets complicated is when you switch up the pitcher. The 0.35 smoothing setting was optimal for Clayton Kershaw, but that is one, very specific case. He threw 888 pitches in 2020, well balanced between three pitch types. For any other cases, 0.35 might not be the right level of smoothing.
In 2020, Jake McGee threw fewer pitches than Kershaw (332) and a higher percentage of fastballs than anyone (97%). With the smaller sample size and heavy dose of one pitch, Kershaw’s settings result in severe under-smoothing for McGee. The dotted line jumping up and down hurts interpretability and the aesthetics of the graph. For McGee, 0.9 was the optimal smoothing setting to remove spikes while keeping peaks and valleys aligned.
With a solid understanding of what over-smoothing and under-smoothing look like on density plots, I figured I could collect a sample of pitchers and model what I deemed to be the optimal setting of the “Adjust” variable. That was the plan, until I ran into Ian Kennedy (see below). While less pronounced than the Kershaw and McGee examples, Kennedy’s optimal graph actually features both under-smoothing and over-smoothing attributes in the same graph.
The shortcomings may be subtle, but they send a clear message: Not every pitcher has a truly sufficient “Adjust” level. In certain cases – like that of Ian Kennedy – the reasonable window of smoothing between two x-axis segments does not overlap. If the answer is not fine tuning the smoothing, what’s the workaround?
Referencing a helpful Stack Overflow post, I was able to recreate each pitch type’s individual density plot then take the sum for the dotted line. This solution is less efficient than making an optimal “adjust” equation in terms of lines of code. However, considering the time it would take to create a good adjust equation only for it to run into a Kennedy-predicament, just recreating and summing the density functions makes far more sense. Behold the glorious 2020 Ian Kennedy velocity density chart!
Hopefully this wade into the ocean of smoothing has brain-washed you into appreciating the subtle complexity of what lays before your eyes. After all, the dotted line nuance was enough to stump my R professor.
This process of fine tuning this graphic does prompt a reflection on the role of data analysts. The goal is to efficiently communicate a lot of information with little effort from the reader. With this in mind, attention to detail is everything. If done correctly, the graph should speak for itself. No where on baseball savant did it say the dotted line represented the sum of the density plots below it. That detail was self-evident thanks to good design. Spending the time to fiddle with the minor details could be the difference between the visual clicking and the reader giving up.
In conclusion, here’s my rendition of Yu Darvish’s spectacularly colorful velocity density chart: