Identifying Overlooked Position Players with K-Means Clustering

Baseball analytics have long been focused on helping teams gain a competitive advantage by finding players their competitors have overlooked for one reason or another. While this started by using very basic statistics such as on-base percentage, recent decades have seen the methods for doing this become increasingly advanced. Major League teams are now pouring large amounts of resources into analytics and R&D departments in an attempt to gain a competitive advantage.

In this article, an attempt will be made to identify some position players we may be overlooking. To do this, data from Baseball Savant through the completion of the games on May 23th, 2021 will be used and all players with less than 200 at bats since the start of the 2020 season will be filtered out.

Players will be clustered with those who their performance has been most similar to since the start of the 2020 season. While different methods exist for clustering, K-Means clustering is the most frequently used due to its simplicity and speed. K-Means clustering groups data into k groups (clusters) based on similar features and the ultimate goal is to minimize the distance between the points in the same cluster. These algorithms are typically used for unsupervised machine learning tasks, meaning there is no information given beforehand about which cluster to place each of the players into.

For clustering tasks, the data will need to be scaled so that the variables with higher values in the dataset don’t have more of an influence on the final results than the other variables. One of the common methods for doing this is z-score standardization which simply converts all of the numerical values in the dataset to z-scores (the other popular method is min-max normalization which puts everything between 0 and 1 with the minimum value being at 0 and the maximum value being at 1).

Before building the K-Means clustering algorithm, the number of clusters to group the data into will need to be pre-determined. This is one of the more difficult parts of these algorithms and one of the most common methods for determining this is known as the elbow method (others frequently used in research papers include the silhouette method and the gap statistic method).

In the elbow method, the number of clusters can be graphed on the x-axis and the total within sum of square can be placed on the y-axis. After a certain point it is likely that the cost of adding an additional cluster is not going to be worth it. In the case of this dataset, this point appears to be k = 3 given that the total within sum of square is not decreasing nearly as rapidly after this point.

Now that the optimal number of clusters has been determined, the model can be built and visualized.

From the above, it is clear that there are three very distinct clusters and that the players are fairly evenly distributed across the clusters.

Cluster #	Size
1	69
2	66
3	84

To further explore, the average values of each of the three clusters can be calculated. By checking the average values for each of the three clusters there are some very noticeable trends that emerge.

Cluster	Age	BA	SLG%	wOBA	xWOBA	Exit Velocity	Launch Angle	Whiff%	Swing%	Sprint Speed (ft/s)
1	30.29	.218	.400	.306	.320	89.11	15.41	28.35%	45.6%	26.59
2	27.98	.258	.383	.310	.309	86.88	8.88	22.8%	47.89%	27.66
3	29.4	.277	.496	.364	.363	89.95	13.01	25.1%	46.47%	26.98

Mean Values for Each Cluster

It seems likely that many of the best hitters in the league will be located in cluster 3. Also, it appears many older and slower players are located in cluster 1 while cluster 2 contains younger, faster players.

Here is a link to the spreadsheet where you can view each of the three clusters. After going through this, many of the intuitions developed after looking at the above table were true, but this does not mean that there were not plenty of surprises (reminder that this is based solely off performance since the start of the 2020 season). Many of the leagues best position players (Mike Trout, Ronald Acuna Jr., Fernando Tatis Jr., etc.) can be found in cluster 3, while players such as Albert Pujols and Miguel Cabrera are found in cluster 1.

While this algorithm clearly isn’t perfect, perhaps it provides some intuition about players who are under/over valued. Another possible interpretation is that it could provide insights into players we can expect to perform closer to their normal talent level (i.e. the cluster they should be in) over the coming months.

All data provided by Baseball Savant

This article was inspired by this blog post by former BaseballCloud Intern Jonah Simon.