Firstly, something to get out of the way. I do not hate your baseball team. I did not create these predictions. The numbers did. Team stats from the years 2008-2020 were pulled and then analyzed. Using the most correlated statistics with wins, thousands of models were run. Some of these statistics included K/9, ERA, FIP, WAR, WPA , wRC+, OPS, and REW. These statistics correlated either negatively or positively between ‘.3’ and ‘.5’. The correlation of a variable with itself is 1. The highest correlation of all the statistics I looked at was ‘Losses.’ Yes, pitchers losses had the highest correlation with team wins. It happened to be a negative correlation which is still helpful to put into a model. (Losses were too negatively correlated to be used for the model.) For example of a negatively correlated statistic that was used, let’s look at runs given up by your pitchers.
As you can see on the scatterplot, there is a negative correlation between runs given up by pitchers and wins. Every plot on this scatterplot is a teams’ ‘Wins’ and ‘Runs Allowed’ between the years of 2008-2019. (2020 wasn’t used due to a covid shortened season) Why is there a negative correlation? The more runs your pitcher gives up, the more likely you are to lose (and therefore not win.) Let’s take a look at the relationship between WAR and Wins.
As you can see on the scatterplot, there is a positive correlation between WAR and wins. WAR is a stat that measures a player’s value in all facets of the game by deciphering how many more wins he’s worth than a replacement-level player at his same position. A simple statistic but highly correlated to wins. Front offices and even fans of the game use this stat often to look at a player’s value to the team. A high WAR is a good WAR.
To see how the model would perform before predicting the future (2021), we had to see how the predictions would compare to the actual wins from that season. I used 2019 and not 2020 because 2020 was a covid shortened season full of cancelled games and weirdness statistics can’t necessarily understand.
‘W./P’ is the amount of wins the team had in 2019. ‘Projected_Wins’ is how many wins my model projected the team to win. Finally, ‘Projected_-_Actual’ is how close the projection got to the actual number of wins.
The average difference between the Model projections and the actual wins was 5.3 wins. Most team win projections were actually closer but due to outliers like the Marlins, Tigers, and the Twins, it became a higher average.
Finally let’s get to the predictions. Again, I reiterate, I do not hate your team. I promise.
Though the model showed strong predictive power, there are improvements to note going forward. The size of the dataset was small, so in the future it would be highly beneficial to continue to improve model performance with more data. Although I obtained a strong predictive model, it did not take offseason acquisitions into consideration. That is something I’d love to add to my model in the future.
A lot of really talented players that could realistically alter the wins predicted above changed teams this offseason. Just to name a few:
Trevor Bauer –> Dodgers
George Springer –> Blue Jays
Francisco Lindor –> Mets
Charlie Morton –> Braves
Nolan Arenado –> Cardinals
Feel free to add a couple wins or losses to any team you please.