The Numbers Behind the MLB Anonymous Students: AD, CD, BM; (TF: Kevin Rader) Abstract This project measures the effects of various baseball statistics on the win percentage of all the teams in MLB. Data was collected off of espn.com for the 2010 regular season for the 30 teams in the MLB. First, simple linear regression models were ran in Stata for each of the variables. Next, a multiple regression model was used as the basis for our model. The final model, after performing a step-wise regression with significance set at a p-value of.05 or less, reflects that the significant determining variables are batting average, strike outs, quality starts, and errors. The results of the regression were not surprising in the fact that the coefficients were positive or negative as expected; however, the variables that proved to be significant and the fact that payroll, home runs, and league were not significant was relatively surprising. Lastly, the researchers discuss the implications of these results and possible strategies that MLB teams could employ to increase their regular season win percentages in the future. Introduction The motivation for this team of researchers was their personal interest in sports. Baseball was the sport of choice because of its heavy reliance on statistics. When one thinks of baseball, it s all about the numbers: runs, strikeouts, payroll, etc, so our group decided to boil down the numbers and find out what numbers really matter. After all, that final number anyone really cares about is win percentage. That s the number that s going to get you to the playoffs. For this reason, our team selected a variety of variables and ran a multiple regression for win percentage to determine which variables were significant indicators for win percentage. The MLB consists of 30 teams. Data was collected from espn.com for each team during the 2010 Regular Season. The independent variables considered were league (National or American), strikeouts, quality starts, home runs, payroll, errors, and batting average. The dependent variable in the model is win percentage, measured as an actual percentage, ranging from 0 to 100. The variable league was put in place as a dummy variable and measured on a 0 or 1 scale. 1 represented National League, and, therefore, 0 represented American League. This variable allowed us to test if there was a difference in win percentages amongst teams in each distinctive league. Strike outs and quality starts fall under the category of pitching statistics. The researchers were curious if there was a correlation between pitching and overall wins, and these variables seemed most well-suited to run in the regression. ERA was not chosen as the pitching statistic to consider because there is a obvious correlation between ERA and win percentage. Rather, the group was interested in determining the effects of statistics less directly related to win percentage. Because these statistics are good qualities of a pitcher, the researchers expect a positive correlation between these pitching statistics and win percentage. The offensive statistics included home runs and the team batting average. Similarly to ERA, runs scored were not incorporated into the model because there was too direct a correlation between runs and wins. Batting average was put into a more output-friendly format as percentages, falling between 0 and 100; the traditional batting average format was multiplied times 100. The team thought that home runs would bring an interesting perspective to the model, considering the range of teams known for big-hitters and those that score off other methods.
Since offense is crucial for a successful team, one would hope to see a positive coefficient in front of these two offensive variables. In conjunction with the pitching statistics, the variable errors represents the defensive statistic for each team. The initial thought of baseball is all about hitting and pitching, so it will be interesting to see if defensive statistics, such as errors, play a role in determining win percentage for any given team. Since the goal of baseball is not only to maximize runs scored, but to minimize runs scored by the other team, the team would expect to see a negative correlation between errors and win percentage. Lastly, the researchers wanted to look at payroll and see the effect on win percentage. You hear so much about individual player earnings in the media. After looking at the data, it is funny to think that one single player on Team X (Alex Rodriguez) makes almost as much money as an entire Team Y (Pittsburgh Pirates). There are a lot of outliers in this category, however, so the group is not expecting a very strong correlation. Methods The researchers collected data from various pages of espn.com. The data was organized into an Excel chart, making sure to properly line up each statistic with the correct team. The data consists of 30 observations for each variable, since there are only 30 teams in the MLB. From there, the team exported the data into Stata. First, the group ran a simple linear regression on win percentage and each of the x variables. These simple linear regressions allowed the group members to get a sense of the relationship between each variable and win percentage. After determining the correlation between each variable on its own, the team decided to run a multiple linear regression to study the effects of each variable when the remaining variables were also taken into consideration. A p-value of.05 or less was used to determine significance of any given variable. Next, a step-wise regression was performed to eliminate variables that were not significant. This left the team with the final model for the project. This final model recognized four variables as significant indicators of win percentage, in addition to the constant. While it is questionable to have more than 3 variables included with the limited number of observations presented in the data, the team feels that the model is well-suited for the goals of this project. With the final model determined, it was necessary to verify that all assumptions were correct, so the team ran several tests in Stata. These included the hettest for heteroskedasticity, the ovtest for nonlinearity, Shapiro-Wilk test to see if residuals were normal, and a test for colinearity. Results The results of the project were interesting. After running the stepwise multiple regression, the data reflects that the only significant variables in determining win percentage are batting average, errors, strike outs, and quality starts. With the lowest p-value at 0.001, batting average was the most significant variable. The coefficient for batting average was 2.602, indicating that when all other variables are held constant, an 1% increase in batting average results in a 2.602% increase in win percentage for any given team. This strong correlation makes sense since wins are directly related to runs scored, and runs are directly related to hits and batting average. The coefficients and p-values for the intercept and all significant variables are shown below.
Variable Coefficient p-value intercept -45.326 0.060 batting average (%) 2.602 0.001 errors -0.149 0.002 strike outs 0.023 0.007 quality starts 0.199 0.043 The final step-wise regression model is shown below.. sw, pr(.05): regress winpct nl1al0 k qs hr payrollmillions errors bafinal begin with full model p = 0.9440 >= 0.0500 removing payrollmillions p = 0.3730 >= 0.0500 removing hr p = 0.1694 >= 0.0500 removing nl1al0 -------------+------------------------------ F( 4, 25) = 17.66 Model 988.622277 4 247.155569 Prob > F = 0.0000 Residual 349.856341 25 13.9942536 R-squared = 0.7386 -------------+------------------------------ Adj R-squared = 0.6968 Total 1338.47862 29 46.1544351 Root MSE = 3.7409 bafinal 2.601685.6946007 3.75 0.001 1.171128 4.032242 k.0229994.0078448 2.93 0.007.0068427.0391562 qs.1987509.092977 2.14 0.043.0072612.3902407 errors -.1485824.0441469-3.37 0.002 -.2395046 -.0576601 _cons -45.32566 22.96426-1.97 0.060-92.62143 1.970117 As shown in the Stata output, the R 2 for this model is 0.7386, with an adjusted R 2 of 0.6968, indicating that the model is a relatively strong predictor for win percentage. Additionally, it is important to note that adjusted R 2 is greater in the step-wise final model, compared to the multiple linear regression with all variables included, verifying that it is a better model. The results of the simple linear regressions for each x variable independently demonstrated nothing too surprising. Although the most significant variable in the final model appears to be batting average, the simple linear regression results indicate that errors is the most significant variable. For this reason, the scatter plots for the simple linear regressions of both variables are shown below.
Conclusions and Discussion When analyzing the results of the final multiple regression model, the coefficients appear as one would predict. Good qualities of a team (batting average, strike outs, and quality starts) all have positive coefficients, and the variable errors is associated with a negative coefficient. Surprisingly, the payroll of a team was not a significant indicator of win percentage. While this may be surprising to any everyday baseball fan, the team was wary of this variable due to the large number of outliers. So, what do these results mean for the future of baseball? Because it was determined that batting average and errors are the most significant predictors, perhaps teams should target their payrolls towards players that are going to better these team statistics. For example, they should focus on offensive players with a high batting average and that play good defense while committing few errors. To a lesser extent, teams should also recruit pitchers with who have a history of quality starts and a high record of strike outs. The dummy variable league proved not to be significant, indicating that win percentage for any given team does not differ between American League and National League, if all other variables are held constant. In other words, the significant variables are equally significant in each league, and therefore, league will not have a direct effect on win percentage. Home runs were also removed in the final model. This is surprising because home runs are directly associated with runs scored, and runs are a strong indicator of win percentage. On the other hand, this could reflect the number of runs scored by hits other than home runs, in addition to walks and errors. The removal of this variable proves that getting more hits overall is more important than getting more home runs. Interestingly, the simple linear regression of win percentage and payroll indicated that payroll was needed in the model. However, in the multiple regression model, payroll was the least significant variable, and therefore, was the first removed. The group has concluded that this discrepancy reflects the poor allocation of team s payroll. For example, rather than using their payroll to sign players as described above, the team may sign big-name players with relatively poor statistics in light of their salary. Essentially, these players are payed more than they re worth just to draw out more fan interest.
Lastly, it is important to mention that the tests ran in the final sections of the project verify that all of our assumptions for linearity, normality of residues, etc. all passed. This means that the final model requires no transformations before being used as an accurate predictor model. Some of these issues were taken care of by our initial transformations of data values into similar numbers. For example, win percentage and batting average were both converted to actual percentages in the range of 0 to 100. This was done so that output could be more easily read in terms of a 1 unit increase in x and its effect on y. The major weakness of this project is the fact that our data was limited to 30 observations. This factor was limited because there are only 30 teams in the MLB, and we wanted to include only data from a single season since strategies have changed throughout the history of baseball. Additionally, the team members only analyzed the regular season, not the playoffs. For example, the Giants won the World Series, and they tied for the third highest number of quality starts and had the highest number of strike outs within our data. This means that these two variables may be more significant for the playoffs than indicated, but because our data was limited to the regular season, these effects were not taken into account. In conclusion, if a teams goal is to increase their win percentage in the regular season, our model is a strong indicator of the significant variables. However, because of the structure of the MLB playoff system, the team with the highest win percentage does not necessarily win the World Series. Therefore, a different study should be conducted to analyze the most significant variables in determining success in the post-season. References Stata Software. Gould, Bill. 2010. Version 11.0. MLB Statistics. Elias Sports Bureau. http://espn.go.com/mlb/stats/team/_/stat/batting/year/2010/seasontype/2 Appendix Simple Linear Regressions:. regress winpct nl1al0 -------------+------------------------------ F( 1, 28) = 0.07 Model 3.22438048 1 3.22438048 Prob > F = 0.7967 Residual 1335.25424 28 47.6876513 R-squared = 0.0024 -------------+------------------------------ Adj R-squared = -0.0332 Total 1338.47862 29 46.1544351 Root MSE = 6.9056 nl1al0 -.6571428 2.5272-0.26 0.797-5.833877 4.519591 _cons 50.35714 1.845606 27.28 0.000 46.57659 54.13769. regress winpct k
-------------+------------------------------ F( 1, 28) = 13.49 Model 435.19831 1 435.19831 Prob > F = 0.0010 Residual 903.280307 28 32.260011 R-squared = 0.3251 -------------+------------------------------ Adj R-squared = 0.3010 Total 1338.47862 29 46.1544351 Root MSE = 5.6798 k.039477.0107481 3.67 0.001.0174605.0614936 _cons 4.863383 12.33451 0.39 0.696-20.40272 30.12949. regress winpct qs -------------+------------------------------ F( 1, 28) = 8.58 Model 313.893106 1 313.893106 Prob > F = 0.0067 Residual 1024.58551 28 36.5923397 R-squared = 0.2345 -------------+------------------------------ Adj R-squared = 0.2072 Total 1338.47862 29 46.1544351 Root MSE = 6.0492 qs.3770548.1287386 2.93 0.007.1133458.6407638 _cons 17.55482 11.13501 1.58 0.126-5.254207 40.36384. regress winpct hr -------------+------------------------------ F( 1, 28) = 6.60 Model 255.200288 1 255.200288 Prob > F = 0.0158 Residual 1083.27833 28 38.6885118 R-squared = 0.1907 -------------+------------------------------ Adj R-squared = 0.1618 Total 1338.47862 29 46.1544351 Root MSE = 6.22 hr.0885053.0344604 2.57 0.016.0179165.1590941 _cons 36.3975 5.419175 6.72 0.000 25.29682 47.49818. regress winpct payrollmillions -------------+------------------------------ F( 1, 28) = 4.34 Model 179.55983 1 179.55983 Prob > F = 0.0465 Residual 1158.91879 28 41.3899567 R-squared = 0.1342 -------------+------------------------------ Adj R-squared = 0.1032 Total 1338.47862 29 46.1544351 Root MSE = 6.4335 payrollmil~s.0641494.0307989 2.08 0.047.0010607.127238 _cons 44.24926 3.00341 14.73 0.000 38.09705 50.40147
. regress winpct errors -------------+------------------------------ F( 1, 28) = 18.81 Model 537.819897 1 537.819897 Prob > F = 0.0002 Residual 800.65872 28 28.5949543 R-squared = 0.4018 -------------+------------------------------ Adj R-squared = 0.3805 Total 1338.47862 29 46.1544351 Root MSE = 5.3474 errors -.2459617.0567145-4.34 0.000 -.362136 -.1297874 _cons 74.8488 5.810765 12.88 0.000 62.94599 86.75161. regress winpct bafinal -------------+------------------------------ F( 1, 28) = 7.50 Model 282.700955 1 282.700955 Prob > F = 0.0106 Residual 1055.77766 28 37.7063451 R-squared = 0.2112 -------------+------------------------------ Adj R-squared = 0.1830 Total 1338.47862 29 46.1544351 Root MSE = 6.1405 bafinal 3.00539 1.097601 2.74 0.011.7570566 5.253723 _cons -27.31199 28.25985-0.97 0.342-85.19967 30.57569. corr winpct errors (obs=30) winpct errors -------------+------------------ winpct 1.0000 errors -0.6339 1.0000
Multiple Linear Regression. regress winpct nl1al0 k qs hr payrollmillions errors bafinal -------------+------------------------------ F( 7, 22) = 10.36 Model 1026.88282 7 146.697545 Prob > F = 0.0000 Residual 311.595802 22 14.1634456 R-squared = 0.7672 -------------+------------------------------ Adj R-squared = 0.6931 Total 1338.47862 29 46.1544351 Root MSE = 3.7634 nl1al0-1.882327 1.737358-1.08 0.290-5.485388 1.720733 k.0260625.010205 2.55 0.018.0048986.0472263 qs.2003392.1042446 1.92 0.068 -.0158508.4165292 hr.0225859.0254233 0.89 0.384 -.0301387.0753105 payrollmil~s -.0014683.0206652-0.07 0.944 -.0443253.0413886 errors -.1353628.0457943-2.96 0.007 -.2303343 -.0403912 bafinal 2.195859.7720799 2.84 0.009.5946634 3.797055 _cons -42.1969 24.26719-1.74 0.096-92.52397 8.13017 Step-Wise Multiple Linear Regression. sw, pr(.05): regress winpct nl1al0 k qs hr payrollmillions errors bafinal begin with full model p = 0.9440 >= 0.0500 removing payrollmillions p = 0.3730 >= 0.0500 removing hr p = 0.1694 >= 0.0500 removing nl1al0 -------------+------------------------------ F( 4, 25) = 17.66 Model 988.622277 4 247.155569 Prob > F = 0.0000 Residual 349.856341 25 13.9942536 R-squared = 0.7386 -------------+------------------------------ Adj R-squared = 0.6968 Total 1338.47862 29 46.1544351 Root MSE = 3.7409 bafinal 2.601685.6946007 3.75 0.001 1.171128 4.032242 k.0229994.0078448 2.93 0.007.0068427.0391562 qs.1987509.092977 2.14 0.043.0072612.3902407 errors -.1485824.0441469-3.37 0.002 -.2395046 -.0576601 _cons -45.32566 22.96426-1.97 0.060-92.62143 1.970117 Test for Heteroskedasticity. hettest Breusch-Pagan / Cook-Weisberg test for heteroskedasticity Ho: Constant variance Variables: fitted values of winpct chi2(1) = 0.07 Prob > chi2 = 0.7852
Test for Non-Linearity. ovtest Ramsey RESET test using powers of the fitted values of winpct Ho: model has no omitted variables F(3, 22) = 0.57 Prob > F = 0.6398 Test for Normality of Noise. predict res (option xb assumed; fitted values). swilk res Shapiro-Wilk W test for normal data Variable Obs W V z Prob>z -------------+-------------------------------------------------- res 30 0.95509 1.427 0.736 0.23089 Test for Colinearity. corr k qs errors bafinal (obs=30) k qs errors bafinal -------------+------------------------------------ k 1.0000 qs 0.3971 1.0000 errors -0.2728-0.3699 1.0000 bafinal 0.0809-0.1120-0.1657 1.0000 The final model passes all tests checking the initial assumptions of the project.