The Numbers Behind the MLB Anonymous Students: AD, CD, BM; (TF: Kevin Rader)



Similar documents
Statistics 104 Final Project A Culture of Debt: A Study of Credit Card Spending in America TF: Kevin Rader Anonymous Students: LD, MH, IW, MY

An Exploration into the Relationship of MLB Player Salary and Performance

A Predictive Model for NFL Rookie Quarterback Fantasy Football Points

Please follow the directions once you locate the Stata software in your computer. Room 114 (Business Lab) has computers with Stata software

DETERMINANTS OF CAPITAL ADEQUACY RATIO IN SELECTED BOSNIAN BANKS

MODELING AUTO INSURANCE PREMIUMS

Using Baseball Data as a Gentle Introduction to Teaching Linear Regression

Baseball Pay and Performance

The econometrics of baseball: A statistical investigation

Data Mining in Sports Analytics. Salford Systems Dan Steinberg Mikhail Golovnya

Data Analysis Methodology 1

Interaction effects and group comparisons Richard Williams, University of Notre Dame, Last revised February 20, 2015

IAPRI Quantitative Analysis Capacity Building Series. Multiple regression analysis & interpreting results

1. The parameters to be estimated in the simple linear regression model Y=α+βx+ε ε~n(0,σ) are: a) α, β, σ b) α, β, ε c) a, b, s d) ε, 0, σ

International Statistical Institute, 56th Session, 2007: Phil Everson

Correlation and Regression

MULTIPLE REGRESSION EXAMPLE

WIN AT ANY COST? How should sports teams spend their m oney to win more games?

Causal Inference and Major League Baseball

An econometric analysis of the 2013 major league baseball season

Data Mining and Data Warehousing. Henryk Maciejewski. Data Mining Predictive modelling: regression

Estimating the Value of Major League Baseball Players

Does pay inequality within a team affect performance? Tomas Dvorak*

Department of Economics Session 2012/2013. EC352 Econometric Methods. Solutions to Exercises from Week (0.052)

Regression Analysis (Spring, 2000)

MODEL I: DRINK REGRESSED ON GPA & MALE, WITHOUT CENTERING

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

A Study of Sabermetrics in Major League Baseball: The Impact of Moneyball on Free Agent Salaries

EXPLORATORY DATA ANALYSIS: GETTING TO KNOW YOUR DATA

Outline. Topic 4 - Analysis of Variance Approach to Regression. Partitioning Sums of Squares. Total Sum of Squares. Partitioning sums of squares

5. Multiple regression

Beating the MLB Moneyline

Predicting Market Value of Soccer Players Using Linear Modeling Techniques

DEPARTMENT OF PSYCHOLOGY UNIVERSITY OF LANCASTER MSC IN PSYCHOLOGICAL RESEARCH METHODS ANALYSING AND INTERPRETING DATA 2 PART 1 WEEK 9

Practical. I conometrics. data collection, analysis, and application. Christiana E. Hilmer. Michael J. Hilmer San Diego State University

Nonlinear Regression Functions. SW Ch 8 1/54/

Stepwise Regression. Chapter 311. Introduction. Variable Selection Procedures. Forward (Step-Up) Selection

The average hotel manager recognizes the criticality of forecasting. However, most

ECON 142 SKETCH OF SOLUTIONS FOR APPLIED EXERCISE #2

Quick Stata Guide by Liz Foster

Directions for using SPSS

ANDREW BAILEY v. THE OAKLAND ATHLETICS

Failure to take the sampling scheme into account can lead to inaccurate point estimates and/or flawed estimates of the standard errors.

Discussion Section 4 ECON 139/ Summer Term II

A Panel Data Analysis of Corporate Attributes and Stock Prices for Indian Manufacturing Sector

Basic Statistical and Modeling Procedures Using SAS

Interaction effects between continuous variables (Optional)

Solution Let us regress percentage of games versus total payroll.

The way to measure individual productivity in

FIRM SPECIFIC FACTORS THAT DETERMINE INSURANCE COMPANIES PERFORMANCE IN ETHIOPIA

Chapter 7: Simple linear regression Learning Objectives

Maximizing Precision of Hit Predictions in Baseball

A Guide to Baseball Scorekeeping

Lab 5 Linear Regression with Within-subject Correlation. Goals: Data: Use the pig data which is in wide format:

Nonlinear relationships Richard Williams, University of Notre Dame, Last revised February 20, 2015

The Value of Major League Baseball Players An Empirical Analysis of the Baseball Labor Market

BEAVER COUNTY FASTPITCH RULES FOR 2013 SEASON

Doing Multiple Regression with SPSS. In this case, we are interested in the Analyze options so we choose that menu. If gives us a number of choices:

Module 3: Correlation and Covariance

PREDICTING THE MATCH OUTCOME IN ONE DAY INTERNATIONAL CRICKET MATCHES, WHILE THE GAME IS IN PROGRESS

TULANE BASEBALL ARBITRATION COMPETITION BRIEF FOR THE TEXAS RANGERS. Team 20

Chapter 1: Exploring Data

GLM I An Introduction to Generalized Linear Models

Getting Correct Results from PROC REG

Indian School of Business Forecasting Sales for Dairy Products

Introduction to Regression and Data Analysis

Team Success and Personnel Allocation under the National Football League Salary Cap John Haugen

Marginal Effects for Continuous Variables Richard Williams, University of Notre Dame, Last revised February 21, 2015

How To Predict Seed In A Tournament

Multiple Regression: What Is It?

Lecture 15. Endogeneity & Instrumental Variable Estimation

S TAT E P LA N N IN G OR G A N IZAT IO N

Simple linear regression

" Y. Notation and Equations for Regression Lecture 11/4. Notation:

This chapter will demonstrate how to perform multiple linear regression with IBM SPSS

Title: Lending Club Interest Rates are closely linked with FICO scores and Loan Length

The Effects of Atmospheric Conditions on Pitchers

Examining if High-Team Payroll Leads to High-Team Performance in Baseball: A Statistical Study. Nicholas Lambrianou 13'

Stata Walkthrough 4: Regression, Prediction, and Forecasting

2. What is the general linear model to be used to model linear trend? (Write out the model) = or

Data Management Summative MDM 4U1 Alex Bouma June 14, Sporting Cities Major League Locations

BIOL 933 Lab 6 Fall Data Transformation

MGT 267 PROJECT. Forecasting the United States Retail Sales of the Pharmacies and Drug Stores. Done by: Shunwei Wang & Mohammad Zainal

c 2015, Jeffrey S. Simonoff 1

Panel Data Analysis Fixed and Random Effects using Stata (v. 4.2)

Multicollinearity Richard Williams, University of Notre Dame, Last revised January 13, 2015

KSTAT MINI-MANUAL. Decision Sciences 434 Kellogg Graduate School of Management

EDUCATION AND VOCABULARY MULTIPLE REGRESSION IN ACTION

Homework 8 Solutions

Simple Predictive Analytics Curtis Seare

Answer: C. The strength of a correlation does not change if units change by a linear transformation such as: Fahrenheit = 32 + (5/9) * Centigrade

Multiple Regression in SPSS This example shows you how to perform multiple regression. The basic command is regression : linear.

Multiple Linear Regression in Data Mining

Transcription:

The Numbers Behind the MLB Anonymous Students: AD, CD, BM; (TF: Kevin Rader) Abstract This project measures the effects of various baseball statistics on the win percentage of all the teams in MLB. Data was collected off of espn.com for the 2010 regular season for the 30 teams in the MLB. First, simple linear regression models were ran in Stata for each of the variables. Next, a multiple regression model was used as the basis for our model. The final model, after performing a step-wise regression with significance set at a p-value of.05 or less, reflects that the significant determining variables are batting average, strike outs, quality starts, and errors. The results of the regression were not surprising in the fact that the coefficients were positive or negative as expected; however, the variables that proved to be significant and the fact that payroll, home runs, and league were not significant was relatively surprising. Lastly, the researchers discuss the implications of these results and possible strategies that MLB teams could employ to increase their regular season win percentages in the future. Introduction The motivation for this team of researchers was their personal interest in sports. Baseball was the sport of choice because of its heavy reliance on statistics. When one thinks of baseball, it s all about the numbers: runs, strikeouts, payroll, etc, so our group decided to boil down the numbers and find out what numbers really matter. After all, that final number anyone really cares about is win percentage. That s the number that s going to get you to the playoffs. For this reason, our team selected a variety of variables and ran a multiple regression for win percentage to determine which variables were significant indicators for win percentage. The MLB consists of 30 teams. Data was collected from espn.com for each team during the 2010 Regular Season. The independent variables considered were league (National or American), strikeouts, quality starts, home runs, payroll, errors, and batting average. The dependent variable in the model is win percentage, measured as an actual percentage, ranging from 0 to 100. The variable league was put in place as a dummy variable and measured on a 0 or 1 scale. 1 represented National League, and, therefore, 0 represented American League. This variable allowed us to test if there was a difference in win percentages amongst teams in each distinctive league. Strike outs and quality starts fall under the category of pitching statistics. The researchers were curious if there was a correlation between pitching and overall wins, and these variables seemed most well-suited to run in the regression. ERA was not chosen as the pitching statistic to consider because there is a obvious correlation between ERA and win percentage. Rather, the group was interested in determining the effects of statistics less directly related to win percentage. Because these statistics are good qualities of a pitcher, the researchers expect a positive correlation between these pitching statistics and win percentage. The offensive statistics included home runs and the team batting average. Similarly to ERA, runs scored were not incorporated into the model because there was too direct a correlation between runs and wins. Batting average was put into a more output-friendly format as percentages, falling between 0 and 100; the traditional batting average format was multiplied times 100. The team thought that home runs would bring an interesting perspective to the model, considering the range of teams known for big-hitters and those that score off other methods.

Since offense is crucial for a successful team, one would hope to see a positive coefficient in front of these two offensive variables. In conjunction with the pitching statistics, the variable errors represents the defensive statistic for each team. The initial thought of baseball is all about hitting and pitching, so it will be interesting to see if defensive statistics, such as errors, play a role in determining win percentage for any given team. Since the goal of baseball is not only to maximize runs scored, but to minimize runs scored by the other team, the team would expect to see a negative correlation between errors and win percentage. Lastly, the researchers wanted to look at payroll and see the effect on win percentage. You hear so much about individual player earnings in the media. After looking at the data, it is funny to think that one single player on Team X (Alex Rodriguez) makes almost as much money as an entire Team Y (Pittsburgh Pirates). There are a lot of outliers in this category, however, so the group is not expecting a very strong correlation. Methods The researchers collected data from various pages of espn.com. The data was organized into an Excel chart, making sure to properly line up each statistic with the correct team. The data consists of 30 observations for each variable, since there are only 30 teams in the MLB. From there, the team exported the data into Stata. First, the group ran a simple linear regression on win percentage and each of the x variables. These simple linear regressions allowed the group members to get a sense of the relationship between each variable and win percentage. After determining the correlation between each variable on its own, the team decided to run a multiple linear regression to study the effects of each variable when the remaining variables were also taken into consideration. A p-value of.05 or less was used to determine significance of any given variable. Next, a step-wise regression was performed to eliminate variables that were not significant. This left the team with the final model for the project. This final model recognized four variables as significant indicators of win percentage, in addition to the constant. While it is questionable to have more than 3 variables included with the limited number of observations presented in the data, the team feels that the model is well-suited for the goals of this project. With the final model determined, it was necessary to verify that all assumptions were correct, so the team ran several tests in Stata. These included the hettest for heteroskedasticity, the ovtest for nonlinearity, Shapiro-Wilk test to see if residuals were normal, and a test for colinearity. Results The results of the project were interesting. After running the stepwise multiple regression, the data reflects that the only significant variables in determining win percentage are batting average, errors, strike outs, and quality starts. With the lowest p-value at 0.001, batting average was the most significant variable. The coefficient for batting average was 2.602, indicating that when all other variables are held constant, an 1% increase in batting average results in a 2.602% increase in win percentage for any given team. This strong correlation makes sense since wins are directly related to runs scored, and runs are directly related to hits and batting average. The coefficients and p-values for the intercept and all significant variables are shown below.

Variable Coefficient p-value intercept -45.326 0.060 batting average (%) 2.602 0.001 errors -0.149 0.002 strike outs 0.023 0.007 quality starts 0.199 0.043 The final step-wise regression model is shown below.. sw, pr(.05): regress winpct nl1al0 k qs hr payrollmillions errors bafinal begin with full model p = 0.9440 >= 0.0500 removing payrollmillions p = 0.3730 >= 0.0500 removing hr p = 0.1694 >= 0.0500 removing nl1al0 -------------+------------------------------ F( 4, 25) = 17.66 Model 988.622277 4 247.155569 Prob > F = 0.0000 Residual 349.856341 25 13.9942536 R-squared = 0.7386 -------------+------------------------------ Adj R-squared = 0.6968 Total 1338.47862 29 46.1544351 Root MSE = 3.7409 bafinal 2.601685.6946007 3.75 0.001 1.171128 4.032242 k.0229994.0078448 2.93 0.007.0068427.0391562 qs.1987509.092977 2.14 0.043.0072612.3902407 errors -.1485824.0441469-3.37 0.002 -.2395046 -.0576601 _cons -45.32566 22.96426-1.97 0.060-92.62143 1.970117 As shown in the Stata output, the R 2 for this model is 0.7386, with an adjusted R 2 of 0.6968, indicating that the model is a relatively strong predictor for win percentage. Additionally, it is important to note that adjusted R 2 is greater in the step-wise final model, compared to the multiple linear regression with all variables included, verifying that it is a better model. The results of the simple linear regressions for each x variable independently demonstrated nothing too surprising. Although the most significant variable in the final model appears to be batting average, the simple linear regression results indicate that errors is the most significant variable. For this reason, the scatter plots for the simple linear regressions of both variables are shown below.

Conclusions and Discussion When analyzing the results of the final multiple regression model, the coefficients appear as one would predict. Good qualities of a team (batting average, strike outs, and quality starts) all have positive coefficients, and the variable errors is associated with a negative coefficient. Surprisingly, the payroll of a team was not a significant indicator of win percentage. While this may be surprising to any everyday baseball fan, the team was wary of this variable due to the large number of outliers. So, what do these results mean for the future of baseball? Because it was determined that batting average and errors are the most significant predictors, perhaps teams should target their payrolls towards players that are going to better these team statistics. For example, they should focus on offensive players with a high batting average and that play good defense while committing few errors. To a lesser extent, teams should also recruit pitchers with who have a history of quality starts and a high record of strike outs. The dummy variable league proved not to be significant, indicating that win percentage for any given team does not differ between American League and National League, if all other variables are held constant. In other words, the significant variables are equally significant in each league, and therefore, league will not have a direct effect on win percentage. Home runs were also removed in the final model. This is surprising because home runs are directly associated with runs scored, and runs are a strong indicator of win percentage. On the other hand, this could reflect the number of runs scored by hits other than home runs, in addition to walks and errors. The removal of this variable proves that getting more hits overall is more important than getting more home runs. Interestingly, the simple linear regression of win percentage and payroll indicated that payroll was needed in the model. However, in the multiple regression model, payroll was the least significant variable, and therefore, was the first removed. The group has concluded that this discrepancy reflects the poor allocation of team s payroll. For example, rather than using their payroll to sign players as described above, the team may sign big-name players with relatively poor statistics in light of their salary. Essentially, these players are payed more than they re worth just to draw out more fan interest.

Lastly, it is important to mention that the tests ran in the final sections of the project verify that all of our assumptions for linearity, normality of residues, etc. all passed. This means that the final model requires no transformations before being used as an accurate predictor model. Some of these issues were taken care of by our initial transformations of data values into similar numbers. For example, win percentage and batting average were both converted to actual percentages in the range of 0 to 100. This was done so that output could be more easily read in terms of a 1 unit increase in x and its effect on y. The major weakness of this project is the fact that our data was limited to 30 observations. This factor was limited because there are only 30 teams in the MLB, and we wanted to include only data from a single season since strategies have changed throughout the history of baseball. Additionally, the team members only analyzed the regular season, not the playoffs. For example, the Giants won the World Series, and they tied for the third highest number of quality starts and had the highest number of strike outs within our data. This means that these two variables may be more significant for the playoffs than indicated, but because our data was limited to the regular season, these effects were not taken into account. In conclusion, if a teams goal is to increase their win percentage in the regular season, our model is a strong indicator of the significant variables. However, because of the structure of the MLB playoff system, the team with the highest win percentage does not necessarily win the World Series. Therefore, a different study should be conducted to analyze the most significant variables in determining success in the post-season. References Stata Software. Gould, Bill. 2010. Version 11.0. MLB Statistics. Elias Sports Bureau. http://espn.go.com/mlb/stats/team/_/stat/batting/year/2010/seasontype/2 Appendix Simple Linear Regressions:. regress winpct nl1al0 -------------+------------------------------ F( 1, 28) = 0.07 Model 3.22438048 1 3.22438048 Prob > F = 0.7967 Residual 1335.25424 28 47.6876513 R-squared = 0.0024 -------------+------------------------------ Adj R-squared = -0.0332 Total 1338.47862 29 46.1544351 Root MSE = 6.9056 nl1al0 -.6571428 2.5272-0.26 0.797-5.833877 4.519591 _cons 50.35714 1.845606 27.28 0.000 46.57659 54.13769. regress winpct k

-------------+------------------------------ F( 1, 28) = 13.49 Model 435.19831 1 435.19831 Prob > F = 0.0010 Residual 903.280307 28 32.260011 R-squared = 0.3251 -------------+------------------------------ Adj R-squared = 0.3010 Total 1338.47862 29 46.1544351 Root MSE = 5.6798 k.039477.0107481 3.67 0.001.0174605.0614936 _cons 4.863383 12.33451 0.39 0.696-20.40272 30.12949. regress winpct qs -------------+------------------------------ F( 1, 28) = 8.58 Model 313.893106 1 313.893106 Prob > F = 0.0067 Residual 1024.58551 28 36.5923397 R-squared = 0.2345 -------------+------------------------------ Adj R-squared = 0.2072 Total 1338.47862 29 46.1544351 Root MSE = 6.0492 qs.3770548.1287386 2.93 0.007.1133458.6407638 _cons 17.55482 11.13501 1.58 0.126-5.254207 40.36384. regress winpct hr -------------+------------------------------ F( 1, 28) = 6.60 Model 255.200288 1 255.200288 Prob > F = 0.0158 Residual 1083.27833 28 38.6885118 R-squared = 0.1907 -------------+------------------------------ Adj R-squared = 0.1618 Total 1338.47862 29 46.1544351 Root MSE = 6.22 hr.0885053.0344604 2.57 0.016.0179165.1590941 _cons 36.3975 5.419175 6.72 0.000 25.29682 47.49818. regress winpct payrollmillions -------------+------------------------------ F( 1, 28) = 4.34 Model 179.55983 1 179.55983 Prob > F = 0.0465 Residual 1158.91879 28 41.3899567 R-squared = 0.1342 -------------+------------------------------ Adj R-squared = 0.1032 Total 1338.47862 29 46.1544351 Root MSE = 6.4335 payrollmil~s.0641494.0307989 2.08 0.047.0010607.127238 _cons 44.24926 3.00341 14.73 0.000 38.09705 50.40147

. regress winpct errors -------------+------------------------------ F( 1, 28) = 18.81 Model 537.819897 1 537.819897 Prob > F = 0.0002 Residual 800.65872 28 28.5949543 R-squared = 0.4018 -------------+------------------------------ Adj R-squared = 0.3805 Total 1338.47862 29 46.1544351 Root MSE = 5.3474 errors -.2459617.0567145-4.34 0.000 -.362136 -.1297874 _cons 74.8488 5.810765 12.88 0.000 62.94599 86.75161. regress winpct bafinal -------------+------------------------------ F( 1, 28) = 7.50 Model 282.700955 1 282.700955 Prob > F = 0.0106 Residual 1055.77766 28 37.7063451 R-squared = 0.2112 -------------+------------------------------ Adj R-squared = 0.1830 Total 1338.47862 29 46.1544351 Root MSE = 6.1405 bafinal 3.00539 1.097601 2.74 0.011.7570566 5.253723 _cons -27.31199 28.25985-0.97 0.342-85.19967 30.57569. corr winpct errors (obs=30) winpct errors -------------+------------------ winpct 1.0000 errors -0.6339 1.0000

Multiple Linear Regression. regress winpct nl1al0 k qs hr payrollmillions errors bafinal -------------+------------------------------ F( 7, 22) = 10.36 Model 1026.88282 7 146.697545 Prob > F = 0.0000 Residual 311.595802 22 14.1634456 R-squared = 0.7672 -------------+------------------------------ Adj R-squared = 0.6931 Total 1338.47862 29 46.1544351 Root MSE = 3.7634 nl1al0-1.882327 1.737358-1.08 0.290-5.485388 1.720733 k.0260625.010205 2.55 0.018.0048986.0472263 qs.2003392.1042446 1.92 0.068 -.0158508.4165292 hr.0225859.0254233 0.89 0.384 -.0301387.0753105 payrollmil~s -.0014683.0206652-0.07 0.944 -.0443253.0413886 errors -.1353628.0457943-2.96 0.007 -.2303343 -.0403912 bafinal 2.195859.7720799 2.84 0.009.5946634 3.797055 _cons -42.1969 24.26719-1.74 0.096-92.52397 8.13017 Step-Wise Multiple Linear Regression. sw, pr(.05): regress winpct nl1al0 k qs hr payrollmillions errors bafinal begin with full model p = 0.9440 >= 0.0500 removing payrollmillions p = 0.3730 >= 0.0500 removing hr p = 0.1694 >= 0.0500 removing nl1al0 -------------+------------------------------ F( 4, 25) = 17.66 Model 988.622277 4 247.155569 Prob > F = 0.0000 Residual 349.856341 25 13.9942536 R-squared = 0.7386 -------------+------------------------------ Adj R-squared = 0.6968 Total 1338.47862 29 46.1544351 Root MSE = 3.7409 bafinal 2.601685.6946007 3.75 0.001 1.171128 4.032242 k.0229994.0078448 2.93 0.007.0068427.0391562 qs.1987509.092977 2.14 0.043.0072612.3902407 errors -.1485824.0441469-3.37 0.002 -.2395046 -.0576601 _cons -45.32566 22.96426-1.97 0.060-92.62143 1.970117 Test for Heteroskedasticity. hettest Breusch-Pagan / Cook-Weisberg test for heteroskedasticity Ho: Constant variance Variables: fitted values of winpct chi2(1) = 0.07 Prob > chi2 = 0.7852

Test for Non-Linearity. ovtest Ramsey RESET test using powers of the fitted values of winpct Ho: model has no omitted variables F(3, 22) = 0.57 Prob > F = 0.6398 Test for Normality of Noise. predict res (option xb assumed; fitted values). swilk res Shapiro-Wilk W test for normal data Variable Obs W V z Prob>z -------------+-------------------------------------------------- res 30 0.95509 1.427 0.736 0.23089 Test for Colinearity. corr k qs errors bafinal (obs=30) k qs errors bafinal -------------+------------------------------------ k 1.0000 qs 0.3971 1.0000 errors -0.2728-0.3699 1.0000 bafinal 0.0809-0.1120-0.1657 1.0000 The final model passes all tests checking the initial assumptions of the project.