The Numbers Behind the MLB Anonymous Students: AD, CD, BM; (TF: Kevin Rader)

Save this PDF as:
 WORD  PNG  TXT  JPG

Size: px
Start display at page:

Download "The Numbers Behind the MLB Anonymous Students: AD, CD, BM; (TF: Kevin Rader)"

Transcription

1 The Numbers Behind the MLB Anonymous Students: AD, CD, BM; (TF: Kevin Rader) Abstract This project measures the effects of various baseball statistics on the win percentage of all the teams in MLB. Data was collected off of espn.com for the 2010 regular season for the 30 teams in the MLB. First, simple linear regression models were ran in Stata for each of the variables. Next, a multiple regression model was used as the basis for our model. The final model, after performing a step-wise regression with significance set at a p-value of.05 or less, reflects that the significant determining variables are batting average, strike outs, quality starts, and errors. The results of the regression were not surprising in the fact that the coefficients were positive or negative as expected; however, the variables that proved to be significant and the fact that payroll, home runs, and league were not significant was relatively surprising. Lastly, the researchers discuss the implications of these results and possible strategies that MLB teams could employ to increase their regular season win percentages in the future. Introduction The motivation for this team of researchers was their personal interest in sports. Baseball was the sport of choice because of its heavy reliance on statistics. When one thinks of baseball, it s all about the numbers: runs, strikeouts, payroll, etc, so our group decided to boil down the numbers and find out what numbers really matter. After all, that final number anyone really cares about is win percentage. That s the number that s going to get you to the playoffs. For this reason, our team selected a variety of variables and ran a multiple regression for win percentage to determine which variables were significant indicators for win percentage. The MLB consists of 30 teams. Data was collected from espn.com for each team during the 2010 Regular Season. The independent variables considered were league (National or American), strikeouts, quality starts, home runs, payroll, errors, and batting average. The dependent variable in the model is win percentage, measured as an actual percentage, ranging from 0 to 100. The variable league was put in place as a dummy variable and measured on a 0 or 1 scale. 1 represented National League, and, therefore, 0 represented American League. This variable allowed us to test if there was a difference in win percentages amongst teams in each distinctive league. Strike outs and quality starts fall under the category of pitching statistics. The researchers were curious if there was a correlation between pitching and overall wins, and these variables seemed most well-suited to run in the regression. ERA was not chosen as the pitching statistic to consider because there is a obvious correlation between ERA and win percentage. Rather, the group was interested in determining the effects of statistics less directly related to win percentage. Because these statistics are good qualities of a pitcher, the researchers expect a positive correlation between these pitching statistics and win percentage. The offensive statistics included home runs and the team batting average. Similarly to ERA, runs scored were not incorporated into the model because there was too direct a correlation between runs and wins. Batting average was put into a more output-friendly format as percentages, falling between 0 and 100; the traditional batting average format was multiplied times 100. The team thought that home runs would bring an interesting perspective to the model, considering the range of teams known for big-hitters and those that score off other methods.

2 Since offense is crucial for a successful team, one would hope to see a positive coefficient in front of these two offensive variables. In conjunction with the pitching statistics, the variable errors represents the defensive statistic for each team. The initial thought of baseball is all about hitting and pitching, so it will be interesting to see if defensive statistics, such as errors, play a role in determining win percentage for any given team. Since the goal of baseball is not only to maximize runs scored, but to minimize runs scored by the other team, the team would expect to see a negative correlation between errors and win percentage. Lastly, the researchers wanted to look at payroll and see the effect on win percentage. You hear so much about individual player earnings in the media. After looking at the data, it is funny to think that one single player on Team X (Alex Rodriguez) makes almost as much money as an entire Team Y (Pittsburgh Pirates). There are a lot of outliers in this category, however, so the group is not expecting a very strong correlation. Methods The researchers collected data from various pages of espn.com. The data was organized into an Excel chart, making sure to properly line up each statistic with the correct team. The data consists of 30 observations for each variable, since there are only 30 teams in the MLB. From there, the team exported the data into Stata. First, the group ran a simple linear regression on win percentage and each of the x variables. These simple linear regressions allowed the group members to get a sense of the relationship between each variable and win percentage. After determining the correlation between each variable on its own, the team decided to run a multiple linear regression to study the effects of each variable when the remaining variables were also taken into consideration. A p-value of.05 or less was used to determine significance of any given variable. Next, a step-wise regression was performed to eliminate variables that were not significant. This left the team with the final model for the project. This final model recognized four variables as significant indicators of win percentage, in addition to the constant. While it is questionable to have more than 3 variables included with the limited number of observations presented in the data, the team feels that the model is well-suited for the goals of this project. With the final model determined, it was necessary to verify that all assumptions were correct, so the team ran several tests in Stata. These included the hettest for heteroskedasticity, the ovtest for nonlinearity, Shapiro-Wilk test to see if residuals were normal, and a test for colinearity. Results The results of the project were interesting. After running the stepwise multiple regression, the data reflects that the only significant variables in determining win percentage are batting average, errors, strike outs, and quality starts. With the lowest p-value at 0.001, batting average was the most significant variable. The coefficient for batting average was 2.602, indicating that when all other variables are held constant, an 1% increase in batting average results in a 2.602% increase in win percentage for any given team. This strong correlation makes sense since wins are directly related to runs scored, and runs are directly related to hits and batting average. The coefficients and p-values for the intercept and all significant variables are shown below.

3 Variable Coefficient p-value intercept batting average (%) errors strike outs quality starts The final step-wise regression model is shown below.. sw, pr(.05): regress winpct nl1al0 k qs hr payrollmillions errors bafinal begin with full model p = >= removing payrollmillions p = >= removing hr p = >= removing nl1al F( 4, 25) = Model Prob > F = Residual R-squared = Adj R-squared = Total Root MSE = bafinal k qs errors _cons As shown in the Stata output, the R 2 for this model is , with an adjusted R 2 of , indicating that the model is a relatively strong predictor for win percentage. Additionally, it is important to note that adjusted R 2 is greater in the step-wise final model, compared to the multiple linear regression with all variables included, verifying that it is a better model. The results of the simple linear regressions for each x variable independently demonstrated nothing too surprising. Although the most significant variable in the final model appears to be batting average, the simple linear regression results indicate that errors is the most significant variable. For this reason, the scatter plots for the simple linear regressions of both variables are shown below.

4 Conclusions and Discussion When analyzing the results of the final multiple regression model, the coefficients appear as one would predict. Good qualities of a team (batting average, strike outs, and quality starts) all have positive coefficients, and the variable errors is associated with a negative coefficient. Surprisingly, the payroll of a team was not a significant indicator of win percentage. While this may be surprising to any everyday baseball fan, the team was wary of this variable due to the large number of outliers. So, what do these results mean for the future of baseball? Because it was determined that batting average and errors are the most significant predictors, perhaps teams should target their payrolls towards players that are going to better these team statistics. For example, they should focus on offensive players with a high batting average and that play good defense while committing few errors. To a lesser extent, teams should also recruit pitchers with who have a history of quality starts and a high record of strike outs. The dummy variable league proved not to be significant, indicating that win percentage for any given team does not differ between American League and National League, if all other variables are held constant. In other words, the significant variables are equally significant in each league, and therefore, league will not have a direct effect on win percentage. Home runs were also removed in the final model. This is surprising because home runs are directly associated with runs scored, and runs are a strong indicator of win percentage. On the other hand, this could reflect the number of runs scored by hits other than home runs, in addition to walks and errors. The removal of this variable proves that getting more hits overall is more important than getting more home runs. Interestingly, the simple linear regression of win percentage and payroll indicated that payroll was needed in the model. However, in the multiple regression model, payroll was the least significant variable, and therefore, was the first removed. The group has concluded that this discrepancy reflects the poor allocation of team s payroll. For example, rather than using their payroll to sign players as described above, the team may sign big-name players with relatively poor statistics in light of their salary. Essentially, these players are payed more than they re worth just to draw out more fan interest.

5 Lastly, it is important to mention that the tests ran in the final sections of the project verify that all of our assumptions for linearity, normality of residues, etc. all passed. This means that the final model requires no transformations before being used as an accurate predictor model. Some of these issues were taken care of by our initial transformations of data values into similar numbers. For example, win percentage and batting average were both converted to actual percentages in the range of 0 to 100. This was done so that output could be more easily read in terms of a 1 unit increase in x and its effect on y. The major weakness of this project is the fact that our data was limited to 30 observations. This factor was limited because there are only 30 teams in the MLB, and we wanted to include only data from a single season since strategies have changed throughout the history of baseball. Additionally, the team members only analyzed the regular season, not the playoffs. For example, the Giants won the World Series, and they tied for the third highest number of quality starts and had the highest number of strike outs within our data. This means that these two variables may be more significant for the playoffs than indicated, but because our data was limited to the regular season, these effects were not taken into account. In conclusion, if a teams goal is to increase their win percentage in the regular season, our model is a strong indicator of the significant variables. However, because of the structure of the MLB playoff system, the team with the highest win percentage does not necessarily win the World Series. Therefore, a different study should be conducted to analyze the most significant variables in determining success in the post-season. References Stata Software. Gould, Bill Version MLB Statistics. Elias Sports Bureau. Appendix Simple Linear Regressions:. regress winpct nl1al F( 1, 28) = 0.07 Model Prob > F = Residual R-squared = Adj R-squared = Total Root MSE = nl1al _cons regress winpct k

6 F( 1, 28) = Model Prob > F = Residual R-squared = Adj R-squared = Total Root MSE = k _cons regress winpct qs F( 1, 28) = 8.58 Model Prob > F = Residual R-squared = Adj R-squared = Total Root MSE = qs _cons regress winpct hr F( 1, 28) = 6.60 Model Prob > F = Residual R-squared = Adj R-squared = Total Root MSE = 6.22 hr _cons regress winpct payrollmillions F( 1, 28) = 4.34 Model Prob > F = Residual R-squared = Adj R-squared = Total Root MSE = payrollmil~s _cons

7 . regress winpct errors F( 1, 28) = Model Prob > F = Residual R-squared = Adj R-squared = Total Root MSE = errors _cons regress winpct bafinal F( 1, 28) = 7.50 Model Prob > F = Residual R-squared = Adj R-squared = Total Root MSE = bafinal _cons corr winpct errors (obs=30) winpct errors winpct errors

8 Multiple Linear Regression. regress winpct nl1al0 k qs hr payrollmillions errors bafinal F( 7, 22) = Model Prob > F = Residual R-squared = Adj R-squared = Total Root MSE = nl1al k qs hr payrollmil~s errors bafinal _cons Step-Wise Multiple Linear Regression. sw, pr(.05): regress winpct nl1al0 k qs hr payrollmillions errors bafinal begin with full model p = >= removing payrollmillions p = >= removing hr p = >= removing nl1al F( 4, 25) = Model Prob > F = Residual R-squared = Adj R-squared = Total Root MSE = bafinal k qs errors _cons Test for Heteroskedasticity. hettest Breusch-Pagan / Cook-Weisberg test for heteroskedasticity Ho: Constant variance Variables: fitted values of winpct chi2(1) = 0.07 Prob > chi2 =

9 Test for Non-Linearity. ovtest Ramsey RESET test using powers of the fitted values of winpct Ho: model has no omitted variables F(3, 22) = 0.57 Prob > F = Test for Normality of Noise. predict res (option xb assumed; fitted values). swilk res Shapiro-Wilk W test for normal data Variable Obs W V z Prob>z res Test for Colinearity. corr k qs errors bafinal (obs=30) k qs errors bafinal k qs errors bafinal The final model passes all tests checking the initial assumptions of the project.

Statistics 104 Final Project A Culture of Debt: A Study of Credit Card Spending in America TF: Kevin Rader Anonymous Students: LD, MH, IW, MY

Statistics 104 Final Project A Culture of Debt: A Study of Credit Card Spending in America TF: Kevin Rader Anonymous Students: LD, MH, IW, MY Statistics 104 Final Project A Culture of Debt: A Study of Credit Card Spending in America TF: Kevin Rader Anonymous Students: LD, MH, IW, MY ABSTRACT: This project attempted to determine the relationship

More information

Regression in Stata. Alicia Doyle Lynch Harvard-MIT Data Center (HMDC)

Regression in Stata. Alicia Doyle Lynch Harvard-MIT Data Center (HMDC) Regression in Stata Alicia Doyle Lynch Harvard-MIT Data Center (HMDC) Documents for Today Find class materials at: http://libraries.mit.edu/guides/subjects/data/ training/workshops.html Several formats

More information

An Exploration into the Relationship of MLB Player Salary and Performance

An Exploration into the Relationship of MLB Player Salary and Performance 1 Nicholas Dorsey Applied Econometrics 4/30/2015 An Exploration into the Relationship of MLB Player Salary and Performance Statement of the Problem: When considering professionals sports leagues, the common

More information

A Predictive Model for NFL Rookie Quarterback Fantasy Football Points

A Predictive Model for NFL Rookie Quarterback Fantasy Football Points A Predictive Model for NFL Rookie Quarterback Fantasy Football Points Steve Bronder and Alex Polinsky Duquesne University Economics Department Abstract This analysis designs a model that predicts NFL rookie

More information

College Education Matters for Happier Marriages and Higher Salaries ----Evidence from State Level Data in the US

College Education Matters for Happier Marriages and Higher Salaries ----Evidence from State Level Data in the US College Education Matters for Happier Marriages and Higher Salaries ----Evidence from State Level Data in the US Anonymous Authors: SH, AL, YM Contact TF: Kevin Rader Abstract It is a general consensus

More information

Please follow the directions once you locate the Stata software in your computer. Room 114 (Business Lab) has computers with Stata software

Please follow the directions once you locate the Stata software in your computer. Room 114 (Business Lab) has computers with Stata software STATA Tutorial Professor Erdinç Please follow the directions once you locate the Stata software in your computer. Room 114 (Business Lab) has computers with Stata software 1.Wald Test Wald Test is used

More information

DETERMINANTS OF CAPITAL ADEQUACY RATIO IN SELECTED BOSNIAN BANKS

DETERMINANTS OF CAPITAL ADEQUACY RATIO IN SELECTED BOSNIAN BANKS DETERMINANTS OF CAPITAL ADEQUACY RATIO IN SELECTED BOSNIAN BANKS Nađa DRECA International University of Sarajevo nadja.dreca@students.ius.edu.ba Abstract The analysis of a data set of observation for 10

More information

MODELING AUTO INSURANCE PREMIUMS

MODELING AUTO INSURANCE PREMIUMS MODELING AUTO INSURANCE PREMIUMS Brittany Parahus, Siena College INTRODUCTION The findings in this paper will provide the reader with a basic knowledge and understanding of how Auto Insurance Companies

More information

REGRESSION LINES IN STATA

REGRESSION LINES IN STATA REGRESSION LINES IN STATA THOMAS ELLIOTT 1. Introduction to Regression Regression analysis is about eploring linear relationships between a dependent variable and one or more independent variables. Regression

More information

Using Baseball Data as a Gentle Introduction to Teaching Linear Regression

Using Baseball Data as a Gentle Introduction to Teaching Linear Regression Creative Education, 2015, 6, 1477-1483 Published Online August 2015 in SciRes. http://www.scirp.org/journal/ce http://dx.doi.org/10.4236/ce.2015.614148 Using Baseball Data as a Gentle Introduction to Teaching

More information

Baseball Pay and Performance

Baseball Pay and Performance Baseball Pay and Performance MIS 480/580 October 22, 2015 Mikhail Averbukh Scott Brown Brian Chase ABSTRACT Major League Baseball (MLB) is the only professional sport in the United States that is a legal

More information

The econometrics of baseball: A statistical investigation

The econometrics of baseball: A statistical investigation The econometrics of baseball: A statistical investigation Mary Hilston Keener The University of Tampa The purpose of this paper is to use various baseball statistics available at the beginning of each

More information

Data Analysis Methodology 1

Data Analysis Methodology 1 Data Analysis Methodology 1 Suppose you inherited the database in Table 1.1 and needed to find out what could be learned from it fast. Say your boss entered your office and said, Here s some software project

More information

International Statistical Institute, 56th Session, 2007: Phil Everson

International Statistical Institute, 56th Session, 2007: Phil Everson Teaching Regression using American Football Scores Everson, Phil Swarthmore College Department of Mathematics and Statistics 5 College Avenue Swarthmore, PA198, USA E-mail: peverso1@swarthmore.edu 1. Introduction

More information

Interaction effects and group comparisons Richard Williams, University of Notre Dame, http://www3.nd.edu/~rwilliam/ Last revised February 20, 2015

Interaction effects and group comparisons Richard Williams, University of Notre Dame, http://www3.nd.edu/~rwilliam/ Last revised February 20, 2015 Interaction effects and group comparisons Richard Williams, University of Notre Dame, http://www3.nd.edu/~rwilliam/ Last revised February 20, 2015 Note: This handout assumes you understand factor variables,

More information

Elementary Statistics. Scatter Plot, Regression Line, Linear Correlation Coefficient, and Coefficient of Determination

Elementary Statistics. Scatter Plot, Regression Line, Linear Correlation Coefficient, and Coefficient of Determination Scatter Plot, Regression Line, Linear Correlation Coefficient, and Coefficient of Determination What is a Scatter Plot? A Scatter Plot is a plot of ordered pairs (x, y) where the horizontal axis is used

More information

Data Mining in Sports Analytics. Salford Systems Dan Steinberg Mikhail Golovnya

Data Mining in Sports Analytics. Salford Systems Dan Steinberg Mikhail Golovnya Data Mining in Sports Analytics Salford Systems Dan Steinberg Mikhail Golovnya Data Mining Defined Data mining is the search for patterns in data using modern highly automated, computer intensive methods

More information

WIN AT ANY COST? How should sports teams spend their m oney to win more games?

WIN AT ANY COST? How should sports teams spend their m oney to win more games? Mathalicious 2014 lesson guide WIN AT ANY COST? How should sports teams spend their m oney to win more games? Professional sports teams drop serious cash to try and secure the very best talent, and the

More information

1. The parameters to be estimated in the simple linear regression model Y=α+βx+ε ε~n(0,σ) are: a) α, β, σ b) α, β, ε c) a, b, s d) ε, 0, σ

1. The parameters to be estimated in the simple linear regression model Y=α+βx+ε ε~n(0,σ) are: a) α, β, σ b) α, β, ε c) a, b, s d) ε, 0, σ STA 3024 Practice Problems Exam 2 NOTE: These are just Practice Problems. This is NOT meant to look just like the test, and it is NOT the only thing that you should study. Make sure you know all the material

More information

IAPRI Quantitative Analysis Capacity Building Series. Multiple regression analysis & interpreting results

IAPRI Quantitative Analysis Capacity Building Series. Multiple regression analysis & interpreting results IAPRI Quantitative Analysis Capacity Building Series Multiple regression analysis & interpreting results How important is R-squared? R-squared Published in Agricultural Economics 0.45 Best article of the

More information

Correlation and Regression

Correlation and Regression Correlation and Regression Scatterplots Correlation Explanatory and response variables Simple linear regression General Principles of Data Analysis First plot the data, then add numerical summaries Look

More information

MULTIPLE REGRESSION EXAMPLE

MULTIPLE REGRESSION EXAMPLE MULTIPLE REGRESSION EXAMPLE For a sample of n = 166 college students, the following variables were measured: Y = height X 1 = mother s height ( momheight ) X 2 = father s height ( dadheight ) X 3 = 1 if

More information

Causal Inference and Major League Baseball

Causal Inference and Major League Baseball Causal Inference and Major League Baseball Jamie Thornton Department of Mathematical Sciences Montana State University May 4, 2012 A writing project submitted in partial fulfillment of the requirements

More information

An econometric analysis of the 2013 major league baseball season

An econometric analysis of the 2013 major league baseball season An econometric analysis of the 2013 major league baseball season ABSTRACT Steven L. Fullerton New Mexico State University Thomas M. Fullerton, Jr. University of Texas at El Paso Adam G. Walke University

More information

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( ) Chapter 340 Principal Components Regression Introduction is a technique for analyzing multiple regression data that suffer from multicollinearity. When multicollinearity occurs, least squares estimates

More information

Data Mining and Data Warehousing. Henryk Maciejewski. Data Mining Predictive modelling: regression

Data Mining and Data Warehousing. Henryk Maciejewski. Data Mining Predictive modelling: regression Data Mining and Data Warehousing Henryk Maciejewski Data Mining Predictive modelling: regression Algorithms for Predictive Modelling Contents Regression Classification Auxiliary topics: Estimation of prediction

More information

Does pay inequality within a team affect performance? Tomas Dvorak*

Does pay inequality within a team affect performance? Tomas Dvorak* Does pay inequality within a team affect performance? Tomas Dvorak* The title should concisely express what the paper is about. It can also be used to capture the reader's attention. The Introduction should

More information

Department of Economics Session 2012/2013. EC352 Econometric Methods. Solutions to Exercises from Week 10 + 0.0077 (0.052)

Department of Economics Session 2012/2013. EC352 Econometric Methods. Solutions to Exercises from Week 10 + 0.0077 (0.052) Department of Economics Session 2012/2013 University of Essex Spring Term Dr Gordon Kemp EC352 Econometric Methods Solutions to Exercises from Week 10 1 Problem 13.7 This exercise refers back to Equation

More information

Regression Analysis (Spring, 2000)

Regression Analysis (Spring, 2000) Regression Analysis (Spring, 2000) By Wonjae Purposes: a. Explaining the relationship between Y and X variables with a model (Explain a variable Y in terms of Xs) b. Estimating and testing the intensity

More information

The Free Agency Market In Major League Baseball: Examining Demand for Free Agents

The Free Agency Market In Major League Baseball: Examining Demand for Free Agents The Free Agency Market In Major League Baseball: Examining Demand for Free Agents Daniel Coulsell Economics Senior Project Cal Poly San Luis Obispo Advisor: Michael Marlow Fall 2012 1 Abstract This paper

More information

Estimating the Value of Major League Baseball Players

Estimating the Value of Major League Baseball Players Estimating the Value of Major League Baseball Players Brian Fields * East Carolina University Department of Economics Masters Paper July 26, 2001 Abstract This paper examines whether Major League Baseball

More information

MODEL I: DRINK REGRESSED ON GPA & MALE, WITHOUT CENTERING

MODEL I: DRINK REGRESSED ON GPA & MALE, WITHOUT CENTERING Interpreting Interaction Effects; Interaction Effects and Centering Richard Williams, University of Notre Dame, http://www3.nd.edu/~rwilliam/ Last revised February 20, 2015 Models with interaction effects

More information

EXPLORATORY DATA ANALYSIS: GETTING TO KNOW YOUR DATA

EXPLORATORY DATA ANALYSIS: GETTING TO KNOW YOUR DATA EXPLORATORY DATA ANALYSIS: GETTING TO KNOW YOUR DATA Michael A. Walega Covance, Inc. INTRODUCTION In broad terms, Exploratory Data Analysis (EDA) can be defined as the numerical and graphical examination

More information

A Study of Sabermetrics in Major League Baseball: The Impact of Moneyball on Free Agent Salaries

A Study of Sabermetrics in Major League Baseball: The Impact of Moneyball on Free Agent Salaries A Study of Sabermetrics in Major League Baseball: The Impact of Moneyball on Free Agent Salaries Jason Chang & Joshua Zenilman 1 Honors in Management Advisor: Kelly Bishop Washington University in St.

More information

DO NOT TURN OVER UNTIL TOLD TO BEGIN

DO NOT TURN OVER UNTIL TOLD TO BEGIN THIS PAPER IS NOT TO BE REMOVED FROM THE EXAMINATION HALLS University of London BSc Examination 2012 BA1040 (BBA0040) +Enc Business Administration Business Statistics Date tba: Time tba DO NOT TURN OVER

More information

Econ 371 Problem Set #3 Answer Sheet

Econ 371 Problem Set #3 Answer Sheet Econ 371 Problem Set #3 Answer Sheet 4.3 In this question, you are told that a OLS regression analysis of average weekly earnings yields the following estimated model. AW E = 696.7 + 9.6 Age, R 2 = 0.023,

More information

The average hotel manager recognizes the criticality of forecasting. However, most

The average hotel manager recognizes the criticality of forecasting. However, most Introduction The average hotel manager recognizes the criticality of forecasting. However, most managers are either frustrated by complex models researchers constructed or appalled by the amount of time

More information

Outline. Topic 4 - Analysis of Variance Approach to Regression. Partitioning Sums of Squares. Total Sum of Squares. Partitioning sums of squares

Outline. Topic 4 - Analysis of Variance Approach to Regression. Partitioning Sums of Squares. Total Sum of Squares. Partitioning sums of squares Topic 4 - Analysis of Variance Approach to Regression Outline Partitioning sums of squares Degrees of freedom Expected mean squares General linear test - Fall 2013 R 2 and the coefficient of correlation

More information

5. Multiple regression

5. Multiple regression 5. Multiple regression QBUS6840 Predictive Analytics https://www.otexts.org/fpp/5 QBUS6840 Predictive Analytics 5. Multiple regression 2/39 Outline Introduction to multiple linear regression Some useful

More information

Cost of Winning: What contributing factors play the most significant roles in increasing the winning percentage of a major league baseball team?

Cost of Winning: What contributing factors play the most significant roles in increasing the winning percentage of a major league baseball team? Cost of Winning: What contributing factors play the most significant roles in increasing the winning percentage of a major league baseball team? The Honors Program Senior Capstone Project Student s Name:

More information

Beating the MLB Moneyline

Beating the MLB Moneyline Beating the MLB Moneyline Leland Chen llxchen@stanford.edu Andrew He andu@stanford.edu 1 Abstract Sports forecasting is a challenging task that has similarities to stock market prediction, requiring time-series

More information

Predicting Market Value of Soccer Players Using Linear Modeling Techniques

Predicting Market Value of Soccer Players Using Linear Modeling Techniques 1 Predicting Market Value of Soccer Players Using Linear Modeling Techniques by Yuan He Advisor: David Aldous Index Introduction ----------------------------------------------------------------------------

More information

Stepwise Regression. Chapter 311. Introduction. Variable Selection Procedures. Forward (Step-Up) Selection

Stepwise Regression. Chapter 311. Introduction. Variable Selection Procedures. Forward (Step-Up) Selection Chapter 311 Introduction Often, theory and experience give only general direction as to which of a pool of candidate variables (including transformed variables) should be included in the regression model.

More information

Nonlinear Regression Functions. SW Ch 8 1/54/

Nonlinear Regression Functions. SW Ch 8 1/54/ Nonlinear Regression Functions SW Ch 8 1/54/ The TestScore STR relation looks linear (maybe) SW Ch 8 2/54/ But the TestScore Income relation looks nonlinear... SW Ch 8 3/54/ Nonlinear Regression General

More information

DEPARTMENT OF PSYCHOLOGY UNIVERSITY OF LANCASTER MSC IN PSYCHOLOGICAL RESEARCH METHODS ANALYSING AND INTERPRETING DATA 2 PART 1 WEEK 9

DEPARTMENT OF PSYCHOLOGY UNIVERSITY OF LANCASTER MSC IN PSYCHOLOGICAL RESEARCH METHODS ANALYSING AND INTERPRETING DATA 2 PART 1 WEEK 9 DEPARTMENT OF PSYCHOLOGY UNIVERSITY OF LANCASTER MSC IN PSYCHOLOGICAL RESEARCH METHODS ANALYSING AND INTERPRETING DATA 2 PART 1 WEEK 9 Analysis of covariance and multiple regression So far in this course,

More information

Practical. I conometrics. data collection, analysis, and application. Christiana E. Hilmer. Michael J. Hilmer San Diego State University

Practical. I conometrics. data collection, analysis, and application. Christiana E. Hilmer. Michael J. Hilmer San Diego State University Practical I conometrics data collection, analysis, and application Christiana E. Hilmer Michael J. Hilmer San Diego State University Mi Table of Contents PART ONE THE BASICS 1 Chapter 1 An Introduction

More information

Title: Lending Club Interest Rates are closely linked with FICO scores and Loan Length

Title: Lending Club Interest Rates are closely linked with FICO scores and Loan Length Title: Lending Club Interest Rates are closely linked with FICO scores and Loan Length Introduction: The Lending Club is a unique website that allows people to directly borrow money from other people [1].

More information

Nonlinear relationships Richard Williams, University of Notre Dame, http://www3.nd.edu/~rwilliam/ Last revised February 20, 2015

Nonlinear relationships Richard Williams, University of Notre Dame, http://www3.nd.edu/~rwilliam/ Last revised February 20, 2015 Nonlinear relationships Richard Williams, University of Notre Dame, http://www.nd.edu/~rwilliam/ Last revised February, 5 Sources: Berry & Feldman s Multiple Regression in Practice 985; Pindyck and Rubinfeld

More information

Introduction to Stata

Introduction to Stata Introduction to Stata September 23, 2014 Stata is one of a few statistical analysis programs that social scientists use. Stata is in the mid-range of how easy it is to use. Other options include SPSS,

More information

Regression of Systolic Blood Pressure on Age, Weight & Cholesterol

Regression of Systolic Blood Pressure on Age, Weight & Cholesterol Regression of Systolic Blood Pressure on Age, Weight & Cholesterol 1 * bp.sas; 2 options ls=120 ps=75 nocenter nodate; 3 title Regression of Systolic Blood Pressure on Age, Weight & Cholesterol ; 4 * BP

More information

ECON 142 SKETCH OF SOLUTIONS FOR APPLIED EXERCISE #2

ECON 142 SKETCH OF SOLUTIONS FOR APPLIED EXERCISE #2 University of California, Berkeley Prof. Ken Chay Department of Economics Fall Semester, 005 ECON 14 SKETCH OF SOLUTIONS FOR APPLIED EXERCISE # Question 1: a. Below are the scatter plots of hourly wages

More information

Interaction effects between continuous variables (Optional)

Interaction effects between continuous variables (Optional) Interaction effects between continuous variables (Optional) Richard Williams, University of Notre Dame, http://www.nd.edu/~rwilliam/ Last revised February 0, 05 This is a very brief overview of this somewhat

More information

Chapter 7: Simple linear regression Learning Objectives

Chapter 7: Simple linear regression Learning Objectives Chapter 7: Simple linear regression Learning Objectives Reading: Section 7.1 of OpenIntro Statistics Video: Correlation vs. causation, YouTube (2:19) Video: Intro to Linear Regression, YouTube (5:18) -

More information

STA 4163 Lecture 10: Practice Problems

STA 4163 Lecture 10: Practice Problems STA 463 Lecture 0: Practice Problems Problem.0: A study was conducted to determine whether a student's final grade in STA406 is linearly related to his or her performance on the MATH ability test before

More information

Regression Analysis. Data Calculations Output

Regression Analysis. Data Calculations Output Regression Analysis In an attempt to find answers to questions such as those posed above, empirical labour economists use a useful tool called regression analysis. Regression analysis is essentially a

More information

Basic Statistical and Modeling Procedures Using SAS

Basic Statistical and Modeling Procedures Using SAS Basic Statistical and Modeling Procedures Using SAS One-Sample Tests The statistical procedures illustrated in this handout use two datasets. The first, Pulse, has information collected in a classroom

More information

Quick Stata Guide by Liz Foster

Quick Stata Guide by Liz Foster by Liz Foster Table of Contents Part 1: 1 describe 1 generate 1 regress 3 scatter 4 sort 5 summarize 5 table 6 tabulate 8 test 10 ttest 11 Part 2: Prefixes and Notes 14 by var: 14 capture 14 use of the

More information

FIRM SPECIFIC FACTORS THAT DETERMINE INSURANCE COMPANIES PERFORMANCE IN ETHIOPIA

FIRM SPECIFIC FACTORS THAT DETERMINE INSURANCE COMPANIES PERFORMANCE IN ETHIOPIA FIRM SPECIFIC FACTORS THAT DETERMINE INSURANCE COMPANIES PERFORMANCE IN ETHIOPIA Daniel Mehari, MSc Arba Minch University, Arba Minch, Ethiopia Tilahun Aemiro, Msc Bahir Dar University, Bahir Dar, Ethiopia

More information

Failure to take the sampling scheme into account can lead to inaccurate point estimates and/or flawed estimates of the standard errors.

Failure to take the sampling scheme into account can lead to inaccurate point estimates and/or flawed estimates of the standard errors. Analyzing Complex Survey Data: Some key issues to be aware of Richard Williams, University of Notre Dame, http://www3.nd.edu/~rwilliam/ Last revised January 24, 2015 Rather than repeat material that is

More information

The way to measure individual productivity in

The way to measure individual productivity in Which Baseball Statistic Is the Most Important When Determining Team I. INTRODUCTION The way to measure individual productivity in any working environment is to track individual performance in a working

More information

PREDICTING THE MATCH OUTCOME IN ONE DAY INTERNATIONAL CRICKET MATCHES, WHILE THE GAME IS IN PROGRESS

PREDICTING THE MATCH OUTCOME IN ONE DAY INTERNATIONAL CRICKET MATCHES, WHILE THE GAME IS IN PROGRESS PREDICTING THE MATCH OUTCOME IN ONE DAY INTERNATIONAL CRICKET MATCHES, WHILE THE GAME IS IN PROGRESS Bailey M. 1 and Clarke S. 2 1 Department of Epidemiology & Preventive Medicine, Monash University, Australia

More information

Using Minitab for Regression Analysis: An extended example

Using Minitab for Regression Analysis: An extended example Using Minitab for Regression Analysis: An extended example The following example uses data from another text on fertilizer application and crop yield, and is intended to show how Minitab can be used to

More information

ANDREW BAILEY v. THE OAKLAND ATHLETICS

ANDREW BAILEY v. THE OAKLAND ATHLETICS ANDREW BAILEY v. THE OAKLAND ATHLETICS REPRESENTATION FOR: THE OAKLAND ATHLETICS TEAM 36 Table of Contents Introduction... 1 Hierarchy of Pitcher in Major League Baseball... 2 Quality of Mr. Bailey s Contribution

More information

A Panel Data Analysis of Corporate Attributes and Stock Prices for Indian Manufacturing Sector

A Panel Data Analysis of Corporate Attributes and Stock Prices for Indian Manufacturing Sector Journal of Modern Accounting and Auditing, ISSN 1548-6583 November 2013, Vol. 9, No. 11, 1519-1525 D DAVID PUBLISHING A Panel Data Analysis of Corporate Attributes and Stock Prices for Indian Manufacturing

More information

An Analysis of the Undergraduate Tuition Increases at the University of Minnesota Duluth

An Analysis of the Undergraduate Tuition Increases at the University of Minnesota Duluth Proceedings of the National Conference On Undergraduate Research (NCUR) 2012 Weber State University March 29-31, 2012 An Analysis of the Undergraduate Tuition Increases at the University of Minnesota Duluth

More information

Discussion Section 4 ECON 139/239 2010 Summer Term II

Discussion Section 4 ECON 139/239 2010 Summer Term II Discussion Section 4 ECON 139/239 2010 Summer Term II 1. Let s use the CollegeDistance.csv data again. (a) An education advocacy group argues that, on average, a person s educational attainment would increase

More information

Chapter 1: Exploring Data

Chapter 1: Exploring Data Chapter 1: Exploring Data Chapter 1 Review 1. As part of survey of college students a researcher is interested in the variable class standing. She records a 1 if the student is a freshman, a 2 if the student

More information

Getting Correct Results from PROC REG

Getting Correct Results from PROC REG Getting Correct Results from PROC REG Nathaniel Derby, Statis Pro Data Analytics, Seattle, WA ABSTRACT PROC REG, SAS s implementation of linear regression, is often used to fit a line without checking

More information

Solution Let us regress percentage of games versus total payroll.

Solution Let us regress percentage of games versus total payroll. Assignment 3, MATH 2560, Due November 16th Question 1: all graphs and calculations have to be done using the computer The following table gives the 1999 payroll (rounded to the nearest million dolars)

More information

Maximizing Precision of Hit Predictions in Baseball

Maximizing Precision of Hit Predictions in Baseball Maximizing Precision of Hit Predictions in Baseball Jason Clavelli clavelli@stanford.edu Joel Gottsegen joeligy@stanford.edu December 13, 2013 Introduction In recent years, there has been increasing interest

More information

Doing Multiple Regression with SPSS. In this case, we are interested in the Analyze options so we choose that menu. If gives us a number of choices:

Doing Multiple Regression with SPSS. In this case, we are interested in the Analyze options so we choose that menu. If gives us a number of choices: Doing Multiple Regression with SPSS Multiple Regression for Data Already in Data Editor Next we want to specify a multiple regression analysis for these data. The menu bar for SPSS offers several options:

More information

A Guide to Baseball Scorekeeping

A Guide to Baseball Scorekeeping A Guide to Baseball Scorekeeping Keeping score for Claremont Little League Spring 2015 Randy Swift rjswift@cpp.edu Paul Dickson opens his book, The Joy of Keeping Score: The baseball world is divided into

More information

BEAVER COUNTY FASTPITCH RULES FOR 2013 SEASON

BEAVER COUNTY FASTPITCH RULES FOR 2013 SEASON BEAVER COUNTY FASTPITCH RULES FOR 2013 SEASON The League Website is www.eteamz.com/bcgfpl this will be updated as league schedule of events are completed, team schedules and scores are submitted. Basic

More information

" Y. Notation and Equations for Regression Lecture 11/4. Notation:

 Y. Notation and Equations for Regression Lecture 11/4. Notation: Notation: Notation and Equations for Regression Lecture 11/4 m: The number of predictor variables in a regression Xi: One of multiple predictor variables. The subscript i represents any number from 1 through

More information

Lab 5 Linear Regression with Within-subject Correlation. Goals: Data: Use the pig data which is in wide format:

Lab 5 Linear Regression with Within-subject Correlation. Goals: Data: Use the pig data which is in wide format: Lab 5 Linear Regression with Within-subject Correlation Goals: Data: Fit linear regression models that account for within-subject correlation using Stata. Compare weighted least square, GEE, and random

More information

Module 3: Correlation and Covariance

Module 3: Correlation and Covariance Using Statistical Data to Make Decisions Module 3: Correlation and Covariance Tom Ilvento Dr. Mugdim Pašiƒ University of Delaware Sarajevo Graduate School of Business O ften our interest in data analysis

More information

Multiple Regression: What Is It?

Multiple Regression: What Is It? Multiple Regression Multiple Regression: What Is It? Multiple regression is a collection of techniques in which there are multiple predictors of varying kinds and a single outcome We are interested in

More information

Directions for using SPSS

Directions for using SPSS Directions for using SPSS Table of Contents Connecting and Working with Files 1. Accessing SPSS... 2 2. Transferring Files to N:\drive or your computer... 3 3. Importing Data from Another File Format...

More information

Regression analysis in practice with GRETL

Regression analysis in practice with GRETL Regression analysis in practice with GRETL Prerequisites You will need the GNU econometrics software GRETL installed on your computer (http://gretl.sourceforge.net/), together with the sample files that

More information

Simple Predictive Analytics Curtis Seare

Simple Predictive Analytics Curtis Seare Using Excel to Solve Business Problems: Simple Predictive Analytics Curtis Seare Copyright: Vault Analytics July 2010 Contents Section I: Background Information Why use Predictive Analytics? How to use

More information

Statistical Predictors of March Madness: An Examination of the NCAA Men s Basketball Championship

Statistical Predictors of March Madness: An Examination of the NCAA Men s Basketball Championship Wright 1 Statistical Predictors of March Madness: An Examination of the NCAA Men s Basketball Championship Chris Wright Pomona College Economics Department April 30, 2012 Wright 2 1. Introduction 1.1 History

More information

The Value of Major League Baseball Players An Empirical Analysis of the Baseball Labor Market

The Value of Major League Baseball Players An Empirical Analysis of the Baseball Labor Market HAVERFORD COLLEGE The Value of Major League Baseball Players An Empirical Analysis of the Baseball Labor Market Chris Maurice 4/29/2010 The goal of this thesis was to test the efficiency of the baseball

More information

The Effects of Atmospheric Conditions on Pitchers

The Effects of Atmospheric Conditions on Pitchers The Effects of Atmospheric Conditions on Rodney Paul Syracuse University Matt Filippi Syracuse University Greg Ackerman Syracuse University Zack Albright Syracuse University Andrew Weinbach Coastal Carolina

More information

Team Success and Personnel Allocation under the National Football League Salary Cap John Haugen

Team Success and Personnel Allocation under the National Football League Salary Cap John Haugen Team Success and Personnel Allocation under the National Football League Salary Cap 56 Introduction T especially interesting market in which to study labor economics. The salary cap rule of the NFL that

More information

TULANE BASEBALL ARBITRATION COMPETITION BRIEF FOR THE TEXAS RANGERS. Team 20

TULANE BASEBALL ARBITRATION COMPETITION BRIEF FOR THE TEXAS RANGERS. Team 20 TULANE BASEBALL ARBITRATION COMPETITION BRIEF FOR THE TEXAS RANGERS Team 20 1 Table of Contents I. Introduction...3 II. Nelson Cruz s Contribution during Last Season Favors the Rangers $4.7 Million Offer...4

More information

Figure 1. Figure 2. Figure 3. Figure 4. normality such as non-parametric procedures.

Figure 1. Figure 2. Figure 3. Figure 4. normality such as non-parametric procedures. Introduction to Building a Linear Regression Model Leslie A. Christensen The Goodyear Tire & Rubber Company, Akron Ohio Abstract This paper will explain the steps necessary to build a linear regression

More information

Using JMP with a Specific

Using JMP with a Specific 1 Using JMP with a Specific Example of Regression Ying Liu 10/21/ 2009 Objectives 2 Exploratory data analysis Simple liner regression Polynomial regression How to fit a multiple regression model How to

More information

Determinants of Demand for Cable TV Services in the Era of Internet Communication Technologies

Determinants of Demand for Cable TV Services in the Era of Internet Communication Technologies Pace University DigitalCommons@Pace Honors College Theses Pforzheimer Honors College 2015 Determinants of Demand for Cable TV Services in the Era of Internet Communication Technologies Michael Gorodetsky

More information

BIOL 933 Lab 6 Fall 2015. Data Transformation

BIOL 933 Lab 6 Fall 2015. Data Transformation BIOL 933 Lab 6 Fall 2015 Data Transformation Transformations in R General overview Log transformation Power transformation The pitfalls of interpreting interactions in transformed data Transformations

More information

Examining if High-Team Payroll Leads to High-Team Performance in Baseball: A Statistical Study. Nicholas Lambrianou 13'

Examining if High-Team Payroll Leads to High-Team Performance in Baseball: A Statistical Study. Nicholas Lambrianou 13' Examining if High-Team Payroll Leads to High-Team Performance in Baseball: A Statistical Study Nicholas Lambrianou 13' B.S. In Mathematics with Minors in English and Economics Dr. Nickolas Kintos Thesis

More information

EXPECTANCY THEORY AND MAJOR LEAGUE BASEBALL PLAYER COMPENSATION EDWARD J. LEONARD

EXPECTANCY THEORY AND MAJOR LEAGUE BASEBALL PLAYER COMPENSATION EDWARD J. LEONARD EXPECTANCY THEORY AND MAJOR LEAGUE BASEBALL PLAYER COMPENSATION by EDWARD J. LEONARD A thesis submitted is partial fulfillment of the requirements for the Honors in the Major Program in Management in the

More information

Indian School of Business Forecasting Sales for Dairy Products

Indian School of Business Forecasting Sales for Dairy Products Indian School of Business Forecasting Sales for Dairy Products Contents EXECUTIVE SUMMARY... 3 Data Analysis... 3 Forecast Horizon:... 4 Forecasting Models:... 4 Fresh milk - AmulTaaza (500 ml)... 4 Dahi/

More information

GLM I An Introduction to Generalized Linear Models

GLM I An Introduction to Generalized Linear Models GLM I An Introduction to Generalized Linear Models CAS Ratemaking and Product Management Seminar March 2009 Presented by: Tanya D. Havlicek, Actuarial Assistant 0 ANTITRUST Notice The Casualty Actuarial

More information

Addressing Alternative. Multiple Regression. 17.871 Spring 2012

Addressing Alternative. Multiple Regression. 17.871 Spring 2012 Addressing Alternative Explanations: Multiple Regression 17.871 Spring 2012 1 Did Clinton hurt Gore example Did Clinton hurt Gore in the 2000 election? Treatment is not liking Bill Clinton 2 Bivariate

More information

MGT 267 PROJECT. Forecasting the United States Retail Sales of the Pharmacies and Drug Stores. Done by: Shunwei Wang & Mohammad Zainal

MGT 267 PROJECT. Forecasting the United States Retail Sales of the Pharmacies and Drug Stores. Done by: Shunwei Wang & Mohammad Zainal MGT 267 PROJECT Forecasting the United States Retail Sales of the Pharmacies and Drug Stores Done by: Shunwei Wang & Mohammad Zainal Dec. 2002 The retail sale (Million) ABSTRACT The present study aims

More information

c 2015, Jeffrey S. Simonoff 1

c 2015, Jeffrey S. Simonoff 1 Modeling Lowe s sales Forecasting sales is obviously of crucial importance to businesses. Revenue streams are random, of course, but in some industries general economic factors would be expected to have

More information

Data Management Summative MDM 4U1 Alex Bouma June 14, 2007. Sporting Cities Major League Locations

Data Management Summative MDM 4U1 Alex Bouma June 14, 2007. Sporting Cities Major League Locations Data Management Summative MDM 4U1 Alex Bouma June 14, 2007 Sporting Cities Major League Locations Table of Contents Title Page 1 Table of Contents...2 Introduction.3 Background.3 Results 5-11 Future Work...11

More information

Lecture 15. Endogeneity & Instrumental Variable Estimation

Lecture 15. Endogeneity & Instrumental Variable Estimation Lecture 15. Endogeneity & Instrumental Variable Estimation Saw that measurement error (on right hand side) means that OLS will be biased (biased toward zero) Potential solution to endogeneity instrumental

More information

Simple linear regression

Simple linear regression Simple linear regression Introduction Simple linear regression is a statistical method for obtaining a formula to predict values of one variable from another where there is a causal relationship between

More information

This chapter will demonstrate how to perform multiple linear regression with IBM SPSS

This chapter will demonstrate how to perform multiple linear regression with IBM SPSS CHAPTER 7B Multiple Regression: Statistical Methods Using IBM SPSS This chapter will demonstrate how to perform multiple linear regression with IBM SPSS first using the standard method and then using the

More information