A Predictive Model for NFL Rookie Quarterback Fantasy Football Points



Similar documents
The Numbers Behind the MLB Anonymous Students: AD, CD, BM; (TF: Kevin Rader)

KSTAT MINI-MANUAL. Decision Sciences 434 Kellogg Graduate School of Management

An Exploration into the Relationship of MLB Player Salary and Performance

International Statistical Institute, 56th Session, 2007: Phil Everson

Multiple Linear Regression

Chapter 7: Simple linear regression Learning Objectives

We extended the additive model in two variables to the interaction model by adding a third term to the equation.

Team Success and Personnel Allocation under the National Football League Salary Cap John Haugen

Premaster Statistics Tutorial 4 Full solutions

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

Quality, Statistics, and Probability in Sports

EDUCATION AND VOCABULARY MULTIPLE REGRESSION IN ACTION

Beating the NCAA Football Point Spread

Sample Problems. 10 yards rushing = 1/48 (1/48 for each 10 yards rushing) 401 yards passing = 16/48 (1/48 for each 25 yards passing)

Chapter 13 Introduction to Linear Regression and Correlation Analysis

Statistical Models in R

STATISTICA Formula Guide: Logistic Regression. Table of Contents

Mgmt 469. Model Specification: Choosing the Right Variables for the Right Hand Side

Regression step-by-step using Microsoft Excel

Penalized regression: Introduction

THE DETERMINANTS OF SCORING IN NFL GAMES AND BEATING THE SPREAD

Stepwise Regression. Chapter 311. Introduction. Variable Selection Procedures. Forward (Step-Up) Selection

Multiple Regression in SPSS This example shows you how to perform multiple regression. The basic command is regression : linear.

Statistics 104 Final Project A Culture of Debt: A Study of Credit Card Spending in America TF: Kevin Rader Anonymous Students: LD, MH, IW, MY

Does pay inequality within a team affect performance? Tomas Dvorak*

1. The parameters to be estimated in the simple linear regression model Y=α+βx+ε ε~n(0,σ) are: a) α, β, σ b) α, β, ε c) a, b, s d) ε, 0, σ

Pick Me a Winner An Examination of the Accuracy of the Point-Spread in Predicting the Winner of an NFL Game

Factors affecting online sales

Using Excel for Statistical Analysis

Introduction to Regression and Data Analysis

Simple Linear Regression Inference

Forecasting Accuracy and Line Changes in the NFL and College Football Betting Markets

New Work Item for ISO Predictive Analytics (Initial Notes and Thoughts) Introduction

Answer: C. The strength of a correlation does not change if units change by a linear transformation such as: Fahrenheit = 32 + (5/9) * Centigrade

Chapter 23. Inferences for Regression

Using R for Linear Regression

NFL Betting Market: Using Adjusted Statistics to Test Market Efficiency and Build a Betting Model

Multiple Regression: What Is It?

Multiple Linear Regression in Data Mining

MULTIPLE REGRESSION AND ISSUES IN REGRESSION ANALYSIS

5. Multiple regression

An Analysis of the Undergraduate Tuition Increases at the University of Minnesota Duluth

Generalized Linear Models

The importance of graphing the data: Anscombe s regression examples

1. What is the critical value for this 95% confidence interval? CV = z.025 = invnorm(0.025) = 1.96

August 2012 EXAMINATIONS Solution Part I

Regression Analysis: A Complete Example

MULTIPLE REGRESSION EXAMPLE

Data Mining and Data Warehousing. Henryk Maciejewski. Data Mining Predictive modelling: regression

The Determinants of Scoring in NFL Games and Beating the Over/Under Line. C. Barry Pfitzner*, Steven D. Lang*, and Tracy D.

The NCAA Basketball Betting Market: Tests of the Balanced Book and Levitt Hypotheses

Final Exam Practice Problem Answers

X X X a) perfect linear correlation b) no correlation c) positive correlation (r = 1) (r = 0) (0 < r < 1)

Fair Bets and Profitability in College Football Gambling

2013 MBA Jump Start Program. Statistics Module Part 3

Correlation and Simple Linear Regression

Chicago Booth BUSINESS STATISTICS Final Exam Fall 2011

Response To Post, Thirty-somethings are Shrinking and Other Challenges for U-Shaped Inferences

DETERMINANTS OF CAPITAL ADEQUACY RATIO IN SELECTED BOSNIAN BANKS

Introduction to Linear Regression

Elements of statistics (MATH0487-1)

2. What is the general linear model to be used to model linear trend? (Write out the model) = or

MTH 140 Statistics Videos

11. Analysis of Case-control Studies Logistic Regression

Predicting Box Office Success: Do Critical Reviews Really Matter? By: Alec Kennedy Introduction: Information economics looks at the importance of

Please follow the directions once you locate the Stata software in your computer. Room 114 (Business Lab) has computers with Stata software

Regression Analysis (Spring, 2000)

A Test for Inherent Characteristic Bias in Betting Markets ABSTRACT. Keywords: Betting, Market, NFL, Efficiency, Bias, Home, Underdog

A Primer on Forecasting Business Performance

PREDICTING THE MATCH OUTCOME IN ONE DAY INTERNATIONAL CRICKET MATCHES, WHILE THE GAME IS IN PROGRESS

Chapter 15. Mixed Models Overview. A flexible approach to correlated data.

MODELING AUTO INSURANCE PREMIUMS

Developing Risk Adjustment Techniques Using the System for Assessing Health Care Quality in the

Simple Predictive Analytics Curtis Seare

Using Baseball Data as a Gentle Introduction to Teaching Linear Regression

Directions for using SPSS

Leveraging Ensemble Models in SAS Enterprise Miner

Predictor Coef StDev T P Constant X S = R-Sq = 0.0% R-Sq(adj) = 0.

Predicting Market Value of Soccer Players Using Linear Modeling Techniques

" Y. Notation and Equations for Regression Lecture 11/4. Notation:

Fixed-Effect Versus Random-Effects Models

Basic Statistics and Data Analysis for Health Researchers from Foreign Countries

Introduction to Linear Regression

430 Statistics and Financial Mathematics for Business

Section 1: Simple Linear Regression

Point Shaving in NCAA Basketball: Corrupt Behavior or Statistical Artifact?

SPSS-Applications (Data Analysis)

Unit 31 A Hypothesis Test about Correlation and Slope in a Simple Linear Regression

SPSS Explore procedure

GLM I An Introduction to Generalized Linear Models

Additional sources Compilation of sources:

Do Programming Languages Affect Productivity? A Case Study Using Data from Open Source Projects

This chapter will demonstrate how to perform multiple linear regression with IBM SPSS

MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.

Making Sense of the Mayhem: Machine Learning and March Madness

An Analysis of the Telecommunications Business in China by Linear Regression

Testing Efficiency in the Major League of Baseball Sports Betting Market.

How To Run Statistical Tests in Excel

Data Mining in Sports Analytics. Salford Systems Dan Steinberg Mikhail Golovnya

Transcription:

A Predictive Model for NFL Rookie Quarterback Fantasy Football Points Steve Bronder and Alex Polinsky Duquesne University Economics Department

Abstract This analysis designs a model that predicts NFL rookie Quarterbacks s fantasy football points. Theoretical models for judging the quality of football players are limited. We theorize that a player s skill can be isolated by assessing their fantasy football points. This model will predict within a margin of error how well a Quarterback will perform his rookie year based off of their college career statistics. Data was gathered for Quarterbacks in the NFL drafts from 2009 to 2012. Our linear model will use college career statistics to predict fantasy football scores. It is found that as the percentage of pass completions, average yards per game, and rushing touchdowns are significant predictors of a rookie quarterbacks NFL fantasy points. The model used does not suffer from heteroskedasticity, but does have a large constant and standard error for the constant. 1

1 Background Statistical analysis in baseball has become so popular that there is now an Oscar nominated movie and best-selling book about it. MoneyBall is the story of how the Oakland Athletics General Manager, facing a tight budget and the release of his star players, uses statistical analysis to hire undervalued players. The Oakland Athletics, despite having such a small budget, managed to make it to the playoffs that year. Since then, almost every baseball team in the United States has adopted sabremetrics, the statistical analysis of baseball. While sabremetrics has dominated baseball, other sports such as football have been wary of implementing statistical analysis into the team s decision-making. Most decisions football teams make, whether which players to draft or what play to run on a fourth down, are chosen heuristically by the teams manager or head coach. In the beginning of each season, each team goes through a series of seven rounds where they choose NCAA athletes to join their team. Attempting to estimate how well a rookie will perform in the NFL is very difficult because, unlike baseball, where each at bat takes place between only the pitcher and batter, there are twenty two different players on the field affecting the every detail of the play. This makes the statistics for each individual player difficult to assess. Each player can change the outcome of a play wildly. The challenge to measure a single player s value is a challenge that sparked the interest of us. 2 Model In this model, we will be regressing NFL quarterback rookie s fantasy points over the player s performance in college. Our data consists of metrics such as rushing touchdowns, passing touchdowns, passing completion percentage, and rushing yards. Using fantasy points as our dependent variable allowed us to remove some of the omitted variable bias that is inherent in our data. When fantasy points are calculated it does not matter whether yards gained, for example, were made on a first down or fourth down. We will use a standard linear regression. Using a standard linear regression is optimal because we are only interested in how an increase or decrease in fantasy point changes based on our explanatory variables. Unlike baseball, we do not have much theory in football to base our models. As such, we will perform a stepwise regression function and use the Bayesian information Criterion (BIC) to determine the proper model. The BIC measures the relative quality of the model while assessing the goodness of fit and complexity. Some problems we run into are heteroskedasticity and omitted variable bias. Omitted variable bias will likely be prominent because of the nature of the sport. Unlike baseball, which is an individual based sport hidden in a team sport, football is solely a team sport. If a quarterback is on a very good team he will inherently have a better fantasy score than a quarterback on a worse team because, for example, he may have better linemen to block for his passes. It may not be possible to correct for omitted variable bias, as that type of data is 2

unavailable. It may be a good place to pursue research in the future. To test for heteroskedasticity we will use the Breusch-Pagan test. If we have heteroskedasticity, we will use robust standard errors to correct it or look to respecify the parameters in the model. 3 Regression Specification First, I ran the R package, leaps, and performed forward and backward stepwise regressions. In the chart below you can see the stepwise regression s BICs. When fitting models it is possible to increase your R 2, however this can lead to over fitting. The BIC solves this problem by inducing a penalty term for the number of parameters in the model. As the BIC of a model decreases, it increases in efficiency. The chart below contains the graphic for our stepwise regression s BICs. As each independent variable goes from white to black it means the variable was increasingly significant as other variables were dropped. Figure 1: BIC Plot for Stepwise Regression on College Metrics The graph above suggests that the most efficient model is the following equation: F T SP T S = β + AY.A(X 1 ) + T D.2(X 2 ) (1) 3

FTSPTS AY.A TD.2 β are rookie Quarterback fantasy points, average yards per game, rushing touchdowns, and a constant, respectively. While this model was the most efficient, running a Breusch-Pagan test for heteroskedasticity forced us to reject the null hypothesis at the.01 level and conclude it suffered from heteroskedasticity. With this knowledge I returned to the chart above and decided to try the following model represented by the following equation: F T SP T S = β + P ct(x 1 ) + Y.A(X 2 ) + T D.2(X 3 ) (2) PCT Y.A are the pass completion percentage a Quarterback earned and the average amount of yards per completion in their college career, respectively. Running the Breusch-Pagan test for heteroskedasticity returned a p-value of.1063. Because there is a 10.63 percent chance of erroneously rejecting the null hypothesis we keep it and conclude this model does not suffer from heteroskedasticity. Figure two is the plot of the residuals versus leverage for figure two Figure 2: Residuals versus Leverage for Equation 2 Investigating figure two shows excessive leverage for Nick Foles and Cam Newton. This means that these two points of data lead to predicted values that are not close to our average predictor values. These two points of data are why we are close to rejecting the null hypothesis of the Breusch-Pagan test. The Cook s distance in the graph shows 4

us what type of data points we should attempt to find to best decrease the chance of our model suffering from heteroskedasticity. Finding more star players like Cam Newton and Nick Foles would decrease the chance of heteroskedasticity. 4 Results Analysis Below are the outputs for equation 2. Estimate Std. Error t value Pr(> t ) (Intercept) -783.9780 222.5539-3.52 0.0015** Pct 7.4641 4.1502 1.80 0.0829. Y.A 52.6316 22.2688 2.36 0.0253* TD.2 13.0181 6.2696 2.08 0.0472* Signif Codes: 0.001 *** 0.01 **.05 *.1. 1 For every one percentage increase in the pass completion percentage there is a 7.641 increase in fantasy points. for every one yard increase the the average amount of yards per completion there is a 52.63 increase in fantasy points. When the amount of rushing touchdowns increases by one point there is a 13.01 increase in fantasy points. All variables are found to be significant. Percentage of completed passes is significant at the at.1, yards on average and rushing touchdowns are significant at.05, and the intercept is significant at.01. One thing that is interesting to note is our large negative intercept and standard error of intercept. This may be due in part to the low number of data points we were able to obtain. One way to correct for this and increase the predictive power of our model would be to collect more data. Attempting to remove the constant led to the model having heteroskedasticity while p-values for the independent variables became insignificant. To test against multicollinearity we performed the test of variance inflation factor. All variables had a VIF of less than 5 and so we concluded that our model does not suffer from multicollinearity. Because we do not suffer from heteroskedasticity or multicollinearity we can establish confidence intervals on our parameters. Table 2 contains the confidence intervals at.95. 2.5 % 97.5 % (Intercept) -1239.86-328.10 Pct -1.04 15.97 TD.2 0.18 25.86 Y.A 7.02 98.25 5

5 conclusion In conclusion, our model observed a small sample size and lacked multicollinearity as well as heteroskedasticity. We narrowed our independent variable choice to avoid any possible variable bias. The results have shown that our model is relatively accurate, observed by our R-Squared of 0.436. However, the large standard deviation of our constant implies that collecting more data would drastically help increase the predictive power of our model. We have tested our results to see how predictive our model is on the player Matthew Ryan. The results from that test showed that our model was a mere fifteen fantasy points away from his actual earned point value, showing that there is significant validity to our regression model. While this is just one point, it does fit within our model with confidence intervals. Further collection of data and back testing against this model would help us identify the strength and predictive interval for the linear regression. 6