Numerical Algorithms for Predicting Sports Results

Similar documents
Predicting outcome of soccer matches using machine learning

Predicting sports events from past results

How To Bet On An Nfl Football Game With A Machine Learning Program

Numerical algorithms for predicting sports results Jack Blundell Computer Science (with Industry) 2008/2009

Beating the NCAA Football Point Spread

Football Match Winner Prediction

Predicting Margin of Victory in NFL Games: Machine Learning vs. the Las Vegas Line

Predicting Market Value of Soccer Players Using Linear Modeling Techniques

Home Bias in the NFL Pointspread Market. Matt Cundith Department of Economics California State University, Sacramento

International Statistical Institute, 56th Session, 2007: Phil Everson

Sports Forecasting. H.O. Stekler. RPF Working Paper No August 13, 2007

Relational Learning for Football-Related Predictions

Forecasting Accuracy and Line Changes in the NFL and College Football Betting Markets

Point Spreads. Sports Data Mining. Changneng Xu. Topic : Point Spreads

Beating the MLB Moneyline

Predicting the NFL Using Twitter

The Fibonacci Strategy Revisited: Can You Really Make Money by Betting on Soccer Draws?

A Test for Inherent Characteristic Bias in Betting Markets ABSTRACT. Keywords: Betting, Market, NFL, Efficiency, Bias, Home, Underdog

Does NFL Spread Betting Obey the E cient Market Hypothesis?

Pick Me a Winner An Examination of the Accuracy of the Point-Spread in Predicting the Winner of an NFL Game

SPORTS FORECASTING. There have been an enormous number of studies involving various aspects of sports. We

Predicting the NFL Using Twitter. Shiladitya Sinha, Chris Dyer, Kevin Gimpel, Noah Smith

Data mining and football results prediction

EXAMINING NCAA/NFL MARKET EFFICIENCY

Making Sense of the Mayhem: Machine Learning and March Madness

ISSUES IN SPORTS FORECASTING

Testing Efficiency in the Major League of Baseball Sports Betting Market.

Beating the Book: Are There Patterns in NFL Betting Lines?

THE DETERMINANTS OF SCORING IN NFL GAMES AND BEATING THE SPREAD

DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES

A Contrarian Approach to the Sports Betting Marketplace

Testing Market Efficiency in a Fixed Odds Betting Market

How to Win at the Track

Predicting Soccer Match Results in the English Premier League

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

Baseball Pay and Performance

Basketball Market Efficiency and the Big Dog Bias. Ladd Kochman* and Randy Goodwin*

Clustering Connectionist and Statistical Language Processing

Prediction on Soccer Matches using Multi- Layer Perceptron

A Hybrid Prediction System for American NFL Results

The Performance of Betting Lines for Predicting the Outcome of NFL Games

Using ELO ratings for match result prediction in association football

Getting Even More Out of Ensemble Selection

You will see a giant is emerging

Soccer Analytics. Predicting the outcome of soccer matches. Research Paper Business Analytics. December Nivard van Wijk

Sports betting odds: A source for empirical Bayes

Sam Schaefer April 2011

BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL

ON INTEGRATING UNSUPERVISED AND SUPERVISED CLASSIFICATION FOR CREDIT RISK EVALUATION

Applied Mathematical Sciences, Vol. 7, 2013, no. 112, HIKARI Ltd,

Volume 30, Issue 4. Market Efficiency and the NHL totals betting market: Is there an under bias?

The NCAA Basketball Betting Market: Tests of the Balanced Book and Levitt Hypotheses

The 7 Premier League trends the bookies don t want YOU to know...

Data quality in Accounting Information Systems

Fair Bets and Profitability in College Football Gambling

Financial Trading System using Combination of Textual and Numerical Data

The Determinants of Scoring in NFL Games and Beating the Over/Under Line. C. Barry Pfitzner*, Steven D. Lang*, and Tracy D.

How To Predict The Outcome Of The Ncaa Basketball Tournament

WIN AT ANY COST? How should sports teams spend their m oney to win more games?

Advanced Ensemble Strategies for Polynomial Models

Herd Behavior and Underdogs in the NFL

NAPCS Product List for NAICS 7132: Gambling Industries

Testing the Efficiency of Sports Betting Markets: An Examination of National Football League and English Premier League Betting

Statistical techniques for betting on football markets.

BIDM Project. Predicting the contract type for IT/ITES outsourcing contracts

THE FAVOURITE-LONGSHOT BIAS AND MARKET EFFICIENCY IN UK FOOTBALL BETTING

Published in 2003 by High Stakes Publishing, 21 Great Ormond Street, London, WC1N 3JB Copyright Joseph Buchdahl

SPORTS ECONOMICS B

EFFICIENCY IN BETTING MARKETS: EVIDENCE FROM ENGLISH FOOTBALL

Using Past Performance to Predict NFL Outcomes: A Chartist Approach

Football Trading Portfolio: Winning Made Easy Football Trading Portfolio - Review - > Click Here GET IT NOW

Glossary of bet types

How To Predict Web Site Visits

Efficiency of sports betting markets

Experiments in Web Page Classification for Semantic Web

SECRET BETTING CLUB FINK TANK FREE SYSTEM GUIDE

Prices, Point Spreads and Profits: Evidence from the National Football League

Commercial-In-Confidence. Report to Football1x2.com. Football Frenzy

LIVE BETTING ULTRA RULES

Football Match Result Predictor Website

POINT SPREAD SHADING AND BEHAVIORAL BIASES IN NBA BETTING MARKETS. by Brad R. Humphreys *

DEVELOPING A MODEL THAT REFLECTS OUTCOMES OF TENNIS MATCHES

Mining an Online Auctions Data Warehouse

How To Predict Seed In A Tournament

A NEW DECISION TREE METHOD FOR DATA MINING IN MEDICINE

Testing the Efficiency of Sports Betting Markets

the sporting index group

CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.

Statistical Football Predictions and Betting Tips

Perspectives on Data Mining

Analyzing Information Efficiency in the Betting Market for Association Football League Winners

not possible or was possible at a high cost for collecting the data.

the sporting index group

Sports betting: A source for empirical Bayes

UZH Business Working Paper Series (ISSN )

Data Quality Mining: Employing Classifiers for Assuring consistent Datasets

Prediction of Stock Performance Using Analytical Techniques

Transcription:

Numerical Algorithms for Predicting Sports Results by Jack David Blundell, 1 School of Computing, Faculty of Engineering ABSTRACT Numerical models can help predict the outcome of sporting events. The features within these models rely on data associated with the competitors. Here, a logistic regression model was used to predict the outcome of American Football matches. As well as using data such as the two teams previous results, novel features such as stadia size and the distance the away team had to travel were incorporated. This model was seen to perform better than two baselines: (i) simply predicting the home team or (ii) using the previous result between the teams. By adding more informative features to the model, a prediction accuracy of 65% was produced. These predictions matched that of bookmakers showing that the information enclosed within the model could be used to predict games to a highly successful rate. Keywords: American Football; bookmakers; logistic model; machine learning; regression; sporting predictions 1 email address for correspondence: jackblundell@hotmail.co.uk

Numerical Algorithms for Predicting Sports Results INTRODUCTION The aim of this work was to see if numerical data could be used to predict the outcome of American Football matches. This was done through a logistic regression model. Informative features were required to model each football match so that a successful prediction could be made. Using these features, the model would produce a binary output representative of whether the home or away team was more likely to win. During the background research, it was noticed that few authors had attempted to predict the outcome of sporting events using the logistic regression approach. Therefore, this project aimed to see if this method was as successful as the alternatives used in other papers. As well as this evaluation, I compared the predictions obtained from my model with that of the bookmakers. It was shown that my model faired no worse than these betting forecasts showing that the model was a good predictor of American Football games. DATA American football The sport that was to be used in this project needed to have a caveat that match outcomes are rarely tied. This would make predicting matches easier through simply selecting the home or away team to win. American Football was therefore chosen as ties are infrequent within the sport. Games I used the 8063 American Football matches played within the NFL between 1970 and 2006. The results were downloaded from http://www.pro-football-reference.com. As 45 of these were tied matches, they were ignored in this project, leaving a final 8018 matches. STADIUM INFORMATION I obtained the stadium that each team played at within the relevant 37 years from http://en.wikipedia.org/wiki/chronology_of_home_stadiums_for_current_national_fo otball_league_teams. This enabled me to store the size of each stadium and using an online city-to-city distance tool (http://www.geobytes.com/citydistancetool.htm), I was able to get the distances between each of these stadia. BOOKMAKER SPREADS To assess the accuracy of the final model, as well as comparing it to the final outcome it was evaluated against the forecasts put forward by the bookmakers for each game. This data was retrieved from Prof. Philip Gray who had carried out similar work in the field of American Football predictions (Gray & Gray 1997). METHODOLOGY Baselines To get an idea how successful the numerical model was, I used simple prediction methods to act as system baselines. The first relied on the theory that the home team within a sporting event has a certain advantage over the away team (Vergin & Sosik 1999, Stefani 1980, Harville 1980). Therefore, this basic predictor chose the home team to win within every match (HOME).

The other baseline algorithm used previous results between the two competing teams to achieve a prediction. After analysing the accuracy of differing amounts of data, it was found that the forecasts became less accurate when older results were used. Thus, the most accurate predictions came from just using the previous year s results between the teams (PREV_RES). When tested on all of the matches between 1970 and 2006, the unsupervised HOME baseline achieved an accuracy of 57.8% and the supervised PREV_RES obtained 58% accuracy. Using the McNemar test (Dwyer 1991) to find the statistical significance at the 0.05 level, they both were found to be significantly better than a random prediction approach and therefore good baselines for the complex model. Logistic model A logistic regression model was used to improve on the 58% accuracy found by the baselines. The output of a logistic model is a binary result (1/0) so as tied games were ignored, this made the approach suitable to forecast either a home or away win. The logistic model encompasses features (attributes) that represent data relevant to the result that is trying to be predicted. Then during the training of the model, the relationship between each feature and the result is assessed to see if it has a strong or weak correlation with the end result. This relationship is then used when trying to predict the outcomes within the testing phase. The next step was to choose the model s features. These needed to represent information such as the overall strength of each team, the recent form of each team, etc. In order to obtain a set of features that achieved the most accurate representation of an American Football game, an iterative prototype approach was used. This meant potential match attributes could be analysed and either kept or ignored depending on their performance. Features The features used within my model centred on work carried out within soccer predictions where these features were encompassed in an ordered-probit model (Goddard & Asimakopoulos 2004). Here, the authors implemented novel features within their model to predict the outcome of soccer games. For instance, they looked at the two competing teams recent form. The decision was made that American Football and soccer have similar traits and therefore these features could be transferred across the two sports. See Table 1 for a list of features used. Feature Description awaycapacity The stadium capacity of the away team awayrecentawayresultn The result in the n th recent away game for the away team awayrecenthomeresultn The result in the n th recent home game for the away team awaywinratio_1 The away team s win ratio for last year awaywinratio_2 The away team s win ratio for 2 years previous distance The distance the away team had to travel to play the match homecapacity The stadium capacity of the home team homerecentawayresultn The result in the n th recent away game for the home team homerecenthomeresultn The result in the n th recent home game for the home team homewinratio 1 The home team s win ratio for last year homewinratio 2 The home team s win ratio for 2 years previous lastyearresult The result of last year s corresponding game between the two teams Table 1. Original Model Feature Set

Here we can see that the recent form of the two teams is captured through the recentawayresultn and recenthomeresultn features. The win ratios of the previous two years represent how successful the teams have been overall in their matches during recent seasons. The capacities of the two stadia try to capture the number of fans a team has, and how this affects the chances of them beating the other team. The distance feature shows how the number of miles an away team has to travel can influence the probability of them becoming victorious. Lastly, the lastyearresult feature borrows from the PREV_RES algorithm in that it looks to see if predictions can be aided through previous encounters between the teams. The features associated with each match were extracted from the list of results and the stadium data. Then the features were placed into vectors with the result of the match appended at the end. Then using the machine learning software WEKA (Witten & Frank 2005), the model was trained using 20 years of NFL matches and tested on the remaining eight years. The logistic model reached an accuracy of 62%, which proved to be significantly better than the two baselines described previously. This showed that the features incorporated in the model, when used together held more predictive qualities than simply picking the home team or relying on last year s results between the teams. Attempts were then made to try to build on this model and improve its forecasting ability. The first included adding power rankings to the feature space. Power rankings are a way in the media to represent a team s current strength based on current form, players injured, etc (Bouiler & Stekler 2003). The power rankings at the end of each season were obtained and used within match vectors for the following season. However, this extended feature set decreased the accuracy of the model. Upon further inspection, it was deduced that the rankings should be used and updated on a week-by-week basis. Thus when last year s power rankings were used to represent a team s strength in the following season, the predictive qualities of the rankings were lost. A further attempt was made to improve the system by altering the feature set from the original model. One feature represented the recent home and away games for the two teams. Here, the result of the recent game is stored, i.e. home/away/tie. This is more suited to soccer as the differences in score are rarely large. So to make the model more suited to American Football, the differences in scores were stored rather than simply just the result. Although this slightly improved on the 62% achieved in the original model, the difference was not found to statistically significant. See Table 2 for the list of features used in the final model.

Feature awaycapacity awayrecentawayscdifn awayrecenthomescdifn awaywinratio_1 awaywinratio_2 distance homecapacity homerecentawayscdifn Description The stadium capacity of the away team The score difference in the n th recent away game for the away team The score difference in the n th recent home game for the away team The away team s win ratio for last year The away team s win ratio for 2 years previous The distance the away team had to travel to play the match The stadium capacity of the home team The score difference in the n th recent away game for the home team homerecenthomescdifn The score difference in the n th recent home game for the home team homewinratio 1 The home team s win ratio for last year homewinratio 2 The home team s win ratio for 2 years previous lastyearresult The result of last year s corresponding game between the two teams Table 2. Final Model Feature Set EVALUATION The model shown here was found to have more predictive qualities than simply choosing the home team. Furthermore, the model is more accurate than just allowing the results of last year s encounters to decide the forecast. The final model was then evaluated against the predications of the bookmakers. This was based on the view that the bookmakers are generally the most accurate forecasters when compared with statistical systems or expert predictions (Harville 1980, Bouiler & Stekler 2003). The model achieved an accuracy of 65.2% whereas the bookmakers obtained a rate of 67.4%. Having said this, the difference in accuracies was not found to be statistically significant. This means that the forecasts made by the model developed in this work were no worse than that of the bookmakers. When compared to the accuracies of other models used within similar research, the system described here is not as accurate as some of the more complex models. However, during the analysis of these more complex models the respective authors found that their models were less accurate when compared to bookmaker s predictions. For example, Stefani s least square model obtained an accuracy of 68.4% but was subsequently overshadowed by the bookmaker s accuracy of 71% on the same matches (Stefani 1980). The system can also be seen to be more accurate than subjective predictions of an expert found within one piece of research. Here, it was discovered the New York Times Editor had a forecasting accuracy of 59.7% (Bouiler and Stekler 2003). Further analysis was carried out to assess the informative qualities of each feature. Known as feature ablation, this involved taking out each of the features one at a time to create a new temporary model. This temporary model was then used to predict the results of each game and if it was more accurate than the original then it can be concluded that the feature removed is a damaging one. Whereas a feature that decreases the accuracy is proposed to be an important feature. Although the feature ablation studies were not wholly conclusive, it could be seen that a team s win ratios from the previous years are very informative and thus could be extended within the model. Furthermore, the model uses each team s previous nine home games to represent their current form. The studies suggested that cutting this down to around

four or five would be more suitable. These would be interesting avenues to pursue if the work of this project were extended. RELATED WORK As mentioned within this work, the features within the model described here were adapted from that of the model illustrated by Goddard and Asimakopoulos (2004). Used to predict the outcome of English soccer matches, they found that their ordered-probit model could compete with the forecasts made by the bookmakers. With regard to American Football modelling, one piece of research used an Ordinary Linear Squares approach to forecast NFL results (Stefani 1980). However, this approach fell short when compared to the forecasts made by the bookmakers. Moreover, Stefani used this same model to predict the outcome of soccer matches and basketball games. This allowed him to achieve similar prediction accuracies compared to the American Football version. This suggests that most sports can be interchangeable when representing them within a model. Harville s approach relied on a much more complex system of linear models to predict the winner of an American Football match (Harville 1980). He obtained a prediction accuracy of 70.3% which in turn was overshadowed by the corresponding bookmakers forecast rate of 72.1%. CONCLUSION I have shown that American Football matches can be accurately modelled through the use of features within a regression model. Furthermore, I have found that a simple logistic model can achieve just as accurate forecasts compared to some more complex alternatives. This is shown by the model s ability to provide forecasts that match those of the bookmakers. ACKNOWLEDGEMENTS I would like to thank my supervisor Dr. Katja Markert for all her time, help and support throughout the project. Also, I want to acknowledge Dr. Andy Bulpitt for his important comments within the development of the project. Lastly, I would like to thank Prof. Philip Gray for his time and effort providing bookmaker s data which enabled me to make huge strides in the evaluation of my work.

REFERENCES Boulier, B.L. and Stekler, H.O. (2003), Predicting the outcomes of national football league games, International Journal of Forecasting, 19(2), 257 270. Dwyer, A.J. (1991), Matchmaking and mcnemar in the comparison of diagnostic modalities Radiology, 178(2), 328. Goddard, J. and Asimakopoulos, I. (2004), Forecasting football results and the efficiency of fixed-odds betting, Journal of Forecasting, 23 (1), 51 66. Gray, P.K. and Gray, S.F. (1997), Testing market efficiency: Evidence from the NFL sports betting market, Journal of Finance, 52(4), 1725 1737. Harville, D. (1980), Predictions for national football league games via linear-model methodology, Journal of the American Statistical Association, 75(371), 516 524. Stefani, R.T. (1980), Improved least squares football, basketball, and soccer predictions, Systems, Man and Cybernetics, IEEE Transactions on, 10(2), 116 123. Vergin, R.C. and Sosik, J.J. (1999), No place like home: an examination of the home field advantage in gambling strategies in NFL football, Journal of Economics and Business, 51(1), 21 31. Witten, I.H. and Frank, E. (2005), Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann, San Francisco.