Game ON! Predicting English Premier League Match Outcomes



Similar documents
Predicting Soccer Match Results in the English Premier League

Football Match Winner Prediction

Beating the MLB Moneyline

Soccer Analytics. Predicting the outcome of soccer matches. Research Paper Business Analytics. December Nivard van Wijk

Predicting sports events from past results

Predicting outcome of soccer matches using machine learning

Creating a NL Texas Hold em Bot

Is it possible to beat the lottery system?

Probability, statistics and football Franka Miriam Bru ckler Paris, 2015.

2. Scoring rules used in previous football forecast studies

Rating Systems for Fixed Odds Football Match Prediction

STA 4273H: Statistical Machine Learning

Does pay inequality within a team affect performance? Tomas Dvorak*

Predict the Popularity of YouTube Videos Using Early View Data

2x + y = 3. Since the second equation is precisely the same as the first equation, it is enough to find x and y satisfying the system

WORKED EXAMPLES 1 TOTAL PROBABILITY AND BAYES THEOREM

Sports betting odds: A source for empirical Bayes

Numerical Algorithms for Predicting Sports Results

Maximizing Precision of Hit Predictions in Baseball

Statistical techniques for betting on football markets.

CS 688 Pattern Recognition Lecture 4. Linear Models for Classification

Data mining and football results prediction

Beating the NCAA Football Point Spread

Predicting Market Value of Soccer Players Using Linear Modeling Techniques

1.2 Solving a System of Linear Equations

Relational Learning for Football-Related Predictions

Additional information >>> HERE <<<

Linear Programming. March 14, 2014

How To Bet On An Nfl Football Game With A Machine Learning Program

Predicting Flight Delays

Basics of Statistical Machine Learning

Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus

Chapter 10. Key Ideas Correlation, Correlation Coefficient (r),

RULES AND REGULATIONS OF FIXED ODDS BETTING GAMES

Coaching Tips Tee Ball

Collaborative Filtering. Radek Pelánek

Sports betting: A source for empirical Bayes

The 7 Premier League trends the bookies don t want YOU to know...

Classification Problems

Football Match Result Predictor Website

Solutions to Math 51 First Exam January 29, 2015

Hedge-funds: How big is big?

Data Mining - Evaluation of Classifiers

S-Parameters and Related Quantities Sam Wetterlin 10/20/09

Making Sense of the Mayhem: Machine Learning and March Madness

v w is orthogonal to both v and w. the three vectors v, w and v w form a right-handed set of vectors.

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Pick Me a Winner An Examination of the Accuracy of the Point-Spread in Predicting the Winner of an NFL Game

Analyzing Information Efficiency in the Betting Market for Association Football League Winners

Predicting the World Cup. Dr Christopher Watts Centre for Research in Social Simulation University of Surrey

CHAPTER 2 Estimating Probabilities

1 Solving LPs: The Simplex Algorithm of George Dantzig

Framing Business Problems as Data Mining Problems

A simple analysis of the TV game WHO WANTS TO BE A MILLIONAIRE? R

CALCULATIONS & STATISTICS

Data Mining Yelp Data - Predicting rating stars from review text

The Physics and Math of Ping-pong and How It Affects Game Play. By: Connor Thompson & Andrew Johnson

6.4 Normal Distribution

Objective. Materials. TI-73 Calculator

Evidence from the English Premier League and the German Bundesliga. September 21, NESSIS 2013, Harvard University

The Impact of Big Data on Classic Machine Learning Algorithms. Thomas Jensen, Senior Business Expedia

COACHING GOALS FOR U7 TO U10 PLAYERS

You will see a giant is emerging

INTERNATIONAL FRAMEWORK FOR ASSURANCE ENGAGEMENTS CONTENTS

arxiv: v1 [cs.ir] 12 Jun 2015

Monotonicity Hints. Abstract

A random point process model for the score in sport matches

Linear Programming Problems

How To Assess Soccer Players Without Skill Tests. Tom Turner, OYSAN Director of Coaching and Player Development

Math Quizzes Winter 2009

Detection. Perspective. Network Anomaly. Bhattacharyya. Jugal. A Machine Learning »C) Dhruba Kumar. Kumar KaKta. CRC Press J Taylor & Francis Croup

SUNSET PARK LITTLE LEAGUE S GUIDE TO SCOREKEEPING. By Frank Elsasser (with materials compiled from various internet sites)

STT315 Chapter 4 Random Variables & Probability Distributions KM. Chapter 4.5, 6, 8 Probability Distributions for Continuous Random Variables

Dynamic bayesian forecasting models of football match outcomes with estimation of the evolution variance parameter

FORM LAB BLACK ADVANCED USER GUIDE

DEALING WITH THE DATA An important assumption underlying statistical quality control is that their interpretation is based on normal distribution of t

Forex Success Formula

Video Poker in South Carolina: A Mathematical Study

Who s Winning? How knowing the score can keep your team moving in the right direction. Who s Winning?

Conn Valuation Services Ltd.

The Market for English Premier League (EPL) Odds

What really drives customer satisfaction during the insurance claims process?

1 o Semestre 2007/2008

Grade 5: Module 3A: Unit 2: Lesson 13 Developing an Opinion Based on the Textual Evidence:

Recognizing Informed Option Trading

Referee Analytics: An Analysis of Penalty Rates by National Hockey League Officials

Linear Programming. April 12, 2005

Assignment #1: Spreadsheets and Basic Data Visualization Sample Solution

Applying Machine Learning to Stock Market Trading Bryce Taylor

Applying Data Science to Sales Pipelines for Fun and Profit

THE FORM LAB 2.5 GOAL STRATEGY

Probability. a number between 0 and 1 that indicates how likely it is that a specific event or set of events will occur.

Time Series Forecasting Techniques

Using ELO ratings for match result prediction in association football

INVESTOR PRESENTATION

Linear Threshold Units

Unit 9 Describing Relationships in Scatter Plots and Line Graphs

Programming Exercise 3: Multi-class Classification and Neural Networks

The Tennis Formula: How it can be used in Professional Tennis

Least-Squares Intersection of Lines

Transcription:

Game ON! Predicting English Premier League Match Outcomes Aditya Srinivas Timmaraju adityast@stanford.edu Aditya Palnitkar aditpal@stanford.edu Vikesh Khanna vikesh@stanford.edu Abstract Among the different club-based soccer leagues in the world, the English Premier League (EPL), broadcast in 212 territories to 643 million homes, is the most followed. In this report, we attempt to predict match outcomes in EPL. One of the key challenges of this problem is the high incidence of drawn games in EPL. We identify desirable characteristics for features that are relevant to this problem. We draw parallels between our choice of features and those in state-of-the-art video search and retrieval algorithms. We demonstrate that our methods offer superior performance compared to existing methods, soccer pundits and the betting markets. We also share a few insights we gained from this interesting exploration. 1 Introduction The most popular soccer league in the world is the English Premier League (EPL), which is based in England. It was watched by an estimated figure of 4.7 billion people in the 2010-11 season. So, it doesn t come as a surprise that the TV rights are valued at 1 billion per season. In EPL, there are 20 teams that contest for the first place. The bottom three teams are relegated and replaced by other teams from lower leagues who perform better. Each team plays every other team twice, once at home and once away. So, there are a total of 380 (= 2 20 C 2 ) games per season. A season runs from August to May of the following year. 1.1 Challenges One of the things that makes predicting outcomes tricky is the significant incidence of draws (neither team wins, if they both score the same number of goals) as compared to other sports. Most popular games viz., tennis, cricket, baseball and American football either have no draws or very few draw occurrences. Consider this - out of the 380 games in the 2012-13 season, 166 games were won by the home team, 106 games by the away team and there were 108 draws! To put these numbers in perspective, we calculate the season s entropy. We regard home wins(1), away wins(2) and drawn games(3) as three separate classes and compute the fraction of each of these outcomes. p 1 = 166/380 p 2 = 106/380 p 3 = 108/380 Since, we have a 3-way classification, we use the base-3 logarithm. Entropy = (p 1 log 3 (p 1 ) + p 2 log 3 (p 2 ) + p 3 log 3 (p 3 )) = 0.9789 An entropy of 1 would correspond to a perfectly random setting (p 1 = 1/3, p 2 = 1/3, p 3 = 1/3). So, considering this, a random prediction would give an expected accuracy of 33.33%. Another naive, but a marginally better approach would be to predict a home win always, which would deliver 1

44-45% accuracy (about the average fraction of home wins in EPL). These are however, not very good numbers. 1.2 Prior Work In [1], a Bayesian dynamic model is built, but there are many parameters and the authors do not justify the choice of their values, which may not work well with other testing sets. Also, out of the 48 EPL games they bet on, they won on only 15 games [1]. Their primary focus is towards computing pseudo-likelihood values for betting on match outcomes. In [2], the authors specifically choose to focus on predicting match outcomes of a single team - Tottenham Hotspur. Also, their model is dependent on the presence of particular players and is thus, by their own admission, not scalable. There is no standard way of reporting results for this problem. Most authors do not mention prediction accuracy directly, some specify precision/recall and some others specify geometric mean of predicted probabilities of correct classes, called Pseudo-likelihood Statistic (PLS). We report the accuracy of predictions, precision/recall as well as PLS. We compare the performance of our algorithms on all these fronts and show a marked improvement. The remainder of the report is organized as follows. Section 2 discusses the choice of features, which we feel is a key step. In Section 3, we give a description of the approaches we have taken towards addressing the problem. Section 4 describes our results. In the remainder of the report, we use the words match and game interchangeably. By convention, in A vs. B, A is the home team and B is the away team. 2 Selection of Features We first identify the most relevant performance metrics: Goals, Corners and Shots on Target. Goals are an obvious choice as they determine which team wins. As for corners, a higher number of corners indicate a team playing well, and thus enforcing a corner kick from the opponent. Shots on Target convey the number of times a shot taken was on target (including those stopped by the keeper). We consciously did not pick possession percentage, because some teams just have a more possessive style of play than others, and moreover, higher possession percentage does not necessarily imply better performance. So, our performance metric vector for a game is a 3-element vector of the form (Goals scored, Corners, Shots on Target). We recognize three important characteristics for a feature. It should: Incorporate a notion of the competing nature (between teams) of the problem Be reflective of the recent form of a particular team Manifest the home advantage factor Capture the improving/declining trend in the relative performance of competing teams (i.e., has team A been on the rise more than team B? Has team B slumped in form more than team A?) Keeping the above characteristics in mind, we propose the following choice of features (we progressively arrived at the above list and so, the first choice below does not capture all these properties). 2.1 KPP (k-past Performances) In KPP, we use the performance of a team in the last k games to determine our prediction. In the game A vs. B, we first take an average of each performance metric of team A in the last k games to arrive at P A. If G = [g 1, g 2, g 3,..., g k ] is a vector containing the number of goals scored in each of the last k games, we have g avg = (g1+g2+...+g k) k. Similarly, we arrive at c avg and st avg for corners and shots on target respectively. Thus, we have P A = [g avg ; c avg ; st avg ]. 2

Similarly, we arrive at P B. We then take the ordered difference P A -P B, a 3-element vector, as our feature. The ordered difference is chosen to inherently include information about which team is playing home and which team away. 2.2 TGKPP (Temporal Gradient k-past Performances) In TGKPP, as in KPP, we first arrive at the 3-element vector P A -P B. Now, consider the performance metric (say Goals scored) vector for team A in the past k games. Denote it as G = [g 1, g 2, g 3,...g k ] We now apply a temporal differencing operator on this vector, which is essentially a convolution with a filter of the form diff = [1; 1]. We now have g da = convolution(g, diff) = (g 2 g 1, g 3 g 2,..., g k g k 1 ) We do this for the other performance metrics and compute c da and st da. We repeat this process for team B and arrive at g db, c db and st db. Then, we compute P diff = [µ(g da ) µ(g db ); µ(c da ) µ(c db ); µ(st da ) µ(st db )] where µ( ) indicates the standard mean of elements in a vector. Thus, our final feature vector is a 6-element vector of the form [P A P B ; P diff ]. This feature is similar to the video fingerprint in [7], where spatial and temporal gradient are applied to blocks in a frame of video. We have used the website http://www.football-data.co.uk/ for obtaining our data. 3 Approaches taken We first toyed with the idea of using Naive Bayes, but found that independence is a very strong assumption in this problem. Also, we did not try using Gaussian Discriminant Analysis because our data was nowhere close to being normally distributed. Owing to space constraints, we have omitted including figures that showed this. We used multiple variations on Multinomial Logistic Regression and Support Vector Machines. 3.1 Approach 1 The first approach we took was using Multinomial Logistic Regression (since there were more than 2 possible outcomes). In this approach, during the training phase, we only considered the performance metrics derived from the current match, rather than taking the average over the last k matches. During testing, suppose we are required to predict the match outcome of team A vs team B, we arrived at the feature vector using KPP. 3.2 Approach 2 In the second approach, we trained in the same way we would later test the data. More precisely, in the training phase too, instead of using the feature vector as the performance metric vector corresponding to the current match, we used KPP. This meant that the trained parameters now inform our beliefs about the result of a match based on the performance in last k matches. In this approach, we also used TGKPP instead of KPP and evaluate our approach. 3.3 Approach 3 : Using teamwise models So far, all approaches we tried would find a global set of parameters, which were independent of the competing teams. So, given the past k performances of the team playing home and the team playing away, our model was agnostic to the identity of the actual teams playing. However, we felt we maybe missing out on some team-specific trends using this method. So, in this approach, we trained different models for different teams. However, this approach placed a limitation on the data 3

Table 1: MLR with Approach 1 Season k value Accuracy 2013-14 3 34.43 2013-14 4 32.79 Table 2: Prediction Accuracy: KPP & TGKPP Season k MLR(KPP) RBF-SVM(KPP) MLR(TGKPP) RBF-SVM(TGKPP) 2 Class (RBF-SVM) 2013-14 4 58.21 46.91 54.32 55.56 76.79 2013-14 5 56.86 59.15 59.15 54.93 76.17 2013-14 6 57.38 54.10 59.02 54.10 76.74 2013-14 7 60.78 58.82 56.86 66.67 83.78 we could use to test/train our model. We could no longer combine data from two different seasons, due to the form of the teams varying between seasons, and major players being traded between teams. So, due to the limited data, and the increased noise induced by increasing the granularity in the model, we ended up getting a lower accuracy (an average of 47%). 4 Results and Discussion We train all our models on the 2012-13 season and test on the 2013-14 (current) season. Table 1 contains the prediction accuracy (in %) obtained using MLR with Approach 1. Figure 1 contains the learning curves for the two algorithms, MLR and RBF-SVM. Our definition of k imposes a restriction on starting point for testing. For k=4, 5, 6, 7 we test on 81, 71, 61, 51 number of games respectively. Table 2 contains the prediction accuracy (in %) obtained using Approach 2 with MLR and RBF-SVM. The last column of Table 2 contains accuracy obtained using only Home win and Away win classes, disregarding the draws. Basically, if C is the confusion matrix, we truncate the 3rd row and 3rd column to obtain C trunc. We then compute 2classaccuracy = trace(c trunc )/ i,j C trunc,ij. We computed this because we felt it would offer a fair metric for comparing against accuracy in other sports which do not have the concept of drawn games. As our problem is a 3-way classification, there is no positive and negative class demarcation. So, while computing them for the Home wins class, we regard both the other classes as negatives. Table 3 lists the precision and recall values. While the precision of Drawn games is fairly okay, it is evident our method misses out majorly on recall value for draws. Under-estimating draws is a problem with the existing methods too. (a) (b) Figure 1: Accuracy vs training set size (for k=7) (a)mlr (b)rbf-svm 4

Table 3: Precision and Recall with RBF-SVM (TGKPP and k=7) Home win Away win Drawn game Precision 71.43 50.00 54.10 Recall 86.21 66.67 23.08 Table 4: Confusion Matrix (TGKPP and k=7) Predicted Home wins Predicted Home losses Predicted draws Actual Home wins 25 3 1 Actual Home losses 3 6 0 Actual draws 7 3 3 In [1], the authors propose a metric which is computed as the geometric mean of the predicted probabilites of actual outcomes, they call it PLS (Pseudo-likelihood Statistic). They obtain a PLS of 0.357 for EPL. In [3], the best value of PLS obtained is 0.36007. In comparison, our average PLS value is 0.4015, with a maximum of 0.4613 with k=4 using MLR and TGKPP. In [4], the authors report their results in the form of a confusion matrix. But they do not predict drawn games at all. Our method offer superior performance than [4] too. We also considerably outdo the accuracy of the probabilistic model in [5], which reaches an accuracy of 52.1% for 2011-12 and 48.15% 2012-11. In comparison to our best accuracy of 66.67%, the accuracy of expert pundits like Mark Lawrenson of BBC is 52.6% [6]. Also, we outperformed the accuracy of betting markets which averaged 55.3% last season [6], suggesting possible avenues for making money! The results we obtained, while interesting and significant in their own right, also offered interesting insights into the match data we analysed. Though our results using approach 3 were not encouraging, we found a striking trend. Since we have two models, we are faced with a choice: If A & B are the teams competing, should we use θ A or θ B for prediction? Or should we always use the home team s parameters? We observed that using θ away for prediction always does better as compared to using θ home. In hindsight, we may be tempted to interpret that using θ away does better because there is more variability in an away teams performance, and so that gives us more information. But a priori, we might as well have thought that home performances are more consistent and so a better predictor would use θ home. 5 References [1] Rue, Havard, and Oyvind Salvesen, Prediction and retrospective analysis of soccer matches in a league Journal of the Royal Statistical Society: Series D (The Statistician) 49.3 (2000): 399-418. [2] Joseph, A., Norman E. Fenton, & Martin Neil, Predicting football results using Bayesian nets and other machine learning techniques. Knowledge-Based Systems 19.7 (2006): 544-553. [3] Goddard, John, Regression models for forecasting goals and match results in association football. International Journal of forecasting 21.2 (2005): 331-340. [4] Crowder, M., Dixon, M., Ledford, A. and Robinson, M. (2002), Dynamic modelling and prediction of English Football League matches for betting. Journal of the Royal Statistical Society: Series D (The Statistician), 51: 157168. doi: 10.1111/1467-9884.00308. [5] Constantinou, Anthony C., Norman E. Fenton, and Martin Neil, pi-football: A Bayesian network model for forecasting Association Football match outcomes. Knowledge-Based Systems (2012). [6] www.pinnaclesports.com/online-betting-articles/09-2013/lawrenson-vs-pinnacle-sports.aspx [7] Oostveen, Job, Ton Kalker, and Jaap Haitsma, Feature extraction and a database strategy for video fingerprinting, Recent Advances in Visual Information Systems (2002): 67-81. 5