The Market for English Premier League (EPL) Odds Guanhao Feng, Nicholas G. Polson, Jianeng Xu arxiv:1604.03614v1 [stat.ap] 12 Apr 2016 Booth School of Business, University of Chicago April 14, 2016 Abstract We develop a probabilistic model to track and forecast real-time odds for English Premier League (EPL) football games. We show how a difference in Poisson processes (a.k.a. Skellam process) provides a dynamic probabilistic model for the evolution of scores. Ex ante, we show how to calibrate expected goal scoring rates using marketbased information on win, lose, draw odds. As the game evolves, we use the current score and our Skellam process to calculate the matrix of final score odds. This enables us to identify real time online betting opportunities relative to our model s predictions. We illustrate our methodology the EPL game between Everton and West Ham and Sunderland and Leicester City. Both games illustrate the flexibility of our model and how odds can change quickly as the score progresses. Finally, we conclude with directions for future research. Key words: Market Implied Prediction, Odds, Skellam Process, Soccer Betting, English Premier League. Nicholas G. Polson is Professor of Econometrics and Statistics at the Booth School of Business, University of Chicago. Guanhao Feng is at Booth School of Business, University of Chicago. Jianeng Xu is at Department of Statistics, University of Chicago. 1
1 Introduction 1.1 The betting market for the EPL Gambling on football is a global industry worth anywhere between $700 billion and $1 trillion a year 1. Spread betting, particularly fixed-odds betting, on the outcome of soccer matches is rapidly growing in popularity and odds are set via online real time betting market (Betfair, Bet365 etc.). 2 Traditional bookmakers, such as Ladbrokes, also offer odds on various outcomes of a match. For example, bets can be placed on the final outcomes (win, lose, draw) as well as goals scored at half-time or full-time. A key feature is that the odds are updated in real time and as such there is great interest in developing probability models for the evolution of the games score. Stern (1994) and Polson and Stern (2015) propose a Brownian motion model for the difference in teams scores and also show how the market s based information can be used to calculate the implied volatility of a game. We build on this approach and develop a model that is tailor-made for the discrete evolution of the scores of a EPL game. Specifically, we develop a probability model based on the difference of Poisson processes (a.k.a. Skellam process 3 ). Various probability models have been proposed to predict the outcome of soccer matches motivated by the demand for assessing betting opportunities. For example, Dixon and Coles (1997) use a bivariate Poisson process which has been extended in a number of ways; see, for example, Karlis and Ntzoufras (2009). Another line of research, asks whether betting markets are efficient and, if not, how to exploit potential inefficiencies in the betting market.? examine the hypothesis that sentimental bettors act like noise traders and can affect the path of prices in football betting markets. Futhermore, Fitt (2009) applies the efficient portfolio theory to analyzes the mis-pricing of cross-sectional odds. Online 1 See Football Betting - the Global Gambling Industry worth Billions. BBC Sport. 2 Fractional odds are used in UK, while money-line odds are favored by American bookmakers. Fractional odds of 2:1 ( two-to-one ) would imply that the bettor stands to make a $200 profit on a $100 stake. 3 See Barndorff-Nielsen and Shephard (2012) for an introduction to Skellam processes. 2
betting of soccer spread bets requires bookmakers to dynamically alter the market odds to prevent arbitrage during the course of a match and Fitt et al. (2005) models the value of online soccer spread by modeling goals and corners as Poisson processes. 1.2 Connections with Existing Work Early models of the number of goals scored by each team (see, e.g. Lee (1997)) use independent Poisson processes. Later models incorporate a correlation between the two scores and model the number of goals scored by each team using bivariate Poisson models (see Maher (1982), Dixon and Coles (1997), and Karlis and Ntzoufras (2009)). Our approach follows Stern (1994), and instead of modeling the number of goals and the correlation between sores directly, we will model the score difference (a.k.a. margin of victory). Soccer scores by their very nature are discrete and not that frequent. Hence they are not adequately modeled by a continuous-time stochastic process. Instead we adapt the Poisson process and model the difference in scores (a.k.a. Skellam distribution). In addition, we define a similar implied volatility measure (see Polson and Stern (2015)) from our discrete dynamic process. The rest of the paper proceeds as follows. Section 2 presents our Skellam process model for tracking the difference in goals scored. We then show how an odds matrix of all the combinations of scores can be computed using Skellam s cumulative distribution function. We also calculate a dynamic, time-varying, implied prediction using real-time online trading market data for the odds of any score and hence win, lose and draw outcomes. Section 3 illustrates our methodology using two EPL game in 2015-2016 between Sunderland and Leicester and Everton and West Ham. Finally, Section 4 discusses extensions of our basic model and concludes with directions for future research. 3
2 Skellam Process for EPL scores Let the outcome between the two teams A and B be modeled as a difference, N(t) = N A (t) N B (t). Here we can interpret N(t) as the lead of home team A over away team B and N A (t) and N B (t) denotes the scores of two teams at time point t (0 t 1). Negative values of N(t) therefore indicate that team A is behind. We assume that the game begins at time zero with N(0) = 0 and ends at time one with N(1) representing the final score difference (positive if A wins and negative if B wins). For our analysis we develop a probabilistic specification of the distribution of N(1) and, more generally, given N(t) = l where l is the current lead, as the game evolves. Given this probabilistic model, we can determine an implied prediction of the outcome for the whole game. For example, ex ante P(N(1) > 0) will provide the odds of team A winning. Half-time scores will be available from the distribution of N( 2 1 ) and as the game progresses we can calculate P(N(1) > 0 N(t) = l) where l is the current goal difference. 2.1 Implied Score Prediction from EPL Odds The Skellam distribution is derived as the difference between two independent Poisson variables (see Skellam (1946), Sellers (2012), and Alzaid and Omair (2010)). We now show how it can be used to model the point spread distribution in those sports with equal scored points. Lemma 1 of Karlis and Ntzoufras (2009) shows that Skellam distribution is not restricted to Poisson difference, and can be extended to the difference of distributions which have a specific trivariate latent variable structure. We begin with a model which specifies the lead of home team A over away team B at time t, N(t) = N A (t) N B (t) where N A (t) Poisson(λ A t) and N B (t) Poisson(λ B t), as a Skellam random variable, (N(t) λ A, λ B ) Skellam(λ A t, λ B t). 4
The parameters λ A and λ B denote the strength of two teams and repeat the expected scoring rates of two teams, respectively. If we view scores of team A and team B at time t as two independent Poisson process, λ A and λ B are expected values of two processes at t = 1. We can use this model to calculate the probability of any specific score difference P(N(1) = x) at the end of the game. To derive the winning odds, we use the law of total probability, the probability mass function of a Skellam random variable is the convolution of two Poisson distributions: P(N(t) = x λ A, λ B ) = = P(N B (t) = N A (t) x N A (t) = k, λ B ) P(N A (t) = k λ A ) k=0 {e λ Bt (λ Bt) k x (k x)! }{e λ At (λ At) k } k! k=max{0,x} = e (λ A+λ B )t = e (λ A+λ B )t k=max{0,x} ( λa λ B (λ B t) k x (λ A t) k (k x)!k! ) x/2 I x (2 λ A λ B t) where I r (x) is the modified Bessel function of the first kind (For full details, see Alzaid and Omair (2010)), thus has series representation ( x ) r (x I r (x) = 2 2 /4) k k!γ(r + k + 1). k=0 Besides the probability of home team A winning can be easily calculated using the cumulative distribution function, P(N(1) > 0 λ A, λ B ) = P(N(t) = x λ A, λ B ). x=1 In practice, we use an upper bound on the number of possible goals since the probability of an extreme score difference is always negligible. Unlike the Brownian motion model for the evolution of the outcome in a sports game (Stern (1994), Polson and Stern (2015)), the 5
probability of a draw in our setting is not zero. Instead, P(N(1) = 0) is always a positive number which depends on the sum and product of two parameters λ A and λ B and thus the odds of a draw are non-zero. Hence, two evenly matched teams with large λ s are more likely to achieve a draw, compared with two evenly matched teams with small λ s. We are also interested in the conditional probability of winning as the game progresses. At time t (0 t 1), let N A (t) and N B (t) denote the numbers of goals already scored by two teams. In other words, the current lead at time t is l = N A (t) N B (t) and so N(t) = l = N A (t) N B (t). With the property of Poisson process, the model updates the conditional distribution of the final score difference (N(1) N(t) = l) by noting that N(1) = N(t) + (N(1) N(t)). By properties of a Poisson process, N(t) and (N(1) N(t)) are independent. Specifically, conditioning on N(t) = l, we have the identity N(1) = l + Skellam(λ A (1 t), λ B (1 t)). From the above expression, we are now in a position to find the conditional distribution (N(1) = x N(t) = l) for every time point t of the game given the current socre. Simply put, we have the time homogeneously condition P(N(1) = x λ A, λ B, N(t) = l) = P(N(1 t) = x l λ A, λ B ), where λ A, λ B, l are either given by market expectations or are known at time t. This equation is easy to verify noting that N(1) N(t) d = N(1 t) and N(1) N(t) = x l. There are two conditional probabilities of interest. First, the chances of home team A 6
winning is P(Home team A wins λ A, λ B, N(t) = l) = P(l + Skellam(λ B (1 t), λ B (1 t)) > 0) Second, the conditional probability of a draw at time t is = P(N(1 t) > l λ A, λ B ) ( ) x/2 = e (λ A+λ B )(1 t) λa I λ x (2 λ A λ B λ A (1 t)). x> l B P(Draw λ A, λ B, N(t) = l) = P(l + Skellam(λ A (1 t), λ B (1 t)) = 0) = P(N(1 t) = l λ A, λ B ) ( ) l/2 = e (λ A+λ B )(1 t) λa I l (2 λ A λ B (1 t)). λ B By symmetry, the conditional probability at time t of home team A losing is simply 1 P(Win N(t) = l) P(Draw N(t) = l). Figure 1 is a discrete version of Figure 1 in Polson and Stern (2015), and illustrates how the different aspects of the model that we have discussed can be visualized with an simulation example for an EPL game between Everton and West Ham (March 5th, 2016). Section 3 provides with more data details. The outcome probability of first half and updated second half are given in the left two panels. The top right panel illustrates a simulation-based approach to visualizing how the model works in the dynamic evolution of score difference. In the bottom left panel, from half-time onwards, we also simulate a set of possible Monte Carlo paths to the end of the game. This illustrates the discrete nature of our Skellam process. 7
Probability of Score Difference Before 1st Half Game Simulations Before 1st Half Probability (%) 0 5 10 15 20 25 Draw = 19.50% West Ham Wins = 23.03% Everton Wins = 57.47% Score Difference 6 5 4 3 2 1 0 1 2 3 4 5 6 7 8 5 4 3 2 1 0 1 2 3 4 5 Score Difference 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Time Probability of Score Difference Before 2nd Half Game Simulations Before 2nd Half Probability (%) 0 5 10 15 20 25 30 35 Draw = 23.18% West Ham Wins = 15.74% Everton Wins = 61.08% Score Difference 5 4 3 2 1 0 1 2 3 4 5 5 4 3 2 1 0 1 2 3 4 5 Score Difference 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Time Figure 1: The Skellam process model illustrated by winning margin and game simulations. The top left panel shows the outcome distribution using odds data before the game starts. Each bar represents the probability of a specific final score difference, with it s color corresponding to the result of win/lose/draw. Score differences larger than 5 or smaller than -5 are not shown. The top right panel shows a set of simulated Skellam process paths for the game outcome. The bottom row has the two figures updated using odds data available at half-time. 2.2 Model Calibration The goal of our analysis is to show that you can use initial market-based odds of (win, lose, draw) to calibrate our parameters λ A and λ B. Suppose that the odds ratios for 8
all possible final score outcomes are given by a bookmaker. For example, suppose that we observe that the odds of final score ending with 2-1 is 3/1. In this case, the bookmaker pays out 3 times the amount staked by the bettor if the final outcome is indeed 2-1. We convert this odds ratio to the implied probability of final score being 2-1, using the identity P(N(1) > 0) = 1 1 + odds(team A winning). The implied probability makes the expected winning amount of a bet equals to 0. In this case, p = 1/(1 + 3) = 1/4 and the expected winning amount is µ = 1 (1 1/4) + 3 1/4 = 0. We denote the odds ratio as odds(2, 1) = 3 and finally get an odds matrix (O) ij. Its element o ij = odds(i 1, j 1), i, j = 1, 2, 3... for all possible combinations of final scores. The sum of resulting probabilities is larger than 1. This phenomenon is standard in betting markets (see Dixon and Coles (1997) and Polson and Stern (2015)). The excess probability corresponds to a quantity known as the market vig. For example, if the sum of all the implied probabilities is 1.1, then the expected profit of the bookmaker is 10%. To account for this phenomenon, we scale down the probabilities to make sure that the resulting sum equals to 1 before estimation. To determine the parameters λ A and λ B for the remaining game N(1) N(t), the odds ratios from a bookmaker should be calibrated by N A (t) and N B (t). For example, if N A (0.5) = 1, N B (0.5) = 0 and odds(2, 1) = 3 at half time, these observations actually says that the odds for the second half score being 1-1 is 3. The calibrated odds for N(1) N(t) is calculated using the original odds as well as the current scores and given by odds (x, y) = odds(x + N A (t), y + N B (t)). At time t (0 t 1), we calculate the implied conditional probabilities of score 9
differences using odds information P(N(1) = k N(t) = l) = P(N(1) N(t) = k l) = c i j=k l 1 1 + odds (i, j) where c = k P(N(1) = k N(t) = l) is a scale factor, l = N A (t) N B (t), i, j 0 and k = 0, ±1, ±2.... Again the property of Poisson distribution makes it easy to derive the moments of a Skellam random variable with parameters λ A and λ B. The unconditional mean is given by E(X) = λ A λ B and the variance is V(X) = λ A + λ B. Therefore the conditional moments in our case is given by E(N(1) N(t) = l) = l + (λ A λ B )(1 t) V(N(1) N(t) = l) = (λ A + λ B )(1 t) The above implied probabilities don t necessarily ensure that Ê(N(1) N(t) = l) l ˆV(N(1) N(t) = l), which may lead to negative MM estimates of λ s. To address this issue, we calibrate parameters by solving the following optimization problem ( ˆλ A, ˆλ B ) t = arg min λ A,λ B {D 2 V + D2 E } + γ{(λ A) 2 + (λ B ) 2, } where we define D V = ˆV(N(1) N(t) = l) 1 t (λ A + λ B ), D E = Ê(N(1) N(t) = l) l 1 t (λ A λ B ). In addition, Ê and ˆV are calculated using implied conditional probabilities. We pick γ to be a large penalty in order to stabilize our estimates across many goals. 10
2.3 Implied Volatility Following Polson and Stern (2015), we can define a discrete version of the implied volatility of the games outcome as simply σ IV,t = (λ A + λ B )(1 t) The market produces information about λ A and λ B and therefore for σ IV. Any goal scored is a discrete Poisson shock to the expected score difference (Skellam process) between the teams. An equivalent model used in option pricing is the jump model of Merton (1976). In our application, we illustrate the path of implied volatility throughout the course of the game. We now turn to an empirical illustration of our methodology. 3 Applications 3.1 Everton vs West Ham We collect the real time online betting odds data from ladbrokes.com for an EPL game between Everton and West Ham. Table 1 shows how the raw data of odds right before the game. We need to transform odds to probability. For example, for the outcome 0-0, 11/1 is equivalent to a probability of 1/12. Then we can calculate the marginal probability of every score difference from -4 to 5 of the game. Table 2 shows our model implied probability for outcome of score differences before the game. Different from independent Poisson modeling in Dixon and Coles (1997), our model is more flexible with the correlation between two teams. The trade-off of flexibility is that we only know the probability of score difference instead of the exact scores. Finally we can plot these probability paths in Figure 2 to examine the behavior of the two teams and predict the final result. By recording real time online betting data 11
Home \Away 0 1 2 3 4 5 0 11/1 12/1 28/1 66/1 200/1 450/1 1 13/2 6/1 14/1 40/1 100/1 350/1 2 7/1 7/1 14/1 40/1 125/1 225/1 3 11/1 11/1 20/1 50/1 125/1 275/1 4 22/1 22/1 40/1 100/1 250/1 500/1 5 50/0 50/1 90/1 150/1 400/1 6 100/1 100/1 200/1 250/1 7 250/1 275/1 375/1 8 325/1 475/1 Table 1: Original Odds Data from Ladbrokes Score difference -4-3 -2-1 0 1 2 3 4 5 Probability(%) estimate 0.78 2.50 6.47 13.02 19.50 21.08 16.96 10.61 5.37 2.27 Table 2: Empirical estimates of score difference probability (Everton - West Ham) for every 10 minutes, we can show the evolution of outcome prediction. 4 Probability of win/draw/loss jump for important events in the game: goals scoring and red card penalty. In such a dramatic game between Everton and West Ham, the winning probability of Everton arrives almost 90% until the goal in 78th minutes of West Ham. The last-gasp goal of West Ham officially kills the game and reverses the probability. Moreover, the implied volatility path is the visualization of the conditional variation of score difference. There was a jump of implied volatility when Everton lost a player by a red card penalty. 3.2 Sunderland vs Leicester City Leicester City started the 2015-2016 season as a 5000-1 underdog to win the premier league. Hence, we take a game from April 9th, 2016 near the end of the season in Figure 3. 4 In the example, we ignore the overtime of both 1st half and 2nd half of the game. 12
2 100 Everton vs. West Ham (SAT, 05 MAR 2016) Goal:E Goal:W 1.2 60 Red Card:E 0.6 40 0.8 1 50 Half Time 0.4 0.2 20 0 10 0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Time Implied Volatility Everton Wins West Ham Wins Draw Figure 2: The betting data of EPL game between Everton and West Ham is from ladbrokes.com. Market implied probabilities (expressed as percentages) for 3 different results (Everton wins, West Ham wins and draw) are marked by 3 different colors, which vary dynamically as the game proceeds. The black solid line shows the evolution of the implied volatility. The dashed line shows important events in the game, such as goals and red cards. 5 goals in this game are: 13 Everton, 56 Everton, 78 West Ham, 81 West Ham and 90 West Ham. 4 Discussion In this paper, we develop a model that allows for discreteness of goals scored in a football game. Our Skellam model is also valid for low-scoring sports such as baseball, hockey or soccer which are categorized by a series of discrete scoring events. Our model has the advantage of not considering correlation between goals scored by both teams and the disadvantage of ignoring the sum of goals. On the other hand, for high-scoring sports 13 Implied Volatility 1.4 70 80 Goal:W 30 Probability (%) 1.8 Goal:E 1.6 90 Goal:W
1 0.8 0.6 Goal:L 0 0 10 0.2 Goal:L 0.4 Half Time 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Leicester City Wins Draw 0.9 1.0 Time Implied Volatility Sunderland Wins Figure 3: 2 goals in this game are: 66 Leicester City and 95 Leicester City. such as basketball, the Brownian motion adopted by Stern (1994) is more applicable. In addition, Rosenfeld (2012) provides an extension of the model to address concerns of nonnormality and uses a logistic distribution to estimate the relative contribution of the lead and the remaining advantage. One avenue for future research, is to extend the Skellam assumption to allow for the jumpiness for NFL football which is somewhere in between these two extremes (see Glickman and Stern (2012), Polson and Stern (2015) and Rosenfeld (2012) for examples.) Another area of future research, is studying the index betting. For example, a soccer games includes total goals scored in match and margin of superiority (see Jackson (1994)). The latter is the score difference in our model and so the Skellam process directly applies. 14 Implied Volatility 1.2 60 50 40 20 30 Probability (%) 70 1.4 80 1.6 90 1.8 2 100 Sunderland vs Leicester City (SUN, 10 APR 2016)
Based on our forecasting model, we can test the inefficiency of EPL sports betting from a statistical arbitrage viewpoint. For example, Dixon and Pope (2004) presents a detailed comparison of odds set by different bookmakers in relation to the Poisson model predictions. An early study of hot hand of market belief by Camerer (1989) finds that extreme underdog teams that during a long losing streak are under-priced by the market. Also, Golec and Tamarkin (1991) test the NFL and college betting markets and find bets on underdogs or home teams win more often than bets on favorites or visiting teams. Gray and Gray (1997) examine the in-sample and out-of-sample performance of different NFL betting strategies by the probit model. They also find the strategy of betting on home-team underdogs averages returns of over 4 percent, in excess of commissions. In summary, a Skellam process appears to fit the dynamics of EPL football betting very well and produces a natural lens to view these market efficiency questions. 15
References Alzaid, A. A. and Omair, M. A. (2010). On the Poisson Difference Distribution Inference and Applications. Bulletin of the Malaysian Mathematical Sciences Society. Barndorff-Nielsen, O. E. and Shephard, N. (2012). Basics of Lévy processes. Draft Chapter from a book by the authors on Lévy Driven Volatility Models. Camerer, C. F. (1989). Does the Basketball Market Believe in thehot Hand,? The American Economic Review. Dixon, M. J. and Coles, S. G. (1997). Modelling Association Football Scores and Inefficiencies in the Football Betting Market. Journal of the Royal Statistical Society. Series C (Applied Statistics), 46(2):265 280. Dixon, M. J. and Pope, P. F. (2004). The Value of Statistical Forecasts in the UK Association Football Betting Market. International Journal of Forecasting, 20(4):697 711. Fitt, A. D. (2009). Markowitz Portfolio Theory for Soccer Spread Betting. IMA Journal of Management Mathematics, 20(2):167 184. Fitt, A. D., Howls, C. J., and Kabelka, M. (2005). Valuation of Soccer Spread Bets. Journal of the Operational Research Society, 57(8):975 985. Glickman, M. E. and Stern, H. S. (2012). A State-Space Model for National Football League Scores. Journal of the American Statistical Association, 93(441):25 35. Golec, J. and Tamarkin, M. (1991). The Degree of Inefficiency in the Football Betting Market: Statistical Tests. Journal of financial economics, 30(2):311 323. Gray, P. K. and Gray, S. F. (1997). Testing Market Efficiency: Evidence From The NFL Sports Betting Market. The Journal of Finance, 52(4):1725 1737. Jackson, D. A. (1994). Index Betting on Sports. The Statistician, 43(2):309. 16
Karlis, D. and Ntzoufras, I. (2009). Bayesian Modelling of Football Outcomes: Using the Skellam s Distribution for the Goal Difference. IMA Journal of Management Mathematics, 20(2):133 145. Lee, A. J. (1997). Modeling Scores in the Premier League: Is Manchester United Reallythe Best? Chance, 10(1):15 19. Maher, M. J. (1982). Modelling Association Football Scores. Statistica Neerlandica, 36(3):109 118. Merton, R. C. (1976). Option Pricing when Underlying Stock Returns are Discontinuous. Journal of financial economics, 3(1-2):125 144. Polson, N. G. and Stern, H. S. (2015). The Implied Volatility of a Sports Game. Journal of Quantitative Analysis in Sports, 11(2):145 153. Rosenfeld, J. W. (2012). An In-game Win Probability Model of the NBA. Sellers, K. F. (2012). A Distribution Describing Differences in Count Data Containing Common Dispersion Levels. 7(3):35 46. Skellam, J. G. (1946). The Frequency Distribution of the Difference Between Two Poisson Variates Belonging to Different Populations. Journal of the Royal Statistical Society, 109(3). Stern, H. S. (1994). A Brownian Motion Model for the Progress of Sports Scores. Journal of the American Statistical Association, 89(427):1128. 17