Numerical algorithms for predicting sports results Jack Blundell Computer Science (with Industry) 2008/2009

Transcription

1 Numerical algorithms for predicting sports results Jack Blundell Computer Science (with Industry) 2008/2009 The candidate confirms that the work submitted is their own and the appropriate credit has been given where reference has been made to the work of others. I understand that failure to attribute material which is obtained from another source may be considered as plagiarism. (Signature of student)

2 Summary This project looks at how numerical data can be used to predict the outcome of sporting events. More specifically, the project details specially created algorithms which make use of this data in order to predict to outcome of American Football games. The report seen here details the critical analysis of these algorithms when compared to the actual match results. These algorithms range from using simplistic single-feature algorithms to complex statistical models. Furthermore, predictions made by the betting market are used here to compare the accuracy of the project s most accurate model. The report also includes a literature review describing previous numerical models that have been used to predict the outcome of sporting events. i

3 Acknowledgements Firstly, I would like to thank my supervisor Dr. Katja Markert for all her time, help and support throughout the project. Also, I want to acknowledge Dr. Andy Bulpitt for his important comments within the marking of the mid-project report and the progress meeting. Furthermore, I wish to thank fellow student Lee Junior Tunnicliffe for proof reading my report, which was very much appreciated. Lastly, I would like to thank Kaj David for all his conversations about Football Manager after I had sworn not to touch it this year. ii

4 Contents 1 Introduction Introduction Project Aim Objectives Minimum Requirements Deliverables Potential Extensions Project Planning Methodology Original Schedule Revised Schedule Choice of Programming Language Project Evaluation Statistical Difference Tests McNemar Test Betting Line Background Reading American Football The National Football League Rules Spread Betting Power Scores iii

5 3.2 Text Mining Predictive Opinions Judgment Opinions Numerical Analysis Numerical Models for Predicting Sporting Results Models Within American Football Models Within Other Sports Expert Opinions Within Sports Regression Analysis Logistic Regression Maximum Likelihood Estimation Machine Learning Software WEKA Summary of Reading Prototypes Data Collection Prototype 1 (HOME) Prototype Summary Prototype 2 (PREV RES) Design Implementation Evaluation Prototype Summary Prototype 3 (Goddard & Asimakopoulos Model) Design Implementation Feature Extraction Training and Testing Set Creation WEKA Vector Convector Data Analysis Using WEKA iv

6 4.4.3 Evaluation Prototype Summary Prototype 4 (Inclusion of Ranking Features) Jeff Sagarin s Power Ratings Football Outsiders Design Implementation Evaluation Prototype Summary Prototype Implementation Evaluation Prototype Summary Evaluation Against Betting Market Evaluation Quantitative Evaluation Overall Prototype Evaluation Usefulness of Features Feature Ablation Ranking Of Features Qualitative Evaluation Project Evaluation Objective and Minimum Requirements Evaluation Project Extensions Schedule Evaluation Methodology Evaluation Conclusion Conclusion Further Work v

7 Bibliography 51 A Personal Reflection 55 B Project Schedule 57 C PREV RES Algorithm 59 D Feature Ablation Results 60 vi

8 Chapter 1 Introduction 1.1 Introduction The interest in sport has reached phenomenal heights over recent years with the help of satellite television and sports channels such as Sky Sports and Setanta. A Sky digital customer for example has access to over 25 sports channels [3] which combined with the vast amount of sporting information on the Internet, it has never been easier to become interested in professional sports. As a result, the art of (successfully) predicting the outcome of sporting events has become more sought-after. There are multiple ways in which fans can predict the outcomes of such events, for example using prediction websites (e.g. I Know The Score 1 ) that are solely for fun between friends and other fans. Alternatively, fans can make bets with a bookmaker either in the high-street or online. A glimpse of this sizeable gambling interest can be seen with the English bookmakers Ladbrokes who have over 2,000 shops in the UK alone and in 2008 announced pre-tax profits of 344 million! [8]. Needless to say, a lot of money can be made if a person can make accurate predictions about sporting events. This project used numerical data within differing statistical models to achieve predictions for American Football matches. This data included historical information about the competitors and their recent results as well as other novel information that was viewed to help achieve the best predictor possible. 1.2 Project Aim The aim of this project was to develop algorithms that used numerical data for predicting sport results. In other words, it analysed how successful predictions can be by using numerical data alone, without the use of subjective information such as opinions or contextual data. This investigation was carried out within the domain of American Football. 1.3 Objectives The objectives involved within this project include: To understand what information can help predict the outcome of an American Football match. To understand the different ways in which this information can be used to model a match Allows soccer fans to guess the score of English Premier League matches 1

9 To creating a model in order to predict the outcomes of American Football matches. To discovering how successful this model can be by relying on numerical data alone. 1.4 Minimum Requirements The minimum requirements of this project were: Development and implementation of existing sports prediction algorithms to apply to American football. Development and implementation of enhancements to existing algorithms: integration of novel features. Feature ablation studies to identify the most useful existing and novel features. Critical analysis of existing and enhanced numerical algorithms by comparison to actual match results. 1.5 Deliverables One of the deliverables produced by the project was a statistical model that used numerical data to predict the outcome of American Football matches. The other deliverable was this detailed report of the execution and findings of said model. 1.6 Potential Extensions There were a number of potential enhancements noticed from the outset that could be applied to the project and they were: To include subjective opinions from professional experts in predicting the outcome of a match. To compare the final model with that of the betting market predictions To see how betting patterns could be analysed and used to increase the prediction accuracy. To see how successful the model is using data from a different sporting domain other than American Football (e.g. ice hockey). 2

10 Chapter 2 Project Planning Originally, the project was going to utilise not only numerical data, but also use NLP (Natural Language Processing) to analyse the opinions of professional experts to achieve an accurate prediction. This was going to extend work carried out by McKinlay in which he used textual analysis to extract predictions made by American Football fans within Internet forums. However, he found that predictions made by individual fans were laced with bias. In short, he concluded that fans were poor predictors of sporting events [39]. In light of this, I intended to assess predictions made by professional experts and then incorporate the numerical data to improve those predictions. By Christmas, I had carried out background research on textual mining (and some research on numerical prediction models). Unfortunately, after prolonged searching, I was unable to find the expert data needed to proceed with the original project. Therefore, I decided that the numerical side of the project was interesting and detailed enough to be concentrated on solely. 2.1 Methodology The methodology for the project (in conjunction with the project schedule) was hugely important to ensure structure was kept, deadlines met and also to see that the direction of the project was maintained. After research was carried out to see which type of methodology would suit the current project, it was clear that a prototype approach was appropriate. This was due to the problem being more akin to a research investigation than a software engineering based project. This project s prototype approach was therefore centred around using numerical techniques (based on research discovered) to predict the outcome of a group of matches. Then analysing how each technique performed and seeing where its weakness lay with a view to rectifying this weakness in the next prototype. Hughes and Cotterell state that prototypes are working models of one or more aspects of the projected system that are constructed and tested quickly in order to evaluate assumptions [33]. Hence, this ability with prototyping to try ideas out and evaluate them quickly and at little cost to the project benefited this investigation immensely. Alternative models like waterfall and V-process do not allow for such iteration thus were not appropriate. It is also claimed that using a prototype is preferential when there is a need to learn about an area of uncertainty from the developer s point of view [33]. This was clearly advantageous within this project as I had little previous experience with developing predictive 3

11 algorithms. Prototypes can be split-up into two categories, throw-away and evolutionary. Throw-away indicates that test systems are developed and thrown away with a view to developing the true system [33]. Evolutionary describes when a prototype is developed and modified until it is finally at the stage where it can become the proposed solution [33]. An evolutionary technique would be advantageous if a new prototype was simply the previous prototype but with features either added or removed. However, some prototypes held no predictive qualities at all and thus were discarded from the next iteration. Therefore a combination of the two was seen to be the best approach. Having said this, there are risks associated with prototyping. These risks include the possibility of poor standards being carried out within the project, meaning the developer is more inclined to use programming hacks [33]. This would hinder the program s consistency and flexibility therefore these hacks were avoided. Furthermore, the code that was created for analysing one prototype was nonspecific and flexible so that it could be used in future prototypes (thus saving time). According to [33], to be considered a genuine prototype approach, the following must be executed: Specify what is hoped to be learnt from the prototype. Plan how the prototype is to be evaluated. Report on what has actually been learned. Therefore it is my aim to clearly carry out these points within this project report. 2.2 Original Schedule Figure B.1 indicates the initial project schedule. The first thing the reader will note is the grey block within December and January. This represents my exam period which I decided to concentrate exclusively on and thus needed to manage my project so that it did not suffer because of this predetermined neglect. The basis of the original plan was that all of the background reading be carried out within the first term. The majority of which would cover the textual analysis aspect of the project with a smaller amount of numerical prediction theory. This would then be used to formulate the design of a textual-based algorithm after Christmas. In terms of the numerical prototypes, the first two would be simple baseline algorithms that required no background reading in terms of statistical analysis. These would rely on common patterns associated with sporting outcomes (e.g. choose the home team to win). Then, numerical prototype 3 s design would depend on initial research carried out on complex predictive models along with the findings from 4

12 the first two prototypes. The implementation of this prototype would start before Christmas and be finished after the exam period. After assessing both numerical prototype 3 and the textual prototype, I would then combine the numerical and textual prototypes into one prototype which would then be evaluated. The best prototype from all of the work previously carried out would then be compared to the betting line. 2.3 Revised Schedule In Figure B.2, we see that the original plan was changed. Clearly, the textual prototype was discarded, however the background reading is still present as this had already been carried out. Also, as my project had become solely focused on the numerical algorithms, this meant more time was needed to carry out further numerical research. The stage Review of first three prototypes represents me looking at how these numerical prototypes had faired and how to proceed from that point. Moreover, as I had no idea how successful these prototypes would be or what where their weaknesses were, the amount of further prototypes required was unknown. Therefore, a period of a month was allocated to build upon the first three iterations without a predefined final number of prototypes in mind. This would depend on how successful each was and what knowledge had been gathered within the background reading. Clearly, by me setting no upper or lower limit on the amount of prototypes needed, there was potential for a certain lack of discipline. However, as I was not aware of how successful each prototype would be, this was essential. The important decision was to allow enough time in which to develop these future prototypes. The final prototype would be tested at the end of March against betting odds to reach an understanding of how accurate it really was. One other point to note is the large period devoted to Write up and Reflection. Due to the lengthy process I found when writing the mid-project report, I decided to stretch the process out whilst doing other parts of the project so it would not be rushed within the space of a couple of weeks Choice of Programming Language There were many languages considered to help implement this project, one of which was Python, a powerful dynamic programming language that is used in a wide variety of application domains [12]. It 1 Clearly this process was very selective as some parts of the report could not have been written if they had not been carried out at that point 5

13 has clear, readable syntax and does not restrict the user to Object-Orientated programming. For these reasons, it is recognised as being very efficient as generally only a relatively small number of lines are needed to write simple programs. This is especially true when spidering 2 [39], however this was not seen to play a large part within the project. Java was another language that was considered. Java is a programming language originally developed by Sun Microsystems which relies on the Object-Orientated paradigm and is used by 6.5 million developers worldwide [9]. My knowledge of Python is far from extensive and therefore understanding the language would become another aspect of the project. The learning curve would have benefited my programming experience in the long-run. However, as the importance of the project was significant I decided to use Java as I am far more accomplished using this programming language and overcoming obstacles within the project was seen to be easier if I already had knowledge of the language used. 2.5 Project Evaluation The prediction accuracy was used to assess the success of each prototype. This was the number of correctly predicted games divided by the number of predicted games (%) within the specified test set: Prototype accuracy = No. of correctly predicted matches Total no. of predicted matches Statistical Difference Tests Statistical significance assesses the likelihood a result is, given that the null hypothesis is false (where the null hypothesis states that any result is attained due to chance) [42]. Thus, a measure was needed to judge how likely it was that the difference in two prototype accuracies were obtained through chance. Dietterich put forward the use of the McNemar test whilst analysing five such measures [21]. When comparing these tests 3, he found that as other tests required the training and test sets to undergo forms of cross-validation, the McNemar test was ideal when dealing with expensive algorithms [21]. It was hard to say how time-consuming my algorithms would be but I envisaged using text file parsing which can be computationally expensive. Not only this but Dietterich found that the McNemar test was one of the most reliable significance measures within the set of five. This was shown when Type I error tests were undertaken. A Type I error is when a statistical significance measure declares two algorithms as 2 Browsing the Web for data in an automated manner 3 The 5 measures assessed were: Test for difference of two proportions, two paired t test (one based on random train/test splits, one based on 10-fold cross-validation), McNemar and the 5x2cv test 6

14 being different when they are in fact similar. The McNemar test was found to have a consistently low error-rate when tested on Type I [21]. On consideration of this, I decided to use the McNemar test McNemar Test The McNemar test makes use of matched binary pairs taken from the output of two distinct predictive models (A and B) where each pair relates to correct and incorrect forecasts [24]. These pairs are then compared with each other and placed into four categories: A was correct, B was correct (a) A was correct, B was incorrect (b) A was incorrect, B was correct (c) A was incorrect, B was incorrect (d) The totals of the two discordant results (i.e. where A and B achieved different results) are then placed into the following McNemar formula [24]: χ 2 = (b c)2 b+c This χ 2 value represents the McNemar statistic which, when referred to a table of χ 2 distribution [35], will reveal the level of significance between the two models. This p value extracted from the table thus represents how likely it is that the difference in accuracy between two models was achieved through chance. For the purpose of this project, we shall generally look to using the p < 0.05 level of significance (there is less than 5% chance the differences are down to chance). If a significance value below this 0.05 threshold is achieved then we can reject the null hypothesis Betting Line The final prototype will be placed against bookmaker s predictions. Discussed at further length within my background reading, I found the bookmakers more often than not achieve the best forecasting results when it comes to predicting sporting events. Therefore when my final prototype was completed, it was planned to compare its forecasts with that of the bookmakers. Hence, this betting data needed to be obtained. During my background reading I came across a paper [29] where this kind of data was used, so I contacted one of the authors. The author, Prof. Philip K. Gray sent me the betting spreads which accounted for over 3,000 NFL matches played between 1976 and Spread data was not available for every game within this time period 7

15 Chapter 3 Background Reading The background research carried out within this project is dissected into three areas. The first section explains why American Football was chosen as the sporting domain as well as some information on the sport. Section 3.2 describes researched work within the field of textual analysis which would have been used to analyse expert predictions. Lastly, Section 3.3 covers work analysing how numerical statistics have been used to predict the outcomes of sporting events. 3.1 American Football An important decision had to be made regarding the sport to be used within this project. This sport needed to adhere to requirements that would ensure that the project had a fair chance of being implemented successfully. These requirements included: Extensive history of results - Ensures that the sport has been played for enough years so that data would reach back far enough to allow for extensive analysis. Home advantage - It would be useful to assess how home advantage affects the outcome of a match. Therefore, every event had have a home competitor associated with it. Low frequency of ties - To predict the outcome to be a draw would have been be difficult, thus a sport where tied games are rare was preferable. Regular seasonal fixtures - If the sport involved some sort of league structure with games being played every year during set months then the extraction and analysis of data within the project would be benefited. This lead to American Football being considered. The sport fulfils all of the requirements above and therefore I decided that it would be a suitable choice. I investigated the scoring system of the sport as this knowledge was required when considering the potential results a game might have. Furthermore, a quick look into the match and league structure may aid in the designing of algorithms The National Football League The NFL is the professional league of American Football. It currently has 32 teams which are split up into two 16-team conferences which are in turn split up into four 4-team divisions. Each team plays a 16-game season encompassing the majority of teams in their division but a team can also play teams 8

16 outside their division or conference 1. The winners in each of the 8 divisions and the 2 best runners-up from each conference proceed to a knock-out tournament known as the Playoffs which culminates in the final championship game (the Super Bowl) [11]. The NFL is an American institution and is known worldwide. To grasp just how popular it is, a worldwide television audience of million tuned-in to watch the 2008 Super Bowl between New York Giants and the New England Patriots [14] Rules Each team has 11 players and at any one time, one side s group of 11 will be the offensive team (i.e. the team in possession of the football) and the opponents use their defensive players. The aim of the offensive team is to get into the area past the goal line (known as the end zone ) with the ball. They can do this through what are known as plays which can involve either running with the ball or passing the ball by throwing it (or a combination of the two). The play stops when the defensive team tackle the player with the ball or the ball goes out of play [10]. The offensive team start with first down and have four plays to get tackled 10 yards further up the pitch (to where they started from first down ). If they fail then the roles of the teams are reversed, the offensive team become the defensive team (and vice versa). If they do however gain 10 yards then they reach another first down and have four more plays to achieve another 10 yards and so on. Each match is split up into four 15 minute quarters. If the teams are level after this initial 60 minutes, then a 15 minute overtime period is played where the first team to score wins. If the teams are still level then the game is drawn, but as the point scoring system suggests, draws are very rare. Points can be scored in the following way: [10] Touchdown - Worth 6 points, and is achieved by carrying (or receiving) the ball into the end zone. Conversion - After a touchdown, the scoring team can either kick a conversion for 1 point, or attempt a more complex 2 point passed conversion. Field Goal - For 3 points, a team can kick the ball through the posts and over the crossbar. Safety - 2 points can be scored by tackling an offensive player in his own end zone. 1 This assignment of games is complex and will not be recognised in this project as the prediction algorithms will not take into account a team s division or conference 9

17 3.1.3 Spread Betting Spread betting is a form of gambling in American Football whereby the bookmaker specifies the margin by which he thinks the favourite team will win by. It is then up to the bettor whether he thinks this favoured team will win by more than the specified margin, in which case he bets for the favourite. Otherwise, if the bettor thinks the underdog will win (or fail to lose by that margin) then he will bet for the team viewed as the underdogs. If the favourites win by the exact margin the bookmaker specified, the bet is regard as a push [46] and the stake is returned. For example, the bookmaker putting a 7+ spread on Atlanta over Dallas signifies Atlanta are the favourites to win by more than 7 points. Imagine the bettor places a bet on Atlanta, then if: Atlanta win by more than 7 points - The bettor wins the bet. Atlanta win by less than 7 points - The bettor loses the bet. Atlanta win by 7 points exactly - A push occurs and the bettor gets his stake returned. Dallas win or the match is drawn - The bettor loses the bet. This form of betting incurs a weighted advantage towards the bookmakers in the form of commission. This means that for a gambler to achieve a positive return on their bets, they must have an average prediction accuracy of 52.4% [29] Power Scores Power scores are a popular way within the media to represent a football team s current strength. These are carried out by various newspapers and media sources, each using their own statistical methods. Although fairly secretive, these rankings are assumed to include information such as previous results, the strength of the opponents played and overall defensive/offensive capabilities [18, 44]. Every week, each team is assigned a value representative of their current overall quality, these values are then used to sort the teams into power rankings. The rankings can then be used by gamblers to assess whether one team will be victorious over another in an up and coming game. These power rankings are held in high regard within the footballing community. As there is no NFL-style Playoff system in college football, power rankings are used to determine which college teams performed better throughout the season in order to compile a final table. These rankings are compiled using the most accurate ranking systems from the media. Among the systems that are used is one created by Jeff Sagarin who publishes his figures within the national newspaper USA Today [18]. Clearly, such data could be used to predict the outcome of an American Football game. Thus, after some initial 10

18 research I found that Jeff Sagarin s archived team ratings are available from 1998 onwards [16]. 3.2 Text Mining Text mining is the process of analysing text to extract information for a particular purpose [48] and research was carried out into how this can be done. Generally speaking, text can take the form of fact or opinion and work was performed to see if the two could be separated. Yu and Hatzivassiloglou used a Naive Bayes classifier to determine whether a document is factual (e.g. news story) or opinionated (e.g. editorial) and found that it was harder to classify sentences than it was to identify documents in this way [49]. The main focus of my research was opinion-based as they form the basis of expert predictions. Kim and Hovy define two types of opinions, predictive and judgmental. Predictive opinions express a person s opinion about the future of an event whereas judgment opinions express positive or negative attitude towards a topic [36]. A predictive opinion would be I think that Miami will lose tomorrow and an example of a judgmental opinion is I think Miami Dolphins are awful. Furthermore, each predictive and judgmental opinion can have sentiment attached to it. The sentiment of a sentence is the feeling (positive, negative or neutral) which is implicitly stored within [36]. Sentiment can also be referred to as polarity, where the polarity of a sentence is either positive or negative [49] Predictive Opinions As the project was originally going to analyse expert predictions, the analysis of these predictive opinions was very important. With regard to this, Kim and Hovy developed a system called Crystal by which they explored the use of generalized lexical features within posts on a Canadian electoral forum [36]. They used supervised learning on these features to predict which party would become victorious within a certain riding 2 based on the forum predictions [36]. This was done through a SVM (Support Vector Machine) approach using varied feature combinations. They found that using a combination of uni, bi and tri-grams, they could successfully decipher the predicted party within the message with an accuracy of 73%. Furthermore, they could predict the result of a riding at an accuracy of over 80%. The Crystal system has been influential in this area of study as similar techniques have been used but in different domains [39, 17]. In [17], Baker took the Crystal idea and created a system whereby forum posts were taken from a UK election site and used to predict the outcome of an upcoming election (CrystalUK). He used a combination of uni, bi and tri-grams as his feature set and using a SVM 2 Canadian equivalent to constituency 11

19 approach, achieved a higher constituency prediction accuracy when compared to the original. However, the system s message prediction accuracy fell 4% short of the standard set by [36] with an accuracy of 69%. The system was then extended (Crystal2.0) to use pronoun resolution 3 as a feature within classification. Crystal2.0 also took into account the SVM classification strength of the sentences, rather than just whether they were postive or negative. Although Baker claimed this created a more robust system capable of generalising to smaller data sets, the system did not improve on CrystalUK [17]. More relevant to this current project, Crystal has also been used within an American Football context [39]. McKinlay used the ideas produced by [36] to analyse fan s predictions within forum posts [39]. As with [36] and [17], an n-gram feature combination was implemented to classify the data. As well as the SVM approached used above, McKinlay also looked into using a rule-based classifier called RIPPER 4. This system reached a message prediction accuracy similar to that of [17] and [36] but only reached a 52% accuracy when predicting the outcome of a match. He suggested that this poor accuracy (marginally above a random naive approach) was dependent on the accuracy of the fans predictions as opposed to the quality of the system. This theory was the original project s basis of analysing professional expert opinions and whether they were any better than the fans forecasts Judgment Opinions Film reviews appear to be a good source from which judgement analysis can be formed [45, 40]. In [40], Pang, Lee and Vaithyanathan found that when using a variety of standard machine learning techniques, sentiment classification is much harder to achieve than simply classifying the topic of the text. They suggested this is down to the reviewing author sometimes employing a thwarted expectations narrative in which he/she will use positively orientated words, only to come to the conclusion that the film is poor and vice versa [40]. One example of this is The film should be great, the cast is good, the director is experienced but it turns out to be a film to miss. These conclusions would have been important during the analysis of expert opinions as they could have contained such thwarted expectations. This is supported within [45] where it is claimed that the whole is not equal to the sum of the parts. Here, Turney uses a simple unsupervised learning algorithm to classify reviews for films, banks, cars and travel destinations as either recommended or not recommended. Sentiment classification here 3 Where pronouns are replaced with the previously mentioned candidate or party, thus increases the frequency of that candidate/party within the post 4 Where both Ripper and SVM were found to produce similar results 12

20 performed poorly as some film reviews include words with negative connotation such as blood and evil. Although these words do not invoke a positive reaction, they could be used to give a good review of a horror film thus leading to misclassification. However, he found that banks and cars were generally easier to classify and concluded that the whole is the sum of the parts in these contexts [45] 5. Sentiment mining has been carried out under varying detail: document, sentence and word-level. Revisiting [45], Turney used a number of techniques to classify a film review as recommended (or not). He used a Part-Of-Speech (POS) tagger to extract phrases containing adjectives or adverbs within the film reviews. He then used an Information Retrieval tool to see the similarity of these phrases with the words excellent and poor. This managed to achieve an overall review classification of 75% [45]. This was improved within [40] using differing methods, where an accuracy of 83% was achieved. They used Naive Bayes, Maximum Entropy and an SVM (Support Vector Machine) classifier on a standard bag-of-words framework on each film review. Then using a variety of features including combinations of unigrams, adjectives and a POS tag attached to each word, each of the three classifiers were carried out. These all reached accuracies between 73% and 83% but it was the SVM when used with just unigrams that achieved the highest classification rate [40]. One interesting feature that was used within this study was looking at the position of the word within the document. This worked on the basis that the summary of a review is usually found at the end of the document and although this feature did not improve the accuracy a great deal, this is worth considering within a textual algorithm. As well as looking at fact/opinion classification, Yu and Hatzivassiloglou investigated sentiment classification at sentence-level [49]. The authors tried to distinguish between positive, negative and neutral sentences within Wall Street Journal articles. They used the hypothesis that positive words coincide more often than just by chance (and the same for negative words) [49]. Using this, they applied a seed set of semantically-orientated words to calculate a modified log-likelihood ratio for each word within a sentence. This ratio represented the word s sentiment, thus an average of the ratio was used to classify the sentence s sentiment (i.e. the amount of positive and negative words in a sentence determining its polarity). They found that using a combination of adjectives, adverbs and verbs from the seed set yielded the best results [49]. More detailed studies have classified the sentiment of individual words or phrases, more specifically the sentiment of words found in subjective expressions [47]. Wilson, Wiebe and Hoffmann define a subjective expression as any word or phrase used to express an opinion, evaluation, stance, etc [47]. 5 Turney found that travel destination reviews were somewhere in between the two extremes 13

21 The authors used a lexicon of over 8,000 single-word subjective clues to find the polarity of expressions within a corpus. This was helped by distinguishing the difference between a clue s prior and contextual polarity, where prior indicates the polarity of the word on its own and contextual polarity indicates the sentiment of the word within a specific phrase. Here is how a prior polarity and contextual polarity could differ within expert opinions: Prior polarity - ridiculous [negative]. Contextual polarity - The Dolphins wide receivers have a ridiculous amount of pace [positive]. Using a simple classifier where the prior polarity was used to predict the contextual polarity of a clue, they reached an accuracy of only 48% noting that a lot of words with non-neutral prior priority appeared in phrases with neutral contextual polarity. Thus they devised two classifiers, one to firstly classify a clue s context polarity as neutral or polar, then a second to decide the clue s actual polarity (either positive, negative or both). The first classifier was done using 28 different features to separate all the neutral and polar words. Subsequently, clues that were classed as polar were then used within the smaller second classifier (10- feature) to decipher polarity. The 28-feature classifier managed to distinguish between polar and neutral words/phrases with 75% accuracy. However, the accuracy of the second classifier achieved an accuracy of 65% concluding that the task of classifying between positive and negative words is more difficult than classifying between polar and neutral [47]. 3.3 Numerical Analysis Numerical Models for Predicting Sporting Results Various numerical research has been carried in a sporting context to see how statistics can be used to formulate a prediction to an outcome. Other than solely looking at research regarding American Football, I decided to investigate similar studies using English football (soccer). This was due to the popularity of the sport and thus the popularity of studies using that context. This can be justified by the fact that there are a number of the papers regarding American Football which cited research into modelling soccer matches [18, 43] and vice versa [27, 25]. After some initial research, I noticed that the motive of many authors to create such models centred on trying to beat the gambling market. The authors cited different reasons for this approach. Many saw this as a good way of assessing their proposed predictive model as the gambling market is generally considered the most accurate source of prediction [18, 31, 25]. Forrest, Goddard and Simmons conclude 14

22 that this is due to the financial incentive associated with a bookmaker s prediction when compared to that of statistical systems or experts [25]. The most common reason amongst authors for comparing their proposed systems to gambling data was to find whether inefficiencies occur in the betting market [46, 29, 28, 27, 22]. A market is efficient if no betting strategy exists whereby on average that strategy yields significantly positive returns [29]. Alternately, some authors take the view of creating a model simply for the purpose of betting (and presumably to make money!) [41]. Some studies were carried out to see how accurate expert predictions regarding sporting events are and how these compared to the forecasts of the numerical models [43, 18]. Originally I researched these with a view to compare them with that of my textual experts findings. However, this research was useful to assess the actual level of expertise and knowledge that are involved in these predictions and to see if just by using historical data, these predictions could be bettered through statistical modelling Models Within American Football A lot of research I came across involved models that did not simply attempt to predict the winner of a match but tried to beat the spread (i.e successfully predict the margin by which a team wins). Stefani highlighted the difference in predicting the winner of a game and predicting the margin of victory. He claimed that good forecasters for the winners of NFL matches were around the 70 percent mark whereas to achieve a good profit by trying to beat the spread, an accuracy of well over 54% is needed. He concluded that if both are compared to a random approach (50%) and not many systems are superior to this spread boundary then the latter task is clearly harder [44]. One model attempted to beat the spread using an Ordinary Least Squares (OLS) approach to highlight certain biases made by the bookmakers between [28]. They used three independent variables, home, favourites and the spread within the model. Each NFL match was then modelled from the focus of one of the competing teams (the team was chosen at random). For example, if a team were playing away and were favourites then the home value would be 0, favourite would be 1 and the spread would be the amount by which bookmaker thought the team would win by. The dependent variable for each match was the difference between the actual margin of victory/defeat and the predicted spread (i.e. positive/negative if the team beat/lost to the spread, zero if a push was found). Using this model, Golec and Tamarkin found that biases against underdog teams and more specifically underdog teams playing at home were present within the betting market. They used these theories to predict the games between and found that when betting on home teams that were underdogs, a winning percentage of 55.6% could be achieved [28]. This is above the profit boundary of 52.4% 15

23 indicating that this could be a useful strategy for predicting matches, thus showing inefficiencies within the market. However, these tests were carried out on the same data which formulated the theory thus further tests on matches outside of the dataset would be needed (in my opinion) to confirm this. Another piece of research was carried out in the same vein as Golec and Tamarkin s work but using a differing technique. In [29], Gray and Gray viewed the OLS approach used within [28] as flawed. They deduced that the OLS system gave more weighting to games where the victorious team beat the spread by a large amount. This approach is not desirable when trying to beat spread betting, as no matter how well the team beat the spread, the bet is still regarded as a win. Subsequently, they preferred to use a discrete-choice probit model where the dependent variable represented whether the team beat the spread or not (rather than how much they beat/lost to the spread). Each match was then modelled from the home team s perspective using multiple variables. One variable represented the winning percentage (in terms of beating the spread) for the two teams in the current season. Also, a variable representing how many times the teams have beaten the spread in the last four games was used along with one indicating whether the home team was the favourite or not. When processing this model on their dataset (matches between ), the weight associated with the favourite variable had a negative coefficient meaning that this parameter had poor correlation to a team beating the spread [29]. This was found to be significantly different to zero therefore reaffirmed the findings within [28] that home-underdogs are reasonably likely to beat the spread. They also concluded that teams that have not performed well in recent games (relative to the spread) are more likely to beat the spread than if they had recently performed well surmising that the bookmakers overreact to the recent form of a team [29]. Using the matches within the dataset, the probit model achieved an accuracy of 54.46% when predicting whether a team would beat a given spread and 56.01% when tested on held back data (both of these accuracies beating a home-underdog approach). The model was improved by taking into account the probit probability of each modelled match and only betting where the probability of the team beating the spread was over This reached a significant success rate of 56.42% using in-sample matches and an even greater accuracy using the outof-sample data (however this was found to be statistically insignificant). Therefore, the method used within [29], which ignored the magnitude of the spread and included recent form of each team produced more accurate predictions than those based on Golec and Tamarkin s theory (home-underdog). Bouiler and Stekler also favoured a probit model approach within their study of simply predicting the winner of NFL games [18]. Rather than using recent form or progress over the course of the season, 16

24 they investigated whether power rankings could be used in order to predict the outcome. Initial tests were carried out over matches played between 1994 and 2000 to see if choosing the higher ranked team would provide an accurate strategy. This resulted in predicting the correct result 60.8% of the time. The accuracy was then compared with the predictions of a New York Times sport editor (59.7%), predicting the home team (61.1%) and the predictions of the betting market (65.8%) [18]. As this probit model did not even beat simply choosing the home team, another probit model was used to assess the probability of a team winning depending on the actual power ranking difference. It was found that as the magnitude of the differing rankings increased, the probability of the higher ranked team winning also increased. For example, if a team was power ranked 3rd they would have a greater chance of beating a team ranked 14th than a team ranked 5th. Thus by predicting the winner in all games with a probit probability of 0.5 or higher achieved a forecast accuracy second only to the betting market (beating home prediction and that of the editor) [18]. These findings concluded that the betting market was the best predictor out of the approaches covered, although power scores do hold information that can be used to achieve an acceptable prediction success rate. Another instance of research used a form of ratings to assess the accuracy of predicting the margin of victory as well as the match winner [44]. Here, Stefani used an OLS approach (similar to that of Golec and Tamarkin) in which the margin of the victory was predicted for each match. He based these predictions upon his own ratings for the teams involved. The ratings were based on the margin of victory from previous games which also took into account the strength of the opponents (i.e. the rating of the opponent at that point in time). These ratings were used in conjunction with a constant representing the home advantage. This constant was calculated by subtracting the average number of home points away from the average number of away points within the dataset 6. As the model relies on the concurrent updating of ratings, it was tested week-by-week during the NFL seasons to see how many winning teams could be successfully predicted. These tests were then compared to the accuracy of the betting line for those games. When predicting the winner, the least squares (OLS) model achieved an accuracy of 68.4% with the home constant which saw a 2 percent dropped when this constant was omitted from the model (highlighting the NFL home advantage). However, even with the home advantage considered this still fell short of the betting line which reached a success rate of 71% [44]. This further lays claim to the fact that betting lines are generally superior in predicting matches compared to statistical models. 6 This home constant was found to be around 2 points for the NFL 17

25 Stefani then proceeded to see how the model faired when trying to predict the actual margin of victory. A successful prediction was seen to be if the predicted outcome was in the same direction as the actual outcome and above the margin of victory. The results were categorised in terms of how many points the model s predicted outcome was away from the betting spread for that game (1-2, 3-4, etc). This saw that if the model was only 1 or 2 points away from the betting line then it could accurately predict margin of victories at 58.4% [44]. In [31], Harville used a mix of complex linear-models to predict the outcome of American Football games between 1971 and The system relied on the differences of team s yearly characteristics. These characteristics were the number of points scored/conceded by that team in relation to an evaluated average team within that year. These are similar to the ratings used within [44] as they represent the difference in strength of the two teams. The proposed model also takes into account the home advantage in a similar vein to the one within [44]. The optimal values for the model s parameters were then found through a maximum likelihood procedure. The model achieved an accuracy of 70.3% which only slightly fell short of the predictions made by the betting line during that time period [31]. He acknowledges a flaw within the model described as it does not currently constrict incidents of teams scoring large amounts of points against weaker teams (e.g. 54-0) known as running up the score [30]. These outcomes in his model will receive higher weighting and he claims that this has an adverse effect on the modelling process and should be somehow restricted [31]. Harville states that the key to successfully predicting sporting events is related to the rating or ranking of teams (e.g. his characteristics or power scores in [18]) Models Within Other Sports One statistical model assessed the efficiency of the soccer betting market by predicting the outcome of English football matches [27]. Here, Goddard and Asimakopolous used an ordered probit regression model to assess whether weak-form inefficiencies occur within the betting market. That is, whether all historical data (previous results, etc) relevant to a specific match is contained within the odds, otherwise the market is found to be weak-form inefficient. As well as implementing soccer equivalents of some of the features seen in Section (i.e. recent performances or season win ration), the probit model also incorporated a number of novel features. Some of these included the distance between the two teams and whether the match is significant in terms of promotion or relegation for either team. The model was estimated using 15 years worth of previous data and tested on English football games between These games would then be examined against odds from 5 separate bookmakers. 18

26 To compare this system to the bookmaker s odds, a separate model was needed to represent these odds. This was done through a simple linear model by regressing the result of a match (home, away or draw) against the implicit bookmaker s probabilities attained from the odds for that match. They then added the probability of the result from the probit model to the bookmaker s model in the form of a variable. The authors surmised that if information held within the probit model was already present in the bookmaker s odds then this variable should be insignificant when the linear model was re-estimated. In other words, the information used within the probit model predictions should not help the bookmaker s model in reaching the prediction of a match. However, it was found that the variable was significant at 0.01 level concluding that the model they proposed does contain information not enclosed within the bookmaker s odds. The authors determined that this showed the soccer betting market was weak-form inefficient. This was backed up by claims they could get a non-negative return on betting when using the model s highest result probability for each match and taking the best odds for that result [27]. One interesting use of team strengths was undertaken by Rue and Salvesen in which they catered for the psychological effect in relation to the difference between the qualities of the competing teams. Their work was carried out using a Bayesian model for home and away teams whereby the goals scored by one team relied on the difference between their attacking strengths and the opposition s defensive strengths. The psychological variable was attained by calculating the difference between the two team s collective strengths (i.e. home attack + home defense - away attack - away defense). This is based on the assumption that if a team is far superior to the other team, then complacency could set in for the better quality team thus giving the perceived weaker team an advantage [41]. Since the magnitudes of victory within previous matches were taken into account, the authors imposed a restriction on the magnitude of goals scored in previous games to 5. For example, if a previous game ended 7-0 then this would be recorded as 5-0 as they deduce that goals past this mark are not informative within the model development. This underlines the claim made by Harville that teams should not be allowed to run up the score [31]. When tested on the second half of the season (using the first half of that season to collect information about the team s respective strengths) it reached a similar prediction performance when compared to the bookmakers odds available for those games [41]. Away from soccer, Hu and Zidek looked into forecasting basketball games [32]. More specifically they used a Weighted Likelihood approach in modelling NBA Playoff games to predict a winner. Firstly, they define two types of historical data which can be used to predict the outcome of a game. One of these is direct information which refers to data that is only relative to matches between the two teams 19

27 (i.e. the most recent results between the teams) [32]. The other type is called relevant information, which is simply attributed to all other data that can be used to predict the outcome [32]. In this study, the authors used all the games played within a season to predict the outcome of the endof-season Playoff games 7. Here, the direct data refers to the results of games involving the two teams within that season. All the other games played by each team within the year were used as relevant information. Therefore, if Chicago Bulls play at home to Orlando Magic in a Playoff match, the model s direct data refers to the results between the two teams during that season where the Bulls were at home. The relevant results are where Chicago Bulls played at home in that season against other teams along with the results of Orlando Magic s games when they played away. Only the results of these games are used and not the magnitude of victory/defeat. It is also noted that this approximation does not take into account the strength of the opposition within these previous results. They do however try to combat this in modifying the model by removing the weaker teams from the relevant results (only recognising teams that had won over 50 out of the 80+ games) which provided better results than the original version. This model, whereby the weaker teams were excluded proved successful in predicting the winner of the 1997/1998 Chicago Bulls and Utah Jazz Playoff match [32]. Lastly, Stefani proved that models can be transferred across different sports. He did this by applying his least squares approach (mentioned previously within Section ) to basketball and soccer [44]. Initially, he used the model to predict college basketball games between 1972 and 1974 with an accuracy of just short of 70%. This was bettered when he adapted the system to the 1974 World Cup when an accuracy of 74% was gained in predicting the results throughout the tournament [44]. This suggests that consistent forecast rates can be achieved when a model is used within a new sporting domain Expert Opinions Within Sports Song, Bouiler and Stekler looked into the NFL predictions of statistical models and experts on a mass scale [43]. They collected the predictions of 70 experts (from the national media) and 32 numerical models (from various football research) to get forecasts for American Football games between 2000 and They found that on average, the expert and system accuracies were relatively similar when attempting to predict the winner achieving 62.2% and 61.7% respectively 8. However, yet again these 7 Each basketball team plays around 80 games in a season and therefore encompasses more information than a soccer or NFL season 8 Result were not found to be statistically significant 20

28 forecasts were over shadowed by the accuracy of the bookmaker s predictions. It was also noted that the dispersion of accuracies was much higher amongst experts than the statistical systems, achieving the best and worst accuracies. This shows us that forecasting success is much more consistent when dealing with numerical systems than experts [43]. This can be backed up within research by Bouiler and Stekler (mentioned in Section ) which showed that the New York Times Editor predictions were worse than simply choosing the home team within NFL games [18]. Next, they decided to use the forecasters mentioned above to predict the game s margin of victory and beat the betting spread. This found that both sets of approaches were (on average) short of the required profit margin of 52.4%. In summary, the authors decided that there was not enough statistically significant evidence to separate the accuracies of experts and numerical systems [43] Regression Analysis If the solution to the project s problem was to use a model similar to those mentioned above, then due to the number of features and the vast amounts of data are plugged into the system, regression analysis was required to evaluate the project s prototypes. Chatterjee and Hadi define regression analysis as a set of data analytic techniques that examine the interrelationships among a set of given variables [20]. These techniques could be utilised to inspect the features during model training and using the interrelationships between the features to predict the outcome of unseen matches. The basis of most statistical modeling involves transforming each observation (sample of data) into an equation whereby a result value (the dependent variable) is equal to the sum of the model s features (independent variables). These features have weights attached to them, whereby each weight represents the correlation between that feature and the model s result value. This equation can be seen here: [34] y = n x=1 w x f x (1) = w f Where y is our dependent variable, f represent the model s features and w are the features associated weights using all observations x. Thus, using the known variable values (both dependent and independent) of the training observations, the coefficients of the weights are found through an estimation method. Then during the testing of the model, these weights are used in conjunction with the independent variable values for each test observations to predict a dependent variable value [34]. With respect to the problem in hand, clearly each observation is an NFL match where the dependent variable represents the outcome of that game. Each independent variable represents an item of data relative to the match 21

29 that could be used to predict the outcome. One way in which this process can be carried out is linear regression, where the value of the dependent variable is a real number. With respect to this project, linear regression could have been used to predict the magnitude of victory for NFL games and this would obtain an implicit predicted winner for each game (i.e. positive would be home win, negative is away win). However, this would involve extra processing of the model that could be avoided if an alternate regression method which produces a binary result value was used instead. This is where logistical regression was considered Logistic Regression Logistic regression is a mathematical modeling approach that can describe the correlation between several independent variables to a binary dependent variable [37]. As the solution to the problem will simply need to predict whether team A or team B wins, logistic regression seems more akin to analysing the prototypes within this project. The technique of logistic regression is based around the concept of odds probability. In other words, within the testing of our NFL games, the model will assess the game data and attain two probabilities, one will represent the likelihood of the outcome being a home win and a different probability represents an away win. The former is then divided by the latter in order to obtain the odds ratios [37]. For example, if a game has 0.25 probability that the home team will win and 0.75 the away team will win then the odds ratio would be 0.25/0.75 (a third). In other words, the probability of the home team winning is one-third the probability of the away team winning (or in bookmakers terms 3-1 for the home team to win). Thus have we the equation: ln( P(y=1 x) 1 P(y=1 x) ) = w f (2) The left-hand side of (2) is the logit function and represents our odds ratio [34]. An equation is now needed to obtain the probability of y being true, using algebraic manipulation on equation (2) we form 9 : P(y = 1 x) = 1 1+e w f (3) The equation (3) is called the logistic function and maps the values of - and to lie between 0 and 1 (which will be utilised to attain the odds probabilities) Maximum Likelihood Estimation Clearly, some way was needed to obtain the coefficients of the weights within the logistical model. One of the more common approaches is known as the Maximum Likelihood (ML) estimation [37]. 9 This will not be detailed here, for more information see Jurafsky and Martin [34] 22

30 This is the process of training the weights to achieve the highest probability of each observed y value [34]. Kleinbaum and Klein defines the likelihood function as the likelihood of observing the data that have been collected [37]. That is, the probability that is produced by using certain coefficients for the respective weights w within the model [34]: w = logp(y (i) x (i) ) (4) i Where i represents all of the matches within the training data. Therefore during the training of the model, we need to find the optimal weights ŵ that will produce the highest probability for the outcomes within the training data [34]: ŵ = argmax w 3.4 Machine Learning Software log P(y (i) x (i) ) (5) i If one of the regression techniques was to be used in this project then software would be needed to carry them out. This software should be able to process vectors of features, where each vector represents one NFL match. This software must be able to make use of the feature data encompassed within each training vector and form feature weights to produce predictions for the unseen test vectors WEKA WEKA (Waikato Environment for Knowledge Analysis) is a collection of machine learning algorithms for use in a data-mining context [48]. It is open source under the GNU General Public License and also incorporates an API which allows WEKA functionality to be used within a Java program. Given this, the training and testing of prototypes could be incorporated into the Java scripts which create said prototypes. The WEKA algorithms include a Logistic regression class which implements a ridge estimator in conjunction with the ML estimation described above. This ridge implementation restricts the weights within the ML process which aids logistic models whereby there are considerably more independent variables than data observations [19]. As I used NFL match data encompassing more than 20 years, this scenario was not the case in this project thus this ridge estimator was not needed. The format of WEKA s input is in the form of ARFF (Attribute-Relation File Format) files which are ASCII text files that describe a list of instances relating to a set of attributes [48]. Aside from the advantage of having Java capabilities, this software was also used within McKinlay s successful body of work [39] thus I decided to use it to analyse the prototypes that were created. 23

31 3.5 Summary of Reading The focus of this project was to be a logistic regression model whereby each match was represented through multiple variables and estimated using Maximum Likelihood. I did not come across any research involving a logistic model to predict the outcome of sporting events so I decided to use this approach and see if it could reach the predictive accuracies achieved by the alternatives mentioned (e.g. linear, OLS, etc). The independent variables would be based on historical data or other novel features and be used to predict the binary dependent variable representative of the match outcome. The need for the model to produce this binary value was the main reason to use a logistic approach. The logistic model can be seen to be just as effective as other more complex equivalents. Duh et al. surmised that the logistic function was no worse in performance when compared to a Neural Network approach and also claim the logistic function as being simpler and computationally less expensive [23]. This simplicity would aid the project with regard to understanding and analysing the nature of each feature. Moreover, eliminating the extra computational time associated with using a more complex model would give me more time to work on prototypes. In terms of the data that was to be used within the model, I did not include friendly or Playoffs games. Harville concluded that as they are not competitive in the literal sense, friendly matches are hard to predict and have little predictive quality anyway [31]. Also Vergin and Sosik found that Playoff games are very unpredictable and veer away from regular-season conventions [46], thus will also not be considered here. Furthermore, as mentioned previously, tied games would not be used within this project (in both training and testing). This statistical model will make use of the most sensible and most useful features with the view to attaining the highest prediction accuracy possible. These features will take inspiration from the models researched within Section 3.3 and from the evaluation of previous prototypes. I also intended to implement a couple of baseline prototypes which relied on a single variable. Having said this, it was still my aim within these early prototypes to achieve an accuracy above the random prediction approach of 50%. The objective then was to assess these basic ideas and if they showed to have some predictive ability, then they would be incorporated into the more complex logistic model. 24

32 Chapter 4 Prototypes 4.1 Data Collection To carry out this numerical analysis, data was needed in the form of previous American Football games dating back at least 10 years. Obviously, these results had to be accurate and ideally collected from the same source to ensure consistency. Research into this led to the discovery of one such website 1. An example of the website data can be seen in Figure 4.1. Although the records stored here actually went back as far as 1920, the decision was made to only use results from 1970 onwards. This was based on reasoning that in terms of storage and algorithm run time, going back further than this date would not be beneficial as 37 years worth of data would suffice in training and testing. Furthermore, before 1970, a league named the AFL (American Football League) existed which rivalled the NFL. This meant that professional teams were split across the two leagues up until the AFL and the NFL merged in As each NFL season was situated on a different web page, the next step was to create a program in Java that would spider through the website. Built specifically for this website s HTML, it iterated through the different pages and printed all the matches to a text file (footballresults.txt). My specially created application recorded the date of the match, the home team, the away team and their respective scores 2. Also printed to the file was the outcome of the game (home team, away team or tie). The section of footballresults.txt that relates to the input within Figure 4.1 can be seen in Figure 4.2. As you can see from Table 4.1, the number of draws between 1970 and 2006 were very minimal which backs up the claim made in Section 3.1 that NFL have very few tied matches. To highlight the contrast with soccer, this home:away:draw ratio of approximately 46:34:0.5 can be compared to that of a ratio of soccer matches 46:27:27 [22]. 4.2 Prototype 1 (HOME) The first prototype was a very simple predictor which as the name suggests crudely selects that the home team will always win. This is based upon a widely-held view that within sports (especially team As NFL teams are franchised, some have undergone various name changes over the years, thus all names were converted to their most recent franchise name 25

33 Figure 4.1: Pro-Football-Reference.com Data for Start of 1975 NFL Season Table 4.1: Statistics from the collected match data between Matches 8063 Home Wins 4631 Away Wins 3387 Draws 45 sports) the home team has some advantage over the opposition [46, 18, 31, 44]. Clearly, this advantage is not easily measurable and if the away team is much stronger than the opposition then this advantage will only go so far in helping the home team. However, due to various factors such as the influence of the home crowd outnumbering the away fans, the travelling involved for the away team and so on, it is a sensible place to start in determining the winner from a football game. What is more, research has shown that this naive method can outperform both statistical models and expert opinions [18]. The algorithm (HOME Predictor) which I created simply picks the home team to win in every match. This simple unsupervised approach performs fairly well and improves on a naive random approach by over 7% (as shown within Table 4.2). In terms of significance with a random approach 3, these results were found to be substantially past the 0.05 threshold I d chosen from the outset. 3 Prediction relied on choosing the team whose name came first alphabetically, incidently attaining 48.1% accuracy 26

34 Figure 4.2: Text Output from the Data Collection Program. Displaying Match Id, Date, Home Team, Away Team, Home Score, Away Score and the Winner Prototype Summary Referring back to the prototype approached recommended by Hughes and Cotterell [33]: I hoped to learn how accurate choosing the home team is within American Football. This was to be evaluated by the accuracy when tested on all games and was to be compared to a random approach. It was shown that predicting the home team produced a significantly better accuracy than choosing the winning team at random. 4.3 Prototype 2 (PREV RES) After seeing that the home advantage is a clear one in American Football, I tried to see if previous results between the two teams could improve on the first prototype. This built upon the theory of direct data proposed by Hu and Zidek [32] mentioned in the background reading. This supervised approach should improve on the simplistic approach seen in Prototype 1 as I am using historical data to predict the winner Design This prototype took each game and used the previous encounters between the two teams to predict the outcome. Firstly, I needed to know how much direct data to use. Thus, each match used 1, 3, 5 and 27

35 10 years worth of previous meetings to yield four separate predictions and attain the optimal number of years to use. Therefore, I split the processing up into running one algorithm four times, each using a differing parameter (the number of years to go back). Each of these iterations would then be compared against each other and ultimately against Prototype Implementation I created a program (PREV RES Predictor) which iterated through the results text file (footballresults.txt), taking each match and creating a PrevRes class using the two teams and match year as arguments. This class contained a method which took the number of years in which to visit and iterated through the results to get previous matches 4 between the two teams within that designated period. This process was called 4 times for each match (using 1,3,5 and 10 years as arguments). Then the aggregation of the number of wins for each team was found, with the higher of the two aggregations becoming the predicted winner of the match 5. To increase efficiency of the algorithm, rather than constantly iterating through the results text file, the results were stored in a result matrix. The result matrix simply involved storing all the match-ups that had happened since 1970 and recording win tallies between the two teams for each year since then. The matrix is a Java HashMap containing all the different combinations of teams (teama vs teamb) that have played each other since Then within each match-up in the HashMap is another HashMap containing tallies (number of teama wins, number of teamb wins and the number of draws) for each year since An fragment of this matrix can be seen within Table 4.3. Prototype 2 s algorithm using the matrix can be seen in Algorithm 1 (Appendix C). Here we see that it only iterates the list of matches within footballresults.txt once, this is much more efficient than doing so every time a result needs to be found Evaluation Table 4.4 details the breakdown of how each year-set of previous results faired. This tells us that although previous results are a useful piece of information, the further back the meetings go, the less accurate the feature becomes. This is shown as 10 years worth of previous results attains a prediction of 55% whereas only using last season s results yields an accuracy of 58%. Although this 58% marginally beats the first prototype (Table 4.2), I failed to reject the null hypothesis as they were not significant. 4 If a previous meeting ended in a draw, the encounter was ignored 5 If the two teams won the same amounts of games then the algorithm would fall back onto Prototype 1 algorithm 28

36 This chronological decrease in accuracy is because information about a certain team will be more accurate the nearer it is to the present year. Take for example if Miami Dolphins beat the Dallas Cowboys at home 10 years ago, many factors may have changed since then. It s likely that no players that played in that match still play for either team, the Dolphins might play their home games in a different stadium and so on. Therefore that game is not an accurate representation of the two team s current situation and thus it is not a good piece of data to predict the outcome of a present day game Prototype Summary Referring back to the prototype approached recommended by Hughes and Cotterell [33]: I hoped to learn how accurate predictions were solely reliant on previous results between the two competing teams. Whether using more/less years of results affected this accuracy and furthermore to discover whether this data held more predictive qualities than just choosing the home team. This was to be evaluated through the accuracy tested on all games when using 1, 3, 5 and 10 years of previous results. Comparing these 4 iterations and contrasting the most accurate with Prototype 1. I learnt that the more data used in this scenario, the less accurate the predictor became. This also showed that using last year s results equalled that of predicting the home team to win. Table 4.2: Prototype 1 & 2 Tested On 8018 Matches between Prototype Accuracy(%) 1 (HOME) (PREV RES) 58.0 Table 4.3: Part of Result Matrix Showing (Team A Wins-Team B Wins-Ties) Tallies From Various Years Team A vs Team B Atlanta Falcons Miami Dolphins Dallas Cowboys Pittsburgh Steelers Prototype 3 (Goddard & Asimakopoulos Model) After assessing the results from the previous two prototypes, the accuracies were higher than a random prediction approach but the question was asked, can more complex models improve on these accura- 29

37 Table 4.4: PREV RES Results Using Differing Amounts of Data No. of Years Accuracy (%) cies? This can be seen from a least squares approach which achieved 68.4% for predicting a winner in an NFL match [44]. Harville achieved even higher than this when he implemented his mixed-linear model to carry out the same task (70%) [31]. It was clear that multiple features would need to be incorporated into one statistical model to predict the outcome of matches with a high success rate. This supervised approach (more complex than Prototype 2) would therefore be look at previous results to assess relationships between these features and the outcome to predict the outcome in other matches. Taken from research within Chapter 3, the statistical model put forward by Goddard and Asimakopoulos within [27] was considered. As well as the authors claiming that the features within this model held information that could be used to successfully predict soccer matches, they found that they could even help the bookmakers make more accurate predictions. Although the model within [27] is used to represent the outcome of an English soccer match, I felt that the features within the model are transferable to most team sports. This was down to soccer and American Football matches sharing a lot of common traits. Both involve two contesting teams consisting of a certain number of players trying to score the most points within a set amount of time. The teams from both sports play matches home and away on a frequent basis within a structured league where a team s home games are played at a regular venue. Furthermore, as mentioned previously, the research that I came across within the field of soccer and NFL involved references to the other. Thus, I looked into the features of the Goddard and Asimakopoulos model Design Research within [27] showed how the outcome of an English football game could be represented through features within an ordered probit regression model. Although a probit approach is not identical to the logistic model, the ideas and theories of the independent variables can still be use within this project. These variables were then adapted to fit the model of American Football. The features that Prototype 3 30

38 used to predict an American Football match between home team i and away team j in year k were: The win ratios for 2 years previous to k (for i, j) - This looked at the percentage of games a team won within a season, carried out for the two seasons previous to year k which produces two percentages. This gave some indication to how successful a team has been against all other teams within previous seasons. The m th recent home results (for i, j) - This used the results of the previous m home games of both teams. [27] found that 9 was the optimal value of m within soccer. This value was initially used within this model, however at that stage it was not known whether this value was optimal. The idea of analysing recent games can also be seen in [29], whereby they used the last four games in assessing how teams faired against the spread thus emphasising the predictive quality of a team s form. The n th recent away results (for i, j) - This used the results of the previous n away results of both teams (similar to recent home games, it was found 4 was the optimal value of n [27] and therefore this value was used). The geographical distance between cities/towns of i, j - This was related to how far the away team would have to travel to play the match. This was based on the hypothesis that the further a team has to travel, the least likely they are to win. When modeling soccer matches to compare with expert opinions, Forrest and Simmons used distance as one of their features, highlighting that it could be a factor in predicting the outcome [26]. The capacity of the stadium in which i and j played their home games during year k - Within [27], a feature was present which was a residual for a certain team based on average home attendance and their final league position. This was referred to as the big team effect. Thus if a team has a above-average home attendance and finished high in the table then it is likely to be a big team. It works on the premise that bigger teams are more likely to beat smaller teams. As average attendances that dated back more than a year were difficult to find and the fact that NFL teams are split into small separate tables, this feature was hard to replicate. Hence, I decided to use the size of stadia that the teams i and j played in within year k. In other words, a team playing in a large stadium will only have such a stadium if they have enough fans to fill it. Thus if a team has a lot of fans then it is reasonable to assume they are a successful (or big ) team. This feature will also give an indication of how the size of the home team s stadium effects the opposition performance (building on work by Vergin and Sosik [46]). 31

39 The result of the corresponding fixture last year between i and j - This looked at whether the teams played each other in the year previous to k with i playing at home and j playing away. If the teams played two games within this scenario in year k-1 then the more recent result was used. This also made use of the information found within Prototype 2 by limiting this feature to just one previous year Implementation A program (VectorModelCreator Goddard) was created which used the results text file to created vectors representative of each match involving all of the features mentioned in the previous subsection Feature Extraction I created a class called Match which, at the point of construction took in information such as the match year, the two team names, the winner, etc. Thus, VectorModelCreator Goddard took each game and created a new instance of the Match class using information attained from the text file. This Match class then allowed access to methods that were used to acquire the features. The values were written to a text file in the form of a vector with the year of the match at the start and each feature being separated by a comma. Throughout the process of calculating the various features, if it was found that one feature did not have a value (for example, if the two teams did not play each other the year before) then I implemented WEKA s missing value function. This missing value (stored in the vector as? ) uses a mean value within the data set for that feature. The features mentioned within Section were implemented by the Match class in the form of different methods: Win ratios - This made use of the result matrix again. The method took a year and a team name as parameters and iterated through all of the match-ups within the matrix to check if one of the teams within the match-up was the team in question. If so, then the result tally from the specified year was extracted. Subsequently, the tally values were obtained and used to accumulate all the wins, losses and draws involving that team within that year. When all the match-ups were iterated through, the win ratio was achieved by the number of wins divided by the total number of games involving that team within that year. This process was then carried for all the years previous to the match year (up to 1970) for team i and j. All the previous win ratios were recorded as this would allow scope for the number of previous win ratios to be increased. The process of cutting 32

40 all these win ratios to just the previous two years was carried out within the training and test set creation (Section ). Recent home or away games - This algorithm could not make use of the result matrix as I needed x recent home/away games. To store the date of the matches within the result matrix would complicate the storage of matches and contradict the purpose of the matrix itself. Therefore it was decided to take the results text file and create a new version in which it would be reversed. This involved writing a separate Java program which started from the end of the original results text file and iterated backwards writing the same information to a new file (reversefootballresults.txt). This enabled an algorithm to iterate through this file until the match in question was found, then it carried on searching the file, extracting home/away results (1 for win, 0.5 for draw, 0 for loss) involving that team until it reached the limit specified by m and n respectively. Distance between teams - I represented the geographic distances in miles between two teams at any point between This may not seem too difficult within some sports as when teams move stadium, they do not move more than 20 or so miles away from their current location. However, as American Football teams are franchised, it has been known for teams to move over 500 miles to a new location [15]. Thus for each match-up since 1970, data was needed to represent the distance between the two teams at that point in time. Firstly, I recorded which cities/towns each team had played in between This involved using a web page which listed all the NFL teams (past and present) and where each of the franchises had been based over the years [1]. This was used to get a list of cities/towns which could have be involved within a match between 1970 and The distances between these places were now needed. One distance matrix collated by an NFL stadium website provided a lot of distances between NFL-hosting cities [4]. The rest were obtained from an on line tool which calculated distances between two cities [2]. These two sources enabled me to manually enter the distances between all cities within a text file. To cut down on time, if it was found that two disparate cities were less than 25 miles apart then they would be clustered into one city (for example Miami Gardens and Miami are both referred to as Miami). As some travelling distances reached well over 500 miles, anything less than 25 miles was viewed as nominal. A further note about this process is that some teams have been known to play their home games at another stadium (sometimes in a different city) for one match per season. Accounting for these small number of matches would have been more costly 33

41 in time than the value extracted from carrying it out thus they were ignored. Once the text file with all the distances had been created, I created a Java class called Stadium- Calculator. This took three parameters, the two teams playing each other and the year in which it was played. Then a method within the class would firstly discover the two cities in which the two teams resided during that year. Subsequently, the method would parse through the distance text file and find the value associated with the two cities 6. Stadium capacities - Again, this feature suffered from the same difficulty with franchises moving from city to city and from stadium to stadium. Fortunately, the web page which was used within the distance feature extraction also contained stadia information for each franchise (including capacities) [1]. To try to ensure that this data was accurate, I checked 5 current capacities from the respective NFL team s official website. As I could not check all the figures, I felt the 5 confirmations were sufficient. Regarding the issue of teams playing at random stadiums at irregular points in time, the collection of this data was also carried out manually to ensure correctness. To try to eliminate any further processing time, I decided to place the capacities into multiple if statements rather than have the Java program read through another text file. This involved going through the web page and for each team, assessing what stadium capacity they had at a certain point in time. This information was then represented in a method within the StadiumCalculator class which when given a team and a certain would search through the if statements until the capacity was found. Last year s meeting - This took advantage of the reversed results file in extracting not only the corresponding fixture last year but the most recent (if there was more than one). Therefore within the Match class, this method searched through the reverse results text file (starting at the year previous to the match year) and found the first occurrence of the two teams playing each other (where the home and away teams are the same). If the home team won, 1 was returned, 0 for an away win and 0.5 for the tie. The result - Finally the vector was ended with the outcome of the match. If the home team won the match then 1 was appended to the vector whereas if the away team were victorious, 0 was used 7. 6 If the two teams played in the same city then the method would simply return zero 7 As previously mentioned in Section 4.2, ties were not counted within the training or testing 34

42 Training and Testing Set Creation Once each match had been represented as a feature vector, the vectors needed to be placed into training and testing sets. Thus, an optimal number of training years was needed. As the data spanned between 1970 and 2006, I set an upper limit of 20 training years because some years would need to be held back for testing. Therefore, I assessed different training years (every even number between 2 and 20) where each set was tested on the same years. Testing started with matches within the year 1999 using the previous 2 years training data ( ), using the previous 4 years ( ) and so on upto 20 years ( ). This same process was then carried out for test years upto The reason for starting testing at 1999 was that the training data in the early 1970s had some incomplete data (e.g. previous win ratios). Furthermore, I had already thought about introducing Sagarin s ratings within a future prototype (which holds data from the end of 1998 onwards). Having said this, 8 years of testing was plentiful, especially when compared to some of the small test periods found within my background reading. An example of how 20 years worth of training data was tested can be seen in Table 4.5. This set creation involved me writing a program which took the text file containing all the vectors and using the year at the start of each vector, placed the relevant vectors into appropriate training and testing text files. This process also involved the trimming of all the previous win ratios to just the two previous to the year of training/test file currently being constructed. This involved using the focused year as an index within the vector and taking the two previous indices WEKA Vector Convector To be able to be processed by WEKA s software, each training/test file had to be represented in the format required for an ARFF file. Therefore within the Training and Testing Set Creation program, each vector was modified and placed into the correct format. Firstly, this involved removing the mark-up around the vector (i.e. the year and [ ] ). Another proviso with the ARFF file format is that each feature needs to be explicitly declared at the start of the file. When this process was completed, the program outputted 8 training ARFF files for each training set (2,4..20) and 8 test ARFF files Data Analysis Using WEKA As mentioned previously, WEKA offers a Java library which enables the use of its classes and functions. Therefore, it would be more efficient to run the training and testing through a Java program rather than through the conventional Explorer GUI. Thus, I created a Java class named WekaWrapper where an 35

43 Table 4.5: During Optimisation of Prototype 3, Example of How The Model Was Tested When Trained with 20 Years of Match Data Iteration Training Years Test Year instance of the class could be created for each prototype. A method then took each of the 8 training files and built 8 separate logistical classifiers (using the Logistic class provided). These classifiers were then evaluated on their corresponding test files. This process was carried out for each set of differing training years (2, 4, 6, etc.). These results can be seen within Table 4.6. Table 4.6: Accuracy of Different Training Years in Prototype 3 No. of Training Years Accuracy(%) Evaluation As we can see from Table 4.6, Prototype 3 when trained on 20 years of data, achieved the highest accuracy of 61.8%. This eased an initial worry with using a large amount of training data that maybe trends within the NFL had changed over the years (e.g. maybe home advantage had become less/more 36

44 significant). Although using 4 or 6 years worth of training data attained similar success rates, I decided to use 20 years worth of match data henceforth. This was due to the fact that I had access to enough data to sufficiently test matches using 20 years of training data. The accuracy of the prototype (trained on 20 years) can now be compared to the accuracies found within the first two prototypes. The first two prototypes were originally tested on all data, but to accurately compare these with Prototype 3, they were re-assessed only using matches from 1999 onwards. This lead to revised accuracies of 57% and 57.2% for Prototype 1 and 2 respectively. This suggests that the Goddard and Asimakopoulos model can be used help predict the outcomes of American Football games with a higher accuracy than simply picking the home team or using last year s result. Prototype 3 s difference in accuracies with the first two prototypes were seen to be highly significant, both at the level (far past the designated 0.05 threshold) Prototype Summary Referring back to the prototype approached recommended by Hughes and Cotterell [33]: I hoped to learn whether a complex logistic model could out perform simple prediction baselines. This was to be evaluated by the accuracy tested on games between (using the optimum number of training years) and comparing with Prototypes 1 and 2. I showed that a logistical model of novel numerical features achieved significantly superior predictions than relying on the home team or the previous result between the competing teams. 4.5 Prototype 4 (Inclusion of Ranking Features) After assessing the first 3 prototypes, it was clear that the model within Prototype 3 held high predictive qualities. Therefore, the decision was taken to build upon Prototype 3 to see if additional features could improve the accuracy of the model. As mentioned within the Background Reading Chapter, power scores/ratings/ranking can be used to enable accurate predictions of NFL games. This was mainly shown within [18] which found that they competed well with NFL betting spreads. Furthermore, Harville claimed that the basis for successful sporting forecasting is through a type of rating system [31]. With this in mind, I decided to search for NFL rankings Jeff Sagarin s Power Ratings As previously mentioned in Section 3.1.4, initial research found power ratings calculated by Jeff Sagarin [16]. Sagarin is a well-respected statistician who also creates similar stats for other North American 37

45 sports such as ice hockey, basketball, etc. He currently works for USA Today providing these power rankings every week and has done so for some time [18]. His ratings are held in such regard that they are used in helping to decide the final league positions within American college football [18]. I was not able to find how these ratings are calculated but they are widely used and therefore must have an indication on the strengths/weaknesses of a team, thus could be used to achieve a reasonable forecast of a match outcome. A section of the website containing ratings for the 1999 NFL season can be seen in Figure Football Outsiders Although Sagarin s rating had promise in terms of their forecasting ability, I decided to get a second set to see if these helped, or even held better prediction qualities than Sagarin s. Hence, I came across Football Outsiders which is a website based on analysing American Football through numerical statistics [5]. They recently entered into a partnership with the huge American sports broadcast network ESPN [6] which signifies that they (like Sagarin s ratings) are highly thought of. Unlike Sagarin, Football Outsiders give a fairly detailed description of how they attain their figures. Their ratings centre around a figure called DVOA (Defense-Adjusted Value Over Average). This breaks down each play within the NFL season to see how much success each player achieves compared to the league average. They claim their statistics are better than official NFL statistics 8 as they take into account the importance of each play unlike the official records [7]. Imagine that a team are at third-down and 4 yards away from making the next down. If the quarterback makes that 4 yard pass it has much more significance than if it was a 4 yard pass on first down. Football Outsiders make note of this importance whereas the official NFL statistics treat both as simply a 4 yard pass. This DVOA is thought of as a team efficiency statistic, hence it can be used in a similar vein to power scores/ratings Design Both ratings from Sagarin and Football Outsiders (now referred to as FO) needed to be incorporated within each match vector for the home and away team. The first step was to acquire the ratings from the respective websites. As previously mentioned, Sagarin s stats are archived from 1998 onwards whereas FO held data reaching back to Now, in the case of both of these ratings, only the final statistics of 8 The NFL league keeps records of statistics throughout the season such as how many yards each quarterback has completed with his passing 9 To find out in more detail how DVOA is calculated, see 38

46 each year were available (i.e. the ratings after the end of the final week of each season) whereas these statistics are supposed to be used (and updated) on a weekly basis. Thus, when placing these ratings in a match vector played in 2003 for example, I used the final rating of the team within year 2002 to help predict the outcome. I felt this should not be a problem because if a team finishes stronger in one season then that momentum can be carried over to the next season. Furthermore, in soccer (specifically the English Premier League) the same teams generally are found to be successful. This is shown by the same teams usually finishing in the top 8 every year. Thus, I based this prototype on the theory if an NFL team was successful in one year it will more than likely be successful in the year following. Figure 4.3: Jeff Sagarin Ratings for the 1999 NFL season Implementation I created two web spiders for each website which would automatically obtain the ratings for each team within a certain season. Similar to the data collection spider (found in Section 4.1), these programs were specific to the HTML found within each website. Both of these programs parsed this HTML to get the relevant ratings from the different years, where each year was located on a different web page within both two sites. An example of the output from the Sagarin spider can be seen in Figure 4.4. These ratings were stored in two separate text files meaning they could simply be searched through to find the rating for a team within a certain year. This functionality was added in the form of two new methods within the Match class (one for each rating). So for each of the competing teams within a match, the program simply called the two methods to extract the Sagarin and FO rating for the year 39

47 Figure 4.4: Text Output From The Sagarin Rating Collector Program. Displaying The Year, Ranking, Team Name and Sagarin Rating previous. These figures were then added to the match vector along with Prototype s 3 features. As I had data from 1998 onwards for Sagarin s stats and 1995 onwards for FO, I was forced to use a smaller test space than the one used in Prototype 3. This meant that the prototype (still being trained on the previous 20 years of data) was tested on matches between This gave the rating features enough time to be assigned appropriate weightings during logistic training. Table 4.7: Prototypes Tested On 924 Matches between Prototype Accuracy(%) 3 (Goddard) (Goddard with Rankings) (Just Rankings) (Just Sagarin) (Just Football Outsiders) Evaluation The non-specific nature of the WekaWrapper class I created, allowed me to simply plug-in in the ARRF files outputted from Prototype 4 s training and test set creator to assess its accuracy. However, as 40

48 differing test years were used here, I had to re-evaluate Prototype 3 using the last four years of data in order to accurately compare. The results from these tests can be seen from Table 4.7. This shows that the ratings that were introduced into the Goddard and Asimakopoulos model had a detrimental effect on the prototype s forecasting ability, however these accuracies were not found to be significantly different. The first thing was to assess whether this lack of improvement was due to Prototype 4 being trained on a large amount of data that held no rating information. For example, a training set encompassing would only have FO training data within match vectors between and Sagarin s between Match vectors between would have missing (mean) values where the ratings are stored. This could have an adverse effect on the training of the prototype. Consequently, I tested Prototypes 3 and 4 on the same years as above but using the previous 4 rather than 20 years of data as training. The choice of 4 years was based on the fact this was found to be one of the more accurate year-sets during the optimal testing of Prototype 3 (Table 4.6). This would eliminate the vast amounts of mean values used within the training of Prototype 4. However, this lead to both prototypes becoming less accurate than when trained on 20 years and Prototype 3 still being a better forecaster than Prototype 4. Further investigation was required, so I ran a model with just the ratings on their own (without the features from the Goddard model). Thus using 20 years of training data, this was tested on the same 4 years as above ( ). As shown in Table 4.7, this prototype (4.1) achieved an accuracy of 58.2% and was found to be significantly different to both Prototype 3 (at the 0.2 level) and Prototype 4 (at the 0.10 level). Furthermore, although this ratings model was found to be more accurate than Prototype 1 and 2 on these test years, it was not significant to either of them. This suggests the ratings that were added to Prototype 3 are a poor indicator as to which team will be victorious over another. I decided to carry out additional analysis to see if one set of ratings outperformed the other with a view to removing one set. Two more temporary prototypes, 4.2 and 4.3 were then created using just Sagarin s ratings and just FO stats respectively. I established that Prototype 4.2 achieved an accuracy of 58.1% and Prototype 4.3 attained 58.0% (see Table 4.7). Although Prototypes 4.1 and 4.2 were not found to be statistically different from each other, we can probably hypothesise that FO team efficiency ratings are no better than Sagarin s ratings as FO have 3 more years worth of data within the training of the models and still achieves a similar accuracy. However, as we can see from this, there is minimal difference between using the ratings individually and using them together. 41

49 I came to the conclusion that these rankings (as mentioned before) are supposed to be used on a weekly basis to predict the winner of a match within the week. The ratings are updated after said match and then used to predict the next game and so on. I only had data that represented the team s strength at the end of each season and clearly this is not a good indicator of how well a team will perform in the following season. A reason for this was highlighted by Koning s analysis of competition within sports. He claims that the NFL draft system is very important in keeping a competitive edge between teams within American Football [38]. The draft is the process of teams picking the best up-and-coming college footballers in the off-season. This involves the worst teams from the previous season getting first pick and the best teams picking last [11]. This will affect the way Prototype 4 is carried out as a team that is successful in one season has no guarantee of success in the next season due to this draft system. This along with other factors such as teams getting a new coach, teams being unable to maintain last season s form, etc. This can be shown by comparing the number of disparate winners of NFL s Super Bowl with that of the English Premier League (EPL). The number of differing champions since 1995 in American Football is 11 [13] whereas there have only been 3 within the EPL during that time Prototype Summary Referring back to the prototype approached recommended by Hughes and Cotterell [33]: I hoped to learn whether adding power ratings to Prototype 3 could improve its accuracy. This was to be evaluated by the accuracy tested on games between and comparing to Prototype 3 s accuracy within these years. I showed that the power ratings did not improve Prototype 3 as they must be used on a weekly basis to have predictive qualities. 4.6 Prototype 5 I concluded that as rating data was not available for each week within the test years, the next prototype should not incorporate the rating features. Therefore my next task was to find other features that would improve the accuracy of the model that was found within Prototype 3. A lot of the research that I covered within Section made use of the score differences involved in recent games rather than just using the results [22, 41, 31, 44]. Goddard and Asimakopoulos surmised that only using the results made the model simpler and made it more suited to soccer as most victories lie 42

50 between 1 or 2 goal deficits anyway [27]. However, NFL games can have a score difference of anything between 0 and 40+. Subsequently, I decided to replace the recent results (win/lose/draw) from Prototype 3 with the actual difference in score within those matches. The magnitude of victory/loss for a team is more informative than simply whether they won or lost. Imagine a scenario where two teams have won their past 5 games where team A won each game 30-0 and team B won their 5 matches 5-0. Although the current model would not see a difference in these two sets of recent games, a sensible prediction would be to choose team A to win. This is because although both are in good form, team A looks to have won their matches with greater ease (suggesting they are the stronger team) Implementation This meant I needed to add another method to the Match class. This method was similar to the one which calculated the recent results, except this time it was required to store the magnitude of victory/loss for that team (where 0 indicated a draw). Also as stated, the ratings were removed from the match vectors which was done whilst sorting said vectors into training and test sets Evaluation The lack of ratings enabled me to expand the test data back to 8 years ( ). Therefore, the structure by which this prototype was trained and tested is the same found in Table 4.5. During the testing of this prototype, an accuracy of 62.5% was averaged over the 8 test years. This improved the accuracy of Prototype 3. However, this was was not found to be significantly different. In a final attempt to improve the model, I restricted on the magnitude of victory within the recent game features. This is in reference to work found within the Chapter 3 by various authors [30, 31, 41]. By not allowing teams to run-up the score, these authors state that more accurate data is extracted from previous games as points/goals scored past a certain point are not informative. After looking through the various score differences within Prototype 5 s match vectors, I looked for a sensible threshold which would bound around 10% of the score differences. This lead me to converting all the values of recent game score differences which were over 25, to 25. After re-testing, unfortunately I failed to find significant differences in performance between this and Prototype 5. Thus, although the restriction on score differences have been considered to be more informative, I was unable to prove this and did not consider it henceforth. In conclusion, although Prototype 5 was not found to be significantly better than Prototype 3, I feel recording the differences in scores over the actual result is more suited to American 43

51 Football Prototype Summary Referring back to the prototype approached recommended by Hughes and Cotterell [33]: Discover whether score difference in recent matches is more informative than just the result. This was to be evaluated by testing on games between , comparing with Prototype 3. I could not prove this to be the case but decided this was a more suitable model when dealing with NFL predictions. 4.7 Evaluation Against Betting Market I have already mentioned within my background research that a lot of authors compared their models to that of the bookmakers predictions. As shown above, differing sets of training and test data can obtain differing accuracies, e.g. some test years achieve better forecasts than others. Therefore, testing against predictions made by bookmakers will give an idea of how accurate Prototype 5 10 actually is. Within the data obtained from Prof. Gray (see Section 2.5.2), matches had a spread attached which held the prediction associated with that match. As the prototype needed 20 years to train and the betting data only went up to the end of the 1994 season, only games between 1990 and 1994 could be tested. Thus, these 5 years were tested using the previous 20 years as training data. Table 4.8 shows that the betting line again is the superior predictor when it comes to sporting predictions. Having said that, the results here are not statistically significant to the pre-determined 0.05 level 11. In other words, even though the betting line was a slightly better forecaster, it cannot be rejected that this higher performance occurred through chance. Table 4.8: Prototype 5 & Betting Line Tested On 995 Matches between Forecaster Accuracy(%) Prototype Betting Line Without the score difference restriction 11 It was only significant at the 0.10 level 44

52 Chapter 5 Evaluation 5.1 Quantitative Evaluation Overall Prototype Evaluation Features which were proposed to predict the winner of a soccer game [27] were the cornerstone of this project s work in predicting the winner of American Football games. These features when placed into a logistic regression model gave substantially more accurate predictions than simply choosing the home team or relying on previous results between teams. Subsequent prototypes were extensions and modifications of this model. However, the analysis of these prototypes showed that the main predictive qualities were from the features suggested within [27]. The final prototype, which included the score difference of a team s recent games attained an accuracy of 65.2% during matches played between 1990 and This was beaten by the bookmaker s predictions, although these results were not found to be significant. This prototype accuracy was close to that achieved by Stefani s least square model which achieved 68.4% but this was bettered by the bookmaker s prediction accuracy of 71% on the same matches. When using a much more complex linear model to forecast NFL games, Harville produced a superior performance than Prototype 5 by correctly forecasting 70.3% of the NFL matches within his test set [31]. However, this again was beaten when compared to the 72.1% achieved by the betting line. In conclusion, although the accuracies of two alternate numerical models were found to be higher than this project s prediction success rate, they were beaten by their respective betting lines. However, this project s model was found to be no worse when compared to the predictions of the bookmakers. My model can also be seen to be more accurate than American Football expert s predictions. Within [43], the authors recorded the average expert prediction for NFL winners as being 62.2%. Furthermore, Bouiler and Stekler found the New York Times Editor to have an even poorer accuracy of 59.7% [18]. The fact that my numerical system does not hold idiosyncratic opinions like experts do, will aid the model in making more accurate unbiased forecasts. 45

53 5.1.2 Usefulness of Features Feature Ablation I investigated the usefulness of each feature within the final prototype. This highlighted areas within the model that could extend this report s work. This investigation was done through feature ablation, which can be executed in two different ways. If we assume the number of features is f, then the first approach is to test the prototype f times, each time taking out a single feature. This is done for each feature and will determine how the model fares without that attribute. The alternate method within ablation studies is to run the prototype f times but this time using only one feature (again done for each feature). I carried out both approaches, however the latter technique produced few conclusions due to each feature being uninformative on their own. I will only discuss results using the first feature ablation approach. I carried out this process for Prototype 5 1 on matches between 1999 and 2006 using the 8 iterations of training seen in Table 4.5. For each iteration, this entailed training and testing the model f times, each time removing a different feature and obtaining f average accuracies for each version of the model. These accuracies were then compared with Prototype 5 s accurate of 62.5% found in Section A damaging feature is one which when taken out leaves the model more accurate than the original. One strategy is to find all the damaging features and remove them from the model to re-assess its accuracy. Although no features were found to be damaging, results from this process can be viewed in Table D.1 2. Here we see the lowest accuracies such as the score difference in the home team s 5 th recent home game, when removed from the model achieving % accuracy. These variables are seen to be the most valuable to the overall model. Whereas the features displayed at the bottom of Table D.1 represent features which are closer to the accuracy of the original model suggesting they are the more redundant attributes (e.g. the score difference in the home team s 3 rd recent home game). However, it should be noted that only 3 values were found to be statistically significant from that of Prototype 5. Here we can confirm one theory from within the project being that last year s result between the teams is a useful feature in predicting an NFL match (Section 4.3). This is shown by the accuracy decreasing (by admittedly a small amount) and this decrease being statistically significant. These results also suggest that the amount of recent home game data could be bound in future prototypes. This is due to the fact that the 6 th recent home game for the away team and the 7 th recent 1 Without the score difference restriction 2 Feature key seen in Table D.3 46

54 home game for the home team are some of the more damaging features. This suggests that if the number was bound to maybe 4 or 5 games instead of Goddard s original proposal of 9, this may improve the accuracy of the model. This could be implemented into a future prototype, however if this was to be done, further study would be needed to find the optimal number of games to use. Overall though, the differences in accuracies are fairly minimal indicating that maybe the model does not contain any outstanding or redundant features, thus I investigated further Ranking Of Features Weka allows another way of assessing how effective each model s feature is, the Attribute Selection option. This was carried out using the Ranker search class in conjunction with the InfoGainAttributeEval evaluator. This evaluates the worth of a feature by the information that is gained with regard to the result of the match, these features are then ranked in order. The features within the training data ( ) for Prototype 5 were ranked by this process and can be seen in Table D.2. Here, we see the win ratios of the home and away team being an important feature within the model with the previous year ratio (e.g. homewinratio 1) being more important than the ratio for the year before that (homewinratio 2). Furthermore, it can be seen that the recent home games are more informative when they are nearer to the current match (homerecentawayscdif1). It is when we move down the table, we see the n th recent games increase (homerecenthomescdif9). This backs up suggestions made in the previous subsection that the recent home game features could be bound to a lower number. This method slightly contradicts what was found within the previous subsection as last year s result is shown to not be a very valuable feature. Although the informative quantity was found to be statistically significant, it was discovered to be one of the least valuable features in the analysis. Thus, the variable s value to the model is somewhat inconclusive. Overall, I have shown that the win ratios are very important when deciding the outcome of a match, with the home ratio being a larger factor than the away team s equivalent. Maybe the number of previous win ratios could be extended in conjunction with a bounding of recent home games within a future prototype. The features regarding the stadia capacities and the travelling distance were suggested to be worthless within the feature rankings. Further analysis would be required to confirm this 3. 3 I also analysed the weight coefficients obtained by the logistic training, however the results were peculiar and thus inconclusive 47

55 5.2 Qualitative Evaluation Project Evaluation Originally the project was to utilise Natural Language Processing to extract expert predictions with a view to help the forecasts produced by the numerical model. The required data was not found and thus this aspect of the project had to be abandoned. I could have created a web crawler which iterates the Internet in order to find this data. However, this would take time to develop and weeks, maybe even months to execute and find the relevant text. Having said this, as discussed in Section maybe these expert opinions would not add any predictive quality to the project s current model. With relation to the soccer model proposed by Goddard and Asimakopoulos, an option could have been to compare the findings in [27] with that of my model. This would have analysed the differences in predictability between soccer and American Football. However, the main priority of this project was to analyse the predictive quality of the model in the domain of American Football and this was done through the comparison with the NFL betting market. On the other hand, if the project was extended then this comparison could be executed to compare the two sports Objective and Minimum Requirements Evaluation Referring back to the original objectives found within Section 1.3, we see that I have carried out each of these with a certain degree of success. I have shown understanding of which information can be used to predict the outcome of an American Football match. This was seen within the features that were chosen to create Prototype 3. I looked into how different techniques were used to model a match within my Background Reading Chapter and decided to use a regression model to utilise Prototype 3 s features (thus creating a model used to predict a match). Finally, by comparing this model to bookmaker s predictions I have analysed how successful the approach chosen was. This report shows how I fulfilled all of my minimum requirements (Section 1.4). I developed and implemented an existing sports prediction algorithm [27] within Prototype 3, to which this was enhanced with the two following prototypes. Furthermore, Section details feature ablation studies highlighting the most useful features within the model. Lastly, I gave prediction accuracies of all prototypes throughout the project showing critical analysis of each algorithm. 4 The model in [27] used a different evaluation system than the one seen in this project 48

56 5.2.3 Project Extensions Outside of the minimum requirements detailed within Section 1.4, I also created baseline algorithms (Section 4.2 & Section 4.3) by which the more complex prototypes could be compared to. Moreover, I compared predictions made by the betting market to definitely assess my most accurate prototype Schedule Evaluation Clearly, the alteration to the project affected its performance. The need to revise the original schedule to such a degree meant I had less time to develop the numerical model. During the schedule revision, I underestimated the amount of reading needed for the numerical analysis. The Background reading - numerical algorithms was carried out up to around the end of February. This left only 2 or 3 weeks to build further prototypes on top of the first three, around 2 weeks less than scheduled for. Although this hindered me, the key point in the original/revised schedule was starting the design and implementation of Prototype 3 during the initial stages of background reading. If this had been started any later, this would have further decreased the amount of work carried out on further prototypes Methodology Evaluation The prototype approach was vital in allowing me to add and remove different features from within the Goddard & Asimakopoulos base. This can be seen specifically within the testing of the ranking features (Section 4.5.3). Here, I needed to quickly test whether the ranking attributes were useful on their own. Prototyping allowed me to do this without affect the project s structure. Technologies such as Java and WEKA allowed me to implement this project efficiently, without major setbacks. The HashMap class within Java s library enabled me to create the results matrix (Section 4.3.2) which aided in result exploration. Furthermore, the Object-Orientated architecture helped me to represent each match and use the methods stored within this Match class to create the vectors. Although Python may have been more efficient in the parsing of text files and web-spidering, the standard Java classes allowed me to carry out these processes with just as much ease and flexibility. The choice of WEKA was justified throughout the project. The Java compatibility enable me to quickly build classifiers using the training data and test them. This structure allowed me to efficiently plug-in the corresponding ARFF files to assess each prototype. Furthermore, the Remove class within WEKA was used to quickly remove the relevant features on the fly within an ARFF file rather than re-creating the file itself. This was a great help during the feature ablation studies. 49

57 Chapter 6 Conclusion 6.1 Conclusion The problem set by this project was to see if numerical data could be used to accurately predict the outcome of matches within American Football. This was approached using a logistic regression model enclosing information based on previous results e.g. the recent form of the competing teams. The model also included novel features such as the distance between the two teams and the two team s stadium capacities. This model was seen to hold more predictive qualities than simply choosing the home team or relying on previous results between the two teams. The model was modified to include the score difference within both team s recent games and although was not found to significantly improve the accuracy of the original model, it produced a more informative system for predicting an NFL game. Ultimately, this modified regression model was seen to compete with predictions made by the betting line, where the system s forecast were found to be no worse than the bookmaker s. 6.2 Further Work By assessing the probability produced by the logistic model for each match: One could analyse whether these probabilities could be used to improve the forecasts. Or whether these probabilities match the size of the betting line spread. (i.e large probabilities map to large betting spreads). A different regression technique could evaluate the project s model. This could either be to try to improve the match winner predictions using another binary-output technique (e.g. Maximum Entropy). Alternatively, a model which produces a continuous result, representing the predicted spread of a match could be used to compare to the actual margin of victory. One such approach could be Ordinary Least Squares as used by Golec and Tamarkin [28]. If a web crawler program was created and professional expert opinions could be found then this subjective data could be utilised in order to obtain a superior prediction to an NFL match. The numerical model could be turned to another professional sport, e.g. ice hockey. This would be justified by work carried out by Stefani [44] in which he used his model on a number of differing sports and attained consistent accuracies. 50

58 Bibliography [1] Chronology of home stadiums for current national football league teams. of home stadiums for current National Football League teams. [2] City distance tool. [3] Digital tv from sky. [4] Directions.pdf. [5] Football outsiders: Football analysis and nfl stats for the moneyball era. [6] Football outsiders: Football analysis and nfl stats for the moneyball era. [7] Football outsiders: Football analysis and nfl stats for the moneyball era. [8] Ladbrokes profits jump on punter s losing streak. UK-LADBROKES.php. [9] Learn about java technology. [10] Nfl s beginner s guide to football. [11] Nfluk.com - about the game - rookie faqs. faqs.html#pagetop. [12] Python programming language official website. [13] Super bowl 43 super bowl history. 51

59 [14] Super bowl xlii tackles record 97.5 million viewers. [15] This day in history 1984: Baltimore colts move to indianapolis. [16] Usatoday.com. [17] T. Baker. Building a system to recognise predictive opinion in online forum posts. Final Year Project, University of Leeds, [18] B.L. Boulier and H.O. Stekler. Predicting the outcomes of national football league games. International Journal of Forecasting, 19(2): , [19] S. Le Cessie and J.C. Van Houwelingen. Ridge estimators in logistic regression. Applied Statistics, pages , [20] S. Chatterjee and A.S. Hadi. Regression analysis by example. Wiley-Interscience, [21] T.G. Dietterich. Approximate statistical tests for comparing supervised classification learning algorithms. Neural Computation, 10: , [22] M.J. Dixon and S.G. Coles. Modelling association football scores and inefficiencies in the football betting market. Applied Statistics, 46(2): , [23] M.S. Duh, A.M. Walker, M. Pagano, and K. Kronlund. Prediction and cross-validation of neural networks versus logistic regression: using hepatic disorders as an example. American journal of epidemiology, 147(4): , [24] A.J. Dwyer. Matchmaking and mcnemar in the comparison of diagnostic modalities. Radiology, 178(2):328, [25] D. Forrest, J. Goddard, and R. Simmons. Odds-setters as forecasters: The case of english football. International Journal of Forecasting, 21(3): , [26] D. Forrest and R. Simmons. Forecasting sport: the behaviour and performance of football tipsters. International Journal of Forecasting, 16(3): , [27] J. Goddard and I. Asimakopoulos. Forecasting football results and the efficiency of fixed-odds betting. Journal of Forecasting, 23(1):51 66,

60 [28] J. Golec and M. Tamarkin. The degree of inefficiency in the football betting market : Statistical tests. Journal of Financial Economics, 30(2): , December [29] P.K. Gray and S.F. Gray. Testing market efficiency: Evidence from the nfl sports betting market. Journal of Finance, 52(4): , September [30] D. Harville. The use of linear-model methodology to rate high school or college football teams. Journal of the American Statistical Association, 72(358): , [31] D. Harville. Predictions for national football league games via linear-model methodology. Journal of the American Statistical Association, 75(371): , [32] F. Hu and J.V. Zidek. Forecasting nba basketball playoff outcomes using the weighted likelihood. Lecture Notes-Monograph Series, 45: , [33] R. Hughes and M. Cotterell. Software project management. McGraw-Hill Higher Education, [34] D. Jurafsky, J.H. Martin, A. Kehler, K. Vander Linden, and N. Ward. Speech and language processing: An introduction to natural language processing, computational linguistics, and speech recognition. MIT Press, [35] G.K. Kanji. 100 Statistical Tests. Sage Publications, London ; Newbury Park, Calif. :, [36] S.M. Kim and E. Hovy. Crystal: Analyzing predictive opinions on the web. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pages , [37] D.G. Klienbaum and M. Klein. Logistic Regression: A Self-Learning Text. Springer, [38] R.H. Koning. Balance in competition in dutch soccer. The Statistician, 49(3): , [39] A. McKinlay. A system for predicting sports results from natural language. Final Year Project, University of Leeds, [40] B. Pang, L. Lee, and S. Vaithyanathan. Thumbs up? sentiment classification using machine learning techniques. In Proceedings of 2002 Conference on Empirical Methods in Natural Language Processing (EMNLP, pages 79 86,

61 [41] H. Rue and O. Salvesen. Prediction and retrospective analysis of soccer matches in a league. The Statistician, 49(3): , [42] J.P. Shaver. What statistical significance testing is, and what it is not. Journal of Experimental Education, 61: , [43] C.U. Song, B.L. Boulier, and H.O. Stekler. The comparative accuracy of judgmental and model forecasts of american football games. International Journal of Forecasting, 23(3): , [44] R.T. Stefani. Improved least squares football, basketball, and soccer predictions. Systems, Man and Cybernetics, IEEE Transactions on, 10(2): , February [45] P. Turney. Thumbs up or thumbs down? semantic orientation applied to unsupervised classification of reviews. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages , [46] R.C. Vergin and J.J. Sosik. No place like home: an examination of the home field advantage in gambling strategies in nfl football. Journal of Economics and Business, 51(1):21 31, [47] T. Wilson, J. Wiebe, and P. Hoffmann. Recognizing contextual polarity in phrase-level sentiment analysis. In Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing, pages Association for Computational Linguistics Morristown, NJ, USA, [48] I.H. Witten and E. Frank. Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann, San Francisco, [49] H. Yu and V. Hatzivassiloglou. Towards answering opinion questions: Separating facts from opinions and identifying the polarity of opinion sentences. In Proceedings of EMNLP-03, pages ,

62 Appendix A Personal Reflection The main thing for me was to choose a project which I would enjoy right the way through and this was definitely the case here. This would be my first suggestion to anyone carrying out their final year project in future years, choose a project that you will enjoy. Otherwise, interest will be lost and the student will be unable to put their full devotion to the work. In terms of data collection, clear one alteration I would make would be to check to see if the expert data was available before starting the textual background reading. Although, I found this research interesting, it was ultimately pointless and ate into my project s time. If this research was not carried out, then I would have finished my numerical background reading sooner and would have had more time to spend on developing more prototypes. This is highlighted within the project s Further Work (Section 6.2). Whilst writing this, I realised how many ways this project could have been extended and analysed with more time. Thus, two more pieces of advice from me would be that students embarking on a data collection-based project like this ones should gather all data first before firmly deciding on a project. Also, when the minimum requirements have been decided, try to make time within your schedule for the project s potential extensions. One key suggestion I would give to somebody would be to always attend the meetings with their supervisor and always have questions ready to which they can write down the response to during the meeting. Although this should not need to be suggested, if these meetings are recorded on paper then it is a good way of keeping account of what was mentioned during the various stages of project. Some people suggested to keep a diary, however I felt this was not needed within my project as I had recorded all the meetings through this process. Whilst searching for the betting line data, I had reached a dead-end similar to when I looked for the textual opinions. This lead to ing the authors of papers where this data was mentioned. After writing to Prof. Philip K. Gray, this allowed me access to the data I had been looking for (which was vital in the analysis of my project). I dealt with this correspondence in a polite and professional manner, which enabled me to get a quick response from him and ultimately he was kind enough to send me the data. If I had been impolite or informal, he may not have even responded. Thus, another tip for future work would be to hold formal correspondence with any third-parties within the project. As mentioned within Section 2.3, the process of writing both reports (mid-project and final report) 55

63 is a lengthy one. This may seem like an obvious statement to somebody who has not carried out a report of this size. However, the routine of writing a report is very much an iterative one. This involves writing initial content, checking through it, realising some areas are irrelevant, realising some areas are missing. Furthermore, this is before one has considered proof reading the work. Therefore, it is my suggestion to any prospective third-year student to start this writing process during implementation of other work so that enough time is allocated to complete the reports. Moreover, with regards to the mid-project report, some of my peers did not pay enough attention to this. Personally, I felt this was a big opportunity to get vital feedback on the direction of your project at Christmas. It can be easy for people to think that because it does not count directly towards the overall project mark then it is not worth spending much time on. However, the comments within my mid-project report advised me to be more detailed about how a regression approach is implemented, which I feel I have rectified within this final report. Additionally, I was told to make sure that it is explicitly stated where I have created scripts and programs within the project rather than using another developer s code. Hopefully, due to the mid-project assessment comments I have dealt with this here. Lastly relating to the mid-project report, if nothing else it gives the student the opportunity to get a lot of the final report writing out of the way. Although I have adjusted parts, I used the mid-project report as a basis to write my final report. In summary, one word I would use to summarise this project is addictive. On countless occasions, I wanted to push a certain prototype further or utilise another technique to try to improve the accuracy. All-in-all, I feel I was fairly disciplined with regards to this. However, I could easily see how somebody could become engulfed by this urge to improve the model. Therefore, if someone was to carry out a similar research investigation to this one, they need to define strict deadlines, like I have and stick to them to ensure that all aspects of the project are fulfilled. 56

64 Appendix B Project Schedule Figure B.1: Original Project Schedule 57

65 Figure B.2: Revised Project Schedule 58

66 Appendix C PREV RES Algorithm Algorithm 1 PREV RES algorithm for all match in footballresults.txt do if match.result!= Tie then teamatally = 0 teambtally = 0 get resulttallies for match.teama vs match.teamb from result matrix years = years to go back yeartoken = match.year - years while yeartoken < match.year do teamatally += number of teama wins in yeartoken from resulttallies teambtally += number of teamb wins in yeartoken from resulttallies yeartoken++ end while if teamatally > teambtally then else predict teama to win if teamatally < teambtally then else predict teamb to win predict teama [Home team] end if end if end if end for 59

67 Appendix D Feature Ablation Results The following two tables display the results of feature analysis carried out within the project s Prototype 5. The first (Table D.1) displays feature ablation studies within the model. The second, Table D.2 shows how useful each feature was within the final model (Prototype 5). A key for the features can be seen in Table D.3. 60

68 Table D.1: Accuracies of Prototype 5 During Feature Ablation Studies Removed Feature Accuracy (%) homerecentawayscdif homerecenthomescdif distance awayrecentawayscdif awayrecenthomescdif awayrecenthomescdif awayrecentawayscdif awayrecenthomescdif homerecenthomescdif * awayrecenthomescdif homerecenthomescdif homerecenthomescdif homerecentawayscdif homecapacity awayrecenthomescdif awaywinratio awayrecenthomescdif homerecentawayscdif homewinratio homerecenthomescdif awaycapacity awaywinratio awayrecentawayscdif awayrecenthomescdif awayrecenthomescdif homerecenthomescdif * homerecenthomescdif lastyearresult * homewinratio homerecenthomescdif homerecentawayscdif awayrecentawayscdif awayrecenthomescdif homerecenthomescdif * Statisically significant 61

69 Table D.2: Weight Coefficients Within The Logistical Model Used in Prototype 5 Feature Weight Coefficient homewinratio * homerecentawayscdif * awayrecenthomescdif * awaywinratio * awayrecenthomescdif * homerecentawayscdif * homerecenthomescdif * homerecentawayscdif * homerecenthomescdif * awayrecenthomescdif * homewinratio * homerecenthomescdif * homerecenthomescdif * awayrecenthomescdif * awayrecenthomescdif * awayrecenthomescdif * homerecenthomescdif * homerecenthomescdif * awaywinratio * homerecenthomescdif * awayrecentawayscdif * awayrecentawayscdif * homerecentawayscdif * awayrecentawayscdif * awayrecentawayscdif * awayrecenthomescdif * homerecenthomescdif * homerecenthomescdif * awayrecenthomescdif * lastyearresult * awaycapacity 0 awayrecenthomescdif8 0 homecapacity 0 distance 0 * Statisically significant 62

70 Table D.3: Prototype 5 Feature Key Feature awaycapacity awayrecentawayscdifn awayrecenthomescdifn awaywinratio 1 awaywinratio 2 distance homecapacity homerecentawayscdifn homerecenthomescdifn homewinratio 1 homewinratio 2 lastyearresult Description The stadium capacity of the away team The score difference in the n th recent away game for the away team The score difference in the n th recent home game for the away team The away team s win ratio for last year The away team s win ratio for 2 years previous The distance the away team had to travel to play the match The stadium capacity of the home team The score difference in the n th recent away game for the home team The score difference in the n th recent home game for the home team The home team s win ratio for last year The home team s win ratio for 2 years previous The result of last year s corresponding game between the two teams 63