Auto-play: A Data Mining Approach to ODI Cricket Simulation and Prediction
|
|
|
- Octavia Montgomery
- 9 years ago
- Views:
Transcription
1 Auto-play: A Data Mining Approach to ODI Cricket Simulation and Prediction Vignesh Veppur Sankaranarayanan, Junaed Sattar and Laks V. S. Lakshmanan Department of Computer Science University of British Columbia Vancouver, B.C. Canada V6T 1Z4 {vsvicky,junaed,laks}@cs.ubc.ca Abstract Cricket is a popular sport played by 16 countries, is the second most watched sport in the world after soccer, and enjoys a multi-million dollar industry. There is tremendous interest in simulating cricket and more importantly in predicting the outcome of games, particularly in their one-day international format. The complex rules governing the game, along with the numerous natural parameters affecting the outcome of a cricket match present significant challenges for accurate prediction. Multiple diverse parameters, including but not limited to cricketing skills and performances, match venues and even weather conditions can significantly affect the outcome of a game. The sheer number of parameters, along with their interdependence and variance create a non-trivial challenge to create an accurate quantitative model of a game. Unlike other sports such as basketball and baseball which are well researched from a sports analytics perspective, for cricket, these tasks have yet to be investigated in depth. In this paper, we build a prediction system that takes in historical match data as well as the instantaneous state of a match, and predicts future match events culminating in a victory or loss. We model the game using a subset of match parameters, using a combination of linear regression and nearest-neighbor clustering algorithms. We describe our model and algorithms and finally present quantitative results, demonstrating the performance of our algorithms in predicting the number of runs scored, one of the most important determinants of match outcome. Keywords Sports prediction, analytics, ridge regression, attribute bagging, nearest neighbors 1 Introduction Primarily played in the member countries of the Commonwealth, cricket has grown in following across all continents. It has the second largest viewership by population for any sport, next only to soccer, and generates an extremely passionate following among the supporters. There is huge commercial interest in strategic planning for ensuring victory and in game outcome prediction. This has motivated thorough and methodical analysis of individual and team performance, as well as prediction of future games, across all formats of the game. Currently, team strategists rely on a combination of personal experience, team constitution and seat of the pants cricketing sense for making instantaneous strategic decisions. Inherently, the methodology employed by human experts is to extract and leverage important information from both past and current game statistics. However, to our knowledge, the underlying science behind this has not been clearly articulated. One of the key problems that needs to be solved in formulating strategies is predicting the outcome of a game. Our focus in this paper is to address the problem of accurately modeling game progression towards match outcome prediction. We learn a model for one-day format games by mining existing game data. In principle, our approach is applicable towards modeling any format of the game; however, we choose to focus our testing and evaluation on the most popular format, namely one-day international (ODI). By using a combination of supervised and unsupervised learning algorithms, our approach learns a number of features from a one-day cricket dataset which consists of complete records of all games played in a 19-month period between January 211 and July 212. Along with these learned historical features of the game, our model also incorporates instantaneous match state data, such as runs scored, wickets lost etc., as game progresses, to predict future states of an on-going match. By using a weighted combination of both historical and instantaneous features, our approach is thus able to simulate and predict game progression before and during a match. We motivate
2 the problem of game modeling and outcome prediction in Section 2. Along with a brief introduction to cricket, Section 3 presents the problem formulation, with details on feature modeling. In Section 4, we present our algorithm for predicting the game progression and outcome, with results discussed in Section 5 2 Related Work 2.1 Data Mining in Other Sports The problem of match outcome prediction has been studied extensively in the context of basketball and soccer. Bhandari et al. [4] developed the Advanced Scout system for discovering interesting patterns from basketball games, which has is now used by the NBA teams. More recently, Schultz [12] studies how to determine types and combination of players most relevant to winning matches. In soccer, Luckner et al. [11] predict the outcome of FIFA World Cup 26 matches using live Prediction Markets. In baseball, Gartheepan et al. [7] built a data driven model that helps in deciding when to pull a starting pitcher. These works are developed with a sport specific intuition which would render them inapplicable to the sport of cricket. 2.2 Academic Interest in Cricket One of the earliest and pioneering works in cricket was by Duckworth and Lewis [6] where they introduce the Duckworth- Lewis or D-L method, which allows fair adjustment of scores in proportion to the time lost due to match interruption (often due to adverse weather conditions such as rain, poor visibility etc.). This proposal has been adopted by the International Cricket Council (ICC) as a means to reset targets in matches where time is lost due to match interruptions. The method proposed in [6], and subsequently adapted by [14], for capturing the resources of a team during the progression of a match has found independent use in subsequent work in cricket modeling and mining [14][2]. Lewis [1], Lemmer [9], Alsopp and Clarke [1], and Beaudoin [3] develop new performance measures to rate teams and to find the most valuable players. Raj and Padma [15] analyze the Indian cricket team s One-Day International (ODI) match data and mine association rules from a set of features, namely toss, home or away game, batting first or second and game outcome. Kaluarachchi and Varde [8] employ both association rules and naive Bayes classifier and analyze the factors contributing to a win, also taking day/day-night game into account. Both approaches use a very limited subset of high-level features to analyze the factors contributing to victory. Furthermore, they do not address score prediction, nor the progression of the game. Bailey and Clarke [2] use historical match data and predict the total score of an innings using linear regression. As data of a match in progress streams in, the prediction model is updated. Using this, they analyze betting 1 market s sensitivity to the ups and downs of the game. Swartz et al. [17] use Markov Chain Monte Carlo methods to simulate ball by ball outcome of a match using a Bayesian Latent variable model. Based on the features of current batsman, bowler, and game situation (number of wickets lost and number of balls bowled), they estimate the outcome of the next ball. This model suffers from severe sparsity as noted by the authors themselves: the likelihood of a given batsman having previously faced a given bowler in previous games in the dataset is low. While both [17] and [2] have built match simulators for ODI cricket, their models rely on games played over 1 years ago. ODI cricket has since undergone a number of major rule modifications. Important examples include powerplays, free hit after an illegal ball delivery, and the use of two new balls (as opposed to just one) in an innings. These changes significantly affect the team strategies, and essentially render old models a poor fit. Our focus is on the modern and current form of ODI cricket, incorporating all recent changes to the game with support for accommodating future rule modifications. 3 Game Modeling 3.1 Overview of ODI Cricket Rules and Objectives We provide a brief overview of ODI cricket and review its basic rules as they pertain to game modeling and score prediction. We also introduce several basic notations and terminologies used in the rest of the paper. Toss: Similar to a number of other sports, an ODI cricket match starts with a toss. The team that wins the toss can choose to bat first or can ask the opponents to bat first. This decision is important and takes into account the nature of the playing field, weather conditions, and relative strengths and weaknesses of the two teams. Objective: In a game between T eam A and T eam B, suppose T eam A wins the toss and chooses to bat first. The period during which T eam A bats is called innings 1, in which T eam A has 5 overs to score as many runs as possible, while T eam B tries to minimize the scoring by getting T eam A s batsmen out (more commonly referred to as taking wickets). Scoring can also be restricted by T eam B, by bowling balls that are difficult to score off and by flawless fielding, where 1 There is a vibrant betting market associated with cricket. See, e.g.,
3 fielders stop hits by batsmen of T eam A to deny them opportunities to score runs. Innings 1 comes to an end when T eam A loses all its wickets or finishes its quota of 5 overs, whichever happens first. Let Score A denote the number of runs accumulated by T eam A at this point.when T eam B comes in to bat in innings 2, it has the exact same number of 5 overs to play, with the goal of scoring at least Score A +1 runs; innings 2 ends when Score B, the number of runs scored by T eam B, exceeds Score A, or when T eam B finishes its quota of 5 overs or loses all its wickets, whichever happens first. T eam B is deemed the winner in the first case, and T eam A wins otherwise. A third possibility is a tie when Score A and Score B are equal at the end of the game. 2 Scoring: Teams can accumulate runs in two ways. One way of scoring is to power-hit the ball outside the playing area. Four runs are awarded if the ball touches the ground before rolling past the boundary of the playing area. If the ball lands directly outside the playing area, six runs are awarded. Borrowing a term from baseball, game, for convenience, we collectively term runs scored this way as home runs. Home runs yield greater reward in terms of runs scored, but the batsmen have to take risks to hit them, which increases their chance of getting out. The other way of scoring is to hit the ball within the playing area and for the two batsmen to run and exchange their positions. In the mean time, the opponent players try to collect the ball to minimize the number of exchanges. Runs are awarded based on the number of times the batsmen exchange their positions before the ball is returned to one of the positions. There is theoretically no bound on the number of exchanges possible in a given ball but this value typically lies in the range 1 3 runs. This way of scoring has a lower risk of the batsman getting out but yields a lower number of runs. We term these non-home runs. Runs are awarded to the batting team when the bowler commits a foul while delivering the ball. Runs conceded this way are usually small and are accounted for by non-home runs in our model. Dismissal: There are eleven ways for a batsman to lose his wicket, commonly referred to as getting out or dismissed. The common ways to get dismissed are being bowled, caught by opponents, run out and Leg-Before- Wicket (abbreviated as LBW). In our model, we do not distinguish between the different forms of dismissal. Target score: The number of runs accumulated by T eam A at the end of innings 1 is Score A. Score A +1 run is set as the Target that the team batting second tries to achieve or exceed in innings 2. 2 Currently, there are no tie-breakers in ODI the format, possibly because ties are extremely rare. Resources: Overs and Wickets are collectively termed as resource. The batting team consumes the overs to accumulate runs and loses wickets in the process. A batting team has 5 overs and 1 wickets at their disposal at the start of an innings. This resource continually decreases as the game progresses. Segment: The batting period of a team is called an innings and it lasts till they run out of one of the resources. We split the 5-over window into 1 segments of 5 overs each, denoted S i, 1 i 1. For a team T, Ri T and Wi T denote the the number of runs scored and the number of wickets lost in segment S i, respectively. The total number of runs scored by team T at the end of their innings is given by R T eoi = 1 i=1 RT i. We drop the superscript T when the team is clear from the context. Below, we formalize the problem addressed in this paper. 3.2 Problem Formulation The main problem we tackle in this paper is given the instantaneous match data up to a certain point in the game, predict the progression of the remainder of the game, and in particular, predict the winner. Before we formalize this, we define a match state at segment n, n < 1, as the pair of numbers consisting of the number of runs scored and the number of wickets lost so far, by the batting team. Notice that given a match state, the resources remaining at the batting team s disposal can be easily calculated: the number of balls remaining is (1 n) 5 6 and the number of wickets remaining is 1 (#wickets lost so far). More precisely, given a match state associated with segment n, namely (R known = n i=1 R i, W known = n i=1 W i), predict the number of runs ˆR i for the remaining segments i, n + 1 i 1. Using these predictions, the total predicted score at the end of the innings can be obtained as (3.1) ˆReoi = R known + n i=n+1 If an innings has not commenced, as a special case, n =, R known = and W known =. We follow this segmented prediction approach to predict ˆR eoi for both innings 1 and innings 2. T eam A is predicted to be the winner if ˆR A eoi > ˆR B eoi. T eam B is predicted to be the winner if ˆR A eoi < ˆR B eoi. 3.3 Sub-Problem We break down the problem of predicting the number of runs in the next segment S n+1, given the match state up to segment S n, into two subproblems, by recognizing that home runs and non-home runs are strategized and scored by different means by the batsmen. We have found from our analysis and exploration that the number of runs can be ˆR i
4 predicted more accurately if we learn separate models for predicting the home runs HR n+1 and non-home runs NHR n+1. More precisely, for any segment i, R i = HR i + NHR i and ˆR i = HR ˆ i + NHR ˆ i, where ˆX is the predicted value of X. While it may seem counter-intuitive to use two different classes of techniques to predict the overall total score, this decision was driven by observing the inherent nature of the game itself, and has eventually been justified by our experimental results. In a given game, the number of non-home run scoring balls greatly outnumber the home run scoring balls. A linearregression based approach to predict non-home runs thus runs into the problem of data sparsity. Attribute bagging, on the other hand, enables our system to find matches that have similar home-run scoring patterns, given the set of match features, and thus avoids the sparsity issue altogether. Our experiments have shown (see Section 5) much degraded performance when using ridge regression for HR prediction, with the MAE for ˆR eoi increasing from 16.5 runs to 29.4 runs. Prediction of HR ˆ i and NHR ˆ i is accomplished using two sets of features historical features and the instantaneous features, described next. Of these, historical features are critical for predicting runs for the first segment, since by definition, no instantaneous match data is available before the first segment. 3.4 Historical Features Our model consists of 6 historical features for each team in the dataset. They are mined from data across all matches played by a given team. The historical features of a team are as follows: (1) Average runs scored (by the team) in an innings; (2) Average number of wickets lost in an innings; (3) Frequency of being all-out; 3 (4) Average runs conceded in an innings; (5) Average number of opponent wickets taken in an innings; (6) Frequency of getting opposition all-out. In what follows, we will use N to denote the total number of matches in the training dataset. Recall, n denotes the segment up to which match state is known. The first feature is calculated by dividing the total runs scored by the given team across the number of matches it played. N i=1 (3.2) AverageScore = (Runs scored in match i) N The subsequent five features are self-explanatory and are calculated similarly to (3.2). Out of the 6 features, the first three represent the team s batting ability, while the last three represent the team s bowling ability. 3 That is, losing all 1 wickets in an innings within 5 overs. 3.5 Instantaneous features In addition to the features mined from past game data, i.e., the historical features, we incorporate several instantaneous match features in our prediction model. What has happened in the game so far is an important indicator for predicting game outcome. We extract the following instantaneous features from the dataset. 1. Home or Away: This is a binary feature describing if the batting team is playing in its home ground. If the match is played in a neutral venue, this feature carries no weight for both teams. 2. Powerplay: Powerplay is a restriction on the number of fielders that could be placed by the bowling team outside a certain range from the batsmen (usually 3 yards, approx meters). This restriction enables the batsmen to hit the balls aggressively and try and score home runs, with a relatively reduced risk of getting out. The first 1 overs of the game are mandatory powerplays, with two more instances of powerplay periods arbitrarily chosen by the batting and bowling team each, to occur at any point in the game up to the 45 th over. For any segment, the powerplay can occupy between and 5 overs of the segment. Consequently, the value of this feature ranges from to 1 in increments of Target: The goal of the team batting second is to achieve the T arget Score, (= Score A +1 runs). This used as a feature 4. Batsmen performance features: For any given segment S n, we identify four performance indicators for each of the two currently playing batsmen. They are batsman-cluster (to be described in section 3.6), #runs scored, #balls faced, and #home runs hit till segment S n Game snapshot: This feature is a pair of game state variables, namely current score and #wickets (i.e., #batsmen) left. Instantaneous features 4 and 5 are explained in detail in Sections 3.6 and Batsmen Clustering In our dataset, there are more than 2 players who have faced at least one ball. Given data corresponding to 125 matches, learning the features for each of the 2 individual players is fraught with extreme sparsity. To give an example, given a currently playing batsman b and a current bowler l, the probability that b has faced l in earlier matches can be quite low. Even when b has faced l before, the number of such matches can be too small to learn any useful signals from, for purposes of prediction. To quantify, if in a dataset of M matches, the average number of matches played by player b is m b, and by player l is m l (where M m b and M m l ), even
5 assuming independence, the probability that b and l played together is m b M m l M. To overcome this sparsity, we cluster the batsmen according to their batting skills, using the following four features: (1) Batting Average; (2) Strike Rate; (3) Home-run hitting ability; and (4) Milestone reaching ability. The first two features are standard metrics used to report batsmen stats in cricket. Although they are used to express a batsman s quality, they do not quite capture his skill as observed by cricket experts and proved by [16] and [1]. Hence for batsman clustering, we use Features 3 and 4 that capture the quality of batsmen more accurately. Batting Average for a batsman is the ratio of the total number of runs he has scored across all matches, over the number of times he has gotten out. Strike Rate is the average number of runs scored per 1 balls, again calculated across all matches played. Both Batting Average and Strike Rate are standard player statistics used in cricket. We measure the ability of a batsman to frequently hit home runs using (3.3) N i=1 HR-hittingAbility = #home runs hit in match i N i=1 balls faced in match i Scoring fifty runs or a hundred runs (commonly referred to as half-century and century) are considered batting milestones in cricket. Players who consistently and frequently reach these milestones are considered to be of very high caliber. To capture this, we define a metric called milestone reaching ability (MRA) as follows: (3.4) # of 5 & 1 run scores in N matches played MRA = N MRA is thus a good indication of batsman quality. Using the above four statistics, we cluster the batsmen into 5 clusters using the k-nearest neighbor clustering. We chose 5 clusters based on the intuition that a team consists of opening batsmen, middle-order batsmen, allrounders, wicket-keeper, and tail-enders, having different batting capabilities. 3.7 Game Snapshot Recall that the problem is, given the match state data up to segment n < 1, i.e., runs scored R i and wickets lost W i in segment i, 1 i n, we need to predict the number of runs for segment n + 1. To facilitate this, we aggregate all of the information in segments S 1 to S n 1 and retain the information in segment S n separately. More precisely, we set R 1:n 1 = n 1 i=1 R i and W 1:n 1 = n 1 i=1 W i. We then incorporate the instantaneous features R 1:n 1, W 1:n 1, R n, W n in our model. Since our score prediction is done separately for home and non-home runs, we use HR 1:n 1 = n 1 i=1 HR i and NHR 1:n 1 = n 1 i=1 NHR i and use these features instead of R 1:n 1, and predict the number of home runs and non-home runs for segment n. For example, to predict the runs in segment S 6 (overs 26 to 3), runs scored and wickets lost in segments S 1 to S 4 are aggregated. Runs and wickets in segment S 5 are retained as such. This approach provides the game information till segment S n 1 and the game information in segment S n separately to the model. This provides a broader snapshot of match state and also gives more importance to the immediately preceding segment. Our learning algorithm, described in the next section, makes use of the aforementioned historical and instantaneous features up to a given segment to predict scores for subsequent segments and uses that to predict the overall score ˆR eoi. As a special case, when n =, the algorithm relies on historical features alone to make its predictions. 4 Algorithm 4.1 Home-Run Prediction Model Using the historical and non-historical features discussed above, we predict the number of home runs HR ˆ i for a segment S i, using attribute bagging ensemble method [5] with nearest-neighbor clustering. Here, we choose random subsets of features for n classifiers with l features each and aggregate the overall results. Different sets of features corresponding to the previous states are chosen randomly and their nearest neighbors are identified from history, thereby leveraging the Markovian nature of segments. Number of features for every classifier is set to be the root value of the total number of features. The number of classifiers is experimentally determined. The intuition behind using nearest-neighbor algorithm is that information from similar match situations can be borrowed from the training dataset. We use Spearman s distance metric, that uses rank correlation to identify the top neighbor. The Spearman distance is a measure of pairwise linear correlation between ranked variables. Suppose a sequence of values of two variables u = (u 1,..., u m ) and v = (v 1,..., v m ) are rank-ordered, then Spearman correlation coefficient is defined as: (4.5) ρ = m i (u i ū)(v i v) m i (u i ū) 2 m i (v i v) 2 It is the same as Pearson correlation coefficient except ranks are used in place of observed values. Game features are ranked in the training and test dataset separately. The distance between a match in the
6 test dataset and one in the training dataset is the dot product of the (ranked) feature vectors of the matches. After running n number of classifiers with l features each, we pick the top 5 neighbors based on frequency counts and average the home run hits. This number of neighbors, 5, has been determined experimentally. 4.2 Non-Home-Run Prediction Using the same historical and instantaneous features, non-home runs of segment S i, NHR ˆ i is predicted by means of Ridge Regression [13]. The iterative algorithm to predict the runs R i of future segments and consequently, R eoi of the whole innings is given in Algorithm 1, where we use the following notation: NHR ˆ i, predicted non-home runs in segment S i HR ˆ i, predicted home runs in segment S i ˆR i, predicted total runs in segment S i ˆR eoi, predicted end of innings score Θ historical features n segment number till which match information is available R i, Runs scored in segment S i where 1 i n n instantaneous features till segment S i. Θ is the set of historical features given in Section 3.4, which remain constant through the iterations since they are learned just once from the historical match data; n is the segment number till which, match state data, also called instantaneous features, n are available. At the start of the algorithm, features Θ and n are fed as input to the algorithm which proceeds iteratively to predict ˆR i, for every segment i, n+1 i 1. Algorithm 1: Prediction of Future Segments runs ˆRi & End of Innings Score ˆR eoi Input : Θ, n, n Output: R i for every n + 1 i 1, ˆR eoi 1 for i n + 1 i 1 do 2 for j 1 j L do 3 Γ j RandomSubspace(Θ, i 1 ) 4 Φ j NearestNeighbor(Γ j ) 5 end 6 HR i MajorityV oting(φ 1:L ) 7 NHR i RidgeRegression(Θ, i 1 ) 8 ˆRi HRi + NHR ˆ i 9 i Update( i 1, ˆRi) 1 end 11 ˆR eoi n i=1 R i + 1 ˆRi i=n+1 Using Θ and i 1 (instantaneous features of previous segment), home runs HR ˆ i of segment i is predicted using Attribute bagging algorithm as explained in section 4.1(Line 2-6). For every classifier j in L (Number of Classifiers), a random subspace of features (Γ j ) is chosen from the overall feature space (Line 3). The nearest neighbor based on this subspace of features (Γ j ) is found out to be Φ j (Line 4). Based on majority voting among the chosen neighbors (Φ 1:L ), the closest neighbor from the raining set is found and home-run information is borrowed. (Line 6) Using the same features Θ and i 1, non-home runs NHR ˆ i are predicted using Ridge Regression as mentioned earlier in this section (line 7). HR ˆ i and NHR ˆ i are added together to give ˆRi, the predicted number of runs scored in segment S i (Line 8). It is to be noted that two of the instantaneous features (Section 3.5) namely Home/Away and Batsmen Cluster do not change through the course of prediction while Game Snapshot features are constantly updated based on predictions of the previous iteration as explained in section 3.7. Number of runs for segment i, ˆRi, predicted in this iteration, is added to the Game Snapshot features of i 1, thereby modifying it to i (Line 9). This i is used to predict the HR ˆ i+1 and NHR ˆ i+1 in the next iteration. The cumulative sum of all known runs R i, 1 i n and all predicted runs ˆRi, n < i 1, is the predicted End-of-Innings Score, ˆReoi (Line 11). 4.3 Cold-Start If prediction is initiated without any match information, i.e., before the start of the actual innings, then n =, and the algorithm starts prediction from S 1 through S 1. In the first iteration when i = 1, no instantaneous features are available. As the opening pair of batsmen are known before the start of a game, their cluster ID numbers (from the feature Batsmen cluster) are used. In order predict ˆR1 accurately, we leverage the match venue information and use it as a feature along with the other historical features Θ. The match venue is incorporated as a multinomial feature called Venue Class, which classifies the location of the venue into one of the following four continent clusters: (i) Australia/New Zealand; (ii) Indian Subcontinent; (iii) The British Isles; and (iv) The Caribbean/South Africa. This classification is significant since the pitch (i.e., the area where the ball is bowled and pitched) and weather conditions across these region clusters are known to be substantially different, for the purposes of the game. Using the venue class as a feature is based on the intuition that a team tends to start differently in different venue conditions. 5 Experiments 5.1 Dataset Our dataset consists of 125 complete matches played between January 211 and July
7 212 among the 9 full-time ICC teams of Australia, Bangladesh, England, India, Sri Lanka, Pakistan, South Africa, New Zealand and the West Indies who have played more than 2 matches each excluding all raininterrupted and rain-abandoned games. We split the dataset into training and test set with 1 and 25 matches respectively. We perform ten-fold cross validation on the 1 training matches and and present our results by testing on the remaining 25 matches. We crawled the data from where ball-by-ball data on all the matches are available publicly. Team and batsmen statistics for mining historical features and batsmen clustering are also queried from their publicly-accessible statistics databases. Since ball-by-ball commentary data consists of transcription of the human interpretation of actual events, there are occasional missing values and errors. They were fixed either manually or by automated consistency checks with the end-of-over summary, scorecards, partnership information and other relevant game data. After the required data is gathered, they were aggregated and rolled-up to 5-over levels (since a segment is a collection of 5 overs) without loss of necessary information. Once the data is available in the processed form, running time for the model to learn the parameters across all the segments and testing by 1-fold cross-validation takes less than 5 seconds on a 4-core, 2.66 GHz machine with 8 GB RAM, running OpenSuse The data is gathered from publicly available sources, as mentioned above. Owing to limited space, we combine and present the innings 1 and innings 2 prediction performances. 5.2 Non-Home Run Prediction Performance Prediction of NHR ˆ eoi, which is the sum of individual NHR ˆ i is shown in Figures 1 and 2. Figure 1 shows a scatter plot between the predicted and actual total non-home runs for both innings 1 and innings 2 combined. and demonstrates good agreement between the predicted and actual Non-Home Runs. Figure 2 shows the total non-home run prediction error distribution across all the matches. The figure on the left gives the Probability Density Function (PDF ) and the one on the right gives the Cumulative Distribution Function (CDF ). It can be seen that for more than 55% of the matches, the error margin is less than or equal to 1 runs in both innings. The median number of non-home runs in an innings in our dataset is 125. It is evident that our approach is very accurate for the majority of matches in the dataset, and performs poorly on only a small percentage of the games. 5.3 Home Run Prediction Performance We experimented with a number of distance metrics (namely, Jaccard, Hamming and Cosine measures) and compared them with the performance of Spearman distance metric., and the Spearman metric was observed to perform best (i.e., have the lowest minimum average error). We report the performance of both nearest neighbor and attribute bagging with nearest neighbor technique and expectedly, attribute bagging performs better than nearest neighbors. Figure 3 shows a scatter plot between predicted and actual total home runs. We have slightly worse Actual non home runs Predcited non home runs Figure 1: Total non-home runs scatter plot Number of Matches PDF % of Matches CDF Absolute Error in Predicted Non Home Runs Absolute Error in Predicted Non Home Runs Figure 2: PDF and CDF of Total non-home run prediction error Number of Matches % of Matches Actual home runs Predcited home runs Figure 3: Total home runs scatter plot 5 PDF Attribute Bagging Absolute Error in Predicted Home Runs PDF Nearest Neighbors Absolute Error in Predicted Home Runs % of Matches CDF.1 Attribute Bagging Nearest Neighbors Absolute Error in Predicted Home Runs Figure 4: PDF and CDF of Total home run prediction error
8 agreement between predicted and actual, compared to non-home run prediction (Figure 1). Figure 4 shows that the error margin for the top 55% of the matches are less or equal to 2 runs. As mentioned in section 3, home runs are awarded either 4 or 6 runs based on where the ball lands. Hence, a single mis-prediction can induce a maximum error of 6 runs. It is also a more difficult problem to predict the number of runs scored through home runs, with the uncertainty arising from the very nature of the game as described in section 3. This is reflected in Figure 3 and 4. Figure 4 also shows that attribute bagging with nearest neighbors performs better than plain nearest neighbor algorithm. 5.4 End-of-Innings Run Prediction Performance Figure 5 shows the scatter plot for ˆR eoi for every match in the dataset. Figure 6 shows the total score error distribution across the all the matches in the dataset. It can be observed that, for 5% of matches, prediction error has a maximum of 16 runs in Attribute bagging method, while for nearest neighbor method, it is close to 3 runs. 8% of the matches fall under the same prediction error ceiling, in attribute bagging method. 5.5 Runs in a Segment, ˆR i As match data streams in, we update the model (in five-over intervals) with ground truth and make ˆR i prediction for the next segment. Figure 7 shows the mean absolute error values for ˆR i scores across segments S i, given match state data till segment S i 1. MAE i for both innings 1 and innings 2 lies within the range of 4 and 12 runs for all the seg- Number of Matches Number of Matches Actual end of innings Score Predcited end of innings Score Figure 5: R eoi scatter plot PDF Attribute Bagging Error in end of innings score prediction PDF Nearest Neighbors Error in end of innings score prediction % of Matches CDF.1 Attribute Bagging Nearest Neighbors Absolute error in end of innings score prediction Figure 6: PDF and CDF of R eoi prediction error ments, with errors increasing towards the later segments during the innings. Generally, until the middle overs (i.e., up to over 35), teams are focusing on building a good foundation and consolidating their run scoring efforts. On the other hand, in the last 2 or 3 segments (from overs 35 to 5), it is common for the batsmen to try to hit most of the deliveries for home runs to maximize total runs; subsequently, a large chunk of the total score is accumulated in these last three segments. In doing so, batsmen take high risk and subsequently may get dismissed. Hence the match could turn in favor of any of the two teams with more or less equal probability. Because of such unpredictable nature of the game during these segments, it is difficult to estimate ˆR i. Accordingly, the performance of our algorithm suffers somewhat in these segments, as demonstrated by the plots. 5.6 Performance Comparison with Baseline Model Bailey et al. [2] propose a model that predicts the ˆR eoi of a game in progress which is used to analyze the sensitivity of betting markets. Although addressing a different requirement, their framework allows making ˆR eoi predictions at the end of each innings. In Figure 8, we demonstrate the accuracy of our model considering their model as a baseline. At the end of each segment, ˆR eoi is calculated and compared with the actual R eoi obtained at the end of innings from match data. As shown in the plot, both our model and that of Bailey et al. make better predictions, as more segments from the match in progress are input to the model. However, MAE for our model is significantly better for all the segments. It can be observed that our model significantly outperforms the baseline for both the innings. We used our framework for predicting ˆR eoi for both innings, to predict the game winner. We found that the accuracy of this prediction is between 68% and Mean Absolute Error (MAE) Innings 1 Innings Overs Figure 7: Mean absolute error for Runs R i across each segments S i, or, 5-over intervals for innings 1 and innings 2. Since the first and fore-most prediction R 1 for i = 1 gives the runs scored at the end of over number 5, the plots start from over 5.
9 Figure 8: Mean absolute error in ˆR eoi prediction for innings 1 (top) and innings 2 (bottom) for both [2] and our model. 7%, which is robust regardless of the number of known segments. To our knowledge, this is the highest winner prediction accuracy reported in for ODI cricket. 6 Conclusion and Future work The main goal of this paper is to learn a model for predicting game progression and outcome in one-day cricket. We developed separate models for home runs and non-home runs using historical features as well as instantaneous match features from past games that we identified. Ridge Regression and attribute bagging algorithms are used on the features to incrementally predict the runs scored in the innings. We demonstrated the quality and accuracy of our predictions with an extensive set of experiments on real ODI cricket data. In addition to predicting runs for future segments, our winner prediction accuracy is by far the highest reported in ODI cricket mining literature. While our technique is significantly more accurate than the state-of-the-art, we are currently working to further reduce the prediction error. Furthermore, to make the prediction engine functionally complete, we intend to predict fall of wickets, overcoming the challenges presented from data sparsity. Finally, we aim to leverage bowler s features (in addition to the batsmen s) to improve the prediction accuracy even further. References [1] P. E. Allsopp and S. R. Clarke. Rating teams and analysing outcomes in one-day and test cricket. Journal of the Royal Statistical Society. Series A (Statistics in Society), 167(4):pp , 24. [2] M. Bailey and S. R. Clarke. Predicting the match outcome in one-day international cricket matches, while the game is in progress. Journal of sports Science and Medicine, 5(4):48 487, 26. [3] D. Beaudoin. The best batsmen and bowlers in one-day cricket. PhD thesis, Simon Fraser University, 23. [4] I. Bhandari, E. Colet, and J. Parker. Advanced Scout: Data mining and knowledge discovery in NBA data. Data Mining and Knowledge Discovery, 1(1): , [5] R. Bryll, R. Gutierrez-Osuna, and F. Quek. Attribute bagging: improving accuracy of classifier ensembles by using random feature subsets. Pattern Recognition, 36(6): , June 23. [6] F. C. Duckworth and A. J. Lewis. A fair method for resetting the target in interrupted one-day cricket matches. The Journal of the Operational Research Society, 49(3):pp , [7] G. Gartheeban and J. Guttag. A data-driven method for in-game decision making in mlb: when to pull a starting pitcher. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining, KDD 13, pages , New York, NY, USA, 213. ACM. [8] A. Kaluarachchi and A. Varde. CricAI: A classification based tool to predict the outcome in ODI cricket. In 5th International Conference on Information and Automation for Sustainability, pages , 21. [9] H. H. Lemmer. An analysis of players performances in the first cricket Twenty2 World Cup series. South African Journal for Research in Sport, Physical Education and Recreation, 3(2):71 77, 28. [1] A. J. Lewis. Towards fairer measures of player performance in one-day cricket. The Journal of the Operational Research Society, 56(7):pp , 25. [11] S. Luckner, J. Schröder, and C. Slamka. On the forecast accuracy of sports prediction markets. In Negotiation, Auctions, and Market Engineering, International Seminar, Dagstuhl Castle, volume 2, pages , 28. [12] D. Lutz. A cluster analysis of NBA players. In MIT Sloan Sports Analytics Conference, 212. [13] D. W. Marquardt. Ridge regression in practice. The American Statistician, 29(1):3 2, February [14] I. G. McHale and M. Asif. A modified Duckworth- Lewis method for adjusting targets in interrupted limited overs cricket. European Journal of Operational Research, 225(2): , March 213. [15] K. Raj and P. Padma. Application of association rule mining: A case study on team India. In International Conference on Computer Communication and Informatics (ICCCI), pages 1 6, 213. [16] T. B. Swartz, P. S. Gill, D. Beaudoin, and B. M. desilva. Optimal batting orders in one-day cricket. Comput. Oper. Res., 33(7): , July 26. [17] T. B. Swartz, P. S. Gill, and S. Muthukumarana. Modelling and simulation for one-day cricket. Canadian Journal of Statistics, 37(2):143 16, 29.
PREDICTING THE MATCH OUTCOME IN ONE DAY INTERNATIONAL CRICKET MATCHES, WHILE THE GAME IS IN PROGRESS
PREDICTING THE MATCH OUTCOME IN ONE DAY INTERNATIONAL CRICKET MATCHES, WHILE THE GAME IS IN PROGRESS Bailey M. 1 and Clarke S. 2 1 Department of Epidemiology & Preventive Medicine, Monash University, Australia
Making Sense of the Mayhem: Machine Learning and March Madness
Making Sense of the Mayhem: Machine Learning and March Madness Alex Tran and Adam Ginzberg Stanford University [email protected] [email protected] I. Introduction III. Model The goal of our research
Football Match Winner Prediction
Football Match Winner Prediction Kushal Gevaria 1, Harshal Sanghavi 2, Saurabh Vaidya 3, Prof. Khushali Deulkar 4 Department of Computer Engineering, Dwarkadas J. Sanghvi College of Engineering, Mumbai,
Social Media Mining. Data Mining Essentials
Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers
Knowledge Discovery and Data Mining. Bootstrap review. Bagging Important Concepts. Notes. Lecture 19 - Bagging. Tom Kelsey. Notes
Knowledge Discovery and Data Mining Lecture 19 - Bagging Tom Kelsey School of Computer Science University of St Andrews http://tom.host.cs.st-andrews.ac.uk [email protected] Tom Kelsey ID5059-19-B &
Beating the MLB Moneyline
Beating the MLB Moneyline Leland Chen [email protected] Andrew He [email protected] 1 Abstract Sports forecasting is a challenging task that has similarities to stock market prediction, requiring time-series
Numerical Algorithms for Predicting Sports Results
Numerical Algorithms for Predicting Sports Results by Jack David Blundell, 1 School of Computing, Faculty of Engineering ABSTRACT Numerical models can help predict the outcome of sporting events. The features
Beating the NCAA Football Point Spread
Beating the NCAA Football Point Spread Brian Liu Mathematical & Computational Sciences Stanford University Patrick Lai Computer Science Department Stanford University December 10, 2010 1 Introduction Over
How To Perform An Ensemble Analysis
Charu C. Aggarwal IBM T J Watson Research Center Yorktown, NY 10598 Outlier Ensembles Keynote, Outlier Detection and Description Workshop, 2013 Based on the ACM SIGKDD Explorations Position Paper: Outlier
Cross-Validation. Synonyms Rotation estimation
Comp. by: BVijayalakshmiGalleys0000875816 Date:6/11/08 Time:19:52:53 Stage:First Proof C PAYAM REFAEILZADEH, LEI TANG, HUAN LIU Arizona State University Synonyms Rotation estimation Definition is a statistical
Data Mining - Evaluation of Classifiers
Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010
STATISTICAL MODELLING IN LIMITED OVERS INTERNATIONAL CRICKET. Muhammad ASIF
STATISTICAL MODELLING IN LIMITED OVERS INTERNATIONAL CRICKET Muhammad ASIF Ph.D. Thesis 2013 i STATISTICAL MODELLING IN LIMITED OVERS INTERNATIONAL CRICKET Muhammad ASIF Centre for Sports Business, Salford
ECB Physical and Learning Disability County Cricket
ECB Physical and Learning Disability County Cricket Introduction In 2015 we saw the introduction of two new competitions for cricketers with physical and / or learning disabilities. The D40League was taken
Data Mining. Nonlinear Classification
Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Nonlinear Classification Classes may not be separable by a linear boundary Suppose we randomly generate a data set as follows: X has range between 0 to 15
Combining player statistics to predict outcomes of tennis matches
IMA Journal of Management Mathematics (2005) 16, 113 120 doi:10.1093/imaman/dpi001 Combining player statistics to predict outcomes of tennis matches TRISTAN BARNETT AND STEPHEN R. CLARKE School of Mathematical
A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier
A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier G.T. Prasanna Kumari Associate Professor, Dept of Computer Science and Engineering, Gokula Krishna College of Engg, Sullurpet-524121,
Chapter 6. The stacking ensemble approach
82 This chapter proposes the stacking ensemble approach for combining different data mining classifiers to get better performance. Other combination techniques like voting, bagging etc are also described
Knowledge Discovery and Data Mining
Knowledge Discovery and Data Mining Unit # 11 Sajjad Haider Fall 2013 1 Supervised Learning Process Data Collection/Preparation Data Cleaning Discretization Supervised/Unuspervised Identification of right
BIDM Project. Predicting the contract type for IT/ITES outsourcing contracts
BIDM Project Predicting the contract type for IT/ITES outsourcing contracts N a n d i n i G o v i n d a r a j a n ( 6 1 2 1 0 5 5 6 ) The authors believe that data modelling can be used to predict if an
Bootstrapping Big Data
Bootstrapping Big Data Ariel Kleiner Ameet Talwalkar Purnamrita Sarkar Michael I. Jordan Computer Science Division University of California, Berkeley {akleiner, ameet, psarkar, jordan}@eecs.berkeley.edu
Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus
Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus 1. Introduction Facebook is a social networking website with an open platform that enables developers to extract and utilize user information
Relational Learning for Football-Related Predictions
Relational Learning for Football-Related Predictions Jan Van Haaren and Guy Van den Broeck [email protected], [email protected] Department of Computer Science Katholieke Universiteit
We employed reinforcement learning, with a goal of maximizing the expected value. Our bot learns to play better by repeated training against itself.
Date: 12/14/07 Project Members: Elizabeth Lingg Alec Go Bharadwaj Srinivasan Title: Machine Learning Applied to Texas Hold 'Em Poker Introduction Part I For the first part of our project, we created a
Knowledge Discovery and Data Mining
Knowledge Discovery and Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Evaluating the Accuracy of a Classifier Holdout, random subsampling, crossvalidation, and the bootstrap are common techniques for
Performance Metrics for Graph Mining Tasks
Performance Metrics for Graph Mining Tasks 1 Outline Introduction to Performance Metrics Supervised Learning Performance Metrics Unsupervised Learning Performance Metrics Optimizing Metrics Statistical
An Overview of Knowledge Discovery Database and Data mining Techniques
An Overview of Knowledge Discovery Database and Data mining Techniques Priyadharsini.C 1, Dr. Antony Selvadoss Thanamani 2 M.Phil, Department of Computer Science, NGM College, Pollachi, Coimbatore, Tamilnadu,
Advanced Ensemble Strategies for Polynomial Models
Advanced Ensemble Strategies for Polynomial Models Pavel Kordík 1, Jan Černý 2 1 Dept. of Computer Science, Faculty of Information Technology, Czech Technical University in Prague, 2 Dept. of Computer
Simbabet. Tip Code League Time MBS Home 1 X 2 Away 1234 UEFA 16:45 1 Barcelona 1.25 5.00 7.25 Arsenal
Sport Betting Sports betting is a form of gaming which is based on sports (Soccer, basketball, cricket etc.). The challenge lies in predicting a win or a tie for a home game, or a win for an away game
Azure Machine Learning, SQL Data Mining and R
Azure Machine Learning, SQL Data Mining and R Day-by-day Agenda Prerequisites No formal prerequisites. Basic knowledge of SQL Server Data Tools, Excel and any analytical experience helps. Best of all:
SPORTSBOOK TERMS & CONDITIONS
SPORTSBOOK TERMS & CONDITIONS DATE OF ISSUE: 01.04.2014 1. General Sports Rules 1.1 Every user himself defines the bets, except the limitation that is set by the earnings limits according to the calculation
CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.
CI6227: Data Mining Lesson 11b: Ensemble Learning Sinno Jialin PAN Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore Acknowledgements: slides are adapted from the lecture notes
Predicting outcome of soccer matches using machine learning
Saint-Petersburg State University Mathematics and Mechanics Faculty Albina Yezus Predicting outcome of soccer matches using machine learning Term paper Scientific adviser: Alexander Igoshkin, Yandex Mobile
AUTOMATION OF ENERGY DEMAND FORECASTING. Sanzad Siddique, B.S.
AUTOMATION OF ENERGY DEMAND FORECASTING by Sanzad Siddique, B.S. A Thesis submitted to the Faculty of the Graduate School, Marquette University, in Partial Fulfillment of the Requirements for the Degree
Soccer Analytics. Predicting the outcome of soccer matches. Research Paper Business Analytics. December 2012. Nivard van Wijk
Soccer Analytics Predicting the outcome of soccer matches Research Paper Business Analytics December 2012 Nivard van Wijk Supervisor: prof. dr. R.D. van der Mei VU University Amsterdam Faculty of Sciences
How To Bet On An Nfl Football Game With A Machine Learning Program
Beating the NFL Football Point Spread Kevin Gimpel [email protected] 1 Introduction Sports betting features a unique market structure that, while rather different from financial markets, still boasts
PlayNow Sports Betting Game Conditions
1. RULES AND REGULATIONS PlayNow Sports Betting The following are subject to the PlayNow Sports Betting Rules and Regulations. 2. DEFINITIONS (a) (b) (c) (d) (e) (f) (g) (h) (i) (j) (k) (l) All-in play
How To Predict The Outcome Of The Ncaa Basketball Tournament
A Machine Learning Approach to March Madness Jared Forsyth, Andrew Wilde CS 478, Winter 2014 Department of Computer Science Brigham Young University Abstract The aim of this experiment was to learn which
International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014
RESEARCH ARTICLE OPEN ACCESS A Survey of Data Mining: Concepts with Applications and its Future Scope Dr. Zubair Khan 1, Ashish Kumar 2, Sunny Kumar 3 M.Tech Research Scholar 2. Department of Computer
TOTOBET.COM SPORTS RULES TABLE OF CONTENTS NO. SUBJECT PAGE WELCOME TO THE WINNING SITE.
TABLE OF CONTENTS NO. SUBJECT PAGE 1. GENERAL 2 2. AMERICAN FOOTBALL 4 3. AUSTRALIAN RULES FOOTBALL 4 4. BANDY 4 5. BASEBALL 5 6. BASKETBALL 5 7. BOXING 5 8. CRICKET 6 9. CYCLING 7 10. DARTS 7 11. FLOORBALL
Rafael Witten Yuze Huang Haithem Turki. Playing Strong Poker. 1. Why Poker?
Rafael Witten Yuze Huang Haithem Turki Playing Strong Poker 1. Why Poker? Chess, checkers and Othello have been conquered by machine learning - chess computers are vastly superior to humans and checkers
SYNTASA DATA SCIENCE SERVICES
SYNTASA DATA SCIENCE SERVICES A 3 : Advanced Attribution Analysis A Data Science Approach Joseph A. Marr, Ph.D. Oscar O. Olmedo, Ph.D. Kirk D. Borne, Ph.D. February 11, 2015 The content and the concepts
Prediction of Stock Performance Using Analytical Techniques
136 JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 5, NO. 2, MAY 2013 Prediction of Stock Performance Using Analytical Techniques Carol Hargreaves Institute of Systems Science National University
Predicting borrowers chance of defaulting on credit loans
Predicting borrowers chance of defaulting on credit loans Junjie Liang ([email protected]) Abstract Credit score prediction is of great interests to banks as the outcome of the prediction algorithm
Data Mining Practical Machine Learning Tools and Techniques
Ensemble learning Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 8 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Combining multiple models Bagging The basic idea
International Statistical Institute, 56th Session, 2007: Phil Everson
Teaching Regression using American Football Scores Everson, Phil Swarthmore College Department of Mathematics and Statistics 5 College Avenue Swarthmore, PA198, USA E-mail: [email protected] 1. Introduction
Comparison of K-means and Backpropagation Data Mining Algorithms
Comparison of K-means and Backpropagation Data Mining Algorithms Nitu Mathuriya, Dr. Ashish Bansal Abstract Data mining has got more and more mature as a field of basic research in computer science and
Data Mining Yelp Data - Predicting rating stars from review text
Data Mining Yelp Data - Predicting rating stars from review text Rakesh Chada Stony Brook University [email protected] Chetan Naik Stony Brook University [email protected] ABSTRACT The majority
Unsupervised Outlier Detection in Time Series Data
Unsupervised Outlier Detection in Time Series Data Zakia Ferdousi and Akira Maeda Graduate School of Science and Engineering, Ritsumeikan University Department of Media Technology, College of Information
An Exploration into the Relationship of MLB Player Salary and Performance
1 Nicholas Dorsey Applied Econometrics 4/30/2015 An Exploration into the Relationship of MLB Player Salary and Performance Statement of the Problem: When considering professionals sports leagues, the common
Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern. KTI, TU Graz 2015-03-05
Ensemble Methods Knowledge Discovery and Data Mining 2 (VU) (707004) Roman Kern KTI, TU Graz 2015-03-05 Roman Kern (KTI, TU Graz) Ensemble Methods 2015-03-05 1 / 38 Outline 1 Introduction 2 Classification
Using Adaptive Random Trees (ART) for optimal scorecard segmentation
A FAIR ISAAC WHITE PAPER Using Adaptive Random Trees (ART) for optimal scorecard segmentation By Chris Ralph Analytic Science Director April 2006 Summary Segmented systems of models are widely recognized
Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data
CMPE 59H Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data Term Project Report Fatma Güney, Kübra Kalkan 1/15/2013 Keywords: Non-linear
Classification of Bad Accounts in Credit Card Industry
Classification of Bad Accounts in Credit Card Industry Chengwei Yuan December 12, 2014 Introduction Risk management is critical for a credit card company to survive in such competing industry. In addition
Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms
Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms Scott Pion and Lutz Hamel Abstract This paper presents the results of a series of analyses performed on direct mail
Practical Data Science with Azure Machine Learning, SQL Data Mining, and R
Practical Data Science with Azure Machine Learning, SQL Data Mining, and R Overview This 4-day class is the first of the two data science courses taught by Rafal Lukawiecki. Some of the topics will be
SOCIAL NETWORK ANALYSIS EVALUATING THE CUSTOMER S INFLUENCE FACTOR OVER BUSINESS EVENTS
SOCIAL NETWORK ANALYSIS EVALUATING THE CUSTOMER S INFLUENCE FACTOR OVER BUSINESS EVENTS Carlos Andre Reis Pinheiro 1 and Markus Helfert 2 1 School of Computing, Dublin City University, Dublin, Ireland
Rating Systems for Fixed Odds Football Match Prediction
Football-Data 2003 1 Rating Systems for Fixed Odds Football Match Prediction What is a Rating System? A rating system provides a quantitative measure of the superiority of one football team over their
Comparison of Data Mining Techniques used for Financial Data Analysis
Comparison of Data Mining Techniques used for Financial Data Analysis Abhijit A. Sawant 1, P. M. Chawan 2 1 Student, 2 Associate Professor, Department of Computer Technology, VJTI, Mumbai, INDIA Abstract
Soccer Spreads. TOTAL GOALS: This is a market where we would predict the total number of goals scored by both teams in a match.
Soccer Spreads SUPREMACY: This is a market for predicting a team's dominance over their opposition. We will predict how many goals a team will beat the other by. For example, we might make Man Utd favourites
Better decision making under uncertain conditions using Monte Carlo Simulation
IBM Software Business Analytics IBM SPSS Statistics Better decision making under uncertain conditions using Monte Carlo Simulation Monte Carlo simulation and risk analysis techniques in IBM SPSS Statistics
A FUZZY BASED APPROACH TO TEXT MINING AND DOCUMENT CLUSTERING
A FUZZY BASED APPROACH TO TEXT MINING AND DOCUMENT CLUSTERING Sumit Goswami 1 and Mayank Singh Shishodia 2 1 Indian Institute of Technology-Kharagpur, Kharagpur, India [email protected] 2 School of Computer
Baseball Pay and Performance
Baseball Pay and Performance MIS 480/580 October 22, 2015 Mikhail Averbukh Scott Brown Brian Chase ABSTRACT Major League Baseball (MLB) is the only professional sport in the United States that is a legal
Big Data: a new era for Statistics
Big Data: a new era for Statistics Richard J. Samworth Abstract Richard Samworth (1996) is a Professor of Statistics in the University s Statistical Laboratory, and has been a Fellow of St John s since
Mobile Phone APP Software Browsing Behavior using Clustering Analysis
Proceedings of the 2014 International Conference on Industrial Engineering and Operations Management Bali, Indonesia, January 7 9, 2014 Mobile Phone APP Software Browsing Behavior using Clustering Analysis
ElegantJ BI. White Paper. The Competitive Advantage of Business Intelligence (BI) Forecasting and Predictive Analysis
ElegantJ BI White Paper The Competitive Advantage of Business Intelligence (BI) Forecasting and Predictive Analysis Integrated Business Intelligence and Reporting for Performance Management, Operational
Royal London Women s One-Day Championship
Royal London Women s One-Day Championship Competition Rules 1 Title The title of the competition will be the Royal London Women s One-Day Championship 2 Management Please refer to Generic Rule 1. 3 Entry
Predicting sports events from past results
Predicting sports events from past results Towards effective betting on football Douwe Buursma University of Twente P.O. Box 217, 7500AE Enschede The Netherlands [email protected] ABSTRACT
Maximizing Precision of Hit Predictions in Baseball
Maximizing Precision of Hit Predictions in Baseball Jason Clavelli [email protected] Joel Gottsegen [email protected] December 13, 2013 Introduction In recent years, there has been increasing interest
SPORTS, MARKETS AND RULES
SPORTS, MARKETS AND RULES 1. GENERAL... 2 2. AMERICAN FOOTBALL... 3 3. AUSTRALIAN RULES FOOTBALL... 3 4. BANDY... 3 5. BASEBALL... 4 6. BASKETBALL... 4 7. BOXING... 4 8. CRICKET... 4 9. CYCLING... 5 10.
Analysing Questionnaires using Minitab (for SPSS queries contact -) [email protected]
Analysing Questionnaires using Minitab (for SPSS queries contact -) [email protected] Structure As a starting point it is useful to consider a basic questionnaire as containing three main sections:
Active Learning SVM for Blogs recommendation
Active Learning SVM for Blogs recommendation Xin Guan Computer Science, George Mason University Ⅰ.Introduction In the DH Now website, they try to review a big amount of blogs and articles and find the
Data Mining Algorithms Part 1. Dejan Sarka
Data Mining Algorithms Part 1 Dejan Sarka Join the conversation on Twitter: @DevWeek #DW2015 Instructor Bio Dejan Sarka ([email protected]) 30 years of experience SQL Server MVP, MCT, 13 books 7+ courses
The Artificial Prediction Market
The Artificial Prediction Market Adrian Barbu Department of Statistics Florida State University Joint work with Nathan Lay, Siemens Corporate Research 1 Overview Main Contributions A mathematical theory
A Game Theoretical Framework for Adversarial Learning
A Game Theoretical Framework for Adversarial Learning Murat Kantarcioglu University of Texas at Dallas Richardson, TX 75083, USA muratk@utdallas Chris Clifton Purdue University West Lafayette, IN 47907,
Chapter 12 Discovering New Knowledge Data Mining
Chapter 12 Discovering New Knowledge Data Mining Becerra-Fernandez, et al. -- Knowledge Management 1/e -- 2004 Prentice Hall Additional material 2007 Dekai Wu Chapter Objectives Introduce the student to
Predicting the Stock Market with News Articles
Predicting the Stock Market with News Articles Kari Lee and Ryan Timmons CS224N Final Project Introduction Stock market prediction is an area of extreme importance to an entire industry. Stock price is
Introduction to Data Mining
Introduction to Data Mining Jay Urbain Credits: Nazli Goharian & David Grossman @ IIT Outline Introduction Data Pre-processing Data Mining Algorithms Naïve Bayes Decision Tree Neural Network Association
Local classification and local likelihoods
Local classification and local likelihoods November 18 k-nearest neighbors The idea of local regression can be extended to classification as well The simplest way of doing so is called nearest neighbor
Data Mining Methods: Applications for Institutional Research
Data Mining Methods: Applications for Institutional Research Nora Galambos, PhD Office of Institutional Research, Planning & Effectiveness Stony Brook University NEAIR Annual Conference Philadelphia 2014
Classification of Learners Using Linear Regression
Proceedings of the Federated Conference on Computer Science and Information Systems pp. 717 721 ISBN 978-83-60810-22-4 Classification of Learners Using Linear Regression Marian Cristian Mihăescu Software
Texas Hold em. From highest to lowest, the possible five card hands in poker are ranked as follows:
Texas Hold em Poker is one of the most popular card games, especially among betting games. While poker is played in a multitude of variations, Texas Hold em is the version played most often at casinos
A Content based Spam Filtering Using Optical Back Propagation Technique
A Content based Spam Filtering Using Optical Back Propagation Technique Sarab M. Hameed 1, Noor Alhuda J. Mohammed 2 Department of Computer Science, College of Science, University of Baghdad - Iraq ABSTRACT
Live Betting Extra Rules
Live Betting Extra Rules Rules for General Main Live Lines The following rules apply for Live Betting: Markets do not include overtime unless otherwise stated. Lines are offered at the provider s discretion.
Environmental Remote Sensing GEOG 2021
Environmental Remote Sensing GEOG 2021 Lecture 4 Image classification 2 Purpose categorising data data abstraction / simplification data interpretation mapping for land cover mapping use land cover class
The Trip Scheduling Problem
The Trip Scheduling Problem Claudia Archetti Department of Quantitative Methods, University of Brescia Contrada Santa Chiara 50, 25122 Brescia, Italy Martin Savelsbergh School of Industrial and Systems
Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets
Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets http://info.salford-systems.com/jsm-2015-ctw August 2015 Salford Systems Course Outline Demonstration of two classification
Three Methods for ediscovery Document Prioritization:
Three Methods for ediscovery Document Prioritization: Comparing and Contrasting Keyword Search with Concept Based and Support Vector Based "Technology Assisted Review-Predictive Coding" Platforms Tom Groom,
JetBlue Airways Stock Price Analysis and Prediction
JetBlue Airways Stock Price Analysis and Prediction Team Member: Lulu Liu, Jiaojiao Liu DSO530 Final Project JETBLUE AIRWAYS STOCK PRICE ANALYSIS AND PREDICTION 1 Motivation Started in February 2000, JetBlue
Invited Applications Paper
Invited Applications Paper - - Thore Graepel Joaquin Quiñonero Candela Thomas Borchert Ralf Herbrich Microsoft Research Ltd., 7 J J Thomson Avenue, Cambridge CB3 0FB, UK [email protected] [email protected]
Predicting Flight Delays
Predicting Flight Delays Dieterich Lawson [email protected] William Castillo [email protected] Introduction Every year approximately 20% of airline flights are delayed or cancelled, costing
Cross Validation. Dr. Thomas Jensen Expedia.com
Cross Validation Dr. Thomas Jensen Expedia.com About Me PhD from ETH Used to be a statistician at Link, now Senior Business Analyst at Expedia Manage a database with 720,000 Hotels that are not on contract
Predict the Popularity of YouTube Videos Using Early View Data
000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050
An Autonomous Agent for Supply Chain Management
In Gedas Adomavicius and Alok Gupta, editors, Handbooks in Information Systems Series: Business Computing, Emerald Group, 2009. An Autonomous Agent for Supply Chain Management David Pardoe, Peter Stone
Monte Carlo Simulations for Patient Recruitment: A Better Way to Forecast Enrollment
Monte Carlo Simulations for Patient Recruitment: A Better Way to Forecast Enrollment Introduction The clinical phases of drug development represent the eagerly awaited period where, after several years
1 LAW 1 THE PLAYERS 2 LAW 2 - SUBSTITUTES AND RUNNERS, BATSMAN OR FIELDER LEAVING THE FIELD, BATSMAN RETIRING, BATSMAN COMMENCING INNINGS
STANDARD ONE-DAY INTERNATIONAL MATCH PLAYING CONDITIONS These playing conditions are applicable to all ODI matches from 1st October 2014 and supersede the previous version dated 1st October 2013. Included
Distributed forests for MapReduce-based machine learning
Distributed forests for MapReduce-based machine learning Ryoji Wakayama, Ryuei Murata, Akisato Kimura, Takayoshi Yamashita, Yuji Yamauchi, Hironobu Fujiyoshi Chubu University, Japan. NTT Communication
Random forest algorithm in big data environment
Random forest algorithm in big data environment Yingchun Liu * School of Economics and Management, Beihang University, Beijing 100191, China Received 1 September 2014, www.cmnt.lv Abstract Random forest
Leveraging Ensemble Models in SAS Enterprise Miner
ABSTRACT Paper SAS133-2014 Leveraging Ensemble Models in SAS Enterprise Miner Miguel Maldonado, Jared Dean, Wendy Czika, and Susan Haller SAS Institute Inc. Ensemble models combine two or more models to
!"!!"#$$%&'()*+$(,%!"#$%$&'()*""%(+,'-*&./#-$&'(-&(0*".$#-$1"(2&."3$'45"
!"!!"#$$%&'()*+$(,%!"#$%$&'()*""%(+,'-*&./#-$&'(-&(0*".$#-$1"(2&."3$'45"!"#"$%&#'()*+',$$-.&#',/"-0%.12'32./4'5,5'6/%&)$).2&'7./&)8'5,5'9/2%.%3%&8':")08';:
