Wesley Pasfield Northwestern University

Transcription

1 March Madness Prediction Using Big Data Technology Wesley Pasfield Northwestern University Abstract- - This paper explores leveraging big data technologies to create probabilistic predictive models for the NCAA Men s Basketball Tournament (March Madness). Historically many approaches have been deployed to predict March Madness, ranging from picking teams based on mascot to heavily quantitative predictive models. Building a consistently successful model has proven to be extremely challenging due to the variability that exists in each individual tournament. This paper attempts to identify the key regular season characteristics of teams that dictate success in the NCAA tournament, while leveraging big data technologies through the Apache Hadoop ecosystem. Specifically, this paper discusses creating probabilistic predictions using three different algorithms - logistic regression through the generalized linear model (GLM), a random forest classifier, and gradient boosting machines (GBM) with a Bernoulli distribution, all using the R scripting functionality for the open source machine learning engine H2O on top of a single node Apache Hadoop cluster. The data used in this analysis came from both the data science competition website Kaggle.com as well as Ken Pomeroy s college basketball analytics website KenPom.com. Keywords: Generalized Linear Model, Gradient Boosting Machines I. INTRODUCTION Using Big Data Technologies The NCAA basketball March Madness tournament is notorious for upsets and unpredictability. Many people fill out brackets trying to predict the outcome of the tournament, but most have to rip up their bracket after the first few rounds. Increasingly, efforts have been made to create a system to better predict the outcome of the NCAA tournament games. In 2015, HP sponsored a competition on the big data competition website Kaggle, offering prizes for teams and individuals that build a model which minimizes the predictive binomial deviance in predicting the outcome of NCAA tournament games through two different stages of submissions: The first evaluates all possible matchups in the previous four tournaments ( ), the second evaluates the upcoming 2015 tournament. Historic data is provided in the form of game- by- game regular season and NCAA tournament data dating back to 2003, with the option of integrating external data as well. This paper examines the use of machine learning techniques through big data technologies to create these predictions. Everything discussed in this paper is based on the first submission. Big Data Technologies Discussion Big data technologies have disrupted organizations across industries, and in recent years have begun gaining widespread traction. In its early stages the Hadoop ecosystem, which facilitates massive data processing and evaluation through parallel computing, was seen as a necessity largely for internet based companies that collected huge amounts of data, and had a need to extract and evaluate that data (Facebook, Amazon, Yahoo etc..). However, in recent years, Hadoop and other big data technologies have gained adoption across more traditional industries, and big data is no longer viewed as a niche operation, instead, it has become a necessity to remain competitive across numerous industries. Hadoop Hadoop is an open source framework for writing and running distributed applications that process massive quantities of data (Lam, 4). One of the major advantages distributed processing has over traditional 1

2 relational database management systems is horizontal scalability. Relational databases scale vertically, typically restricting storage to one machine, which becomes problematic when data reaches scale beyond the capability of that machine. Distributed processing scales horizontally on a cluster of commodity machines, with the cluster size determined by the user based on need. Another advantage of Hadoop is fault tolerance, data is replicated across multiple nodes spread out across different servers, so if one node goes down data still remains in tact. In addition, Hadoop is built for unstructured data, as it does not require a predefined schema in advance of loading data, unlike relational databases. In summation, Hadoop is highly scalable, fault tolerant and built for semi or unstructured data, making it an excellent complement to relational databases, which are still preferable for smaller datasets with a defined schema. Hadoop Ecosystem: In its nascence, the only way to utilize the Hadoop ecosystem was to write MapReduce code which is difficult to program, and requires Java knowledge (Lam, 212). Programs like Pig & Hive have opened up Hadoop programming to people who do not know how to program MapReduce in an easy to understand format. Pig is a procedural language that allows users to supply high- level data transformations, and also manipulate the number of reduce tasks using the PARALLEL function. (Lam, 231). Hive is a declarative language that is almost identical to SQL, and like PIG allows users to access the Hadoop ecosystem without having to deal with the complexities of MapReduce code (Lam 247). These programs have opened up the Hadoop ecosystem to the masses, as most business analysts or programmers have some familiarity with either procedural or declarative languages, and thus are able to utilize either Pig or Hive. Machine Learning with Hadoop: There are also programs that have been developed that allow for Machine learning and data mining on top of the Hadoop platform, such as Apache Mahout and H20. In this project, models were built by running R on top of H2O, and the details of the models and the H2O environment are discussed at length in later sections of this paper. II. Data Collection, Manipulation & Exploration Details was obtained from a college basketball website KenPom.com, which contains efficiency statistics for college basketball teams extending back to These efficiency statistics include : Adjusted Offense offensive efficiency relative to number of possessions each team has this takes efficiency of scoring into play, rather than just looking at points per game, since teams play at different speeds, and get different numbers of opportunities per game Adjusted Defense defensive efficiency relative to number of possessions each team has same explanation as offense Adjusted Tempo measures the speed at which teams play purpose of including in the model is to see if faster or slower teams have an advantage SOS Difference Strength of schedule measures the quality of competition for each team (KenPom.com, 2) All of these statistics should be extremely valuable in distilling insights from the teams, and are complementary to the data provided by Kaggle, which also was transformed, and is detailed below: Location Whether the winning team was playing at home, away or on a neutral court Offensive Rebound Differential Average offensive rebound differential per game for a team, calculated the same way as average point differential Defensive Rebound Differential Average defensive rebound differential per game for a team, calculated the same way as others in this section. Turnover Differential Average turnover differential per game for a team, calculated the same way as other in this section Three Point Reliance Percentage of points scored from three pointers for a team True Shooting Percentage True shooting percentage for a team. True shooting percentage is a more accurate version of shooting percentage, as it weights shots based on their value (ex. Three- pointers are more valuable). Prior to modeling, numerous transformations and additions we made to the dataset. First, information 2

3 True Shooting Percentage Allowed True shooting percentage allowed for a team. Same as above, but what the team allows rather than shoots. Block Differential Average difference in blocks per game for team vs. opponents. Dataset Summary: Each instance in the dataset provided contains all of this information for the winning and losing team in each NCAA regular season and tournament game since Game- by- game statistics were provided, but all of the statistics listed above were rolled into season averages for each team for use in the tournament predictor. This allows for predictive evaluation of teams, as using game- by- game predictors of outcome is not helpful, since it is impossible to know the game statistics prior to the game occurring. All factors are expected to have a positive relationship with winning apart from adjusted defense, true shooting percentage allowed and turnover differential, where in all three cases the goal is to have lower figures than opponents. The object of this competition is to minimize log loss probability so a binary win loss indicator was created, however, a continuous predictor could be created using the margin of victory for each team in each game as well. Modeling Details The individual game predictors for this study were season averages for each team. The model must be run on season averages for teams, since as discussed in- game statistics are not available prior to the game. Hive Analysis: Prior to modeling, data was uploaded into Hive to get a better understanding of how the variables relate to the outcome variable. Prior to evaluation, all factors were converted to z scores based on yearly averages for flexibility in evaluation and modeling. In each game evaluation, the difference in the Z scores between the winning and losing team is the indicator for each variable. The above chart shows the average z score for teams that won vs. teams that lost, and the code above it is what was used to generate the report Teams that won are represented by the binary variables in the newmm.binarymargin column, 1 indicates wins, 0 indicates losses. Unsurprisingly, teams that have an above average offensive efficiency rate & below average defensive efficiency rate (allow less points per possession), win more frequently. Tempo is very close to zero indicating there isn t much of a relationship between pace of play and winning in a vacuum, while strength of schedule has a positive relationship, which makes sense because better teams play harder schedules. Let s take a look at the remaining statistics, all from the provided Kaggle database: All of these results here are also intuitive, teams that shoot better and force their opponents into worse shooting percentages win more often. Defensive rebounding and true shooting had the highest z scores associated with winning teams. The difference between offensive and defensive rebounding is particularly interesting; as it appears defensive rebounding is much more important. Finally, an analysis was run to see how these figures look when taking margin of victory into account. Presumably, numbers would be more pronounced in either direction because margin of victory takes into account the disparity between the two teams, rather than just a binary win or loss. 3

4 The above chart validates the importance of the adjusted offensive and defensive efficiency statistics. As the margin of victory increases (far left column), the adjusted offensive efficiency z score increases, while the adjusted defensive efficiency z score decreases. There are minor inconsistencies, but in general there is clearly a linear trend for both variables. That is outlined by the graph below, which was generated using ggplot2 in R, after exporting the Hive results to CSV. The higher margins have larger variation because there are few instances of the higher margins, however, the trends for both adjusted offensive efficiency and adjusted defensive efficiency are clear, which is promising in terms of these variables being strong indicators of the outcome in the models. As seen in the code below, this pull was filtered to only include double- digit victories because there is less randomness when the margin of victory increases. The purpose of this chart is to confirm that the expected trends exist, so looking at the cases where randomness is less of a factor is useful. III. MODEL BUILDING & EVALUATION Modeling was completed using the H20 integration with R running on top of the Hadoop cluster installed on a Cloudera virtual machine. In order to do this, H20 & R (plus RStudio) must be installed on the virtual machine, and then the H2O package must be installed within R, and then the H2O connection has to be configured with the below code. The great thing about this integration is R acts as the interface for H2O, giving users with a background in R the ability to utilize R, while retaining the power of distributed computing that H2O provides. While the interface may be R, the objects, math and computing are all done by H2O, and in order to reach the optimal power of H2O it must be run on a server, or else the environment will become unstable (Lang, 1). H2O must be running prior to initiating the H2O package within R, or it will not be able to load. After configuration, training and testing files were imported into the R environment using the below code (summary used to ensure data looks in order): Using the same premise, true Shooting percentage was evaluated relative to true shooting percentage allowed: In this case, the same logical relationship emerges, but not as severely as seen with overall offensive and defensive efficiency. This again makes sense because shooting is only a portion of an entire team s performance on both offense and defense, so it makes sense why overall offense and defense have a stronger relationship with the outcome than true shooting percentage and true shooting percentage allowed in isolation. Logistic Regression: After this step, a logistic regression model was built using the H2O GLM package including all independent variables included in the file and 75% of the instances from the regular season file as the training set (Fu, Aiello, Rao, Wang, Kraljevic, Maj 46). Below is the code, as well as the results from the model: 4

5 better in the tournament, and there are much fewer observations in the tournament file. Here are the results from the tournament test set: Logistic Regression Evaluation: These results show strong overall performance (as seen by high AUC), however, it seems to over- predict victories relative to losses, as seen by the low threshold figure, as well as the disparity of false to true predictions. The coefficients are logical based on the data exploration we did using Hive, as offense is highly positive and defense is highly negative. It does appear that many factors are unimportant, so there s potential to rerun the model with fewer variables, but in the interest of time that was not enacted in this analysis. Running the test set on this file produced very similar results to training. The results look very similar to the regular season test set a tendency to over- predict wins relative to losses, but in general, fairly accurate. Finally, predictions were run on the submission dataset. The code is below there is no outcome evaluation possible because the outcomes are not provided: Unfortunately the submission was uploaded after the competition deadline due to time zone confusion, however, it would have ranked 22 nd out of 347 as scored on the predictive binomial deviance. The above chart shows the tendency to predict teams to win notice that there are more missed upsets than correct predictions when the margin of victory is very low (1-4), a negative margin of victory with a missed upset indicates a team that lost and was projected to win. Next the model was enacted versus a different test set historic NCAA tournament results. Major differences between the performances of this test set versus the regular season test set would indicate that there are significant differences between regular season and postseason play in predicting the outcomes of games. Notable differences between the regular season & tournament are all games are played on neutral courts in the tournament, and the quality of each team is Random Forest Classifier: Next a Random Forest classifier was built and performance was compared to the original logistic regression model (Fu et al.. 83). The random forest model is shown below, the regular season test set and the tourney test set were both used to validate the model, and overall results were inferior to the logistic regression model despite trying different parameters. 5

6 Unsurprisingly it better predicted the regular season dataset (testfile) that is much larger and more representative of the training set than the tournament test set (tourney). The submission prediction was run, and the predictive binomial deviance was also worse than the logistic regression model, most likely due to overtraining (too many trees/ too much tree depth). GBM Model: The final model run was the H2O.gbm model (Fu et al.. 39). GBM stands for gradient boosted classification trees in this example, and it essentially creates numerous decision trees, each learning off the residuals of the prior tree. The results are as follows: Ultimately the GBM predictor was outperformed by the original logistic regression model, but only slightly. Surprisingly the GBM tourney test set had a higher AUC than the regular season test set, unlike the random forest model. With more time and effort, the GBM model has potential to improve significantly, and ultimately is the most promising algorithmic solution. IV. CONCLUSION Overall the power of big data technologies is expansive, and the ability to run complex and sophisticated analyses on massive datasets represents a tremendous opportunity globally. With very little manipulation the H2O machine learning engine was able to produce an extremely competitive and actionable model in a very short period of time. H2O is an extremely powerful technology that encapsulates the power of R at a much greater data scale. Currently, the algorithms available in H2O are not as expansive as the library in R, however, the current ratio will continue to get closer as H2O evolves as a platform. In conclusion, big data technologies are no longer segregated to just the few in terms of both individuals and organizations. New languages like Pig & Hive, and open source platforms like Cloudera and H2O have opened up distributed computing to the masses, and as the Hadoop ecosystem continues to gain traction and represent a competitive advantage in the marketplace based on insight extraction, the importance of big data will continue to grow. 6

7 REFERENCE Fu, Anqi, Aiello, Spencer, Rao, Ariel, Wang, Amy, Kraljevic, Tom and Maj, Peter. H2O R Interface. February 2, < project.org/web/packages/h2o/h2o.pdf>. Lang, Irene. Running a GLM Model in H2O + R (notes from the hand- on meetup (Sept.26.) September 27, <0xdata.com>. Pomeroy, Ken. Ratings Glossary. June 8, <kenpom.com> Lam, Chuck. Hadoop In Action. Stamford, CT: Manning Publication Co Print. 7