Predict NBA Game Pace ECE 539 final project Fa Wang

Transcription

1 Predict NBA Game Pace ECE 539 final project Fa Wang Abstract Long time ago, people still believed that having all the best players on the court could guarantee winning of a game or even a champion title. Recently, more historical data proved that a winning team might not have a highest score on offense and lowest score on defense. To evaluate a quality of a team, an index of pace value is introduced. The goal of our project is to predict pace value of a NBA game by analyzing a large amount of historical data on the Internet. To do that, we first extract key features from play-by-play data and game stats data with map reduce pattern. For sake of large amount of historical play-by-play data, we take advantage of HBase[1] to store and process our data. Different from any traditional standalone Machine Learning theory designs, the paper would be more focusing on applying map-reduce pattern along with Machine Learning method to solve big data problem. By examining the final result s mean-square error (MSE), we can judge how far our prediction is away from the real value as to evaluate our model s correctness and accuracy. I. INTRODUCTION To predict NBA game score has been a heated topic for a long time. In recent year, people turn to a possession based score model, which has proved to provide more accurate statistical result. Later on, people introduced a new index to evaluate performance of a team for a certain game. The index is called Pace value. [2] According to basketball-reference, pace factor is an estimate of the number of possessions a team uses per game. Understanding opponent's pace value allows coach to better preparing a game as pace factor is closely related to score of a game, the formula of Score is given as: Score = ASPM * Pace. ASPM[3] denotes points / possession and Pace is the metric for possession. That means an accurate prediction of pace value will result in a precise final score. In our paper, we are trying to merge play-by-play (PBP) data and each game s stats data to predict future game s pace value. The challenge of predicting game's pace is obvious. Different

2 styles of defense and offense strategies may directly influence the pace of a game, especially during clutch time when a game comes down to a few processions; different teams may use the clock differently to defeat their opponents. Also an older team may be more likely to spend longer time on each offensive procession than a younger team. Factors of pace value may even depend on players' health and attitude. As a result, Pace factor is nonlinear and difficult to predict with a small set of data. II. MOTIVATION To precisely predict a future NBA game s result is a tough but appealing task to both NBA fans and non-fans all the time. Recent research shows that a precise game pace value is the key factor, so our team decide to use this sub-topic to take a trial. With a precise pace evaluation of the team, we can apply it to all of its players so that each player can be correctly tagged with a pace capability. Faster team may choose faster player so that our assessment can be taken as a toolbox for team to select desired player. Also NBA game bet is another appealing target for our research. III. RELATED WORK There are many researches on the topic of pace value. The paper [4] examines the optimality of the shooting decisions of National Basketball Association (NBA) players using a rich dataset of 1.4 million offensive possessions. The decision to shoot is a complex problem that involves weighing the continuation value of the possession and the outside option of a teammate shooting. To apply our abstract model to the data we make assumptions about the distribution of potential shots. In line with dynamic efficiency, we find that the cut threshold declines monotonically with time remaining on the shot clock at approximately the correct rate. Most line-ups show strong adherence to allocative efficiency. We link departures in optimality to line-up experience, player salary and overall ability. Paper [5] also talks about the how the teams trade of the value of controlling the length of the games. The author uses the 859 games played by the 30 teams in the NBA during regular season to get the result of the leading team tried to decrease the pace but the trailed team tried to increase the pace when the game is tend to over. The author

3 also concluded that the leading team got fewer points in the last few minutes of the game comparing to the 3rd quarter. But whether the player s decision would increase or decrease the pace, given the strategy of their opponent should be done more investigation. Bigtable Bigtable[6] data structure is a key component of the project and main stream of nowadays big data analytic. A paper Bigtable: A Distributed Storage System for Structured Data has a thorough overview of the design and usage of the data structure, from index key variables, data compression, implement to Google Earth example. The reason we consider Bigtable is a big help for our project is the defined data structure is sparse and distributed through all clusters. The majority operations for our project are mapping and reducing, the Bigtable with indexed by row key, column key and timestamp key can help us retrieving data across multiple clusters efficiently. HBase. Performance Management huge amount of data is a nightmare to engineers, it requires zero fault-tolerance to the system and the program. Processing a large data efficiently and safely is also a key part of the project. MySQL has been a reliable ACID storing platform for the last decent until the introduction of big data analytic. A relative paper Solving Big Data Challenges for Enterprise Application compares the workload performance of MySQL and HBase.[7]. The result of their experiment shows that MySQL has a very good scan throughput for a single node, but it doesn't scale with the number of nodes as data size increase. On the other hand, HBase could be scaled to obtain a linear increase in throughput with the number of nodes. Another reason for using the HBase is due to increase the speed of program execution, in the traditional way, data is copied and run on local machine. The paper Map Reduce: Simplified Data Processing on Large Clusters [8] talked about the locality on large clusters environment. The Map Reduce master takes the location information into account and schedules map task near a replica task location.

4 IV. DESIGN Fg1. Program Diagram The design of the program has three main phases. According to the design diagram above, first data source retrieval phase, a large set of HTML play by play files would be interpreted and stored into a HBase, alongside with Game status data source. Second, Features generation phase, map-reduce would be used to combine two data sources in the HBase and create 20 features for computation. Third, data training phase, since the data size has been reduced in phase 2, we could use the support vector regression (SVR) algorithm to train the data set in an old fashion way. Data source retrieval There are two types of data source used in this paper. The first data source is called PBP (Play By Play) data which is retrieval from the NBA website. The PBP data records possession of ball, shot attempt and possession turnover at certain moment of each NBA game. The PBP data used in this paper consists of 4000 games where each game's PBP record is presented in a HTML format on the NBA website. The total PBP data size is about 2.76GB, each of PBP record has two tables that are useful to our project, one is the abbreviation of the home team and the away team, the other one is the detailed of PBP data.

5 To reduce size and improve accuracy, a JAVA preprocessor program is created to translate the raw data and eliminate noise from the website. After preprocessing, 5000 PBP records are inserted into HBase table as html format string for features generation phase. Game status data is another data source which contains more static information about each team's condition, such as pace value of the current game, and how long has the home team be resting before the current game. Parts of Game status data can be computed after map-reduce process during the features generation phase. gameid Column <html>...</html> Features generation With 5000 html file entries store in the HBase, now we could implement the map-reduce function. The input to a mapper function is a specification with a pair of gameid and column value in the HBase table. In the mapper function, each html file would be translated into home team attack time and home team defense time. Table2 is a sample of PBP html record, to compute home team attack time, we can subtract Time value from row one to row two. And similar process can be done for computing the defense time. After Mapper function, a pair of gameid and a value features (set of attack time and defense time) would output to the reduce function. Time New York Score Indiana 11: L. Stephenson makes 2-pt shot from 1 ft 11:15.0 Shumpert makes pt shot from 23 ft(assist by C Anthony) 10: Personal block foul by I. Shumper

6 The reduce function is called once for each gameid. In our case, there has no real assignment for the reduce phase, only task for the reduce function is to output data from the mapper function directly into a database alongside with the pace value of the current game. The output of mapper phase has extracted features from the raw data and reduced the data size heavily. With a set of base features for each gameid entry, more features could be computed during post map reduce phase. Average home attack time for the last 5 games could be computed by iterating through the last 5 records. Average of game pace value could be computed in a similar fashion, etc. A complete half of features list is shown in Table3, second half of features will be same except that it's for away team. Here is list: Features Description gameid ID to identify each game. HomeTeam Home team name Avg1 Home Average of home defend time team's defend time for all the games before the current game. Avg2 Home Average of home defend time team's defend time for all the games this season. Avg1 Home Average of home offense time team's offense time for all the games before the current game. Avg2 Home Average of home offense time team's offense

7 Avg1 Home Pace Avg2 Home Pace Std1 Var Home Pace Std2 Var Home Pace Avg3 Home Pace Avg5 Home Pace time for all the games this season. Average of home team's pace value for all the games before the current game. Average of home team's pace value for all the games this season. Standard variance of home team's pace value for all the games before the current game. Standard variance of home team's pace value for all the games this season. Average of home team's pace value for last 3 games before the current game. Average of home team's pace value for last 5 games before the current

8 Avg7 Home Pace game. Average of home team's pace value for last 7 games before the current game. The first step for calculating features is reading data from Hbase and stores it into a container. Because features are calculated according to Game ID and team name, we need to search data and get it when we use it. Hash Map is selected as the container because searching is really fast if the key is known in a Hash Map. Three Hash Maps are needed. The first one is used to store the imported data. Game Id is the key, then store all the other data such as average attack time, defend time, actual day since start, rest day and actual pace into an array and the array is the value of the Hash Map. The most important thing for the first Hash Map is that team name should also be an element in the array because we need to calculate features not only according to Game Id, but also team name. The first value of the array is the home s name and the first value of the second half of the array is the away s name. The second Hash Map is used to store intermediate data, team names are considered as keys, store other data such as the sum of average attack time before the day considered and the sum of the average defend time till this day considered, the number of game the team attended and so on. The third Hash Map is used to store the final result. And Game Id is considered as the key, all the features are stored in an array and the array is considered as the value of the Hash Map. The second step is the Calculating. The average attack time is calculated by dividing the sum of attack time by the number of games it attended. For average defend time and average pace value is using similar way. For variance, it s calculated by average the square of the difference between each pace and average pace. For the rest day of each team, there are eight forms. Eight features are added to each team. The team s corresponding feature will be 1 and other seven features will be 0. All of the data is stored into the third Hash Map directly. Game Id is keys and array stores these features are values.

9 Data training We decided to use Support Vector Regression (SVR) as our main learning method for this project. As what we discussed in the previous session, the pace value for each game is difficult predict, because there are multiple non-linear factors, such as player health condition, team chemistry, and coach's game plan. These factors are unpredictable. Therefore Support Vector Regression (SVR) is the primary algorithm for the data training phase. To understand this model, let's start looking at the most elementary algorithm linear regression which is to minimize the quadratic cost function(1). In linear regression, we can solve the equation by finding the optimal vector w. There is another problem of the equation (1) even with a set of linear training data. The problem is called over-fitting which appears when the function performs perfectly well only with the training set but poorly with unknown set of data. To avoid this problem, a penalized term of w is introduced to the equation (2). With the penalized form of linear equation, now we could develop a non-linear form of equation by adding basic function to move existing vectors into a higher dimension. The idea is very simple, if we could not predict a pace value accurate with an existing features, we should simply add more features! Basic function is a mapping function from finite vectors to finite/infinite vectors. Before creating basic function, we could represent equation (2) to (3). The basic function is applied to convert xi to Bi = B(xi).

10 with some algebra tricks. The final form of w is derived. Noted that B is in form of vector with feature sizes. Now substitute 4 to the original equation, we get the form 5 which is the final form of the equation. We know that inner product of basic function can be evaluated by a kernel function. In fact, the kernel function that we used in this project is called Radial Basis function which move data from finite dimension to infinite dimension. The way to find optimal parameters for the algorithm is through quadratic optimization. The complete derivation and optimization of the algorithm can be found in the paper by Alex J. Smola[10] We slightly change our feature file by appending a certain number ahead as required by the SVR input format. Then we split the features file into two parts, one from year 2009 to 2012, one of year The former one is used as training data and the latter one is used as testing data. The training data is used to predict the outcome and the testing data is used to evaluate the estimate. We use mean-square error to evaluate our prediction. That is, the variance of the error between estimated result and real result. Until now, the best team in the world does it in 14.2, so we set our goal to be 16 or less. As a matter of fact, we follow the principle: the less the better. Then, we have to set the SVR core and mode to build the most likely model for our data. As our data of features do not fall into 2 or certain category, we use regression core instead of classification core. As for the mode, there are 2 main choices, one is linear fit and the other is Gaussian fit. We choose to test both of them and compare the outcome. V. RESULTS Since choosing representative features is the key part to achieve a good result, the first step of our experiment is to filter a set of features that best describe the target game. We consult industry expert to generate tens of candidate features that may contribute to our system. Then we draw visual graph like in Fg2 to see which of them may be more important.

11 Fg2. Feature Selection From charts shown above, we can easily find out that pace is highly related to features like attach/defense time but less sensitive to pace standard deviation. Since extra feature will not have too much bad influence on our system, so we don t go deep in proving the reliability or correctness of those chosen features. We just let the final prediction result to tell which kind of features is positive meanwhile others are not. Prediction Accuracy We ve run through bunch of combination of features to perform the final prediction. We take use of MSE (Min Square Error) to reflect the accuracy of our prediction. MSE is a classic metric to evaluate the average error of the system.

12 MSE Linear Kernal Gaussian Kernal 15.4 Feature Set 1 Feature Set 2 Feature Set 3 Fg3. MSE result Seen from Fg3, we can figure out that with more complex combination of features, the MSE can be reduced to a certain degree. Pace is with about 100, so 16 MSE means that the average error is about 4%. VI. Future Work Evaluating pace value of a NBA game is not an easy task in general. Despite with the current play-by-play data we have collected from the website, reaching the real meaningful and analytical data that truly describe a game is still challenging. The play-by-play data is an abstract description of a game at certain time frame; we could not know exact context and motivation of the each play. Another limitation may be related to a team reasons strategic. In our current model, we treat each game equally. In reality, to get a better place and shape in post-season, some teams may have different game plans throughout a season. For example, some teams may slow down their paces after securing their playoff positions. Players, coaches and training staffs may also different from one year to another. As a result, historical data may not be impacted to the current team's pace value. However, these more dynamic factors are not taking into account of the project. In future, new parameter could be introduced to characterize this kind of impacts. From a technical point of view, we may

13 require more data to improve the prediction accuracy. However, from paper by Alex J. Smola[10], we realize that SVR is a very expansive operation which may take quadratic time to solve a optimization equation. When data size reaches the bottleneck level, training phase in our design would be a time consuming process. One way to solve this problem is to add another layer of map reduce phase to highly compress the raw data, also we could migrate the training phase into another map reduce task to solve the Newtonian equation. VII. Conclusion We have successfully accomplished the objective that we proposed in the early stage. Each design phases truly allows us to practice and develop skills to deal with big data. Using Map Reduce and HBase to compress and extract features from 2.1GB data down to a few MB is really a key component to our project. It would be very inefficient to train and analyze a huge data size with a sophisticated algorithm like SVM. Furthermore, studying big data is really a topic with focusing on revealing favorable characteristic data from tons of data, but not just a huge data with dummy noise. Lastly, with that in mind, our key features describe the behavior of pace value of each game as it shows in our result. Acknowledgment I would like to acknowledge the help and support of my friends: Jiawei Dong, Cai Qi. References [1] Shoji Nishimura: MD-HBase: A Scalable Multi-dimensional Data [2] NBA Basketball reference: CA, May [3] Historical NBA ASPM: DStats/2013/nba-stats/historical-nba-aspm-and-hall-rating-released/ CA, May [4] Matt Goldman and Justin M. Rao. Tick-Tock Shot Clock: Optimal Stopping in the NBA.

14 Fall [5] Tim Xin: The Value of Pace In the NBA, Spring 2012 [6] Fay Chang: Bigtable: A Distributed Storage System for Structured Data, [7] Craig Franke, Samuel Morin, Artem Chebotko, John Abraham, and Pearl Brazier: Distributed Semantic Web Data Management in HBase and MySQL Cluster [8] Jeffrey Dean and Sanjay Ghemawat: MapReduce: Simplified Data Processing on Large Clusters. CA,2004 [9] Hao Helen Zhang and Marc Gento: Compactly Supported Radial Basis Function Kernels. [10] Alex J. Smola and Bernhard Scholkopf: A Tutorial on Support Vector Regression.September 30, 2003