Data Mining in Sports Analytics Salford Systems Dan Steinberg Mikhail Golovnya
Data Mining Defined Data mining is the search for patterns in data using modern highly automated, computer intensive methods Data mining may be best defined as the use of a specific class of tools (data mining methods) in the analysis of data search automated The literature often refers to finding hidden information in data 2 Salford Systems, 2012
Uses of Data Mining DATA MINING 3 Salford Systems, 2012
Long Live the King =Your Data= Analyst asks the right questions but makes no assumptions The success of data mining solely depends on the quality of available data Garbage In - Garbage Out Salford Systems, 2012 4
The Essence of Machine Learning In a nutshell: Use historical data to gain insights and/or make predictions on the new data Salford Systems, 2012 5
Data in Sports Analytics Any game is the ultimate and unambiguous source of the quality data This is very different from the data availability and quality in other areas of research However, there is no universal agreement on the best way of organizing and summarizing the results in a numeric form Large number or various game statistics available Common sense and game rules are at the core Heated debates on which stats best describe the potential for a future win Salford Systems, 2012 6
Baseball Stats Available from many sources, including the Internet Player level: summarize performance in a season, post season, and entire career Team level: wins and losses Game level: most detailed Salford Systems, 2012 7
Baseball Databases Widely known public database Gathers baseball stats all the way back to 1871 Will use parts of it to illustrate the potential of data mining Salford Systems, 2012 8
Typical DM Problem Focus on the 2010 versus 2011 regular season performance in both leagues Have access to the player stats for the entire season organized in a flat table Define a measure of the overall player success simply by having the team winning its division Thus 6 out of 30 participating teams in 2010 are declared as success Question: Which of the player stats were associated with the team winning the division? Salford Systems, 2012 9
Core Stats Name Description AB At Bats R Runs H Hits 2B Doubles 3B Triples HR Home Runs RBI Runs Batted In SB Stolen Bases CS Caught Stealing BB Base on Balls SO Strikeouts SF Sacrifice Flies HBP Hit by pitch Name AVG TB SLG OBP OPS Batting Stats Derived Stats Description Batting Average H/AB Total Bases B1 + 2x2B + 3x3B + 4xHR Slugging TB/AB On Base Percentage (H+BB+HBP)/(AB+BB+SF+HBP) On Base Plus Slugging OBP + SLG Many more exist Salford Systems, 2012 10
Conventional Statistical Approaches This is how the problem is usually attacked Each dot represents a single batter record for the whole 2010 season 1245 overall records 16 core stats Winning team batters are marked in red No obvious insights! Salford Systems, 2012 11
Unique Personalities Salford Systems, 2012 Starting with CART in 1984, laid the foundation for treebased modeling techniques Conduct deep look into all available data Point out most relevant variables and features Automatically identify optimal transformations Capable of extracting complex patterns going way beyond the 12
TreeNet Model on Core Stats Salford Systems, 2012 13
Key Findings 6 core batter stats were identified as most predictive 15-20% of total variation can be directly associated with the batter stats The single plots show non-linear nature of many of the relationships Fine plot irregularities should be ignored Striking result: In 2010 season HR above 30 is associated with loosing the division! 2011 season looks fine Proceed by digging into pair-wise contribution plots Salford Systems, 2012 14
Surprise: 2010 HR Leads to Division Loss! Salford Systems, 2012 15
Comments on Batting 3D dependency plots further highlight the rather unusual HR finding for the 2010 season It is a well-known fact that batters aiming at a home run have higher number of strike-outs This is supported by both graphs However, in 2010 regular season the HR-centered approach lead to a defeat! Salford Systems, 2012 16
Compare with Conventional Plot This plot represents two performance stats plotted against data table Note the difficulty at discerning the identified HR x SO pattern visually projections Salford Systems, 2012 17
Pitching Stats Similar to batting stats Large number of derived stats exists Salford Systems, 2012 18
Core Stats Name Description W Wins L Losses H Hits Allowed BFP Batters Faced R Runs Allowed HR Home Runs Allowed WP Wild Pitches IPOUTS Outs Pitched SHO Shutouts BB Base on Balls SO Strikeouts ER Earned Runs HBP Batters Hit by Pitch Name ERA DICE FIP dera CERA Pitching Stats Derived Stats Description Earned Run Average 9xER/InningsPitched Defense Independent Component 3.0+(13HR+3(BB+HBP)-2SO)/IP Fielding Independent Pitching 3.1+(13HR+3BB-2SO)/IP Defense Independent ERA 10-line algorithm Component ERA Long convoluted equation Many more exist Salford Systems, 2012 19
Modeling Steps Started by feeding a complete set of available 26 pitching stats for 2010 season performance Using top variable elimination followed by bottom variable elimination technique, reduced the list to only 7 important stats Salford Systems, 2012 20
One-Variable Contributions 2011 season agrees with what is normally expected 2010 season surprises with higher HR values working in favor of pitchers! This is further supported by the SO dependency Salford Systems, 2012 21
One-Variable Contributions BB and R stats agree with expectations in both seasons 2010 season surprises with higher WP values working in favor of pitchers! This could be related to previous findings Salford Systems, 2012 22
Two-Variable Contributions 2010 surprise: Keep the strikeouts high and the base on balls low to win the division! Salford Systems, 2012 23
Two-Variable Contributions 2010 surprise: More wild pitches, more home runs allowed, more strikeouts => the division is won! Salford Systems, 2012 24
Compare with Conventional Plot Conventional plot for 2010 season IGNORES other dimensions which effectively project on top of each other As a result, there is a lot of confusion on the plot making it difficult to see any pattern In contrast, TN dependence plot shows the given pair contribution AFTER the influence of other dimensions has been eliminated Salford Systems, 2012 25
Compare with Conventional Regression These plots represent the results of running conventional linear regression (LR) on the pitching data While the anomalous HR-effect is present, the model fails at identifying the fine local nature of the phenomenon LR does not provide Salford Systems, 2012 26
What Have We Learnt It appears that in the 2010 regular season Home Run driven strategy did not work! At least, this is what the data tells us, further understanding will require experts in the field Core stats have good explaining potential once put into true multivariate modeling framework Conventional statistics approaches do not have Modern Data Mining helps identifying realized patterns and allows quick and efficient check of the usefulness of various performance measures available to a manager or researcher Salford Systems, 2012 27
Data Mining Mythology NEVER FALL FOR THESE Absolute Powers data mining will finally find and explain everything Gold Rush with the right tool one can rip the stock-market or predict World-Series winner to become obscenely rich Quest for the Holy Grail search for an algorithm that will always produce 100% accurate models Magic Wand getting a complete solution from start to finish with a single button push 28 Salford Systems, 2012
The End Salford Systems, 2012 29