Data Mining in Sports Analytics. Salford Systems Dan Steinberg Mikhail Golovnya



Similar documents
A Guide to Baseball Scorekeeping

SUNSET PARK LITTLE LEAGUE S GUIDE TO SCOREKEEPING. By Frank Elsasser (with materials compiled from various internet sites)

European Cup Coed Slowpitch 2012

Causal Inference and Major League Baseball

Baseball Pay and Performance

The Baseball Scorecard. Patrick A. McGovern Copyright by Patrick A. McGovern. All Rights Reserved.

TULANE BASEBALL ARBITRATION COMPETITION BRIEF FOR THE TEXAS RANGERS. Team 20

The Numbers Behind the MLB Anonymous Students: AD, CD, BM; (TF: Kevin Rader)

JOSH REDDICK V. OAKLAND ATHLETICS SIDE REPRESENTED: JOSH REDDICK TEAM 4

An Exploration into the Relationship of MLB Player Salary and Performance

Last Updated - June 10, 2016.

Beating the MLB Moneyline

Last Updated - June 13, 2016

Examining if High-Team Payroll Leads to High-Team Performance in Baseball: A Statistical Study. Nicholas Lambrianou 13'

Baseball and Statistics: Rethinking Slugging Percentage. Tanner Mortensen. December 5, 2013

Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets

How to Create a College Recruiting Resume

Using Baseball Data as a Gentle Introduction to Teaching Linear Regression

Rider University Baseball

Tree Ensembles: The Power of Post- Processing. December 2012 Dan Steinberg Mikhail Golovnya Salford Systems

How To Calculate The Value Of A Baseball Player

During the course of our research on NBA basketball, we found out a couple of interesting principles.

Offensive Statistics. *Plate Appearance Records were not kept every year. This is the best we can do with current stat knowledge.

Cost of Winning: What contributing factors play the most significant roles in increasing the winning percentage of a major league baseball team?

ANDREW BAILEY v. THE OAKLAND ATHLETICS

A Predictive Model for NFL Rookie Quarterback Fantasy Football Points

The way to measure individual productivity in

BEAVER COUNTY FASTPITCH RULES FOR 2013 SEASON

Welcome to your FX365i and the Forex Quick Start course!

Practice Ideas Rookie / Junior Mosquito

A Study of Sabermetrics in Major League Baseball: The Impact of Moneyball on Free Agent Salaries

Maximizing Precision of Hit Predictions in Baseball

1. No drinking before and in between games for any coach, anyone caught or suspected will be removed from coaching that day.

Sales Performance Management

LVBP 2014/2015 Batting Leaders for Zulia (as of Jan 01, 2015) (All games) Hitting minimums AB/Game 2.7 TPA/Game Pitching minimums - 0.

Call Centre Helper - Forecasting Excel Template

DELHI TOWNSHIP PARKS & RECREATION GIRLS MIDGET SOFTBALL (1-2)

Everyday Math Online Games (Grades 1 to 3)

Data Mining Part 5. Prediction

Length of Contracts and the Effect on the Performance of MLB Players

A SPECIAL PROGRAM OF DISTRICT # 8 NY DAN CAVALLO DISTRICT ADMINSTRATOR

A League Baseball Local Rules REVISION HISTORY

William Hill Race and Sports Book House Wagering Rules and Regulations are provided for your information.

Seattle Mariners (7-11) vs. Miami Marlins (9-10) Sunday, April 20, 2014 Marlins Park, Miami, FL

The Official National Collegiate Athletic Association Baseball and Softball Scorebook

Vertical Alignment Colorado Academic Standards 6 th - 7 th - 8 th

Data Mining Applications in Higher Education

YOUTH SOFTBALL RULES. *** Archer Lodge, Knightdale, Louisburg, Rolesville, Wendell, Zebulon ***

INTANGIBLES. Big-League Stories and Strategies for Winning the Mental Game in Baseball and in Life

Bentonville Youth Softball League Coaches Packet and League Information

Statistics 215b 11/20/03 D.R. Brillinger. A field in search of a definition a vague concept

Machine Learning and Data Mining. Regression Problem. (adapted from) Prof. Alexander Ihler

Game 6 Innings 6 Innings 6 Innings 6 Innings 6 Innings 7 Innings 7 Innings. No No No Yes Yes Yes Yes. No Restrictions

WIN AT ANY COST? How should sports teams spend their m oney to win more games?

Baseball Multiplication Objective To practice multiplication facts.

CIRCUS CIRCUS LAS VEGAS RACE AND SPORTS BOOK HOUSE RULES

The power of money management

LEXINGTON COUNTY SOFTBALL 2016 YOUTH RULES FAST PITCH (10U/12U/14U/16U) 1. All equipment must be kept in dugouts during games.

TOTOBET.COM SPORTS RULES TABLE OF CONTENTS NO. SUBJECT PAGE WELCOME TO THE WINNING SITE.

Official Softball Statistics Rules Extracted in entirety from Rule 14 in NCAA Softball Rules and Interpretations Book

Evaluating Trading Systems By John Ehlers and Ric Way

PREDICTING THE MATCH OUTCOME IN ONE DAY INTERNATIONAL CRICKET MATCHES, WHILE THE GAME IS IN PROGRESS

SPORTSBOOK TERMS & CONDITIONS

Relational Learning for Football-Related Predictions

Game 9. Overview. Materials. Recommended Grades 3 5 Time Instruction: minutes Independent Play: minutes

Football Match Winner Prediction

DELHI TOWNSHIP PARKS & RECREATION GIRLS INTERMEDIATE SOFTBALL (3-4)

DEMYSTIFYING BIG DATA. What it is, what it isn t, and what it can do for you.

Time Series Analysis. 1) smoothing/trend assessment

UNDERGROUND TONK LEAGUE

Marion County Girls Softball Rule Book

The Effect of Contract Year Performance on Free Agent Salary in Major League Baseball

Treasure Island Sports Book House Wagering Rules

BEACH VOLLEYBALL SHORT COURT RULES

QUANTIFYING MARKET INEFFICIENCIES IN THE BASEBALL PLAYERS MARKET. By Ben Baumer Smith College and

Business Insights through Big Data

Prescriptive Analytics. A business guide

Northern California. Girls Softball. Association. Rules of Play & Policies of Operation

Simple Methods and Procedures Used in Forecasting

Current California Math Standards Balanced Equations

NDBA/NCBA 2015 Tadpole Interlock Rules

Sample Problems. 10 yards rushing = 1/48 (1/48 for each 10 yards rushing) 401 yards passing = 16/48 (1/48 for each 25 yards passing)

COACH PITCH RULES (7-8 Year Olds) COACHES SHOULD MEET TO DISCUSS GROUND RULES PRIOR TO EVERY GAME

LESSON 1. Opening Leads Against Notrump Contracts. General Concepts. General Introduction. Group Activities. Sample Deals

CHAPTER 3: DATA MODELING USING THE ENTITY-RELATIONSHIP MODEL

SPORTS, MARKETS AND RULES

-- Special Ebook -- Bookie Buster 21 Secret Systems Used by Pro Sports Gamblers Finally REVEALED!

Probability, statistics and football Franka Miriam Bru ckler Paris, 2015.

Q1. The game is ready to start and not all my girls are here, what do I do?

Using SAS/INSIGHT Software as an Exploratory Data Mining Platform Robin Way, SAS Institute Inc., Portland, OR

Third Grade Math Games

Making Sense of the Mayhem: Machine Learning and March Madness

GAME NOTES Tuesday, August 26, 2014

Big Data Analytics. An Introduction. Oliver Fuchsberger University of Paderborn 2014

How To Run Statistical Tests in Excel

Picking Winners is For Losers: A Strategy for Optimizing Investment Outcomes

BEST PRACTICES TO LEVERAGE STORED VALUE TICKETS

Spreadsheets and Laboratory Data Analysis: Excel 2003 Version (Excel 2007 is only slightly different)

Peter J. Fadde Assistant Professor, Instructional Technology and Design Southern Illinois University Carbondale, IL

Probabilities of Poker Hands with Variations

Transcription:

Data Mining in Sports Analytics Salford Systems Dan Steinberg Mikhail Golovnya

Data Mining Defined Data mining is the search for patterns in data using modern highly automated, computer intensive methods Data mining may be best defined as the use of a specific class of tools (data mining methods) in the analysis of data search automated The literature often refers to finding hidden information in data 2 Salford Systems, 2012

Uses of Data Mining DATA MINING 3 Salford Systems, 2012

Long Live the King =Your Data= Analyst asks the right questions but makes no assumptions The success of data mining solely depends on the quality of available data Garbage In - Garbage Out Salford Systems, 2012 4

The Essence of Machine Learning In a nutshell: Use historical data to gain insights and/or make predictions on the new data Salford Systems, 2012 5

Data in Sports Analytics Any game is the ultimate and unambiguous source of the quality data This is very different from the data availability and quality in other areas of research However, there is no universal agreement on the best way of organizing and summarizing the results in a numeric form Large number or various game statistics available Common sense and game rules are at the core Heated debates on which stats best describe the potential for a future win Salford Systems, 2012 6

Baseball Stats Available from many sources, including the Internet Player level: summarize performance in a season, post season, and entire career Team level: wins and losses Game level: most detailed Salford Systems, 2012 7

Baseball Databases Widely known public database Gathers baseball stats all the way back to 1871 Will use parts of it to illustrate the potential of data mining Salford Systems, 2012 8

Typical DM Problem Focus on the 2010 versus 2011 regular season performance in both leagues Have access to the player stats for the entire season organized in a flat table Define a measure of the overall player success simply by having the team winning its division Thus 6 out of 30 participating teams in 2010 are declared as success Question: Which of the player stats were associated with the team winning the division? Salford Systems, 2012 9

Core Stats Name Description AB At Bats R Runs H Hits 2B Doubles 3B Triples HR Home Runs RBI Runs Batted In SB Stolen Bases CS Caught Stealing BB Base on Balls SO Strikeouts SF Sacrifice Flies HBP Hit by pitch Name AVG TB SLG OBP OPS Batting Stats Derived Stats Description Batting Average H/AB Total Bases B1 + 2x2B + 3x3B + 4xHR Slugging TB/AB On Base Percentage (H+BB+HBP)/(AB+BB+SF+HBP) On Base Plus Slugging OBP + SLG Many more exist Salford Systems, 2012 10

Conventional Statistical Approaches This is how the problem is usually attacked Each dot represents a single batter record for the whole 2010 season 1245 overall records 16 core stats Winning team batters are marked in red No obvious insights! Salford Systems, 2012 11

Unique Personalities Salford Systems, 2012 Starting with CART in 1984, laid the foundation for treebased modeling techniques Conduct deep look into all available data Point out most relevant variables and features Automatically identify optimal transformations Capable of extracting complex patterns going way beyond the 12

TreeNet Model on Core Stats Salford Systems, 2012 13

Key Findings 6 core batter stats were identified as most predictive 15-20% of total variation can be directly associated with the batter stats The single plots show non-linear nature of many of the relationships Fine plot irregularities should be ignored Striking result: In 2010 season HR above 30 is associated with loosing the division! 2011 season looks fine Proceed by digging into pair-wise contribution plots Salford Systems, 2012 14

Surprise: 2010 HR Leads to Division Loss! Salford Systems, 2012 15

Comments on Batting 3D dependency plots further highlight the rather unusual HR finding for the 2010 season It is a well-known fact that batters aiming at a home run have higher number of strike-outs This is supported by both graphs However, in 2010 regular season the HR-centered approach lead to a defeat! Salford Systems, 2012 16

Compare with Conventional Plot This plot represents two performance stats plotted against data table Note the difficulty at discerning the identified HR x SO pattern visually projections Salford Systems, 2012 17

Pitching Stats Similar to batting stats Large number of derived stats exists Salford Systems, 2012 18

Core Stats Name Description W Wins L Losses H Hits Allowed BFP Batters Faced R Runs Allowed HR Home Runs Allowed WP Wild Pitches IPOUTS Outs Pitched SHO Shutouts BB Base on Balls SO Strikeouts ER Earned Runs HBP Batters Hit by Pitch Name ERA DICE FIP dera CERA Pitching Stats Derived Stats Description Earned Run Average 9xER/InningsPitched Defense Independent Component 3.0+(13HR+3(BB+HBP)-2SO)/IP Fielding Independent Pitching 3.1+(13HR+3BB-2SO)/IP Defense Independent ERA 10-line algorithm Component ERA Long convoluted equation Many more exist Salford Systems, 2012 19

Modeling Steps Started by feeding a complete set of available 26 pitching stats for 2010 season performance Using top variable elimination followed by bottom variable elimination technique, reduced the list to only 7 important stats Salford Systems, 2012 20

One-Variable Contributions 2011 season agrees with what is normally expected 2010 season surprises with higher HR values working in favor of pitchers! This is further supported by the SO dependency Salford Systems, 2012 21

One-Variable Contributions BB and R stats agree with expectations in both seasons 2010 season surprises with higher WP values working in favor of pitchers! This could be related to previous findings Salford Systems, 2012 22

Two-Variable Contributions 2010 surprise: Keep the strikeouts high and the base on balls low to win the division! Salford Systems, 2012 23

Two-Variable Contributions 2010 surprise: More wild pitches, more home runs allowed, more strikeouts => the division is won! Salford Systems, 2012 24

Compare with Conventional Plot Conventional plot for 2010 season IGNORES other dimensions which effectively project on top of each other As a result, there is a lot of confusion on the plot making it difficult to see any pattern In contrast, TN dependence plot shows the given pair contribution AFTER the influence of other dimensions has been eliminated Salford Systems, 2012 25

Compare with Conventional Regression These plots represent the results of running conventional linear regression (LR) on the pitching data While the anomalous HR-effect is present, the model fails at identifying the fine local nature of the phenomenon LR does not provide Salford Systems, 2012 26

What Have We Learnt It appears that in the 2010 regular season Home Run driven strategy did not work! At least, this is what the data tells us, further understanding will require experts in the field Core stats have good explaining potential once put into true multivariate modeling framework Conventional statistics approaches do not have Modern Data Mining helps identifying realized patterns and allows quick and efficient check of the usefulness of various performance measures available to a manager or researcher Salford Systems, 2012 27

Data Mining Mythology NEVER FALL FOR THESE Absolute Powers data mining will finally find and explain everything Gold Rush with the right tool one can rip the stock-market or predict World-Series winner to become obscenely rich Quest for the Holy Grail search for an algorithm that will always produce 100% accurate models Magic Wand getting a complete solution from start to finish with a single button push 28 Salford Systems, 2012

The End Salford Systems, 2012 29