MODEL CONSIDERATIONS FOR MULTI-ENTRY COMPETITIONS. A Thesis. Presented to the. Faculty of. San Diego State University. In Partial Fulfillment

Similar documents

BEGINNER S GUIDE TO BETTING THE RACES

But if you want your pulse to race, try putting down a wager on your favorite filly. This little book is designed to help you improve your odds.

Advanced handicapping made easy.

Wagering. Welcome to the Coeur d Alene Casino! D Alene Casino Resort Hotel

Winning is just the beginning...

An Economic Analysis of Pari-mutuel Race Competitiveness

Win. Wagering to Win. Wager. Watch. Know your limit, play within it. Know your limit, play within it.

Six Important Long Shot Factors Every Horse Racing Fan Should Know

What are the Odds You'll Bet on A Race? Determinants of Wagering Demand at a Thoroughbred Racetrack

The Very Best Way We Know to Play the Exacta

Determinants of Simulcast Wagering: The Demand for Harness and Thoroughbred Horse Races

Market efficiency in greyhound racing: empirical evidence of absence of favorite-longshot bias

POCKET GUIDE to. WAGERING on the Ponies

How to Win at the Track

Determinants of betting market efficiency

Lotto Master Formula (v1.3) The Formula Used By Lottery Winners

The Program. Straight Bets. Win You win if your horse finishes first. Place You win if your horse finishes first or second.

The Very Best Way We Know to Play the Daily Double

Taking Handle into Account: An Economic Analysis of Account Betting

Late Money and Betting Market Efficiency: Evidence from Australia

Frandsen Publishing Presents Favorite ALL-Ways TM Newsletter Articles. Race Pace Shapes

KNOWLEDGE IS POWER The BEST first-timers guide to betting on and winning at the races you ll EVER encounter

About Horse Racing Nation

GLOBEFORM S US RACING MANUAL

January 2012, Number 64 ALL-WAYS TM NEWSLETTER

Enriching Tradition Through Technology

Pace Handicapping with Brohamer Figures

BETTING REMEMBER MADE EASY 3 TIPS TO. WIN BETS on the FAVOURITE pay 32% of the time. PLACE BETS on the FAVOURITE pay 53% of the time

Direct test of Harville's multi-entry competitions model on race-track betting data

Handicapping Helpful Tips

The Importance of Age

A Guide To Wagering On America's Fastest Athlete The Racing American Quarter Horse

the handicapper s edge ULTIMATE PPs WITH COMMENTS EXPLANATION Pt. 1

27 FREE Betting Systems

May 20, Re: Request for Comments on Pari-Mutuel Gambling Winnings in REG

THE EASY BETTING GUIDE NEXT

THE EASY BETTING GUIDE NEXT

Learn How to Use The Roulette Layout To Calculate Winning Payoffs For All Straight-up Winning Bets

BETTING MARKET EFFICIENCY AT PREMIERE RACETRACKS

Internal Fraction Advantages

The Very Best Way We Know to Play the Superfecta

CAPSHEET Version 4.0

The Very Best Way We Know to Play the Trifecta

The Show Partial Parlay Wager

AJAX DOWNS 2015 Track Rules ORC Approved Apr 16/15

Newcastle University eprints

1.888.PLAY.TVG 210, ( ) 2010 PHOTOS:

Horse Ballymanyy Curragh Co Kildare Horse Racing Ireland

A TOTE BETTING GUIDE

PART IV. HORSE RACING COMMISSION

"TWO DOLLARS TO SHOW ON NUMBER

Frandsen Publishing Presents Favorite ALL-Ways TM Newsletter Articles. Hedging Your Bets

Equotion. Working with Equotion

Standard 12: The student will explain and evaluate the financial impact and consequences of gambling.

Sin City. In poker, the facility to buy additional chips in tournaments. Total payout liability of a casino during any one game.

A GUIDE TO WAGERING ON AMERICAN QUARTER HORSE RACING APPLICATIONS AND TECHNIQUES FOR THE ADVANCED THOROUGHBRED PLAYER AS WELL AS TIPS FOR BEGINNERS

How to open an account

The Very Best Way We Know to Play the Pick 3

The Secret To Making Consistent. Profits Betting On Sports

Double Deck Blackjack

TOTALISATOR OPERATIONAL RULES

Introduction 3. Factors that affect performance 4. Applying points to a factor 5 8. Weighing up our chosen factors 9. Rating horses 10 13

Step 1. Step 2. Pick Your Horse. Decide How Much to Bet.

BETTING INFORMATION AN INTRODUCTION TO WAGERING & WINNING 1

Lay Betting Selection System. Strike Rate!!

How to Beat Online Roulette!

Lorraine DiFelice has been studying

After 3 races the player with the highest winnings has won that day s racing.

"All-in" A betting proposition where no refunds are given for scratchings or withdrawals regardless of whether

ARCI Calculation Of Payouts And Distribution Of Pools. Part I

Inform Racing User Guide.

Would You Like To Earn $1000 s With The Click Of A Button?

South Africa Betting Rules

Copyright 2009, 2010 by Yohap

ALIANTE RACE AND SPORTS BOOK HOUSE RULES

Expected Value and the Game of Craps

Betting On Quick Returners For Profit

Frequently Asked Questions

BETTING CHEAP CLAIMERS

Chance and Uncertainty: Probability Theory

Frequently Asked Questions

Understanding Betting Many people are put off horse racing because they don't understand how betting works, when really it couldn't be simpler.

Unit 19: Probability Models

Read honest, unbiased betting system reviews at Proven Plodders

Ready, Set, Go! Math Games for Serious Minds

2urbo Blackjack Hold their face value

36 Odds, Expected Value, and Conditional Probability

Congratulations! By purchasing this ebook you now have access to a very simple, yet highly profitable system to make TAX FREE profits with horse

Breakdown of the Handle

RetirementWorks. proposes an allocation between an existing IRA and a rollover (or conversion) to a Roth IRA based on objective analysis;

William Hill Race and Sports Book House Wagering Rules and Regulations are provided for your information.

How to Play. Player vs. Dealer

The overall size of these chance errors is measured by their RMS HALF THE NUMBER OF TOSSES NUMBER OF HEADS MINUS NUMBER OF TOSSES

HOW TO BET ON GREYHOUNDS. Gambling can be addictive. Please play responsibly.

The power of money management

> New Account Registration > Electronic Wagering Requirements > Users Guide COPOW ( )

MONT 107N Understanding Randomness Solutions For Final Examination May 11, 2010

Transcription:

MODEL CONSIDERATIONS FOR MULTI-ENTRY COMPETITIONS A Thesis Presented to the Faculty of San Diego State University In Partial Fulfillment of the Requirements for the Degree Master of Science in Statistics by Vincent Stanley Dayes Summer 2010

iii Copyright c 2010 by Vincent Stanley Dayes

iv Gambler s Prayer: Dear Lord, please let me break even, because I really need the money Mr. X

v ABSTRACT OF THE THESIS Model Considerations for Multi-Entry Competitions by Vincent Stanley Dayes Master of Science in Statistics San Diego State University, 2010 A unique and highly practical system for identifying good and bad bets at the major Southern California Thoroughbred racetracks is created and analyzed. A probability model for each individual race is created; a function of odds, Morning Line, each horse s past performances, current trainer and jockey, and miscellaneous factors depending on type of race. A continuous response variable, Perf, (a numerical performance estimator) is used as the response variable in the regression analysis performed. After obtaining new estimates for Perf, Monte Carlo methods were then implemented to calculate probabilities of each horse s 1st, 2nd, 3rd, or 4th place finish. Horses were then grouped according to Odds, and reports were generated to analyze results and calculate Expected Values. To find the numerous hidden factors and patterns that only occur under specific conditions, numerous subsets of races and horses were anayzed using hundreds of covariates. A Baseline of probabilities is created using a simple model based mainly on odds of a horse. Then the final model probabilities resulting from the estimated regression parameters equation are compared to the baseline probabilities. Those that differ significantly are separated into two groups: Estimated probabilities higher than the baseline s are considered profitable bets Overlays, while those less than the baseline s are Underlays (unprofitable bets). Each group is displayed in the odds-based report format. 10 3/4 years of horse-racing data is used with 8 3/4 years set up as Regression Dataset and the two mosr recent complete years as Testing Dataset. Of primary interest is the Profitability or Expected Values of each group. In sum, various parameters and wagering options are analyzed for their positive or negative affects on profitability.

vi TABLE OF CONTENTS PAGE ABSTRACT... v LIST OF TABLES... viii LIST OF FIGURES... ix CHAPTER 1 INTRODUCTION... 1 1.1 History... 1 1.2 Statement of Problem... 2 1.3 Objective... 3 1.4 A Typical Horse Race... 3 1.5 Definition of Terms... 4 2 DATA... 12 2.1 Variables Input into SAS... 12 2.2 Summary Statistics of Numerical Data... 16 2.3 Subgroups... 16 2.4 Data Separated into Odds Ranges... 18 2.5 The Daily Racing Form for the Serious Handicapper... 20 3 METHODOLOGY... 22 3.1 Perf: The Important Response Variable... 22 3.2 Data Preparation in MS ACCESS... 23 3.3 SAS Operations and Processing... 24 3.3.1 Non-Indicator Covariates Created in SAS... 24 3.3.2 Indicator Type Covariates Created in SAS... 24 3.3.3 WBF Exponent Found Using Box-Cox Method... 26 3.3.4 SAS Regression and Model Selection... 26 3.4 Matlab: Simulating Horse Races for Probability Estimates... 27 3.5 Comparing Probability Files in ACCESS... 27 4 RESULTS... 29 4.1 Overlays... 29

vii 4.2 Underlays... 30 4.3 Comparisons of Results by the Four Major Odds Ranges... 30 5 MULTICOLLINEARITY... 35 6 DISCUSSION... 36 6.1 Response Variable: Perf (and Power Point)... 36 6.2 Subgroups... 37 6.3 Limitations of the Study... 37 6.4 Predictor Variables Included in the Final Regression Model... 38 6.5 Miscellaneous... 39 7 CONCLUSIONS... 40 BIBLIOGRAPHY... 42

viii LIST OF TABLES PAGE Table 2.1 Descriptive Statistics of Numerical Variables... 13 Table 2.2 Regression Data by Collapsing Odds Ranges... 19 Table 3.1 Trainer Names and ID Codes... 25 Table 3.2 Test Data Baseline Model... 27 Table 3.3 Regression Model Coefficients... 28 Table 4.1 Overlays: Comparison of Win Results(A) to Baseline(B)... 29 Table 4.2 Overlays: Comparison of 2nd, 3rd, and 4th Results(B) to Baseline(A)... 30 Table 4.3 Underlays: Comparison of Win Results(A) to Baseline(B)... 30 Table 4.4 Underlays: Comparison of 2nd, 3rd, 4th Results(A) to Baseline(B)... 31 Table 5.1 Covariates With Variance of Inflation Greater Than 2.0... 35 Table 7.1 Comparison of Overlays and Underlays Totals... 40 Table 7.2 Odds Range 9-27 of Overlays and Underlays Totals... 40

ix LIST OF FIGURES PAGE Figure 1.1 Daily Racing Form. Note abundance of information for each horse.... 6 Figure 4.1 Win percentage comparisons between Underlays, Baseline, and Overlays by odds ranges. Win percentages for Overlays substantially greater than those for Underlays.... 32 Figure 4.2 EV comparisons between Underlays, Baseline, and Overlays by odds ranges. EVs for Overlays are much greater than those for Underlays.... 33 Figure 4.3 Test results: Finish comparisons between Underlays, Baseline, and Overlays by odds ranges. Total percentages significantly greater for Overlays than Underlays except for 0-4 range.... 34

1 CHAPTER 1 INTRODUCTION Perhaps the most complex and challenging multi-entry competition is the horse race. Horse races are basically unique and independent of each other. The race conditions, restrictions and eligibility requirements determine which horses are allowed to be entered in a race, said conditions apply to all the horses in the race, such as a Maiden race where only horses which have never won a race in their lives may be entered. Typical restrictions are by sex, age, state bred in, types of and number of races previously won, etc. Race conditions may be distance, racing surface (dirt, turf or synthetic track), physical condition of track, purse offered, etc. Thus each race is a cluster of horses running under race-specific factors. Horse-specific factors are jockey, trainer, post-position, equipment (blinkers, type of shoe, etc.), (legal) drugs, assigned weight, etc. But for the serious handicapper, the most important information is the past-performances for each horse listed in the Daily Racing Form. Listed in chronologically descending order, the previous (up to 10) races of each horse are capsulized. 1.1 HISTORY Gambling on horse races has been around since man first started riding horses. Modern horse racing exists because it is a popular form of legalized gambling and is accepted as benefitting local and state economies by generating large amounts of tax dollars and providing jobs and money. Statisticians have been analyzing horse racing data for many years, with milestone works by Harville [1], Henery [2], Stern [3] and others. Many other disciplines also have researchers investigating the ponies. Hausch et al. [4] cover articles from economists, psychologists, management scientists, probability theorists as well as professional gamblers. The first model proposed by Harville [1] is a simple way of computing ordering probabilities based on winning probabilities. Henery [2] suggested using a normal distribution for estimating running times where as Stern [3] recommended using gamma distributions for the same purpose. Bacon-Shone, Lo and Busche [5] and Lo and Bacon-Shone [6] showed that the Henery and Stern models were better fits than the Harville model for particular racing data. However, since both the Henery and Stern models are complicated to use in practice, Lo and Bacon-Shone [7] suggested a simple approximation for both the Henery and Stern models.

2 Also heavily investigated is the favorite-longshot bias (where favorites are typically underbet so odds are too high and longshots overbet so their odds are too low) that has often been found in gambling data (see Ali [8], Asch et al. [9], Ziemba and Hausch [10], Lo [4], and Bacon-Shone and Lo [11]). This bias also appears in this study, but is not the main focus. Basically all the works cited in this Section were aimed at finding models that would accurately estimate probabilites that could result in turning a profit at the racetrack, by finding profitable wagers and/or avoiding the unprofitable ones. This work is aimed at developing a system that facilitates evaluating whichever patterns, variables, statistics, etc. that a handicapper might wish to investigate, and continually improve an already useful model by adding new covariates that are significant. 1.2 STATEMENT OF PROBLEM The problem is to beat the odds, finding profitable bets and avoiding losing propositions. Turning a profit at the horse races involves the basic calculation of return versus risk, payoff vs. probability. By waiting until the last few minutes before a race goes off, a bettor has an accurate approximation of payoff/return for a straight win bet, which can also be a strong indicator to payoffs of other types of bets. So the key to success is obtaining accurate probabilities of winning and then choosing the wagers where return (odds) far outweighs risk (win probability). Even for the most experienced handicappers, one race may take from 20 minutes to two hours to produce accurate probabilities for each horse. Factors vary from race to race but starting points are: Morning Line, each horse s past performances, current trainer and jockey, pace style (relative to the pace styles of the other horses in the race), and miscellaneous factors and patterns depending on type of race. Actually the amount of available information is overwhelming and only a computer-assisted handicapper can accurately assess each horse s probabilites for winning in a reasonable amount of time, much less calculate probabilities for coming in 2nd, 3rd or 4th required to bet Superfectas (picking exact order of the top four finishers). The betting public in general does a good job of estimating most horse s probability of winning which means the obvious predictor variables like best jockeys and trainers, recent strong performances, sparkling workouts, etc. are reflected in the odds. The problem then is to find predictor variables that are significant and at the same time, relatively independent of odds. Then there needs to be a way to determine how strongly these covariates weigh against each other and how they relate to a horse s performance. Also there needs to be a numerical rating system of performance for each horse in a race, a system that takes into account the number of horses in a race and where each horse finishes.

3 1.3 OBJECTIVE The Objective is to create and analyze a practical system for weighing all the positive and negative factors associated with each horse in a race and then calculating probabilities for 1st, 2nd, 3rd, and 4th place finishes. The horses whose (1st place) probabilites are significantly higher than the probabilities as reflected by their odds, should be good bets (Overlays) and those whose probabilities are significantly lower, are wagers to avoid (Underlays). Finding a response variable that numerically rates a horse s performance is also part of the objective. 1.4 A TYPICAL HORSE RACE Trainers pick the races for their horses and find a jockey through dealing with jockey agents. Horses are entered a few days before a race and post positions are assigned by small numbered balls pills drawn. The Racing Form is usually available two days before the race and the Morning Line one day before. Everything is synchronized at the racetrack. Even as one race is running, the horses for the next race are being led to the Paddock, where they are saddled, checked over and calmed down. Then the horses are led to a viewing ring surrounded by a crowd of bettors who are intensely studying the horses for good or bad signs, and the horses are led around the outside of the ring while the owners and their friends and families, being on the inside (usually all dressed up and pretending they do not notice the crowd on the outside) watch their horse in a confident manner. At this point the trainer gives final instructions to the jockey (who usually ignores them) and some official calls Riders up and the jockeys jump on the horses and then the horses are led through a tunnel to the track where there is a post parade in front of the grandstands, then the horses warm up by jogging around the track, and a few minutes before Post Time they start walking to the starting gate. Around Post Time the horses are loaded into the starting gate all the time being checked over by the track veterinarian. When the horses are all loaded and calm, the offical starter pushes a button that opens all the gates in front of each horse and the race is off. The start is the most chaotic point in the race with horses frequently going sideways, bumping and cutting each other off, or sometimes leaving the gate very slowly. At the start jockeys are seeking advantageous position that gives their horse its best chance of winning. Some horses are front-runners - they like to be in the lead and steal the race by setting the pace without using up their reserve energy, and then having enough gas in the tank to fight off late challengers (going wire to wire ). Other horses may have a stalking style where they stay right behind the leader or leaders until near the end of the race and then go all out. Then there are the closers who may stay near the rear of the field and then close strongly the last part of the race. All jockeys try to save ground on the turns by staying as near the rail as possible at the same time

4 trying to avoid trouble in the form of being blocked by horses in front or being pinched into the rail. The end of a race can be quite exciting as frequently horses are tightly packed at the finish line, separated by inches after running a mile or more. Jockeys are expert at using the whip - some horses respond well to whipping and others do not in which case the jockey may just show the whip by placing it in front of the horses eyes or just lightly tapping the horse once or twice. Jockeys are also adept at urging their mounts to give their best efforts, especially in the final straight-away before the finish line (the home stretch. ) Note that jockeys do not actually sit on their mounts but balance themselves on the stirrups the whole race so as not to impede or interfere with their horse s running action. Jockeys are tremendous athletes who must have large amounts of strength, courage, lightning reflexes and good judgement and instincts to succeed. The flip side to the running of the race is the wagering. At the track there is a huge Totalizer Board that displays odds for win bets and the total amounts bet on each horse, and which are updated every minute or so. This information is also available on monitors displayed all around the public areas. Wagering is done over the internet right up until post time. At the track, wagering is allowed until a loud bell goes off, a few seconds after the race starts. The difference is usually from two or three minutes to 10 or more. Thus bettors at the track have an advantage in that they have time to see the effects from last minute internet wagering and get an accurate estimate of the final odds and still have time to make their own bets. 1.5 DEFINITION OF TERMS Baseline : Estimated Perfs are derived from the Simple Regression Model which is based only on odds and number of horses in race. From each horse s estimated Perfs, estimated probabilities are calculated which are baseline values which the estimated probabilities from the final regression model are compared to Bay : Reddish brown color of horses Betting Pool : Each type of wager has its own pool of money bet, separate from all other pools Blinkers : A hood placed over a horse s head with cups sewn onto the eye openings. This restricts a horse s vision so it can only see straight ahead Box-Cox Method : Used to find the best fit of the win bet fraction (wbf) to the performance response variable (Perf) by finding the exponent λ that minimizes Sum of Squares

5 Error. A new predictor variable, wbfall equal to wbf raised to λ is used in place of wbf (see Kutner [12]) Breakage : This is due to odds being rounded downward to the nearest tenth of a dollar and the wagering establishment keeping the difference Breeder : Whoever breeds the horse Claimed : When a horse runs in a claiming race and is claimed by a licensed owner or trainer it is purchased for the claiming amount specified for that race. The horse must be in the starting gate when the race goes off. Once the race starts, the horse offically belongs to the new owner even if it is injured or drops dead, but any monies won goes to the original owner Claiming Race : Horses which may be claimed (purchased) for a specified price Class : Level of competition - numerical evalution of the general strength of a race. The concept of Class is used here to categorize numerically the quality of a race and therefore its entrants. The strongest runners are in the highest classes (and highest purses to be won) and vice versa Colt : A male horse age 4 or less Cushion Track : A type of synthetic surface Daily Double : A wager where the winners of two consecutive races must be picked to win the bet. Originally the first two races of the day, now most tracks offer this on all consecutive races Daily Racing Form : Resembles a small newspaper filled with racing information for each horse running on a particular day (see Figure 1.1) Entry : When two or more horses are entered in a race and are considered a single entity for wagering purposes Exacta : An exotic wager where the exact order of the first two finishers in a single race is specified Exotics : Newer, more complicated bets such as Trifectas, Superfectas on single races and multiple race bets like the Pick 3, Pick 4, and Pick 6 EV : Expected Value - used here in same sense as Profitability - expected or average return on a wager

Figure 1.1. Daily Racing Form. Note abundance of information for each horse. 6

7 Favorite : The most heavily bet horse in a race Filly : Female horse age 4 or less Furlong : An eighth of a mile Gelding : A castrated male horse of any age Handicapper : An experienced Daily Racing Form reader able to hold huge amounts of information in his head and at the same time judge the relative merits of each horse in a race, coming up with estimates of winning probabilities Handicap Race : A stakes race where weights (see weight) are assigned according a horse s past performances Horse : Specifically a male horse (not gelded) of age 5 or greater House Take : House Cut, Track Percentage - the amount taken out of the Betting Pool by the House or race track. For simple pools like the win, place or show, it is around 14% to 18%. For the exotic pools, it is around 20%. It varies by state and race track Indicator 0 or 1 : Covariates are set to 1 if they occur, otherwise they are set to 0 Jockey : Professional rider of horses Lasix : Legal anti-bleeding drug - common in California, illegal in some states and countries Length : About nine feet - the length of a generic horse from the tip of its nose to the end of its tail (when running) - also a rough time measurement: one length is about a fifth of a second Line : Refers to a past performance line in the Daily Racing Form Longshot : General term meaning a horse that is unlikely to win Maiden Race : Races only for horses who have never won a race Major Odds Range : In this study, the (four) major ranges are: 0.1 to 4.0, 4.1 to 9.0, 9.1 to 27.0, and 27.1 and UP Mare : Female horse 5 or more years in age Monte Carlo Method : Computational system of simulation using reapeated random sampling to compute results

8 Morning Line : Predicted final odds - may appear in Racing Form one or more days before race Odds : Return on investment, should it be successful Overlay : When probability of a horse winning is greater than the probability indicated by its odds Out Finish : Finishing 5th place or worse - not 1st, 2nd, 3rd or 4th Pace : The speed of the early leaders in a race Pace Style : The usual early-race location of a horse - may be forwardly placed early or in the rear Paddock : The area where the horses are viewed before a race Past Performances : Daily Racing Form information lines (see Figure 1.1) Perf : Dependent (Response) Variable - numerical evaluation of a horse s performance in a given race, a function of lengths ahead or behind the Power Point. Originally just 10 * lengths from Power Point (negative if behind Power Point, positive if ahead). For example, if the winning horse were two lengths ahead of the second place horse in a 7-horse race (where the Power Point equals the second place finish), its Perf would be 20. There was a set minimum for Perf (in this study it is -210) Photo Finish : A close finish where the finish picture must be examined to determine the order of finish Pick 3, 4, or 6 : Wagers where the winners of all the included races must be picked Place : A wager where a bettor wins if his horse comes in 1st or 2nd. Also the place position is 2nd place Polytrack : Synthetic race track (general term) Post Parade : After horses leave the paddock and before the race, the horses come out onto the racetrack and parade in front of the grandstands Post Time : Official time horses are supposed to be at the starting gate. Most races start a few minutes after post time

9 Power Point : A numerical indicator of a strength threshold value at the finish of a race: a function of the number of horses in the race and the distances in lengths between the top four finishers. Originally equal to the second place finish for races with 7 or fewer horses, equal to the midpoint between 2nd and 3rd for races with 8, 9, and 10 horses, and equal to 3rd place finish for races with 11 or more horses. If a horse finished at the Power Point, it was assigned a Perf of zero. For example, if ahead of the Power Point, Perf would equal the number of lengths times 10, if behind, Perf was minus the number of lengths multiplied by 10 Profitability : The positive or negative return per wager. Synonomous with EV Pro-Ride : A specific type of synthetic racetrack Purse : Prize money offered in a race of which typically 60% goes to the winner, 20% to 2nd place, 12% 3rd, 6% to 4th and 2% to 5th (the distribution percentages vary from state to state) Race Restrictions : Restrictions on horses allowed into a specific race Racetrack : The three California tracks in this study are all flat ovals with Santa Anita and Del Mar being a mile in circumference and Hollywood being a mile and 1/8. The turf or grass course is just inside of the main course which up until 2007 was a dirt track. Racetracks are publicly owned but strictly regulated by state agencies Racing Form : See Daily Racing Form Reflected Probs : Inverted odds: Probabilities that reflect how a horse is bet - with estimated Track Percentage taken into account 1/(odds 1.2 + 1) (1.1) Regression Data : Data used to develop model and find Estimated Parameters/Regression Coefficients Regression Funct. : Model/equation used to predict new values of response variable Perf from Test Data/Prediction Set Results Dataset : Subset of Testing Dataset consisting of horses whose estimated win probabilities (from the Regression Function) differ significantly from the Baseline win probabilities

10 Saddle Cloth Number : Official number of horse, used when placing bets or checking results - frequently the same as the post position, but not always Saving Ground : Minimizing distance horse has to run by staying close to the inside rail on turns Scratch : A horse does not run (for whatever reason) in a race it is entered in Show : A wager where the bettor wins if his pick comes in 1st, 2nd or 3rd. Usually has a small return, sometimes 10 cents on the dollar Stakes Race : Highest class of races with the largest purses, for example, the Kentucky Derby Superfecta : An exotic wager where the exact order of the first four finishers in a single race is specified Synthetic Track : In May, 2006, the California Horse Racing Board mandated that all California horse racing tracks had to switch their dirt tracks to synthetic surfaces for the safety and welfare of horses. Hollywood has been using Cushion Track since November 2006, Santa Anita tried Cushion Track from September 2007 to summer of 2008 when it switched to Pro-Ride due to drainage problems. Del Mar has been using Polytrack since July 2007. Polytrack is similar to Pro-Ride and the two have been treated as the same surface type in this study Testing Data : Data set aside for model validation - data is not used to in creating Regression Function/Estimated Parameters Totalizer Board : Huge display at race track displaying all important betting information Trainer : Responsible for training, behavior, overseeing the exercise routine of horses, selects races for horse to run in, and picks the jockey Trifecta : An exotic wager where the exact order of the first three finishers in a single race is specified Wbf : Win Bet Fraction, inverted odds: wbf = 1/(odds + 1) (1.2) WbfAll : Wbf raised to exponent λ found using the Box-Cox method wbfall = wbf λ (1.3)

11 Weight : Horses are assigned minmum weights according to race conditions. All jockeys are weighed before all races - if under the assigned weight they carry extra weights in their saddle. Overweights don t matter, except to the horse Whip : Leather instrument used by a jockey to encourage his horse

12 CHAPTER 2 DATA The data comes from the three major southern California Horse Racing Tracks: Santa Anita (Los Angeles), Hollywood (Los Angeles), and Del Mar (San Diego). These three tracks form a circuit since only one is open at a time and the same horses, trainers, and jockeys move from track to track. Thus the data has a consistent, homogenous nature. The races were run from January 1999 to November 2009-10 3/4 years. Out of 23,478 races, 19,930 races were used for the individual race model (168,253 horses), the others being rejected because of too few horses (minimum 6 horses in a race), abnormalities, entries (multiple horses coupled together for betting purposes), causing complications in odds analysis), and corrupted data. There are three different types of race data: Current Race data, which is data a handicapper has before the race goes off (typically found in The Daily Racing Form). Results data is the results of the Current Races. Past Performances data is how a horse performed in previous races so it is a combination of the other two types of data. The pre-race data was exported from The Daily Racing Form files, imported into MS ACCESS, checked for errors, processed for easy analysis and then exported from ACCESS in Comma-Delimited files, which were then read and analyzed by SAS and Matlab. The results data and the final odds came from Equibase Inc. which specializes in horse-racing results data. The data was purchased through Post Time Solutions, Inc. The 10 3/4 years of data was split into two groups: The first, (Regression Dataset) is 8 3/4 years of data (1/27/99 to 11/05/07-16,284 races/136,855 horses) and the second (Testing Dataset) is the last two years of data (11/06/08 to 11/07/09-3,646 races/31,398 horses). 2.1 VARIABLES INPUT INTO SAS Note: see Table 2.1 for statistics on numerical data. age : Age of horses allowed in race: Age 2: 13.27% of races, age 3: 19.05%, age 4: 2.61%, age 3UP: 41.76%, age 4UP: 23.31% Note that there were 8 races for 3 and 4 year olds only blinks : Blinkers changed: X = Blinkers taken off: 2.59%, B = Blinkers put on: 4.95% No change in blinkers: 92.45%

13 Table 2.1. Descriptive Statistics of Numerical Variables Variable Minimum Median Maximum Std Dev 10th PCT 90th PCT days1st 2 29 1876 85.64 15 180 days2nd 11 72 1903 109.39 37 223 days3rd 17 116 1945 124.40 62 297 dist 20 65 140 13.07 55 85 horseage 2 3 12 1.31 2 5 ML1 0.01 0.10 0.68 0.07 0.03 0.22 monthborn 1 3 12 1.56 2 5 nhor 6 9 14 1.95 6 12 numlines 0 6 10 3.83 0 10 numlinediff -8.90 0 8.75 2.02-2.57 2.40 odds 0.10 8.70 243.2 21.69 2.2 44.6 odds1 0.05 9.00 339.5 18.62 2.0 35.6 odds2 0.05 10.40 339.5 16.66 2.0 30.7 perf -210-30 121 99.67-210 55 pp 1 5 14 2.78 1 9 speed1diff 0 2 10 3.83 0 10 speed12diff 0 3 10 3.77 0 10 speed123diff 0 4 10 3.74 0 10 turfstarts 0 0 78 5.92 0 10 turfwins 0 0 15 1.31 0 2 wbf 0.004 0.10 0.91 0.13 0.02 0.32 wbfall 0.43 0.70 0.99 0.11 0.55 0.84 wbfold1 0.003 0.10 0.95 0.13 0.03 0.33 wbfold2 0.003 0.09 0.95 0.13 0.03 0.33 cl12 : Claim indicator for last three races: 1 = claimed in last race, 2 = claimed in second race back, 4 = claimed in 3rd race back - 19,366 horses out of 168,253 were claimed in at least one of their last three races (11.51%) date : Julian date of race - 36187 to 40124 (1/27/1999 to 11/7/2009) days1st : Number of days since last race (see Table 2.1) days2nd : Number of days since 2nd race back (see Table 2.1) days3rd : Number of days since 3rd race back (see Table 2.1) dist : Distance of race in tenths of a furlong - furlong is 1/8th of a mile - from 20 to 140 (1/4 mile to 1 3/4 mile); most common distance: 60 or 6 furlongs (3/4 mile): 4,718 out of 19,930 races (see Table 2.1) finish : Place of finish (1 to 14)

14 flags : Indicator-type: Flag is one when current race is 2nd race within 60 days after maiden win. 3737 out of 168,253 (2.2%) horseage : Age of individual horse: from 2 to 12 (see Table 2.1) horsetype : Type of horse: f = filly (female age 4 or less) 35.2%, m = mare (female age 5 and up) 6.5%, c = colt (male age 4 or less) 25.2%, h = horse (male age 5 and up) 5.7%, g = gelding (castrated male any age) 27.2% JockID : Three character code for professional race-riders, of 470 different jockeys, 41 had 1000 or more rides lasix1st : Indicator-type: 1 if first time horse has had the drug lasix in its life (1st time starters not included) 3,760/168,253 (2.2%) lasixl : L if horse has been given lasix, (96.5% have lasix, 3.5% do not) ML1 : Inverted Morning Line Ffraction - ML1 = 1/(1 + ML) where ML is the original Morning Line pre-race estimate of the final odds - ML1 is nromalized to account for horses that scratch before the race goes off (see Table 2.1) monthborn : Month horse is foaled - note that horses born in the same year all are considered to have the same age whether born Janaury 1 or December 31. 93.9% are foaled in January thru May (March 27.8%, April 25.4%, and February 21.0%) and the other 6.1% in June through December, being mainly Southern Hemisphere horses (see Table 2.1) nhor : Number of horses in a race. From 6 to 14 (Minimum was set to 6 for analysis purposes). Percentages by number of horses: 6-17.8%, 7-19.9%, 8-21.1%, 9-16.7%, 10-13.8%, 11-7.9%, 12-6.9%, 13-2.0%, 14-0.8% (see Table 2.1) numlines : Number of previous races to a maximum of 10. Refers to the number of lines of past performances in the Daily Racing Form (see Table 2.1) numlinediff : For each race, the average number of lines is calculated. Then each horse s number of lines is subtracted to get numlinediff (see Table 2.1) odds : Final odds horse went of at: from 0.1 (minimum by law) to 243.2. For odds distribution of Regression Data see Table 2.1 odds1 : Odds in last race. Note that in other states minimum odds may be 0.05. (see Table 2.1)

15 odds2 : Odds in 2nd race back (see Table 2.1) perf : Response Variable - numerical evaluation of a horse s performance in a given race from 121 to -210 (see Table 2.1) pp : Post position in race - 1 to 14 (see Table 2.1) sex of Race : Race restriction by sex: 41.56% races were for female horses only - 58.44% races were for either sex, although only 270 out of 98,329 horses were fillies or mares: 0.3% running against the boys speed1diff : Difference from average speed of race (see Table 2.1) speed12diff : Difference from average speed of the best of each horse s last two races (see Table 2.1) speed123diff : Difference from average speed of the best of each horse s last three races (see Table 2.1) statebred : Three character code for state or country horse was bred in. Most common states are California: 36.4%, Kentucky: 25.4%, Florida: 6.1% and the most common foreign countries are Ireland: 2.0%, Great Britain: 1.7%, and Argentina: 1.1% track : Three racetracks: SA had 45.75% of the races, HOL had 36.26%, and DMR had 17.98% races trainid : Three character code for trainers. There are 985 trainers, of which Doug O Neil had the most horses entered: 4083, Bob Baffert had 3568, and 34 other trainers had 1000 or more horses entered turf : One character field where T indicated a turf race: 27.2%, P a race on Polytrack or Pro-Ride synthetic surfaces: 7.8%, C indicated Cushion track synthetic surface: 10.7% and a blank meant dirt surface: 54.3% turfstarts : Number of lifetime races run on the turf: from 0 to a maximun of 78 (see Table 2.1) turfwins : Number of lifetime wins on the turf: from 0 to a maximum of 15 (see Table 2.1) type : Numerical designator of type of race: types are 0, 1, 4, 6, 8, 10, 14, 21, 22, 23, 31, 32, 33. Most common race types: Maiden Claiming (type = 0): 21.2%, Maiden Allowance(type = 1): 18.7%, Allowance Non-Winners of 1 (31): 13.9%, Claiming Middle(22): 12.9%, Claiming High(23): 10.0%, and Claiming Low(21): 9.7%

16 wbfold1 Win bet fraction from odds of previous race (see Table 2.1) wbfold2 Win bet fraction from odds of 2nd race back (see Table 2.1) 2.2 SUMMARY STATISTICS OF NUMERICAL DATA Some of the interesting statistics from Table 2.1: A horse had a race 2 days after a previous race and came in 5th out of 6, and another, a 9 year old gelding who came back to the races after more than a 5 year layoff and came in last in an 8 horse field, the oldest horse running was 12 and the highest odds in the California tracks was 243.2, while a horse went off at 339.5 somewhere else. 2.3 SUBGROUPS Various subsets of races and/or horses were run through the regression stepwise selection process with all covariates to find predictor variables that are either hidden or are much more significant in a subgroup than in the overall total set of all races and horses. Subgroups considered: MClm : Maiden Claiming - races for horses who have never won a race and can be claimed for a specified claiming amount - considered to have the most volatile and unpredictable horses - many veteran jockies avoid riding in these races - has the lowest purse amounts - Claimed covariates may be significant in these races MAlw : Maiden Allowance - Races for horses who have never won a race and are not claimable - many stars of the future are in these races NonMaid : Races for horses who have won at least one race Turf/Grass : Races run on a turfcourse - turf surface is thought to suit style of running for some horses and bb slippery and unsuitable for others due to different leg action NonTurf : Races not run on a turfcourse Poly : Races run on Polytrack synthetic surface (replacing dirt surfaces) Cush : Races run on Cushion Track synthetic surface (replacing dirt surfaces) Alw : Allowance Races - horses are not claimable - various restrictions usually apply limiting horses eligible for race - not including Alowance races for Non-Winners of one or two races AlwNW12 : Allowance race for either Non-Winners of one or two races - these races are a threshold for horses that go on to have profitable careers and run in Handicap and Stakes races and those who fade into the lower class Claiming races

17 Stakes : Special races with highest purse amounts - also important for establishing a horse s reputation which directly influences its breeding value Age2 : Races for two-year-olds - young horses may be quite inconsistent in their performances Age3 : Races for three-year-olds Age3UP : Races for three-year-olds and up and races for four years and up Sprint : Races for short distances less than 7 furlongs - usually favors horses with early speed MidDist : Races for distances 8 to 9 furlongs - two turn races where first turn is close to the start (Santa Anita and Del Mar) so post position may be more significant in these races LongDist : Races for distance greater than 9 furlongs - favors horses with stamina and lighter weights assignments Fill : Races for Fillies and Mares only - these races may have more longshots Male : Races for any sex - usually all male, but not always so Fill covariate can be analyzed here yr9902 : Data set from 1/27/99 to 12/25/2002 - First 3 and 11/12 years of Regression Data yr0305 : Data set from 12/26/02 to 12/25/2005 - middle 3 years of Regression Data yr0607 : Data set from 12/26/05 to 11/4/2007 - last two complete years of Regression Data - may show trends that are changing yr07 : Data set from 10/29/06 to 11/4/2007 - last complete year of Regression Data - may show trends that are changing a67 : Races with 6 or 7 horses in race - predictors may vary especially when compared to a11up subgroup a8910 : Races with 8, 9, or 10 horses in race a11up : Races with 11 or more horses in race - predictors may vary when compared to a11up subgroup, so post position may be more significant in these races ClaimLow : Classes with low Claiming Amounts (8,000, 10,000, 12,500) - Claimed covariates may be significant in these races

18 ClaimMid : Classes with middle Claiming Amounts (16,000, 20,000, 25,500, 32,000) - Claimed covariates may be significant in these races ClaimHigh : Classes with highest Claiming Amounts (40,000 and up) - Claimed covariates may be significant in these races DMR : Races run only at Del Mar Race Track - trainer and jockeys may do better here than at other tracks HOL : Races run only at Hollywood Race Track - trainer and jockeys may do better here than at other tracks SA : Races run only at Santa Anita Race Track - trainer and jockeys may do better here than at other tracks T65 : Races run on the downhill Turf course at Santa Anita - these races are so different from all others that the covariates may greatly change values 2.4 DATA SEPARATED INTO ODDS RANGES Regression data in Table 2.2 is separated into odds ranges for analysis. The top portion has 16 odds ranges, which are folded into 8 odds ranges just below. The 8 odds ranges are collapsed into 4 ranges, then 2 and then 1 line of totals. The EV shows that the Profitability/EV of horses of odds from 0.1 to 9.0 is around 0.82 to 0.90, with an average about 0.85 (third line from bottom of Table). The Ev then tapers off as the EV for the 9 to 27 range varies from 0.76 to 0.84 with an average of 0.81 (fifth row from bottom). EV then decreases rapidly to a low of 0.30 for the 75 and Up range. This supports the famous favorite-longshot bias (favorites underbet so odds are too high and longshots overbet so their odds are too low) that has often been found in gambling data. See Ali [8], Asch et al. [9], Ziemba and Hausch [10], Lo [4], and Bacon-Shone and Lo [11]. Looking at Table 2.2, the Perf column shows a definite decrease as the rows descend and the odds increase (and wbf decreases. Although the response variable Perf is independent of odds when it is calculated (see Section 1.5), it is highly (inversely) correlated to odds: the lower the odds, the higher the average Perf, and vice versa. This is to be expected since the best performing horses (judging from previous races and other factors) get bet the most and thereby have the lowest odds. The sixteen odds ranges was chosen so that the separations fell on whole integers and an approximately equal number of horses would accumulate in each range (except for the two extremes). Having 16 ranges made it easy to convert to 8 ranges, then 4, 2, and 1 (overall totals). The report that generates Table 2.2 was designed so that it could be used for any

19 Table 2.2. Regression Data by Collapsing Odds Ranges Odds-range Total wins win% EV Perf 0-1 2552 1354 53.1 0.90 31.27 1-2 8890 3054 34.4 0.85 7.14 2-3 11889 2961 24.9 0.85-10.63 3-4 11205 2078 18.5 0.82-22.63 4-5 9473 1483 15.7 0.85-34.66 5-6 7748 1009 13.0 0.84-42.69 6-7 6797 782 11.5 0.85-47.06 7-9 11136 1069 9.6 0.85-55.31 9-11 8960 645 7.2 0.78-65.90 11-14 9976 629 6.3 0.84-75.66 14-19 10709 477 4.5 0.76-87.24 19-27 10889 394 3.6 0.84-99.83 27-35 7005 156 2.2 0.70-111.82 35-50 8133 119 1.5 0.62-126.91 50-75 6924 59 0.9 0.52-145.51 75-UP 4569 15 0.3 0.30-169.46 0.1-2 11442 4408 38.5 0.86 12.52 2-4 23094 5039 21.8 0.84-16.45 4-6 17221 2492 14.5 0.84-38.27 6-9 17933 1851 10.3 0.85-52.18 9-14 18936 1274 6.7 0.81-71.04 14-27 21598 871 4.0 0.80-93.59 27-50 15138 275 1.8 0.66-119.93 50-UP 11493 74 0.6 0.43-155.03 0-4 34536 9447 27.4 0.84-6.85 9 35154 4343 12.4 0.85-45.37 9-27 40534 2145 5.3 0.81-83.06 27-UP 26631 349 1.3 0.56-135.08 All 136855 16284 11.9 0.78-64.27 number of horses and there would be an appropriate grouping of odds ranges for that number of horses. For analyses based on large sample sizes (many thousands of horses), the top group of 16 odds ranges, as presented in Table 2.2, is preferable and thus used in subsequent analyses. However, in some subgroup analyses, the Overlays and Underlays are shown using 4, 2, and 1 odds range groupings since the number of horses considered are on the order of 1000-1500, too few for the full 16 odds range grouping. Frequently, for very small subsets 50 or less, only the totals line is appropriate, but even then it could be of interest to scan upward even to the 8 and 16 odds ranges to see the odds distribution of the selected horses. Making the number of lines of each grouping of odds ranges a power of 2 enables the user to scan up

20 and down the report and get an understanding of the distribution of the odds of the horses selected and so understand the totals line better. 2.5 THE DAILY RACING FORM FOR THE SERIOUS HANDICAPPER Most of the important pre-race information comes from The Daily Racing Form. The Racing Form is similar to a small newspaper and contains key information on every horse running in each race for a particular racetrack. Figure 1.1 shows information for a fifth race (the big 5 in upper left corner) at Santa Anita on March 9, 2007. The race information is given at the top: distance is 7 furlongs, it is a claiming race which means the horses may be claimed or purchased (by registered trainers or owners only) for $25,000. The purse or prize money is $28,000 of which typically 60% goes to the winner, 20% to 2nd place, 12% 3rd, 6% to 4th and 2% to 5th (the distribution percentages varies from state to state). After the purse amount comes the race restrictions: this race is open only to three year old fillies. Next is the weight assignment - all must carry at least 122 pounds (jockey with added weights as needed). Horses are allowed three pounds off if they have not won a race since January 30th of this year. Also if the horses run for a lower claiming amount ($22,500) they are allowed two pounds off. All information pertaining to the race in general, is given in this top area. Below the race information are three sections beginning with 2 Tee Dee, 3 Brought It, and 1 Warrens Grindstone. Each of these sections is a detailed summary of the key information for the three fillies: Tee Dee, Brought It, and Warrens Grindstone (truncated). Horses are listed in post position order (source for pp variable) so Tee Dee, if she runs, will leave from the inside or post position one. If for some reason she does not run (scratches), then Brought It starts in the one post position. The big number in front of the names is the official number used for wagering purposes, known as the Saddle Number or Cloth Number. All wagers are made by using this number, not to be confused with post position. Below the Saddle Number is the jockey name and his (or her) record for the year and the record for the previous year. For Tee Dee, the jockey is M A Pedroza and to the right is the trainer name, Jeff Mullins. Just above Jeff Mullins is the breeder, Nicholas... (Ky), the Ky indicating Tee Dee was bred in Kentucky, which is the source for the statebred variable. Note that if a horse is from another country, that country s code would be in parenthesis next to the horse s name. For this study, country bred in and state bred in were lumped together into the same field, statebred. At the top line above trainer and breeder names, is B f 3 (Jan) which indicates Tee Dee is a bay colored, three year old filly who was foaled in January, from which the monthbred, horsetype and horseage variables are obtained. To the right of trainer and breeder names is a large L 119. The L signifys the horse will be given the legal anti-bleeding drug Lasix and the 119 is the assigned weight. To the right of L 119, at the top

21 is Life 7. This is where the variable numlines comes from (to a maximum of 10), 7 being the number of races Tee Dee has had so far in her racing career. Looking back to the far left below Pedroza, is 11Feb07, indicating that Tee Dee s last race was February 11, 2007. The difference between the current race date and the previous race date is from where the variable days1st comes. Just below 11Feb07 is 12Jan07 which is Tee Dee s 2nd race back and below that is 3Dec06 or Tee Dee s 3rd race back. The two predictor variables, days2nd and days3rd comes from these dates. In the blank area above and to the left of the large L 119 is where a short note may appear such as blinkers off, blinkers on, or 1st time lasix. The variables blinksoff, blinkson and lasix1st come from here. Also not shown here, but of extreme importance, is the Morning Line, which will appear in large numbers just to the left of the horse s name. Directly below the L 119 in Tee Dee s section, is 2.60 and just below that is 16.20. These are the odds Tee Dee went off at in her last two races and are the source of the odds1 and odds2 variables. Looking at the line for Brought It that begins with 11Feb07, there is a (circled) f followed by Clm c-(20-18) which indicates that Bought It was claimed in her last race where the claiming prices were from $18,000 to $20,000. Note that Tee Dee has a similar notation in her 2nd race back. These notations are where the cl12 variable is from. Obviously there is a lot more information here that is not used in this study. The number of possible patterns and combinations of variables that could be analyzed is almost endless.

22 CHAPTER 3 METHODOLOGY Data is first prepared in MS ACCESS, response variable Perf is calculated, predictor variables are created, two datasets are prepared in MS ACCESS, the Regression Data and the Testing Data, and then exported to SAS. Regression analysis is performed in SAS on the Regression Data until a suitable model is found. Test Data is then used in the model to generate two files: Baseline and Results Files. Estimated values for the response variable, Perf, are found using the model s regression coefficients and then exported to Matlab for Monte Carlo type processing. For the Baseline Perfs, only a simple formula using two predictor variables. For the Results file, all the significant predictor variables are used to calculate Perf. At this point, horses must be grouped together so they can be processed as clusters of horses within a race. Monte Carlo processing produces estimated probabilities of each horse finishing 1st, 2nd, 3rd, or 4th in their race for both the Baseline and Results Datasets. These probabilities are then exported back into ACCESS for comparison and report-generation. Win probabilities in the Results dataset that differ significantly from the Baseline win probabilities are separated into two groups: Overlays (profitable bets) where estimated probabilities are greater than the corresponding Baseline probability, and Underlays (unprofitable bets) where probabilities are less. Tables are generated showing these results and key information. 3.1 PERF: THE IMPORTANT RESPONSE VARIABLE For regression analysis, the key step is getting a functional response variable. A continuous dependent variable, PERF is calculated from the Results data. Perf is an estimated numerical evaluation of each horse s performance in a race, independent of the race s Class (level of competition). See Section 1.5 for more information and background on Perf and Power Point. Finding Perf is a two step process: First a Power Point for each race is derived as a function of the number of horses in the race and the distances in lengths between the top four finishers. Then for each horse, Perf is a function of finish and distance (in lengths) from the Power Point. The greater the Perf, the stronger the finish and vice versa. Perf varies from a max of 121 to a min of -210. The minimum Perf, -210, is assigned to all

23 horses considered to have finished sufficiently far back that there is no value in trying to evaluate their performance. This cut-off point is around 5 to 8 lengths behind the Power Point, depending on surface type. Perf also increases as the wbf increases since the strongest horses have the lower odds (and thus higher wbf) and higher Perfs. 3.2 DATA PREPARATION IN MS ACCESS Datasets are prepared for SAS to facilitate easy importing and analysis. Although each race is a cluster of horses, SAS processes each horse record independently (so the order of the records does not matter). Therefore some predictor variables that are race specific have to be prepared accordingly. For example, The Daily Racing Form provides a speed rating which varies from 0 to 117. Horses in high class races will, in general, have high speed ratings and those in low class races will have low speed ratings. What matters is the relative speed ratings to the other horses in the race. So to get SpeedDiff1, the average speed rating for each horse s last race is found and then subtracted from each horse s speed. Thus SpeedDiff1 indicates the relative speed of each horse to the other horses in the race, independent of speed rating of all other races and horses. Similar processing in ACCESS was done for other variables that were specific to the race. Some variables, including the response variable, Perf, were created using Visual Basic programs developed within the MS ACCESS framework or by using the flexible Query system in ACCESS. Other covariates developed like this are MLadj, Odds1, Odds2, and Flags. Predictor Variables Requiring Special Processing in ACCESS: MLadj : Adjusted Morning Line - Morning Lines are normalized so total inverted Morning Lines add to 1 Perf : Performance Indicator (and Power Point) calculated for each horse Speed1Diff : Average Speed for race is calculated, then subtracted from each horse s Speed Rating Speed12Diff : Each horse s maximum speed rating in last 2 races is subtracted from race average Speed123Diff : Each horse s maximum speed rating in last 3 races is subtracted from race average NumLinesDiff : Each horse s number of race-lines (previous races up to a maximum of 10) is subtracted from race average Flags : Indicator-type: Flag is zero unless current race is 2nd race within 60 days after maiden win

24 Odds1 : Odds in last race Odds2 : Odds in 2nd race back Days1st : Number days since last race, if zero, set to 180 for processing purposes (1st time starters had values of zero which would throw off calculations) Days2nd : Number of days since 2nd race back, if zero, set to Maximum of 200 or Days1st + 20 for processing purposes (1st and 2nd time starters had values of zero which would throw off calculations) Days3rd : Number of days since 3rd race back, if zero, set to Maximum of 230 or Days2nd + 30 for processing purposes (1st, 2nd, and 3rd time starters had values of zero which would throw off calculations) 3.3 SAS OPERATIONS AND PROCESSING Many predictor variables are created in SAS based on data imported from ACCESS. Most are the indicator type: 1 if present in a data field, 0 if not present. Specific jockeys and trainers are examples - 72 individual trainers have their own covariate from the one field TrainID and from JockID 24 jockey covariates are created. Other predictor variables are calculated in SAS and have continuous values such as wbfold1 and wbfold2 which are the win bet fractions for odds1 and odds2 respectively. 3.3.1 Non-Indicator Covariates Created in SAS See Table 2.1 for statistics on these covariates. wbf : Win bet fraction of odds: 1 / ( 1 + odds ) wbfall : Win bet fraction raised to an exponent determined through Box-Cox Method wbfold1 : Win bet fraction from odds of previous race wbfold2 : Win bet fraction from odds of 2nd race back 3.3.2 Indicator Type Covariates Created in SAS The Post Position field yielded five indicator variables that were of interest: the three inside posts 1 to 3 and the two far outside post positions: pp1, pp2, pp3, ppout (far outside post) and ppinout (the post just to the left of far outside post). Since saving ground (running distance) on the turns is naturally quite important since the less distance a horse has to run, the better its chances of a good finish. Post position is a definite factor for getting a horse into favorable position on turns. On many two-turn races such as a mile at Santa Anita and Del

Mar, the first turn comes up in less than a furlong and the inside positions can be an advantage for quick starting horses who then save ground on the first turn. However, post position 1 is considered the most dangerous position because of its proximity to the inside rail where many horse racing accidents have taken place - oftentimes horses are pincehed between the rail and other horses. Seven countries and eight states indicator variables came from the statebred field. The jockey field is used to create 24 Indicator-type covariates for individual jockeys. In a similar fashion, 72 Indicator-type covariates for individual trainers were created: Table 3.1. Other indicator variables included three Claimed indicators: cl1 (horse claimed in last race), cl2 (claimed 2nd back), and cl3) from the cl12 field, two (blinkson and BlinksOff) from the blinkers field, two (start1st and start2nd) from the numlines Field, and two input fields were changed to indicator types (Lasix1st and notlasix) to facilitate processing. Table 3.1. Trainer Names and ID Codes ID Name ID Name ID Name A Barry Abrams Ag Paul Aguirre AV A. C. Avila B Bob Baffert Bec Rafael Becerra C Jack Carava Cad Ruben Cardenas Cec Ben Cecil CJ Julio Canani Cs James Cassidy CV Vladmir Cerin D Neil Drysdale DC Caesar Dominguez Dej Jose DeLima DO Craig Dollase EL Ronald Ellis Eur Peter Eurton F Robert Frankel FA Jerry Fanning Ga Carla Gaines GL Patrick Gallagher Gla Mark Glatt Gok Sal Gonzalez Gp Paco Gonzalez Gre Beau Greely Gut Jorge Guitierrez H Robert B. Hess HA Mike Harrington Hab Eoin Harty HD Bruce Headley Hen Dan Hendriks HF David Hofmans Hol Jerry Hollendorfer Jom Martin F. Jones Kna Steve Knapp Kor Brian Koriner La David La Croix LE Craig Lewis Ma Michael Machowsky Ma2 Gary Mandella MC Ronald McAnally MD Richard Mandella Mii Peter Miller MM Mike Mitchell Mo Henry Moreno Mog Ed Moger Mul Jeff Mullins Mum Kristin Mulhall ON Doug O Neil Paa Christopher Paasch Pei Jorge Periban Pol Marcelo Polanco Pow Leonard Powell Puy Mike Puype SA John Sdler SH Sanford Shulman Shc Gary Sherlock She Art Sherman Shi John Shirreffs Si Clifford Sise SJ Jenine Sahadi SM Melvin Stute SP William Spawr Ste Roger Stein Stg Gary Stute TR Eddie Truman VB Jack Van Berg VD Darrell Vienna Wa Ward Wesley WK Kathy Walsh WT Ted West Zuc Howard Zucker 25

3.3.3 WBF Exponent Found Using Box-Cox Method The best predictor of a horse s performance is the odds it goes off at, as shown by Table 2.2 where the two performance measurements, win percentage and Perf, decrease reading down the table as the odds increase. The powerful betting public made up of thousands of bettors wagering many thousands and frequently millions of dollars on a single race, is constantly searching for a bargin horse - one whose return is better than expected. Like the stock market, there are last minute corrections to horses that appear to have value. Although the odds are the best predictor, they do not come in an easy-to-use form since odds do not translate directly to probabilities and the total odds of all the horses in a race has no significance. Inverting the odds to get the win bet fraction: wbf = 1/(odds + 1) is a start since the total win bet fractions would add to one if there was no House Cut. With the House Cut which varies due to Breakage, the win bet fractions sum to around 1.20. Thus win bet fractions indicate how strongly each horse is bet relative to each other. In the early stages of this project, it was noticed that the square root of wbf was a better fit than wbf itself. So it seemed likely that the best fit was wbf raised to an optimal exponent. Thus the well-known Box-Cox [12] transformation procedure, based on a maximum likelihood estimation routine, is used to find the optimal exponent for wbf. Notice that in this instance, wbf is the response variable and Perf is the predictor variable. This procedure was performed starting with coarse intervals of 0.1 for the exponent, then 0.01, 0.001, and 0.0001 was used, reaching the limits of accuracy for the SAS Box-Cox procedure. Thus an exponent was found to the 4th decimal place (0.1548). A new predictor variable was then created for each horse: wbfall = wbf 0.1548. 3.3.4 SAS Regression and Model Selection The REG procedure in SAS fits a linear regression model by least squares to find estimated coefficinets for each predictor variable. The Stepwise, Forward, and Backward Selection processes are used (with a selection criterion of 0.05) and compared to find the best model. These selection processes depended on Mallows C p criterion. The Variance Inflation Factor (VIF) selection is used to check for multicollinearity. After considerations, various covariates were deleted from the final model due to correlation problems and low significance. Data Subgroups are run through through the same process as the above section and if warranted, new predictor variables are created - always of the indicator type since they are specific to the subgroups. Note that in some cases original covariates may be set to 0 when the new covariates are set to 1 to avoid correlation problems. The regression process is repeated with the new and orginal covariates. The VIF diagnostic is especially important for checking for correlation between old and new 26

covariates. The standard deviation used for Monte Carlo processing of test results is generated in this step. A Baseline model for testing was created using wbfall and the number of horses in the race to get a predicted Perf value for each horse. Table 3.2 presents the ANOVA Table and parameter estimates. Table 3.2. Test Data Baseline Model Parameter Standard Variable Estimate Error P-value 95% CI Intercept -412.04 2.28 <.0001 (-416.50, -407.57) wbfall 459.28 2.38 <.0001 (454.62, 463.93) nhor 2.96 0.13 <.0001 (2.71, 3.22) 27 3.4 MATLAB: SIMULATING HORSE RACES FOR PROBABILITY ESTIMATES Two files containing the predicted perfs for the Horses in the Test Dataset were created: the Test Baseline File and the Test Results File. They were then exported to Matlab for Monte Carlo-style processing. The standard error used here is generated in the step described in Section 3.3.4. For this step, horses are grouped together by race. Each horse in each race has a random normal number times the standard deviation added to its predicted perf to simulate the variances in performance as predicted by the Regression Model of Table 3.3. Each race was simulated 100,000 times. The number of simulations a particular horse has the highest total was divided by 100,000 to get the estimated probability of winning. The same process was used to get estimated values for 2 nd,3 rd, and 4 th place probabilities. 3.5 COMPARING PROBABILITY FILES IN ACCESS The two probability files, Test Baseline and Test Results, are exported to ACCESS for comparison reports. The final model probabilities that orginated from the estimated regression parameters equation are compared to the baseline probabilities. Those that differ significantly are separated into two groups: estimated probabilities higher than the baseline s are considered good bets Overlays, while those less than the baseline probabilities are Underlays - bad bets. Each group is displayed in an odds-based report format. The Results File was generated using the Regression Function on each horse in the Test Data, plugging in Regression Coefficients to get predicted values for perf (pred) as shown in Table 3.3.

28 Table 3.3. Regression Model Coefficients Parm. Std. Variable Description Est. Error P-value 95% CI VIF Intercept intercept -403.22 2.37 <.0001 (-407.88, -398.58) 0 wbfall wbf to exponent 447.57 2.64 <.0001 (442.40, 452.75) 1.35 nhor number of horses 3.14 0.13 <.0001 (2.89, 3.41) 1.11 days2nd days 2nd race back -0.03 0.023 <.0001 (-0.035, -0.25) 1.12 speed12 diff speed last 2 0.38 0.07 <.0001 (0.24, 0.52) 1.25 numline diff number of lines 3.64 0.13 <.0001 (3.34, 3.89) 1.10 blinkon blinkers on -16.68 1.10 <.0001 (-18.84, -14.53) 1.00 pp2 post position 2 2.61 0.76 0.0006 (1.11, 4.01) 1.05 ppout far outside post pos. -2.50 0.76 0.0010 (-3.99, -1.01) 1.05 pp3 post position 2.20 0.76 0.0039 (0.71, 3.70) 1.05 notlasix not using lasix -7.86 1.29 <.0001 (-10.40, -5.33) 1.03 jesp jockey - V Espinoza -3.10 1.02 0.0023 (-5.10, -1.10) 1.03 jbat jockey - T Baze 9.11 2.72 0.0317 (5.03, 13.25) 1.02 jsol jockey - A. Solis 4.41 1.17 0.0002 (2.13, 6.71) 1.08 jgar jockey - M. Garcia 6.00 1.37 <.0001 (3.32, 8.68) 1.01 jsmi jockey - M. Smith -5.47 1.71 0.0014 (-8.84, -2.11) 1.01 jbav jockey - M. Baze 9.91 2.26 <.0001 (5.47, 14.34) 1.00 jsor jockey - D. Sorenson 6.99 2.00 0.0005 (3.07, 10.91) 1.00 jbj jockey - R. Bejarano 8.41 2.94 0.0319 (4.27, 12.87) 1.04 jqui jockey - A. Quimez 6.24 2.70 0.0211 (2.16, 10.36) 1.02 jror jockey - J. Rosario 14.98 3.59 <.0001 (10.77, 19.43) 1.03 Bec trainer - R. Becerra 8.02 2.64 0.0024 (2.84, 13.20) 1.01 Cad trainer - R. Cardenas -12.38 3.67 0.0007 (-19.57, -5.20) 1.00 Cec trainer - B. Cecil 7.48 3.70 0.0433 (0.23, 14.74) 1.00 EL trainer - R. Ellis 6.49 2.86 0.0235 (0.87, 12.10) 1.00 Gut trainer - J. Guiterrez 15.19 3.70 <.0001 (7.94, 22.44) 1.00 HD trainer - B. Headley 6.40 2.45 0.0274 (2.75, 10.55) 1.04 Kna trainer - S. Knapp -7.37 2.11 0.0005 (-11.5, -3.24) 1.00 LE trainer - C. Lewis -6.85 2.53 0.0067 (-11.81, -1.90) 1.00 Ma2 trainer - G. Mandella 12.12 3.90 0.0018 (4.49, 19.74) 1.00 MD trainer - R. Mandella 4.40 2.13 0.0391 (0.22, 8.57) 1.02 Puy trainer - M. Puype 12.70 3.81 0.0008 (5.24, 20.16) 1.00 SA trainer - J. Sadler 8.34 1.72 <.0001 (4.97, 11.71) 1.02 VB trainer - J. Van Berg -9.28 2.25 <.0001 (-13.69, -4.88) 1.00 VD trainer - D. Vienna 9.25 2.66 0.0005 (4.03, 14.47) 1.00 WK trainer - K. Walsh 6.28 3.70 0.0353 (3.17, 9.34) 1.00 zfr bred in France -7.00 2.60 0.0070 (-12.09, -1.91) 1.01

29 CHAPTER 4 RESULTS Test Results are divided into two groups; Overlays, horses with estimated probabilities significantly greater (> 1.33 Baseline) than the baseline s probabilities, and Underlays, horses with probabilities significantly less (< 0.6 Baseline) than the baseline s probabilities. Each group is further divided into two tables, the first being a comparison of win results to baseline results and the second being a comparison of 2nd, 3rd, and 4th place finishes. 4.1 OVERLAYS For the Overlays (Tables 4.1 and 4.2), 1,548 horses were selected from the test group. Note that the proportion selected for the low (0 to 4) odds range 181/1548 = 11.7% was much less than the proportion for the whole test group 7588/31398 = 24.2%. Thus one must be careful comparing totals for the Results to totals for the baseline since the 0 to 4 odds range has a much higher percentage of 1sts, 2nds, 3rds and 4ths. For example, when looking at Table 4.2, and comparing the Out numbers, the total percentage for the Baseline (OutB = 53.6) is lower than the Results (OutA = 55.2), while at the same time, the OutA (results) is lower than the OutB for each of the 4 odds ranges. This is a kind of numerical optical illusion due to the weighted distribution of the Results to high-odds horses. For the Underlays, the weighted distribution is much more severe. Of the 1036 horses selected in the Underlays group, only 11 were in the 0-4 odds range. The proportions were 11/1036 = 1.1% to 7588/31398 = 24.2% of the whole test group. Thus care must be taken when looking at totals. For horses with Results probabilities significantly greater (> 1.33 Baseline) than the Baseline probabilities, the Overlays Table 4.1, shows an overall increase in EV/Profitability from 0.78 to 1.03. Table 4.1. Overlays: Comparison of Win Results(A) to Baseline(B) range TotB TotA winb wina 1%B 1%A EV:B EV:A PerfB PerfA 0-4 7588 181 2022 52 26.6 28.7 0.85 1.09-0.8 12.1 4-9 8519 416 1019 53 12.0 12.7 0.83 0.94-35.8-22.3 9-27 9509 592 533 44 5.6 7.4 0.84 1.20-71.3-52.0 27-UP 5782 359 72 6 1.2 1.7 0.53 0.82-128.5-113.8 0-9 16107 597 3041 105 18.9 17.6 0.83 0.98-19.3-11.9 9-UP 15291 951 605 50 4.0 5.3 0.72 1.06-92.9-75.4 All 31398 1548 3646 155 11.6 10.0 0.78 1.03-55.2-50.9

30 Table 4.2. Overlays: Comparison of 2nd, 3rd, and 4th Results(B) to Baseline(A) range Tot 2A 3A 2%B 2%A 3%B 3%A 4%B 4%A Out OutA 0-4 181 37 31 20.6 20.4 16.3 17.1 12.1 10.5 24.4 23.2 4-9 416 78 57 14.4 18.8 14.6 13.7 14.0 13.7 45.0 41.1 9-27 592 53 69 7.6 9.0 9.9 11.7 12.4 13.3 64.4 58.6 27-UP 359 9 22 2.3 2.5 3.9 6.1 6.1 7.5 86.5 82.2 0-9 597 115 88 17.3 19.3 15.4 14.7 13.1 12.7 35.3 35.7 9-UP 951 62 91 5.6 6.5 7.6 9.6 10.0 11.1 72.8 67.5 All 1548 177 179 11.6 11.4 11.6 11.6 11.6 11.8 53.6 55.2 4.2 UNDERLAYS Underlays are the horses whose estimated winning probabilities are significantly (< 0.6 Baseline) less than the Baseline s probabilities. These horses show a marked decrease in overall EV/Profitability, 0.78 to 0.46 (see Table 4.3). These values are somewhat misleading since most of the horses selected in this Results group were in the high odds ranges which had low EVs to begin with (see Table 2.2), but the EVs of 0.0 and 0.67 for the 4-9 and 9-27 odds ranges are lower than the corresponding baseline EVs. Looking at the four odds ranges breakdown in Table 4.4, the first odds range 0.1 to 4 should be ignored since it only has 11 horses in it. The other three odds ranges showed decreases in all finishes, 2nd, 3rd, and 4th which resulted in increases in the OutA percentages over the Baseline s OutB numbers. Table 4.3. Underlays: Comparison of Win Results(A) to Baseline(B) range TotB TotA win wina 1% 1%A EV:B EV:A PerfB PerfA 0-4 7588 11 2022 3 26.6 27.3 0.85 1.10-0.8-23.5 4-9 8519 61 1019 0 12.0 0.0 0.83 0.00-35.8-86.4 9-27 9509 334 533 13 5.6 3.9 0.84 0.67-71.3-107.7 27-UP 5782 630 72 5 1.2 0.8 0.53 0.39-128.5-160.7 0-9 16107 72 3041 3 18.9 4.2 0.83 0.17-19.3-76.8 9-UP 15291 964 605 18 4.0 1.9 0.72 0.49-92.9-142.3 All 31398 1036 3646 21 11.6 2.0 0.78 0.46-55.2-137.8 4.3 COMPARISONS OF RESULTS BY THE FOUR MAJOR ODDS RANGES A visual representation of the results is best to highlight the difference between the three datasets: Underlays, Overlays and the Baseline. Care must be taken when comparing the overall results since the proportions of horses in the four major Odds Ranges are different for each of the three datasets, as noted in Sections 4.1 and 4.2. For example, Figure 4.1 shows that the win percentages are fairly even for the odds range 0-4 but it should be noted that the

31 Table 4.4. Underlays: Comparison of 2nd, 3rd, 4th Results(A) to Baseline(B) range TotA 2A 3A 2%B 2%A 3%B 3%A 4%B 4%A OutB OutA 0-4 11 1 2 20.6 9.1 16.3 18.2 12.1 9.1 24.4 36.4 4-9 61 6 7 14.4 9.8 14.6 11.5 14.0 11.5 45.0 67.2 9-27 334 10 17 7.6 3.0 9.9 5.1 12.4 11.4 64.4 76.6 27-UP 630 3 15 2.3 0.5 3.9 2.4 6.1 2.1 86.5 94.3 0-9 72 7 9 17.3 9.7 15.4 12.5 13.1 11.1 35.3 62.5 9-UP 964 13 32 5.6 1.3 7.6 3.3 10.0 5.3 72.8 88.2 All 1036 20 41 11.6 1.9 11.6 4.0 11.6 5.7 53.6 86.4 Underlays had only 11 horses in that group of which three were winners. Figure 4.2 also reflects this situation in the 0-4 range. But in general, the two figures show that there is a substantial increase in win percentage and Expected Value with the Overlays subset and a definite decrease with the Underlays. Figure 4.3 shows the combined percentages for 1st through 4th place finishes.

32 Win %: Odds Range 0 to 4 Win %: Odds Range 4 to 9 Win % 0 10 20 30 40 Win % 0 5 10 15 20 Under Base Over Under Base Over Win %: Odds Range 9 to 27 Win %: Odds Range 27 and UP Win % 0 2 4 6 8 10 12 Win % 0.0 1.0 2.0 3.0 Under Base Over Under Base Over Win %: Odds Range 0 to 9 Win %: Odds Range 9 and UP Win % 0 5 10 15 20 Win % 0 2 4 6 8 Under Base Over Under Base Over Figure 4.1. Win percentage comparisons between Underlays, Baseline, and Overlays by odds ranges. Win percentages for Overlays substantially greater than those for Underlays.

33 EV: Odds Range 0 to 4 EV: Odds Range 4 to 9 EV 0.0 0.4 0.8 1.2 EV 0.0 0.2 0.4 0.6 0.8 1.0 Under Base Over Under Base Over EV: Odds Range 9 to 27 EV: Odds Range 27 and UP EV 0.0 0.4 0.8 1.2 EV 0.0 0.4 0.8 1.2 Under Base Over Under Base Over EV: Odds Range 0 to 9 EV: Odds Range 9 and UP EV 0.0 0.4 0.8 1.2 EV 0.0 0.4 0.8 1.2 Under Base Over Under Base Over Figure 4.2. EV comparisons between Underlays, Baseline, and Overlays by odds ranges. EVs for Overlays are much greater than those for Underlays.

34 1, 2, 3, 4th% Range 0 to 4 1, 2, 3, 4th% Range 4 to 9 Total% 0 20 40 60 80 100 1st 2nd 3rd 4th Total% 0 20 40 60 80 1st 2nd 3rd 4th Under Base Over Under Base Over 1, 2, 3, 4th% Range 9 to 27 1, 2, 3, 4th% Range 27 & UP Total% 0 10 20 30 40 50 60 1st 2nd 3rd 4th Total% 0 10 20 30 40 1st 2nd 3rd 4th Under Base Over Under Base Over Figure 4.3. Test results: Finish comparisons between Underlays, Baseline, and Overlays by odds ranges. Total percentages significantly greater for Overlays than Underlays except for 0-4 range.

35 CHAPTER 5 MULTICOLLINEARITY Shown in Table 5.1 is part of the SAS Variance Inflation Factor (VIF) diagnostic results. Covariates not shown (mostly trainers, jockeys, and state-bred) all had VIF values less than 1.4 and so were not flagged for high collinearity concerns. As expected, days1st (days since last race), days2nd (days since 2nd race back) and days3rd are correlated since days2nd is by definition, always larger than days1st, and days3rd is always larger than days2nd. It can not be concluded that they are correlated to each other, but when days2nd is by itself in the Final model, its VIF drops to 1.12, (in Table 3.3 ) showing small correlation. Similarly, Speed1Diff, Speed12Diff and Speed123Diff show correlation values over 2 in Table 5.1 but when Speed12Diff is by itself, the VIF value drops to around 1.25 as shown in Table 3.3. For the Final Model, looking at Table 3.3, the highest VIF is 1.35 of wbfall, from which the conclusion is that there is no serious concern of multicollinearity in the model. Table 5.1. Covariates With Variance of Inflation Greater Than 2.0 Parameter Standard Variance Variable Estimate Error P-value 95% CI Inflation days1st 0.0072 0.0045 0.1046 (-0.0015, 0.016) 2.56 days2nd -0.029 0.0049 <.0001 (-0.039, -0.019) 5.07 days3rd -0.0077 0.0036 0.0323 (-0.015, -0.00065) 3.42 speed1diff -0.085 0.13 0.5124 (0.52, 0.087) 4.25 speed123diff 0.22 0.12 0.0620 (-0.011, 0.46) 3.51 speed12diff 0.19 0.09 <.0001 (0.04, 0.34) 2.08

36 CHAPTER 6 DISCUSSION The Overlays showed significant improvements in Win EV/Profitability as well as improvements in 2nd, 3rd and 4th place finish percentages, as noted in Section 4.1. Conversely, the Underlays indicated horses to be avoided due to low Win EV/Profitability and lower 2nd, 3rd and 4th place percentages as noted in Section 4.2. 6.1 RESPONSE VARIABLE: PERF (AND POWER POINT) Perhaps the most critical component of this system is the response variable Perf. In the early stages of analysis, a Perf with a minimum of -210 and a simple function according to lengths ahead or behind the Power Point was used. This was deemed unsatifactory because some horses won big and their added lengths of victory were not nearly as important as, perhaps, the length before and behind the area in a close finish of two or more horses vying for the win. So Perf values were increased proportionately more for added distance directly ahead of the Power Point up to a close winner. For big winners, the Perfs were high, but extra lengths over a 5 length winning margin did not increase the Perf overly much. Similarly for horses close to, but behind the Power Point, their Perf ratings fall more rapidly the further from the Power Point to the -210 minimum. Other Perf scores were tried including one with a -390 minimum, but it was deemed unacceptable as it seemed to fit the horses in the middle and rear of the race better than those in the front. The final Perf used here was also a function of the track surface. We noticed that races run on turf generally have marked closer finishes. Perf values were thus adjusted accordingly. Note that the synthetic surfaces, Polytrack and Cushion Track, had finsihes far more similar to dirt tracks. The Power Point is an attempt to numerically evaluate a point in a race s finish that served as a threshold separating strong efforts from lesser efforts. This is heavily related to the 2nd place finish in a race with few horses (6-7), 3rd place finish in races with 8 to 10 horses, and some point between 3rd and 4th for races with over 10 horses. Thus it is a function of number of horses in a race, and the distance in lengths between 2nd and 3rd, and the distance between 3rd and 4th, and sometimes between 1st and 2nd.

37 6.2 SUBGROUPS Subgroups did not yield many new significant covariates. A bit of a surprise as some covariates and subgroups had been designed with each other in mind, such as fillies and mares running against males, and post position 1 in the Middle Distance subgroup where the first turn comes up quick. Of course the coefficients for individual trainers and jockeys varied from subgroup to subgroup, but there were no gigantic increases or decreases that coincided with other strong indicators (p-values, F-Values, partial R-Squared contributions, etc.). There is still a lot of valuable information to be gleaned from subgroups - the trick is finding the best covariates to test against. Often times a handicapper will wonder how a certain pattern looks in a specific Subgroup. Since almost any pattern can be converted to an indicator-type covariate, it can then be processed through the system described in this paper to find its value as a predictor variable. There are undoubtly numerous (currently unidentified) covariates that do not show up as significant when looking at the total Regression Dataset, but would be highly significant if looked at in a certain Subgroup. The potential in this area is enormous. 6.3 LIMITATIONS OF THE STUDY Unfortunately synthetics replaced dirt surfaces right at the end of the Regression Data period. It is impossible to judge the effect the switch to synthetic track surfaces has had on this study. Of the 16,284 races making up the Regression Dataset, 10,982 (66.4%) were run on dirt and 896 (5.5%) on synthetics. For the Testing Dataset, 2,668 of the 3,646 (73.2%) were run on synthetics and NONE on dirt. How great of a difference the surface has on a race is open to speculation, but many trainers complained about the switch of surfaces and some owners and trainers took their horses to other tracks [13]. Undoubtly there has been an adjustment period for the trainers, jockeys and horses to get used to the synthetics [14]. Unfortuately that adjustment period occurs mainly during the period of the Testing Dataset. The majority of covariates analyzed here are either trainers or jockeys, although in essence they are basically just two covariates: jockey and trainer. Like other humans, trainers and jockeys come and go, and have ascending and descending periods. Significant new jockeys and trainers may have appeared at the end of the Regression Data period and so either do not have regression coefficients or the estimates may be inaccurate due to small sample size. The subgroup datasets for the last year of Regression Data (2007) and the last two years (2006-2007) helped somewhat in this regard - one jockey and two trainers were deemed significant enough to be added. Note that there were hundreds of trainers and jockeys who appeared sporadically and because of their lack of data were not added to the study.

38 6.4 PREDICTOR VARIABLES INCLUDED IN THE FINAL REGRESSION MODEL Out of the 72 indicator-type trainer covariates, 16 appeared in the final model as shown in Table 3.3, with 12 of the 16 having positive parameter estimates which means a positive influence on predicted Perf (since they are indicator-type variables). Since the 72 trainers were selected because they had the most horses and in general that is an indicator of financial success at horse racing, it was expected that the majority of trainers would have positive parameters. It is interesting that Bob Baffert [15] (selected to Horse Racing s Hall of Fame and has had three Kentucky Derby winners, to name two from amongst his numerous accolades, and probably the most famous trainer in this study) was not in the final model. Although he enjoys great success, his celebrity status translates into his horses being heavily bet. Since he did not have a negative parameter estimate, his success and notoriety balance each other. Ten of 24 jockeys were included in the final regression model. From post position, (pp) three of the five covariates, from the statebred field, one of the seven countries (France) and none of the eight states made the final cut. Actually the statebred field was not expected to have any covariates in the final model. The foreign horses especially should be indicator variables in conjuction with number of races in the U.S. (see item 6 in the Conclusion Chapter). Similarly 8 of the 10 jockeys in the final model had positive parameter estimates. Also the two most famous, Kent Desormeaux [16] (holds current record for most wins by a jockey in the U.S. in one year and is one of four jockeys to win three national titles) and Garrett Gomez [17] (U.S. leading jockey in total earnings for the years 2006 to 2009) were not in the final model, probably since their excellent riding skills were offset by their name-recognition. The biggest surprise was the effect of adding blinkers to horses that had not worn them in their previous race. The blinkon covariate had the largest (absolute) value of any indicator-type covariate: -16.68. Handicappers typically consider adding blinkers a positive sign. The other big surprise was numlinediff - the difference in a horse s number of lines compared to the race s average. Basically this says that experience helps. NotLasix was also an intersting indicator with its -7.85 value. Since so many (California) horses use lasix (96.5%) and in most races it seems like all the horses have it, it is easy to overlook horses that are not using it. Many covariates that were of high interest to us and therefore purposely included in this study did not show up in the final model. Some of these were: the Claimed covariates, cl1 (claimed last race), cl2 (claimed 2nd race back), and cl3 (claimed 3rd race back), flags covariate (2nd race since maiden win if within 60 days), and odds1 (odds in last race).

39 6.5 MISCELLANEOUS The best predictors other than the baseline predictors (Intercept, wbfall, and nhor) were, in order of strength: numlinediff, blinkon, days2nd, notlasix, speeddiff2, ppout, pp2 and pp3. The rest of the predictors were trainers, jockeys, and horses bred in France. The only time sensitive step in the process was the computation of Monte Carlo probability estimates which took around 10 to 25 hours depending on the speed of the computer used. EVs (Expected Value/Profitability) do not always increase directly with increases in Perf. It may be that the improvement in Perfs show up in improved 2nd, 3rd, or 4th place performances. In an article by Clive Thompson [18] in Wired magazine titled Advantage: Cyborgs, it is pointed out that in a freestyle 2005 online chess tournament, where any kind of entrant was allowed, the most successful players were Cyborgs, those able to use computers as assistants most efficiently. That principle undoubtedly holds at the racetracks. The system described here has tremendous potential for assisting handicappers. Finding accurate probabilities should translate into high profitability.

40 CHAPTER 7 CONCLUSIONS 1. The system works. Table 7.1 shows a comparison of the totals for Overlays versus Underlays. The differences are dramatic even taking into account the differences in distribution by odds ranges. Table 7.1. Comparison of Overlays and Underlays Totals Totals of Important Statistics Underlays Overlays Profitability/EV 0.46 1.03 Winnning Percentage 2.0 10.0 Combined 1st, 2nd % 3.9 21.6 Combined 1st, 2nd, 3rd % 7.9 33.2 Combined 1st - 4th % 13.6 45.0 Average Perf -137.8-50.9 % of Total Horses in Odds Range 0-4 1.1 11.7 % of Total Horses in Odds Range 4-9 5.9 26.9 % of Total Horses in Odds Range 9-27 32.2 38.2 % of Total Horses in Odds Range 27 and UP 60.8 23.2 2. A better comparison is Table 7.2 since it is for the odds range 9-27 and the percentage of total horses in the range is about the same (32.2% to 38.2%). Horses in the 9-27 odds range are longshots, basically overlooked or lightly bet. Although a bettor has to be patient for Overlays and Underlays to happen, they can lead to profitable bets when used in the exotic wagering, especially the exactas, trifectas and superfectas since which horses to bet and which to avoid are clearly identified. To hit a 15 or 20 to one longshot in the correct spot on an exotic bet can really boost the payoff! Table 7.2. Odds Range 9-27 of Overlays and Underlays Totals Important Statistics: Odds 9-27 Underlays Overlays Profitability/EV 0.67 1.20 Winnning Percentage 3.9 7.4 Combined 1st, 2nd % 6.9 16.4 Combined 1st, 2nd, 3rd % 13.0 28.1 Combined 1st - 4th % 24.4 41.4 Average Perf -107.7-52.0

3. The system, though it is in its infantcy stage, works well at identifying a predictive model. Using these regression methods will produce more accurate probabilities on some horses than those reflected from the odds. 4. The system is usable at the racetrack. Once a regression equation is found, new estimated probabilities can be generated and calculations quickly made on any new horse to highlight wagers that are likely to be profitable. This includes not only straight win bets, but perhaps more importantly, the exotic single-race bets such as Exactas, Trifectas, and Superfectas, as well as the multiple-race wagers such as the Daily Doubles, Pick3, Pick4, and Pick 6. 5. Just about any pattern or combination of factors or subset of horses can easily and quickly be turned into a predictor variable and analyzed to see how and if it affects a horses probabilities. The flags covariate (see Section 2.1) is an example of an obscure pattern that we found interesting and wanted to investigate and was able to do so just by making in an indicator type predictor. This is a tremendous tool for handicappers who have often wondered about special situations but had no feasible way to get an accurate answer. 6. Improvements are possible: the Response Variable, Perf and its underlying key statistic, Power Point can both be tweaked for better overall performance. Possible new predictor variables with some appropriate variable number N: weight drops from one race to the next, lowest weight in race by N or more pounds, switching distance type after N or more races at one specific type, new jockey after previous jockey rode N or more times, moving up or down in class, three year old horses in races for ages three and up, adding (or removing) blinkers after N races of not wearing (or wearing) them, are some possibilities. Others may involve comparing lifetime and current year records for statistics such as average earnings per race, percentages for winning or placing, or showing. The foreign horses could also provide valuable predictors like first race in U.S., 2nd race, etc. or when they switch to dirt or synthetic surface for the first time (many horses from Europe have run only on turf when they come to the U. S.). There are numerous possibilities for new predictors. 41

42 BIBLIOGRAPHY [1] D.A. Harville. Assigning probabilities to the outcomes of multi-entry competitions. Journal of American Statistical Association, 68:312-316, 1973. [2] R.J. Henery. Permutation probabilities as models for horse races. Journal of Royal Statistical Society B, 43:86-91, 1981. [3] H. Stern. Models for distributions on permutations. Journal of American Statistical Association, 85:558-564, 1990. [4] D.B. Hausch, V.S.Y. Lo, and W.T. Ziembe. Efficiency of Racetrack Betting Markets. Academic Press, New York, NY, 1994. [5] J.B. Bacon-Shone, V.S.Y. Lo, and K. Busche. Logistics analyses of complicated bets. Research Report 11, Department of Statistics, the University of Hong Kong, 1992. [6] V.S.Y. Lo and J. Bacon-Shone. Comparison between two models for predicting ordering probabilities in multi-entry competitions. The Statistician, 43(2):317-327, 1994. [7] V.S.Y. Lo and J. Bacon-Shone. Handbook of Investments: Efficiency of Sports and Lottery Markets. Elsevier, London, England, 2008. [8] M.M.Ali. Probability and utility estimates for racetrack bettors. Journal of Political Economy, 84:803-815, 1977. [9] P. Asch, B. Malkiel, and R. Quandt. Market efficiency in racetrack betting. Journal of Business, 57:165-174, 1984. [10] W.T. Ziemba and D.B. Hausch. Dr. Z s Beat the Racetrack. Morrow, New York, NY, 1987. [11] J.B. Bacon-Shone and V.S.Y. Lo. Probability and statistical models for racing. Journal of Quantitative Analysis in Sports, 4(2):2-11, 2008. [12] M.H. Kutner, C.J. Nachtsheim, and J. Neter. Applied Linear Regression Models. McGraw-Hill Irwin, New York, NY, 2004. [13] B. Harris. Emotional Bob Baffert heads into Thoroughbred Racing Hall of Fame. Sports News, August 12, 2009. [14] J. Bossert. Trainers bemoan synthetic tracks as Breeders Cup approaches. New York Daily News, October 22, 2008. [15] Wikipedia. Bob Baffert, 2010. http://en.wikipedia.org/wiki/bob Baffert, accessed May 2010. [16] Wikipedia. Kent Desormeaux, 2010. http://en.wikipedia.org/wiki/kent Desormeaux, accessed May 2010. [17] Wikipedia. Garrett Gomez, 2010. http://en.wikipedia.org/wiki/garrett K. Gomez, accessed May 2010.

[18] C. Thompson. Advantage: Cyborgs. Wired Magazine, 42, April 2010. 43