TrueSkll Through Tme: Revstng the Hstory of Chess Perre Dangauther INRIA Rhone Alpes Grenoble, France perre.dangauther@mag.fr Ralf Herbrch Mcrosoft Research Ltd. Cambrdge, UK rherb@mcrosoft.com Tom Mnka Mcrosoft Research Ltd. Cambrdge, UK mnka@mcrosoft.com Thore Graepel Mcrosoft Research Ltd. Cambrdge, UK thoreg@mcrosoft.com Abstract We extend the Bayesan skll ratng system TrueSkll to nfer entre tme seres of sklls of players by smoothng through tme nstead of flterng. The skll of each partcpatng player, say, every year s represented by a latent skll varable whch s affected by the relevant game outcomes that year, and coupled wth the skll varables of the prevous and subsequent year. Inference n the resultng factor graph s carred out by approxmate message passng (EP) along the tme seres of sklls. As before the system tracks the uncertanty about player sklls, explctly models draws, can deal wth any number of competng enttes and can nfer ndvdual sklls from team results. We extend the system to estmate player-specfc draw margns. Based on these models we present an analyss of the skll curves of mportant players n the hstory of chess over the past 150 years. Results nclude plots of players lfetme skll development as well as the ablty to compare the sklls of dfferent players across tme. Our results ndcate that a) the overall playng strength has ncreased over the past 150 years, and b) that modellng a player s ablty to force a draw provdes sgnfcantly better predctve power. 1 Introducton Compettve games and sports can beneft from statstcal skll ratngs for use n matchmakng as well as for provdng crtera for the admsson to tournaments. From a hstorcal perspectve, skll ratngs also provde nformaton about the general development of skll wthn the dscplne or for a partcular group of nterest. Also, they can gve a fascnatng narratve about the key players n a gven dscplne, allowng a glmpse at ther rse and fall or ther struggle aganst ther contemporares. In order to provde good estmates of the current skll level of players skll ratng systems have tradtonally been desgned as flters that combne a new game outcome wth knowledge about a player s skll from the past to obtan a new estmate. In contrast, when takng a hstorcal vew we would lke to nfer the skll of a player at a gven pont n the past when both ther past as well as ther future achevements are known. The best known such skll flter based ratng system s the Elo system [3] developed by Arpad Elo n 1959 and adopted by the World Chess Federaton FIDE n 1970 [4]. Elo models the 1
s probablty of the game outcome as P (1 wns over 2 s 1, s 2 ) := Φ 2 1 s 2 where s 1 and s 2 are the skll ratngs of each player, Φ denotes the cumulatve densty of a zero-mean untvarance Gaussan and s the assumed varablty of performance around skll. Denote the game outcomes by y = +1 f player 1 wns, y = 1 f player 2 wns and y = 0 f a draw occurs. Then the resultng (lnearsed) Elo update s gven by s 1 s 1 + y, s 2 s 2 y and = α y + 1 π P (1 wns over 2 s } {{ } 1, s 2 ), 2 K Factor where 0 < α < 1 determnes how much the flter weghs the new evdence versus the old estmate. The TrueSkll ratng system [6] mproves on the Elo system n a number of ways. TrueSkll s current belef about a player s skll s represented by a Gaussan dstrbuton wth mean µ and varance σ 2. As a consequence, TrueSkll does not requre a provsonal ratng perod and converges to the true sklls of players very quckly. Also, n contrast to Elo, TrueSkll explctly models the probablty of draws. Crucally for ts applcaton n the Xbox Lve onlne gamng system (see [6] for detals) t can also nfer sklls from games wth more than two partcpatng enttes and nfers ndvdual players sklls from the outcomes of team games. As a skll ratng and matchmakng system TrueSkll operates as a flter as dscussed above. However, due to ts fully probablstc formulaton t s possble to extend Trueskll to perform smoothng on a tme seres of player sklls. In ths paper we extend TrueSkll to provde accurate estmates of the past skll levels of players at any pont n tme takng nto account both ther past and ther future achevements. We carry out a large-scale analyss of about 3.5 mllon games of chess played over the last 150 years. The paper s structured as follows. In Secton 2 we revew prevous work on hstorcal chess ratngs. In Secton 3 we present two models for hstorcal ratngs through tme, one assumng a fxed draw margn and one estmatng the draw margn per player per year. We ndcate how large scale approxmate message passng (EP) can be used to effcently perform nference n these huge models. In Secton 4 we present expermental results on a huge data set from ChessBase wth over 3.5 mllon games and gan some fascnatng chess specfc nsghts from the data. 2 Prevous Work on Hstorcal Chess Ratngs Estmatng players sklls n retrospectve allows one to take nto account more nformaton and hence can be expected to lead to more precse estmates. The poneer n ths feld was Arpad Elo hmself, when he encountered the necessty of ntalzng the skll values of the Elo system when t was frst deployed. To that end he ftted a smooth curve to skll estmates from fve-year perods; however lttle s known about the detals of hs method [3]. Probably best known n the chess communty s the Chessmetrcs system [8], whch ams at mprovng the Elo scores by attemptng to obtan a better ft wth the observed data. Although constructed n a very thoughtful manner, Chessmetrcs s not a statstcally wellfounded method and s a flterng algorthm that dsregards nformaton from future games. The frst approach to the hstorcal ratng problem wth a sold statstcal foundaton was developed by Mark Glckman, charman of the USCF Ratng Commttee. Glcko 1 & 2 [5] are Bayesan ratng systems that address a number of drawbacks of the Elo system whle stll beng based on the Bradley-Terry pared-comparson method [1] used by modern Elo. Glckman models sklls as Gaussan varables whose varances ndcate the relablty of the skll estmate, an dea later adopted n the TrueSkll model as well. Glcko 2 adds volatlty measures, ndcatng the degree of expected fluctuaton n a player s ratng. After an ntal estmate past estmatons are smoothed by propagatng nformaton back n tme. The second statstcally well founded approach are Rod Edwards s Edo Hstorcal Chess Ratngs [2], whch are also based on the Bradley-Terry model but have been appled only to hstorcal games from the 19th century. In order to model skll dynamcs Edwards consders 2
the same player at dfferent tmes as several dstnct players, whose sklls are lnked together by a set of vrtual games whch are assumed to end n draws. Whle Edo ncorporates a dynamcs model va vrtual games and returns uncertanty measures n terms of the estmator s varance t s not a full Bayesan model and provdes nether posteror dstrbutons over sklls, nor does t explctly model draws. In lght of the above prevous work on hstorcal chess ratngs the goal of ths paper s to ntroduce a fully probablstc model of chess ratngs through tme whch explctly accounts for draws and provdes posteror dstrbutons of sklls that reflect the relablty of the estmate at every pont n tme. 3 Models for Rankng through Tme Ths paper strongly bulds on the orgnal TrueSkll paper [6]. Although TrueSkll s applcable to the case of multple team games, we wll only consder the two player case for ths applcaton to chess. It should be clear, however, that the methods presented can equally well be used for games wth any number of teams competng. Consder a game such as chess n whch a number of, say, N players {1,..., N} are competng over a perod of T tme steps, say, years. Denote the seres of game outcomes between two players and j n year t by y t j (k) {+1, 1, 0} where k { 1,..., K t j} denotes the number of game outcomes avalable for that par of players n that year. Furthermore, let y = +1 f player wns, y = 1 f player j wns and y = 0 n case of a draw. 3.1 Vanlla TrueSkll In the Vanlla TrueSkll system, each player s assumed to have an unknown skll s t R at tme t. We assume that a game outcome yt j (k) s generated as follows. For each of the two players and j performances p t j (k) and pt j (k) are drawn accordng to p ( p t j (k) ( ) st = N p t j (k) ; s t, 2). The outcome yj t (k) of the game between players and j s then determned as +1 f p t yj t j (k) > pt j (k) + ε (k) := 1 f p t j (k) > pt j (k) + ε, 0 f p t j (k) p t j (k) ε where the parameter ε > 0 s the draw margn. In order to nfer the unknown sklls s t the TrueSkll model assumes a factorsng Gaussan pror p ( s 0 = N s 0 ; µ 0, σ0) 2 over sklls and a Gaussan drft of sklls between tme steps gven by p ( s t s = N s t ; s, 2). The model can be well descrbed as a factor graph (see Fgure 1, left) whch clarfes the factorsaton assumptons of the model and allows to develop effcent (approxmate) nference algorthms based on message passng (for detals see [6]) In the Vanlla TrueSkll algorthm denotng the wnnng player by W and the losng player by L and droppng the tme ndex for now, approxmate Bayesan nference (Gaussan densty flterng [7]) leads to the followng update equatons for µ W, µ L, σ W and σ L. µ W µ W + σ2 W µw µ L ε v, and σ W σ W 1 σ2 W µw µ L ε c j c j c j c 2 w, j c j c j µ L µ L σ2 L µw µ L ε v, and σ L σ L 1 σ2 L µw µ L ε c j c j c j c 2 w,. j c j c j The overall varance s c 2 j = 22 + σw 2 + σ2 L and the two functons v and w are gven by N (t α; 0, 1) v (t, α) := and w (t, α) := v (t, α) (v (t, α) + (t α)). Φ (t α) For the case of a draw we have the followng update equatons: µ µ + σ2 µ µ ε ṽ, and σ σ 1 σ2 µ µ ε c j c j c j c 2 w,, j c j c j 3
and smlarly for player j. Defnng d := α t and s := α + t then ṽ and w are gven by N ( s; 0, 1) N (d; 0, 1) ṽ (t, α) := Φ (d) Φ ( s) and w (t, α) := ṽ 2 (t, α) + (d) N (d; 0, 1) (s) N (s; 0, 1) Φ (d) Φ ( s) In order to approxmate the skll parameters µ t and σt for all players {1,..., N} at all tmes t {0,..., T } the Vanlla TrueSkll algorthm ntalses each skll belef wth µ 0 µ 0 and σ 0 σ 0. It then proceeds through the years t {1... T } n order, goes through the game outcomes yj t (k) n random order and updates the skll belefs accordng to the equatons above. 3.2 TrueSkll through Tme (TTT) The Vanlla TrueSkll algorthm suffers from two major dsadvantages: 1. Inference wthn a gven year t depends on the random order chosen for the updates. Snce no knowledge s assumed about game outcomes wthn a gven year, the results of nference should be ndependent of the order of games wthn a year. 2. Informaton across years s only propagated forward n tme. More concretely, f player A beats player B and player B later turns out to be very strong (.e., as evdenced by hm beatng very strong player C repeatedly), then Vanlla TrueSkll cannot propagate that nformaton backwards n tme to correct player A s skll estmate upwards. Both problems can be addressed by extendng the Gaussan densty flterng to runnng full expectaton propagaton (EP) untl convergence [7]. The basc dea s to update repeatedly on the same game outcomes but makng sure that the effect of the prevous update on that game outcome s removed before the new effect s added. Ths way, the model remans the same but the nferences are less approxmate. More specfcally, we go through the game outcomes yj t wthn a year t several tmes untl convergence. The update for a game outcome yj t (k) s performed n the same way as before but savng the upward messages m f(p t j (k),s (s t ) s t t ) whch descrbe the effect of the updated performance p t j (k) on the underlyng skll st. When game outcome yt j (k) comes up for ( update agan, the new downward message m f(p t j (k),s p t ) p t j (k) t j (k) ) can be calculated by m f(p t j (k),s t ) p t j (k) ( p t j (k) ) = f ( p t j (k), s t ) p (s t ) m f(p t j (k),s (s t ) s t t )dst, thus effectvely dvdng out the earler upward message to avod double countng. The ntegral above s easly evaluated snce the messages as well as the margnals p (s t ) have been assumed Gaussan. The new downward message serves as the effectve pror belef on the performance p t (k). At convergence, the dependency of the nferred sklls on the order of game outcomes vanshes. The second problem s addressed by performng nference for TrueSkll through tme (TTT),.e. by repeatedly smoothng forward and backward n tme. The frst forward pass of TTT s dentcal to the nference pass of Vanlla TrueSkll except that the forward messages m f(s,s (s t ) s t t ) are stored. They represent the nfluence of skll estmate s at tme t 1 on skll estmate s t at tme t. In the backward pass, ( these ) messages are then used to calculate the new backward messages m f(s,s s t ) s, whch effectvely serve as the new pror for tme step t 1, m f(s,s t ) s s = f ( s, s t ) p (s t ) m f(s,s (s t ) s t t )dst. Ths procedure s repeated forward and backward along the tme seres of sklls untl convergence. The backward passes make t possble to propagate nformaton from the future nto the past.. 4
s W s L s W s L ε L ε s s j ε j ς ς ς s t W s t L s t W s t L ε t L ε t s t s t j ε t j p W p L p W p L + + p p j + u L u u j d d d d j <0 Fgure 1: Factor graphs of sngle game outcomes for TTT (left) and TTT-D. In the left graph there are three types of varables: sklls s, performances p, performance dfferences d. In the TTT-D graphs there are two addtonal types: draw margns ε and wnnng thresholds u. The graphs only requre three dfferent types of factors: factor takes the form N ( ;, 2), factor > 0 takes the form I ( > 0) and factor ± takes the form I ( ± = ). 3.3 TTT wth Indvdual Draw Margns (TTT-D) From explorng the data t s known that the probablty of draw not only ncreases markedly through the hstory of chess, but s also postvely correlated wth playng skll and even vares consderably across ndvdual players. We would thus lke to extend the TrueSkll model to ncorporate another player-specfc parameter whch ndcates a player s ablty to force a draw. Suppose each player at every tme-step t s charactersed by an unknown skll s t R and a player-specfc draw margn εt > 0. Agan, performances pt j (k) and pt j (k) are drawn accordng to p ( p t j (k) ( ) st = N p t j (k) ; s t, 2). In ths model a game outcome yj t (k) between players and j at tme t s generated as follows: +1 f p t yj t j (k) > pt j (k) + εt j (k) = 1 f p t j (k) > pt j (k) + εt 0 f ε t pt j (k) pt j (k) εt j. In addton to the Gaussan assumpton about player sklls as n the Vanlla TrueSkll model of Secton 3.1 we assume a factorsng Gaussan dstrbuton for the player-specfc draw margns p ε 0 = N ε 0 ; ν 0, ς0 2 and a Gaussan drft of draw margns between tme steps gven by p ( ε t ε = N ε t ; ε, ς 2). The factor graph for the case of wn/loss s shown n Fgure 1 (centre) and for the case of a draw n Fgure 1 (rght). Note, that the postvty of the player-specfc draw margns at each tme step t s enforced by a factor > 0. Inference n the TTT-D model s agan performed by expectaton propagaton, both wthn a gven year t as well as across years n a forward backward manner. Note that n ths model the current belef about the skll of a player s represented by four numbers: µ t and σ t for the skll and νt and ςt for the player-specfc draw margn. Players wth a hgh value of ν t can be thought of as havng the ablty to acheve a draw aganst strong players, whle players wth a hgh value of µ t have the ablty to acheve a wn. 5
x 10 5 2.5 2 Frequency 1.5 1 0.5 0 1850 1872 1894 1916 1938 1960 1982 2004 Year Fgure 2: (Left) Dstrbuton over number of recorded match outcomes played per year n the ChessBase database. (Rght) The log-evdence P (y, ) for the TTT model as a functon of the varaton of player performance,, and skll dynamcs,. The maxmzng parameter settngs are ndcated by a black dot. 4 Experments and Results Our experments are based on a data-set of chess match outcomes collected by ChessBase 1. Ths database s the largest top-class annotated database n the world and covers more than 3.5 mllon chess games from 1560 to 2006 played between 200,000 unque players. From ths database, we selected all the matches between 1850 (the brth of modern Chess) and 2006. Ths results n 3,505,366 games between 206,059 unque players. Note that a large proporton of games was collected between 1987 and 2006 (see Fgure 2 (left)). Our mplementaton of the TrueSkll through Tme algorthms was done n F# 2 and bulds a factor graph wth approxmately 11,700,000 varables and 15,200,000 factors (TTT) or 18,500,000 varables and 27,600,000 factors (TTT-D). The whole schedule allocates no more than 6 GB (TTT) or 11 GB (TTT-D) and converges n less than 10 mnutes (TTT)/20 mnutes (TTT-D) of CPU tme on a standard Pentum 4 machne. The code for ths analyss wll be made publcly avalable. In the frst experment, we bult the TTT model for the above mentoned collecton of Chess games. The draw margn was chosen such that the a-pror probablty of draw between two equally sklled players matches the overall draw probablty of 30.3%. Moreover, the model has a translatonal nvarance n the sklls and a scale nvarance n /σ 0 and /σ 0. Thus, we fxed µ 0 = 1200, σ 0 = 400 and computed the log-evdence L := P (y, ) for varyng values of and (see Fgure 2 (rght)). The plots show that the model s very robust to settng these two parameters except f s chosen too small. Interestngly, the log-evdence s nether largest for 0 (complete de-couplng) nor for 0 (constant skll over lfe-tme) ndcatng that t s mportant to model the dynamcs of Chess players. Note that the logevdence s L TTT = 3, 953, 997, larger than that of the nave model (L nave = 4, 228, 005) whch always predcts 30.3% for a draw and correspondngly for wn/loss 3. In a second experment, we pcked the optmal values (, ) = (480, 60) for TTT and optmsed the remanng pror and dynamcs parameters of TTT-D to arrve at a model wth a log-evdence of L TTT D = 3, 661, 813. In Fgure 3 we have plotted the skll evoluton for some well known players of the last 150 years when fttng the TTT model (µ t, σ t are shown). In Fgure 4 the skll evoluton of the same players s plotted when fttng the TTT-D model; the dashed lnes show µ t + ε t 1 For more nformaton, see http://www.bcmchess.co.uk/softdatafrcb.html. 2 For more detals, see http://research.mcrosoft.com/fsharp/fsharp.aspx. 3 Leakage due to approxmate nference. 6
Fgure 3: Skll evoluton of top Chess players wth TTT; see text for detals. whereas the sold lnes dsplay µ t ; for comparsons we added the µ t of the TTT model as dotted lnes. As a frst observaton, the uncertantes always grow towards the begnnng and end of a career snce they are not constraned by past/future years. In fact, for Bobby Fscher the uncertanty grows very large n hs 20 years of nactvty (1972 1992). Moreover, there seems to be a notceable ncrease n overall skll snce the 1980 s. Lookng at Fgure 4 we see that players have dfferent abltes to force a draw; the strongest player to do so s Bors Spassky (1937 ). Ths ablty got stronger after 1975 whch explans why the model wth a fxed draw margn estmates Spassky s skll larger. Lookng at ndvdual players we see that Paul Morphy (1837 1884), The Prde and Sorrow of Chess, s partcularly strong when comparng hs skll to those of hs contemporares n the next 80 years. He s consdered to have been the greatest chess master of hs tme, and ths s well supported by our analyss. Bobby Fscher (1943 ) ted wth Bors Spassky at the age of 17 and later defeated Spassky n the Match of the Century n 1972. Agan, ths s well supported by our model. Note how the uncertanty grows durng the 20 years of nactvty (1972 1992) but starts to shrnk agan n lght of the (future) re-match of Spassky and Fscher n 1992 (whch Fscher won). Also, Fscher s the only one of these players whose ε t decreased over tme when he was actve, he was known for the large margn by whch he won! Fnally, Garry Kasparov (1963 ) s consdered the strongest Chess player of all tme. Ths s well supported by our analyss. In fact, based on our analyss Kasparov s stll consderably stronger than Vladmr Kramnk (1975 ) but a contender for the crown of strongest player n the world s Vswanathan Anand (1969 ), a former FIDE world champon. 5 Concluson We have extended the Bayesan ratng system TrueSkll to provde player ratngs through tme on a unfed scale. In addton, we ntroduced a new model that tracks player-specfc draw margns and thus models the game outcomes even more precsely. The resultng factor graph model for our large ChessBase database of game outcomes has 18.5 mllon nodes and 27.6 mllon factors, thus consttutng one of the largest non-trval Bayesan models ever 7
Skll (Varable Draw Margn) Skll + Draw Margn Skll (Fxed Draw Margn) 3500 Anand; Vswanathan Kasparov; Garry 3000 Fscher; Robert James Skll estmate 2500 Echborn; Lous Botvnnk; Mkhal Kramnk; Vladmr Karpov; Anatoly Morphy; Paul Stentz; Wllam Spassky; Bors V 2000 Lasker; Emanuel Capablanca; Jose Raul Anderssen; Adolf 1500 1850 1858 1866 1875 1883 1891 1899 1907 1916 1924 1932 1940 1949 1957 1965 1973 1981 1990 1998 2006 Year Fgure 4: Skll evoluton of top Chess players wth TTT-D; see text for detals. tackled. Full approxmate nference takes a mere 20 mnutes n our F# mplementaton and thus demonstrates the effcency of EP n approprately structured factor graphs. One of the key questons provoked by ths work concerns the comparablty of skll estmates across dfferent eras of chess hstory. Can we drectly compare Fscher s ratng n 1972 wth Kasparov s n 1991? Edwards [2] ponts out that we would not be able to detect any skll mprovement f two players of equal skll were to learn about a skll-mprovng breakthrough n chess theory at the same tme but would only play aganst each other. However, ths argument does not rule out the possblty that wth more players and chess knowledge flowng less perfectly the mprovement may be detectable. After all, we do see a marked mprovement n the average skll of the top players. In future work, we would lke to address the ssue of skll calbraton across years further, e.g., by ntroducng a latent varable for each year that serves as the pror for new players jonng the pool. Also, t would be nterestng to model the effect of playng whte rather than black. References [1] H. A. Davd. The method of pared comparsons. Oxford Unversty Press, New York, 1988. [2] R. Edwards. Edo hstorcal chess ratngs. http://members.shaw.ca/edo1/. [3] A. E. Elo. The ratng of chess players: Past and present. Arco Publshng, New York, 1978. [4] M. E. Glckman. A comprehensve gude to chess ratngs. Amer. Chess Journal, 3:59 102, 1995. [5] M. E. Glckman. Parameter estmaton n large dynamc pared comparson experments. Appled Statstcs, 48:377 394, 1999. [6] R. Herbrch, T. Mnka, and T. Graepel. TrueSkll(TM): A Bayesan skll ratng system. In Advances n Neural Informaton Processng Systems 20, 2007. [7] T. Mnka. A famly of algorthms for approxmate Bayesan nference. PhD thess, MIT, 2001. [8] J. Sonas. Chessmetrcs. http://db.chessmetrcs.com/. 8