On Learning Algorithms for Nash Equilibria

Transcription

1 On Learning Algorihms for Nash Equilibria Consaninos Daskalakis 1, Rafael Frongillo 2, Chrisos H. Papadimiriou 2 George Pierrakos 2, and Gregory Valian 2 1 MIT, cosis@csail.mi.edu, 2 UC Berkeley, {raf chrisos georgios gvalian}@cs.berkeley.edu Absrac. Can learning algorihms find a Nash equilibrium? This is a naural quesion for several reasons. Learning algorihms resemble he behavior of players in many naurally arising games, and hus resuls on he convergence or nonconvergence properies of such dynamics may inform our undersanding of he applicabiliy of Nash equilibria as a plausible soluion concep in some seings. A second reason for asking his quesion is in he hope of being able o prove an impossibiliy resul, no dependen on complexiy assumpions, for compuing Nash equilibria via a resriced class of reasonable algorihms. In his work, we begin o answer his quesion by considering he dynamics of he sandard muliplicaive weighs updae learning algorihms (which are known o converge o a Nash equilibrium for zero-sum games). We revisi a 3 3 game defined by Shapley [10] in he 1950s in order o esablish ha ficiious play does no converge in general games. For his simple game, we show via a poenial funcion argumen ha in a variey of seings he muliplicaive updaes algorihm impressively fails o find he unique Nash equilibrium, in ha he cumulaive disribuions of players produced by learning dynamics acually drif away from he equilibrium. 1 Inroducion In complexiy, once a problem is shown inracable, research shifs owards wo direcions 1 (a) polynomial algorihms for more modes goals such as special cases and approximaion, and (b) exponenial lower bounds for resriced classes of algorihms. In oher words, we weaken eiher he problem or he algorihmic model. For he problem of finding Nash equilibria in games, he firs avenue has been followed exensively and producively, bu, o our knowledge, no ye he second. I has been shown ha a general and naural class of algorihms fails o solve muliplayer games in polynomial ime in he number of players [4] bu such games have an exponenial inpu anyway, and he poin of ha proof is o show, via communicaion complexiy argumens, ha, if he players do no know he inpu, hey have o communicae large pars of i, a leas for some games, in order o reach equilibrium. We conjecure ha a very srong lower bound resul, of sweeping generaliy, is possible even for bimarix games. In paricular, we suspec ha a broad class of algorihms ha mainains and updaes mixed disribuions in essenially arbirary ways can be shown o fail o efficienly find Nash equilibria in bimarix games, as long as hese algorihms canno idenify he marices since our ambiion here falls shor of proving 1 In addiion, of course, o he perennial challenge of collapsing complexiy classes...

2 ha P = NP, such resricion needs o be in place. In his paper we sar on his pah of research. In argeing resriced classes of algorihms, i is ofen mos meaningful o focus on algorihmic ideas which are known o perform well under cerain circumsances or in relaed asks. For games, learning is he undispued champion among algorihmic syles. By learning we mean a large variey of algorihmic ways of playing games which mainain weighs for he sraegies (unnormalized probabiliies of he curren mixed sraegy) and updae hem based on he performance of he curren mixed sraegy, or single sraegy sampled from i, agains he opponen s mixed sraegy (or, again, sampled sraegy). Learning algorihms are known o converge o he Nash equilibrium in zero-sum games [2], essenially because hey can be shown o have diminishing regre. Furhermore, in general games, a varian in which regre is minimized explicily [5] is known o always converge o a correlaed equilibrium. Learning is of such cenral imporance in games ha i is broadly discussed as a loosely defined equilibrium concep for example, i has been recenly invesigaed viz. he price of anarchy [1, 7, 9]. There are hree disinc varians of he learning algorihmic syle wih respec o games: In he firs, which we call he disribuion payoff seing, he players ge feedback on he expeced uiliy of he opponen s mixed sraegy on all of heir sraegies in oher words, in a bimarix game (R, C), if he row player plays mixed sraegy x and he column player y, hen he row player sees a each sage he vecor Cy T while he column player sees x T R. In he second varian which we call he sochasic seing, we sample from he wo mixed sraegies and boh players learn he payoffs of all of heir sraegies agains he one chosen by he opponen ha is, he row player learns he C j, he whole column corresponding o he column player s choice, and vice-versa. A hird varian is he muli-armed seing, inroduced in [2], in which he players sample he disribuions and updae hem according o he payoff of he combined choices. In all hree cases we are ineresing in sudying he behavior of he cumulaive disribuions of he players, and see if hey converge o he Nash equilibrium (as is he case for zero-sum games). An early fourh kind of learning algorihm called ficiious play does no fall ino our framework. In ficiious play boh players mainain he opponen s hisogram of pas plays, adop he belief ha his hisogram is he mixed sraegy being played by he opponen, and keep bes-responding o i. In 1951 Julia Robinson proved ha ficiious play (or more accuraely, he cumulaive disribuions of players resuling from ficiious play) converges o he Nash equilibrium in zero-sum games. Incidenally, Robinson s inducive proof implies a convergence ha is exponenially slow in he number of sraegies, bu Karlin [6] conjecured in he 1960s quadraic convergence; his conjecure remains open. Shapley [10] showed ha ficiious play fails o converge in a paricular simple 3 3 nonzero-sum game (i does converge in all 2 n games). Bu how abou learning dynamics? Is here a proof ha his class of algorihms fails o solve he general case of he Nash equilibrium problem? This quesion has been discussed in he pas, and has in fac been reaed exensively in Zinkevich s hesis [14]. Zinkevich presens exensive experimenal resuls showing ha, for he same 3 3 game considered by Shapley in [10] (and which is he objec of our invesigaion), as well as

3 in a varian of he same game, he cumulaive disribuions do no converge o a Nash equilibrium (we come back o Zinkevich s work laer in he las secion). However, o our knowledge here is no acual proof in he lieraure esablishing ha learning algorihms fail o converge o a Nash equilibrium. Our main resul is such a non-convergence proof; in fac, we esablish his for each of he varians of learning algorihms. For each of he hree syles, we consider he sandard learning algorihm in which he weigh updaes are muliplicaive, ha is, he weighs are muliplied by an exponenial in he observed uiliy, hence he name muliplicaive expers weigh updae algorihms. (In he muli-armed seing, we analyze he varian of he muliplicaive weighs algorihm ha applies in his seing, in which payoffs are scaled so as o boos low-probabiliy sraegies). In all hree seings, our resuls are negaive: for Shapley s 3 3 game he learning algorihms fail, in general, o converge o he unique Nash equilibrium. In fac, we prove he much more sriking resul ha in all seings, he dynamics lead he players cumulaive disribuions away from he equilibrium exponenially quickly. The precise saemens of he heorems differ, reflecing he differen dynamics and he analyical difficulies hey enail. A his poin i is imporan o emphasize ha mos of he work on he field focuses on proving he non-convergence of privae disribuions of he players, i.e. he disribuion over sraegies of each player a each ime-sep. In general, his is easy o do. In sharp conras, we prove he non-convergence of he cumulaive disribuions of he players; he cumulaive disribuion is essenially he ime-average of he privae disribuions played up o some ime-sep. This is a huge difference, because his weaker definiion of convergence (corresponding o a realisic sense of wha i means o play a mixed sraegy in a repeaed game) yields a much sronger resul. Only Shapley in his original paper [10] (and Benaim and Hirsch [?] for a more elaborae seing) prove nonconvergence resuls for he cumulaive disribuions, bu for ficiious play dynamics. We show his for muliplicaive weigh updaes, arguably (on he evidence of is many oher successes, see he survey [12]) a much sronger class of algorihms. 2 The Model We sar by describing he characerisics of game-play; o do ha we need o specify he ype of informaion ha he players receive a each ime sep. In his secion we briefly describe he hree learning environmens which we consider, and hen for each environmen describe he ypes of learning algorihms which we consider. 2.1 Learning Environmens The firs seing we consider is he disribuion payoff seing, in which each player receives a vecor of he expeced payoffs ha each of his sraegies would receive, given he disribuion of he oher player. Formally, we have he following definiion: Definiion 1. [Disribuion payoff seing] Given mixed sraegy profiles c =(c 1,...,c n ), and r =(r 1,...,r n ) T wih r i = c i =1for he column and row player, respecively, and payoff marices C, R of he underlying game, r +1 = f(rc T, r ), c +1 = g(r T C, c ),

4 where f,g are updae funcions of he row, and column player, respecively, wih he condiion ha r +1, c +1 are disribuions. I may seem ha his seing gives oo much informaion o he players, o he poin of being unrealisic. We consider his seing for wo reasons; firs, inuiively, if learning algorihms can find Nash equilibria in any seing, hen hey should in his seing. Since we will provide largely negaive resuls, i is naural o consider his seing ha furnishes he players wih he mos power. The second reason for considering his seing is ha in his seing, provided f,g are deerminisic funcions, he enire dynamics is deerminisic, simplifying he analysis. Our resuls and proof approaches for his seing provide he guiding inuiion for our resuls in he more realisic learning seings. The second seing we consider, is he sochasic seing, in which each player selecs a single sraegy o play, according o heir privae sraegy disribuions, r and c, and each player may updae his sraegy disribuion based on he enire vecor of payoffs ha his differen sraegies would have received given he single sraegy choice of he opponen. Formally, we have: Definiion 2. [Sochasic seing] Given mixed sraegy profiles r, and c for he row and column player, respecively, a some ime, and payoff marices R, C of he underlying game, he row and column players selec sraegies i, and j according o r and c, respecively, and r +1 = f(r,j, r ), c +1 = g(c i,, c ), where f,g are updae funcions of he row and column player, respecively, and r +1, c +1 are required o be disribuions, and M i,,m,i, respecively, denoe he i h row and column of marix M. Finally, we will consider he muli-armed seing, in which boh players selec sraegies according o heir privae disribuions, knowing only he single payoff value given by heir combined choices of sraegies. Definiion 3. [Muli-armed seing] Given mixed sraegy profiles r, and c for he row and column player, respecively, a some ime, and payoff marices R, C of he underlying game, he row and column players selec sraegies i, and j according o r and c, respecively, and r +1 = f(r i,j, r ), c +1 = g(c i,j, c ), where f,g are updae funcions of he row, and column player, respecively, and r +1, c +1 are disribuions. While he muli-armed seing is clearly he weakes seing o learn in, i is also, arguably, he mos realisic and closely resembles he ype of seing in which many everyday games are played. Almos all of he resuls in his paper refer o he non-covergence of he cumulaive disribuions of he players, defined as: R i, = j=0 r i,j,c i, = j=0 c i,j

5 2.2 Learning Algorihms For each game-play seing, he hope is o characerize which ypes of learning algorihms are capable of efficienly converging o an equilibrium. In his paper, we ackle he much more modes goal of analyzing he behavior of sandard learning models ha are known o perform well in each seing. For he disribuion payoff seing, and he sochasic seing, we consider he dynamics induced by muliplicaive weigh updaes. Specifically, for a given updae parameer >0, a each imesep, a player s disribuion w =(w 1,,...,w n, ) is updaed according o w i,+1 = w i,(1 + ) Pi i w i,(1 + ) Pi, where P i is he payoff ha he i h sraegy would receive a ime. We focus on his learning algorihm as i is exraordinarily successful, boh pracically and heoreically, and is known o have vanishing regre (which, by he min-max heorem, guaranees ha cumulaive disribuions T w =1 T converge o he Nash equilibrium for zero-sum games[12]). For he muli-armed seing, he above weigh updae algorihm is no known o perform well, as low-probabiliy sraegies are driven down by he dynamics. There is a simple fix, firs suggesed in [11]; one scales he payoffs by he inverse of he probabiliy wih which he given sraegy was played, hen applies muliplicaive weighs as above wih he scaled payoffs in place of he raw payoff. Inuiively, his modificaion gives he low-weigh sraegies he exra boos ha is needed in his seing. Formally, given updae parameer, and disribuion w, if sraegy s is chosen a ime, and payoff P is received, we updae according o he following: w s = w s, (1 + ) P/ws, w i=s = w i, w j,+1 = w j. k w k We noe ha his updae scheme differs slighly from he originally proposed scheme in [11], in which a small drif owards he uniform disribuion is explicily added. We omi his drif as i grealy simplifies he analysis; addiionally, argumens from [13] can be used o show ha our updae scheme also has he guaranee ha he algorihm will have low-regre in expecaion (and hus he dynamics converge for zero-sum games). 2.3 The game For all of our resuls, we will make use of Shapley s 3 3 bimarix game wih row and column payoffs given by R = ,C=

6 This game has a single Nash equilibrium in which boh players play each sraegy wih equal probabiliies. I was originally used by Shapley o show ha ficiious play does no converge for general games. 3 Disribuion Payoff Seing In his secion we consider he deerminisic dynamics of running he expers weighs algorihm in he disribuion payoff seing. We show ha under hese dynamics, provided ha he iniial disribuions saisfy r = c, he cumulaive disribuions R,C end away from he Nash equilibrium. The proof splis ino hree main pieces; firs, we define a poenial funcion, which we show is sricly increasing hroughou he dynamics, and argue ha he value of he poenial canno be bounded by any consan. Nex, we argue ha given a sufficienly large value of he poenial funcion, evenually he privae row and column disribuions r, c mus become unbalanced in he sense ha for some i {1, 2, 3}, r i >.999 and c i <.001 (or r i <.001,c i >.999). Finally, given his imbalance, we argue ha he dynamics consiss of each player swiching beween essenially pure sraegies, wih he amoun of ime spen playing each sraegy increasing in a geomeric progression, from which i follows ha he cumulaive disribuions will no converge. Each of he hree componens of he proof, including he poenial funcion argumen, will also apply in he sochasic, and muli-armed seings, alhough he deails will differ. Before saing our main non-convergence resuls, we sar by observing ha in he case ha boh players perform muliplicaive expers weigh updaes wih parameers R = C, and sar wih idenical iniial disribuions r = c, he dynamics do converge o he equilibrium. In fac, no only do he cumulaive disribuions R,C converge, bu so do he privae disribuions r, c. Proposiion 1. If boh players sar wih a common disribuion r = c and perform heir weigh updaes wih R = C = 3/5, hen he dynamics of r, c converge o he Nash equilibrium exponenially fas. The proof is simple and is delegaed o he full version of his paper. We now urn our aenion o he main non-convergence resul of his secion if he iniial disribuions are no equal, hen he dynamics diverge. Theorem 1. In he disribuion payoff seing, wih a row player performing expers weigh updaes wih parameer 1+ R, and column player performing updaes wih parameer 1+ C, he cumulaive disribuions R = r i i=0,c = c i i=0 diverge, provided ha he iniial weighs do no saisfy r i = c α i, wih α = log(1+ R) log(1+ C ). The firs componen of he proof will hinge upon he following poenial funcion for he dynamics: Φ(r, c) := log max( r i i c α ) log min( r i i i c α ), (1) i

7 wih α = log(1+ R) log(1+ C ). We are going o use he same poenial funcion for he oher wo learning seings as well. The following lemma argues ha Φ(r, c ) increases unboundedly. Lemma 1. Given iniial privae disribuions r 0, c 0 such ha Φ(r 0, c 0 ) = 0, hen Φ(r, c ) is sricly increasing, and for any consan k, here exiss some 0 such ha Φ(r 0, c 0 ) >k. Proof. We consider he change in Φ afer one sep of he dynamics. For convenience, we give he proof in he case ha R = C = ; wihou his assumpion idenical argumens yield he desired general resul. Also noe ha wihou loss of generaliy, by he symmery of he game, i suffices o consider he case when r 1, c 1,. The dynamics define he following updaes: r1,+1, r 2,+1, r 3,+1 = n 1 r1, (1 + ) c2+2c3 c 1,+1 c 2,+1 c 3,+1 n 2 c 1, (1 + ), r 2,(1 + ) 2c1+c3 r2+2r3 c 2, (1 + ), r 3,(1 + ) c1+2c2, 2r1+r3 c 3, (1 + ) r1+2r2 for some posiive normalizing consans n 1,n 2. By he symmery of he game, i suffices o consider he following wo cases: when argmax i (r i /c i )=1and argmin i (r i /c i )=2, and he case when argmax i (r i /c i )=1and argmin i (r i /c i )=3. We sar by considering he firs case: Φ(r +1, c +1 ) = log max( r i log i r1 c 1 ) c i log = log(n 1 /n 2 ) + log log r2 c 2 r1, log(n 1 /n 2 ) log c 1, r2, min( r i ) i c i +(c 2, +2c 3, r 2, 2r 3, ) log(1 + ) c 2, +(c 3, +2c 1, r 3, 2r 1, ) log(1 + ) = Φ(r, c )+( 2c 1, + c 2, + c 3, r 2, r 3, +2r 1, ) log(1 + ) = Φ(r, c ) + 3(r 1, c 1, ) log(1 + ) In he case second case, where argmax i (r i /c i ) = 1 and argmin i (r i /c i ) = 3,a similar calculaion yields ha Φ(r +1, c +1 ) Φ(r, c ) + 3(c 3, r 3, ) log(1 + ). In eiher case, noe ha Φ is sricly increasing unless r i /c i =1for each i, which can only happen when Φ(r, c )=0. To see ha Φ is unbounded, we firs argue ha if he privae disribuions r, c are boh sufficienly far from he boundary of he uni cube, hen he value of he poenial funcion will be increasing a a rae proporionae o is value. If r or c is near he boundary of he uni cube, and max i r i c i is small, hen we argue ha he dynamics will drive he privae disribuions owards he inerior of he uni cube. Thus i will follow ha he value of he poenial funcion is unbounded.

8 Specifically, if r, c [.1, 1] 3, hen from he derivaive of he logarihm, we have 30 max r i c i Φ(r, c) i and hus provided r, c are in his range Φ(r +1, c +1 ) Φ(r, c ) 1+ log(1+) 30. If r, c [.1, 1] 3, hen argumens from he proof of Proposiion 1 can be used o show ha afer some ime 0, eiher r 0, c 0 [.2, 1] 3, or for some ime < 0, max i r i c i.01, in which case by he above argumens he value of he poenial funcion mus have increased by a leas.01 log(1 + ), and hus our lemma holds. The above lemma guaranees ha he poenial funcion will ge arbirarily large. We now leverage his resul o argue ha here is some ime 0 and a coordinae i such ha r i,0 is very close o 1, whereas c i,0 is very close o zero. The proof consiss of firs considering some ime a which he poenial funcion is quie large. Then, we argue ha here mus be some fuure ime a which for some i, j wih i = j, he conribuions of coordinaes i and j o he value of he poenial funcion are boh significan. Given ha log(r i /c i ) and log(r j /c j ) are boh large, we hen argue ha afer some more ime, we ge he desired imbalance in some coordinae k, namely ha r k >.999 and c k <.001 (or vice versa). Lemma 2. Given iniial disribuions r 0 =(r 1,0,r 2,0,r 3,0 ), c 0 =(c 1,0,c 2,0,c 3,0 ), wih Φ(r 0, c 0 ) 40 log 1+R (2000), assuming ha he cumulaive disribuions converge o he equilibrium, hen here exiss 0 > 0 and i such ha eiher r i,0 >.999 and c i,0 <.001, or r i,0 <.001, and c i,0 >.999. Proof. For convenience, we will assume all logarihms are o he base 1+ R, unless oherwise specified. For ease of noaion, le k = log 1+R (2000). Also, for simpliciy, we give he proof in he case ha R = C = ; as above, he proof of he general case is nearly idenical. Assuming for he sake of conradicion ha he cumulaive disribuions converge o he equilibrium of he game, i mus be he case ha here exiss some ime >0 for which arg max i log(r i, /c i, ) = arg max i log(r i,0 /c i,0 ), and hus, wihou loss of generaliy, we may assume ha a ime 0, for some i, j wih i = j, log ri,0 c i,0 > 13k, and log rj,0 c j,0 > 13k. Wihou loss of generaliy, we may assume ha r i >c i. We will firs consider he cases in which log(r i /c i ) > 13k and log(r j /c j ) > 13k, and hen will consider he cases when log(r i /c i ) > 13k and log(r i /c i ) < 13k. Consider he case when log(r 1 /c 1 ) > 13k and log(r 2 /c 2 ) > 13k. Observe ha c 3 >r 3 and ha k = ln(2000)/ ln(1 + R ) ln(2000)/ R. Le 0 be he smalles ime a which log(r 3,0 ) max(log(r 1,0 ), log(r 2,0 )) k. We argue by inducion, ha log(c 3, ) max(log(c 1, ), log(c 2, )) (log(r 3, ) max(log(r 1, ), log(r 2, ))) 12k, for any {0,..., 0 1}. When =0, his quaniy is a leas 13k. Assuming he claim holds for all <, for some fixed < 0 1, we have ha +1 =0 r 1,

9 2 1 2 R 2000, where he facor of 2 in he numeraor akes ino accoun he fac ha he payoffs are slighly differen han 2, 1, 0, for he hree row sraegies. Similarly, +1 =0 r 2, 2 1 R Thus we have ha log(c 3, +1) log(c 1, +1) log(c 3,0 ) log(c 1,0 ) 2( + 1) 4 2 R log(c 3,0 ) log(c 1,0 ) 2( + 1) k Similarly, we can wrie a corresponding expression for log(c 3, +1) log(c 2, +1), from which our claim follows. Thus we have ha log(c 3,0 ) max(log(c 1,0 ), log(c 2,0 )) 12k, and log(r 3,0 ) max(log(r 1,0 ), log(r 2,0 )) k. Afer anoher 2.1k imeseps, we have ha log(r 3,0 ) max(log(r 1,0 ), log(r 2,0 )) k, and log(c 3,0 ) max(log(c 1,0 ), log(c 2,0 )) 7k. If log(r 1,0+2.1k) log(r 2,0+2.1k) < k, hen we are done, since r 2,0+2.1k >.999, c 2,0+2.1k <.001. If log(r 1,0+2.1k) log(r 2,0+2.1k) > k, hen i mus be he case ha log(r 1,0+4.2k) log(r 2,0+4.2k) >k, a which poin we sill have log(c 3,0+4.2k) max(log(c 1,0+4.2k), log(c 2,0+4.2k)) > 2k, so we have r 1,0+4.2k >.999,c 1,0+4.2k <.001. The case when log(r 1 /c 1 ) > 13k and log(r 3 /c 3 ) > 13k is idenical. In he case when log(r 1 /c 1 ) > 13k and log(r 2 /c 2 ) < 13k, we le 0 be he firs ime a which eiher log(r 1,0 ) log(r 3,0 ) > k or log(c 2,0 ) log(c 3,0 ) > k. As above, we can show by inducion ha log(r 2,0 max(log(r 1,0 ), log(r 3,0 )) < 12k, and log(c 1,0 max(log(c 2,0 ), log(c 3,0 )) < 12k. Afer anoher 2.1k imeseps, eiher r 1 >.999, and c 1 <.001 or c 2,0+2.1k >.1, in which case afer an addiional 2.1k imeseps, c 2 >.999 and r 2 <.001. The remaining case when log(r 1 /c 1 ) > 13k and log(r 3 /c 3 ) < 13k, is idenical, as can be seen by swiching he players and permuing he rows and columns of he marix. The following lemma complees our proof of Theorem 1. Lemma 3. Given iniial disribuions r 0 =(r 1,0,r 2,0,r 3,0 ), c 0 =(c 1,0,c 2,0,c 3,0 ), such ha for some i, r i,0 >.999 and c i,0 <.001, he cumulaive disribuions defined by j=0 R i, = r i,j j=0,c i, = c i,j do no converge, as. Proof. As above, for he sake of clariy we presen he proof in he case ha R = C =. Throughou he following proof, all logarihms will be aken wih base 1+. Assume wihou loss of generaliy ha r 1,0 >.999 and c 1,0 <.001. Firs noe ha if c 2, < 1/2 hen r 1 will mus increase and c 1 will decrease, and hus wihou loss of generaliy, we may assume ha r 1,0.999, c 1,0 <.001, and c 2,0 1/2. For some k log 10, i mus be he case ha afer k imeseps we have c 2,k.9, and log(r 1,k ) log(r i,k ) log 999 k, for i =2, 3. A his poin log(c 2 /c 3 ), log(c 3 /c 1 ), and log(r 1 /r 2 ), log(r 3 /r 2 ) will all coninue o increase unil r 3 1/ Le 1 denoe he number of seps before r 1 <.9, and noe ha 1 log 999 k log 10.

10 A his poin, we mus have log(r 1 /r 2 ).9 1, log(c 2 /c 3 ).9 1, log(c 3 /c 1 ).9 1. Afer anoher a mos log 10 seps, r 3 >.9, and r 3 will coninue o increase unil c 2 <.9. Le 2 denoe he ime unil c 2.9, which mus mean ha c 1.1 since c 3 is decreasing, and noe ha log 10, where he las erm is due o he seps ha occur when neiher r 1 nor r 3 were a leas.9. A his ime poin, we mus have ha log(c 2 /c 3 ).9 2, log(r 3 /r 1 ).9 2, log(r 1 /r 2 ).9 2. Afer anoher a mos k 3 seps, c 1 >.9, and we can coninue arguing as above, o yield ha afer anoher log 10 seps, r 3 <.9, r 2.1, and log(c 1 /c 2 ).9 3, log(c 2 /c 3 ).9 3. Inducively applying hese argumens shows ha he amoun of ime during which he weigh of a single sraegy is held above.9, increases by a facor of a leas 1.8 in each ieraion, and hus he cumulaive disribuions j=1 r i/ canno converge. 4 Sochasic Seing In his secion we prove an analog of Theorem 1 for he muliplicaive weighs learning algorihm in he sochasic seing. We show ha in his seing, no maer he iniial configuraion, wih probabiliy ending owards 1, he cumulaive disribuions of he row and column player will be far from he Nash equilibrium. To show his, we will make use of he same poenial funcion (1) as in he proof of Theorem 1, and analyze is expeced drif. Alhough he expecaion operaor doesn commue wih he applicaion of he poenial funcion (and hus we canno explicily use he monooniciy of he poenial funcion as calculaed above), unsurprisingly, in expecaion he poenial funcion increases. While he drif in he poenial funcion vanished a he equilibrium in he disribuion payoff seing, in his seing, he randomness, ogeher wih he nonnegaiviy of he poenial funcion allow us o bound he expeced drif by a posiive consan when he disribuions are no near he boundary of he uni cube. Given his, as in he previous secion we will hen be able o show ha for any consan, wih probabiliy 1 afer a sufficienly long ime he value of he poenial funcion will be a leas ha consan. Given his, analogs of Lemmas 2 and 3 hen show ha he cumulaive disribuions end away from he equilibrium wih all bu inverse exponenial probabiliy. Our main heorem in his seing is he following. Theorem 2. If he row player uses muliplicaive updaes wih updae parameer (1 + R ), and he column player uses muliplicaive updaes wih updae parameer (1+ C ), hen from any iniial pair of disribuions, afer ime seps, eiher he dynamics have lef he simplex r i,c i (1/3.2, 1/3 +.2) a some ime sep 0, or wih all bu inverse exponenial probabiliy will be a disance exp(ω()) from he equilibrium.

11 To prove he heorem, we need he following lemma whose proof is deferred o he full version ha esablishes he desired drif of poenial (1). Lemma 4. If r i,c i (1/3.2, 1/3+.2), hen Φ(r, c ) log(1 + R ) E[Φ(r +1, c +1 ) r, c ] Φ(r, c ) + max, (log(1 + R)) We are now prepared o finish our proof of Theorem 2. We do so by analyzing he one-dimensional random walk defined by he value of he poenial funcion over ime. As long as our pair of disribuions has probabiliy values in (1/3.2, 1/3+.2), here is a consan (a funcion of R ) drif pushing us away from he equilibrium (which corresponds o he minimum of he poenial funcion). Using maringale argumens we can show hen ha wih all bu inverse exponenial probabiliy he value of he poenial funcion will be γ for some consan γ, independen of, unless we have exied he ball of radius 0.2 around he equilibrium. Proof of heorem 2: We wish o analyze he random walk (r 0, c 0 ), (r 1, c 1 ),...,where he evoluion is according o he sochasic dynamics. To do his analysis, we ll consider he one dimensional random walk X 0,X 1,...,where X i = Φ(r, c ), assuming ha he walk sars wihin he ball r i,c i (1/3.2, 1/3+.2). Noe firs ha X +1 X 4 log(1 + R ). Nex, from he X i s, we can define a maringale sequence Y 0,Y 1,... where Y 0 = X 0, and for i 1,Y i+1 := Y i + X i+1 E[X i+1 X i ]. Clearly he sequence Y i has he bounded difference propery, specifically Y i+1 Y i 8 log(1 + R ), and hus we can apply Azuma s inequaliy 2 o yield ha wih probabiliy a leas 1 2exp( 2/3 /2), Y Y 0 5/6 8 log(1 + R ). Noice nex ha, from our definiion of he maringale sequence {Y } and Lemma 4, i follows ha, as long as he disribuions are conained wihin he ball r i,c i (1/3.2, 1/3+.2), X Y + (log(1+ R)) Le us hen define T o be he random ime where he disribuions exi he ball for he firs ime, and consider he sequence of random variables {Y T }. Clearly, he new sequence is also a maringale, and from he above we ge X T Y T +( T ) (log(1+ R)) , and, wih probabiliy a leas 1 2exp( 2/3 /2), Y T Y 0 5/6 8 log(1 + R ). Hence, wih probabiliy a leas 1 2exp( 2/3 /2), X T Y 0 5/6 8 log(1 + R )+( T ) (log(1+ R)) and he heorem follows. 5 Muli-armed Seing Perhaps unsurprising in ligh of he inabiliy of muliplicaive weigh updaes o converge o he Nash equilibrium in he sochasic seing, we show he analogous resul for he muli-armed seing. The proof very closely mirrors ha of Theorem 2, and, in fac he only noable difference is in he calculaion of he expeced drif of he poenial funcion. The analogous of Lemma 4 can be easily shown o hold and he res of he proof follows easily; we defer deails o he full version. 2 Azuma s inequaliy: Le X 1,X 2,...be a maringale sequence wih he propery ha for all, X X +1 c; hen for all posiive, and any γ>0, Pr[ X X 1 cγ ] 2e γ2 /2.

12 6 Conclusions and Open Problems We showed ha simple learning approaches which are known o solve zero-sum games canno work for Nash equilibria in general bimarix games; we did so by considering he simples possible game. Some of our non-convergence proofs are raher dauning; i would be ineresing o invesigae wheher considering more complicaed games resuls in simpler (and easier o generalize o larger classes of algorihms) proofs. In paricular, Shapley s game has a unique Nash equilibrium; inuiively, one algorihmically nasy aspec of Nash equilibria in nonzero-sum games is heir non-convexiy: here may be muliple discree equilibria. Zinkevich [14] has aken an ineresing sep in his direcion, defining a varian of Shapley s game wih an exra pure Nash equilibrium. However, afer quie a bi of effor, i seems o us ha a non-convergence proof in Zinkevich s game may no be ulimaely much easier ha he ones presened here. Despie he apparen difficulies, however, we feel ha a very srong lower bound, valid for a very large class of algorihms, may ulimaely be proved. References 1. A. Blum, M. Hajiaghayi, K. Lige, and A. Roh Regre minimizaion and price of oal anarchy, STOC E. Friedman, and S. Shenker Learning and implemenaion on he Inerne, Working paper, Y. Freund, and R. E. Schapire, Adapive game playing using muliplicaive weighs, Games and Economic Behavior, 29:79 103, S. Har, and Y. Mansour The communicaion complexiy of uncoupled Nash equilibrium procedures, STOC S. Har, and A. Mas-Colell A simple adapive procedure leading o correlaed equilibrium, Economerica, 68, 5, pp , S. Karlin Mahemaical Mehods and Theory in Games, Programming, and Economics, Dover, R. Kleinberg, G. Piliouras, and É. Tardos Muliplicaive updaes ouperform generic no-regre learning in congesion games, STOC J. Robinson An ieraive mehod of solving a game, Annals of Mahemaics, T. Roughgarden Inrinsic robusness of he price of anarchy, STOC L. S. Shapley Some opics in wo-person games, in Advances in game heory, edied by M. Dresher, R. J. Aumann, L. S. Shapley, A. W. Tucker, P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire The Nonsochasic Muliarmed Bandi Problem, in SIAM J. Compu., Vol 32, 48:77, S. Arora, E. Hazan, and S. Kale The Muliplicaive Weighs Updae Mehod: a Mea Algorihm and Applicaions, Manuscrip, J. Abernehy, and A. Rakhlin Beaing he adapive bandi wih high probabiliy, COLT M. Zinkevich, Theoreical guaranees for algorihms in muli-agen seings, Tech. Rep. CMU-CS , Carnegie Mellon Universiy, 2004.