Distributed Strategic Learning with Application to Network Security

Transcription

1 Amercan Control Conference on O'Farrell Street San Francsco CA USA June 9 - July Dstrbute Strategc Learnng wth Applcaton to Network Securty Quanyan Zhu Hamou Tembne an Tamer Başar Abstract We conser n ths paper a class of two-player nonzero-sum stochastc games wth ncomplete nformaton. We evelop fully strbute renforcement learnng algorthms whch requre for each player a mnmal amount of nformaton regarng the other player. At each tme each player can be n an actve moe or n a sleep moe. If a player s n an actve moe she upates her strategy an estmates of unknown quanttes usng a specfc pure or hybr learnng pattern. We use stochastc approxmaton technques to show that uner approprate contons the pure or hybr learnng schemes wth ranom upates can be stue usng ther etermnstc ornary fferental equaton ODE) counterparts. Convergence to state-nepenent equlbra s analyze uner specfc payoff functons. Results are apple to a class of securty games n whch the attacker an the efener aopt fferent learnng schemes an upate ther strateges at ranom tmes. I. INTRODUCTION In recent years game-theoretc methos have been apple to stuy resource allocaton problems n communcaton networks securty mechansms for network securty an prvacy [] an economc prcng n power networks [7]. Most frameworks have assume the ratonalty of the agents or the ecson-makers as well as complete nformaton about ther payoffs an strateges. However n practce ue to nose an uncertantes n the envronment agents often have nformaton lmtatons n ther knowlege not only of other players payoffs an strateges but also of ther own. For ths reason we must conser the learnng aspects of the ecsonmakers an aress ther estmaton an assessment of ther payoffs an strateges base on the nformaton avalable to them. Learnng n games has been nvestgate n many recent papers. In [9] [6] a fcttous-play algorthm s use to fn Nash equlbrum n a nonzero-sum game. Players observe opponents actons an upate ther strateges n reacton to others actons n a best-response fashon. The authors n [4] propose a mofe verson of the fcttous play calle jont fcttous play wth nerta for potental games n whch players alternate ther upates at fferent tme slots. In stanar fcttous play Brown 95 Robnson 95) players have to montor the actons of every other player an nee to know ther own payoff so as to fn ther optmal actons. In ths paper we are ntereste n fully strbute learnng proceures where players o not nee any nformaton about the actons or payoffs of the other players an moreover they o not nee to have complete Ths work was supporte n part by grants from AFOSR an DOE. Q. Zhu an T. Başar are wth Dept. ECE an CSL Unversty of Illnos 38 West Man Urbana IL 68 USA. Unversty of Illnos at Urbana-Champagn. {zhu3 basar}@llnos.eu. H. Tembne s wth Telecommuncaton Department Supelec France. E-mal: tembne@eee.org nformaton of ther own payoff structure. The focus of ths paper s on fnte games where the exstence of mxe Nash equlbrum s ensure by [5]. Some recent work has been one for nfnte or contnuous-kernel) games uner ncomplete nformaton where extremum-seekng methos have been use for local) convergence to pure-strategy Nash equlbra see [8] an several of the references theren). A smlar settng was aopte n [8] whch however ealt wth zero-sum games. Here we exten these results n a nontrval way to general-sum two-person games an ntrouce the new paragm of hybr learnng where players can choose fferent pure learnng schemes at fferent tmes base on ther ratonalty an preferences. The heterogenous learnng n [8] can be seen as a specal case of the generalze hybr learnng of ths paper. In orer to rener the learnng more practcal n the context of network securty we ntrouce atonal features of the game: F) In aton to exogenous envronment uncertantes we ntrouce nherent moe uncertantes n players. Each player can be n an actve moe or a sleepng moe. Players learn ther strateges an average payoffs only when they are n an actve moe. F) We allow the nteracton between the players to occur at ranom tmes unknown by the players. We use stochastc approxmaton technques to show that the hybr learnng schemes wth ranom upates can be stue usng ther etermnstc ornary fferental equaton ODE) counterparts. The ODE obtane for hybr learnng s a lnear combnaton of ODEs from pure learnng schemes. We show the convergence propertes of the learnng algorthms for specal classes of games namely games wth two actons as well as potental games an emonstrate ther applcablty n a network securty envronment. The paper s structure as follows. In Secton II we formulate the two-player nonzero-sum stochastc game wth ncomplete nformaton an ntrouce the soluton concept of state-nepenent Nash equlbrum. In Secton III we present a number of stnct learnng schemes an scuss ther propertes. In Secton IV we present man results on learnng for general-sum games. In Secton V we apply the learnng algorthms to a network securty applcaton. Secton VI conclues the paper. II. TWO-PERSON GAME In ths secton we conser a fnte twoperson nonzero-sum game NZSG) n whch each player has stochastc payoffs an the nteractons between the players are ranom. Let Ξ := N{S } N {Ω } N {A } N {U sb.)} s Sb B N be the stochastc NZSG where N = {} s the set of players P an P who maxmze ther payoffs an A A //$6. AACC 457

2 are the fnte sets of actons avalable to players P an P respectvely. The set S := [s s s N S ] comprses all possble NS external states of P whch escrbes the envronment where P reses. We assume that the state space S := N S an the probablty transton on the states are both unknown to the players. A state s s ranomly an nepenently chosen at each tme from the set S. We assume that the acton spaces are the same n each state. In aton players o not nteract at all tmes. A player can be n one of the two moes: actve moe or sleep moe enote by moe B = an B = respectvely. Let B N be an... ranom varable on Ω := {{} whose probablty mass functon s gven by ρ B = p B = p B N. The player moes can be = vewe as nternal states that are governe by the nherent ranomness of the player. The system moe B Ω := Ω Ω s a set of nepenent moes of the players an we enote by B N as the corresponng set of actve players to B. The NZSG s characterze by utlty functons U : S Ω A A R. P collects a payoff U sb a a ) when P chooses a A an P uses a A at state s S an moe B. Remark : The preceng game moel can be vewe as a specal class of stochastc games n whch the state transtons are nepenent of the player actons as well as the current state an we assume that the state processes an the actvtes of the players are... ranom varables. We have slotte tme t {...} when players pck ther mxe strateges as functons of what has transpre n the past to the extent the nformaton avalable to them allows. Towar ths en we let x t a ) enote the probabltes of P choosng a A at tme t an let x t = [x t a )] a A be the mxe strateges { of P at tme t where more precsely x t X := x R A : x a ) [] a A x a ) = }. In partcular we efne e a R A wth a A as unt vectors of szes A whose entry that correspons to a s whle others are zeros. We assume that the mxe strateges of the players are nepenent of the current state s an the player moe B. For any gven par of mxe strateges x x ) X X an for a fxe s S B Ω we efne the expecte utlty as expecte payoff to P) as U sb x x ) := E xx U sb a a ) where E xx U enotes expectaton of U over the acton sets of the players uner the gven mxe strateges. A further expectaton of ths quantty over the states s an B enote bye sb yels the performance nex of the expecte game. We now efne the equlbrum concept of nterest for ths game that s the equlbrum of the expecte game: Defnton State-nepenent equlbrum): A strategy profle x x ) X X s a state-nepenent equlbrum of the game Ξ f t s equlbrum of the expecte game.e. x X x X E sb U sb x x ) E sb U sb x x ) an E sb U sb x x ) E sb U sb x x ). Snce the expecte game s a two-player game wth fnte actons for each player the exstence of an equlbrum follows from Nash s exstence theorem [5] an hence we have the followng lemma. Lemma : The stochastc NSZG Ξ wth unknown states an changng moes amts a state-nepenent equlbrum. A. Learnng Proceures III. LEARNING IN NZSGS In many practcal applcatons players n two-person NZSGs o not have complete knowlege of each other s utlty functons an the state of ther envronment. Moreover they o not know whether they nteract wth the other player or not. Hence the equlbrum strategy has to be learne onlne by observng the realze payoffs urng each tme slot. A general learnng proceure s outlne as follows. At each tme slot t Z + each player generates an nternal moe B to etermne whether to partcpate n the game or not. If both players are actve they nteract an receve a payoff after the play. If only one of the players s actve then the actve player receves hs payoff as an outcome of hs acton at t only wthout nteracton wth the other player. If players o not have the knowlege of ther actve moe probablty p then each player keeps a count of ts nteracton wth others by upatng ts vectors θ jt R j {} as follows: θ jt = θ jt + l {Bj=} where θ jt s P s count of Pj s number of actvty snce t an the ntal conton s gven by θ j = j {}. The actve players choose an acton a t A at tme t an observe or measure an output u jt R as an outcome of ther actons. Players estmate ther payoffs by upatng the entry of the estmate payoff vector û t+ R A that correspons to the chosen acton a t. In a smlar way players upate ther strategy vectors x t+ base on a specfc learnng scheme to be ntrouce later). The upate of the strategy vectors can explot the payoff nformaton û t from the prevous tme step. In ths case we say the learnng s combne. The general combne learnng upates on the strategy an utlty vectors take the followng form: { xt+ = x t +Π t λ t a t u t û t x t ) ) û t+ = û t +Σ t ν t a t u t x t û t ) where Π t Σ t N are properly chosen functons for strategy an utlty upates respectvely. The parameters λ t ν t are learnng rates ncatng players capabltes of nformaton retreval an upate. The vectors x t X are mxe strateges of the players at tme t. û t N are estmate average payoffs upate at each teraton t an u t N are the observe payoffs receve by players at tme t. The learnng rates λ t ν t R + nee to satsfy the contons C) t λ t < t ν t < ; C) t λ t = + t ν t = +. The learnng rates of P are relatve to ther frequency of actvty. In general they are functons of θ N an can be wrtten as λ θt)ν θt). We nee to aopt a tme reference for the game usng maxmum learnng rates among 458

3 the actve players.e. λ t := max B t)λ θt) ν t := max B t)ν θt). It can be verfe that the reference learnng rates λ tν t satsfy C) an C) f λ t ν t satsfy the contons for every N. The learnng rates λ tν t as we wll see later affect the ODE approxmaton. We call the learnng n ) a COmbne DIstrbute PAyoff an Strategy Renforcement Learnng CODIPAS-RL) [8]. The players can have fferent learnng rates for ther utlty an strategy upates. The payoff learnng rate s on a faster tme scale than strategy learnng rate f λ t /ν t as t ; f t s the other way aroun ν t /λ t as t. In the former case the payoff learnng can be seen as quas-statc compare to the strategy learnng an vce versa for the latter. B. Learnng Schemes We ntrouce fferent learnng schemes n the form of ) for the stochastc NZSG. Let L = {L k k { 5}} be the set of fve pure learnng schemes. A player P chooses a learnng schemes P from the set L. We call the learnng homogeneous f both players use the same pure learnng schemes an heterogeneous f players use fferent learnng schemes.e. P P. ) Bush-Mosteller-base CODIPAS-RL L : Let Γ R be a reference level of P an Γ t := L s gven by u t Γ sup sb a U sb a) Γ. The learnng pattern x t+ a ) = x t a )+λ t l { B t)} Γ t l{at=a } x t a ) ) û t+ a ) = û t a )+ν t l {at=a B t)} The upates on the strategy an the estmate payoff are ecouple but they are mplctly epenent. The strategy upate oes not explot the knowlege of estmate payoff but only reles on the observe payoffs urng each tme slot. The strategy upate of L s wely stue n machne learnng an has been ntally propose by Bush an Mosteller n [6]. Combne wth the payoff upate we obtan a COPIDAS- RL base on Bush-Mosteller learnng. When Γ = we obtan the learnng schemes n [] [4]. ) Boltzmann-Gbbs-base CODIPAS-RL L : Let βǫ : R A R A be the Boltzmann-Gbbs B-G) strategy mappng gven by β e ǫ û t )a ) := ǫ ût a ) a A e ǫ ût a ) a A. It s also known as the soft-max functon. When ǫ the B-G strategy yels a pure) strategy that pcks the maxmum entry of the payoff vector û t. The learnng pattern L s gven by x t+ a ) = x t a )+λ t l { B t)} βǫ û t )a t ) x t a )) û t+ a ) = û t a )+ν t l {at=a B t)} The strategy an the estmate payoff are upate n a couple fashon. The numercal value of experment s use n the estmaton an the estmate payoffs are use to bult the strategy here the estmatons are crucal snce a player oes not know the numercal value of the payoff corresponng to hs other actons that he not use). The strategy upate s a B-G base renforcement learnng. Combne together one gets the B-G base CODIPAS-RL. The rest pont L can be seen as the equlbrum for a mofe game wth the perturbe payoff E sb U + ǫ H where H s the extra entropy term as scusse n [6]. 3) Imtatve B-G CODIPAS-RL L 3 : Let βǫt I : X R A R A be the mtatve B-G strategy mappng gven by β ǫt I x x tû t )a ) = ta )e ǫ ût a ) a A xta )e ǫ ût a ) a A. The learnng pattern L 3 s gven by x t+ a ) = x t a )+λ t l { B t)} βi ǫt û t )a ) x t a )) û t+ a ) = û t a )+ν t l {at=a B t)} The mtatve B-G learnng weghts the B-G strategy wth the current strategy vector x t an the strategy mappng β ǫt I s tme-epenent. It allows the learnng strateges to be attane at the bounary of the smplex X. 4) Weghte Imtatve B-G CODIPAS-RL L 4 : Let βw t : X R R A R A be the mtatve weghte B- G strategy mappng gven by β t Wx tλ t û t )a ) := x ta )+λ a t)ût ) a ) for every a A xta )+λt)ût a A. The learnng pattern L 4 s gven by x t+ a ) = x t a )+l { B t)} βw t x t λ t û t )a ) x t a )) û t+ a ) = û t a )+ν t l {at=a B t)} Note that the explotaton functon learnng β t W s tme epenent n L 4 an s nepenent of parameter ǫ. If the learnng yels an nteror pont as the equlbrum then t s the exact equlbrum of the expecte game whle the equlbrum n L s an approxmate one for the ǫ perturbe game. 5) Weakene Fcttous-Play L 5 : Let βf t : R A R A be a pont-to-set mappng corresponence) β t F û t) := ǫ)δ βû t) + ǫ A where R A s a vector wth all ts entres beng ; β : R A A s the best-response corresponence: β û t ) argmax a A û t a )an δ ZZ A enotes a set of unt vectors {e a a Z}. The learnng pattern L 5 s gven by x t+ a ) = x t a ) l { B t)} βf t û t ) x t a )) û t+ a ) = û t a )+ν t l {at=a B t)} The weakene fcttous play L 5 has been scusse n [] [4]. Dfferent from the classcal fcttous play a player 459

4 oes not observe the acton playe by the other player at the prevous step an the utlty functon s unknown. Each player estmates ts payoff by upatng û t usng perceve payoffs. The strategy upate equaton s compose of two parts. A player chooses one of hs optmal actons wth probablty ǫ) by optmzng the up-to-ate payoff estmaton û t an plays an arbtrary acton wth equal probablty ǫ. IV. MAIN RESULTS A. Stochastc approxmaton of the pure learnng schemes The pure learnng schemes ntrouce n Secton III share the same learnng structure for average utlty but ffer n ther strategy learnng. Denote by Π l) t the strategy learnng functon for l L n the general form ). Followng the multple tme-scale stochastc approxmaton framework evelope n [3] [5] [] [3] one can wrte the pure learnng schemes nto the form { ) x t+ x t q t f l) x t û t )+M l) t+ û t+ û t q t Esx tb U û t + M ) t+ where f l = E[Π l) t+ F t] l L s a learnng pattern n the form of stochastc approxmaton. q t s a tme-scalng factor whch s a functon of the learnng rates λ t an the probablty of P n actve moe at tme t enote by P B t)); q t s the tme-scalng factor for û t. To use ODE approxmaton we check frst the assumptons gven s a boune martngale fference because the strateges are n the prouct of smplces whch are convex an compact an the contonal expectaton of M t+ gven the sgma-algebra generate by the ranom varabless t x t u t û t t t s zero. Smlar propertes hol for Mt+. The functon f s a regular functon an hence Lpschtz over a compact set whch mples lnear growth. Note that the case of constant learnng rates can be analyze uner the same settng but the convergence result s weaker Thus the asymptotc pseuo-trajectores for the non-vanshng tme-scale rato.e. λ t /ν t γ for n the Appenx. The term M l) t+ some γ R ++ are { t x t g t f l) ) x t û t ) tût = ḡ t Esx tb U ) û t where g t resp. ḡ t ) are the asymptotc functons of q t λ t p resp. q t νt p ). If the learnng rates have the vanshng rato.e. λt µ t the asymptotc pseuo-trajectores are { ) t x t g t f l) x t E sx t U ) û t E sx B U. B. Stochastc approxmaton of the hybr learnng scheme Players can choose fferent patterns urng fferent tme slots. Conser the hybr an swtchng learnng { x t+ x t q t l L l {l t=l}f l) x t û t )+M l) t+ ) û t+ û t q t Esx t U û t + M ) t+ 46 TABLE I ASYMPTOTIC PSEUDO-TRAJECTORIES OF PURE LEARNING Learnng patterns L L L 3 L 4 L 5 Class of ODE Ajuste replcator ynamcs Smooth best response ynamcs Imtatve BG ynamcs -scale replcator ynamcs Perturbe best response ynamcs where l t L s the learnng pattern chosen by P at tme t. Theorem : Assume that each player P N aopts one of the CODIPAS-RLs n L wth probablty ω = [ω l ] l L L) an the learnng rates satsfy contons C) an C). Then the asymptotc pseuo-trajectores of the hybr an swtchng learnng can be wrtten nto the form { t x t ) g t l L ω lf l) x t û t ) ) tût = ḡ t Esx t U û t for the non-vanshng tme-scale learnng rato λ t /ν t ; an { ) t x t g t l L ω lf l) x t E U sx tb ) û t E U sx B for the vanshng learnng rato λ t /ν t. In Table we gve the asymptotc pseuotrajectory of the pure learnng when the rate of payoff learnng s faster than the strategy learnng. Let U j x) := E sb U j sb x)j N. In Table the replcator ynamcs are gven by ẋ j a j ) = q j x j a j ) [U j e aj x j ) ] a U je j Aj a x j j )x j a j ).The smooth best response ynamcs are gven by e ǫ ẋ j a j ) = q j U j ea j x j ) U a e ǫ j e a x j x ) j a j )). The mtatve j j Boltzman-Gbbs ynamcs are gven by ẋ j a j ) = x ja j)e ǫ U j ea j x j ) q j U a j xja j )e ǫ j e a x j x ) j a j )). The best response j ynamcs are gven byẋ j q j β j x j ) x j ) an the payoff ynamcs are tûja j ) = q j x j a j )U j e aj x j ) û j a j )). C. Connecton wth equlbra of the expecte game We stuy the convergence propertes of the ynamcs an ther connecton wth the state-nepenent Nash equlbrum for three specal classes of games. ) Games wth two actons: For two-player games wth two actons.e A = {a a }A = {a a } one can transform the system of ODEs of the strategy-learnng nto a planar system uner the form α = Q α α ) α = Q α α ) ) where we let α = x a ). The ynamcs for P can be expresse n terms of α α only as x a ) = x a ) an x a ) = x a ). We use the Poncaré-Benxson theorem an the Dulac crteron [] to establsh a convergence result for ).

5 Theorem []): Conser an autonomous planar vector fel as n ).Let γ.) be a scalar functon efne on the unt square []. If [γα)) α] α + [γα) α] α s not entcally zero an oes not change sgn n [] then there are no cycles lyng entrely n []. Corollary : Conser a two-player two-acton game. Assume that each player aopts the Boltzmann-Gbbs CODIPAS-RL wth λt ν t = λt ν t. Then the asymptotc pseuo-trajectory reuces to a planar system n the form α = β ǫ u e a α )) α ; α = β ǫ u α e a )) α. Moreover the system satsfes the contons of Theorem known as the Dulac s crteron). Note that for the replcator ynamcs the Dulac crteron reuces to α )U e a α ) U e a α )) + α )U α e a ) U α e a )) whch vanshes for α α ) = //). It s possble to have lmt cycles n replcator ynamcs an hence the Dulac crteron oes not apply. However the stablty of the replcator ynamcs can be rectly stue n the two-acton case by entfyng the game as belongng to one of the types: coornaton antcoornaton prsoner s lemma hawk-an-ove. The followng corollary now follows from Theorem. Corollary : CR) Heterogeneous learnng: If P s wth Boltzmann- Gbbs CODIPAS-RL an P s learnng leas to replcator ynamcs then the convergence conton reuces to α )u α e a ) u α e a )) < for all α α ). CR) Hybr learnng: If the players use an hybr learnng obtane by combnng Boltzmann-Gbbs CODIPAS- RL wth weght ω an the replcator ynamcs wth weght ω then the Dulac crteron reuces to ω [ α )u e a α ) u e a α ))] + ω [ α )u α e a ) u α e a ))] < w + w for all α α ). Remark Symmetrc games wth three actons): If the expecte game s a symmetrc game wth three actons per player then the symmetrc game ynamcs reuce to the two-mensonal ynamcal system. Ths allows us to apply the Dulac crteron. ) Lyapunov games: We say that the game Ξ s a Lyapunov game uner the hybr ynamcs f the resultng ynamcs has a Lyapunov functon. Theorem 3: Conser a Lyapunov game uner the learnng schemes L L 4. Then the learnng proceure has global convergence to the set of equlbra of the expecte robust game for all nteror ntal contons. Note that ths result hols also forn player stochastc games wth ranom upates. We say that the stochastc game Ξ s an expecte robust potental game f the expecte payoff erves from a potental functon. Potental games consttute a specal class of games where the payoff functons of the players are governe by a potental functon Φ : R N A R.e. U e a x ) = Φx) x Na a ) A. We use a Lyapunov approach to show the global convergence of hybr learnng for potental games. Lemma : Assume that the stochastc NZSG Φ has a potental functon Φ. Then there exsts a Lyapunov functon V R x x ) : R A + A R for learnng schemes L L 4 - assocate replcator ynamcs an t s gven by ts potental V R = Φ. Hence the replcator ynamcs converge to a rest pont. In aton f startng from an nteror pont of the smplex the ynamcs converge to the Nash equlbrum of the game Ξ. Lemma 3: Let V B x x ) : R A + A R be a Lyapunov functon for learnng pattern L l -assocate replcator ynamcs f l l = such that V B x x ) = Φx x ) + ǫ H x ) + ǫ H x )where H : R A R are strctly concave perturbaton functons whch can take fferent forms epenng on the pure learnng scheme l. The ODEs corresponng to the learnng schemes converge to a set of perturbe equlbra of the game Ξ. Theorem 4: Assume that the stochastc NZSG Ξ has a potental functon Φ. The hybr learnng wth L an L converges locally to a perturbe state-nepenent Nash equlbrum x x of the potental game Ξ for suffcently small ǫ. The proof of Lemmas 3 an Theorem 4 can be foun n the nternal techncal report [7]. V. SECURITY APPLICATION In ths secton we use the learnng algorthm to stuy a two-person securty game n a network between an ntruer an an amnstrator. An amnstrator P can use fferent levels of protecton. The ntruer P can launch an attack that can be of hgh or low ntensty. Let the acton sets for P an P be A := {HL} an A := {SW} respectvely. The network amnstrator s assume to be always on alert whle the ntruer attacks wth a probablty p. Hence the set B t) can be of two types.e. C) {P P} or C) {P}. The former case C) correspons to the scenaro where the ntruer an the amnstrator attack an efen respectvely whereas the latter C) suggests that the amnstrator faces no threat. We represent the payoff uner these two scenaros by M an M respectvely: [ H L S W ] M := H M :=. In L C) a successful efense aganst attack yels a payoff of for P whle a falure results n a payoff of -. A successful attack yels P a payoff of whle a fale attack yels a zero payoff. The employment of strong efense H) or strong attack S) costs an extra unt of effort as compare to the low efense L) an the weak attack W) for P an P respectvely. In C) P stays secure wthout the threat from the ntruer hence yels a payoff of. However the hgh securty level costs an extra unt of energy from the player. The payoffs n M an M are subject to exogenous nose whch epens on the envronmental state s. The state-nepenent equlbrum of the game s foun to be at x = [ ]T x = [ 3 3 ]T an the optmal average payoffs are û = [ 3 3 ]T û = []T. In Fgures an we show the payoffs an mxe strateges of both players when both players use the learnng pattern L. We can 46

6 5 5 Average s P Avg. of Choosng a: ûa).9 Mxe Strateges Prob. of P Choosng a: xa) Prob. of P Choosng a: xa) Average s.9 Mxe Strateges Prob. of P Choosng a: xa) Prob. of P Choosng a: xa).5 P Avg. of Choosng a: ûa) P Avg. of Choosng a: ûa) P Avg. of Choosng a: ûa).8.7 P Avg. of Choosng a: ûa) P Avg. of Choosng a: ûa) P Avg. of Choosng a: ûa) P Avg. of Choosng a: ûa) Fg.. The payoffs to the players wth both players usng L. Fg.. The mxe strateges of the players wth both players usng L. Fg. 3. The payoffs to the players wth both players usng L. Fg. 4. The mxe strateges of the players wth both players usng L..5 Average s P Avg. of Choosng a: ûa) P Avg. of Choosng a: ûa) P Avg. of Choosng a: ûa) P Avg. of Choosng a: ûa).9.8 Mxe Strateges Prob. of P Choosng a: xa) Prob. of P Choosng a: xa).5 Average s P Avg. of Choosng a: ûa) P Avg. of Choosng a: ûa) P Avg. of Choosng a: ûa) P Avg. of Choosng a: ûa).9.8 Mxe Strateges Prob. of P Choosng a: xa) Prob. of P Choosng a: xa) Fg. 5. The payoffs to the heterogeneous players wth P usng L an P usng L. Fg. 6. The mxe strateges of the heterogeneous players wth P usng L an P usng L. see that the replcator ynamcs from L o not converge. T However the tme average strateges lm T T x tt converges to x x respectvely an the tme average payoffs lm T T ûtt converges to û û respectvely. T In Fgures 3 an 4 we show the payoffs an mxe strateges of the players when they both aopt the learnng pattern L. We choose ǫ = /5 an observe that the mxe strateges converge to x = [77.473] T x = [ ] T an the payoffs converge to û = [ ] T û = [.7] T whch are n the close neghborhoo of û û. In Fgures 5 an 6 we show the convergence of the heterogeneous learnng scheme where P uses L an P uses L. Wth ǫ = /5 we fn the convergng strateges at x x an the payoffs at û û. In Fgures 7 an 8 we show the convergence of the hybr learnng scheme where P an P aopt L an L wth equal weghts. The strateges converge to [ ] T [ ] T for P an P respectvely whereas the payoffs converge to [ ] T [.459] T for P an P respectvely. VI. CONCLUSIONS AND FUTURE WORKS We have presente strbute strategc learnng algorthms for two-person nonzero-sum stochastc games along wth ther general convergence or non-convergence propertes. Interestng work that we leave for the future s to exten ths learnng framework to an arbtrary number of players each of them aoptng hybr learnng wth a ffuson term leang to stochastc fferental equatons. Another extenson wll be to more general stochastc games where the state evoluton epens on the actons use the players an ther states. Ths stuaton s more complcate because the noses are correlate an epen on states an actons an the convergence ssue n that case s a very challengng open problem. Fg. 7. The payoffs to the players wth both players usng hybr players wth both players usng hy- Fg. 8. The mxe strateges of the learnng scheme wth equal weghts br learnng scheme wth equal on L an L. weghts on L an L. REFERENCES [] T. Alpcan an T. Başar. Network Securty: A Decson an Game Theoretc Approach. Cambrge Unversty Press. [] W. B. Arthur. On esgnng economc agents that behave lke human agents. J. Evolutonary Econ. 3 pages 993. [3] M. Benaïm an M. Faure. Stochastc approxmatons cooperatve ynamcs an supermoular games. Preprnt avalable at [4] T. Borgers an R. Sarn. Learnng through renforcement an replcator ynamcs. Mmeo Unversty College Lonon [5] V. S. Borkar. Stochastc approxmaton: a ynamcal systems vewpont. 8. [6] R. Bush an F. Mosteller. Stochastc Moels of Learnng [7] R.W. Ferrero S.M. Shahehpour an V.C. Ramesh. Transacton analyss n eregulate power systems usng game theory. Power Systems IEEE Transactons on 3): Aug [8] P. Frhauf M. Krstc an T. Başar. Nash equlbrum seekng wth nfntely-many players. n Proceengs of Amercan Control Conference ACC). [9] D. Fuenberg an D. Levne. Learnng n Games [] J. Guckenhemer an P. Holmes. Nonlnear Oscllatons Dynamcal Systems an Bfurcatons of Vector Fels [] Kushner H. Stochastc approxmaton: a survey. Wley Interscplnary Revews: Computatonal Statstcs : [] D. Lesle an E. Collns. Invual q-learnng n normal form games. SIAM J. Control Optm. 44: [3] D. S. Lesle an E. J. Collns. Convergent multple tmescales renforcement learnng algorthms n normal form games. The Annals of Apple 34): [4] J. R. Maren G. Arslan an J. S. Shamma. Jont strategy fcttous play wth nerta for potental games. n Proc. 44th IEEE Conf. Decson Control pages Dec. 5. [5] J. Nash. Equlbrum ponts n n-person games. Proceengs of the Natonal Acaemy of Scences 36): [6] J. S. Shamma an G. Arslan. Dynamc fcttous play ynamc graent play an strbute convergence to Nash equlbra. IEEE Trans Automatc Control 53):3 37 March 5. [7] Q. Zhu T. Hamou an T. Başar. Dstrbute strategc learnng wth applcaton to network securty. Internal Techncal Report CSL UIUC. [8] Q. Zhu H. Tembne an T. Başar. Heterogeneous learnng n zero-sum stochastc games wth ncomplete nformaton. n 49th IEEE Conf. on Decson an Control Atlanta GA USA. 46