1 One Prcicl Algorihm for Boh Sochsic nd Adversril Bndis Yevgeny Seldin Queenslnd Universiy of Technology, Brisbne, Ausrli Aleksndrs Slivkins Microsof Reserch, New York NY, USA Absrc We presen n lgorihm for mulirmed bndis h chieves lmos opiml performnce in boh sochsic nd dversril regimes wihou prior knowledge bou he nure of he environmen. Our lgorihm is bsed on ugmenion of he EXP lgorihm wih new conrol lever in he form of explorion prmeers h re ilored individully for ech rm. The lgorihm simulneously pplies he old conrol lever, he lerning re, o conrol he regre in he dversril regime nd he new conrol lever o deec nd exploi gps beween he rm losses. This secures problem-dependen logrihmic regre when gps re presen wihou compromising on he wors-cse performnce gurnee in he dversril regime. We show h he lgorihm cn exploi boh he usul expeced gps beween he rm losses in he sochsic regime nd deerminisic gps beween he rm losses in he dversril regime. The lgorihm reins logrihmic regre gurnee in he sochsic regime even when some observions re conmined by n dversry, s long s on verge he conminion does no reduce he gp by more hn hlf. Our resuls for he sochsic regime re suppored by experimenl vlidion.. Inroducion Sochsic mulirmed bndis Thompson, 9; Robbins, 95; Li & Robbins, 985; Auer e l., nd dversril mulirmed bndis Auer e l., 995; b hve co-exised in prllel for lmos wo decdes by now, in he sense h no lgorihm for sochsic mulirmed bndis is pplicble o dversril mulirmed bndis nd l- Proceedings of he s Inernionl Conference on Mchine Lerning, Beijing, Chin, 4. JMLR: W&CP volume. Copyrigh 4 by he uhors. gorihms for dversril bndis re unble o exploi he simpler regime of sochsic bndis. The recen emp of Bubeck & Slivkins o bring hem ogeher did no mke i in he full sense of unificion, since he lgorihm of Bubeck nd Slivkins relies on he knowledge of ime horizon nd mkes one-ime irreversible swich beween sochsic nd dversril operion modes if he beginning of he gme is esimed o exhibi dversril behvior. We presen n lgorihm h res boh sochsic nd dversril mulirmed bndi problems wihou disinguishing beween hem. Our lgorihm jus runs, s mos oher bndi lgorihms, wihou knowledge of ime horizon nd wihou mking ny hrd semens bou he nure of he environmen. We show h if he environmen hppens o be dversril he performnce of he lgorihm is jus fcor of worse hn he performnce of he EXP lgorihm wih he bes consns, s described in Bubeck & Ces-Binchi nd if he environmen hppens o be sochsic he performnce of our lgorihm is comprble o he performnce of UCB of Auer e l.. Thus, we cover he full rnge nd chieve lmos opiml performnce he exreme poins. Furhermore, we show h he new lgorihm cn exploi boh he usul expeced gps beween he rm losses in he sochsic regime nd deerminisic gps beween he rm losses in he dversril regime. We lso show h he lgorihm reins logrihmic regre gurnee in he sochsic regime even when some observions re dversrilly conmined, s long s on verge he conminion does no reduce he gp by more hn hlf. To he bes of our knowledge, no oher lgorihm hs been ye shown o be ble o exploi gps in he dversril or dversrilly conmined sochsic regimes. The conmined sochsic regime is very prcicl model, since in mny rel-life siuions we re deling wih sochsic environmens wih occsionl disurbnces. Since he inroducion of Thompson s smpling Thompson, 9 which ws nlyzed only fer 8 yers Kufmnn e l., ; Agrwl & Goyl, vriey of l-
2 One Prcicl Algorihm for Boh Sochsic nd Adversril Bndis gorihms were invened for he sochsic mulirmed bndi problem. The mos powerful for ody re KL-UCB Cppé e l.,, EwS Millrd,, nd he foremenioned Thompson s smpling. I is esy o show h ny deerminisic lgorihm cn poenilly suffer liner regre in he dversril regime see he supplemenry meril for proof. Alhough nohing is known bou he performnce of rndomized lgorihms for sochsic bndis in he dversril regimes, empiriclly hey re exremely sensiive o deviions from he sochsic ssumpion. In he dversril world he mos powerful lgorihm for ody is INF Audiber & Bubeck, 9; Bubeck & Ces- Binchi,. Neverheless, he EXP lgorihm of Auer e l. b sill reins n imporn plce, minly due o is simpliciy nd wide pplicbiliy, which covers combinoril bndis, pril monioring gmes, nd mny oher dversril problems. Since ny sochsic problem cn be seen s n insnce of n dversril problem, boh INF nd EXP hve he wors-cse roo- regre gurnee in he sochsic regime, bu i is no known wheher hey cn do beer. Empiriclly in he sochsic regime EXP is inferior o ll oher known lgorihms for his seing, including he simples UCB lgorihm. I is ineresing o ke brief look ino he developmen of EXP. The lgorihm ws firs suggesed in Auer e l. 995 nd is prmerizion nd nlysis were improved in Auer e l. b. The EXP of Auer e. l. ws designed for he mulirmed bndi gme wih rewrds nd is plying sregy is bsed on mixing Gibbs disribuion lso known s exponenil weighs wih uniform explorion disribuion in proporion o he lerning re. The uniform explorion leves no hope for chieving logrihmic regre in he sochsic regime simulneously wih he roo- regre in he dversril regime, since ech rm is plyed les Ω imes in rounds of he gme. By chnging he lerning re Ces-Binchi & Fischer 998 mnged o derive differen prmerizion of he lgorihm h ws shown o chieve logrihmic regre in he sochsic regime, bu i hd no regre gurnees in he dversril regime. Solz 5 hs observed h in he gme wih losses he roo- regre gurnee in he dversril regime cn be chieved wihou mixing in he uniform disribuion nd even led o beer consns. However, mixing in ny disribuion h elemen-wise does no exceed he lerning re does no brek he wors-cse performnce of he lgorihm in he gme wih losses. We exploi his emerged freedom in order o derive modificion of he EXP lgorihm h chieves lmos opiml regre in boh dversril nd sochsic regimes wihou prior knowledge bou he nure of he environmen. Rewrds cn be rnsformed ino losses by king l = r.. Problem Seing We sudy he mulirmed bndi MAB gme wih losses. In ech round of he gme he lgorihm chooses one cion A mong K possible cions,.k.. rms, nd observes he corresponding loss l A. The losses of oher rms re no observed. There is lrge number of loss generion models, four of which re considered below. In his work we resric ourselves o loss sequences l }, h re genered independenly of he lgorihm s cions. Under his ssumpion we cn ssume h he loss sequences re wrien down before he gme srs bu no reveled o he lgorihm. We lso mke sndrd ssumpion h he losses re bounded in he [, inervl. The performnce of he lgorihm is qunified by regre, defined s he difference beween he expeced loss of he lgorihm up o round nd he expeced loss of he bes rm up o round : R = E [ l As s min E [ l s } The expecion is ken over he possible rndomness of he lgorihm nd loss generion model. The gol of he lgorihm is o minimize he regre. We consider wo sndrd loss generion models, he dversril regime nd he sochsic regime nd wo inermedie regimes, he conmined sochsic regime nd he dversril regime wih gp. Adversril regime. In his regime he loss sequences re genered by n unresriced dversry who is oblivious o he lgorihm s cions. This is he mos generl seing nd he oher hree regimes cn be seen s specil cses of he dversril regime. An rm rg min l s is known s bes rm in hindsigh for he firs rounds. Sochsic regime. In his regime he losses l re smpled independenly from n unknown disribuion h depends on, bu no on. We use µ = E [l o denoe he expeced loss of rm. Arm is clled bes rm if µ = min µ } nd subopiml oherwise; le denoe some bes rm. For ech rm, define he gp = µ µ. Le = min : > } denoe he miniml gp. Leing N be he number of imes rm ws plyed up o nd including round, he regre cn be rewrien s R = E [N. Conmined sochsic regime. In his regime he dversry picks some round-rm pirs, locions before he gme srs nd ssigns he loss vlues here in n.
3 One Prcicl Algorihm for Boh Sochsic nd Adversril Bndis rbirry wy. The remining losses re genered ccording o he sochsic regime. We cll conmined sochsic regime moderely conmined fer τ rounds if for ll τ he ol number of conmined locions of ech subopiml rm up o ime is mos /4 nd he number of conmined locions of ech bes rm is mos /4. By his definiion, for ll τ on verge over sochsiciy of he loss sequences he dversry cn reduce he gp of every rm by mos hlf. Adversril regime wih gp. An dversril regime is nmed by us n dversril regime wih gp if here exiss round τ nd n rm τ h persiss o be he bes rm in hindsigh for ll rounds τ. We nme such rm consisenly bes rm fer round τ. If no such rm exiss hen τ is undefined. Noe h if τ is defined for some τ hen τ is defined for ll τ > τ. We use λ = l s o denoe he cumulive loss of rm. Whenever τ is defined we define deerminisic gp of rm on round τ s: τ, = min τ λ λ τ If τ is undefined, τ, is defined s zero. }. Noion. We use E} o denoe he indicor funcion of even E nd = A=} o denoe he indicor funcion of he even h rm ws plyed on round.. Min Resuls Our min resuls include new lgorihm, which we nme EXP++, nd is nlysis in he four regimes defined in he previous secion. The EXP++ lgorihm, provided in Algorihm box, is generlizion of he EXP lgorihm wih losses. Algorihm Algorihm EXP++. Remrk: See ex for definiion of η nd ξ. : L =. for =,,... do β = ln K K. : ε = min K, β, ξ }. : ρ = e η L / e η L. : ρ = ε ρ + ε. Drw cion A ccording o ρ nd ply i. Observe nd suffer he loss l A. : l = la ρ. : L = L + l. end for The EXP++ lgorihm hs wo conrol levers: he lerning re η nd he explorion prmeers ξ. The EXP wih losses s described in Bubeck & Ces-Binchi is specil cse of he EXP++ wih η = β nd ξ =. The crucil innovion in EXP++ is he inroducion of explorion prmeers ξ, which re uned individully for ech rm depending on he ps observions. In he sequel we show h uning only he lerning re η suffices o conrol he regre of EXP++ in he dversril regime, irrespecive of he choice of he explorion prmeers ξ. Then we show h uning only he explorion prmeers ξ suffices o conrol he regre of EXP++ in he sochsic regime irrespecive of he choice of η, s long s η β. Applying he wo conrol levers simulneously we obin n lgorihm h chieves he opiml roo- regre in he dversril regime up o logrihmic fcors nd lmos opiml logrihmic regre in he sochsic regime hough wih subopiml power in he logrihm. Then show h he new conrol lever is even more powerful nd llows o deec nd exploi he gp in even more chllenging siuions, including moderely conmined sochsic regime nd dversril regime wih gp. Adversril Regime Firs, we show uning η is sufficien o conrol he regre of EXP++ in he dversril regime. Theorem. For η = β nd ny ξ he regre of EXP++ for ny sisfies: R 4 K ln K. Noe h he regre bound in Theorem is jus fcor of worse hn he regre of EXP wih losses Bubeck & Ces-Binchi,. Sochsic Regime Now we show h for ny η β uning he explorion prmeers ξ suffices o conrol he regre of he lgorihm in he sochsic regime. By choosing η = β we obin lgorihms h hve boh he opiml roo- regre scling in he dversril regime nd logrihmic regre scling in he sochsic regime. We consider number of differen wys of uning he explorion prmeers ξ, which led o differen prmerizions of EXP++. We sr wih n idelisic ssumpion h he gp is known, jus o give n ide of wh is he bes resul we cn hope for. Theorem. Assume h he gps re known. For ny choice of η β nd ny c 8, he regre of EXP++ wih ξ = c ln in he sochsic regime
4 One Prcicl Algorihm for Boh Sochsic nd Adversril Bndis sisfies: R ln O + K Õ. The consns in his heorem re smll nd re provided explicily in he nlysis. We lso show h c cn be mde lmos s smll s. Nex we show h using he empiricl gp s n esime of he rue gp ˆ = min, L min L } we cn lso chieve polylogrihmic regre gurnee. We cll his lgorihm EXP++ AVG. Theorem. Le c 8 nd η β. Le be he miniml ineger h sisfies 4c K ln 4 nd le = mx, e / }. cln ˆ lnk The regre of EXP++ wih ξ = ermed EXP++ AVG in he sochsic regime sisfies: R ln O +. Alhough he ddiive consns in his heorem re very lrge, in he experimenl secion we show h minor modificion of his lgorihm performs comprbly o UCB in he sochsic regime nd hs he dversril regre gurnee in ddiion. In he following heorem we show h if we ssume known ime horizon T, hen we cn elimine he ddiive erm e / in he regre bound. The lgorihm in Theorem 4 replces he empiricl gp esime in he definiion of ξ wih lower confidence bound on he gp nd slighly djuss oher erms. We nme his lgorihm EXP++ LCBT. Theorem 4. Consider he sochsic regime wih known ime horizon T. The EXP++ LCBT lgorihm wih ny η β nd ppropriely defined ξ chieves regre RT Olog T. The precise definiion of EXP++ LCBT nd he proof of Theorem 4 re provided in he supplemenry meril. I seems h simulneous eliminion of he ssumpion on he known ime horizon nd he exponenilly lrge ddiive erm is very chllenging problem nd we defer i for fuure work. Conmined Sochsic Regime Nex we show h EXP++ AVG cn susin modere conminion in he sochsic regime wihou significn deeriorion in performnce. Theorem 5. Under he prmerizion given in Theorem, for = mx, e 4/ }, where is defined s before, he regre of EXP++ AVG in he sochsic regime h is moderely conmined fer τ rounds sisfies: R ln O + mx, τ}. The price h is pid for modere conminion fer τ rounds is he scling of by fcor of / nd he ddiive fcor of τ. The scling of ffecs he definiion of nd he consn in O ln. As before, he regre gurnee of Theorem 5 comes in ddiion o he gurnee of Theorem. Adversril Regime wih Gp Finlly, we show h EXP++ AVG cn lso ke dvnge of deerminisic gp in he dversril regime. Theorem 6. Under he prmerizion given in Theorem, he regre of EXP++ AVG in he dversril regime sisfies: R min mx, τ, e / τ,} } ln + O. τ τ, We remind he reder h in he bsence of consisenly bes rm τ, is defined s zero nd he regre bound is vcuous bu he regre bound of Theorem sill holds. We lso noe h τ, is non-decresing funcion of τ. Therefore, here is rde-off: incresing τ increses τ,, bu loses he regre gurnee on he rounds before τ for simpliciy, we ssume h we hve no gurnees before τ. Theorem 6 llows o pick τ h minimizes his rde-off. An imporn implicion of he heorem is h if he deerminisic gp is growing wih ime he regre gurnee improves oo. 4. Proofs We prove he heorems from he previous secion in he order hey were presened. The Adversril Regime The proof of Theorem relies on he following lemm, which is n inermedie sep in he nlysis of EXP by Bubeck see lso Bubeck & Ces-Binchi. Lemm 7. For ny K sequences of non-negive numbers X, X,... indexed by,..., K} nd ny non-incresing posiive sequence η, η,..., for ρ =
5 One Prcicl Algorihm for Boh Sochsic nd Adversril Bndis exp η X s h exp η X s exponen is zero we hve: ssuming for = he sum in he T = ρ T X min = X T η ρ X + ln K. η T = More precisely, we re using he following corollry, which follows by llowing X -s o be rndom vribles nd king expecions of he wo sides of nd using he fc h E [min[ min [E [. We decompose expecions of incremenl sums ino incremenl sums of condiionl expecions nd use E [ o denoe expecions condiioned on relizion of ll rndom vribles up o round. Corollry 8. Le X, X,... for,..., K} be nonnegive rndom vribles nd le η nd ρ s defined in Lemm 7. Then: [ T [ [ T E E ρ X min E E [X = = [ T [ η E E ρ X + ln K. η T = Proof of Theorem. We ssocie X in wih l in he EXP++ lgorihm. We hve E [ l = l nd since ρ = ε ρ ε ρ ε nd l [, we lso hve: E [ ρ l [ E ρ ε l E [l A ε. As well, we hve: [ E ρ l = E ρ [ ρ E ρ = = l A ρ ρ ρ ρ ε ρ + ε K, where he ls inequliy follows by he fc h ε by he definiion of ε. Subsiuion of he bove clculions ino Corollry 8 yields: [ T [ T R = E l A min E l K T = = η + ln K η T + = ε K T = η + ln K η T. The resul of he heorem follows by he choice of η. The Sochsic Regime Our proofs re bsed on he following form of Bernsein s inequliy, which is minor improvemen over Ces-Binchi & Lugosi 6, Lemm A.8 bsed on he ides from Boucheron e l., Theorem.. Theorem 9 Bernsein s inequliy for mringles. Le X,..., X n be mringle difference sequence wih respec o filrion F = F i in nd le S i = i j= X j be he ssocied mringle. Assume h here exis posiive numbers ν nd c, such h X j c for ll j wih probbiliy nd [ n i= E X i Fi ν wih probbiliy. Then for ll b > : P [ S n > νb + cb e b. We re lso using he following echnicl lemm, which is proved in he supplemenry meril. Lemm. For ny c > : = e c = O c. The proof of Theorems nd is bsed on he following lemm. Lemm. Le ε } = be non-incresing deerminisic sequences, such h ε ε wih probbiliy nd ε ε for ll nd. Define ν = ε s nd define he even E L L ν + ν b +.5b ε. E Then for ny posiive sequence b, b,... nd ny he number of imes rm is plyed by EXP++ up o round is bounded s: E [N + e bs + ε s E } s= s= + e ηsgs, s= where g = b ε + ε.5b ε.
6 One Prcicl Algorihm for Boh Sochsic nd Adversril Bndis Proof. Noe h elemens of } he mringle difference sequence l l re upper bounded by = ε +. Since ε ε /K /4 we cn simplify he upper bound by using ε +.5 ε. Furher noe h = [ E s l s [ E s l s [ E s ls l s l s [ + E s l s p s + p s ε s + ε s ε s + ε s = ν + ν wih probbiliy. Le E denoe he complemen of even E. Then by Bernsein s inequliy P [ E b. The number of imes rm is plyed up o round is bounded s: E [N = = P [A s = P [ A s = [ E s P E s + P [ A s = [ Es P E s P [ A s = E s E s } + P [ Es P [ A s = E s E s } + e bs. For he erms of he sum bove we hve: P [ A = E E s } = ρ E s } ρ + ε E s } L = ε + e η L e η E s } ε + e η L L E s } ε E s } + e ηg, Where in he ls inequliy we used he fcs h even E holds nd h since ε is non-incresing sequence ν ε. Subsiuion of his resul bck ino he compuion of E [N complees he proof. Proof of Theorem. The proof is bsed on Lemm. Le b = ln nd ε = ε. For ny c 8 nd ny, where is he miniml ineger for which 4c K ln 4 lnk, we hve: g = b ε + ε b ε.5b ε =.5 c c..5b ε The choice of ensures h for ll subopiml cions we hve ε = ξ, which slighly simplifies he clculions. Also noe h since ε = min K, β }, sympoiclly /ε erm in g domines /ε erm nd wih bi more creful bounding c cn be mde lmos s smll s. By subsiuion of he lower bound on g ino Lemm we hve: E [N + ln + c ln + c ln e 4 s lnk K + ln K + O +, where we used Lemm o bound he sum of he exponens. Noe h is of order Õ K 4. Proof of Theorem. Noe h since by our definiion ˆ } he sequence ε = ε = min K, β, c ln sisfies he condiion of Lemm. Also noe h for lrge enough, so h 4c K ln 4 ln K, we hve ε = c ln. Le b = ln nd le be lrge enough, so h for ll we hve 4c K ln 4 ln K nd e. We re going o bound he hree erms in he bound on E [N in Lemm. Bounding s= e bs is esy. For bounding s= ε s E s } we noe h when E holds nd c 8 we hve: ˆ L min L L L g = b.5b 4 ε ε =.5 c ln c ln.5 c c,
7 One Prcicl Algorihm for Boh Sochsic nd Adversril Bndis where in 4 we used he fc h E holds nd in he ls line we used he fc h for we hve ln /. Thus ε E s } cln ˆ 4c ln nd s= ε s E s } = O ln. Finlly, for he ls erm in Lemm we hve lredy shown s n inermedie sep in he clculion of he bound on ˆ h for we hve g. Therefore, he ls erm K is of order O. By king ll hese clculions ogeher we obin he resul of he heorem. Noe h he resul holds for ny η β. The Conmined Sochsic Regime Proof of Theorem 5. The key elemen of he previous proof ws high-probbiliy lower bound on L L. We show h we cn obin similr lower bound in he conmined seing oo. Le, denoe he indicor funcion of conminion in locion,, kes vlue if conminion occurred nd oherwise. Le m =,l +, µ, in oher words, if eiher ws conmined on round hen m is he dversrilly ssigned vlue of he loss of rm on round nd oherwise i is he expeced loss. Le M = m s hen M M L L is mringle. By definiion of moderely conmined fer τ rounds process, for τ nd ny subopiml cion he ol number of rounds up o where eiher iself or were conmined is mos /. Therefore, M M / / /. Define even B : L L ν b +.5b, B ε where ε is defined in he proof of Theorem nd ν = ε. Then by Bernsein s inequliy P [ B s b. The reminder of he proof is idenicl o he proof of Theorem wih replced by /. The Adversril Regime wih Gp The proof of Theorem 6 is bsed on he following lemm, which is n nlogue of Theorems nd 5. Lemm. Under he prmerizion given in Theorem, he number of imes subopiml rm is plyed by EXP++ AVG in n dversril regime wih gp sisfies: E [N mx, τ, e / τ,} ln + O τ,. Proof. Agin, he only modificion we need is highprobbiliy lower bound on L L τ. We noe h λ λ τ L L τ is mringle nd h by definiion for τ we hve λ λ τ τ,. Define he evens W : τ, L L τ ν b +.5b, W ε where ε nd ν re s in he proof of Theorem 5. By Bernsein s inequliy P [ W b. The reminder of he proof is idenicl o he proof of Theorem. Proof of Theorem 6. Noe h by definiion τ, is non-decresing sequence of τ. Since Lemm is deerminisic resul i holds for ll τ simulneously nd we re free o choose he one h minimizes he bound. 5. Empiricl Evluion: Sochsic Regime We consider he sochsic mulirmed bndi problem wih Bernoulli rewrds. For ll he subopiml rms he rewrds re Bernoulli wih bis.5 nd for he single bes rm he rewrd is Bernoulli wih bis.5 +. We run he experimens wih K =, K =, nd K =, nd =. nd =. in ol, six combinions of K nd. We run ech gme for 7 rounds nd mke en repeiions of ech experimen. The solid lines in he grphs in Figure represen he men performnce over he experimens nd he dshed lines represen he men plus one sndrd deviion sd over he en repeiions of he corresponding experimen. In he experimens EXP++ is prmerized by ξ = ln ˆ ˆ, where ˆ is he empiricl esime of defined in. In order o demonsre h in he sochsic regime he explorion prmeers re in full conrol of he performnce we run he EXP++ lgorihm wih wo differen lerning res. EXP++ EMP corresponds o η = β nd EXP++ ACC corresponds o η =. Noe h only he EXP++ EMP hs performnce gurnee in he dversril regime. We compre EXP++ lgorihm wih he EXP lgorihm s described in Bubeck & Ces-Binchi, he UCB lgorihm of Auer e l., nd Thompson s smpling. Since i ws demonsred empiriclly in Seldin e l. h in he bove experimens he performnce of Thompson smpling is comprble or superior o he performnce of EwS nd KL-UCB, he ler wo lgorihms re excluded from he comprison. For he EXP++ nd he EXP lgorihms we rnsform he rewrds ino losses vi l = r rnsformion, oher lgorihms opere direcly on he rewrds.
8 One Prcicl Algorihm for Boh Sochsic nd Adversril Bndis 7 K =. =. 5 K =. =..5 x 4 K =. =. Cumulive Regre Cumulive Regre 4 Cumulive Regre x 6 K =, = x 6 b K =, = x 6 c K =, =. Cumulive Regre K =. =. Cumulive Regre 4 x K =. =. Cumulive Regre x UCB Thom EXP EXP++ EMP EXP++ ACC K =. = x 6 d K =, = x 6 e K =, = x 6 f K =, =. Figure. Comprison of UCB, Thompson smpling Thom, EXP, nd EXP++ lgorihms in he sochsic regime. The legend in figure f corresponds o ll he figures. EXP++ EMP is he Empiricl EXP++ lgorihm nd EXP++ ACC is n Accelered Empiricl EXP++, where we ke η =. Solid lines correspond o mens over repeiions of he corresponding experimens nd dshed lines correspond o he mens plus one sndrd deviion. The resuls re presened in Figure. We see h in ll he experimens he performnce of EXP++ EMP is lmos idenicl o he performnce of UCB. However, unlike UCB nd Thompson s smpling, EXP++ EMP is secured gins he possibiliy h he gme is conrolled by n dversry. In he supplemenry meril we show h ny deerminisic lgorihm is vulnerble gins n dversry. The EXP++ ACC lgorihm cn be seen s eser for fuure work. I performs beer hn EXP++ EMP, bu i does no hve he dversril regime performnce gurnee. However, we do no exclude he possibiliy h by some more sophisiced simulneous conrol of η nd ε -s i my be possible o design n lgorihm h will hve boh beer performnce in he sochsic regime nd regre gurnee in he dversril regime. An exmple of such sophisiced conrol of he lerning re in he full informion gmes cn be found in de Rooij e l Discussion We presened generlizion of he EXP lgorihm, he EXP++ lgorihm, which ugmens he EXP lgorihm wih new conrol lever in he form explorion prmeers ε h re uned individully for ech rm. We hve shown h he new conrol lever is exremely useful in deecing nd exploiing he gp in wide rnge of regimes, while he old conrol lever lwys keeps he wors-cse performnce of he lgorihm under conrol. Due o he cenrl role of he EXP lgorihm in he dversril nlysis h sreches fr beyond he dversril bndis nd due o he simpliciy of our generlizion we believe h our resul will led o muliude of new lgorihms for oher problems h exploi he gps wihou compromising on he wors-cse performnce gurnees. There is lso room for furher improvemen of he presened echnique h we pln o pursue in fuure work. Acknowledgmens The uhors would like o hnk Sébsien Bubeck nd Wouer Koolen for useful discussions nd Csb Szepesvári for bringing up he reference o Ces-Binchi & Fischer 998. This reserch ws suppored by n Ausrlin Reserch Council Ausrlin Luree Fellowship FL8.
9 One Prcicl Algorihm for Boh Sochsic nd Adversril Bndis References Agrwl, Shipr nd Goyl, Nvin. Furher opiml regre bounds for Thompson smpling. In AISTATS,. Audiber, Jen-Yves nd Bubeck, Sébsien. Minimx policies for dversril nd sochsic bndis. In Proceedings of he Inernionl Conference on Compuionl Lerning Theory COLT, 9. Auer, Peer, Ces-Binchi, Nicolò, Freund, Yov, nd Schpire, Rober E. Gmbling in rigged csino: The dversril mulirmed bndi problem. In Proceedings of he Annul Symposium on Foundions of Compuer Science, 995. Seldin, Yevgeny, Szepesvári, Csb, Auer, Peer, nd Abbsi- Ydkori, Ysin. Evluion nd nlysis of he performnce of he EXP lgorihm in sochsic environmens. In JMLR Workshop nd Conference Proceedings, volume 4 EWRL,. Solz, Gilles. Incomplee Informion nd Inernl Regre in Predicion of Individul Sequences. PhD hesis, Universié Pris- Sud, 5. Thompson, Willim R. On he likelihood h one unknown probbiliy exceeds noher in view of he evidence of wo smples. Biomerik, 5, 9. Auer, Peer, Ces-Binchi, Nicolò, nd Fischer, Pul. Finie-ime nlysis of he mulirmed bndi problem. Mchine Lerning, 47,. Auer, Peer, Ces-Binchi, Nicolò, Freund, Yov, nd Schpire, Rober E. The nonsochsic mulirmed bndi problem. SIAM Journl of Compuing,, b. Boucheron, Séphne, Lugosi, Gábor, nd Mssr, Pscl. Concenrion Inequliies A Nonsympoic Theory of Independence. Oxford Universiy Press,. Bubeck, Sébsien. Bndis Gmes nd Clusering Foundions. PhD hesis, Universié Lille,. Bubeck, Sébsien nd Ces-Binchi, Nicolò. Regre nlysis of sochsic nd nonsochsic muli-rmed bndi problems. Foundions nd Trends in Mchine Lerning, 5,. Bubeck, Sébsien nd Slivkins, Aleksndrs. The bes of boh worlds: sochsic nd dversril bndis. In Proceedings of he Inernionl Conference on Compuionl Lerning Theory COLT,. Cppé, Olivier, Grivier, Aurélien, Millrd, Odlric-Ambrym, Munos, Rémi, nd Solz, Gilles. Kullbck-Leibler upper confidence bounds for opiml sequenil llocion. Annls of Sisics, 4,. Ces-Binchi, Nicolò nd Fischer, Pul. Finie-ime regre bounds for he mulirmed bndi problem. In Proceedings of he Inernionl Conference on Mchine Lerning ICML, 998. Ces-Binchi, Nicolò nd Lugosi, Gábor. Predicion, Lerning, nd Gmes. Cmbridge Universiy Press, 6. de Rooij, Seven, vn Erven, Tim, Grünwld, Peer D., nd Koolen, Wouer M. Follow he leder if you cn, hedge if you mus. Journl of Mchine Lerning Reserch, 4. Kufmnn, Emilie, Kord, Nhniel, nd Munos, Rémi. Thompson smpling: An opiml finie ime nlysis. In Proceedings of he Inernionl Conference on Algorihmic Lerning Theory ALT,. Li, Tze Leung nd Robbins, Herber. Asympoiclly efficien dpive llocion rules. Advnces in Applied Mhemics, 6, 985. Millrd, Odlric-Ambrym. Apprenissge Séqueniel: Bndis, Sisique e Renforcemen. PhD hesis, INRIA Lille,. Robbins, Herber. Some specs of he sequenil design of experimens. Bullein of he Americn Mhemicl Sociey, 95.