The Sample Complexity of Exploration in the Multi-Armed Bandit Problem

Transcription

1 Joura of Machie Learig Research 5 004) Submitted 1/04; Pubished 6/04 The Sampe Compexity of Exporatio i the Muti-Armed Badit Probem Shie Maor Joh N. Tsitsikis Laboratory for Iformatio ad Decisio Systems Massachusetts Istitute of Techoogy Cambridge, MA 0139, USA [email protected] [email protected] Editors: Kristi Beett ad Nicoò Cesa-Biachi Abstract We cosider the muti-armed badit probem uder the PAC probaby approximatey correct ) mode. It was show by Eve-Dar et a. 00) that give arms, a tota of O /ε )og1/δ) ) trias suffices i order to fid a ε-optima arm with probabiity at east 1 δ. We estabish a matchig ower boud o the expected umber of trias uder ay sampig poicy. We furthermore geeraize the ower boud, ad show a expicit depedece o the ukow) statistics of the arms. We aso provide a simiar boud withi a Bayesia settig. The case where the statistics of the arms are kow but the idetities of the arms are ot, is aso discussed. For this case, we provide a ower boud of Θ 1/ε ) + og1/δ)) ) o the expected umber of trias, as we as a sampig poicy with a matchig upper boud. If istead of the expected umber of trias, we cosider the maximum over a sampe paths) umber of trias, we estabish a matchig upper ad ower boud of the form Θ /ε )og1/δ) ). Fiay, we derive ower bouds o the expected regret, i the spirit of Lai ad Robbis. 1. Itroductio The muti-armed badit probem is a cassica probem i decisio theory. There is a umber of aterative arms, each with a stochastic reward whose probabiity distributio is iitiay ukow. We try these arms i some order, which may deped o the sequece of rewards that have bee observed so far. A commo objective i this cotext is to fid a poicy for choosig the ext arm to be tried, uder which the sum of the expected rewards comes as cose as possibe to the idea reward, i.e., the expected reward that woud be obtaied if we were to try the best arm at a times. Oe of the attractive features of the muti-armed badit probem is that despite its simpicity, it ecompasses may importat decisio theoretic issues, such as the tradeoff betwee exporatio ad expoitatio. The muti-armed badit probem has bee widey studied i a variety of setups. The probem was first cosidered i the 50 s, i the semia work of Robbis 195), which derives poicies that asymptoticay attai a average reward that coverges i the imit to the reward of the best arm. The muti-armed badit probem was ater studied i discouted, Bayesia, Markovia, expected reward, ad adversaria setups. See Berry ad Fristedt 1985) for a review of the cassica resuts o the muti-armed badit probem. c 004 Shie Maor ad Joh Tsitsikis.

2 MANNOR AND TSITSIKLIS Lower bouds for differet variats of the muti-armed badit have bee studied by severa authors. For the expected regret mode, where the regret is defied as the differece betwee the idea reward if the best arm were kow) ad the reward uder a oie poicy, the semia work of Lai ad Robbis 1985) provides asymptoticay tight bouds i terms of the Kuback-Leiber divergece betwee the distributios of the rewards of the differet arms. These bouds grow ogarithmicay with the umber of steps. The adversaria muti-armed badit probem i.e., without ay probabiistic assumptios) was cosidered i Auer et a. 1995, 00b), where it was show that the expected regret grows proportioay to the square root of the umber of steps. Of reated iterest is the work of Kukari ad Lugosi 000) which shows that for ay specific time t, oe ca choose the reward distributios so that the expected regret is iear i t. The focus of this paper is the cassica muti-armed badit probem, but rather tha ookig at the expected regret, we are cocered with PAC-type bouds o the umber of steps eeded to idetify a ear-optima arm. I particuar, we are iterested i the expected umber of steps that are required i order to idetify with high probabiity at east 1 δ) a arm whose expected reward is withi ε from the expected reward of the best arm. This aturay abstracts the case where oe must evetuay commit to oe specific arm, ad quatifies the amout of exporatio ecessary. This is i cotrast to most of the resuts for the muti-armed badit probem, where the mai aim is to maximize the expected cumuative reward whie both exporig ad expoitig. I Eve-Dar et a. 00), a poicy, caed the media eimiatio agorithm, was provided which requires O/ε )og1/δ)) trias, ad which fids a ε-optima arm with probabiity at east 1 δ. A matchig ower boud was aso derived i Eve-Dar et a. 00), but it oy appied to the case where δ > 1/, ad therefore did ot capture the case where high cofidece sma δ) is desired. I this paper, we derive a matchig ower boud which aso appies whe δ > 0 is arbitrariy sma. Our mai resut ca be viewed as a geeraizatio of a O1/ε )og1/δ)) ower boud provided i Athoy ad Bartett 1999), ad Cheroff 197), for the case of two badits. The proof i Athoy ad Bartett 1999) is based o a hypothesis iterchage argumet, ad reies criticay o the fact there are oy two uderyig hypotheses. Furthermore, it is imited to oadaptive poicies, for which the umber of trias is fixed a priori. The techique we use is based o a ikeihood ratio argumet ad a tight martigae boud, ad appies to geera poicies. A differet type of ower boud was derived i Auer et a. 00b) for the expected regret i a adversaria setup. The bouds derived there ca aso be used to derive a ower boud for our probem, but do ot appear to be tight eough to capture the og1/δ) depedece o δ. Our work aso provides fudameta ower bouds i the cotext of sequetia aaysis see, e.g., Cheroff, 197; Jeiso et a., 198; Siegmud, 1985). I the aguage of Siegmud 1985), we provide a ower boud o the expected egth of a sequetia sampig poicy uder ay adaptive aocatio scheme. For the case of two arms, it was show i Siegmud 1985) p. 148) that if oe restricts to sampig poicies that oy take ito accout the empirica average rewards from the differet arms, the the probems of iferece ad arm seectio ca be treated separatey. As a cosequece, ad uder this restrictio, Siegmud 1985) shows that a optima aocatio caot be much better tha a uiform oe. Our resuts are differet i a umber of ways. First, we cosider mutipe hypotheses mutipe arms). Secod, we aow the aocatio rue to be competey geera ad to deped o the whoe history. Third, uike most of the sequetia aaysis iterature see, e.g., Jeiso et a., 198), we do ot restrict ourseves to the imitig case where the probabiity of error coverges to zero. Fiay, we cosider fiite time bouds, rather tha asymptotic oes. We further commet that 64

3 EXPLORATION IN MULTI-ARMED BANDITS our resuts exted those of Jeiso et a. 198), i that we cosider the case where the reward is ot Gaussia. Paper Outie The paper is orgaized as foows. I Sectio, we set up our framework, ad sice we are maiy iterested i ower bouds, we restrict to the specia case where each arm is a coi, i.e., the rewards are Beroui radom variabes, but with ukow parameters biases ). I Sectio 3, we provide a O/ε )og1/δ)) ower boud o the expected umber of trias uder ay poicy that fids a ε-optima coi with probabiity at east 1 δ. I Sectio 4, we provide a refied ower boud that depeds expicity o the specific though ukow) biases of the cois. This ower boud has the same og1/δ) depedece o δ; furthermore, every coi roughy cotributes a factor iversey proportioa to the square differece betwee its bias ad the bias of a best coi, but o more that 1/ε. I Sectio 5, we derive a ower boud simiar to the oe i Sectio 3, but withi a Bayesia settig, uder a prior distributio o the set of biases of the differet cois. I Sectio 6 we provide a boud o the expected regret which is simiar i spirit to the boud i Lai ad Robbis 1985). The costats i our bouds are sighty worse tha the oes i Lai ad Robbis 1985), but the differet derivatio, which iks the PAC mode to regret bouds, may be of idepedet iterest. Our boud hods for ay fiite time, as opposed to the asymptotic resut provided i Lai ad Robbis 1985). The case where the coi biases are kow i advace, but the idetities of the cois are ot, is discussed i Sectio 7. We provide a poicy that fids a ε-optima coi with probabiity at east 1 δ, uder which the expected umber of trias is O 1/ε ) + og1/δ)) ). We show that this boud is tight up to a mutipicative costat. If istead of the expected umber of trias, we cosider the maximum over a sampe paths) umber of trias, we estabish a matchig upper ad ower bouds of the form Θ/ε )og1/δ)). Fiay, Sectio 8 cotais some brief cocudig remarks.. Probem Defiitio The exporatio probem for muti-armed badits is defied as foows. We are give arms. Each arm is associated with a sequece of ideticay distributed Beroui i.e., takig vaues i 0, 1}) radom variabes Xk, k = 1,,..., with ukow mea. Here, Xk correspods to the reward obtaied the kth time that arm is tried. We assume that the radom variabes Xk, for = 1,...,, k = 1,,..., are idepedet, ad we defie p = p 1,..., p ). Give that we restrict to the Beroui case, we wi use i the seque the term coi istead of arm. A poicy is a mappig that give a history, chooses a particuar coi to be tried ext, or seects a particuar coi ad stops. We aow a poicy to use radomizatio whe choosig the ext coi to be tried or whe makig a fia seectio. However, we oy cosider poicies that are guarateed to stop with probabiity 1, for every possibe vector p. Otherwise, the expected umber of steps woud be ifiite.) Give a particuar poicy, we et P p be the correspodig probabiity measure o the atura probabiity space for this mode). This probabiity space captures both the radomess i the cois accordig to the vector p), as we as ay additioa radomizatio carried out by the poicy. We itroduce the foowig radom variabes, which are we defied, except possiby o the set of measure zero where the poicy does ot stop. We et T be the tota umber of times that 65

4 MANNOR AND TSITSIKLIS coi is tried, ad et T = T 1 + +T be the tota umber of trias. We aso et I be the coi which is seected whe the poicy decides to stop. We say that a poicy is ε,δ)-correct if ) P p p I > max ε 1 δ, for every p [0,1]. It was show i Eve-Dar et a. 00) that there exist costats c 1 ad c such that for every, ε > 0, ad δ > 0, there exists a ε,δ)-correct poicy uder which E p [T ] c 1 ε og c δ, p [0,1]. A matchig ower boud was aso estabished i Eve-Dar et a. 00), but oy for arge vaues of δ, amey, for δ > 1/. I cotrast, we aim at derivig bouds that capture the depedece of the sampe-compexity o δ, as δ becomes sma. 3. A Lower Boud o the Sampe Compexity We start with our cetra resut, which ca be viewed as a extesio of Lemma 5.1 from Athoy ad Bartett 1999), as we as a specia case of Theorem 5. We preset it here because it admits a simper proof, but aso because parts of the proof wi be used ater. Throughout the rest of the paper, og wi stad for the atura ogarithm. Theorem 1 There exist positive costats c 1, c, ε 0, ad δ 0, such that for every, ε 0,ε 0 ), ad δ 0,δ 0 ), ad for every ε,δ)-correct poicy, there exists some p [0,1] such that E p [T ] c 1 ε og c δ. I particuar, ε 0 ad δ 0 ca be take equa to 1/8 ad e 4 /4, respectivey. Proof Let us cosider a muti-armed badit probem with + 1 cois, which we umber from 0 to. We cosider a fiite set of + 1 possibe parameter vectors p, which we wi refer to as hypotheses. Uder ay oe of the hypotheses, coi 0 has a kow bias p 0 = 1+ε)/. Uder oe hypothesis, deoted by H 0, a the cois other tha zero have a bias of 1/, H 0 : p 0 = 1 + ε, p i = 1, for i 0, which makes coi 0 the best coi. Furthermore, for = 1,...,, there is a hypothesis H : p 0 = 1 + ε, = 1 + ε, p i = 1, for i 0,, which makes coi the best coi. We defie ε 0 = 1/8 ad δ 0 = e 4 /4. From ow o, we fix some ε 0,ε 0 ) ad δ 0,δ 0 ), ad a poicy, which we assume to be ε/,δ)-correct. If H 0 is true, the poicy must have probabiity at east 1 δ of evetuay stoppig ad seectig coi 0. If H is true, for some 0, the poicy must have probabiity at east 1 δ of evetuay stoppig ad seectig coi. We deote by E ad P the expectatio ad probabiity, respectivey, uder hypothesis H. 66

5 EXPLORATION IN MULTI-ARMED BANDITS We defie t by t = 1 cε og 1 4δ = 1 cε og 1 θ, 1) where θ = 4δ, ad where c is a absoute costat whose vaue wi be specified ater. 1 Note that θ < e 4 ad ε < 1/4. Reca that T stads for the umber of times that coi is tried. We assume that for some coi 0, we have E 0 [T ] t. We wi evetuay show that uder this assumptio, the probabiity of seectig H 0 uder H exceeds δ, ad vioates ε/,δ)-correctess. It wi the foow that we must have E 0 [T ] > t for a 0. Without oss of geeraity, we ca ad wi assume that the above coditio hods for = 1, so that E 0 [T 1 ] t. We wi ow itroduce some specia evets A ad C uder which various radom variabes of iterest do ot deviate sigificaty from their expected vaues. We defie ad obtai from which it foows that A = T 1 4t }, t E 0 [T 1 ] 4t P 0 T 1 > 4t ) = 4t 1 P 0 T 1 4t )), P 0 A) 3/4. We defie K t = X X t 1, which is the umber of uit rewards heads ) if the first coi is tried a tota of t ot ecessariy cosecutive) times. We et C be the evet defied by C = max 1 t 4t Kt 1 t < t og1/θ) We ow estabish two emmas that wi be used i the seque. Lemma We have P 0 C) > 3/4. Proof We wi prove a more geera resut: we assume that coi i has bias p i uder hypothesis H, defie Kt i as the umber of uit rewards heads ) if coi i is tested for t ot ecessariy cosecutive) times, ad et K i C i = max t p i t < } t og1/θ). 1 t 4t First, ote that Kt i p i t is a P -martigae i the cotext of Theorem 1, p i = 1/ is the bias of coi i = 1 uder hypothesis H 0 ). Usig Komogorov s iequaity Coroary 7.66, i p. 44 of Ross, 1983), the probabiity of the compemet of C i ca be bouded as foows: K i P max t p i t ) t og1/θ) 1 t 4t Sice E [K i 4t 4p it ) ] = 4p i 1 p i )t, we obtai }. E [K4t i 4p it ) ] t. og1/θ) P C i ) 1 4p i1 p i ) og1/θ) > 3 4, ) where the ast iequaity foows because θ < e 4 ad 4p i 1 p i ) I this ad subsequet proofs, ad i order to avoid repeated use of trucatio symbos, we treat t as if it were iteger.. The proof for a geera p i wi be usefu ater. 67

6 MANNOR AND TSITSIKLIS Lemma 3 If 0 x 1/ ad y 0, the where d = x) y e dxy, Proof A straightforward cacuatio shows that og1 x) + dx 0 for 0 x 1/. Therefore, yog1 x) + dx) 0 for every y 0. Rearragig ad expoetiatig, eads to 1 x) y e dxy. We ow et B be the evet that I = 0, i.e., that the poicy evetuay seects coi 0. Sice the poicy is ε/,δ)-correct for δ < e 4 /4 < 1/4, we have P 0 B) > 3/4. We have aready show that P 0 A) 3/4 ad P 0 C) > 3/4. Let S be the evet that A, B, ad C occur, that is S = A B C. We the have P 0 S) > 1/4. Lemma 4 If E 0 [T 1 ] t ad c 100, the P 1 B) > δ. Proof We et W be the history of the process the sequece of cois chose at each time, ad the sequece of observed coi rewards) uti the poicy termiates. We defie the ikeihood fuctio L by ettig L w) = P W = w), for every possibe history w. Note that this fuctio ca be used to defie a radom variabe L W). We aso et K be a shorthad otatio for K T1, the tota umber of uit rewards heads ) obtaied from coi 1. Give the history up to time t 1, the coi choice at time t has the same probabiity distributio uder either hypothesis H 0 ad H 1 ; simiary, the coi reward at time t has the same probabiity distributio, uder either hypothesis, uess the chose coi was coi 1. For this reaso, the ikeihood ratio L 1 W)/L 0 W) is give by L 1 W) L 0 W) = 1 + ε)k 1 ε)t 1 K 1 )T 1 = 1 + ε) K 1 ε) K 1 ε) T 1 K = 1 4ε ) K 1 ε) T 1 K. 3) We wi ow proceed to ower boud the terms i the right-had side of Eq. 3) whe evet S occurs. If evet S has occurred, the A has occurred, ad we have K T 1 4t, so that 1 4ε ) K 1 4ε ) 4t = 1 4ε ) 4/cε ))og1/θ) e 16d/c)og1/θ) = θ 16d/c. We have used here Lemma 3, which appies because 4ε < 4/4 < 1/. Simiary, if evet S has occurred, the A C has occurred, which impies, T 1 K t og1/θ) = /ε c)og1/θ), 68

7 EXPLORATION IN MULTI-ARMED BANDITS where the equaity above made use of the defiitio of t. Therefore, 1 ε) T 1 K 1 ε) /ε c)og1/θ) e 4d/ c)og1/θ) = θ 4d/ c. Substitutig the above i Eq. 3), we obtai L 1 W) L 0 W) c) θ16d/c)+4d/. By pickig c arge eough c = 100 suffices), we obtai that L 1 W)/L 0 W) is arger tha θ = 4δ wheever the evet S occurs. More precisey, we have L 1 W) L 0 W) 1 S 4δ1 S, where 1 S is the idicator fuctio of the evet S. The, [ ] L1 W) P 1 B) P 1 S) = E 1 [1 S ] = E 0 L 0 W) 1 S E 0 [4δ1 S ] = 4δP 0 S) > δ, where we used the fact that P 0 S) > 1/4. To summarize, we have show that whe c 100, if E 0 [T 1 ] 1/cε )og1/4δ)), the P 1 B) > δ. Therefore, if we have a ε/,δ)-correct poicy, we must have E 0 [T ] > 1/cε )og1/4δ)), for every > 0. Equivaety, if we have a ε,δ)-correct poicy, we must have E 0 [T ] > /4cε ))og1/4δ)), which is of the desired form. 4. A Lower Boud o the Sampe Compexity - Geera Probabiities I Theorem 1, we worked with a particuar ufavorabe vector p the oe correspodig to hypothesis H 0 ), uder which a ot of exporatio is ecessary. This eaves ope the possibiity that for other, more favorabe choices of p, ess exporatio might suffice. I this sectio, we refie Theorem 1 by deveopig a ower boud that expicity depeds o the actua though ukow) vector p. Of course, for ay give vector p, there is a optima poicy, which seects the best coi without ay exporatio: e.g., if p 1 for a, the poicy that immediatey seects coi 1 is optima. However, such a poicy wi ot be ε,δ)-correct for a possibe vectors p. We start with a ower boud that appies whe a coi biases p i ie i the rage [0,1/]. We wi ater use a reductio techique to exted the resut to a geeric rage of biases. I the rest of the paper, we use the otatioa covetio x) + = max0,x}. Theorem 5 Fix some p 0,1/). There exists a positive costat δ 0, ad a positive costat c 1 that depeds oy o p, such that for every ε 0,1/), every δ 0,δ 0 ), every p [0,1/], ad every ε,δ)-correct poicy, we have } Mp,ε) 1) + 1 E p [T ] c 1 ε + p ) og 1 8δ, 69 Np,ε)

8 MANNOR AND TSITSIKLIS where p = max i p i, Mp,ε) = : > p ε, ad > p, ad ε + p } 1 +, 4) 1/ ad Np,ε) = : p ε, ad > p, ad ε + p } ) 1/ I particuar, δ 0 ca be take equa to e 8 /8. Remarks: a) The ower boud ivoves two sets of cois whose biases are ot too far from the best bias p. The first set Mp,ε) cotais cois that are withi ε from the best ad woud therefore be egitimate seectios. I the presece of mutipe such cois, a certai amout of exporatio is eeded to obtai the required cofidece that oe of these cois is sigificaty better tha the others. The secod set Np,ε) cotais cois whose bias is more tha ε away from p ; they come ito the ower boud because agai some exporatio is eeded i order to obtai the required cofidece that oe of these cois is sigificaty better tha the best coi i Mp,ε). b) The expressio ε + p )/1 + 1/) i Eqs. 4) ad 5) ca be repaced by ε + p )/ α) for ay positive costat α, by chagig some of the costats i the proof. c) This resut actuay provides a famiy of ower bouds, oe for every possibe choice of p. A tighter boud ca be obtaied by optimizig the choice of p, whie aso takig ito accout the depedece of the costat c 1 o p. This is ot hard the depedece of c 1 o p is described i Remark 7), but does ot provide ay ew isights. Proof Let us fix δ 0 = e 8 /8, some p 0,1/), ε 0,1/), δ 0,δ 0 ), a ε,δ)-correct poicy, ad some p [0,1/]. Without oss of geeraity, we assume that p = p 1. Let us deote the true ukow) bias of each coi by q i. We cosider the foowig hypotheses: H 0 : q i = p i, for i = 1,...,, ad for = 1,...,, H : q = p 1 + ε, q i = p i, for i. If hypothesis H is true, the poicy must seect coi. We wi boud from beow the expected umber of times the cois i the sets Np,ε) ad Mp,ε) must be tried, whe hypothesis H 0 is true. As i Sectio 3, we use E ad P to deote the expectatio ad probabiity, respectivey, uder the poicy beig cosidered ad uder hypothesis H. We defie θ = 8δ, ad ote that θ < e 8. Let t = 1 cε og 1 θ, 1 cp 1 ) og 1 θ, 630 if Mp,ε), if Np,ε),

9 EXPLORATION IN MULTI-ARMED BANDITS where c is a costat that oy depeds o p, ad whose vaue wi be chose ater. Reca that T stads for the tota umber of times that coi is tried. We defie the evet A = T 4t }. As i the proof of Theorem 1, if E 0 [T ] t, the P 0A ) 3/4. We defie Kt = X1 + + X t, which is the umber of uit rewards heads ) if the -th coi is tried a tota of t ot ecessariy cosecutive) times. We et C be the evet defied by C = max 1 t 4t Simiar to Lemma, ad sice θ = 8δ < e 8, we have 3 } Kt t < t og1/θ). P 0 C ) > 7/8. Let B be the evet I = }, i.e., that the poicy evetuay seects coi, ad et B c compemet. Sice the poicy is ε,δ)-correct with δ < δ 0 < 1/, we must have be its P 0 B c ) > 1/, Np,ε). We aso have Mp,ε) P 0 B ) 1, so that the iequaity P 0 B ) > 1/ ca hod for at most oe eemet of Mp,ε). Equivaety, the iequaity P 0 B c ) 1/ ca hod for at most oe eemet of Mp,ε). Let M 0 p,ε) = Mp,ε) ad P 0 B c ) > 1 }. It foows that M 0 p,ε) Mp,ε) 1) +. The foowig emma is a aaog of Lemma 4. Lemma 6 Suppose that M 0 p,ε) Np,ε) ad that E 0 [T ] t. If the costat c i the defiitio of t is chose arge eough possiby depedig o p), the P B c ) > δ. Proof Fix some M 0 p,ε) Np,ε). For future referece, we ote that the defiitios of Mp,ε) ad Np,ε) icude the coditio ε+ p )/1+ 1/). Recaig that p = p 1, 1/, ad usig the defiitio = p 1 0, some easy agebra eads to the coditios ε + 1 ε ) We defie the evet S by S = A B c C. Sice P 0 A ) 3/4, P 0 B c ) > 1/, ad P 0C ) > 7/8, we have P 0 S ) > 1 8, M 0p,ε) Np,ε). 3. The derivatio is idetica to Lemma except for Eq. ), where oe shoud repace the assumptio that θ < e 4 with the stricter assumptio that θ < e 8 used here. 631

10 MANNOR AND TSITSIKLIS As i the proof of Lemma 4, we defie the ikeihood fuctio L by ettig L w) = P W = w), for every possibe history w, ad use agai L W) to defie the correspodig radom variabe. Let K be a shorthad otatio for K T, the tota umber of uit rewards heads ) obtaied from coi. We have L W) L 0 W) = p 1 + ε) K 1 p 1 ε) T K p K 1 ) T K p1 = + ε ) K 1 p1 ε 1 1 = 1 + ε + ) K 1 ε + ) T K, 1 ) T K where we have used the defiitio = p 1. It foows that L W) L 0 W) = = = 1 + ε + ) K 1 ε + ) ) ε + K 1 1 ε + ) K 1 ε + ) ) ε + K 1 1 ε + ) K 1 ε + 1 ) K 1 ε + 1 ) K 1 ε + 1 ) T K ) K1 p )/ 1 ε + 1 ) T K ) p T K)/. 7) We wi ow proceed to ower boud the right-had side of Eq. 7) for histories uder which evet S occurs. If evet S has occurred, the A has occurred, ad we have K T 4t, so that for every Nε, p), we have ) ) ε + K 1 = a b ) ) ε + 4t 1 ) ) ε + 4/c )og1/θ) 1 exp exp = θ 16d/p c. ε/ ) + 1 d 4 c d 16 cp og1/θ) ) } og1/θ) I step a), we have used Lemma 3 which appies because of Eq. 6); i step b), we used the fact ε/ 1, which hods because Nε, p). } 63

11 EXPLORATION IN MULTI-ARMED BANDITS Simiary, for Mε, p), we have ) ) ε + K 1 = a b ) ) ε + 4t 1 ) ) ε + 4/cε )og1/θ) 1 exp exp = θ 16d/p c. 1 + /ε) d 4 c d 16 cp og1/θ) ) } og1/θ) I step a), we have agai used Lemma 3; i step b), we used the fact /ε 1, which hods because Mε, p). We ow boud the product of the secod ad third terms i Eq. 7). If b 1, the the mappig y 1 y) b is covex for y [0,1]. Thus, 1 y) b 1 by, which impies that 1 ε + ) 1 p )/ 1 ε + ), 1 so that the product of the secod ad third terms ca be ower bouded by 1 ε + ) K 1 ε + ) K1 p )/ 1 ε + ) K 1 ε + ) K = 1. 1 We sti eed to boud the fourth term of Eq. 7). We start with the case where Np,ε). We have 1 ε + ) p T K)/ a 1 ε b = c d e 1 ε + 1 } ) 1/p ) t og1/θ) 8) ) 1/p c )og1/θ) exp d } ε + og1/θ) c 1 ) } d exp og1/θ) c1 p ) exp 4d } og1/θ) cp = θ 4d/ c). Here, a) hods because we are assumig that the evets A ad C occurred; b) uses the defiitio of t for Np,ε); c) foows from Eq. 6) ad Lemma 3; d) foows because > ε; ad e) hods because 0 1/, which impies that 1/1 ). 9) 10) 633

12 MANNOR AND TSITSIKLIS Cosider ow the case where M 0 p,ε). Equatio 8) hods for the same reasos as whe Np,ε). The oy differece from the above cacuatio is i step b), where t shoud be repaced with 1/cε )og1/θ). The, the right-had side i Eq. 9) becomes exp d } ε + og1/θ). c ε1 ) For M 0 p,ε), we have ε, which impies that ε + )/ε, which the eads to the same expressio as i Eq. 10). The rest of the derivatio is idetica. Summarizig the above, we have show that if M 0 p,ε) Np,ε), ad evet S has occurred, the L W) L 0 W) θ4d/ c)+16d/p c). For M 0 p,ε) Np,ε), we have p <. We ca choose c arge eough so that L W)/L 0 W) θ = 8δ; the vaue of c depeds oy o the costat p. Simiar to the proof of Theorem 1, we have L W) L 0 W) 1 S 8δ1 S, where 1 S is the idicator fuctio of the evet S. It foows that [ ] P B c ) P L W) S ) = E [1 S ] = E 0 L 0 W) 1 S E 0 [8δ1 S ] = 8δP 0 S ) > δ, where the ast iequaity reies o the aready estabished fact P 0 S ) > 1/8. Sice the poicy is ε,δ)-correct, we must have P B c ) δ, for every. Lemma 6 the impies that E 0 [T ] > t for every M 0p,ε) Np,ε). We sum over a M 0 p,ε) Np,ε), use the defiitio of t, together with the fact M 0p,ε) Mp,ε) 1) +, to cocude the proof of the theorem. Remark 7 A cose examiatio of the proof reveas that the depedece of c 1 o p is captured by a requiremet of the form c 1 c p, for some absoute costat c. This suggests that there is a tradeoff i the choice of p. By choosig a arge p, the costat c 1 is made arger, but the sets M ad N become smaer, ad vice versa. The precedig resut may give the impressio that the sampe compexity is high oy whe the p i are bouded by 1/. The ext resut shows that simiar ower bouds hod with a differet costat) wheever the p i ca be assumed to be bouded away from 1. However, the ower boud becomes weaker i.e., the costat c 1 is smaer) whe the upper boud o the p i approaches 1. I fact, the depedece of a ower boud o ε caot be Θ1/ε ) whe max i p i = 1. To see this, cosider the foowig poicy π. Try each coi O1/ε)og/δ)) times. If oe of the cois aways resuted i heads, seect it. Otherwise, use some ε,δ)-correct poicy π. It ca be show that the poicy π is ε,δ)-correct for every p [0,1] ), ad that if max i p i = 1, the E p [T ] = O/ε)og/δ)). 634

13 EXPLORATION IN MULTI-ARMED BANDITS Theorem 8 Fix a iteger s, ad some p 0,1/). There exists a positive costat c 1 that depeds oy o p such that for every ε 0, s+) ), every δ 0,e 8 /8), every p [0,1 s ], ad every ε,δ)-correct poicy, we have E p [T ] c 1 sη M p,εη) 1) + ε + N p,ηε) } 1 p ) og 1 8δ, where p = max i p i, η = s+1 /s, p is the vector with compoets p i = 1 1 p i ) 1/s for i = 1,,...,), ad M ad N are as defied i Theorem 5. Proof Let us fix s, p 0,1/), ε 0, s+) ), ad δ 0,e 8 /8). Suppose that we have a ε,δ)-correct poicy π whose expected time to termiatio is E p [T ], wheever the vector of coi biases happes to be p. We wi use the poicy π to costruct a ew poicy π such that ) P p p I > max p i ηε 1 δ, p [0,1/) + ηε] ; i we wi the say that π is ηε,δ)-correct o [0,1/)+ηε] ). Fiay, we wi use the ower bouds from Theorem 5, appied to π, to obtai a ower boud o the sampe compexity of π. The ew poicy π is specified as foows. Ru the origia poicy π. Wheever π chooses to try a certai coi i oce, poicy π tries coi i for s cosecutive times. Poicy π the feeds π with 0 if a s trias resuted i 0, ad feeds π with 1 otherwise. If p is the true vector of coi biases faced by poicy π, ad if poicy π chooses to sampe coi i, the poicy π sees a outcome which equas 1 with probabiity p i = 1 1 p i ) s. Let us defie two mappigs f,g : [0,1] [0,1], which are iverses of each other, by f p i ) = 1 1 p i ) 1/s, g p i ) = 1 1 p i ) s, ad with a sight abuse of otatio, et f p) = f p 1 ),..., f p )), ad simiary for g p). With our costructio, whe poicy π is faced with a bias vector p, it evoves i a idetica maer as the poicy π faced with a bias vector p = g p). But uder poicy π, there are s trias associated with every tria uder poicy π, which impies that T = st T is the umber of trias uder poicy π) ad therefore E π p[ T ] = se π g p) [T ], E π f p) [ T ] = se π p[t ], 11) where the superscript i the expectatio operator idicates the poicy beig used. We wi ow determie the correctess guaratees of poicy π. We first eed some agebraic preimiaries. Let us fix some p [0,1/)+ηε] ad a correspodig vector p, reated by p = f p) ad p = g p). Let aso p = max i p i ad p = max i p i. Usig the defiitio η = s+1 /s ad the assumptio ε < s+), we have p 1/) + 1/s), from which it foows that p s ) s = 1 1 s 1 1 s ) s 1 1 s 1 4 = 1 s+). The derivative f of f is mootoicay icreasig o [0,1). Therefore, f p ) f 1 s+) ) = 1 s+)) 1/s) 1 1 = s s s+)1 s)/s = 1 s s+1 /s) 1 s s+1 = η. 635

14 MANNOR AND TSITSIKLIS Thus, the derivative of the iverse mappig g satisfies g p ) 1 η, which impies, usig the cocavity of g, that g p ηε) g p ) g p )εη g p ) ε. Let I be the coi idex fiay seected by poicy π whe faced with p, which is the same as the idex chose by π whe faced with p. We have the superscript i the probabiity idicates the poicy beig used) P π p p I p ηε) = P π p g p I ) g p ηε)) P π p g p I ) g p ) ε) = P π p p I p ε) 1 δ, where the ast iequaity foows because poicy π was assumed to be ε,δ)-correct. We have therefore estabished that π is ηε,δ)-correct o [0,1/) + ηε]. We ow appy Theorem 5, with ηε istead of ε. Eve though that theorem is stated for a poicy which is ε,δ)-correct for a possibe p, the proof oy requires the poicy to be ε,δ)-correct for p [0,1/) + ε]. This gives a ower boud o E π p [ T ] which, usig Eq. 11), trasates to the caimed ower boud o E π p[t ]. This ower boud appies wheever p = g p), for some p [0,1/], ad therefore wheever p [0,1 s ]. 5. The Bayesia Settig There is aother variat of the probem which is of iterest. I this variat, the parameters p i associated with each arm are ot ukow costats, but radom variabes described by a give prior. I this case, there is a sige uderyig probabiity measure which we deote by P, ad which is the average of the measures P p over the prior distributio of p. We aso use E to deote the expectatio with respect to P. We the defie a poicy to be ε,δ)-correct, for a particuar prior ad associated measure P, if We the have the foowig resut. P ) p I > max p i ε 1 δ. i Theorem 9 There exist positive costats c 1, c, ε 0, ad δ 0, such that for every ad ε 0,ε 0 ), there exists a prior for the -badit probem such that for every δ 0,δ 0 ), ad ε,δ)-correct poicy for this prior, we have E[T ] c 1 ε og c δ. I particuar, ε 0 ad δ 0 ca be take equa to 1/8 ad e 4 /1, respectivey. 636

15 EXPLORATION IN MULTI-ARMED BANDITS Proof Let ε 0 = 1/8 ad δ 0 = e 4 /1, ad et us fix ε 0,ε 0 ) ad δ 0,δ 0 ). Cosider the hypotheses H 0,...,H, itroduced i the proof of Theorem 1. Let the prior probabiity of H 0 be 1/, ad the prior probabiity of H be 1/, for = 1,...,. Fix a ε/,δ)-correct poicy with respect to this prior, ad ote that it satisfies E[T ] 1 E 0[T ] 1 =1 E 0 [T ]. 1) Sice the poicy is ε/,δ)-correct, we have Pp I > max ε/)) 1 δ. As i the proof of Theorem 5, et B be the evet that the poicy evetuay seects coi. We have 1 P 0B 0 ) + 1 P B ) 1 δ, which impies that 1 =1 =1 P B 0 ) δ. 13) Let G be the set of hypotheses 0 uder which the probabiity of seectig coi 0 is at most 3δ, i.e., G = : 1, P B 0 ) 3δ}. From Eq. 13), we obtai 1 G )3δ < δ, which impies that G > /3. Foowig the same argumet as i the proof of Lemma 4, we obtai that there exists a costat c such that if δ 0,e 4 /4) ad E 0 [T ] 1/cε )og1/4δ ), the P B 0 ) > δ. By takig δ = 3δ ad requirig that δ 0,e 4 /1), we see that the iequaity E 0 [T ] 1/cε )og1/1δ) impies that P B 0 ) > 3δ here, c is the same costat as i Lemma 4). But for every G we have P B 0 ) 3δ, ad therefore E 0 [T ] 1/cε )og1/1δ). The, Eq. 1) impies that E[T ] 1 E 0 [T ] G 1 G cε og 1 1δ c 1 ε og c δ, where we have used the fact G > /3 i the ast iequaity. To cocude, we have show that there exists costats c 1 ad c ad a prior for a probem with + 1 cois, such that ay ε/,δ)-correct poicy satisfies E[T ] c 1 /ε )ogc /δ). The resut foows by takig a arger costat c 1 to accout for havig + 1 ad ot cois, ad ε istead of ε/). 6. Regret Bouds I this sectio we cosider ower bouds o the regret of ay poicy, ad show that oe ca derive the Θogt) regret boud of Lai ad Robbis 1985) usig the techiques i this paper. The resuts of Lai ad Robbis 1985) are asymptotic as t, whereas ours dea with fiite times t. Our ower boud has simiar depedece i t as the upper bouds give by Auer et a. 00a) for some 637

16 MANNOR AND TSITSIKLIS atura sampig agorithms. As i Lai ad Robbis 1985) ad Auer et a. 00a), we aso show that whe t is arge, the regret depeds ieary o the umber of cois. Give a poicy, et S t be the tota umber of uit rewards heads ) obtaied i the first t time steps. The regret by time t is deoted by R t, ad is defied by R t = t max i p i S t. Note that the regret is a radom variabe that depeds o the resuts of the coi tosses as we as of the radomizatio carried out by the poicy. Theorem 10 There exist positive costats c 1,c,c 3,c 4, ad a costat c 5, such that for every, ad for every poicy, there exists some p [0,1] such that for a t 1, E p [R t ] mic 1 t, c + c 3 t, c 4 ogt og + c 5 )}. 14) The iequaity 14) suggests that there are essetiay two regimes for the expected regret. Whe is arge compared to t, the expected regret is iear i t. Whe t is arge compared to, the regret behaves ike ogt, but depeds ieary o. Proof We wi prove a stroger resut, by cosiderig the regret i a Bayesia settig. By provig that the expectatio with respect to the prior is ower bouded by the right-had side i Eq. 14), it wi foow that the boud aso hods for at east oe of the hypotheses. Cosider the same sceario as i Theorem 1, where we have +1 cois ad +1 hypotheses H 0,H 1,...,H. The prior assigs a probabiity of 1/ to H 0, ad a probabiity of 1/ to each of the hypotheses H 1,H,...,H. Simiar to Theorem 1, we wi use the otatio E ad P to deote expectatio ad probabiity whe the th hypothesis is true, ad E to deote expectatio with respect to the prior. Let us fix t for the rest of the proof. We defie T as the umber of times coi is tried i the first t time steps. The expected regret whe H 0 is true is E 0 [R t ] = ε =1 ad the expected regret whe H = 1,...,) is true is so that the expected Bayesia) regret is E[R t ] = 1 ε E 0 [T ], E [R t ] = ε E [T 0 ] + ε E [T i ], i 0, =1 E 0 [T ] + ε 1 =1 E [T 0 ] + ε Let D be the evet that coi 0 is tried at east t/ times, i.e., D = T 0 t/}. =1 i 0, E [T i ]. 15) We cosider separatey the two cases P 0 D) < 3/4 ad P 0 D) 3/4. Suppose first that P 0 D) < 3/4. I that case, E 0 [T 0 ] < 7t/8, so that =1 E 0[T ] t/8. Substitutig i Eq. 15), we obtai E[R t ] εt/3. This gives the first term i the right-had side of Eq. 14), with c 1 = ε/3. 638

17 EXPLORATION IN MULTI-ARMED BANDITS We assume from ow o that P 0 D) 3/4. Rearragig Eq. 15), ad omittig the third term, we have E[R t ] ε 4 E 0 [T ] + 1 ) E [T 0 ]. Sice E [T 0 ] t/)p D), we have E[R t ] ε 4 For every 0, et us defie δ by =1 =1 E 0 [T ] + t ) P D). 16) E 0 [T ] = 1 cε og 1 4δ. Such a δ exists because of the mootoicity of the mappig x og1/x).) Let δ 0 = e 4 /4. If δ < δ 0, we argue exacty as i Lemma 4, except that the evet B i that emma is repaced by evet D. Sice P 0 D) 3/4, the same proof appies ad shows that P D) δ, so that E 0 [T ] + t P D) 1 cε og 1 4δ + t δ. If o the other had, δ δ 0, the E 0 [T ] 1/cε )og1/4δ 0 ), which impies by the earier aaogy with Lemma 4) that P D) δ 0, ad E 0 [T ] + t P D) 1 cε og 1 4δ + t δ 0. Usig the above bouds i Eq. 16), we obtai E[R t ] ε 4 =1 1 cε og 1 + hδ ) t ), 17) 4δ where hδ) = δ if δ < δ 0, ad hδ) = δ 0 otherwise. We ca ow view the δ as free parameters, ad cocude that E[R t ] is ower bouded by the miimum of the right-had side of Eq. 17), over a δ. Whe optimizig, a the δ wi be set to the same vaue. The miimizig vaue ca be δ 0, i which case we have E[R t ] 4cε og 1 ε + δ 0 4δ 0 8 t. Otherwise, the miimizig vaue is δ = /ctε, i which case we have 1 E[R t ] 16cε + 1 ) 4cε ogcε /) + 1 4cε og1/) + 4cε ogt. Thus, the theorem hods with c = 1/4cε)og1/4δ 0 ), c 3 = δ 0 ε/8, c 4 = 1/4cε, ad c 5 = 1/4) + ogcε /). 639

18 MANNOR AND TSITSIKLIS 7. Permutatios We ow cosider the case where the coi biases p i are kow up to a permutatio. More specificay, we are give a vector q [0,1], ad we are tod that the true vector p of coi biases is of the form p = q σ, where σ is a ukow permutatio of the set 1,...,}, ad where q σ stads for permutig the compoets of the vector q accordig to σ, i.e., q σ) = q σ). We say that a poicy is q,ε,δ)-correct if the coi I evetuay seected satisfies ) P q σ p I > maxq ε 1 δ, for every permutatio σ of the set 1,...,}. We start with a O + og1/δ))/ε ) upper boud o the expected umber of trias, which is sigificaty smaer tha the boud obtaied whe the coi biases are competey ukow cf. Sectios 3 ad 4). We aso provide a ower boud which is withi a costat factor of our upper boud. We the cosider a differet measure of sampe compexity: istead of the expected umber of trias, we cosider the maximum over a sampe paths) umber of trias. We show that for every q,ε,δ)-correct poicy, there is a Θ/ε )og1/δ)) ower boud o the maximum umber of trias. We ote that i the media eimiatio agorithm of Eve-Dar et a. 00), the egth of a sampe paths is the same ad withi a costat factor from our ower boud. Hece our boud is agai tight. We therefore see that for the permutatio case, the sampe compexity depeds criticay o whether our criterio ivoves the expected or maximum umber of trias. This is i cotrast to the geera case cosidered i Sectio 3: the ower boud i that sectio appies uder both criteria, as does the matchig upper boud from Eve-Dar et a. 00). 7.1 A Upper Boud o the Expected Number of Trias Suppose we are give a vector q [0,1], ad we are tod that the true vector p of coi biases is a permutatio of q. The poicy i Tabe 1 takes as iput the accuracy ε, the cofidece parameter δ, ad the vector q. I fact the poicy oy eeds to kow the bias of the best coi, which we deote by q = max q. The poicy aso uses a additioa parameter δ 0,1/]. The foowig theorem estabishes the correctess of the poicy, ad provides a upper boud o the expected umber of trias. Theorem 11 For every δ 0,1/], ε 0,1), ad δ 0,1), the poicy i Tabe 1 is guarateed to termiate after a fiite umber of steps, with probabiity 1, ad is q, ε,δ)-correct. For every permutatio σ, the expected umber of trias satisfies E q σ [T ] 1 ε c 1 + c og 1 ), δ for some positive costats c 1 ad c that deped oy o δ. Proof We start with a usefu cacuatio. Suppose that at iteratio k, the media eimiatio agorithm seects a coi I k whose true bias is p Ik. The, usig the Hoeffdig iequaity, we have P ˆp k p Ik ε/3) exp ε/3) m k } δ k. 18) 640

19 EXPLORATION IN MULTI-ARMED BANDITS Iput: Accuracy ad cofidece parameters ε 0,1) ad δ 0,1); the bias of the best coi q. Parameter: δ 1/. 0. k = 1; 1. Ru the media eimiatio agorithm to fid a coi I k whose bias is withi ε/3 of q, with probabiity at east 1 δ.. Try coi I k for m k = 9/ε )og k /δ) times. Let ˆp k be the fractio of these trias that resut i heads. 3. If ˆp k q ε/3 decare that coi I k is a ε-optima coi ad termiate. 4. Set k := k + 1 ad go back to Step 1. Tabe 1: A poicy for fidig a ε-optima coi whe the bias of the best coi is kow. Let K be the umber of iteratios uti the poicy termiates. Give that K > k 1 i.e., the poicy did ot termiate i the first k 1 iteratios), there is probabiity at east 1 δ 1/ that p Ik q ε/3), i which case, from Eq. 18), there is probabiity at east 1 δ/ k ) 1/ that ˆp k q ε/3). Thus, PK > k K > k 1) 1 η, with η = 1/4. Cosequety, the probabiity that the poicy does ot termiate by the kth iteratio, PK > k), is bouded by 1 η) k. Thus, the probabiity that the poicy ever termiates is bouded above by 3/4) k for a k, ad is therefore 0. We ow boud the expected umber of trias. Let c be such that the umber of trias i oe executio of the media eimiatio agorithm is bouded by c/ε )og1/δ ). The, the umber of trias, tk), durig the kth iteratio is bouded by c/ε )og1/δ ) + m k. It foows that the expected tota umber of trias uder our poicy is bouded by k)tk) k=1pk 1 ) ε 1 η) k 1 cog1/δ ) + 9/)og k /δ) + 1 k=1 = 1 ε 1 η) k 1 cog1/δ ) + 9/)og1/δ) + 9k/)og + 1 ) k=1 1 ε c 1 + c og1/δ)), for some positive costats c 1 ad c. We fiay argue that the poicy is q,ε,δ)-correct. For the poicy to seect a coi I with bias p I q ε, it must be that at some iteratio k, a coi I k with p Ik q ε was obtaied, but ˆp k came out arger tha q ε/3. From Eq. 18), for ay fixed k, the probabiity of this occurrig is bouded by δ/ k. By the uio boud, the probabiity that p I q ε is bouded by k=1 δ/k = δ. Remark 1 The kowedge of q turs out to be sigificat: it eabes the poicy to termiate as soo as there is high cofidece that a coi has bee foud whose bias is arger tha q ε, without havig to check the other cois. A poicy of this type woud ot work for the hypotheses 641

20 MANNOR AND TSITSIKLIS cosidered i the proofs of Theorems 1 ad 5: uder those hypotheses, the vaue of q is ot a priori kow. We ote that Theorem 11 disagrees with a ower boud i a preimiary versio Maor ad Tsitsikis, 003) of this paper. It turs out that the atter ower boud is oy vaid uder a additioa restrictio o the set of poicies, which wi be the subject of Sectio A Lower Boud We ow prove that the upper boud i Theorem 11 is tight, withi a costat. Theorem 13 There exist positive costats c 1, c, ε 0, ad δ 1, such that for every ad ε 0,ε 0 ), there exists some q [0,1], such that for every δ 0,δ 1 ) ad every q,ε,δ)-correct poicy, there exists some permutatio σ such that E q σ [T ] 1 ε c 1 + c og 1 ). δ Proof Let ε 0 = 1/4 ad et δ 1 = δ 0 /5, where δ 0 is the same costat as i Theorem 5. Let us fix some ad ε 0,ε 0 ). We wi estabish the caimed ower boud for q = ε, 0.5 ε,...,0.5 ε), 19) ad for every δ 0,δ 1 ). I fact, it is sufficiet to estabish a ower boud of the form c /ε )og1/δ) ad a ower boud of the form c 1 /ε. We start with the former. Part I. Let us cosider the foowig three hypothesis testig probems. For each probem, we are iterested i a δ-correct poicy, i.e., a poicy whose probabiity of error is ess tha δ uder ay hypothesis. We wi show that a δ-correct poicy for the first probem ca be used to costruct a δ-correct poicy for the third probem, with the same sampe compexity, ad the appy Theorem 5 to obtai a ower boud. Π 1 : We have two cois ad the bias vector is either 0.5 ε, ε) or ε, 0.5 ε). We wish to determie the best coi. This is a specia case of our permutatio probem, with =. Π : We have a sige coi whose bias is either 0.5 ε or ε, ad we wish to determie the bias of the coi. 4 Π 3 : We have two cois ad the bias vector ca be 0.5, 0.5 ε), 0.5+ε, 0.5 ε), or 0.5,0.5+ε). We wish to determie the best coi. Cosider a δ-correct poicy for probem Π 1 except that the coi outcomes are ecoded as foows. Wheever coi 1 is tried, record the outcome uchaged; wheever coi is tried, record the opposite of the outcome i.e., record a 0 outcome as a 1, ad vice versa). Uder the first hypothesis i probem Π 1, every tria o matter which coi was tried) has probabiity 0.5 ε of beig equa to 1, ad uder the secod hypothesis has probabiity ε of beig equa to 1. With this ecodig, it is apparet that the iformatio provided by a tria of either coi i probem Π 1 is the same as the 4. A ower boud for this probem was provided i Lemma 5.1 from Athoy ad Bartett 1999). However, that boud is oy estabished for poicies with a a priori fixed umber of trias, whereas our poicies aow the umber of trias to be determied adaptivey, based o observed outcomes. 64

21 EXPLORATION IN MULTI-ARMED BANDITS iformatio provided by a tria of the sige coi i probem Π. Thus, a δ-correct poicy for Π 1 trasates to a δ-correct poicy for Π, with the same sampe compexity. I probem Π 3, ote that coi is the best coi if ad oy if its bias is equa to ε as opposed to 0.5 ε). Thus, ay δ-correct poicy for Π ca be appied to the secod coi i Π 3, to yied a δ-correct poicy for Π 3 with the same sampe compexity. We ow observe that probem Π 3 ivoves a set of three hypotheses, of the form cosidered i the proof of Theorem 5, for the case of two cois. More specificay, i terms of the otatio used i that proof, we have p = 0.5,0.5 ε), ad Np,ε) = }. It foows that the sampe compexity of ay δ-correct poicy for Π 3 is ower bouded by c 1 /ε )og1/8δ), where c 1 is the costat i Theorem 5. 5 Because of the reatio betwee the three probems estabished earier, the same ower boud appies to ay δ-correct poicy for probem Π 1, which is the permutatio probem of iterest. We have so far estabished a ower boud proportioa to 1/ε )/og1/δ) for probem Π 1, which is the permutatio probem we are iterested i, with a q vector of the form 19), for the case =. Cosider ow the permutatio probem for the q vector i 19), but for geera. If we are give the iformatio that the best coi ca oy be oe of the first two cois, we obtai probem Π 1. I the absece of this side-iformatio, the permutatio probem caot be ay easier. This shows that the same ower boud hods for every. Part II: We ow cotiue with the secod part of the proof. We wi estabish a ower boud of the form c 1 /ε for the permutatio probem associated with the bias vector q itroduced i Eq. 19), to be referred to as probem Π. The proof ivoves a reductio of Π to a probem Π of the form cosidered i the proof of Theorem 5. The probem Π ivoves + 1 cois cois 0,1,...,) ad the foowig + hypotheses: ad H 0 : p 0 = 0.5, p i = 0.5 ε, for i 0, H 0 : p 0 = ε, p i = 0.5 ε, for i 0, H : p 0 = 0.5, = ε, p i = 0.5 ε, for i 0,. Note that the best coi is coi 0 uder either hypothesis H 0 or H 0, ad the best coi is coi uder H, for 1. This eads us to defie H 0 as the hypothesis that either H 0 or H 0 is true. We say that a poicy for Π is δ-correct if it seects the best coi with probabiity at east 1 δ. We wi show that if we have a q,ε,δ)-correct poicy π for Π, with a certai sampe compexity, we ca costruct a ε, 5δ)-correct poicy π with a reated sampe compexity. We wi the appy Theorem 5 to ower boud the sampe compexity of π, ad fiay trasate to a ower boud for π. The idea of the reductio is as foows. I probem Π, if we kew that H 0 is ot true, we woud be eft with a permutatio probem with cois, to which π coud be appied. However, if H 0 is true, the behavior of π is upredictabe. I particuar, π might ot termiate, or it might termiate with a arbitrary decisio: this is because we are oy assumig that π behaves propery whe faced with the permutatio probem Π.) If H 0 is true, we ca repace coi 1 with a coi whose bias is ε, resutig i the bias vector q, i which case π is guarateed to work propery. But what if we repace coi 1 as above, but some H, 0,1, happes to be true? I that case, there wi be two cois with bias ε ad π may misbehave. The soutio is to ru two processes i parae, oe 5. Athough Theorem 5 is stated for ε,δ)-correct poicies, it is cear from the proof that the ower boud appies to ay poicy that has the desired behavior uder a of the hypotheses cosidered i the proof. 643

22 MANNOR AND TSITSIKLIS with ad oe without this modificatio, i which case oe of the two wi have certai performace guaratees that we ca expoit. Cosider the q,ε,δ)-correct poicy π for probem Π. Let t π be the maximum over a permutatios σ) expected time uti π termiates whe the true coi bias vector is q σ. We defie two more bias vectors that wi be used beow: ad q = 0.5 ε,...,0.5 ε), q + = ε, ε, 0.5 ε,...,0.5 ε). Note that if H 0 is true i probem Π, ad π is appied to cois 1,...,, the π wi be faced with the bias vector q. Aso, if H, is true i probem Π, for some 0,1, ad we modify the bias of coi 1 to ε, the poicy π wi be faced with the bias vector q +. Let us ote for future referece that, as i Eq. 18), if we sampe a coi with bias ε for m = 1/ε )og1/δ) times, the empirica mea reward is arger tha 0.5 with probabiity at east 1 δ. Simiary, if we sampe a coi with bias 0.5 ε for m times, the empirica mea reward is smaer tha 0.5 with probabiity at east 1 δ. Sampig a specific coi that may times, ad comparig the empirica mea to 0.5, wi be referred to as vaidatig the coi. We ow describe poicy π for probem Π. The poicy ivoves two parae processes A ad B: it aterates betwee the two processes, ad each process sampes oe coi i each oe of its turs. The processes cotiue to sampe the cois ateratey uti oe of them termiates ad decares oe of the cois as the best coi or equivaety seects a hypothesis). The parameter k beow is set to k = 18og1/δ). A: Appy poicy π to cois 1,,...,. If π termiates ad seects coi, vaidate coi by sampig it m times. If the empirica mea reward is more tha 0.5, the A termiates ad decares coi as the best coi. If the empirica mea reward is ess tha or equa to 0.5, the A termiates ad decares coi 0 as the best coi. If π has carried out t π /δ trias, the A termiates ad decares coi 0 as the best coi. B: Sampe coi 1 for m times. If the empirica mea reward is more tha 0.5, the B termiates ad decares coi 1 as the best coi. Otherwise, repace coi 1 with aother coi whose bias is ε. Iitiaize a couter N with N = 0. Repeat k times the foowig: a) Pick a radom permutatio τ uiformy over the set of permutatios). b) Ru a τ-permuted versio of π, to be referred to as τ π; that is, wheever π is supposed to sampe coi i, τ π sampes coi τi) istead. c) If τ π termiates ad seects coi 1 as the best coi, set N := N + 1. If N > k/3, the B termiates ad decares coi 0 as the best coi. Otherwise, wait uti process A termiates. We first address the issue of correctess of poicy π. Note that π is guarateed to termiate i fiite time, because process A ca oy ru for a bouded umber of steps. We cosider separatey the foowig cases: 644

23 EXPLORATION IN MULTI-ARMED BANDITS 1. Process A termiates first, H 0 is true. I this case the true bias vector faced by π is q rather tha q. A error ca happe oy if π termiates ad the coi erroeousy passes the vaidatio test. The probabiity of this occurrig is at most δ. I a other cases vaidatio fais or the ruig time exceeds the time imit t π /δ ), process A correcty decares H 0 to be the true hypothesis.. Process A termiates first, H is true for some 0. Process A does ot decare the correct H if oe of the foowig evets occurs: A fais to seect the best coi probabiity at most δ, sice π is q,ε,δ)-correct); or the vaidatio makes a error probabiity at most δ); or the time imit is exceeded. By Markov s iequaity, the probabiity that the ruig time T π of poicy π exceeds the time imit satisfies P q T π t π /δ) E q [T π ]δ/t π t π δ/t π = δ. So, the tota the probabiity of errors that fa withi this case is bouded by 3δ. 3. Process B termiates first, H 0 is true. Note that uder H 0, process B is faced with cois whose bias vector is q. Process B does ot decare H 0 if oe of the foowig evets occurs: the iitia vaidatio of coi 1 makes a error probabiity at most δ); or i k rus, the permuted versios of poicy π make at east k/3 errors. Each such ru has probabiity at most δ of makig a error sice π is q,ε,δ)-correct), idepedety for each ru because we use a radom permutatio before each ru). Usig Hoeffdig s iequaity, the probabiity of at east k/3 errors is bouded by exp k1/3 δ) }. Sice δ < 1/1, this probabiity is at most e k/8. So, the tota the probabiity of errors that fa withi this case is bouded by δ + e k/8. 4. Process B termiates first, H is true for some > 1. Process B does ot decare H if oe of the foowig evets occurs: the iitia vaidatio of coi 1 makes a error probabiity at most δ); or i k rus, the permuted versios of poicy π seect coi 1 for N > k/3 times. Sice i each of the k rus the poicy is faced with the bias vector q +, there are o guaratees o its behavior. However, sice we use a radom permutatio before each ru, ad sice there are two cois with bias ε, amey cois 1 ad, the probabiity that the permuted versio of π seects coi 1 at ay give ru is bouded by 1/. Usig Hoeffdig s iequaity, the probabiity of seectig coi 1 more tha k/3 times is bouded by exp k/3) 1/)) } = e k/18. So, the tota the probabiity of errors that fa withi this case is bouded by δ + e k/ Process B termiates first, H 1 is true. Process B fais to decare coi 1 oy if a error is made i the iitia vaidatio step, which happes with probabiity bouded by δ. To cocude, usig the uio boud, whe H 0 is true, the probabiity of error is bouded by δ + e k/8 ; whe H 1 is true, it is bouded by 4δ; ad whe H, > 1 is true, it is bouded by 4δ + e k/18. Sice k = 18og1/δ), we see that the probabiity of error, uder ay hypothesis, is bouded by 5δ. We ow examie the sampe compexity of poicy π. We cosider two cases, depedig o which hypothesis is true. 1. H 0 is true. I process A, poicy π is faced with the bias vector q ad has o termiatio guaratees, uess it reaches the time imit t π /δ. As estabished earier case 3), Process B 645

24 MANNOR AND TSITSIKLIS wi termiate after the iitia vaidatio of coi 1 m trias), pus possiby k rus of poicy π expected umber of trias kt π ), with probabiity at east 1 e k/8. Otherwise, B waits uti A termiates at most 1 + t π /δ time). Mutipyig everythig by a factor of because the two processes aterate), the expected time uti π termiates is bouded by m + kt π ) + e k/8 m +t π /δ + 1).. H is true for some 0. I this case, process A termiates after the vaidatio time m ad the time it takes for π to ru. Thus, the expected time uti termiatio is bouded by m + 1 +t π ). We have costructed a ε, 5δ)-correct poicy π for probem Π. Usig the above derived time bouds, ad the defiitios of k ad m, the expected umber of trias, uder ay hypothesis H, is bouded from above by 4m + 36 og ) ) t π. δ O the other had, probem Π ivoves hypotheses of the form cosidered i the proof of Theorem 5, with p = 0.5, 0.5 ε,...,0.5 ε), ad with Np,ε) = 1,,...,}. Thus, the expected umber of trias uder some hypothesis is bouded beow by c 1 /ε )ogc /δ), for some positive costats c 1 ad c, eadig to c 1 ε og c 4m + 36 og δ ) ) t π. δ This trasates to a ower boud of the form t π c /ε, for some ew costat c, ad for a arger tha some 0. But for 0, we ca use the ower boud c /ε )og1/δ) that was derived i the first part of the proof. 7.3 Pathwise Sampe Compexity The sampe compexity of the poicy preseted i Sectio 7.1 was measured i term of the expected umber of trias. Suppose, however, that we are iterested i a poicy for which the umber of trias is aways ow. Let us say that a poicy has a pathwise sampe compexity of t, if the poicy termiates after at most t trias, with probabiity 1. We ote that the media eimiatio agorithm of Eve-Dar et a. 00) is a q, ε,δ)-correct poicy whose pathwise sampe compexity is of the form c 1 /ε )ogc /δ). I this sectio, we show that at east for a certai q, there is a matchig ower boud o the pathwise sampe compexity of ay q,ε,δ)-correct poicy. Theorem 14 There exist positive costats c 1, c, ε 0, ad δ 1 such that for every ad ε 0,ε 0 ), there exists some q [0,1], such that for every δ 0,δ 1 ) ad every q,ε,δ)-correct poicy π, there exists some permutatio σ uder which the pathwise sampe compexity of π is at east c 1 ε og c ). δ Proof The proof uses a reductio simiar to the oe i the proof of Theorem 13. Let ε 0 = 1/8 ad δ 1 = δ 0 /, where δ 0 = e 8 /8 is the costat i Theorem 5. Let q be the same as i the proof of Theorem 13 cf. Eq. 19)), ad cosider the associated permutatio probem, referred to as probem 646

25 EXPLORATION IN MULTI-ARMED BANDITS Π. Fix some δ 0,δ 1 ) ad suppose that we have a q,ε,δ)-correct poicy π for probem Π whose pathwise sampe compexity is bouded by t π for every permutatio σ. Cosider aso the probem Π itroduced i the proof of Theorem 13, ivovig the hypotheses H 0, H 1,...,H. We wi ow use the poicy π to costruct a poicy π for probem Π. We ru π o the cois 1,,...,. If π termiates at or before time t π ad seects some coi, we sampe coi for 1/ε )og1/δ) times. If the empirica mea reward is arger tha 0.5 we decare H as the true hypothesis. If the empirica mea reward of coi is ess tha or equa to 0.5, or if π does ot termiate by time t π, we decare H 0 as the true hypothesis. We start by showig correctess. Suppose first that H 0 is true. For the poicy π to make a icorrect decisio, it must be the case that poicy π seected some coi ad the empirica mea reward of this coi was arger tha 1/; usig Hoeffdig s iequaity, the probabiity of this evet is bouded by δ. Suppose istead that H is true for some 1. I this case, poicy π is guarateed to termiate withi t π steps. Poicy π wi make a icorrect decisio if either poicy π makes a icorrect decisio probabiity at most δ), or if poicy π makes a correct decisio but the seected coi fais to vaidate probabiity at most δ). It foows that poicy π is q,ε,δ)-correct. The umber of trias uder poicy π is bouded by t = t π + 1/ε )og1/δ), uder ay hypothesis. O the other had, usig Theorem 5, the expected umber of trias uder some hypothesis is bouded beow by c 1 /ε )ogc /δ), eadig to c 1 ε og c δ t π + 1 ε og ) 1. δ this trasates to a ower boud of the form t c 1 /ε )ogc /δ), for some ew costats c 1 ad c. 8. Cocudig Remarks We have provided bouds o the umber of trias required to idetify a ear-optima arm i a mutiarmed badit probem, with high probabiity. For the probem formuatios studied i Sectios 3 ad 5, the ower bouds match the existig O/ε )og1/δ)) upper bouds. For the case where the vaues of the biases are kow but the idetities of the cois are ot, we provided two differet tight bouds, depedig o the particuar criterio beig used expected versus maximum umber of trias). Our resuts have bee derived uder the assumptio of Beroui rewards. Ceary, the ower bouds aso appy to more geera probem formuatios, as og as they icude Beroui rewards as a specia case. It woud be of some iterest to derive simiar ower bouds for other specia cases of reward distributios. It is reasoabe to expect that essetiay the same resuts wi carry over, as og as the Kuback-Leiber divergece betwee the reward distributios associated with differet arms is fiite as i Lai ad Robbis, 1985). Ackowedgmets We woud ike to thak Susa Murphy ad David Siegmud for poitig out some reevat refereces from the sequetia aaysis iterature. We thak two aoymous reviewers for their commets. This research was supported by the MIT-Merri Lych partership, the ARO uder grat DAAD , ad the Natioa Sciece Foudatio uder grat ECS

26 MANNOR AND TSITSIKLIS Refereces M. Athoy ad P. L. Bartett. Neura Network Learig: Theoretica Foudatios. Cambridge Uiversity Press, P. Auer, N. Cesa-Biachi, ad P. Fischer. Fiite-time aaysis of the mutiarmed badit probem. Machie Learig, 47:35 56, 00a. P. Auer, N. Cesa-Biachi, Y. Freud, ad R. E. Schapire. Gambig i a rigged casio: The adversaria muti-armed badit probem. I Proc. 36th Aua Symposium o Foudatios of Computer Sciece, pages IEEE Computer Society Press, P. Auer, N. Cesa-Biachi, Y. Freud, ad R. E. Schapire. The o-stochastic muti-armed badit probem. SIAM Joura o Computig, 3:48 77, 00b. D.A. Berry ad B. Fristedt. Badit Probems. Chapma ad Ha, H. Cheroff. Sequetia Aaysis ad Optima Desig. Society for Idustria & Appied Mathematics, 197. E. Eve-Dar, S. Maor, ad Y. Masour. PAC bouds for muti-armed badit ad Markov decisio processes. I J. Kivie ad R. H. Soa, editors, Fifteeth Aua Coferece o Computatioa Learig Theory, pages Spriger, 00. C. Jeiso, I. M. Johstoe, ad B. W. Turbu. Asymptoticay optima procedures for sequetia adaptive seectio of the best of severa orma meas. I S. S. Gupta ad J. Berger, editors, Statistica decisio theory ad reated topics III, voume 3, pages Academic Press, 198. S. R. Kukari ad G. Lugosi. Fiite-time ower bouds for the two-armed badit probem. IEEE Tras. Aut. Cotro, 454): , 000. T. L. Lai ad H. Robbis. Asymptoticay efficiet adaptive aocatio rues. Advaces i Appied Mathematics, 6:4, S. Maor ad J.N. Tsitsikis. Lower bouds o the sampe compexity of exporatio i the mutiarmed badit probem. I B. Schökopf ad M. K. Warmuth, editors, Sixteeth Aua Coferece o Computatioa Learig Theory, pages Spriger, 003. H. Robbis. Some aspects of sequetia desig of experimets. Bueti of the America Mathematica Society, 55:57 535, 195. S. M. Ross. Stochastic Processes. Wiey, D. Siegmud. Sequetia aaysis: Tests ad Cofidece Itervas. Spriger Verag,