Multi-armed Bandit Problems with History

Transcription

1 Multi-arme Bait Problems with History Paaga Shivaswamy Departmet of Computer Siee Corell Uiversity, Ithaa NY Thorste Joahims Departmet of Computer Siee Corell Uiversity, Ithaa NY Abstrat I this paper we osier the stohasti multi-arme bait problem. However, ulike i the ovetioal versio of this problem, we o ot assume that the algorithm starts from srath. May appliatios offer observatios of some of the arms eve before the algorithm starts. We propose three ovel multi-arme bait algorithms that a exploit this ata. A upper bou o the regret is erive i eah ase. The results show that a logarithmi amout of histori ata a reue regret from logarithmi to ostat. The effetiveess of the propose algorithms are emostrate o a large-sale maliious URL etetio problem. 1 Itroutio May real-worl problems, ragig from the optimizatio of avertisig reveue i searh egies to the sheulig of liial trials, a be moele as multiarme bait problems. At eah time step, the algorithm hooses oe of the possible arms i.e. avertisemets, treatmets a observes its rewars. The goal is to maximize the sum of rewars over all time steps, typially expresse as regret ompare to the best arm i hisight. I the ovetioal formulatio of the problem, the algorithm has o prior kowlege about the arms. May appliatios, however, provie some ata about the arms eve before the algorithm starts. For example: A searh egie ompay has esige K retrieval futios. Histori ata is available from a betatest o a small sample of pai users, but ow the Appearig i Proeeigs of the 15 th Iteratioal Coferee o Artifiial Itelligee a Statistis AISTATS 1, La Palma, Caary Islas. Volume XX of JMLR: W&CP XX. Copyright 1 by the authors. futios shoul be fiele i the proutio system as to maximize likthrough. A olie movie ompay has K ifferet reommeer futios to suggest movies to a user. Whe a ew user sigs up, he is aske to rate a few pivotal movies whih provies histori ata for optimizig the hoie of reommeer futio i the log ru. A liial trial experimet was stoppe ue to a legal hurle. Now the ourts wet i favor of otiuig the liial trial but also war that the losses shoul be miimum from ow o. More geerally, we efie histori ata as ay observatios of the arms that are ollete before the start of the olie learig algorithm. The algorithm itself has o otrol over the hoie of arms i the histori ata, or o all arms have to be sample uiformly. The availability of suh histori ata leas to the questio of how olie learig algorithms a best use it to reue regret. This problem is meaigful oly for the ase of stohasti arms [8, 5], sie o amout of histori ata a help i the aversarial settig [4]. To our best kowlege, this problem has ot bee stuie i the literature. However, the work by [9] o bait problem with sie iformatio is relate. Their work assumes that histori ata ollete via some poliy is available to evaluate a mappig from sie iformatio to arms. I the absee of sie-iformatio, their poliy evaluatio strategy reues to hoosig the arm with the highest mea rewar o the histori ata. Relate is also the Sleepig Baits Problem [7], where oly a subset of the arms is ative at eah time step. While it a mimi histori ata to some extet e.g. it allows the aitio of a ew arm at ay time, algorithms a bous are weaker sie they aot rely o a separatio of histori ata a olie learig. This paper propose three ew olie learig algorithms that are able to exploit histori ata. We erive

2 Multi-arme Bait Problems with History upper bous o the regret for eah of the three algorithms, showig that a logarithmi amout of histori ata allows them to ahieve ostat regret. A esirable property of ay bait algorithm with histori obseravatios is that the regret is zero with ifiite histori ata. All the three algorithms that we propose satisfy this property. We also evaluate the algorithms empirially o a maliious URL etetio problem, fiig that histori ata a make a substatial ifferee o pratial problems. Problem Defiitio a Notatio The stohasti K arme bait problem osiers boue raom variables X,t [,1] for 1 K a time iex t 1. Eah X,t eotes the rewar thatisiurrewhethe th armispullethet th time. For arm, the rewars X,t are iepeet a ietially istribute with a ukow mea µ a a ukow variae σ. The arm with the largest mea rewariseoteby i.e., µ := max 1 i K µ i. Further, for ay arm, eotes µ µ. Ofte, we replae with i ay otatio to eote a quatity that orrespos to. Histori observatios are eote by X,t h [,1] for 1 K a 1 t H iexig the t th istae of histori rewar for arm. H is the umber of histori istaes available for arm, a H is efie as H := K =1 H. The histori rewars for eah arm are assume to be raw iepeetly from the same istributios as the o-histori rewars. T eotes the umber of times the arm is pulle betwee times 1 a this exlues the pulls of the arm i the histori ata. The regret at time is efie as R := µ µ K =1 E[T ], where E[T ] is the expetatio of T. The per-rou regret at time is efie as R /. H The mea rewar from the histori ata for arm is efie as X H h := Xh,t. Mea rewarofarm urig the exeutio of the algorithms util its th pull is efie as X, := X,t. Aalogously, the oit mea rewar of arm iorporatig both the histori a the olie ata is X H, h := Xh,t + X,t. Fi- H + ally, V, h eotes the sample variae of the rewars for arm util its th pull iluig the histori ata a V, eotes the sample variae without history. 3 A Algorithm We first osier the simplest algorithm that makes use of the histori ata: pik the arm with the maximum mea rewar o the histori ata a the to simply play that arm i every iteratio. Ufortuately, this is ot a very goo strategy, sie there is a ostat probability of sufferig regret i eah step. By ostrutig a example, Theorem 1 shows that this algorithm a have regret that grows polyomially with time eve if the arms have a logarithmi amout of histori ata. Theorem 1 Cosier a two arme bait problem. The first arm has a fixe rewar.5+ǫ,.5 > ǫ >, the seo arm has a Beroulli rewar with mea.5. Suppose H = 3δl/16ǫ the the aive strategy has regret growig polyomially with for ay > exp1/δ. Proof We lower bou the probability that the observe mea rewar for the worse seo arm is higher tha the mea rewarfor the better first arm: P[ X h > X 1] h = P[B > H.5+ǫ] [ P Z > 4 H ǫ/ ] 3. 1 I 1 we applie Slu s iequality [1] whih states: [ P[B > t] P Z > t p / ] p1 p, for a biomial raom variable B parametrize by a p suh that p 1/, p t 1 p, a Z N,1i.e. staargaussiaraomvariable. Further, for Z N,1, we have from a result i [6]: P[Z > θ] θexp θ / / π1+θ. For θ > 1, it is easy to verify that θ/1 + θ > exp θ /. Thus, P[Z > θ] exp θ / π. Applyig this to 1, we get, P[ X h > X h 1 ] exp 16H ǫ /3 / π. Substitutig the value of H from the statemet of the theorem, we get, P[ X h > X h 1 ] 1 / π δ. Thus, the regret ahieve by the algorithm i steps is at least ǫ 1 δ / π. From θ > 1, we get > exp1/δ. 4 Algorithms a Aalysis I this setio, we propose three ew algorithms for the stohasti multi-arme bait problem with histori ata. For eah algorithm, we prove a logarithmi regret bou. Iterestigly, these bous show

3 Paaga Shivaswamy, Thorste Joahims Algorithm 1 UCB1 At time t play the arm that maximizes X, + lt, where eotes T t 1. Algorithm HUCB1 At time t play the arm that maximizes X, h lh+t + +H, where eotes T t 1. that a logarithmi amout of histori ata is suffiiet to allow these algorithms to ahieve ostat regret. Moreover, as the umber of histori observatios for every arm tes to ifitiy, the regret ahieve is zero. I partiular, we erive bous for the expete umber of pulls for ay suboptimal arm, i.e., E[T ]. From these, the regret bou a be ompute as : > E[T ]. 4.1 HUCB1: UCB1 with Histori Data Our first algorithm is erive from the UCB1 algorithm [5]. The origial UCB1 algorithm is give i Algorithm 1, while our extesio of UCB1 for histori ata alle HUCB1 is show i Algorithm. For a give amout H of historial ata for eah arm, the followig theorem provies a upper bou for HUCB1 o the expete umber of pulls for ay suboptimal arm. Theorem The expete umber of pulls of ay suboptimal arm, for ay time horizo, satisfies, E[T ] 1+l + + π 1+6H 6H +1 + π 1+6H 6H +1., where, l + = max, 8log+H H. Proof Defie i t,s = lt+h i /H i +s, we the have, for ay iteger l >, T = l+ {I t = } l+ {I t =,T t 1 l} { Xh,T t 1 + t 1,T t 1 X },T h t 1 + t 1,T,T t 1 t 1 l { } l+ mi X,s h <s<t + t 1,s max X,s h l s + t 1,s t l+ s=1 s =l { Xh,s + t,s X h,s + t,s}. Theevet{ Xh,s + t,s X },s h + t,s implies at least oe of the followig hols: 1 X h,s µ t,s, X h,s µ + t,s, µ < µ + t,s. 3 The erivatioso-faris verysimilarto that i the origial UCB1 aalysis. However, from this poit, havig histori ata starts to have a sigifiat impat. The probability that the first two iequalities i 3 hol a be bou usig Hoeffig s iequality; ilusio of histori ata gives sigifiatly tighter bous: P [ Xh,s µ t,s] e 4logt+H = t+h 4, ] P[ Xh µ,s + t,s e 4logt+H = t+h 4. Further, for our hoie of l = l + give i, the thir iequality i 3 is false. We are ow reay to bou the expete umber of pulls for arm. We have, E[T ] l l + + s=1 s=1 s =l + P[ X h,s µ t,s] s =l + P[ X h,s µ + t,s ] s=1 s =l + t+h 4 +t+h 4 1+l + + π 1+6H 6H +1 + π 1+6H 6H +1. I the above, we have use the fat that mm 1π 3m+1 m 1 t mm+π 3m+1, to erive a upper bou for t+h. First, ote that the above bou reues to the bou for the UCB1 algorithm [5] whe H = for all. Next, to see how muh impat histori ata a have o the regret, osier = exp H 8 H. I this situatio E[T ] π 1+6H 6H +1 + π 1+6H 6H +1 for HUCB1 whih is iversely relate to H. However, for UCB1, for the above hoie of, the upper bou o the E[T ] is upper boue by 1+ 8 H log exp 8 H + π 3, whih is approximately liear i H. Rather tha the ofiee iterval show i Algorithm, at first glae oe might thik that 1 It is easy to hek this by egatig this laim.

4 Multi-arme Bait Problems with History logh+t +H is the most atural hoie to use for histori ata. It a be show that this hoie leas to the followig bou: E[T ] max,8 log+h H + π 1+6H 3H +1. It therefore has two isavatages. First, it oes ot take ito aout that there oul be ifferet umbers of pulls for ifferet arms i the histori ata. Seo, wheh issmallbut H isquitelarge,the abovebou a be worse tha the oe erive i Theorem. 4. HUCB3: A ǫ-greey Algorithm Arguably the simplest bait algorithm is UCB3 [5]. We ow explore whether there is a similar algorithm with histori ata. We first preset a slightly moifie versio of UCB3 i Algorithm 3. Istea of havig a sigle rate ǫ for all arms, the followig versio has a ifferet rate ǫ for arm. Despite this hage, the aalysis of this algorithm is aalogous to that of the origial UCB3 algorithm. UCB3 has two parameters, whih is a lowerbouothesmallesto-zero aaother parameter >, but these two parameters always appear together as /. Algorithm 3 UCB3 Parameters > a < < mi Defie a sequee for eah arm: ǫ := mi 1 K, At iteratio, let i be the arm with the highest average rewar with o histori ata, play arm i with probability 1 K =1 ǫ. Play arm with probability ǫ. To erive a algorithm that a exploit histori ata, the key is to set the rates ǫ i a way that aouts for histori ata. It might seem, at first, that replaig +H therate with woulwork. Ufortuately, this approah oes ot lea to strog guaratees. First, observe that i the ase of UCB3, ǫ is 1/K util K/. The amout of exploratio oe by UCB3 betwee times t := K/ + 1 to is lower boue as follows: P[I t = ] = t=t t=t log K O1 To erive a ǫ-greey-like algorithm that a exploit histori ata, we first fi suh that the expete exploratio exees H. This is oe by settig H We igore the floor o K/ for brevity. equal to the lower bou we igore the ostat aitive term i the above equatio. This gives = K exp H. The histori versio of the ǫ- greey algorithm will have the same rates as UCB3 i the first K/ steps. However, after that, the rate use by the histori algorithm at time step t > K/ will be that of UCB3 at step +t K/. Base o these ieas, HUCB3 is presete i Algorithm 4. Note that whe H =, ǫ for HUCB3 reues to /, whih is exatly the same rate as i UCB3. We ow provie a upper bou o the istataeous regret of HUCB3 i Theorem 4. The proof of the followig theorem is provie as as a appeix ue to spae ostraits. The overall iea of the proof is the same as the orrespoig proof for HUCB3. The two ifferees i our proof are the availability of histori ata while applyig oetratio iequalities a the alterate efiitio of ǫ as propose i Algorithm 4. Algorithm 4 HUCB3 Parameters > a < < 1 Defie a sequee for eah arm: ǫ := 1/K for K/ a ǫ := Ke H 1 1+ for > K/. At iteratio, let = argmax Xh,T 1. Play arm with probability 1 K =1 ǫ. Play arm with probability ǫ. Theorem 3 For ay K/, where 1, HUCB3 satisfies, P[I = ] 1 1 +o. K exp H 1+ The followig orollary gives a upper bou o the expete umber of pulls of ay sub-optimal arm. It is obtaie by summig the istataeous regrets for arm give i Theorem 3. Corollary 4 HUCB3 amits the followig bou for ay sub-optimal arm, for ay > K/, E[T ] K log exp H 1+ +O1. K exp H To see how the above bou hages with histori ata, suppose H = log /K, the E[T ] = O1. This agai shows that a logarithmi amout of histori ata suffies to ahieve ostat regret. It is alsoeasyto see from the proofoftheorem 3 that these aitive terms go to zero expoetially with H a H thus showigthattheregretapproaheszeroasthe umber of histori observatios approahes ifiity.

5 Paaga Shivaswamy, Thorste Joahims 4.3 HUCBV: Exploitig Sample Variae Our fial algorithm is base o a reet versio of the UCB algorithm whih also iorporates the sample variae of the rewars [, 3]. I its most basi form, the UCBV algorithm is as show i Algorithm 5. Auibert et al. [] show that a value of θ = 1. is eough for logarithmi overgee. The expete regret of the UCBV algorithm was show to be upper boue by 1 :µ <µ σ / + log. The avatage of UCBV over algorithms that o ot iorporate the sample variae is that the regret bou for UCBV ivolves σ / istea of 1/. The variae σ a be substatially smaller tha 1. our hoie of u esures that, σ + /E t s+h σ + /E u +H + 3Et s+h + 3E u +H σ + 8σ σ +. 6 Cosier the probability i the first term i 5, Algorithm 5 UCBV At timet play the arm that maximizes θv, logt X, + + 3θlogt. The histori versio of the UCBV algorithm is summarize i Algorithm 6. We will ow erive a upper bou o its regret. Algorithm 6 HUCBV At time t play the arm that maximizes B,Tt 1,t with B,s,t = X,s h + θv,s h logt+h s+h + 3θlogt+H s+h. Theorem 5 For θ = 1., HUCBV satisfies, E[T ] 1+v +O1 where v is efie as: v := max { 8 σ E eotes θlog+h. + E H, }. 4 Proof We start with iequality 8 from [3] whih hols for ay iteger u > 1: E[T ] u + + t=u +K 1 t=u +K 1 s=u P[B,s,t > µ ] P[ s : 1 s t 1 s.t. B,s,t µ ] 5 Our hoie of u is the smallest iteger greater tha v efie i 4. Followig [3], for u s t a t, P[B,s,t > µ ] P[ X,s h + V,s h Et + 3Et > µ + ] s+h s+h P[ X,s h σ + + /E t + 3Et > µ + ] s+h s+h +P[V h,s σ + /] P[ X h,s µ > /] +P[V h,s σ + /] e s+h /8σ +4 /3. I the above, the seo step follows from 6. I the last step, Berstei s iequality has bee use twie a the extra term H i the expoet is a result of havig histori ata for arm. Summig the above upper bous from s = u to t 1 a usig the fat that 1 e x x/3 for x 3/4 gives, 4σ P[B,s,t > µ ] + e 4 E s=u Now, osier the last term i 5, usig Theorem 1 empirial Berstei bou of [3], it a be upper boue by, 3 t=u +1 βe t,t, where, βx,t := if 1<α 3 mi e x/α. Therefore, we a write t, logt logα the upper bou o E[T ] as, E[T ] 1+max + 4σ + 4 { 8 σ e E + + E H, t=u +1 βe t,t For the hoie, θ = 1., e E i the thir term above }

6 Multi-arme Bait Problems with History beomes, +H Now osier the last term: t=u +1 mi t=3 O1+ O1+ βet,t βet,t t, t=4 t=4 t=3 logt log 1.1 e θlogt+h /1.1 logt log1.1 e 1.logt+H /1.1 logt/log1.1 = O1. t+h 1.9 I the seo step, we replae ifimum over a rage to a speifi value i the rage. I the thir step, we use the fat that logt/logα < t for t 4 a α = 1.1. I the last step, we use the fat that logt t+h is a overget series; it is easy to 1.9 verify this fat by the itegral test. I the ase of HUCBV, E[T ] = O1 whe = exp H /9.6σ / +/ H. Thus, with logarithmi amout of histori ata, the regret is ostat oe agai. It a agai be see from the proof that the aitive terms approah zero as H a H approah ifiity. I pratie, the performae of HUCBV is sigifiatly better ompare to the other versios of the algorithms that we have propose. This will be a reurrig theme i our experimets. 5 Experimets Experimets were oute o a large-sale realworl ataset [1] otaiig about.4 millio istaes. Eah istae orrespos to a URL a has more tha 3. millio features assoiate with it. The label of a istae iiates whether the URL is maliious or ot. Five ifferet SVM lassifiers were traie usig a subset of twety thousa examples. The ifferet SVMs orrespoe to ifferet C parameter values whih trae-off betwee margi a slak variables i SVM. Preitios were the obtaie o all the remaiig istaes for all the five lassifiers. The istaes use i traiig were ot use i the rest of the experimets. The five lassifiers were the use as the arms of a multi-arme bait problem. The rewar was simply oe whe the preitio of the lassifier mathe the true URL reputatio label a zero whe it i ot. The best arm iffere from the seo best by about.8. Whereas the best arm iffere from the worst by :=.55. We show per-rou regret expresse as a fratio of i.e. R / i our results. Note thatthese valueswereestimatefromabout.4millio examples. All the experimets i this setio were performe by rawig raom samples from this populatio. Hee the above values eote true values for the uerlyig istributio from whih examples were raw. Baselies Obviously, the origial UCB algorithms a the NAIVE strategy Setio 3 are baselies i our experimets. However, we also osiere three other stroger strategies alle BUCB1, BUCBV a BUCB3. These stroger strategies BUCB aot be ru with arbitrary histori ata a were ilue merely for a wier perspetive. These strategies were as follows. I ay experimet, if there were H histori examples for all the arms together, the orrespoig UCB algorithms were ru for extra H rous at the start but the regret aumulate i these first H rous wassimply igore. Note that the arms pulle i the first H iteratios of the BUCB strategies are ompletely etermie by the uerlyig UCB algorithm. I otrast, our algorithms for histori ata a have arbitrary history for ay subset of arms. It is possible to argue that BUCB strategies have higher regret ompare to HUCB algorithms. Suppose, UCB1 3 is ru for + H iteratios, the the umber of pulls of the sub-optimal arm is Ol+ H/. The umber of pulls i the first H steps is OlH/. The worst possible seario is whe ΘlH/ pullsaremaeithe firsth steps. Thus igorig the pulls of arm i the first H steps woul give Ol+H/ lh/ pulls. I otrast, E[T ] for HUCB1 is ofthe orerol+h / H. This shows that the upper bou for our algorithms are muh better eve though these baselie strategies are stroger tha ompletely igorig history. Our experimets ofirm this fiig. While the regret bous we prove for the three algorithms presribe what parameters to use, these parameter hoies are ofte very oservative sie the bous hol for ay istributio. We therefore osiere variats of the propose algorithms where the trae-off betwee exploratio a exploitatio is tue empirially. I the ase of UCB1, HUCB1, UCBV a HUCBV, we put a weight θ o the ofiee iterval; i the ase of UCB3 a HUCB3, the parameter was always set at.8; however the parameter was tue. To tue the values of these parameters, UCB1, UCB3, a UCBV were ru times where the rewars ame from a raom raw of istaes eah time. The parameters orrespoig to the smallest average regret from these rus were fixe 3 We a show similar results for UCB3 a UCBV.

7 Paaga Shivaswamy, Thorste Joahims NAIVE UCB1 HUCB1 BUCB NAIVE UCBV HUCBV BUCBV NAIVE UCB3 HUCB3 BUCB R.4 R.4 R Figure 1: R / vs iteratios with 4 historial examples per-arm. fortherestoftheexperimets. ForUCB1,θ wasetermie to be.. I the ase of UCB3, the parameter was fou to be.3. Fially, i the ase of UCBV, θ was equal to.4. For our propose algorithms e.g. HUCB1 a for the baselies above e.g. BUCB1, we simply use the same value of parametersfou for the orrespoig base algorithm e.g. UCB How oes history affet the regret? The aim of the first experimet was to stuy the behavior of regret i the presee of histori ata. The total amout of histori ata was fixe at, uiformly split ito 4 per arm. The algorithms were the ru o istaes a the per-rou regret was ote after eah iteratio for eah algorithm. The experimets were repeate times by raomly seletig the istaes. A ifferet set of histori ata was selete for eah ru. The results R / a errorbars ofthis experimet areshowifigure1. Examiigtheimpatofhistorial ata o the regret, we see that all algorithms that exploit histori ata iee outperform their outerparts. This experimet shows how a omparably small amout of histori ata a help ahieve a substatial improvemet i regret. As expete, the NAIVE algorithm performs poorly whereas, HUCBV has the best performae amog all the algorithms. It a also be see that the HUCB algorithms perform slightly better tha the orrespoig BUCB strategies for large i the ase of HUB3 a HUCBV. 5. How oes regret hage with the amout of history? I this experimet, the amout of histori ata is varietostuy itseffet othe regret. Thesetup isaalogous to the previous experimets a agai the histori ata is split uiformly amog the arms. Per-rou regret is measure after 5, iteratios. The results of this experimet are show i Figure. The regret at 5, iteratios for UCB1, UCB3, a UCBV is show as a baselie. As the amout of histori ata ireases, the regret ereases as expete. Over most amouts of histori ata, HUCB1, HUCB3, a HUCBV outperform their ovetioal outerparts. We oe agai see a small improvemet over BUCB strategies as well. BUCB3 has slightly better performae over HUCB3 with a large amout of history at 5, iteratios. This is ue to the fat that UCB3 algorithms take a loger time to overge a fat that a be verifie from Figure 1 as well ue to ostat rates i the begiig. For large amout of histori ata, the NAIVE algorithm a reliably pik the best arm usig oly the histori ata. However, whe the amout of history is small, the regret from the aive strategy is sigifiatly higher whe ompare to our algorithms. 5.3 How oes the istributio of histori ata affet regret? The fial experimet was esige to stuy the effet of ubalae amouts of histori ata per arm. Sie the bous we erive i this paper showe that the umber of times a arm is pulle epes o H a H, we fixe the umber of istaes at 4 for the four o-optimal arms i.e. H = 4 whe. The umber of istaes for the optimal arm H was the varie i steps. The BUCB baselies have a avatage i this experimet sie we aot efore a istributio of histori ata over the arms i that ase the algorithms eie whih rewars are reveale to them. The results of this experimet are show i Figure 3. We show the behavior of the algorithms at 5, iteratios. 4 Whe H is large, the sub-optimal arms are uer sample a they te to be pulle more ofte i the begiig. This a be see by almost flat urves for HUCBV a HUCB3 a by a irease i regret i the ase of HUCB1 for larger H values. Obviously, the aive algorithm has the opposite behavior 4 After a large umber of rous e.g. 1 5 there was harly ay ifferee i regret for ifferet H values.

8 Multi-arme Bait Problems with History UCB1 HUCB1 BUCB UCBV HUCBV BUCBV UCB3 HUCB3 BUCB R.4 R.4 R Amout of history Amout of history Amout of history Figure : R / vs the amout of history at 5, iteratios UCBV HUCBV BUCBV.35.3 UCB3 HUCB3 BUCB R. R. R UCB1 HUCB1 BUCB Amout of H * Amout of H * Amout of H * Figure 3: R / vs the amout of H at 5, iteratios. ompare to our algorithms sie the higher H, the more likely it is to hoose the best arm. Amog the three algorithms, HUCB1 seems to be the most sesitive with respet to ubalae history. 6 Disussio logt As we poite out i Setio 4.1, the aive way of iorporatig history is to have logt+h i the ofiee iterval rather tha our hoie of logt+h. We also poite out that the regret bous a be sigifiatly better for our hoie of ofiee iterval ompare to the aive hoie. This leas to a itriguig possibility for the multi-arme bait problems with o histori ata. If we losely examie UCB algorithms UCB1 for istae, the ofiee iterval there is. A atural questio is whether it is possible to replae t isie the logarithm suh that the per-arm history urig a ru of UCB1 is better iorporate? A algorithm of this ki will better exploit the per-arm history urig a ru. Proposig a formal ofiee iterval a provig rigorous upper bous i this ase seem like iterestig iretios of researh to pursue. 7 Colusios We propose three ovel algorithms to exploit histori ata i stohasti multi-arme bait problems. The algorithms themselves have o otrol over the histori ata or o the arms have to be sample uiformly. Logarithmi fiite-time regret bous were erive for eah of the three propose algorithms. The bous showe that alreay a logarithmi amout of histori ata a lea to ostat regret with our algorithms. Experimets were oute o a large-sale ataset. The experimets valiate our theory a showe that eve a little histori ata a make a sigifiat ifferee i terms of regret. Overall, HUCBV has the best performae amog all the algorithms. A properly tue HUCB3 ofte performs better tha HUCB1. A future iretio i this lie of researh is to erive algorithms that a exploit histori ata also for other stohasti bait settigs, suh as baits with a otiuum of arms, uelig baits, et. While we oly showe upper bous o the performae of the propose algorithms i this paper, a atural ext step is to also prove lower bous for bait problems with histori ata. Akowlegmets We thak Bobby Kleiberg for helpful isussios. This work was fue i part uer NSF awar IIS

9 Paaga Shivaswamy, Thorste Joahims Referees [1] M. Athoy a P.L. Bartlett. Neural Network Learig: Theoretial Fouatios. Cambrige Uiversity Press, Cambrige, UK, [] J.-Y. Auibert, R. Muos, a Cs. Szepesvári. Tuig bait algorithms i stohasti eviromets. I ALT, pages Spriger, 7. [3] J.-Y. Auibert, R. Muos, a Cs. Szepesvári. Variae estimates a exploratio futio i multi-arme bait. Researh report 7-31, Certis - Eole es Pots, 7. [4] P. Auer, N. Cesa-Biahi, Y. Freu, a R. Shapire. The o-stohasti multi-arme bait problem. SIAM Joural o Computig, 31:48 77,. [5] Peter Auer, Niolò Cesa-Biahi, a Paul Fisher. Fiite-time aalysis of the multiarme bait problem. Mahie Learig, 47-3:35 56,. [6] P. Boresso a C.-E. Suberg. Simple approximatios of the error futio qx for ommuiatios appliatios. IEEE Trasatios o Commuiatios, 7: , [7] Robert D. Kleiberg, Alexaru Niulesu-Mizil, a Yogeshwer Sharma. Regret bous for sleepig experts a baits. I COLT, pages , 8. [8] T.L. Lai a H. Robbis. Asymptotially effiiet aaptive alloatio rules. Avaes i Applie Mathematis, 6:4, [9] Joh Lagfor, Alexaer Strehl, a Jeifer Wortma. Exploratio savegig. I ICML, 8. [1] Justi Ma, Lawree K. Saul, Stefa Savage, a Geoffrey M. Voelker. Ietifyig suspiious urls: A appliatio of large-sale olie learig. I ICML, 9.

10 A Appeix: Proof of Theorem 3 Proof Oly a sketh of this proof showig the ifferees with the orrespoig steps i a similar erivatio for UCB3 are give. The probability that the arm is hose at time t is give by: Moreover, K P[I = ] = ǫ +1 ǫ P[ X,T h X h 1,T ] 1 =1 P[ X,T X T ] P[ X,T h µ + ]+P[ X,T h µ ]. 1 Deotig 1 ǫ t by x, it a be show that the first term above is upper boue by, P[ X h,t µ + ] x P[T R x ]+ e e x / H /, where, weget the extra fatorexp H /from a appliatioofhoeffig s iequality iorporatigthe histori ata a T R is the umber of times arm is selete at raom i the first raws. Sie for all we a replae exp H / with exp H /. It a further be show that: usig Berstei s iequality. Fially, we a lower bou, x as follows: x = 1 ǫ t = 1 K log P[T R x ] e x /5, 3 1 K + 1 ek t= K +1 K e H / 1+t e H / 1+e K e H / Usig 1,, 3 a 4, it a be show that: P[I = ] K e H / 1+ + P 1 1 log + P 4 + P 1 log P 1 P. 4 e H / + P 4 e H / 5 1

11 where P := K e H / K e H / Thus, for 1, the last four terms i 5 are o 1 sie < 1.