Follow the Leader If You Can, Hedge If You Must

Transcription

1 Journal of Machine Learning Research 15 (2014) Submied 1/13; Revised 1/14; Published 4/14 Follow he Leader If You Can, Hedge If You Mus Seven de Rooij VU Universiy and Universiy of Amserdam Science Park 904, P.O. Box 94323, 1090 GH Amserdam, he Neherlands Tim van Erven Déparemen de Mahémaiques Universié Paris-Sud, Orsay Cedex, France Peer D. Grünwald Wouer M. Koolen Leiden Universiy (Grünwald) and Cenrum Wiskunde & Informaica (Grünwald and Koolen) Science Park 123, P.O. Box 94079, 1090 GB Amserdam, he Neherlands Edior: Nicolò Cesa-Bianchi Absrac Follow-he-Leader (FTL) is an inuiive sequenial predicion sraegy ha guaranees consan regre in he sochasic seing, bu has poor performance for wors-case daa. Oher hedging sraegies have beer wors-case guaranees bu may perform much worse han FTL if he daa are no maximally adversarial. We inroduce he FlipFlop algorihm, which is he firs mehod ha provably combines he bes of boh worlds. As a sepping sone for our analysis, we develop AdaHedge, which is a new way of dynamically uning he learning rae in Hedge wihou using he doubling rick. AdaHedge refines a mehod by Cesa-Bianchi, Mansour, and Solz (2007), yielding improved wors-case guaranees. By inerleaving AdaHedge and FTL, FlipFlop achieves regre wihin a consan facor of he FTL regre, wihou sacrificing AdaHedge s wors-case guaranees. AdaHedge and FlipFlop do no need o know he range of he losses in advance; moreover, unlike earlier mehods, boh have he inuiive propery ha he issued weighs are invarian under rescaling and ranslaion of he losses. The losses are also allowed o be negaive, in which case hey may be inerpreed as gains. Keywords: advice Hedge, learning rae, mixabiliy, online learning, predicion wih exper 1. Inroducion We consider sequenial predicion in he general framework of Decision Theoreic Online Learning (DTOL) or he Hedge seing (Freund and Schapire, 1997), which is a varian of predicion wih exper advice (Lilesone and Warmuh, 1994; Vovk, 1998; Cesa-Bianchi and Lugosi, 2006). Our goal is o develop a sequenial predicion algorihm ha performs well no only on adversarial daa, which is he scenario mos sudies worry abou, bu also when he daa are easy, as is ofen he case in pracice. Specifically, wih adversarial daa, he wors-case regre (defined below) for any algorihm is Ω( T ), where T is he number of predicions o be made. Algorihms such as Hedge, which have been designed o achieve his lower bound, ypically coninue o suffer regre of order T, even for easy daa, where c 2014 Seven de Rooij, Tim van Erven, Peer D. Grünwald and Wouer M. Koolen.

2 De Rooij, Van Erven, Grünwald and Koolen he regre of he more inuiive bu less robus Follow-he-Leader (FTL) algorihm (also defined below) is bounded. Here, we presen he firs algorihm which, up o consan facors, provably achieves boh he regre lower bound in he wors case, and a regre no exceeding ha of FTL. Below, we firs describe he Hedge seing. Then we inroduce FTL, discuss sophisicaed versions of Hedge from he lieraure, and give an overview of he resuls and conens of his paper. 1.1 Overview In he Hedge seing, predicion proceeds in rounds. A he sar of each round = 1, 2,..., a learner has o decide on a weigh vecor w = (w,1,..., w,k ) R K over K expers. Each weigh w,k is required o be nonnegaive, and he sum of he weighs should be 1. Naure hen reveals a K-dimensional vecor conaining he losses of he expers l = (l,1,..., l,k ) R K. Learner s loss is he do produc h = w l, which can be inerpreed as he expeced loss if Learner uses a mixed sraegy and chooses exper k wih probabiliy w,k. We denoe aggregaes of per-rial quaniies by heir capial leer, and vecors are in bold face. Thus, L,k = l 1,k l,k denoes he cumulaive loss of exper k afer rounds, and H = h h is Learner s cumulaive loss (he Hedge loss). Learner s performance is evaluaed in erms of her regre, which is he difference beween her cumulaive loss and he cumulaive loss of he bes exper: R = H L, where L = min k L,k. We will always analyse he regre afer an arbirary number of rounds T. We will omi he subscrip T for aggregae quaniies such as L T or R T wherever his does no cause confusion. A simple and inuiive sraegy for he Hedge seing is Follow-he-Leader (FTL), which pus all weigh on he exper(s) wih he smalles loss so far. More precisely, we will define he weighs w for FTL o be uniform on he se of leaders {k L 1,k = L 1 }, which is ofen jus a singleon. FTL works very well in many circumsances, for example in sochasic scenarios where he losses are independen and idenically disribued (i.i.d.). In paricular, he regre for Follow-he-Leader is bounded by he number of imes he leader is overaken by anoher exper (Lemma 10), which in he i.i.d. case almos surely happens only a finie number of imes (by he uniform law of large numbers), provided he mean loss of he bes exper is sricly smaller han he mean loss of he oher expers. As demonsraed by he experimens in Secion 5, many more sophisicaed algorihms can perform significanly worse han FTL. The problem wih FTL is ha i breaks down badly when he daa are anagonisic. For example, if one ou of wo expers incurs losses 1 2, 0, 1, 0,... while he oher incurs opposie losses 0, 1, 0, 1,..., he regre for FTL a ime T is abou T/2 (his scenario is furher discussed in Secion 5.1). This has promped he developmen of a muliude of alernaive algorihms ha provide beer wors-case regre guaranees. The seminal sraegy for he learner is called Hedge (Freund and Schapire, 1997, 1999). Is performance crucially depends on a parameer η called he learning rae. Hedge can be inerpreed as a generalisaion of FTL, which is recovered in he limi for η. In many analyses, he learning rae is changed from infiniy o a lower value ha opimizes 1282

3 Follow he Leader If You Can, Hedge If You Mus some upper bound on he regre. Doing so requires precogniion of he number of rounds of he game, or of some propery of he daa such as he evenual loss of he bes exper L. Provided ha he relevan saisic is monoonically nondecreasing in (such as L ), a simple way o address his issue is he so-called doubling rick: seing a budge on he saisic, and resaring he algorihm wih a double budge when he budge is depleed (Cesa-Bianchi and Lugosi, 2006; Cesa-Bianchi e al., 1997; Hazan and Kale, 2008); η can hen be opimised for each individual block in erms of he budge. Beer bounds, bu harder analyses, are ypically obained if he learning rae is adjused each round based on previous observaions, see e.g. (Cesa-Bianchi and Lugosi, 2006; Auer e al., 2002). The Hedge sraegy presened by Cesa-Bianchi, Mansour, and Solz (2007) is a sophisicaed example of such adapive uning. The relevan algorihm, which we refer o as CBMS, is defined in (16) in Secion 4.2 of heir paper. To discuss is guaranees, we need he following noaion. Le l = min k l,k and l + = max k l,k denoe he smalles and larges loss in round, and le L = l l and L + = l l+ denoe he cumulaive minimum and maximum loss respecively. Furher le s = l + l denoe he loss range in rial and le S = max{s 1,..., s } denoe he larges loss range afer rials. Then, wihou prior knowledge of any propery of he daa, including T, S and L, he CBMS sraegy achieves regre bounded by 1 R CBMS 4 (L L )(L + ST L ) T ln K + lower order erms (1) (Cesa-Bianchi e al., 2007, Corollary 3). Hence, in he wors case L = L + ST/2 and he bound is of order S T, bu when he loss of he bes exper L [L, L + ST ] is close o eiher boundary he guaranees are much sronger. The conribuions of his work are wofold: firs, in Secion 2, we develop AdaHedge, which is a refinemen of he CBMS sraegy. A (very) preliminary version of his sraegy was presened a NIPS (Van Erven e al., 2011). Like CMBS, AdaHedge is compleely parameerless and unes he learning rae in erms of a direc measure of pas performance. We derive an improved wors-case bound of he following form. Again wihou any assumpions, we have R ah 2 S (L L )(L + L ) L + L ln K + lower order erms (2) (see Theorem 8). The parabola under he square roo is always smaller han or equal o is CMBS counerpar (since i is nondecreasing in L + and L + L +ST ); i expresses ha he regre is small if L [L, L + ] is close o eiher boundary. I is maximized in L a he midpoin beween L and L +, and in his case we recover he wors-case bound of order S T. Like (1), he regre bound (2) is fundamenal, which means ha i is invarian under ranslaion of he losses and proporional o heir scale. Moreover, no only AdaHedge s regre bound is fundamenal: he weighs issued by he algorihm are hemselves invarian 1. As poined ou by a referee, i is widely known ha he leading consan of 4 can be improved o using echniques by Györfi and Oucsák (2007) ha are essenially equivalen o our Lemma 2 below; Gerchinoviz (2011, Remark 2.2) reduced i o approximaely AdaHedge allows a sligh furher reducion o

4 De Rooij, Van Erven, Grünwald and Koolen under ranslaion and scaling (see Secion 4). The CBMS algorihm and AdaHedge are insensiive o rials in which all expers suffer he same loss, a naural propery we call imelessness. An aracive feaure of he new bound (2) is ha i expresses his propery. A more deailed discussion appears below Theorem 8. Our second conribuion is o develop a second algorihm, called FlipFlop, ha reains he wors-case bound (2) (up o a consan facor), bu has even beer guaranees for easy daa: is performance is never subsanially worse han ha of Follow-he-Leader. A firs glance, his may seem rivial o accomplish: simply ake boh FTL and AdaHedge, and combine he wo by using FTL or Hedge recursively. To see why such approaches do no work, suppose ha FTL achieves regre R fl, while AdaHedge achieves regre R ah. We would only be able o prove ha he regre of he combined sraegy compared o he bes original exper saisfies R c min{r fl, R ah } + G c, where G c is he wors-case regre guaranee for he combinaion mehod, e.g. (1). In general, eiher R fl or R ah may be close o zero, while a he same ime he regre of he combinaion mehod, or a leas is bound G c, is proporional o T. Tha is, he overhead of he combinaion mehod will dominae he regre! The FlipFlop approach we describe in Secion 3 circumvens his by alernaing beween Following he Leader and using AdaHedge in a carefully specified way. For his sraegy we can guaranee R ff = O(min{R fl, G ah }), where G ah is he regre guaranee for AdaHedge; Theorem 15 provides a precise saemen. Thus, FlipFlop is he firs algorihm ha provably combines he benefis of Follow-he- Leader wih robus behaviour for anagonisic daa. A key concep in he design and analysis of our algorihms is wha we call he mixabiliy gap, inroduced in Secion 2.1. This quaniy also appears in earlier works, and seems o be of fundamenal imporance in boh he curren Hedge seing as well as in sochasic seings. We elaborae on his in Secion 6.2 where we provide he big picure underlying his research and we briefly indicae how i relaes o pracical work such as (Devaine e al., 2013). 1.2 Relaed Work As menioned, AdaHedge is a refinemen of he sraegy analysed by Cesa-Bianchi e al. (2007), which is iself more sophisicaed han mos earlier approaches, wih wo noable excepions. Firs, Chaudhuri, Freund, and Hsu (2009) describe a sraegy called NormalHedge ha can efficienly compee wih he bes ɛ-quanile of expers; heir bound is incomparable wih he bounds for CBMS and for AdaHedge. Second, Hazan and Kale (2008) develop a sraegy called Variaion MW ha has especially low regre when he losses of he bes exper vary lile beween rounds. They show ha he regre of Variaion MW is of order VAR max T ln K, where VAR max T = max T s=1 ( ls,k 1 L,k ) 2 wih k he bes exper afer rounds. This bound dominaes our wors-case resul (2) (up o a muliplicaive consan). As demonsraed by he experimens in Secion 5, heir mehod does no achieve he benefis of FTL, however. In Secion 5 we also discuss he performance of NormalHedge and Variaion MW compared o AdaHedge and FlipFlop. 1284

5 Follow he Leader If You Can, Hedge If You Mus Oher approaches o sequenial predicion include Defensive Forecasing (Vovk e al., 2005), and Following he Perurbed Leader (Kalai and Vempala, 2003). These radically differen approaches also allow compeing wih he bes ɛ-quanile, as shown by Chernov and Vovk (2010) and Huer and Poland (2005); he laer also consider nonuniform weighs on he expers. The safe MDL and safe Bayesian algorihms by Grünwald (2011, 2012) share he presen work s focus on he mixabiliy gap as a crucial par of he analysis, bu are concerned wih he sochasic seing where losses are no adversarial bu i.i.d. FlipFlop, safe MDL and safe Bayes can all be inerpreed as mehods ha aemp o choose a learning rae η ha keeps he mixabiliy gap small (or, equivalenly, ha keeps he Bayesian poserior or Hedge weighs concenraed ). 1.3 Ouline In he nex secion we presen and analyse AdaHedge and compare is wors-case regre bound o exising resuls, in paricular he bound for CBMS. Then, in Secion 3, we build on AdaHedge o develop he FlipFlop sraegy. The analysis closely parallels ha of AdaHedge, bu wih exra complicaions a each of he seps. In Secion 4 we show ha boh algorihms have he propery ha heir behaviour does no change under ranslaion and scaling of he losses. We furher illusrae he relaionship beween he learning rae and he regre, and compare AdaHedge and FlipFlop o exising mehods, in experimens wih arificial daa in Secion 5. Finally, Secion 6 conains a discussion, wih ambiious suggesions for fuure work. 2. AdaHedge In his secion, we presen and analyse he AdaHedge sraegy. To inroduce our noaion and proof sraegy, we sar wih he simples possible analysis of vanilla Hedge, and hen move on o refine i for AdaHedge. 2.1 Basic Hedge Analysis for Consan Learning Rae Following Freund and Schapire (1997), we define he Hedge or exponenial weighs sraegy as he choice of weighs w,k = w 1,ke ηl 1,k Z, (3) where w 1 = (1/K,..., 1/K) is he uniform disribuion, Z = w 1 e ηl 1 is a normalizing consan, and η (0, ) is a parameer of he algorihm called he learning rae. If η = 1 and one imagines L 1,k o be he negaive log-likelihood of a sequence of observaions, hen w,k is he Bayesian poserior probabiliy of exper k and Z is he marginal likelihood of he observaions. Like in Bayesian inference, he weighs are updaed muliplicaively, i.e. w +1,k w,k e ηl,k. The loss incurred by Hedge in round is h = w l, he cumulaive Hedge loss is H = h h, and our goal is o obain a good bound on H T. To his end, i urns 1285

6 De Rooij, Van Erven, Grünwald and Koolen ou o be echnically convenien o approximae h by he mix loss m = 1 η ln(w e ηl ), (4) which accumulaes o M = m m. This approximaion is a sandard ool in he lieraure. For example, he mix loss m corresponds o he loss of Vovk s (1998; 2001) Aggregaing Pseudo Algorihm, and racking he evoluion of m is a crucial ingredien in he proof of Theorem 2.2 of Cesa-Bianchi and Lugosi (2006). The definiions may be exended o η = by leing η end o. We hen find ha w becomes a uniform disribuion on he se of expers {k L 1,k = L 1 } ha have incurred smalles cumulaive loss before ime. Tha is, Hedge wih η = reduces o Follow-he-Leader, where in case of ies he weighs are disribued uniformly. The limiing value for he mix loss is m = L L 1. In our approximaion of he Hedge loss h by he mix loss m, we call he approximaion error δ = h m he mixabiliy gap. Bounding his quaniy is a sandard par of he analysis of Hedge-ype algorihms (see, for example, Lemma 4 of Cesa-Bianchi e al. 2007) and i also appears o be a fundamenal noion in sequenial predicion even when only so-called mixable losses are considered (Grünwald, 2011, 2012); see also Secion 6.2. We le = δ δ denoe he cumulaive mixabiliy gap, so ha he regre for Hedge may be decomposed as R = H L = M L +. (5) Here M L may be hough of as he regre under he mix loss and is he cumulaive approximaion error when approximaing he Hedge loss by he mix loss. Throughou he paper, our proof sraegy will be o analyse hese wo conribuions o he regre, M L and, separaely. The following lemma, which is proved in Appendix A, collecs a few basic properies of he mix loss: Lemma 1 (Mix Loss wih Consan Learning Rae) For any learning rae η (0, ] 1. l m h l +, so ha 0 δ s. 1 η (w 2. Cumulaive mix loss elescopes: M = ln 1 e ηl) for η <, L for η =. 3. Cumulaive mix loss approximaes he loss of he bes exper: L M L + ln K η. 4. The cumulaive mix loss M is nonincreasing in η. In order o obain a bound for Hedge, one can use he following well-known bound on he mixabiliy gap, which is obained using Hoeffding s bound on he cumulan generaing funcion (Cesa-Bianchi and Lugosi, 2006, Lemma A.1): δ η 8 s2, (6) 1286

7 Follow he Leader If You Can, Hedge If You Mus from which S 2 T η/8, where (as in he inroducion) S = max{s 1,..., s } is he maximum loss range in he firs rounds. Togeher wih he bound M L ln(k)/η from mix loss propery #3 his leads o R = (M L ) + ln K η + ηs2 T 8. (7) The bound is opimized for η = 8 ln(k)/(s 2 T ), which equalizes he wo erms. This leads o a bound on he regre of S T ln(k)/2, maching he lower bound on wors-case regre from he exbook by Cesa-Bianchi and Lugosi (2006, Secion 3.7). We can use his uned learning rae if he ime horizon T is known in advance. To deal wih he siuaion where T is unknown, eiher he doubling rick or a ime-varying learning rae (see Lemma 2 below) can be used, a he cos of a worse consan facor in he leading erm of he regre bound. In he remainder of his secion, we inroduce a compleely parameerless algorihm called AdaHedge. We hen refine he seps of he analysis above o obain a beer regre bound. 2.2 AdaHedge Analysis In he previous secion, we spli he regre for Hedge ino wo pars: M L and, and we obained a bound for boh. The learning rae η was hen uned o equalise hese wo bounds. The main disincion beween AdaHedge and oher Hedge approaches is ha AdaHedge does no consider an upper bound on in order o obain his balance: insead i aims o equalize and ln(k)/η. As he cumulaive mixabiliy gap is nondecreasing in (by mix loss propery #1) and can be observed on-line, i is possible o adap he learning rae direcly based on. Perhaps he easies way o achieve his is by using he doubling rick: each subsequen block uses half he learning rae of he previous block, and a new block is sared as soon as he observed cumulaive mixabiliy gap exceeds he bound on he mix loss ln(k)/η, which ensures hese wo quaniies are equal a he end of each block. This is he approach aken in an earlier version of AdaHedge (Van Erven e al., 2011). However, we can achieve he same goal much more eleganly, by decreasing he learning rae wih ime according o η ah = ln K ah 1 (8) (where ah 0 = 0, so ha ηah 1 = ). Noe ha he AdaHedge learning rae does no involve he end ime T or any oher unobserved properies of he daa; all subsequen analysis is herefore valid for all T simulaneously. The definiions (3) and (4) of he weighs and he mix loss are modified o use his new learning rae: w,k ah = wah 1,k e ηah L 1,k w1 ah e ηah L 1 and m ah = 1 η ah ln(w ah e ηah l ), (9) wih w ah 1 = (1/K,..., 1/K) uniform. Noe ha he muliplicaive updae rule for he weighs no longer applies when he learning rae varies wih ; he las hree resuls of Lemma 1 are also no longer valid. Laer we will also consider oher algorihms o deermine 1287

8 De Rooij, Van Erven, Grünwald and Koolen Single round quaniies for rial : l Loss vecor l = min k l,k, l + = max k l,k Min and max loss s = l + l Loss range w alg h alg = e ηalg L 1 / k e ηalg L 1,k Weighs played = w alg l Hedge loss m alg = 1 η alg δ alg v alg = h alg ( ln m alg = Var k w alg w alg e ηalg l ) Mix loss Mixabiliy gap [l,k ] Loss variance Aggregae quaniies afer rounds: (The final ime T is omied from he subscrip where possible, e.g. L = L T ) L, L, L+, Halg, M alg, alg, V alg τ=1 of l τ, l τ, l + τ, h alg τ, m alg τ, δτ alg, vτ alg S = max{s 1,..., s } Maximum loss range L = min k L,k Cumulaive loss of he bes exper R alg = H alg L Regre Algorihms (he alg in he superscrip above): (η) Hedge wih fixed learning rae η ah AdaHedge, defined by (8) fl Follow-he-Leader (η fl = ) ff FlipFlop, defined by (16) Table 1: Noaion variable learning raes; o avoid confusion he considered algorihm is always specified in he superscrip in our noaion. See Table 1 for reference. From now on, AdaHedge will be defined as he Hedge algorihm wih learning rae defined by (8). For concreeness, a malab implemenaion appears in Figure 1. Our learning rae is similar o ha of Cesa-Bianchi e al. (2007), bu i is less pessimisic as i is based on he mixabiliy gap iself raher han is bound, and as such may exploi easy sequences of losses more aggressively. Moreover our uning of he learning rae simplifies he analysis, leading o igher resuls; he essenial new echnical ingrediens appear as Lemmas 3, 5 and 7 below. We analyse he regre for AdaHedge like we did for a fixed learning rae in he previous secion: we again consider M ah L and ah separaely. This ime, boh legs of he analysis become slighly more involved. Luckily, a good bound can sill be obained wih only a small amoun of work. Firs we show ha he mix loss is bounded by he mix loss we would have incurred if we would have used he final learning rae ηt ah all along: Lemma 2 Le dec be any sraegy for choosing he learning rae such ha η 1 η 2... Then he cumulaive mix loss for dec does no exceed he cumulaive mix loss for he sraegy ha uses he las learning rae η T from he sar: M dec M (η T ). 1288

9 Follow he Leader If You Can, Hedge If You Mus % Reurns he losses of AdaHedge. % l(,k) is he loss of exper k a ime funcion h = adahedge(l) [T, K] = size(l); h = nan(t,1); L = zeros(1,k); Dela = 0; end for = 1:T ea = log(k)/dela; [w, Mprev] = mix(ea, L); h() = w * l(,:) ; L = L + l(,:); [~, M] = mix(ea, L); dela = max(0, h()-(m-mprev)); % max clips numeric Jensen violaion Dela = Dela + dela; end % Reurns he poserior weighs and mix loss % for learning rae ea and cumulaive loss % vecor L, avoiding numerical insabiliy. funcion [w, M] = mix(ea, L) mn = min(l); if (ea == Inf) % Limi behaviour: FTL w = L==mn; else w = exp(-ea.* (L-mn)); end s = sum(w); w = w / s; M = mn - log(s/lengh(l))/ea; end Figure 1: Numerically robus malab implemenaion of AdaHedge This lemma was firs proved in is curren form by Kalnishkan and Vyugin (2005, Lemma 3), and an essenially equivalen bound was inroduced by Györfi and Oucsák (2007) in he proof of heir Lemma 1. Relaed echniques for dealing wih ime-varying learning raes go back o Auer e al. (2002). Proof Using mix loss propery #4, we have M dec T = T =1 which was o be shown. m dec = T =1 ( M (η) M (η) ) 1 T =1 ( M (η) M (η ) 1) 1 = M (η T ) T, We can now show ha he wo conribuions o he regre are sill balanced. Lemma 3 The AdaHedge regre is R ah = M ah L + ah 2 ah. Proof ah As δ ah 0 for all (by mix loss propery #1), he cumulaive mixabiliy gap is nondecreasing. Consequenly, he AdaHedge learning rae η ah as defined in (8) is nonincreasing in. Thus Lemma 2 applies o M ah ; ogeher wih mix loss propery #3 and (8) his yields M ah M (ηah T ) L + ln K η ah T = L + ah T 1 L + ah T. Subsiuion ino he rivial decomposiion R ah = M ah L + ah yields he resul. The remaining ask is o esablish a bound on ah. As before, we sar wih a bound on he mixabiliy gap in a single round, bu raher han (6), we use Bernsein s bound on he mixabiliy gap in a single round o obain a resul ha is expressed in erms of he variance of he losses, v ah = Var k w ah [l,k ] = k wah,k (l,k h ah )

10 De Rooij, Van Erven, Grünwald and Koolen Lemma 4 (Bernsein s Bound) Le η = η alg (0, ) denoe he finie learning rae chosen for round by any algorihm alg. The mixabiliy gap δ alg saisfies Furher, v alg δ alg g(s η ) s v alg (l + halg )(h alg l ) s2 /4., where g(x) = ex x 1. (10) x Proof This is Bernsein s bound (Cesa-Bianchi and Lugosi, 2006, Lemma A.5) on he cumulan generaing funcion, applied o he random variable (l,k l )/s [0, 1] wih k disribued according o w alg. Bernsein s bound is more sophisicaed han Hoeffding s bound (6), because i expresses ha he mixabiliy gap δ is small no only when η is small, bu also when all expers have approximaely he same loss, or when he weighs w are concenraed on a single exper. The nex sep is o use Bernsein s inequaliy o obain a bound on he cumulaive mixabiliy gap ah. In he analysis of Cesa-Bianchi e al. (2007) his is achieved by firs applying Bernsein s bound for each individual round, and hen using a elescoping argumen o obain a bound on he sum. Wih our learning rae (8) i is convenien o reverse hese seps: we firs elescope, which can now be done wih equaliy, and subsequenly apply Bernsein s inequaliy in a sricer way. Lemma 5 AdaHedge s cumulaive mixabiliy gap saisfies ( ah ) 2 V ah ln K + ( 2 3 ln K + 1)S ah. Proof In his proof we will omi he superscrip ah. Using he definiion of he learning rae (8) and δ s (from mix loss propery #1), we ge 2 = T =1 = ( ) = ( ) ln K 2δ + δ 2 η ( ) ( 1 + δ ) = ( ) ln K 2δ + s δ 2 ln K η ( ) 2δ 1 + δ 2 δ η + S. (11) The inequaliies in his equaion replace a δ erm by S, which is of no concern: he resuling erm S adds a mos 2S o he regre bound. We will now show δ η 1 2 v s δ. (12) This supersedes he bound δ /η (e 2)v for η s 1 used by Cesa-Bianchi e al. (2007). Even hough a firs sigh circular, he form (12) has wo major advanages. Firs, inclusion of he overhead 1 3 s δ will only affec smaller order erms of he regre, bu admis a reducion of he leading consan o he opimal facor 1 2. This gain direcly percolaes o our regre bounds below. Second, (12) holds for unbounded η, which simplifies uning considerably. 1290

11 Follow he Leader If You Can, Hedge If You Mus Firs noe ha (12) is clearly valid if η =. Assuming ha η is finie, we can obain his resul by rewriing Bernsein s bound (10) as follows: 1 2 v s δ 2g(s η ) = δ s f(s η )δ, where f(x) = ex 1 2 x2 x 1 η xe x x 2 x. Remains o show ha f(x) 1/3 for all x 0. Afer rearranging, we find his o be he case if (3 x)e x 1 2 x2 + 2x + 3. Taylor expansion of he lef-hand side around zero reveals ha (3 x)e x = 1 2 x2 + 2x x3 ue u for some 0 u x, from which he resul follows. The proof is compleed by plugging (12) ino (11) and finally relaxing s S. Combinaion of hese resuls yields he following naural regre bound, analogous o Theorem 5 of Cesa-Bianchi e al. (2007). Theorem 6 AdaHedge s regre is bounded by Proof Lemma 5 is of he form R ah 2 V ah ln K + S( 4 3 ln K + 2). wih a and b nonnegaive numbers. Solving for ah hen gives which by Lemma 3 implies ha ( ah ) 2 a + b ah, (13) ah 1 2 b b 2 + 4a 1 2 b ( b 2 + 4a) = a + b, R ah 2 a + 2b. Plugging in he values a = V ah ln K and b = S( 2 3 ln K + 1) from Lemma 5 complees he proof. This firs regre bound for AdaHedge is difficul o inerpre, because he cumulaive loss variance V ah depends on he acions of he AdaHedge sraegy iself (hrough he weighs w ah ). Below, we will derive a regre bound for AdaHedge ha depends only on he daa. However, AdaHedge has one imporan propery ha is capured by his firs resul ha is no longer expressed by he wors-case bound we will derive below. Namely, if he daa are easy in he sense ha here is a clear bes exper, say k, hen he weighs played 1 as increases, hen he loss variance mus decrease: v ah 0. Thus, Theorem 6 suggess ha he AdaHedge regre may be bounded if he weighs concenrae on he bes exper sufficienly quickly. This indeed urns ou o be he case: we can prove ha he regre is bounded for he sochasic seing where he loss vecors l are independen, and E[L,k L,k ] = Ω( β ) for all k k and any β > 1/2. This is an imporan feaure of AdaHedge when i is used as a sand-alone algorihm, and Van Erven e al. (2011) provide a proof for he previous version of he by AdaHedge will concenrae on ha exper. If w ah,k 1291

12 De Rooij, Van Erven, Grünwald and Koolen sraegy. See Secion 5.4 for an example of concenraion of he AdaHedge weighs. Here we will no pursue his furher, because he Follow-he-Leader sraegy also incurs bounded loss in ha case; we raher focus aenion on how o successfully compee wih FTL in Secion 3. We now proceed o derive a bound ha depends only on he daa, using an approach similar o he one aken by Cesa-Bianchi e al. (2007). We firs bound he cumulaive loss variance as follows: Lemma 7 Assume L H. The cumulaive loss variance for AdaHedge saisfies V ah S (L+ L )(L L ) L + L + 2S. In he degenerae case L = L + he fracion reads 0/0, bu since we hen have V ah = 0, from here on we define he raio o be zero in ha case, which is also is limiing value. Proof We omi all ah superscrips. By Lemma 4 we have v (l + h )(h l ). Now T V = v (l + h )(h l ) S =1 1 T = ST (l + h )(h l ) s (l + h )(h l ) (l + h ) + (h l ) S (L+ H)(H L ), (14) L + L where he las inequaliy is an insance of Jensen s inequaliy applied o he funcion B defined on he domain x, y 0 by B(x, y) = xy x+y for xy > 0 and B(x, y) = 0 for xy = 0 o ensure coninuiy. To verify ha B is joinly concave, we will show ha he Hessian is negaive semi-definie on he inerior xy > 0. Concaviy on he whole domain hen follows from coninuiy. The Hessian, which urns ou o be he rank one marix 2 2 B(x, y) = (x + y) 3 ( ) ( ) y y, x x is negaive semi-definie since i is a negaive scaling of a posiive ouer produc. Subsequenly using H L (by assumpion) and H L + 2 (by Lemma 3) yields as desired. (L + H)(H L ) (L+ L )(L + 2 L ) (L+ L )(L L ) + 2 L + L L + L L + L This can be combined wih Lemmas 5 and 3 o obain our firs main resul: Theorem 8 (AdaHedge Wors-Case Regre Bound) AdaHedge s regre is bounded by R ah 2 S (L+ L )(L L ) L + L ln K + S( 16 3 ln K + 2). (15) 1292

13 Follow he Leader If You Can, Hedge If You Mus Proof If H ah < L, hen R ah < 0 and he resul is clearly valid. Bu if H ah L, we can bound V ah using Lemma 7 and plug he resul ino Lemma 5 o ge an inequaliy of he form (13) wih a = S(L + L )(L L )/(L + L ) and b = S( 8 3 ln K + 1). Following he seps of he proof of Theorem 6 wih hese modified values for a and b we arrive a he desired resul. This bound has several useful properies: 1. I is always smaller han he CBMS bound (1), wih a leading consan ha has been reduced from he previously bes-known value of 2.63 o 2. To see his, noe ha (15) increases o (1) if we replace L + by he upper bound L + ST. I can be subsanially sronger han (1) if he range of he losses s is highly variable. 2. The bound is fundamenal, a concep discussed in deail by Cesa-Bianchi e al. (2007): i is invarian o ranslaions of he losses and proporional o heir scale. I is herefore valid for arbirary loss ranges, regardless of sign. In fac, no jus he bound, bu AdaHedge iself is fundamenal in his sense: see Secion 4 for a discussion and proof. 3. The regre is small when he bes exper eiher has a very low loss, or a very high loss. The laer is imporan if he algorihm is o be used for a scenario in which we are provided wih a sequence of gain vecors g raher han losses: we can ransform hese gains ino losses using l = g, and hen run AdaHedge. The bound hen implies ha we incur small regre if he bes exper has very small cumulaive gain relaive o he minimum gain. 4. The bound is no dependen on he number of rials bu only on he losses; i is a imeless bound as discussed below. 2.3 Wha are Timeless Bounds? All bounds presened for AdaHedge (and FlipFlop) are imeless. We call a regre bound imeless if i does no change under inserion of addiional rials where all expers are assigned he same loss. Inuiively, he predicion ask does no become more difficul if naure should inser same-loss rials. Since hese rials do nohing o differeniae beween he expers, hey can safely be ignored by he learner wihou affecing her regre; in fac, many Hedge sraegies, including Hedge wih a fixed learning rae, FTL, AdaHedge and CBMS already have he propery ha heir fuure behaviour does no change under such inserions: hey are robus agains such ime dilaion. If any sraegy does no have his propery by iself, i can easily be modified o ignore equal-loss rials. I is easy o imagine pracical scenarios where his robusness propery would be imporan. For example, suppose you hire a number of expers who coninually monior he asses in your porfolio. Usually hey do no recommend any changes, bu occasionally, when hey see a rare opporuniy or receive suble warning signs, hey may urge you o rade, resuling in a poenially very large gain or loss. I seems only beneficial o poll he expers ofen, and here is no reason why he many resuling equal-loss rials should complicae he learning ask. 1293

14 De Rooij, Van Erven, Grünwald and Koolen The oldes bounds for Hedge scale wih T or L, and are hus no imeless. From he resuls above we can obain fundamenal and imeless varians wih, for parameerless algorihms, he bes known leading consans (he firs iem below follows Corollary 1 of Cesa-Bianchi e al. 2007): Corollary 9 The AdaHedge regre saisfies he following inequaliies: R ah T=1 s 2 ln K + S( 4 3 ln K + 2) (analogue of radiional T -based bounds), R ah 2 S(L L ) ln K + S( 16 3 ln K + 2) (analogue of radiional L -based bounds), R ah 2 S(L + L ) ln K + S( 16 3 ln K + 2) (symmeric bound, useful for gains). Proof We could ge a bound ha depends only on he loss ranges s by subsiuing he wors case L = (L + + L )/2 ino Theorem 8, bu a sharper resul is obained by plugging he inequaliy v s 2 /4 from Lemma 4 direcly ino Theorem 6. This yields he firs iem above. The oher wo inequaliies follow easily from Theorem 8. In he nex secion, we show how we can compee wih FTL while a he same ime mainaining all hese wors-case guaranees up o a consan facor. 3. FlipFlop AdaHedge balances he cumulaive mixabiliy gap ah and he mix loss regre M ah L by reducing η ah as necessary. Bu, as we observed previously, if he daa are no hopelessly adversarial we migh no need o worry abou he mixabiliy gap: as Lemma 4 expresses, δ ah is also small if he variance v ah of he loss under he weighs w,k ah is small, which is he case if he weigh on he bes exper max k w,k ah becomes close o one. AdaHedge is able o exploi such a lucky scenario o an exen: as explained in he discussion ha follows Theorem 6, if he weigh of he bes exper goes o one quickly, AdaHedge will have a small cumulaive mixabiliy gap, and herefore, by Lemma 3, a small regre. This happens, for example, in he sochasic seing wih independen, idenically disribued losses, when a single exper has he smalles expeced loss. Similarly, in he experimen of Secion 5.4, he AdaHedge weighs concenrae sufficienly quickly for he regre o be bounded. There is he poenial for a nasy feedback loop, however. Suppose here are a small number of difficul early rials, during which he cumulaive mixabiliy gap increases relaively quickly. AdaHedge responds by reducing he learning rae (8), wih he effec ha he weighs on he expers become more uniform. As a consequence, he mixabiliy gap in fuure rials may be larger han wha i would have been if he learning rae had sayed high, leading o furher unnecessary reducions of he learning rae, and so on. The end resul may be ha AdaHedge behaves as if he daa are difficul and incurs subsanial regre, even in cases where he regre of Hedge wih a fixed high learning rae, or of Followhe-Leader, is bounded! Precisely his phenomenon occurs in he experimen in Secion 5.2 below: AdaHedge s regre is close o he wors-case bound, whereas FTL hardly incurs any regre a all. 1294

15 Follow he Leader If You Can, Hedge If You Mus I appears, hen, ha we mus eiher hope ha he daa are easy enough ha we can make he weighs concenrae quickly on a single exper, by no reducing he learning rae a all; or we fear he wors and reduce he learning rae as much as we need o be able o provide good guaranees. We canno really inerpolae beween hese wo exremes: an inermediae learning rae may no yield small regre in favourable cases and may a he same ime desroy any performance guaranees in he wors case. I is unclear a priori wheher we can ge away wih keeping he learning rae high, or ha i is wiser o play i safe using AdaHedge. The mos exreme case of keeping he learning rae high, is he limi as η ends o, for which Hedge reduces o Follow-he-Leader. In his secion we work ou a sraegy ha combines he advanages of FTL and AdaHedge: i reains AdaHedge s wors-case guaranees up o a consan facor, bu is regre is also bounded by a consan imes he regre of FTL (Theorem 15). Perhaps surprisingly, his is no easy o achieve. To see why, imagine a scenario where he average loss of he bes exper is subsanial, whereas he regre of eiher Follow-he-Leader or AdaHedge, is small. Since our combinaion has o guaranee a similarly small regre, i has only a very limied margin for error. We canno, for example, simply combine he wo algorihms by recursively plugging hem ino Hedge wih a fixed learning rae, or ino AdaHedge: he performance guaranees we have for hose mehods of combinaion are oo weak. Even if boh FTL and AdaHedge yield small regre on he original problem, choosing he acions of FTL for some rounds and hose of AdaHedge for he oher rounds may fail if we do i naively, because he regre is no necessarily increasing, and we may end up picking each algorihm precisely in hose rounds where he oher one is beer. Luckily, alernaing beween he opimisic FTL sraegy and he wors-case-proof Ada- Hedge does urn ou o be possible if we do i in a careful way. In his secion we explain he appropriae sraegy, called FlipFlop (superscrip: ff ), and show ha i combines he desirable properies of boh FTL and AdaHedge. 3.1 Exploiing Easy Daa by Following he Leader We firs invesigae he poenial benefis of FTL over AdaHedge. Lemma 10 below idenifies he circumsances under which FTL will perform well, which is when he number of leader changes is small. I also shows ha he regre for FTL is equal o is cumulaive mixabiliy gap when FTL is inerpreed as a Hedge sraegy wih infinie learning rae. Lemma 10 Le c be an indicaor for a leader change a ime : define c = 1 if here exiss an exper k such ha L 1,k = L 1 while L,k L, and c = 0 oherwise. Le C = c c be he cumulaive number of leader changes. Then he FTL regre saisfies R fl = ( ) S C. Proof We have M ( ) = L by mix loss propery #3, and consequenly R fl = ( ) + M ( ) L = ( ). To bound ( ), noice ha, for any such ha c = 0, all leaders remained leaders and incurred idenical loss. I follows ha m ( ) = L L 1 = h( ) 1295 and hence δ ( ) = 0. By

16 De Rooij, Van Erven, Grünwald and Koolen bounding δ ( ) as required. S for all oher we obain ( ) = T =1 δ ( ) = : c =1 δ ( ) : c =1 S = S C, We see ha he regre for FTL is bounded by he number of leader changes. This quaniy is boh fundamenal and imeless. I is a naural measure of he difficuly of he problem, because i remains small whenever a single exper makes he bes predicions on average, even in he scenario described above, in which AdaHedge ges caugh in a feedback loop. One example where FTL ouperforms AdaHedge is when he losses for wo expers are (1, 0) on he firs round, and keep alernaing according o (1, 0), (0, 1), (1, 0),... for he remainder of he rounds. Then he FTL regre is only 1/2, whereas AdaHedge s performance is close o he wors-case bound (because is weighs w ah converge o (1/2, 1/2), for which he bound (6) on he mixabiliy gap is igh). This scenario is illusraed furher in he experimens, Secion FlipFlop FlipFlop is a Hedge sraegy in he sense ha i uses exponenial weighs defined by (9), bu he learning rae η ff now alernaes beween infiniy, such ha he algorihm behaves like FTL, and he AdaHedge value, which decreases as a funcion of he mixabiliy gap accumulaed over he rounds where AdaHedge is used. In Definiion 11 below, we will specify he flip regime R, which is he subse of imes {1,..., } where we follow he leader by using an infinie learning rae, and he flop regime R = {1,..., } \ R, which is he se of imes where he learning rae is deermined by AdaHedge (mnemonic: he posiion of he bar refers o he value of he learning rae). We accumulae he mixabiliy gap, he mix loss and he variance for hese wo regimes separaely: = δ ff τ ; M = m ff τ ; (flip) τ R τ R = δ ff τ ; M = m ff τ ; V = vτ ff. (flop) τ R τ R τ R We also change he learning rae from is definiion for AdaHedge in (8) o he following, which differeniaes beween he wo regimes of he sraegy: η ff = { η flip if R, η flop if R, where η flip = η fl = and η flop = ln K. (16) 1 Like for AdaHedge, η flop = as long as 1 = 0, which now happens for all such ha R 1 =. Noe ha while he learning raes are defined separaely for he wo regimes, he exponenial weighs (9) of he expers are sill always deermined using he cumulaive losses L,k over all rounds. We also poin ou ha, for rounds R, he learning rae η ff = η flop is no equal o η ah, because i uses 1 insead of ah 1. For his reason, he 1296

17 Follow he Leader If You Can, Hedge If You Mus % Reurns he losses of FlipFlop % l(,k) is he loss of exper k a ime ; phi > 1 and alpha > 0 are parameers funcion h = flipflop(l, alpha, phi) [T, K] = size(l); h = nan(t,1); L = zeros(1,k); Dela = [0 0]; scale = [phi/alpha alpha]; regime = 1; % 1=FTL, 2=AH end for = 1:T if regime==1, ea = Inf; else ea = log(k)/dela(2); end [w, Mprev] = mix(ea, L); h() = w * l(,:) ; L = L + l(,:); [~, M] = mix(ea, L); dela = max(0, h()-(m-mprev)); Dela(regime) = Dela(regime) + dela; if Dela(regime) > scale(regime) * Dela(3-regime) regime = 3-regime; end end Figure 2: FlipFlop, wih new ingrediens in boldface FlipFlop regre may be eiher beer or worse han he AdaHedge regre; our resuls below only preserve he regre bound up o a consan facor. In conras, we do compee wih he acual regre of FTL. I remains o define he flip regime R and he flop regime R, which we will do by specifying he imes a which o swich from one o he oher. FlipFlop sars opimisically, wih an epoch of he flip regime, which means i follows he leader, unil becomes oo large compared o. A ha poin i swiches o an epoch of he flop regime, and keeps using η flop unil becomes oo large compared o. Then he process repeas wih he nex epochs of he flip and flop regimes. The regimes are deermined as follows: Definiion 11 (FlipFlop s Regimes) Le ϕ > 1 and α > 0 be parameers of he algorihm (uned below in Corollary 16). Then FlipFlop sars in he flip regime. If is he earlies ime since he sar of a flip epoch where > (ϕ/α), hen he ransiion o he subsequen flop epoch occurs beween rounds and + 1. (Recall ha during flip epochs increases in whereas is consan.) Vice versa, if is he earlies ime since he sar of a flop epoch where > α, hen he ransiion o he subsequen flip epoch occurs beween rounds and + 1. This complees he definiion of he FlipFlop sraegy. See Figure 2 for a malab implemenaion. The analysis proceeds much like he analysis for AdaHedge. We firs show ha, analogously o Lemma 3, he FlipFlop regre can be bounded in erms of he cumulaive mixabiliy gap; in fac, we can use he smalles cumulaive mixabiliy gap ha we encounered 1297

18 De Rooij, Van Erven, Grünwald and Koolen in eiher of he wo regimes, a he cos of slighly increased consan facors. This is he fundamenal building block in our FlipFlop analysis. We hen proceed o develop analogues of Lemmas 5 and 7, whose proofs do no have o be changed much o apply o FlipFlop. Finally, all hese resuls are combined o bound he regre of FlipFlop in Theorem 15, which, afer Theorem 8, is he second main resul of his paper. Lemma 12 (FlipFlop version of Lemma 3) The following wo bounds hold simulaneously for he regre of he FlipFlop sraegy wih parameers ϕ > 1 and α > 0: ( ) ( ) ϕα ϕ R ff ϕ 1 + 2α S ϕ ; (17) ( ϕ R ff ϕ 1 + ϕ ) α S. (18) Proof The regre can be decomposed as R ff = H ff L = + + M + M L. (19) Our firs sep will be o bound he mix loss M + M in erms of he mix loss M flop of he auxiliary sraegy ha uses η flop for all. As η flop is nonincreasing, we can hen apply Lemma 2 and mix loss propery #3 o furher bound M flop M (ηflop T ) L + ln K η flop = L + T 1 L +. (20) Le 0 = u 1 < u 2 <... < u b < T denoe he imes jus before he epochs of he flip regime begin, i.e. round u i + 1 is he firs round in he i-h flip epoch. Similarly le 0 < v 1 <... < v b T denoe he imes jus before he epochs of he flop regime begin, where we arificially define v b = T if he algorihm is in he flip regime afer T rounds. These definiions ensure ha we always have u b < v b T. For he mix loss in he flop regime we have M = (M flop u 2 Mv flop 1 ) + (Mu flop 3 Mv flop 2 ) (Mu flop b Mv flop b 1 ) + (M flop Mv flop b ). (21) Le us emporarily wrie η = η flop o avoid double superscrips. For he flip regime, he properies in Lemma 1, ogeher wih he observaion ha η flop does no change during he flip regime, give M = = b i=1 b ( ) M v ( ) i M u ( ) i = ( M (ηv i ) v i M (ηv i ) u i i=1 ( ) Mv flop 1 Mu flop 1 + b i=1 ( M ( ) v i L u i ) + ln K ) b = η vi i=1 ( ) Mv flop 2 Mu flop b i=1 ( M (ηv i ) v i L u i ) ( Mv flop i Mu flop i + ln K η ui +1 ) ( ) Mv flop b Mu flop b + b ui. (22) i=1 From he definiion of he regime changes (Definiion 11), we know he value of ui very accuraely a he ime u i of a change from a flop o a flip regime: ui > α ui = α vi 1 > ϕ vi 1 = ϕ ui

19 Follow he Leader If You Can, Hedge If You Mus By unrolling from low o high i, we see ha b b ui ϕ 1 i ub ϕ 1 i ub = i=1 i=1 i=1 ϕ ϕ 1 u b. Adding up (21) and (22), we herefore find ha he oal mix loss is bounded by b M + M M flop + ui M flop + ϕ ( ) ϕ ϕ 1 u b L + ϕ 1 + 1, i=1 where he las inequaliy uses (20). Combinaion wih (19) yields R ff ( ϕ ϕ ) +. (23) Our nex goal is o relae and : by consrucion of he regimes, hey are always wihin a consan facor of each oher. Firs, suppose ha afer T rials we are in he bh epoch of he flip regime, ha is, we will behave like FTL in round T + 1. In his sae, we know from Definiion 11 ha is suck a he value ub ha promped he sar of he curren epoch. As he regime change happened afer u b, we have ub S α ub, so ha S α. A he same ime, we know ha is no large enough o rigger he nex regime change. From his we can deduce he following bounds: 1 α ( S) ϕ α. On he oher hand, if afer T rounds we are in he bh epoch of he flop regime, hen a similar reasoning yields In boh cases, i follows ha α ( S) α. ϕ < α + S; < ϕ α + S. The wo bounds of he lemma are obained by plugging firs one, hen he oher of hese bounds ino (23). The flop cumulaive mixabiliy gap is relaed, as before, o he variance of he losses. Lemma 13 (FlipFlop version of Lemma 5) The cumulaive mixabiliy gap for he flop regime is bounded by he cumulaive variance of he losses for he flop regime: 2 V ln K + ( 2 3 ln K + 1)S. (24) 1299

20 De Rooij, Van Erven, Grünwald and Koolen Proof The proof is analogous o he proof of Lemma 5, wih insead of ah, V insead of V ah, and using η = η flop = ln(k)/ 1 insead of η = η ah = ln(k)/ ah 1. Furhermore, we only need o sum over he rounds R in he flop regime, because does no change during he flip regime. As i is sraigh-forward o prove an analogue of Theorem 6 for FlipFlop by solving he quadraic inequaliy in (24), we proceed direcly owards esablishing an analogue of Theorem 8. The following lemma provides he equivalen of Lemma 7 for FlipFlop. I can probably be srenghened o improve he lower order erms; we provide he version ha is easies o prove. Lemma 14 (FlipFlop version of Lemma 7) Suppose H ff L. variance for FlipFlop wih parameers ϕ > 1 and α > 0 saisfies V S (L+ L )(L ( L ) ϕ + L + L ϕ 1 + ϕ ) α + 2 S + S 2. Proof The sum of variances saisfies V = R v ff T =1 v ff S (L+ H ff )(H ff L ) L + L, The cumulaive loss where he firs inequaliy simply includes he variances for FTL rounds (which are ofen all zero), and he second follows from he same reasoning as employed in (14). Subsequenly using L H ff (by assumpion) and, from Lemma 12, H ff L + γ, where γ denoes he righ-hand side of he bound (18), we find which was o be shown. V S (L+ L )(L + γ L ) S (L+ L )(L L ) + Sγ, L + L L + L Combining Lemmas 12, 13 and 14, we obain our second main resul: Theorem 15 (FlipFlop Regre Bound) The regre for FlipFlop wih doubling parameers ϕ > 1 and α > 0 simulaneously saisfies he wo bounds R ff where c 1 = R ff c 1 ( ϕα ϕ 1 + 2α + 1 ) R fl + S S (L+ L )(L L ) L + L ϕ ϕ 1 + ϕ α + 2. ( ϕ ϕ ), ( ln K + c 1 S (c ) ln K + ) ln K S, This shows ha, up o a muliplicaive facor in he regre, FlipFlop is always as good as he bes of Follow-he-Leader and AdaHedge s bound from Theorem 8. Of course, if 1300

21 Follow he Leader If You Can, Hedge If You Mus AdaHedge significanly ouperforms is bound, i is no guaraneed ha FlipFlop will ouperform he bound in he same way. In he experimens in Secion 5 we demonsrae ha he muliplicaive facor is no jus an arifac of he analysis, bu can acually be observed on simulaed daa. Proof From Lemma 10, we know ha ( ) = R fl. Subsiuion in (17) of Lemma 12 yields he firs inequaliy. For he second inequaliy, noe ha L > H ff means he regre is negaive, in which case he resul is clearly valid. We may herefore assume w.l.o.g. ha L H ff and apply Lemma 14. Combinaion wih Lemma 13 yields 2 V ln K + ( 2 3 ln K + 1)S S (L+ L )(L L ) L + L ln K + S 2 ln K + c 2 S, where c 2 = (c ) ln K + 1. We now solve his quadraic inequaliy as in (13) and relax i using a + b a + b for nonnegaive numbers a, b o obain S (L+ L )(L L ) ln K + S L + L 2 ln K + c 2 S S (L+ L )(L L ) ( ) ln K + S ln K + c2. L + L In combinaion wih Lemma 12, his yields he second bound of he heorem. Finally, we propose o selec he parameer values ha minimize he consan facor in fron of he leading erms of hese regre bounds. Corollary 16 The parameer values ϕ = 2.37 and α = approximaely minimize he wors of he wo leading facors in he bounds of Theorem 15. The regre for FlipFlop wih hese parameers is simulaneously bounded by R ff 5.64R fl S, R ff 5.64 S (L+ L )(L L ) L + L Proof The leading facors f(ϕ, α) = ϕα ϕ 1 ( ln K + S ln K ) ln K α + 1 and g(ϕ, α) = ϕ ϕ 1 + ϕ α + 2 are respecively increasing and decreasing in α. They are equalized for α(ϕ) = ( 2ϕ ϕ 3 16ϕ 2 + 4ϕ + 1 ) /(6ϕ 4). The analyic soluion for he minimum of f(ϕ, α(ϕ)) in ϕ is oo long o reproduce here, bu i is approximaely equal o ϕ = 2.37, a which poin α(ϕ ) Invariance o Rescaling and Translaion A common simplifying assumpion made in he lieraure is ha he losses l,k are ranslaed and normalised o ake values in he inerval [0, 1]. However, doing so requires a priori 1301

22 De Rooij, Van Erven, Grünwald and Koolen knowledge of he range of he losses. One would herefore prefer algorihms ha do no require he losses o be normalised. As discussed by Cesa-Bianchi e al. (2007), he regre bounds for such algorihms should no change when losses are ranslaed (because his does no change he regre) and should scale by σ when he losses are scaled by a facor σ > 0 (because he regre scales by σ). They call such regre bounds fundamenal and show ha mos of he mehods hey inroduce saisfy such fundamenal bounds. Here we go even furher: i is no jus our bounds ha are fundamenal, bu also our algorihms, which do no change heir oupu weighs if he losses are scaled or ranslaed. Theorem 17 Boh AdaHedge and FlipFlop are invarian o ranslaion and rescaling of he losses. Saring wih losses l 1,..., l T, obain rescaled, ranslaed losses l 1,..., l T by picking any σ > 0 and arbirary reals τ 1,..., τ T, and seing l,k = σl,k +τ for = 1,..., T and k = 1,..., K. Boh AdaHedge and FlipFlop issue he exac same sequence of weighs w = w on l as hey do on l. Proof We annoae any quaniy wih a prime o denoe ha i is defined wih respec o he losses l. We omi he algorihm name from he superscrip. Firs consider AdaHedge. We will prove he following relaions by inducion on : 1 = σ 1 ; η = η σ ; w = w. (25) For = 1, hese are valid since 0 = σ 0 = 0, η 1 = η 1/σ =, and w 1 = w 1 are uniform. Now assume owards inducion ha (25) is valid for some {1,..., T }. We can hen compue he following values from heir definiion: h = w l = σh + τ ; m = (1/η ) ln(w e η l ) = σm + τ ; δ = h m = σ(h m ) = σδ. Thus, he mixabiliy gaps are also relaed by he scale facor σ. From here we can re-esablish he inducion hypohesis for he nex round: we have = 1 + δ = σ 1 + σδ = σ, and η +1 = ln(k)/ = η +1 /σ. For he weighs we ge w +1 e η +1 L = e (η +1 /σ)(σl ) w +1, which means he wo mus be equal since boh sum o one. Thus he relaions of (25) are also valid for ime + 1, proving he resul for AdaHedge. For FlipFlop, if we assume regime changes occur a he same imes for l and l, hen similar reasoning reveals = σ ; = σ, η flip = η flip /σ =, η flop = η flop /σ, and w = w. Remains o check ha he regime changes do indeed occur a he same imes. Noe ha in Definiion 11, he flop regime is sared when > (ϕ/α), which is equivalen o esing > (ϕ/α) since boh sides of he inequaliy are scaled by σ. Similarly, he flip regime sars when > α, which is equivalen o he es > α. 5. Experimens We performed four experimens on arificial daa, designed o clarify how he learning rae deermines performance in a variey of Hedge algorihms. These experimens are designed o illusrae as clearly as possible he inricacies involved in he cenral quesion of his paper: wheher o use a high learning rae (by following he leader) or o play i safe by using a smaller learning rae insead. Raher han mimic real-world daa, on which high learning raes ofen seem o work well (Devaine e al., 2013), we vary he main facor ha 1302

23 Follow he Leader If You Can, Hedge If You Mus appears o drive he bes choice of learning rae: he difference in cumulaive loss beween he expers. We have kep he experimens as simple as possible: he daa are deerminisic, and involve wo expers. In each case, he daa consis of one iniial hand-crafed loss vecor l 1, followed by a sequence of loss vecors l 2,..., l T, which are eiher (0, 1) or (1, 0). For each experimen ξ {1, 2, 3, 4}, we wan he cumulaive loss difference L,1 L,2 beween he expers o follow a arge f ξ (), which will be a coninuous, nondecreasing funcion of. As he losses are binary, we canno make L,1 L,2 exacly equal o he arge f ξ (), bu afer he iniial loss l 1, we choose every subsequen loss vecor such ha i brings L,1 L,2 as close as possible o f ξ (). All funcions f ξ change slowly enough ha L,1 L,2 f ξ () 1 for all. For each experimen, we le he number of rials be T = 1000, and we firs plo he regre R (η) of he Hedge algorihm as a funcion of he fixed learning rae η. We subsequenly plo he regre R alg as a funcion of = 1,..., T, for each of he following algorihms alg : 1. Follow-he-Leader (Hedge wih learning rae ) 2. Hedge wih fixed learning rae η = 1 3. Hedge wih he learning rae ha opimizes he wors-case bound (7), which equals η = 8 ln(k)/(s 2 T ) ; we will call his algorihm safe Hedge for breviy. 4. AdaHedge 5. FlipFlop, wih parameers ϕ = 2.37 and α = as in Corollary Variaion MW by Hazan and Kale (2008), using he fixed learning rae ha opimises he bound provided in heir Theorem 4 7. NormalHedge, described by Chaudhuri e al. (2009) Noe ha he safe Hedge sraegy (he hird iem above) can only be used in pracice if he horizon T is known in advance. Variaion MW (he sixh iem) addiionally requires precogniion of he empirical variance of he sequence of losses of he bes exper up unil T (ha is, VAR max T as defined in Secion 1.2), which is no available in pracice, bu which we are supplying anyway. We include algorihms 6 and 7 because, as explained in Secion 1.2, hey are he sae of he ar in Hedge-syle algorihms. Like AdaHedge, Variaion MW is a refinemen of he CBMS sraegy described by Cesa-Bianchi e al. (2007). They modify he definiion of he weighs in he Hedge algorihm o include second-order erms; he resuling bound is never more han a consan facor worse han he bounds (1) for CBMS and (15) for AdaHedge, bu for some easy daa i can be subsanially beer. For his reason i is a naural performance arge for AdaHedge. The bounds for CBMS and AdaHedge are incomparible wih he bound for NormalHedge, being beer for some, worse for oher daa. The reason we include i in he experimens is because, compared o he oher mehods, is performance in pracice urns ou o be excellen. We do no know wheher here are daa sequences on which FlipFlop significanly ouperforms NormalHedge, nor wheher here is a heoreical reason for his good performance, as he NormalHedge bound (Chaudhuri e al., 2009) is no igh for our experimens. 1303

24 De Rooij, Van Erven, Grünwald and Koolen To reduce cluer, we omi resuls for CBMS; is behaviour is very similar o ha of AdaHedge. Below we provide an exac descripion of each experimen, and discuss he resuls. 5.1 Experimen 1. Wors Case for FTL The experimen is defined by l 1 = ( 1 2 0), and f 1() = 0. This yields he following losses: ( ) 1/2, 0 ( ) 0, 1 ( ) 1, 0 ( ) 0, 1 ( ) 1,... 0 These daa are he wors case for FTL: each round, he leader incurs loss one, while each of he wo individual expers only receives a loss once every wo rounds. Thus, he FTL regre increases by one every wo rounds and ends up around 500. For any learning rae η, he weighs used by he Hedge algorihm are repeaed every wo rounds, so he regre H L increases by he same amoun every wo rounds: he regre increases linearly in for every fixed η ha does no vary wih. However, he consan of proporionaliy can be reduced grealy by reducing he value of η, as he op graph in Figure 3 shows: for T = 1000, he regre becomes negligible for any η less han abou Thus, in his experimen, a learning algorihm mus reduce he learning rae o shield iself from incurring an excessive overhead. The boom graph in Figure 3 shows he expeced breakdown of he FTL algorihm; Hedge wih fixed learning rae η = 1 also performs quie badly. When η is reduced o he value ha opimises he wors-case bound, he regre becomes compeiive wih ha of he oher algorihms. Noe ha Variaion MW has he bes performance; his is because is learning rae is uned in relaion o he bound proved in he paper, which has a relaively large consan in fron of he leading erm. As a consequence he algorihm always uses a relaively small learning rae, which urns ou o be helpful in his case bu harmful in laer experimens. FlipFlop behaves as heory suggess i should: is regre increases alernaely like he regre of AdaHedge and he regre of FTL. The laer performs horribly, so during hose inervals he regre increases quickly, on he oher hand he FTL inervals are relaively shor-lived so in he end hey do no harm he regre by more han a consan facor. The NormalHedge algorihm sill has accepable performance, alhough is regre is relaively large in his experimen; we have no explanaion for his bu in fairness we do observe good performance of NormalHedge in he oher hree experimens as well as in numerous furher unrepored simulaions. 5.2 Experimen 2. Bes Case for FTL The second experimen is defined by l 1 = (1, 0) and f 2 () = 3/2. This leads o he sequence of losses ( ) ( ) ( ) ( ) ( ) ,,,,, in which he loss vecors are alernaing for 2. These daa look very similar o he firs experimen, bu as he op graph in Figure 4 illusraes, because of he small changes a 1304

25 Follow he Leader If You Can, Hedge If You Mus he sar of he sequence, i is now viable o reduce he regre by using a very high learning rae. In paricular, since here are no leader changes afer he firs round, FTL incurs a regre of only 1/2. As in he firs experimen, he regre increases linearly in for every fixed η (provided i is less han ); bu now he consan of lineariy is large only for learning raes close o 1. Once FlipFlop eners he FTL regime for he second ime, i says here indefiniely, which resuls in bounded regre. Afer his small change in he seup compared o he previous experimen, NormalHedge also suddenly adaps very well o he daa. The behaviour of he oher algorihms is very similar o he firs experimen: heir regre grows wihou bound. 5.3 Experimen 3. Weighs do no Concenrae in AdaHedge The hird experimen uses l 1 = (1, 0), and f 3 () = 0.4. The firs few loss vecors are he same as in he previous experimen, bu every now and hen here are wo loss vecors (1, 0) in a row, so ha he firs exper gradually falls behind he second in erms of performance. By = T = 1000, he firs exper has accumulaed 508 loss, while he second exper has only 492. For any fixed learning rae η, he weighs used by Hedge now concenrae on he second exper. We know from Lemma 4 ha he mixabiliy gap in any round is bounded by a consan imes he variance of he loss under he weighs played by he algorihm; as hese weighs concenrae on he second exper, his variance mus go o zero. One can show ha his happens quickly enough for he cumulaive mixabiliy gap o be bounded for any fixed η ha does no vary wih or depend on T. From (5) we have R (η) = M L + (η) ln K η + bounded = bounded. So in his scenario, as long as he learning rae is kep fixed, we will evenually learn he ideniy of he bes exper. However, if he learning rae is very small, his will happen so slowly ha he weighs sill have no converged by = Even worse, he op graph in Figure 5 shows ha for inermediae values of he learning rae, no only do he weighs fail o converge on he second exper sufficienly quickly, bu hey are sensiive enough o he alernaion of he loss vecors o increase he overhead incurred each round. For his experimen, i really pays o use a large learning rae raher han a safe small one. Thus FTL, Hedge wih η = 1, FlipFlop and NormalHedge perform excellenly, while safe Hedge, AdaHedge and Variaion MW incur a subsanial overhead. Exrapolaing he rend in he graph, i appears ha he overhead of hese algorihms is no bounded. This is possible because he hree algorihms wih poor performance use a learning rae ha decreases as a funcion of. As a concequence he used learning rae may remain oo small for he weighs o concenrae. For he case of AdaHedge, his is an example of he nasy feedback loop described in Secion Experimen 4. Weighs do Concenrae in AdaHedge The fourh and las experimen uses l 1 = (1, 0), and f 4 () = 0.6. The losses are comparable o hose of he hird experimen, bu he performance gap beween he wo expers is somewha larger. By = T = 1000, he wo expers have loss 532 and 468, respecively. I 1305

26 De Rooij, Van Erven, Grünwald and Koolen is now so easy o deermine which of he expers is beer ha he op graph in Figure 6 is nonincreasing: he larger he learning rae, he beer. The algorihms ha managed o keep heir regre bounded in he previous experimen obviously sill perform very well, bu i is clearly visible ha AdaHedge now achieves he same. As discussed below Theorem 6, his happens because he weigh concenraes on he second exper quickly enough ha AdaHedge s regre is bounded in his seing. The crucial difference wih he previous experimen is ha now we have f ξ () = β wih β > 1/2. Thus, while he previous experimen shows ha AdaHedge can be ricked ino reducing he learning rae while i would be beer no o do so, he presen experimen shows ha on he oher hand, someimes AdaHedge does adap really nicely o easy daa, in conras o algorihms ha are uned in erms of a wors-case bound. 6. Discussion and Conclusion The main conribuions of his work are wofold. Firs, we develop a new hedging algorihm called AdaHedge. The analysis simplifies exising resuls and we obain improved bounds (Theorems 6 and 8). Moreover, AdaHedge is fundamenal in he sense ha is weighs are invarian under ranslaion and scaling of he losses (Secion 4) and is bounds are imeless in he sense ha hey do no degenerae when rounds are insered in which all expers incur he same loss. Second, we explain in deail why i is difficul o une he learning rae such ha good performance is obained boh for easy and for hard daa, and we address he issue by developing he FlipFlop algorihm. FlipFlop never performs much worse han he Follow-he-Leader sraegy, which works very well on easy daa (Lemma 10), bu i also reains a wors-case bound similar o he bound for AdaHedge (Theorem 15). As such, his work may be seen as solving a special case of a more general quesion: can we compee wih Hedge for any fixed learning rae? We will now briefly discuss his quesion and hen place our work in a broader conex, which provides an ambiious agenda for fuure work. 6.1 General Quesion: Compeing wih Hedge for any Fixed Learning Rae Up o muliplicaive consans, FlipFlop is a leas as good as FTL and as (he bound for) AdaHedge. These wo algorihms represen wo exremes of choosing he learning rae η in Hedge: FTL akes η = o exploi easy daa, whereas AdaHedge decreases η wih o proec agains he wors case. I is now naural o ask wheher we can design a Universal Hedge algorihm ha can compee wih Hedge wih any fixed learning rae η (0, ]. Tha is, for all T, he regre up o ime T of Universal Hedge should be wihin a consan facor C of he regre incurred by Hedge run wih he fixed ˆη ha minimizes he Hedge loss H (ˆη). This appears o be a difficul quesion, and maybe such an algorihm does no even exis. Ye, even parial resuls (such as an algorihm ha compees wih η [ ln(k)/(s 2 T ), ] or wih a facor C ha increases slowly, say, logarihmically, in T ) would already be of significan ineres. In his regard, i is ineresing o noe ha, in pracice, he learning raes chosen by sophisicaed versions of Hedge do no always perform very well; higher learning raes ofen do beer. This is noed by Devaine e al. (2013), who resolve he issue by adaping he learning rae sequenially in an ad-hoc fashion, which works well in heir applicaion, bu 1306

27 Follow he Leader If You Can, Hedge If You Mus regre learning rae FTL Hedge ea=1 35 regre NormalHedge FlipFlop AdaHedge Safe Hedge Variaion MW ime Figure 3: Hedge regre for Experimen 1 (FTL wors-case) 1307

28 De Rooij, Van Erven, Grünwald and Koolen regre learning rae 30 Hedge ea= regre 15 AdaHedge 10 Safe Hedge 5 Variaion MW NormalHedge, FlipFlop, FTL ime Figure 4: Hedge regre for Experimen 2 (FTL bes-case) 1308

29 Follow he Leader If You Can, Hedge If You Mus regre learning rae 15 AdaHedge Safe Hedge 10 regre Variaion MW 5 Hedge ea=1 NormalHedge FlipFlop, FTL ime Figure 5: Hedge regre for Experimen 3 (weighs do no concenrae in AdaHedge) 1309

30 De Rooij, Van Erven, Grünwald and Koolen regre learning rae 15 Variaion MW Safe Hedge 10 regre 5 NormalHedge, AdaHedge, Hedge ea=1 FlipFlop, FTL ime Figure 6: Hedge regre for Experimen 4 (weighs do concenrae in AdaHedge) 1310

31 Follow he Leader If You Can, Hedge If You Mus for which hey can provide no guaranees. A Universal Hedge algorihm would adap o he learning rae ha is opimal wih hindsigh. FlipFlop is a firs sep in his direcion. Indeed, i already has some of he properies of such an ideal algorihm: under some condiions we can show ha if Hedge achieves bounded regre using any learning rae, hen FTL, and herefore FlipFlop, also achieves bounded regre: Theorem 18 Fix any η > 0. For K = 2 expers wih losses in {0, 1} we have R (η) is bounded R fl is bounded R ff is bounded. The proof is in Appendix B. While he second implicaion remains valid for more expers and oher losses, we currenly do no know if he firs implicaion coninues o hold as well. 6.2 The Big Picure Broadly speaking, a learning rae is any single scalar parameer conrolling he relaive weigh of he daa and a prior regularizaion erm in a learning ask. Such learning raes pop up in bach seings as diverse as L 1 /L 2 -regularized regression such as Lasso and Ridge, sandard Bayesian nonparameric and PAC-Bayesian inference (Zhang, 2006; Audiber, 2004; Caoni, 2007), and as in his paper in sequenial predicion. All he applicaions jus menioned can formally be seen as varians of Bayesian inference: Bayesian MAP in Lasso and Ridge, randomized drawing from he poserior ( Gibbs sampling ) in he PAC-Bayesian seing and in he Hedge seing. Moreover, in each of hese applicaions, selecing he appropriae learning rae is nonrivial: simply adding he learning rae as anoher parameer and puing a Bayesian prior on i can lead o very bad resuls (Grünwald and Langford, 2007). An ideal mehod for adaping he learning rae would work in all such applicaions. In addiion o he FlipFlop algorihm described here, we currenly have mehods ha are guaraneed o work for several PAC-Bayesian syle sochasic seings (Grünwald, 2011, 2012). I is encouraging ha all hese mehods are based on he same, apparenly fundamenal, quaniy, he mixabiliy gap as defined before Lemma 1: hey all employ differen echniques o ensure a learning rae under which he poserior is concenraed and hence he mixabiliy gap is small. This gives some hope ha he approach can be aken even furher. To give bu one example, he Safe Bayesian mehod of Grünwald (2012) uses essenially he same echnique as Devaine e al. (2013), wih an addiional online-o-bach conversion sep. Grünwald (2012) proves ha his approach adaps o he opimal learning rae in an i.i.d. sochasic seing wih arbirary (counably or uncounably infinie) ses of expers (predicors); in conras, AdaHedge and FlipFlop in he form presened in his paper are suiable for a wors-case seing wih a finie se of expers. This raises, of course, he quesion of wheher eiher he Safe Bayesian mehod can be exended o he wors-case seing (which would imply formal guaranees for he mehod of Devaine e al. 2013), or he FlipFlop algorihm can be exended o he seing wih infiniely many expers. Thus, we have wo major, inerrelaed quesions for fuure work: firs, as explained in Secion 6.1, we would like o be able o compee wih all η in some se ha conains a whole range raher han jus wo values. Second, we would like o compee wih he bes η in a seing wih a counably infinie or even uncounable number of expers equipped wih an arbirary prior disribuion. 1311

32 De Rooij, Van Erven, Grünwald and Koolen A hird quesion for fuure work is wheher our mehods can be exended beyond he sandard wors-case Hedge seing and he sochasic i.i.d. seing. A paricularly inriguing (and, as iniial research suggess, nonrivial) quesion is wheher AdaHedge and FlipFlop can be adaped o seings wih limied feedback such as he adversarial bandi seing (Cesa-Bianchi and Lugosi, 2006). We would also like o exend our approach o he Hedgebased sraegies for combinaorial decision domains like Componen Hedge by Koolen e al. (2010), and for marix-valued predicions like hose by Tsuda e al. (2005). Acknowledgmens We would like o hank Wojciech Kołowski, Gilles Solz and wo anonymous referees for criical feedback. This work was suppored in par by he IST Programme of he European Communiy, under he PASCAL Nework of Excellence, IST and by NWO Rubicon grans and Appendix A. Proof of Lemma 1 The resul for η = follows from η < as a limiing case, so we may assume wihou loss of generaliy ha η <. Then m h is obained by using Jensen s inequaliy o move he logarihm inside he expecaion, and m l and h l + follow by bounding all losses by heir minimal and maximal values, respecively. The nex wo iems are analogues of similar basic resuls in Bayesian probabiliy. Iem 2 generalizes he chain rule of probabiliy Pr(x 1,..., x T ) = T =1 Pr(x x 1,..., x 1 ): M = 1 η ln T =1 For he hird iem, use iem 2 o wrie w 1 e ηl w 1 e ηl 1 = 1 η ln(w 1 e ηl ). M = 1 η ln ( k w 1,k e ηl T,k The lower bound is obained by bounding all L T,k from below by L ; for he upper bound we drop all erms in he sum excep for he erm corresponding o he bes exper and use w 1,k = 1/K. For he las iem, le 0 < η < γ be any wo learning raes. Then Jensen s inequaliy gives 1 η ln w 1 e ηl = 1 η ln w 1 (e γl) η/γ 1 ( η ln w 1 e γl) η/γ 1 = γ ln w 1 e γl. This complees he proof. Appendix B. Proof of Theorem 18 The second implicaion follows from Theorem 15, so we only need o prove he firs implicaion. To his end, consider any infinie sequence of losses on which FTL has unbounded regre. We will argue ha Hedge wih fixed η mus have unbounded regre as well. ). 1312

33 Follow he Leader If You Can, Hedge If You Mus Our argumen is based on finding an infinie subsequence of he losses on which (a) he regre for Hedge wih fixed η is a mos as large as on he original sequence of losses; and (b) he regre for Hedge is infinie. To consruc his subsequence, firs remove all rials such ha l,1 = l,2 (ha is, boh expers suffer he same loss), as hese rials do no change he regre of eiher FTL or Hedge, nor heir behaviour on any of he oher rounds. Nex, we will selecively remove cerain local exrema. We call a pair of wo consecuive rials (, + 1) a local exremum if he losses in hese rials are opposie: eiher l = (0, 1) and l +1 = (1, 0) or vice versa. Removing any local exremum will only decrease he regre for Hedge, as may be seen as follows. We observe ha removing a local exremum will no change he cumulaive losses of he expers or he behaviour of Hedge on oher rounds, so i suffices o consider only he regre incurred on rounds and + 1 hemselves. By symmery i is furher sufficien o consider he case ha l = (0, 1) and l +1 = (1, 0). Then, over rials and + 1, he individual expers boh suffer loss 1, and for Hedge he loss is h + h +1 = w l + w +1 l +1 = w,2 + w +1,1. Now, since he loss received by exper 1 in round was less han ha of exper 2, some weigh shifs o he firs exper: we mus have w +1,1 > w,1. Subsiuion gives h + h +1 > w,1 + w,2 = 1. Thus, Hedge suffers more loss in hese wo rounds han whichever exper urns ou o be bes in hindsigh, and i follows ha removing rials and + 1 will only decrease is regre (by an amoun ha depends only on η). We proceed o selec he local exrema o remove. To his end, le d = L,2 L,1 denoe he difference in cumulaive loss beween he expers afer rials, and observe ha removal of a local exremum a (, + 1) will simply remove he elemens d and d +1 from he sequence d 1, d 2,... while leaving he oher elemens of he sequence unchanged. We will remove local exrema in a way ha leads o an infinie subsequence of losses such ha d 1, d 2, d 3, d 4, d 5,... = ±1, 0, ±1, 0, ±1,... (26) In his subsequence, every wo consecuive rials sill consiue a local exremum, on which Hedge incurs a cerain fixed posiive regre. Consequenly, he Hedge regre R grows linearly in and is herefore unbounded. If he losses already saisfy (26), we are done. If no, hen observe ha here can only be a leader change a ime + 1 in he sense of Lemma 10 when d = 0. Since he FTL regre is bounded by he number of leader changes (Lemma 10), and since FTL was assumed o have infinie regre, here mus herefore be an infinie number of rials such ha d = 0. We will remove local exrema in a way ha preserves his propery. In addiion, we mus have d +1 d = 1 for all, because d +1 = d would imply ha l +1,1 = l +1,2 and we have already removed such rials. This second propery is auomaically preserved regardless of which rials we remove. If he losses do no ye saisfy (26), here mus be a firs rial u wih d u 2. Since here are infiniely many wih d = 0, here mus hen also be a firs rial w > u wih d w = 0. Now choose any v [u, w) so ha d v = max [u,w] d maximizes he discrepancy beween he cumulaive losses of he expers. Since v aains he maximum and d +1 d = 1 for all as menioned above, we have d v+1 = d v 1, so ha (v, v + 1) mus be a local exremum, and his is he local exremum we remove. Since d v d u 2, we also have d v+1 1, so ha his does no remove any of he rials in which d = 0. Repeiion of his 1313

34 De Rooij, Van Erven, Grünwald and Koolen process will evenually lead o v = u, so ha rial u is removed. Given any T, he process may herefore be repeaed unil d 1 for all T. As d +1 d = 1 for all, we hen mach (26) for he firs T rials. Hence by leing T go o infiniy we obain he desired resul. References Jean-Yves Audiber. PAC-Bayesian saisical learning heory. PhD hesis, Universié Paris VI, Peer Auer, Nicolò Cesa-Bianchi, and Claudio Genile. Adapive and self-confiden on-line learning algorihms. Journal of Compuer and Sysem Sciences, 64:48 75, Olivier Caoni. PAC-Bayesian Supervised Classificaion. Lecure Noes-Monograph Series. IMS, Nicolò Cesa-Bianchi and Gábor Lugosi. Predicion, learning, and games. Cambridge Universiy Press, Nicolò Cesa-Bianchi, Yoav Freund, David Haussler, David P. Helmbold, Rober E. Schapire, and Manfred K. Warmuh. How o use exper advice. Journal of he ACM, 44(3): , Nicolò Cesa-Bianchi, Yishay Mansour, and Gilles Solz. Improved second-order bounds for predicion wih exper advice. Machine Learning, 66(2/3): , Kamalika Chaudhuri, Yoav Freund, and Daniel Hsu. A parameer-free hedging algorihm. In Advances in Neural Informaion Processing Sysems 22 (NIPS 2009), pages , Alexey V. Chernov and Vladimir Vovk. Predicion wih advice of unknown number of expers. In Peer Grünwald and Peer Spires, ediors, UAI, pages AUAI Press, Marie Devaine, Pierre Gaillard, Yannig Goude, and Gilles Solz. Forecasing elecriciy consumpion by aggregaing specialized expers; a review of he sequenial aggregaion of specialized expers, wih an applicaion o Slovakian and French counry-wide one-dayahead (half-)hourly predicions. Machine Learning, 90(2): , February Yoav Freund and Rober E. Schapire. A decision-heoreic generalizaion of on-line learning and an applicaion o boosing. Journal of Compuer and Sysem Sciences, 55: , Yoav Freund and Rober E. Schapire. Adapive game playing using muliplicaive weighs. Games and Economic Behavior, 29:79 103, Sébasien Gerchinoviz. Prédicion de suies individuelles e cadre saisique classique: éude de quelques liens auour de la régression parcimonieuse e des echniques d agrégaion. PhD hesis, Universié Paris-Sud,

35 Follow he Leader If You Can, Hedge If You Mus Peer Grünwald. Safe learning: bridging he gap beween Bayes, MDL and saisical learning heory via empirical convexiy. In Proceedings of he 24h Inernaional Conference on Learning Theory (COLT 2011), pages , Peer Grünwald. The safe Bayesian: learning he learning rae via he mixabiliy gap. In Proceedings of he 23rd Inernaional Conference on Algorihmic Learning Theory (ALT 2012), Peer Grünwald and John Langford. Subopimal behavior of Bayes and MDL in classificaion under misspecificaion. Machine Learning, 66(2-3): , DOI /s László Györfi and György Oucsák. Sequenial predicion of unbounded saionary ime series. IEEE Transacions on Informaion Theory, 53(5): , Elad Hazan and Sayen Kale. Exracing cerainy from uncerainy: Regre bounded by variaion in coss. In Proceedings of he 21s Annual Conference on Learning Theory (COLT), pages 57 67, Marcus Huer and Jan Poland. Adapive online predicion by following he perurbed leader. Journal of Machine Learning Research, 6: , Adam Kalai and Sanosh Vempala. Efficien algorihms for online decision. In Proceedings of he 16s Annual Conference on Learning Theory (COLT), pages , Yuri Kalnishkan and Michael V. Vyugin. The weak aggregaing algorihm and weak mixabiliy. In Proceedings of he 18h Annual Conference on Learning Theory (COLT), pages , Wouer M. Koolen, Manfred K. Warmuh, and Jyrki Kivinen. Hedging srucured conceps. In A.T. Kalai and M. Mohri, ediors, Proceedings of he 23rd Annual Conference on Learning Theory (COLT 2010), pages , Nick Lilesone and Manfred K. Warmuh. The weighed majoriy algorihm. Informaion and Compuaion, 108(2): , Koji Tsuda, Gunnar Räsch, and Manfred K. Warmuh. Marix exponeniaed gradien updaes for on-line learning and Bregman projecion. Journal of Machine Learning Research, 6: , Tim van Erven, Peer Grünwald, Wouer M. Koolen, and Seven de Rooij. Adapive hedge. In Advances in Neural Informaion Processing Sysems 24 (NIPS 2011), pages , Vladimir Vovk. A game of predicion wih exper advice. Journal of Compuer and Sysem Sciences, 56(2): , Vladimir Vovk. Compeiive on-line saisics. Inernaional Saisical Review, 69(2): ,

36 De Rooij, Van Erven, Grünwald and Koolen Vladimir Vovk, Akimichi Takemura, and Glenn Shafer. Defensive forecasing. In Proceedings of AISTATS 2005, Archive version available a hp:// Tong Zhang. Informaion heoreical upper and lower bounds for saisical esimaion. IEEE Transacions on Informaion Theory, 52(4): ,