Parameter-Free Convex Learning through Coin Betting

JMLR: Workshop and Conference Proceedings 1:1 7, 2016 ICML 2016 AuoML Workshop Parameer-Free Convex Learning hrough Coin Being Francesco Orabona Dávid Pál Yahoo Research, New York FRANCESCO@ORABONA.COM DPAL@YAHOO-INC.COM Absrac We presen a new parameer-free algorihm for online linear opimizaion over any Hilber space. I is heoreically opimal, wih regre guaranees as good as wih he bes possible learning rae. The algorihm is simple and easy o implemen. The analysis is given via he adversarial coin-being game, Kelly being and he Krichevsky-Trofimov esimaor. Applicaions o obain parameer-free convex opimizaion and machine learning algorihms are shown. 1. Inroducion We consider he Online Linear Opimizaion (OLO) (Cesa-Bianchi and Lugosi, 2006; Shalev-Shwarz, 2011) over a Hilber space H. In each round, an algorihm chooses a poin w H and hen receives a loss vecor l H. The algorihm s goal is o keep is regre small, defined as he difference beween is cumulaive reward and he cumulaive reward of a fixed sraegy u H, ha is Regre T (u) = T l, w =1 T l, u. where, is he inner produc in H. OLO is a basic building block of many machine learning problems. For example, Online Convex Opimizaion (OCO) is a problem analogous o OLO where he linear funcion u l, u is generalized o an arbirary convex funcion f (u). OCO is solved hrough a reducion o OLO by feeding he algorihm l = f (w ) (Shalev-Shwarz, 2011). Bach and sochasic convex opimizaion can also be solved hrough a reducion o OLO by aking he average of w 1, w 2,..., w T (Shalev- Shwarz, 2011). To achieve opimal regre, mos of he exising online algorihms (e.g. Online Gradien Descen, Hedge) require he user o se he learning rae o an unknown/oracle value. Recenly, new parameer-free algorihms have been proposed for OLO/OCO (Chaudhuri e al., 2009; Chernov and Vovk, 2010; Sreeer and McMahan, 2012; Orabona, 2013; McMahan and Abernehy, 2013; McMahan and Orabona, 2014; Luo and Schapire, 2014; Orabona, 2014; Luo and Schapire, 2015; Koolen and van Erven, 2015). These algorihms adap o he characerisics of he opimal predicor, wihou he need o une parameers. However, heir design and underlying inuiion is sill a challenge. Our conribuions are as follows. We connec algorihms for OLO wih coin being. Namely, we show ha an algorihm for OLO can be viewed as an algorihm for being on oucomes of adversarial coin flips. The wealh he algorihm can generae for he being problem is conneced o he regre in OLO seing. This insigh allows us o design novel parameer-free algorihms, =1 c 2016 F. Orabona & D. Pál.

ORABONA PÁL which are exremely simple and naural. We also show some applicaions of our resuls o convex opimizaion and machine learning, as well as some empirical resuls. 2. How o Tune he Learning Raes? Denoe by =, he induced norm in H and assume ha l 1. Consider OLO over a Hilber Space H. Online Gradien Descen (OGD) wih learning rae η saisfies (Shalev-Shwarz, 2011) u H Regre T (u) u 2 2η + ηt 2. (1) I is obvious ha he opimal uning of he learning rae depends on he unknown norm of u. The simple choice η = 1/ T leads o an algorihm ha saisfies ( u H Regre T (u) 1 2 1 + u 2) T. (2) However, in his bound he dependency on u is subopimal: The quadraic dependency can be replaced by an (almos) linear dependency. Saring from (1), if we choose he learning rae η = D/ T, we ge a family of algorihms parameerized by D [0, ) ha saisfy u H : u D = Regre T (u) D T. (3) Insead of a family of algorihms parameerized by D [0, ) saisfying he bound (3), one would like o have a single algorihm (wihou any uning parameers) saisfying u H Regre T (u) u T. (4) Noice ha (4) is sronger han (3) in he following sense: A single algorihm saisfying (4) implies (3) for all values of D [0, ). However, a family of algorihms {A D : D [0, )} parameerized by D where A D saisfies (3), does no yield a single algorihm ha saisfies (4). Finally, noe ha (4) has beer dependency on u han (2). Beer guaranees are indeed possible. In fac, here have been a lo of work on algorihms (Sreeer and McMahan, 2012; Orabona, 2013; McMahan and Abernehy, 2013; McMahan and Orabona, 2014; Orabona, 2014) ha saisfy a slighly weaker version of (4). Namely, heir regre saisfies u H Regre T (u) ( O(1) + polylog(1 + u ) u ) T. (5) I can be shown ha for OLO over Hilber space he exra poly-logarihmic facor is necessary (McMahan and Abernehy, 2013; Orabona, 2013). Algorihms saisfying (5) are called parameer-free, since hey do no need o know D, ye hey have an opimal dependency on u. 3. Parameer-Free Algorihm From Coin Being Here, we presen our new parameer-free algorihm for OLO over a Hilber space H, saed as Algorihm 1. We would like o sress he exreme simpliciy of he algorihm. The heorem below upper bounds is regre in he form of (5), he proof can be found in Orabona and Pál (2016). 2

PARAMETER-FREE CONVEX LEARNING THROUGH COIN BETTING Algorihm 1 Algorihm for OLO over Hilber space H based on Krichevsky-Trofimov esimaor 1: for = 1, 2,... do ( 2: Predic wih w 1 1 ) 1 l 1 i, w i l i 3: Receive loss vecor l H such ha l 1 4: end for Theorem 1 (Regre Bound for Algorihm 1) Le {l } =1 be any sequence of loss vecors in a Hilber space H such ha l 1. Algorihm 1 saisfies ( T 0 u H Regre T (u) u T ln 1 + 4T 2 u 2) + 1. We now explain how Algorihm 1 is derived from he Krichevsky-Trofimov soluion o he adversarial coin-being problem. Adversarial Coin Being. Consider a gambler making repeaed bes on he oucomes of adversarial coin flips. The gambler sars wih an iniial endowmen of 1 dollar. In each round, he bes on he oucome of a coin flip c { 1, 1}, where +1 denoes heads and 1 denoes ails. The oucome c is chosen by an adversary. The gambler can be any amoun on eiher heads or ails. However, he canno borrow any addiional money. If he loses, he loses he beed amoun; if he wins, he ges he beed amoun back and, in addiion o ha, he ges he same amoun as a reward. We encode he gambler s be in round by a single number β [ 1, 1]. The sign of β encodes wheher he is being on heads or ails. The absolue value encodes he beed amoun as he fracion of his curren wealh. Le Wealh be gambler s wealh a he end of round. I saisfies Wealh 0 = 1 and Wealh = (1 + c β ) Wealh 1 for 1. (6) Noe ha since β [ 1, 1], gambler s wealh says always non-negaive. Kelly Being and Krichevsky-Trofimov Esimaor. For sequenial being on i.i.d. coin flips, he opimal sraegy has been proposed by Kelly (1956). The sraegy assumes ha he coin flips {c } =1, c {+1, 1}, are generaed i.i.d. wih known probabiliy of heads. If p [0, 1] is he probabiliy of heads, he Kelly be is β = 2p 1. He showed ha, in he long run, his sraegy will provide more wealh han being any oher fixed fracion (Kelly, 1956). For adversarial coins, Kelly being does no make sense. Krichevsky and Trofimov (1981) proposed o replace p wih an esimae: Afer seeing coin flips c 1, c 2,..., c 1, use he empirical esimae k = 1/2+ 1 1[c i=+1]. Their esimae is commonly called KT esimaor 1 and i resuls in 1 he being sraegy β = 2k 1 = c i. Krichevsky and Trofimov showed ha his sraegy guaranees almos he same wealh ha one would obain knowing in advance he fracion of heads. Namely, if we denoe by Wealh (β) he wealh of he sraegy ha bes he fracion β in every round, hen he wealh of he Krichevsky-Trofimov being sraegy saisfies β [ 1, 1] Wealh Wealh (β) 2. (7) Moreover, his guaranee is opimal up o consan muliplicaive facors (Cesa-Bianchi and Lugosi, 2006). 1. Compared o he sandard maximum likelihood esimae 1 1[c i=+1] 1, KT esimaor shrinks slighly owards 1 2. 3

ORABONA PÁL Algorihm 2 SGD algorihm based on KT esimaor Require: Convex funcions f 1, f 2,..., f N and desired number of ieraions T 1: Iniialize Wealh 0 1 and θ 0 0 2: for = 1, 2,..., T do 3: Se w Wealh 1 θ 1 4: Selec an index j a random from {1, 2,..., N} and compue l = f j (w 1 ) 5: Updae θ θ 1 l and Wealh Wealh 1 l, w 6: end for 7: Oupu w T = 1 T T =1 w From being o OLO. In Algorihm 1, he coin oucome is he vecor c H where c = l and algorihm s wealh is Wealh = 1+ c i, w i = 1 l i, w i. The algorihm explicily 1 keeps rack of is wealh and i bes vecorial fracion β = c 1 i = l i of is curren wealh. The regre bound (Theorem 1) is a consequence of Krichevsky-Trofimov lower bound (7) on he wealh and he dualiy beween regre and wealh. For more deails, see Orabona and Pál (2016). 4. From Online Learning o Convex Opimizaion and Machine Learning The resul in Secion 3 immediaely implies new algorihms and resuls in convex opimizaion and machine learning. We will sae some of hem here, see Orabona (2014) for more resuls. Convex Opimizaion. Consider an empirical risk minimizaion problem of he form F (w) = 1 N N f i (w), (8) where f i : R d R is convex. 2 I is immediae o ransform Algorihm 1 ino a Sochasic Gradien Descen (SGD) algorihm for his problem, obaining Algorihm 2. In Algorihm 2, f j (w) denoes a subgradien of f j a a poin w. We assume ha he norm of he subgradien of f j is bounded by 1. Beside he simpliciy of he Algorihm 2, i has he imporan propery is ha i does no have a learning rae o be uned, ye i achieves he opimal convergence rae. In fac, denoing by ŵ = arg min w F (w) he opimal soluion of (8), he following heorem saes he rae of convergence of Algorihm 2. Theorem 2 The average w T produced by Algorihm 2 is an approximae minimizer of he objecive funcion (8): E [F (w T )] F (ŵ) ŵ T log(1 + 4T 2 ŵ 2 ) + 1 T. Noe ha in he above heorem, T can be larger (muliple epochs) or smaller han N. Machine Learning. In machine learning, he minimizaion of a funcion (8) is jus a proxy o minimize he rue risk over an unknown disribuion. For example, f i (w) can be of he form f i (w) = f(w, X i, Y i ) where {(X i, Y i )} N is a sequences of labeled sampled generaed i.i.d. from some unknown disribuion and f(w, X i, Y i ) is he logisic loss of a weigh vecor w on a sample 2. The algorihm can also be implemened and analyzed wih kernels (Orabona, 2014). 4

PARAMETER-FREE CONVEX LEARNING THROUGH COIN BETTING Algorihm 3 Averaging algorihm based on KT esimaor Require: Sample (X 1, Y 1 ), (X 2, Y 2 ),..., (X N, Y N ) 1: Iniialize Wealh 0 1 and θ 0 0 2: for i = 1, 2,..., N do θ 3: Se w i Wealh i 1 i 1 i 4: Compue l i = f(w,x i,y i ) w w=wi 5: Updae θ i θ i 1 l i and Wealh i Wealh i 1 l i, w i 6: end for 7: Oupu w N = 1 N N w i (X i, Y i ). A common approach o have a small risk on he es se is o minimize a regularized objecive funcion over he raining se: F Reg (w) = w 2 + 1 N N f(w, X i, Y i ). (9) This problem is srongly convex, so here are very efficien mehods o minimize i, hence we can assume o be able o ge he minimizer of F Reg wih arbirary high precision and algorihms ha do no require o une learning raes. Ye, his is no enough. In fac, we are rarely ineresed in he value of he objecive funcion F Reg or is minimizer, raher we are ineresed in he rue risk of a soluion w, ha is E[f(w, X, Y )], where (X, Y ) is an independen es sample from he same disribuion from which he raining se {(X i, Y i )} N came from. Hence, in order o ge a good performance we have o selec a good regularizaion parameer. In paricular, from Sridharan e al. (2009) we ge E[f(ŵ, X, Y )] E[f(w, X, Y )] O( w 2 + 1 N ), where w = arg min w E[f(w, X, Y )] and ŵ = arg min w F Reg (w). From he above bound, i is clear ha he opimal value of depends on he w ha is unknown. Ye anoher possibiliy is o selec he opimal learning rae and/or he number of epochs of SGD o direcly minimize E[f(w, X, Y )]. However, all hese mehods are equivalen (J. Lin, 2016) and hey sill require o une a leas one parameer. We would like o sress ha his is no jus a heoreical problem: Any praciioner knows how painful i is o find he righ regularizaion for he problem a hand. Assuming we would know w, we could se = O(1/( w N)) o achieve he wors-case opimal bound ( ) E[f(ŵ, X, Y )] E[f(w, X, Y )] O w N. (10) However, we can ge he same guaranee wihou knowing w or he opimal, by doing a single pass over he daa se. More precisely, we derive Algorihm 3 from Algorihm 1 by applying he sandard online-o-bach reducion (Shalev-Shwarz, 2011). The algorihm makes only a single pass over he daase and i does no have any uning parameers. Ye, i has almos he same guaranee (10) wihou knowing w or he opimal regularizaion parameer or he learning rae, or any oher uning parameer. 5

ORABONA PÁL 0.7 yearpredicionmsd daase, absolue loss 16 cpusmall daase, absolue loss 1.2 x 105 cadaa daase, absolue loss 0.69 14 SGD KT-based 1.15 Tes loss 0.68 0.67 0.66 Tes loss 12 10 8 Tes loss 1.1 1.05 1 0.95 SGD KT-based 0.65 0.9 0.64 SGD KT-based 6 0.85 0.63 10 1 10 0 10 1 Learning rae SGD (a) 4 10 1 10 0 10 1 10 2 10 3 Learning rae SGD (b) 0.8 10 2 10 4 10 6 Learning rae SGD (c) Figure 1: Tes loss versus learning rae parameer of SGD (in log scale), compared wih he parameer-free Algorihm 3. Theorem 3 Assume ha (X, Y ), (X 1, Y 1 ), (X 2, Y 2 ),..., (X N, Y N ) are i.i.d. The oupu w N of Algorihm 3 saisfies E[f(w N, X, Y )] E[f(w, X, Y )] w N log(1 + 4N 2 w 2 ) + 1 N. Comparing his guaranee o he one in (10), we see ha, jus paying a sub-logarihmic price, we obain he opimal convergence rae and we remove all he parameers. 5. Empirical Evaluaion We have also run a small empirical evaluaion o show ha he heoreical difference beween classic learning algorihms and parameer-free ones is real and no jus heoreical. In Figure 1, we have used hree regression daases, 3 and solved he OCO problem hrough OLO. In all he hree cases, we have used he absolue loss and normalized he inpu vecors o have L2 norm equal o 1. The daase were spli in wo pars: 75% raining se and he remaining as es se. The raining is done hrough one pass over he raining se and he final classifier is evaluaed on he es se. We used 5 differen splis of raining/es and we repor average and sandard deviaions. We have run SGD wih differen learning raes and shown he performance of is las soluion on he es se. For Algorihm 3, we do no have any parameer o une so we jus plo is es se performance as a line. From he empirical resuls, i is clear ha he opimal learning rae is compleely daa-dependen. I is also ineresing o noe how he performance of SGD becomes very unsable wih large learning raes. Ye our parameer-free algorihm has a performance very close o he unknown opimal uning of he learning rae of SGD. Acknowledgmens The auhors hank Jacob Abernehy, Nicolò Cesa-Bianchi, Sayen Kale, Chansoo Lee, and Giuseppe Moleni for useful discussions on his work. 3. Daases available a hps://www.csie.nu.edu.w/ cjlin/libsvmools/daases/. 6

PARAMETER-FREE CONVEX LEARNING THROUGH COIN BETTING References N. Cesa-Bianchi and G. Lugosi. Predicion, learning, and games. Cambridge Universiy Press, 2006. K. Chaudhuri, Y. Freund, and D. Hsu. A parameer-free hedging algorihm. In Y. Bengio, D. Schuurmans, J. D. Laffery, C. K. I. Williams, and A. Culoa, ediors, Advances in Neural Informaion Processing Sysems 22 (NIPS 2009), December 7-12, Vancouver, BC, Canada, pages 297 305, 2009. A. Chernov and V. Vovk. Predicion wih advice of unknown number of expers. In Peer Grünwald and Peer Spires, ediors, Proc. of he 26h Conf. on Uncerainy in Arificial Inelligence (UAI 2010), July 8-11, Caalina Island, California, USA. AUAI Press, 2010. L. Rosasco J. Lin, R. Camoriano. Generalizaion properies and implici regularizaion for muliple passes SGM. In Proc. of Inernaional Conference on Machine Learning (ICML), 2016. J. L. Kelly. A new inerpreaion of informaion rae. Informaion Theory, IRE Transacions on, 2(3):185 189, Sepember 1956. W. M. Koolen and T. van Erven. Second-order quanile mehods for expers and combinaorial games. In Peer Grünwald, Elad Hazan, and Sayen Kale, ediors, Proc. of he 28h Conf. on Learning Theory (COLT 2015), July 3-6, Paris, France, pages 1155 1175, 2015. R. E. Krichevsky and V. K. Trofimov. The performance of universal encoding. IEEE Transacions on Informaion Theory, 27(2):199 206, 1981. H. Luo and R. E. Schapire. A drifing-games analysis for online learning and applicaions o boosing. In Z. Ghahramani, M. Welling, C. Cores, N.D. Lawrence, and K.Q. Weinberger, ediors, Advances in Neural Informaion Processing Sysems 27, pages 1368 1376. Curran Associaes, Inc., 2014. H. Luo and R. E. Schapire. Achieving all wih no parameers: AdaNormalHedge. In Peer Grünwald, Elad Hazan, and Sayen Kale, ediors, Proc. of he 28h Conf. on Learning Theory (COLT 2015), July 3-6, Paris, France, pages 1286 1304, 2015. H. B. McMahan and J. Abernehy. Minimax opimal algorihms for unconsrained linear opimizaion. In C. J. C. Burges, L. Boou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, ediors, Advances in Neural Informaion Processing Sysems 26 (NIPS 2013), pages 2724 2732, 2013. H. B. McMahan and F. Orabona. Unconsrained online linear learning in Hilber spaces: Minimax algorihms and normal approximaions. In Vialy Feldman, Csaba Szepesvri, and Maria Florina Balcan, ediors, Proc. of The 27h Conf. on Learning Theory (COLT 2014), June 13-15, Barcelona, Spain, pages 1020 1039, 2014. F. Orabona. Dimension-free exponeniaed gradien. In C. J. C. Burges, L. Boou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, ediors, Advances in Neural Informaion Processing Sysems 26, pages 1806 1814. Curran Associaes, Inc., 2013. F. Orabona. Simulaneous model selecion and opimizaion hrough parameer-free sochasic learning. In Advances in Neural Informaion Processing Sysems 27, pages 1116 1124, 2014. F. Orabona and D. Pál. From coin being o parameer-free online learning, 2016. Available from: hp: //arxiv.org/pdf/1602.04128.pdf. S. Shalev-Shwarz. Online learning and online convex opimizaion. Foundaions and Trends in Machine Learning, 4(2):107 194, 2011. K. Sridharan, S. Shalev-Shwarz, and N. Srebro. Fas raes for regularized objecives. In D. Koller, D. Schuurmans, Y. Bengio, and L. Boou, ediors, Advances in Neural Informaion Processing Sysems 21, pages 1545 1552. Curran Associaes, Inc., 2009. M. Sreeer and B. McMahan. No-regre algorihms for unconsrained online convex opimizaion. In F. Pereira, C. J. C. Burges, L. Boou, and K. Q. Weinberger, ediors, Advances in Neural Informaion Processing Sysems 25 (NIPS 2012), Decmber 3-8, Lake Tahoe, Nevada, USA, pages 2402 2410, 2012. 7