Online Multiple Kernel Learning: Algorithms and Mistake Bounds

Onlne Multple Kernel Learnng: Algorthms and Mstake Bounds Rong Jn 1, Steven C.H. Ho 2, and Tanbao Yang 1 1 Department of Computer Scence and Engneerng, Mchgan State Unversty, MI, 48824, USA 2 School of Computer Engneerng, Nanyang Technologcal Unversty, 639798, Sngapore 1 {rongjn,yangta1}@cse.msu.edu, 2 chho@ntu.edu.sg Abstract. Onlne learnng and kernel learnng are two actve research topcs n machne learnng. Although each of them has been studed extensvely, there s a lmted effort n addressng the ntersectng research. In ths paper, we ntroduce a new research problem, termed Onlne Multple Kernel Learnng OMKL, that ams to learn a kernel based predcton functon from a pool of predefned kernels n an onlne learnng fashon. OMKL s generally more challengng than typcal onlne learnng because both the kernel classfers and ther lnear combnaton weghts must be learned smultaneously. In ths work, we consder two setups for OMKL,.e. combnng bnary predctons or real-valued outputs from multple kernel classfers, and we propose both determnstc and stochastc approaches n the two setups for OMKL. The determnstc approach updates all kernel classfers for every msclassfed example, whle the stochastc approach randomly chooses a classfers for updatng accordng to some samplng strateges. Mstake bounds are derved for all the proposed OMKL algorthms. Keywords: On-lne learnng and relatve loss bounds, Kernels 1 Introducton In recent years, we have wtnessed ncreasng nterests on both onlne learnng and kernel learnng. Onlne learnng refers to the learnng process of answerng a sequence of questons gven the feedback of correct answers to prevous questons and possbly some addtonal pror nformaton [27]; Kernel learnng ams to dentfy an effectve kernel for a gven learnng task [20, 28, 12]. A well-known kernel learnng method s Multple Kernel Learnng MKL [3, 28], that seeks the combnaton of multple kernels n order to optmze the performance of kernel based learnng methods e.g., Support Vector Machnes SVM. Although kernel trck has been explored n onlne learnng [10, 7], t s often assumed that kernel functon s gven apror. In ths work, we address a a new research problem, Onlne Multple Kernel Learnng OMKL, whch ams to smultaneously learn multple kernel classfers and ther lnear combnatons from a pool of gven kernels n an onlne fashon. Compared to the extng methods for multple kernel learnng see [17] and reference theren, onlne multple kernel learnng s computatonally advantageous n that t only requres gong through tranng examples once. We emphasze

2 Onlne Multple Kernel Learnng that onlne multple kernel learnng s sgnfcantly more challengng than typcal onlne learnng because both the optmal kernel classfers and ther lnear combnatons have to be learned smultaneously n an onlne fashon. In ths paper, we consder two dfferent setups for onlne multple kernel learnng. In the frst setup, termed as Onlne Multple Kernel Learnng by Predctons or OMKL-P, ts objectve s to combne the bnary predctons from multple kernel classfers. The second setup, termed as Onlne Multple Kernel Learnng by Outputs or OMKL-O, mproves OMKL-P by combnng the real-valued outputs from multple kernel classfers. Our onlne learnng framework for multple kernel learnng s based on the combnaton of two types of onlne learnng technques: the Perceptron algorthm [25] that learns a classfer for a gven kernel, and the Hedge algorthm [9] that lnearly combnes multple classfers. Based on the proposed framework, we present two types of approaches for each setup of OMKL,.e., determnstc and stochastc approaches. The determnstc approach updates each kernel classfer for every msclassfed example, whle the stochastc approach chooses a subset of classfers for updatng based on certan samplng strateges. Mstake bounds are derved for all the proposed algorthms for onlne kernel learnng. The rest of ths paper s organzed as follows. Secton 2 revews the related work on both onlne learnng and kernel learnng. Secton 3 overvews the problem of onlne multple kernel learnng. Secton 4 presents the algorthms for Onlne Multple Kernel Learnng by Predctons and ther mstake bounds; Secton 5 presents algorthms for Onlne Multple Kernel Learnng by Outputs and ther mstake bounds. Secton 6 concludes ths study wth future work. 2 Related Work Our work s closely related to both onlne learnng and kernel learnng. Below we brefly revew the mportant work n both areas. Extensve studes have been devoted to onlne learnng for classfcaton. Startng from Perceptron algorthm [1, 25, 23], a number of onlne classfcaton algorthms have been proposed ncludng the ROMMA algorthm [21], the ALMA algorthm [11], the MIRA Algorthm [8], the NORMA algorthm [16, 15], and the onlne Passve- Aggressve algorthms [7]. Several studes extended the Perceptron algorthm nto a nonlnear classfer by the ntroducton of kernel functons [16, 10]. Although these algorthms are effectve for nonlnear classfcaton, they usually assume that approprate kernel functons are gven apror, whch lmts ther applcatons. Besdes onlne classfcaton, our work s also related to onlne predcton wth expert advces [9, 22, 30]. The most well-known work s probably the Hedge algorthm [9], whch was a drect generalzaton of Lttlestone and Warmuth s Weghted Majorty WM algorthm [22]. We refer readers to the book [4] for the n-depth dscusson of ths subject. Kernel learnng has been actvely studed thanks to the great successes of kernel methods, such as support vector machnes SVM [29, 26]. Recent studes of kernel learnng focus on learnng an effectve kernel automatcally from tranng data. Varous algorthms have been proposed to learn parametrc or sem-parametrc kernels from labeled and/or unlabeled data. Example technques nclude cluster kernels [5],

Onlne Multple Kernel Learnng 3 dffuson kernels [18], margnalzed kernels [14], graph-based spectral kernel learnng approaches [32, 13], non-paramerc kernel learnng [12, 6], and lower-rank kernel learnng[19]. Among varous approaches for kernel learnng, Multple Kernel LearnngMKL [20], whose goal s to learn an optmal combnaton of multple kernels, has emerged as a promsng technque. A number of approaches have been proposed to solve the optmzaton problem related to MKL, ncludng the conc combnaton approach va regular convex optmzaton [20], the sem-nfnte lnear program SILP approach [28], the Subgradent Descent approach [24], and the recent level method [31]. We emphasze that although both onlne learnng and kernel learnng have been extensvely studed, lttle work has been done to address onlne kernel learnng, especally onlne multple kernel learnng. To the best of our knowledge, ths s the frst theoretc study that addresses the OMKL problem. 3 Onlne Multple Kernel Learnng Before presentng the algorthms for onlne multple kernel learnng, we frst brefly descrbe the Multple Kernel Learnng MKL problem. Gven a set of tranng examples D T = {x t,y t,t = 1,...,T} wherey t { 1,+1},t= 1,...,T, and a collecton of kernel functonsk m = {κ : X X R, = 1,...,m}, our goal s to dentfy the optmal combnaton of kernel functons, denoted byu = u 1,,u m, that mnmzes the margn classfcaton error. It s cast as the followng optmzaton problem: mn mn 1 u f H κu 2 f 2 H κu +C lfx t,y t 1 whereh κ denotes the reproducng kernel Hlbert space defned by kernelκ, denotes a smplex,.e. = {θ R m + m =1 θ = 1}, and m κ u, = u j κ j,, lfx t,y t = max0,1 y t fx t j=1 It can also be cast nto the followng mnmax problem: { T mn max α t 1 m } u α [0,C] T 2 α y u K α y where K R n n wth K j,l = κ x j,x l, y = y 1,,y T, and s the elementwse product between two vectors. The formulaton for batch mode multple kernel learnng n 1 ams to learn a sngle functon n the space ofh κu. It s well recognzed that solvng the optmzaton problem n 1 s n general computatonally expensve. In ths work, we am to allevate the computatonal dffculty of multple kernel learnng by onlne learnng that only needs to scan through tranng examples once. The followng theorem allows us to smplfy ths problem by decomposng t nto two separate tasks,.e., learnng a classfer for each ndvdual kernel, and weghts that combne the outputs of ndvdual kernel classfer to form the fnal predcton. =1 2

4 Onlne Multple Kernel Learnng Theorem 1 The optmzaton problem n 1 s equvalent to m u m mn u,{f H κ } m =1 2 f 2 H κ +C l u f x t,y t =1 Proof. It s mportant to note that problem n 3 s non-convex, and therefore we can not drectly deploy the standard approach to convert t nto ts dual form. In order to transform 3 nto 1, we rewrtelz,y = max α1 yz, and rewrte 3 as follows α [0,1] m u mn mn max u 2 f m 2 H κ + α t 1 y t u f x t {f H κ } m =1 α [0,C] T =1 Snce the problem s convex nf and concave nα, we can swtch the mnmzaton of f wth the maxmzaton ofα. By takng the mnmzaton off, we have f = =1 α t y t κ x t,, = 1,...,m and the resultng optmzaton problem becomes mn max u α [0,C] T m α t =1 =1 u 2 α y K α y, whch s dentcal to the optmzaton problem of batch mode multple kernel learnng n 2. Based on the results of the above theorem, our strategy toward onlne kernel learnng s to smultaneously learn a set of kernel classfersf, = 1,...,m and ther combnaton weghsu. We consder two setups for Onlne Multple Kernel Learnng OMKL. In the frst setup, termed Onlne Multple Kernel Learnng by Predctons OMKL-P, we smplfy the problem by only consderng combnng the bnary predctons from multple kernel classfers,.e., ŷ = m =1 u sgnf x. In the second setup, termed Onlne Multple Kernel Learnng by Outputs OMKL-O, we learn to combne the realvalued outputs from multple kernel classfers,.e. ŷ = m =1 u f x. In the next two sectons, we wll dscuss algorthms and theoretc propertes for OMKL-P and OMKL- O, respectvely. For the convenence of analyss, throughout the paper, we assume κ x,x 1 for all the kernel functons κ, and for any example x. Below we summarze the notatons that are used throughout ths paper: D T = {x t,y t,t = 1,...,T} denotes a sequence of T tranng examples.k m = {κ : X X R, = 1,...,m} denotes a collecton ofmkernel functons. f t = f t 1,,f t m denotes the collecton of m classfers n round t, where f t represents the classfer learned usng the kernel κ,. For the purpose of presentaton, we usef t,f t for short.f tx = f t 1x,,f t mx denotes the real-valued outputs on example x by the m classfers learned n round t, and sgnf t x = sgnf t 1 x,,sgnft m x denotes the bnary predctons by the correspondng classfers on examplex. 3

Onlne Multple Kernel Learnng 5 w t = w1, t,wm t denotes the weght vector for the m classfers n round t; W t = m =1 wt represents the sum of weghts n round t; q t = q1 t,...,qt m denotes the normalzed weght vector,.e.q t = wt /W t. z t = z1 t,,zt m denotes the ndcator vector, where z t = Iy tf tx t 0 ndcates f the th kernel classfer makes a mstake on example x t, where IC outputs1 whenc s true and zero otherwse. m t = m t 1,,m t m denotes the 0-1 random varable vector, wherem t {0,1} ndcates f theth kernel classfer s chosen for updatng n round t. p t = p t 1,,pt m denotes a probablty vector,.e.p t [0,1]. a b denotes the dot-product between vector a and vector b, 1 denotes a vector wth all elements equal to 1, and0denotes a vector wth all elements equal to 0. Mult Samplep t denotes a multnomal samplng process followng the probablty dstrbutonp t that outputs t {1,...,m}. Bern Samplep t denotes a Bernoull samplng process followng the probabltyp t that outputs a bnary varablem t {1,0}. 4 Algorthms for Onlne Kernel Learnng by Predctons OMKL-P 4.1 Determnstc ApproachesDA As already ponted out, the man challenge of OMKL s that both the kernel classfers and ther combnaton are unknown. The most straghtforward approach s to learn a classfer for each ndvdual kernel functon and decde ts combnaton weght based on the number of mstakes made by the kernel classfer. To ths end, we combne the Perceptron algorthm and the Hedge algorthm together. In partcular, for each kernel, the Perceptron algorthm s employed to learn a classfer, and the Hedge algorthm s used to update ts weght. Algorthm 1 shows the determnstc algorthm for OMKL-P. The theorem below shows the mstake bound for Algorthm 1. For the convenence of presentaton, we defne the optmal margn error for kernel κ, wth respect to a collecton of tranng examplesd T as: gκ,l = mn f H κ f 2 H κ +2 lfx t,y t Theorem 2 After recevng a sequence oft tranng examplesd T, the number of mstakes made by runnng Algorthm 1 s bounded as follows M = Iq t z t 0.5 2ln1/β 1 β mn gκ,l+ 2lnm 1 m 1 β The proof for the theorem as well as the followng theorems s sketched n the Appendx. Note that n Algorthm 1 the weght for each ndvdual kernel classfer s based on whether they classfy the tranng example correctly. An alternatve approach for updatng the weghts s to take nto account the output values of {f t}m =1 by penalzng a 4

6 Onlne Multple Kernel Learnng Algorthm 1 DA for OMKL-P 1 1: INPUT: Kernels: K m Dscount weght: β 0, 1 2: Intalzaton:f 1 = 0,w 1 = 1 3: for t = 1,2,... do 4: Receve an nstance x t 5: Predct:ŷ t = sgnq t sgnf tx t 6: Receve the class label y t 7: for = 1,2,...,m do 8: Setz t = Iy tf t x t 0 9: Update w t+1 10: Update f t+1 11: end for 12: end for = wβ t zt = f t +zy t tκ x t, Algorthm 2 DA for OMKL-P 2 1: INPUT: Kernels: K m Dscount weght: β 0, 1 Max-msclassfcaton level: γ > 0 2: Intalzaton:f 1 = 0,w 1 = 1 3: for t = 1,2,... do 4: Receve an nstance x t 5: Predct: ŷ t = sgnq t sgnf tx t 6: Receve the class label y t 7: for = 1,2,...,m do 8: Set z t = Iy tfx t t 0, ν t = z1/2+mnγ, y t tfx t t 9: Update w t+1 10: Update f t+1 11: end for 12: end for = wβ t ν t = f t +zy t tκ x t, kernel classfer more f ts degree of msclassfcaton, measured by y t f tx t, s large. To ths end, we present the second verson of the determnstc approach for OMKL- P n Algorthm 2 that takes nto account the value of {f t}m =1 when updatng weghts {w } m =1. In ths alternate algorthm, we ntroduce the parameterγ, whch can be nterpreted as the maxmum level of msclassfcaton. The key quantty ntroduced n Algorthm 2 sν t that measures the degree of msclassfcaton by1/2+mnγ, y tf tx. Note that we dd not drectly use y t f tx for updatng weghts{w } m =1 because t s unbounded. Theorem 3 After recevng a sequence oft tranng examplesd T, the number of mstakes made by Algorthm 2 s bounded as follows M = Iq t z t 0.5 21/2+γln1/β 1 β 1/2+γ mn gκ,l+ 41/2+γlnm 1 m 1 β 1/2+γ The proof s essentally smlar to that of Theorem 2 wth the modfcaton that addresses varableν t ntroduced n Algorthm 2. One problem wth Algorthm 1 s how to decde an approprate value for β. A straghtforward approach s to choose β that mnmzes the mstake bound, leadng to the followng corollary. Corollary 4 By choosng β = bound M 2 mn T 1 m zt mn T 1 m zt + lnm mn gκ,l+lnm+2 1 m, we have the followng mstake mn gκ,llnm 1 m

Onlne Multple Kernel Learnng 7 Proof. Followed by nequalty n 4, we have M 2 β mn 1 m z t + 2lnm 1 β where we use ln1/β 1/β 1. By settng the dervatve of the above upper bound wth respect to β to zero, and usng the nequalty T zt gκ,l as shown n the appendx, we have the result. mn 1 m T zt Drectly usng the result n Corollary 4 s unpractcal because t requres foreseeng the future to compute the quantty. We resolve ths problem by explotng the doublng trck [4]. In partcular, we dvde the sequence 1,2,...,T nto s segments: [T 0 +1 = 1,T 1 ],[T 1 +1,T 2 ],...,[T s 1 +1,T s = T] Tk+1 such that a mn 1 m t=t k +1 zt = 2k Ts fork = 0,...,s 2, and b mn 1 m t=t s 1+1 zt 2 s 1. Now, for each segment [T k +1,T k+1 ], we ntroduce a dfferent β, denote by β k, and set ts value as β k = 2 k/2, k = 0,...,s 1 5 lnm+2 k/2 The followng theorem shows the mstake bound of Algorthm 1 wth suchβ. Theorem 5 By runnng Algorthm 1 wth β k specfed n 5, we have the followng mstake bound M 2 mn gκ 2,l+lnm+ mn gκ,llnm 1 m 2 1 1 m +2 log 2 mn gκ,l lnm 1 m where x computes the smallest nteger that s larger than or equal to x. 4.2 Stochastc Approaches The analyss n the prevous secton allows us to bound the mstakes when classfyng examples wth a mxture of kernels. The man shortcomng wth the determnstc approach s that n each round, all the kernel classfers have to be checked and potentally updated f the tranng example s classfed ncorrectly. Ths could lead to a hgh computatonal cost when the number of kernels s large. In ths secton, we present stochastc approaches for onlne multple kernel learnng that explctly address ths challenge.

8 Onlne Multple Kernel Learnng Sngle Update ApproachSUA Algorthm 3 shows a stochastc algorthm for OMKL- P. In each round, nstead of checkng every kernel classfer, we sample a sngle kernel classfer accordng to the weghts that are computed based on the number of mstakes made by ndvdual kernel classfers. However, t s mportant to note that rather than usng q t drectly to sample one classfer to update, we add a smoothng term δ/m to the samplng probablty p t, Ths smoothng term guarantees a low bound for pt, whch ensures that each kernel classfer wll be explored wth at least certan amount of probablty, whch s smlar to methods for the mult-arm bandt problem [2] to ensure the tradeoff between exploraton and explotaton. The theorem below shows the mstake bound of Algorthm 3. Theorem 6 After recevng a sequence of T tranng examples D T, the expected number of mstakes made by Algorthm 3 s bounded as follows [ T ] E[M] = E Iq t z t 0.5 2mln1/β mn δ1 β gκ,l+ 2mlnm 1 m δ1 β Remark Comparng to the mstake bound n Theorem 2 by Algorthm 1, the mstake bound by Algorthm 3 s amplfed by a factor ofm/δ due to the stochastc procedure of updatng one out of m kernel classfers. The smoothng parameter δ essentally controls the tradeoff between effcacy and effcency. To see ths, we note that the bound for the expected number of mstakes s nversely [ proportonal to δ; n contrast, the bound T ] [ m T ] m for the expected number of updatese =1 mt zt = E =1 pt zt [ T ] 1 δe m =1 qt zt + δt has a leadng term δt when δ s large, whch s proportonal to δ. Multple Updates ApproachMUA Compared wth the determnstc approaches, the stochastc approach,.e. the sngle update algorthm, does sgnfcantly mprove the computatonal effcency. However, one major problem wth the sngle update algorthm s that n any round, only one sngle kernel classfer s selected for updatng. As a result, the unselected but possbly effectve kernel classfers lose ther opportunty for updatng. Ths ssue s partcularly crtcal at the begnnng of an onlne multple kernel learnng task where most ndvdual kernel classfers could perform poorly. In order to make a better tradeoff between effcacy and effcency, we develop another stochastc algorthm for onlne multple kernel learnng. The man dea of ths new algorthm s to randomly choose multple kernel classfers for updatng and predctons. Instead of choosng a kernel classfer from a multnomal dstrbuton, the updatng of each ndvdual kernel classfer s determned by a separate Bernoull dstrbuton governed by p t for each classfer. The detaled procedure s shown n Algorthm 4. The theorem below shows the mstake bound of the multple updates algorthm. Theorem 7 After recevng a sequence of T tranng examples D T, the expected number of mstakes made by Algorthm 4 s bounded as follows [ T ] E[M] = E Iq t z t 0.5 2ln1/β mn δ1 β gκ,l+ 2lnm 1 m δ1 β

Onlne Multple Kernel Learnng 9 Algorthm 3 SUA for OMKL-P 1: INPUT: Kernels: K m Dscount weght: β 0, 1 Smoothng parameter: δ 0, 1 2: Intalzaton:f 1 = 0,w 1 = 1,p 1 = 1/m 3: for t = 1,2,... do 4: Receve an nstance x t 5: Predct:ŷ t = sgn q t sgnf tx t 6: Receve the class label y t 7: t=mult Samplep t 8: for = 1,2,...,m do 9: Setm t = I = t 10: Setz t = Iy tfx t t 0 11: Update w t+1 = wβ t mt zt 12: Update f t+1 = f t +m t zy t tκ x t, 13: end for 14: Update p t+1 = 1 δq t+1 +δ1/m 15: end for Algorthm 4 MUA for OMKL-P 1: INPUT: Kernels: K m Dscount weght: β 0, 1 Smoothng parameter: δ 0, 1 2: Intalzaton:f 1 = 0,w 1 = 1,p 1 = 1 3: for t = 1,2,... do 4: Receve an nstance x t 5: Predct: ŷ t = sgn q t sgnf tx t 6: Receve the class label y t 7: for = 1,2,...,m do 8: Sample m t = Bern Samplep t 9: Set z t = Iy tfx t t 0 10: Update w t+1 11: Update f t+1 = wβ t mt zt = f t +m t zy t tκ x t, 12: end for 13: Update p t+1 = 1 δq t+1 +δ1 14: end for Remark Compared to the mstake bound n Theorem 2 by Algorthm 1, the mstake bound by Algorthm 4 s amplfed by a factor of 1/δ due to the stochastc procedure. On the other hand, compared to the mstake bound of sngle update n Theorem 6, the mstake bound by Algorthm 4 s mproved by a factor of m, manly due to smultaneously updatng multple kernel classfers n each round. The expected number of updates for multple updates approach s bounded by E [ T ] [ T E 1 δe m =1 pt zt m =1 qt zt [ T ] m =1 mt zt = ] + δmt, where the frst term s dscounted by a factor ofm and the second term s amplfed by a factor ofm compared to that of sngle update approach. 5 Algorthms for Onlne Multple Kernel Learnng by Outputs OMKL-O 5.1 A Determnstc Approach In the followng analyss, we assume the functonal norm of any classfer f s bounded by R,.e., f Hκ R. We defne doman Ω κ as Ω κ = {f H κ : f Hκ R}. Algorthm 5 shows the determnstc algorthm for OMKL-O. Compared to Algorthm 1, there are three key features of Algorthm 5. Frst, n step 11, the updated kernel classfer f s projected nto doman Ω κ to ensure ts norm s no more than R. Ths projecton step s mportant for the proof of the mstake bound that wll be shown later. Second, each ndvdual kernel classfer s updated only when

10 Onlne Multple Kernel Learnng the predcton of the combned classfer s ncorrect.e., y t ŷ t 0. Ths s n contrast to the Algorthm 1, where each kernel classfer f t s updated when t msclassfes the tranng example x t. Ths feature wll make the proposed algorthm sgnfcantly more effcent than Algorthm 1. Fnally, n step 9 of Algorthm 5, we update weghts w t+1 based on the output f tx t. Ths s n contrast to Algorthm 1 where weghts are updated only based on f the ndvdual classfers classfy the example correctly. Theorem 8 After recevng a sequence oft tranng examplesd T, the number of mstakes made by Algorthm 5 s bounded as follows f R 2 ln1/β < 1 M 1 1 R 2 ln1/β mn gu,{f } m u,{f Ω κ } m =1 + =1 where gu,{f } m =1 = m =1 u f 2 H κ +2 T lu fx t,y t. 2lnm 1 R 2 ln1/βln1/β Usng the result n Theorem 1, we have the followng corollary that bounds the number of mstakes of onlne kernel learnng by the objectve functon used n the batch mode multple kernel learnng. Corollary 9 We have the followng mstake bound for runnng Algorthm 5 fr 2 ln1/β < 1 M 1 1 R 2 ln1/β mn u,f Ω κu gκu,l+ where gκu,l = f 2 H κu +2 T lfx t,y t. 2lnm 1 R 2 ln1/βln1/β 5.2 A Stochastc Approach Fnally, we present a stochastc strategy n Algorthm 6 for OMKL-O. In each round, we randomly sample one classfer to update by followng the probablty dstrbuton p t. Smlar to Algorthm 3, the probablty dstrbutonp t s a mxture of the normalzed weghts for classfers and a smoothng term δ/m. Dfferent from Algorthm 5, the updatng rule for Algorthm 6 has two addtonal factors,.e. m t whch s non-zero for the chosen classfer and has expectaton equal to 1, and the step sze η whch s essentally ntroduced to ensure a good mstake bound as shown n the followng theorem. Theorem 10 After recevng a sequence oft tranng examplesd T, the expected number of mstakes made by Algorthm 6 s bounded as follows E[M] mn u,{f Ω κ } m =1 lu fx t,y t +2 ar,β,δbr,β,δt where ar,β,δ = R2 2 + lnm ln1/β, br,β,δ = ln1/βr2 m 2 2δ 2 ar,β,δ η = br,β,δ + m, andη s set to 2δ

Onlne Multple Kernel Learnng 11 Algorthm 5 DA for OMKL-O 1: INPUT: Kernels: K m Dscount weght: β 0, 1 Maxmum functonal norm: R 2: Intlzaton:f 1 = 0,w 1 = 1 3: for t = 1,2,... do 4: Receve an nstance x t 5: Predct:ŷ t = sgn q t f tx t 6: Receve the class label y t 7: fŷ ty t 0 then 8: for = 1,2,...,m do 9: Update w t+1 = wβ t ytft xt 10: Update 11: Project t+1 f t+1 f = f t +y tκ x t, ntoω κ by f t+1 t+1 = f /max1, f t+1 Hκ /R 12: end for 13: end f 14: end for Algorthm 6 SUA for OMKL-O 1: INPUT: K m,β,r as n Algorthm 5 Smoothng parameter δ 0,1, and Step sze: η > 0 2: Intalzaton:f 1 = 0,w 1 = 1,p 1 = 1/m 3: for t = 1,2,... do 4: Receve an nstance x t 5: Predct: ŷ t = sgn q t f tx t 6: Receve the class label y t 7: f ŷ ty t 0 then 8: t=mult Samplep t 9: for = 1,2,...,m do 10: Setm t = I = t, m t = m t /p t 11: Update w t+1 = wβ t η mt ytft xt t+1 12: Update f = f t +η m t y tκ x t, t+1 f 13: Project x ntoω κ. 14: end for 15: Update p t = 1 δq t +δ1/m 16: end f 17: end for 6 Conclusons Ths paper nvestgates a new research problem, onlne multple kernel learnng OMKL, whch ams to attack an onlne learnng task by learnng a kernel based predcton functon from a pool of predefned kernels. We consder two setups for onlne kernel learnng, onlne kernel learnng by predctons that combnes the bnary predctons from multple kernel classfers and onlne kernel learnng by outputs that combnes the realvalued outputs from kernel classfers. We proposed a framework for OMKL by learnng a combnaton of multple kernel classfers from a pool of gven kernel functons. We emphasze that OMKL s generally more challengng than typcal onlne learnng because both the kernel classfers and ther lnear combnaton are unknown. To solve ths challenge, we propose to combne two onlne learnng algorthms,.e., the Perceptron algorthm that learns a classfer for a gven kernel, and the Hedge algorthm that combnes classfers by lnear weghtng. Based on ths dea, we present two types of algorthms for OMKL,.e., determnstc approaches and stochastc approaches. Theoretcal bounds were derved for the proposed OMKL algorthms. Acknowledgement The work was supported n part by Natonal Scence Foundaton IIS-0643494, Offce of Naval Research N00014-09-1-0663, and Army Research Offce W911NF-09-1- 0421. Any opnons, fndngs, and conclusons or recommendatons expressed n ths

12 Onlne Multple Kernel Learnng materal are those of the authors and do not necessarly reflect the vews of NSF, ONR and ARO. Appendx Proof of Theorem 2 Proof. The proof essentally combnes the proofs of the Perceptron algorthm [25] and the Hedge algorthm [9]. Frst, followng the analyss n [9], we can easly have q t z t ln1/β 1 β mn 1 m z t + lnm 1 β Second, due to the convexty ofl and the updatng rule forf, whenz t = 1 we have lfx t t,y t lfx t,y t y t f t f,zκ t x t, Hκ = f t f,f t+1 f t Hκ 1 f t f 2 H 2 κ f t+1 f 2 H κ +z t Snce lf t x t,y t z t, then zt ft f 2 H κ f t+1 f 2 H κ + 2lfx t,y t. Takng summaton on both sdes, we have z t mn f H κ f t f 2 H κ f t+1 f 2 H k +2 lfx t,y t gκ,l Usng the above nequalty and notng thatm = T Iq t z t 0.5 2 T q t z t, we have the result n the theorem. Proof of Theorem 3 Proof. The proof can be constructed smlarly to the proof for Theorem 2 by notng the followng three dfferences. Frst, the updatng rule for the weghts can be wrtten as w t+1 = w tβ1/2+γ νt /1/2+γ, where ν t 1/2+γ. Second, qt νt qt zt /2. Thrd, we havelfx t t,y t ν t+z t/2, and therefore ν t gκ,l/2. Proof of Theorem 5 Proof. We denote by M k the number of mstakes made durng the segment [T k + 1,T k+1 ]. It can be shown that the followng equalty M k 2 mn T k+1 1 m t=t k +1, z t +lnm+2 lnm2 k/2

Onlne Multple Kernel Learnng 13 holds for anyk = 0,...,s 1. Takng summaton over all M k s 1 s 1 T k+1 M = M k 2 mn z t +22k/2 lnm +2slnm mn 1 m k=0 k=0 1 m t=t k +1 we can obtan the bound n the theorem by notng that s 1 T zt, T and2s 1 1 mn 1 m zt mn gκ,l. 1 m Proof of Theorem 6 Proof. Smlar to the proof for Theorem 2, we can prove m =1 q t mt zt ln1/β 1 β m t zt + lnm 1 β, and k=0 mn 1 m T Tk1 t=t k +1 zt m t zt gκ,l Takng expectaton on both sdes, and notng thate[m t ] = pt δ/m, we have T E q t z t mln1/β mn δ1 β gκ,l+ mlnm 1 m δ1 β SnceM 2 T q t z t, we have the result stated n the theorem. Proof of Theorem 7 Proof. The proof can be duplcated smlarly to the proof for Theorem 6, except for p t δ, = 1,...,m n ths case. Proof of Theorem 8 Proof. In the followng analyss, we only consder the subset of teratons where the algorthm makes a mstake. By slghtly abusng the notaton, we denote by 1, 2,..., M the trals where the examples are msclassfed by Algorthm 5. For any combnaton weght u = u 1,,u m and any kernel classfcaton functon f Ω κ, = 1,...,m, we have l q t f t x t,y t l u fxt,y t gt y t q t f t x t u fx t = g t y t q t f t x t u f t x t +g t y t u f t x t u fx t where g t = max0,1 z z = 1 because examples are msclassfed n z=ytq t f tx t these trals. For the frst term, followng the proof for Theorem 11.3 n [4], we can have g t y t q t u f t x t ln1/β R 2 1 + 2 ln1/β {KLu q t KLu q t+1 }

14 Onlne Multple Kernel Learnng For the second term, followng the analyss n the proof for Theorem 2, we can have g t y t u f t x t u fx t = m u ft f,g t y t κ x t, Hκ =1 1 2 + m =1 Combnng the above result together, we arrve at M M lq t f t x t,y t lu fx t,y t 1 2 u 2 f t f 2 H κ f t+1 f 2 H κ Sncelq t f t x t,y t 1, we can have the result n the theorem. m =1 u f 2 H κ + lnm ln1/β +1 2 ln1/βr2 +1M Proof of Theorem 10 Proof. Smlar to the proof for the Theorem 8, we have the followng bound for the two terms E[g t y t q t f t x t u f t x t ] = 1 η E[ g t y t q t u t η m t f t x t ] E[g t y t u f t x t fx t ] = E 1 ηln1/β E[KLu qt KLu qt+1]+ ln1/βr2 η 2 m 2 2δ 2 m =1 u η mη 2δ +E f t f,η m t g ty t κ x t, H κ [ m =1 u 2η ft f 2 H κ f t+1 f 2 H κ Combnng the above results together, and settngη as n the theorem, we have the result n the theorem. ] References 1. S. Agmon. The relaxaton method for lnear nequaltes. CJM, 63:382 392, 1954. 2. P. Auer, N. Cesa-Banch, Y. Freund, and R. E. Schapre. The nonstochastc multarmed bandt problem. SICOMP., 321, 2003. 3. F. R. Bach, G. R. G. Lanckret, and M. I. Jordan. Multple kernel learnng, conc dualty, and the smo algorthm. In ICML, 2004. 4. N. Cesa-Banch and G. Lugos. Predcton, Learnng, and Games. 2006. 5. O. Chapelle, J. Weston, and B. Schölkopf. Cluster kernels for sem-supervsed learnng. In NIPS, pages 585 592, 2002. 6. Y. Chen, M. R. Gupta, and B. Recht. Learnng kernels from ndefnte smlartes. In ICML, pages 145 152, 2009.

Onlne Multple Kernel Learnng 15 7. K. Crammer, O. Dekel, J. Keshet, S. Shalev-Shwartz, and Y. Snger. Onlne passveaggressve algorthms. JMLR, 7, 2006. 8. K. Crammer and Y. Snger. Ultraconservatve onlne algorthms for multclass problems. JMLR, 3, 2003. 9. Y. Freund and R. E. Schapre. A decson-theoretc generalzaton of on-lne learnng and an applcaton to boostng. JCSS, 551, 1997. 10. Y. Freund and R. E. Schapre. Large margn classfcaton usng the perceptron algorthm. ML, 373, 1999. 11. C. Gentle. A new approxmate maxmal margn classfcaton algorthm. JMLR, 2, 2001. 12. S. C. Ho, R. Jn, and M. R. Lyu. Learnng non-parametrc kernel matrces from parwse constrants. In ICML, pages 361 368, 2007. 13. S. C. H. Ho, M. R. Lyu, and E. Y. Chang. Learnng the unfed kernel machnes for classfcaton. In KDD, pages 187 196, 2006. 14. H. Kashma, K. Tsuda, and A. Inokuch. Margnalzed kernels between labeled graphs. In ICML, pages 321 328, 2003. 15. J. Kvnen, A. Smola, and R. Wllamson. Onlne learnng wth kernels. IEEE Trans. on Sg. Proc., 528, 2004. 16. J. Kvnen, A. J. Smola, and R. C. Wllamson. Onlne learnng wth kernels. In NIPS, pages 785 792, 2001. 17. M. Kloft, U. Rückert, and P. L. Bartlett. A unfyng vew of multple kernel learnng. CoRR, abs/1005.0437, 2010. 18. R. I. Kondor and J. D. Lafferty. Dffuson kernels on graphs and other dscrete nput spaces. In ICML, pages 315 322, 2002. 19. B. Kuls, M. Sustk, and I. Dhllon. Learnng low-rank kernel matrces. In ICML, pages 505 512, 2006. 20. G. R. G. Lanckret, N. Crstann, P. Bartlett, L. E. Ghaou, and M. I. Jordan. Learnng the kernel matrx wth semdefnte programmng. JMLR, 5, 2004. 21. Y. L and P. M. Long. The relaxed onlne maxmum margn algorthm. ML, 461-3, 2002. 22. N. Lttlestone and M. K. Warmuth. The weghted majorty algorthm. In FOCS, 1989. 23. A. Novkoff. On convergence proofs on perceptrons. In Proceedngs of the Symposum on the Mathematcal Theory of Automata, volume XII, 1962. 24. A. Rakotomamonjy, F. R. Bach, S. Canu, and Y. Grandvalet. Smplemkl. JMLR, 11, 2008. 25. F. Rosenblatt. The perceptron: A probablstc model for nformaton storage and organzaton n the bran. Psychologcal Revew, 65, 1958. 26. B. Schölkopf and A. J. Smola. Learnng wth Kernels: Support Vector Machnes, Regularzaton, Optmzaton, and Beyond. 2001. 27. S. Shalev-Shwartz. Onlne learnng: Theory, algorthms, and applcatons. In Ph.D thess, 2007. 28. S. Sonnenburg, G. Rätsch, C. Schäfer, and B. Schölkopf. Large scale multple kernel learnng. JMLR, 7, 2006. 29. V. N. Vapnk. Statstcal Learnng Theory. Wley, 1998. 30. V. Vovk. A game of predcton wth expert advce. J. Comput. Syst. Sc., 562, 1998. 31. Z. Xu, R. Jn, I. Kng, and M. R. Lyu. An extended level method for effcent multple kernel learnng. In NIPS, pages 1825 1832, 2008. 32. X. Zhu, J. S. Kandola, Z. Ghahraman, and J. D. Lafferty. Nonparametrc transforms of graph kernels for sem-supervsed learnng. In NIPS, pages 1641 1648, 2004.