Online Multiple Kernel Learning: Algorithms and Mistake Bounds



Similar documents
Logistic Regression. Lecture 4: More classifiers and classes. Logistic regression. Adaboost. Optimization. Multiple class classification

Support Vector Machines

An Alternative Way to Measure Private Equity Performance

Luby s Alg. for Maximal Independent Sets using Pairwise Independence

What is Candidate Sampling

Feature selection for intrusion detection. Slobodan Petrović NISlab, Gjøvik University College

Forecasting the Direction and Strength of Stock Market Movement

BERNSTEIN POLYNOMIALS

Module 2 LOSSLESS IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur

A hybrid global optimization algorithm based on parallel chaos optimization and outlook algorithm

An Interest-Oriented Network Evolution Mechanism for Online Communities

The Development of Web Log Mining Based on Improve-K-Means Clustering Analysis

Recurrence. 1 Definitions and main statements

L10: Linear discriminants analysis

CS 2750 Machine Learning. Lecture 3. Density estimation. CS 2750 Machine Learning. Announcements

1 Example 1: Axis-aligned rectangles

8.5 UNITARY AND HERMITIAN MATRICES. The conjugate transpose of a complex matrix A, denoted by A*, is given by

Lecture 2: Single Layer Perceptrons Kevin Swingler

New Approaches to Support Vector Ordinal Regression

benefit is 2, paid if the policyholder dies within the year, and probability of death within the year is ).

How To Know The Components Of Mean Squared Error Of Herarchcal Estmator S

Calculation of Sampling Weights

Single and multiple stage classifiers implementing logistic discrimination

NON-CONSTANT SUM RED-AND-BLACK GAMES WITH BET-DEPENDENT WIN PROBABILITY FUNCTION LAURA PONTIGGIA, University of the Sciences in Philadelphia

Statistical Methods to Develop Rating Models

Institute of Informatics, Faculty of Business and Management, Brno University of Technology,Czech Republic

How Sets of Coherent Probabilities May Serve as Models for Degrees of Incoherence

Learning Permutations with Exponential Weights

Project Networks With Mixed-Time Constraints

A Probabilistic Theory of Coherence

CHOLESTEROL REFERENCE METHOD LABORATORY NETWORK. Sample Stability Protocol

Forecasting the Demand of Emergency Supplies: Based on the CBR Theory and BP Neural Network

ECE544NA Final Project: Robust Machine Learning Hardware via Classifier Ensemble

Power-of-Two Policies for Single- Warehouse Multi-Retailer Inventory Systems with Order Frequency Discounts

Risk-based Fatigue Estimate of Deep Water Risers -- Course Project for EM388F: Fracture Mechanics, Spring 2008

Bayesian Network Based Causal Relationship Identification and Funding Success Prediction in P2P Lending

A Lyapunov Optimization Approach to Repeated Stochastic Games

A study on the ability of Support Vector Regression and Neural Networks to Forecast Basic Time Series Patterns

NPAR TESTS. One-Sample Chi-Square Test. Cell Specification. Observed Frequencies 1O i 6. Expected Frequencies 1EXP i 6

Logistic Regression. Steve Kroon

Research Article Enhanced Two-Step Method via Relaxed Order of α-satisfactory Degrees for Fuzzy Multiobjective Optimization

v a 1 b 1 i, a 2 b 2 i,..., a n b n i.

THE APPLICATION OF DATA MINING TECHNIQUES AND MULTIPLE CLASSIFIERS TO MARKETING DECISION

8 Algorithm for Binary Searching in Trees

Dynamic Pricing for Smart Grid with Reinforcement Learning

How To Solve An Onlne Control Polcy On A Vrtualzed Data Center

Performance Analysis and Coding Strategy of ECOC SVMs

ANALYZING THE RELATIONSHIPS BETWEEN QUALITY, TIME, AND COST IN PROJECT MANAGEMENT DECISION MAKING

Fast Fuzzy Clustering of Web Page Collections

Financial market forecasting using a two-step kernel learning method for the support vector regression

Fisher Markets and Convex Programs

Mining Feature Importance: Applying Evolutionary Algorithms within a Web-based Educational System

Ants Can Schedule Software Projects

PSYCHOLOGICAL RESEARCH (PYC 304-C) Lecture 12

Learning from Multiple Outlooks

An MILP model for planning of batch plants operating in a campaign-mode

Extending Probabilistic Dynamic Epistemic Logic

GRAVITY DATA VALIDATION AND OUTLIER DETECTION USING L 1 -NORM

Design of Output Codes for Fast Covering Learning using Basic Decomposition Techniques

Joint Scheduling of Processing and Shuffle Phases in MapReduce Systems

Data Broadcast on a Multi-System Heterogeneous Overlayed Wireless Network *

On the Optimal Control of a Cascade of Hydro-Electric Power Stations

SVM Tutorial: Classification, Regression, and Ranking

Sketching Sampled Data Streams

Brigid Mullany, Ph.D University of North Carolina, Charlotte

Enabling P2P One-view Multi-party Video Conferencing

Multiple-Period Attribution: Residuals and Compounding

Latent Class Regression. Statistics for Psychosocial Research II: Structural Models December 4 and 6, 2006

Gender Classification for Real-Time Audience Analysis System

Software project management with GAs

Realistic Image Synthesis

ONE of the most crucial problems that every image

The Greedy Method. Introduction. 0/1 Knapsack Problem

The Journal of Systems and Software

) of the Cell class is created containing information about events associated with the cell. Events are added to the Cell instance

A DYNAMIC CRASHING METHOD FOR PROJECT MANAGEMENT USING SIMULATION-BASED OPTIMIZATION. Michael E. Kuhl Radhamés A. Tolentino-Peña

Face Verification Problem. Face Recognition Problem. Application: Access Control. Biometric Authentication. Face Verification (1:1 matching)

J. Parallel Distrib. Comput.

Joe Pimbley, unpublished, Yield Curve Calculations

POLYSA: A Polynomial Algorithm for Non-binary Constraint Satisfaction Problems with and

Formulating & Solving Integer Problems Chapter

CHAPTER 14 MORE ABOUT REGRESSION

2008/8. An integrated model for warehouse and inventory planning. Géraldine Strack and Yves Pochet

Statistical algorithms in Review Manager 5

MATHEMATICAL ENGINEERING TECHNICAL REPORTS. Sequential Optimizing Investing Strategy with Neural Networks

A DATA MINING APPLICATION IN A STUDENT DATABASE

Semantic Link Analysis for Finding Answer Experts *

Enterprise Master Patient Index

Dropout: A Simple Way to Prevent Neural Networks from Overfitting

DEFINING %COMPLETE IN MICROSOFT PROJECT

PERRON FROBENIUS THEOREM

The Geometry of Online Packing Linear Programs

NEURO-FUZZY INFERENCE SYSTEM FOR E-COMMERCE WEBSITE EVALUATION

A Statistical Model for Detecting Abnormality in Static-Priority Scheduling Networks with Differentiated Services

How To Calculate The Accountng Perod Of Nequalty

Design and Development of a Security Evaluation Platform Based on International Standards

1. Fundamentals of probability theory 2. Emergence of communication traffic 3. Stochastic & Markovian Processes (SP & MP)

Transcription:

Onlne Multple Kernel Learnng: Algorthms and Mstake Bounds Rong Jn 1, Steven C.H. Ho 2, and Tanbao Yang 1 1 Department of Computer Scence and Engneerng, Mchgan State Unversty, MI, 48824, USA 2 School of Computer Engneerng, Nanyang Technologcal Unversty, 639798, Sngapore 1 {rongjn,yangta1}@cse.msu.edu, 2 chho@ntu.edu.sg Abstract. Onlne learnng and kernel learnng are two actve research topcs n machne learnng. Although each of them has been studed extensvely, there s a lmted effort n addressng the ntersectng research. In ths paper, we ntroduce a new research problem, termed Onlne Multple Kernel Learnng OMKL, that ams to learn a kernel based predcton functon from a pool of predefned kernels n an onlne learnng fashon. OMKL s generally more challengng than typcal onlne learnng because both the kernel classfers and ther lnear combnaton weghts must be learned smultaneously. In ths work, we consder two setups for OMKL,.e. combnng bnary predctons or real-valued outputs from multple kernel classfers, and we propose both determnstc and stochastc approaches n the two setups for OMKL. The determnstc approach updates all kernel classfers for every msclassfed example, whle the stochastc approach randomly chooses a classfers for updatng accordng to some samplng strateges. Mstake bounds are derved for all the proposed OMKL algorthms. Keywords: On-lne learnng and relatve loss bounds, Kernels 1 Introducton In recent years, we have wtnessed ncreasng nterests on both onlne learnng and kernel learnng. Onlne learnng refers to the learnng process of answerng a sequence of questons gven the feedback of correct answers to prevous questons and possbly some addtonal pror nformaton [27]; Kernel learnng ams to dentfy an effectve kernel for a gven learnng task [20, 28, 12]. A well-known kernel learnng method s Multple Kernel Learnng MKL [3, 28], that seeks the combnaton of multple kernels n order to optmze the performance of kernel based learnng methods e.g., Support Vector Machnes SVM. Although kernel trck has been explored n onlne learnng [10, 7], t s often assumed that kernel functon s gven apror. In ths work, we address a a new research problem, Onlne Multple Kernel Learnng OMKL, whch ams to smultaneously learn multple kernel classfers and ther lnear combnatons from a pool of gven kernels n an onlne fashon. Compared to the extng methods for multple kernel learnng see [17] and reference theren, onlne multple kernel learnng s computatonally advantageous n that t only requres gong through tranng examples once. We emphasze

2 Onlne Multple Kernel Learnng that onlne multple kernel learnng s sgnfcantly more challengng than typcal onlne learnng because both the optmal kernel classfers and ther lnear combnatons have to be learned smultaneously n an onlne fashon. In ths paper, we consder two dfferent setups for onlne multple kernel learnng. In the frst setup, termed as Onlne Multple Kernel Learnng by Predctons or OMKL-P, ts objectve s to combne the bnary predctons from multple kernel classfers. The second setup, termed as Onlne Multple Kernel Learnng by Outputs or OMKL-O, mproves OMKL-P by combnng the real-valued outputs from multple kernel classfers. Our onlne learnng framework for multple kernel learnng s based on the combnaton of two types of onlne learnng technques: the Perceptron algorthm [25] that learns a classfer for a gven kernel, and the Hedge algorthm [9] that lnearly combnes multple classfers. Based on the proposed framework, we present two types of approaches for each setup of OMKL,.e., determnstc and stochastc approaches. The determnstc approach updates each kernel classfer for every msclassfed example, whle the stochastc approach chooses a subset of classfers for updatng based on certan samplng strateges. Mstake bounds are derved for all the proposed algorthms for onlne kernel learnng. The rest of ths paper s organzed as follows. Secton 2 revews the related work on both onlne learnng and kernel learnng. Secton 3 overvews the problem of onlne multple kernel learnng. Secton 4 presents the algorthms for Onlne Multple Kernel Learnng by Predctons and ther mstake bounds; Secton 5 presents algorthms for Onlne Multple Kernel Learnng by Outputs and ther mstake bounds. Secton 6 concludes ths study wth future work. 2 Related Work Our work s closely related to both onlne learnng and kernel learnng. Below we brefly revew the mportant work n both areas. Extensve studes have been devoted to onlne learnng for classfcaton. Startng from Perceptron algorthm [1, 25, 23], a number of onlne classfcaton algorthms have been proposed ncludng the ROMMA algorthm [21], the ALMA algorthm [11], the MIRA Algorthm [8], the NORMA algorthm [16, 15], and the onlne Passve- Aggressve algorthms [7]. Several studes extended the Perceptron algorthm nto a nonlnear classfer by the ntroducton of kernel functons [16, 10]. Although these algorthms are effectve for nonlnear classfcaton, they usually assume that approprate kernel functons are gven apror, whch lmts ther applcatons. Besdes onlne classfcaton, our work s also related to onlne predcton wth expert advces [9, 22, 30]. The most well-known work s probably the Hedge algorthm [9], whch was a drect generalzaton of Lttlestone and Warmuth s Weghted Majorty WM algorthm [22]. We refer readers to the book [4] for the n-depth dscusson of ths subject. Kernel learnng has been actvely studed thanks to the great successes of kernel methods, such as support vector machnes SVM [29, 26]. Recent studes of kernel learnng focus on learnng an effectve kernel automatcally from tranng data. Varous algorthms have been proposed to learn parametrc or sem-parametrc kernels from labeled and/or unlabeled data. Example technques nclude cluster kernels [5],

Onlne Multple Kernel Learnng 3 dffuson kernels [18], margnalzed kernels [14], graph-based spectral kernel learnng approaches [32, 13], non-paramerc kernel learnng [12, 6], and lower-rank kernel learnng[19]. Among varous approaches for kernel learnng, Multple Kernel LearnngMKL [20], whose goal s to learn an optmal combnaton of multple kernels, has emerged as a promsng technque. A number of approaches have been proposed to solve the optmzaton problem related to MKL, ncludng the conc combnaton approach va regular convex optmzaton [20], the sem-nfnte lnear program SILP approach [28], the Subgradent Descent approach [24], and the recent level method [31]. We emphasze that although both onlne learnng and kernel learnng have been extensvely studed, lttle work has been done to address onlne kernel learnng, especally onlne multple kernel learnng. To the best of our knowledge, ths s the frst theoretc study that addresses the OMKL problem. 3 Onlne Multple Kernel Learnng Before presentng the algorthms for onlne multple kernel learnng, we frst brefly descrbe the Multple Kernel Learnng MKL problem. Gven a set of tranng examples D T = {x t,y t,t = 1,...,T} wherey t { 1,+1},t= 1,...,T, and a collecton of kernel functonsk m = {κ : X X R, = 1,...,m}, our goal s to dentfy the optmal combnaton of kernel functons, denoted byu = u 1,,u m, that mnmzes the margn classfcaton error. It s cast as the followng optmzaton problem: mn mn 1 u f H κu 2 f 2 H κu +C lfx t,y t 1 whereh κ denotes the reproducng kernel Hlbert space defned by kernelκ, denotes a smplex,.e. = {θ R m + m =1 θ = 1}, and m κ u, = u j κ j,, lfx t,y t = max0,1 y t fx t j=1 It can also be cast nto the followng mnmax problem: { T mn max α t 1 m } u α [0,C] T 2 α y u K α y where K R n n wth K j,l = κ x j,x l, y = y 1,,y T, and s the elementwse product between two vectors. The formulaton for batch mode multple kernel learnng n 1 ams to learn a sngle functon n the space ofh κu. It s well recognzed that solvng the optmzaton problem n 1 s n general computatonally expensve. In ths work, we am to allevate the computatonal dffculty of multple kernel learnng by onlne learnng that only needs to scan through tranng examples once. The followng theorem allows us to smplfy ths problem by decomposng t nto two separate tasks,.e., learnng a classfer for each ndvdual kernel, and weghts that combne the outputs of ndvdual kernel classfer to form the fnal predcton. =1 2

4 Onlne Multple Kernel Learnng Theorem 1 The optmzaton problem n 1 s equvalent to m u m mn u,{f H κ } m =1 2 f 2 H κ +C l u f x t,y t =1 Proof. It s mportant to note that problem n 3 s non-convex, and therefore we can not drectly deploy the standard approach to convert t nto ts dual form. In order to transform 3 nto 1, we rewrtelz,y = max α1 yz, and rewrte 3 as follows α [0,1] m u mn mn max u 2 f m 2 H κ + α t 1 y t u f x t {f H κ } m =1 α [0,C] T =1 Snce the problem s convex nf and concave nα, we can swtch the mnmzaton of f wth the maxmzaton ofα. By takng the mnmzaton off, we have f = =1 α t y t κ x t,, = 1,...,m and the resultng optmzaton problem becomes mn max u α [0,C] T m α t =1 =1 u 2 α y K α y, whch s dentcal to the optmzaton problem of batch mode multple kernel learnng n 2. Based on the results of the above theorem, our strategy toward onlne kernel learnng s to smultaneously learn a set of kernel classfersf, = 1,...,m and ther combnaton weghsu. We consder two setups for Onlne Multple Kernel Learnng OMKL. In the frst setup, termed Onlne Multple Kernel Learnng by Predctons OMKL-P, we smplfy the problem by only consderng combnng the bnary predctons from multple kernel classfers,.e., ŷ = m =1 u sgnf x. In the second setup, termed Onlne Multple Kernel Learnng by Outputs OMKL-O, we learn to combne the realvalued outputs from multple kernel classfers,.e. ŷ = m =1 u f x. In the next two sectons, we wll dscuss algorthms and theoretc propertes for OMKL-P and OMKL- O, respectvely. For the convenence of analyss, throughout the paper, we assume κ x,x 1 for all the kernel functons κ, and for any example x. Below we summarze the notatons that are used throughout ths paper: D T = {x t,y t,t = 1,...,T} denotes a sequence of T tranng examples.k m = {κ : X X R, = 1,...,m} denotes a collecton ofmkernel functons. f t = f t 1,,f t m denotes the collecton of m classfers n round t, where f t represents the classfer learned usng the kernel κ,. For the purpose of presentaton, we usef t,f t for short.f tx = f t 1x,,f t mx denotes the real-valued outputs on example x by the m classfers learned n round t, and sgnf t x = sgnf t 1 x,,sgnft m x denotes the bnary predctons by the correspondng classfers on examplex. 3

Onlne Multple Kernel Learnng 5 w t = w1, t,wm t denotes the weght vector for the m classfers n round t; W t = m =1 wt represents the sum of weghts n round t; q t = q1 t,...,qt m denotes the normalzed weght vector,.e.q t = wt /W t. z t = z1 t,,zt m denotes the ndcator vector, where z t = Iy tf tx t 0 ndcates f the th kernel classfer makes a mstake on example x t, where IC outputs1 whenc s true and zero otherwse. m t = m t 1,,m t m denotes the 0-1 random varable vector, wherem t {0,1} ndcates f theth kernel classfer s chosen for updatng n round t. p t = p t 1,,pt m denotes a probablty vector,.e.p t [0,1]. a b denotes the dot-product between vector a and vector b, 1 denotes a vector wth all elements equal to 1, and0denotes a vector wth all elements equal to 0. Mult Samplep t denotes a multnomal samplng process followng the probablty dstrbutonp t that outputs t {1,...,m}. Bern Samplep t denotes a Bernoull samplng process followng the probabltyp t that outputs a bnary varablem t {1,0}. 4 Algorthms for Onlne Kernel Learnng by Predctons OMKL-P 4.1 Determnstc ApproachesDA As already ponted out, the man challenge of OMKL s that both the kernel classfers and ther combnaton are unknown. The most straghtforward approach s to learn a classfer for each ndvdual kernel functon and decde ts combnaton weght based on the number of mstakes made by the kernel classfer. To ths end, we combne the Perceptron algorthm and the Hedge algorthm together. In partcular, for each kernel, the Perceptron algorthm s employed to learn a classfer, and the Hedge algorthm s used to update ts weght. Algorthm 1 shows the determnstc algorthm for OMKL-P. The theorem below shows the mstake bound for Algorthm 1. For the convenence of presentaton, we defne the optmal margn error for kernel κ, wth respect to a collecton of tranng examplesd T as: gκ,l = mn f H κ f 2 H κ +2 lfx t,y t Theorem 2 After recevng a sequence oft tranng examplesd T, the number of mstakes made by runnng Algorthm 1 s bounded as follows M = Iq t z t 0.5 2ln1/β 1 β mn gκ,l+ 2lnm 1 m 1 β The proof for the theorem as well as the followng theorems s sketched n the Appendx. Note that n Algorthm 1 the weght for each ndvdual kernel classfer s based on whether they classfy the tranng example correctly. An alternatve approach for updatng the weghts s to take nto account the output values of {f t}m =1 by penalzng a 4

6 Onlne Multple Kernel Learnng Algorthm 1 DA for OMKL-P 1 1: INPUT: Kernels: K m Dscount weght: β 0, 1 2: Intalzaton:f 1 = 0,w 1 = 1 3: for t = 1,2,... do 4: Receve an nstance x t 5: Predct:ŷ t = sgnq t sgnf tx t 6: Receve the class label y t 7: for = 1,2,...,m do 8: Setz t = Iy tf t x t 0 9: Update w t+1 10: Update f t+1 11: end for 12: end for = wβ t zt = f t +zy t tκ x t, Algorthm 2 DA for OMKL-P 2 1: INPUT: Kernels: K m Dscount weght: β 0, 1 Max-msclassfcaton level: γ > 0 2: Intalzaton:f 1 = 0,w 1 = 1 3: for t = 1,2,... do 4: Receve an nstance x t 5: Predct: ŷ t = sgnq t sgnf tx t 6: Receve the class label y t 7: for = 1,2,...,m do 8: Set z t = Iy tfx t t 0, ν t = z1/2+mnγ, y t tfx t t 9: Update w t+1 10: Update f t+1 11: end for 12: end for = wβ t ν t = f t +zy t tκ x t, kernel classfer more f ts degree of msclassfcaton, measured by y t f tx t, s large. To ths end, we present the second verson of the determnstc approach for OMKL- P n Algorthm 2 that takes nto account the value of {f t}m =1 when updatng weghts {w } m =1. In ths alternate algorthm, we ntroduce the parameterγ, whch can be nterpreted as the maxmum level of msclassfcaton. The key quantty ntroduced n Algorthm 2 sν t that measures the degree of msclassfcaton by1/2+mnγ, y tf tx. Note that we dd not drectly use y t f tx for updatng weghts{w } m =1 because t s unbounded. Theorem 3 After recevng a sequence oft tranng examplesd T, the number of mstakes made by Algorthm 2 s bounded as follows M = Iq t z t 0.5 21/2+γln1/β 1 β 1/2+γ mn gκ,l+ 41/2+γlnm 1 m 1 β 1/2+γ The proof s essentally smlar to that of Theorem 2 wth the modfcaton that addresses varableν t ntroduced n Algorthm 2. One problem wth Algorthm 1 s how to decde an approprate value for β. A straghtforward approach s to choose β that mnmzes the mstake bound, leadng to the followng corollary. Corollary 4 By choosng β = bound M 2 mn T 1 m zt mn T 1 m zt + lnm mn gκ,l+lnm+2 1 m, we have the followng mstake mn gκ,llnm 1 m

Onlne Multple Kernel Learnng 7 Proof. Followed by nequalty n 4, we have M 2 β mn 1 m z t + 2lnm 1 β where we use ln1/β 1/β 1. By settng the dervatve of the above upper bound wth respect to β to zero, and usng the nequalty T zt gκ,l as shown n the appendx, we have the result. mn 1 m T zt Drectly usng the result n Corollary 4 s unpractcal because t requres foreseeng the future to compute the quantty. We resolve ths problem by explotng the doublng trck [4]. In partcular, we dvde the sequence 1,2,...,T nto s segments: [T 0 +1 = 1,T 1 ],[T 1 +1,T 2 ],...,[T s 1 +1,T s = T] Tk+1 such that a mn 1 m t=t k +1 zt = 2k Ts fork = 0,...,s 2, and b mn 1 m t=t s 1+1 zt 2 s 1. Now, for each segment [T k +1,T k+1 ], we ntroduce a dfferent β, denote by β k, and set ts value as β k = 2 k/2, k = 0,...,s 1 5 lnm+2 k/2 The followng theorem shows the mstake bound of Algorthm 1 wth suchβ. Theorem 5 By runnng Algorthm 1 wth β k specfed n 5, we have the followng mstake bound M 2 mn gκ 2,l+lnm+ mn gκ,llnm 1 m 2 1 1 m +2 log 2 mn gκ,l lnm 1 m where x computes the smallest nteger that s larger than or equal to x. 4.2 Stochastc Approaches The analyss n the prevous secton allows us to bound the mstakes when classfyng examples wth a mxture of kernels. The man shortcomng wth the determnstc approach s that n each round, all the kernel classfers have to be checked and potentally updated f the tranng example s classfed ncorrectly. Ths could lead to a hgh computatonal cost when the number of kernels s large. In ths secton, we present stochastc approaches for onlne multple kernel learnng that explctly address ths challenge.

8 Onlne Multple Kernel Learnng Sngle Update ApproachSUA Algorthm 3 shows a stochastc algorthm for OMKL- P. In each round, nstead of checkng every kernel classfer, we sample a sngle kernel classfer accordng to the weghts that are computed based on the number of mstakes made by ndvdual kernel classfers. However, t s mportant to note that rather than usng q t drectly to sample one classfer to update, we add a smoothng term δ/m to the samplng probablty p t, Ths smoothng term guarantees a low bound for pt, whch ensures that each kernel classfer wll be explored wth at least certan amount of probablty, whch s smlar to methods for the mult-arm bandt problem [2] to ensure the tradeoff between exploraton and explotaton. The theorem below shows the mstake bound of Algorthm 3. Theorem 6 After recevng a sequence of T tranng examples D T, the expected number of mstakes made by Algorthm 3 s bounded as follows [ T ] E[M] = E Iq t z t 0.5 2mln1/β mn δ1 β gκ,l+ 2mlnm 1 m δ1 β Remark Comparng to the mstake bound n Theorem 2 by Algorthm 1, the mstake bound by Algorthm 3 s amplfed by a factor ofm/δ due to the stochastc procedure of updatng one out of m kernel classfers. The smoothng parameter δ essentally controls the tradeoff between effcacy and effcency. To see ths, we note that the bound for the expected number of mstakes s nversely [ proportonal to δ; n contrast, the bound T ] [ m T ] m for the expected number of updatese =1 mt zt = E =1 pt zt [ T ] 1 δe m =1 qt zt + δt has a leadng term δt when δ s large, whch s proportonal to δ. Multple Updates ApproachMUA Compared wth the determnstc approaches, the stochastc approach,.e. the sngle update algorthm, does sgnfcantly mprove the computatonal effcency. However, one major problem wth the sngle update algorthm s that n any round, only one sngle kernel classfer s selected for updatng. As a result, the unselected but possbly effectve kernel classfers lose ther opportunty for updatng. Ths ssue s partcularly crtcal at the begnnng of an onlne multple kernel learnng task where most ndvdual kernel classfers could perform poorly. In order to make a better tradeoff between effcacy and effcency, we develop another stochastc algorthm for onlne multple kernel learnng. The man dea of ths new algorthm s to randomly choose multple kernel classfers for updatng and predctons. Instead of choosng a kernel classfer from a multnomal dstrbuton, the updatng of each ndvdual kernel classfer s determned by a separate Bernoull dstrbuton governed by p t for each classfer. The detaled procedure s shown n Algorthm 4. The theorem below shows the mstake bound of the multple updates algorthm. Theorem 7 After recevng a sequence of T tranng examples D T, the expected number of mstakes made by Algorthm 4 s bounded as follows [ T ] E[M] = E Iq t z t 0.5 2ln1/β mn δ1 β gκ,l+ 2lnm 1 m δ1 β

Onlne Multple Kernel Learnng 9 Algorthm 3 SUA for OMKL-P 1: INPUT: Kernels: K m Dscount weght: β 0, 1 Smoothng parameter: δ 0, 1 2: Intalzaton:f 1 = 0,w 1 = 1,p 1 = 1/m 3: for t = 1,2,... do 4: Receve an nstance x t 5: Predct:ŷ t = sgn q t sgnf tx t 6: Receve the class label y t 7: t=mult Samplep t 8: for = 1,2,...,m do 9: Setm t = I = t 10: Setz t = Iy tfx t t 0 11: Update w t+1 = wβ t mt zt 12: Update f t+1 = f t +m t zy t tκ x t, 13: end for 14: Update p t+1 = 1 δq t+1 +δ1/m 15: end for Algorthm 4 MUA for OMKL-P 1: INPUT: Kernels: K m Dscount weght: β 0, 1 Smoothng parameter: δ 0, 1 2: Intalzaton:f 1 = 0,w 1 = 1,p 1 = 1 3: for t = 1,2,... do 4: Receve an nstance x t 5: Predct: ŷ t = sgn q t sgnf tx t 6: Receve the class label y t 7: for = 1,2,...,m do 8: Sample m t = Bern Samplep t 9: Set z t = Iy tfx t t 0 10: Update w t+1 11: Update f t+1 = wβ t mt zt = f t +m t zy t tκ x t, 12: end for 13: Update p t+1 = 1 δq t+1 +δ1 14: end for Remark Compared to the mstake bound n Theorem 2 by Algorthm 1, the mstake bound by Algorthm 4 s amplfed by a factor of 1/δ due to the stochastc procedure. On the other hand, compared to the mstake bound of sngle update n Theorem 6, the mstake bound by Algorthm 4 s mproved by a factor of m, manly due to smultaneously updatng multple kernel classfers n each round. The expected number of updates for multple updates approach s bounded by E [ T ] [ T E 1 δe m =1 pt zt m =1 qt zt [ T ] m =1 mt zt = ] + δmt, where the frst term s dscounted by a factor ofm and the second term s amplfed by a factor ofm compared to that of sngle update approach. 5 Algorthms for Onlne Multple Kernel Learnng by Outputs OMKL-O 5.1 A Determnstc Approach In the followng analyss, we assume the functonal norm of any classfer f s bounded by R,.e., f Hκ R. We defne doman Ω κ as Ω κ = {f H κ : f Hκ R}. Algorthm 5 shows the determnstc algorthm for OMKL-O. Compared to Algorthm 1, there are three key features of Algorthm 5. Frst, n step 11, the updated kernel classfer f s projected nto doman Ω κ to ensure ts norm s no more than R. Ths projecton step s mportant for the proof of the mstake bound that wll be shown later. Second, each ndvdual kernel classfer s updated only when

10 Onlne Multple Kernel Learnng the predcton of the combned classfer s ncorrect.e., y t ŷ t 0. Ths s n contrast to the Algorthm 1, where each kernel classfer f t s updated when t msclassfes the tranng example x t. Ths feature wll make the proposed algorthm sgnfcantly more effcent than Algorthm 1. Fnally, n step 9 of Algorthm 5, we update weghts w t+1 based on the output f tx t. Ths s n contrast to Algorthm 1 where weghts are updated only based on f the ndvdual classfers classfy the example correctly. Theorem 8 After recevng a sequence oft tranng examplesd T, the number of mstakes made by Algorthm 5 s bounded as follows f R 2 ln1/β < 1 M 1 1 R 2 ln1/β mn gu,{f } m u,{f Ω κ } m =1 + =1 where gu,{f } m =1 = m =1 u f 2 H κ +2 T lu fx t,y t. 2lnm 1 R 2 ln1/βln1/β Usng the result n Theorem 1, we have the followng corollary that bounds the number of mstakes of onlne kernel learnng by the objectve functon used n the batch mode multple kernel learnng. Corollary 9 We have the followng mstake bound for runnng Algorthm 5 fr 2 ln1/β < 1 M 1 1 R 2 ln1/β mn u,f Ω κu gκu,l+ where gκu,l = f 2 H κu +2 T lfx t,y t. 2lnm 1 R 2 ln1/βln1/β 5.2 A Stochastc Approach Fnally, we present a stochastc strategy n Algorthm 6 for OMKL-O. In each round, we randomly sample one classfer to update by followng the probablty dstrbuton p t. Smlar to Algorthm 3, the probablty dstrbutonp t s a mxture of the normalzed weghts for classfers and a smoothng term δ/m. Dfferent from Algorthm 5, the updatng rule for Algorthm 6 has two addtonal factors,.e. m t whch s non-zero for the chosen classfer and has expectaton equal to 1, and the step sze η whch s essentally ntroduced to ensure a good mstake bound as shown n the followng theorem. Theorem 10 After recevng a sequence oft tranng examplesd T, the expected number of mstakes made by Algorthm 6 s bounded as follows E[M] mn u,{f Ω κ } m =1 lu fx t,y t +2 ar,β,δbr,β,δt where ar,β,δ = R2 2 + lnm ln1/β, br,β,δ = ln1/βr2 m 2 2δ 2 ar,β,δ η = br,β,δ + m, andη s set to 2δ

Onlne Multple Kernel Learnng 11 Algorthm 5 DA for OMKL-O 1: INPUT: Kernels: K m Dscount weght: β 0, 1 Maxmum functonal norm: R 2: Intlzaton:f 1 = 0,w 1 = 1 3: for t = 1,2,... do 4: Receve an nstance x t 5: Predct:ŷ t = sgn q t f tx t 6: Receve the class label y t 7: fŷ ty t 0 then 8: for = 1,2,...,m do 9: Update w t+1 = wβ t ytft xt 10: Update 11: Project t+1 f t+1 f = f t +y tκ x t, ntoω κ by f t+1 t+1 = f /max1, f t+1 Hκ /R 12: end for 13: end f 14: end for Algorthm 6 SUA for OMKL-O 1: INPUT: K m,β,r as n Algorthm 5 Smoothng parameter δ 0,1, and Step sze: η > 0 2: Intalzaton:f 1 = 0,w 1 = 1,p 1 = 1/m 3: for t = 1,2,... do 4: Receve an nstance x t 5: Predct: ŷ t = sgn q t f tx t 6: Receve the class label y t 7: f ŷ ty t 0 then 8: t=mult Samplep t 9: for = 1,2,...,m do 10: Setm t = I = t, m t = m t /p t 11: Update w t+1 = wβ t η mt ytft xt t+1 12: Update f = f t +η m t y tκ x t, t+1 f 13: Project x ntoω κ. 14: end for 15: Update p t = 1 δq t +δ1/m 16: end f 17: end for 6 Conclusons Ths paper nvestgates a new research problem, onlne multple kernel learnng OMKL, whch ams to attack an onlne learnng task by learnng a kernel based predcton functon from a pool of predefned kernels. We consder two setups for onlne kernel learnng, onlne kernel learnng by predctons that combnes the bnary predctons from multple kernel classfers and onlne kernel learnng by outputs that combnes the realvalued outputs from kernel classfers. We proposed a framework for OMKL by learnng a combnaton of multple kernel classfers from a pool of gven kernel functons. We emphasze that OMKL s generally more challengng than typcal onlne learnng because both the kernel classfers and ther lnear combnaton are unknown. To solve ths challenge, we propose to combne two onlne learnng algorthms,.e., the Perceptron algorthm that learns a classfer for a gven kernel, and the Hedge algorthm that combnes classfers by lnear weghtng. Based on ths dea, we present two types of algorthms for OMKL,.e., determnstc approaches and stochastc approaches. Theoretcal bounds were derved for the proposed OMKL algorthms. Acknowledgement The work was supported n part by Natonal Scence Foundaton IIS-0643494, Offce of Naval Research N00014-09-1-0663, and Army Research Offce W911NF-09-1- 0421. Any opnons, fndngs, and conclusons or recommendatons expressed n ths

12 Onlne Multple Kernel Learnng materal are those of the authors and do not necessarly reflect the vews of NSF, ONR and ARO. Appendx Proof of Theorem 2 Proof. The proof essentally combnes the proofs of the Perceptron algorthm [25] and the Hedge algorthm [9]. Frst, followng the analyss n [9], we can easly have q t z t ln1/β 1 β mn 1 m z t + lnm 1 β Second, due to the convexty ofl and the updatng rule forf, whenz t = 1 we have lfx t t,y t lfx t,y t y t f t f,zκ t x t, Hκ = f t f,f t+1 f t Hκ 1 f t f 2 H 2 κ f t+1 f 2 H κ +z t Snce lf t x t,y t z t, then zt ft f 2 H κ f t+1 f 2 H κ + 2lfx t,y t. Takng summaton on both sdes, we have z t mn f H κ f t f 2 H κ f t+1 f 2 H k +2 lfx t,y t gκ,l Usng the above nequalty and notng thatm = T Iq t z t 0.5 2 T q t z t, we have the result n the theorem. Proof of Theorem 3 Proof. The proof can be constructed smlarly to the proof for Theorem 2 by notng the followng three dfferences. Frst, the updatng rule for the weghts can be wrtten as w t+1 = w tβ1/2+γ νt /1/2+γ, where ν t 1/2+γ. Second, qt νt qt zt /2. Thrd, we havelfx t t,y t ν t+z t/2, and therefore ν t gκ,l/2. Proof of Theorem 5 Proof. We denote by M k the number of mstakes made durng the segment [T k + 1,T k+1 ]. It can be shown that the followng equalty M k 2 mn T k+1 1 m t=t k +1, z t +lnm+2 lnm2 k/2

Onlne Multple Kernel Learnng 13 holds for anyk = 0,...,s 1. Takng summaton over all M k s 1 s 1 T k+1 M = M k 2 mn z t +22k/2 lnm +2slnm mn 1 m k=0 k=0 1 m t=t k +1 we can obtan the bound n the theorem by notng that s 1 T zt, T and2s 1 1 mn 1 m zt mn gκ,l. 1 m Proof of Theorem 6 Proof. Smlar to the proof for Theorem 2, we can prove m =1 q t mt zt ln1/β 1 β m t zt + lnm 1 β, and k=0 mn 1 m T Tk1 t=t k +1 zt m t zt gκ,l Takng expectaton on both sdes, and notng thate[m t ] = pt δ/m, we have T E q t z t mln1/β mn δ1 β gκ,l+ mlnm 1 m δ1 β SnceM 2 T q t z t, we have the result stated n the theorem. Proof of Theorem 7 Proof. The proof can be duplcated smlarly to the proof for Theorem 6, except for p t δ, = 1,...,m n ths case. Proof of Theorem 8 Proof. In the followng analyss, we only consder the subset of teratons where the algorthm makes a mstake. By slghtly abusng the notaton, we denote by 1, 2,..., M the trals where the examples are msclassfed by Algorthm 5. For any combnaton weght u = u 1,,u m and any kernel classfcaton functon f Ω κ, = 1,...,m, we have l q t f t x t,y t l u fxt,y t gt y t q t f t x t u fx t = g t y t q t f t x t u f t x t +g t y t u f t x t u fx t where g t = max0,1 z z = 1 because examples are msclassfed n z=ytq t f tx t these trals. For the frst term, followng the proof for Theorem 11.3 n [4], we can have g t y t q t u f t x t ln1/β R 2 1 + 2 ln1/β {KLu q t KLu q t+1 }

14 Onlne Multple Kernel Learnng For the second term, followng the analyss n the proof for Theorem 2, we can have g t y t u f t x t u fx t = m u ft f,g t y t κ x t, Hκ =1 1 2 + m =1 Combnng the above result together, we arrve at M M lq t f t x t,y t lu fx t,y t 1 2 u 2 f t f 2 H κ f t+1 f 2 H κ Sncelq t f t x t,y t 1, we can have the result n the theorem. m =1 u f 2 H κ + lnm ln1/β +1 2 ln1/βr2 +1M Proof of Theorem 10 Proof. Smlar to the proof for the Theorem 8, we have the followng bound for the two terms E[g t y t q t f t x t u f t x t ] = 1 η E[ g t y t q t u t η m t f t x t ] E[g t y t u f t x t fx t ] = E 1 ηln1/β E[KLu qt KLu qt+1]+ ln1/βr2 η 2 m 2 2δ 2 m =1 u η mη 2δ +E f t f,η m t g ty t κ x t, H κ [ m =1 u 2η ft f 2 H κ f t+1 f 2 H κ Combnng the above results together, and settngη as n the theorem, we have the result n the theorem. ] References 1. S. Agmon. The relaxaton method for lnear nequaltes. CJM, 63:382 392, 1954. 2. P. Auer, N. Cesa-Banch, Y. Freund, and R. E. Schapre. The nonstochastc multarmed bandt problem. SICOMP., 321, 2003. 3. F. R. Bach, G. R. G. Lanckret, and M. I. Jordan. Multple kernel learnng, conc dualty, and the smo algorthm. In ICML, 2004. 4. N. Cesa-Banch and G. Lugos. Predcton, Learnng, and Games. 2006. 5. O. Chapelle, J. Weston, and B. Schölkopf. Cluster kernels for sem-supervsed learnng. In NIPS, pages 585 592, 2002. 6. Y. Chen, M. R. Gupta, and B. Recht. Learnng kernels from ndefnte smlartes. In ICML, pages 145 152, 2009.

Onlne Multple Kernel Learnng 15 7. K. Crammer, O. Dekel, J. Keshet, S. Shalev-Shwartz, and Y. Snger. Onlne passveaggressve algorthms. JMLR, 7, 2006. 8. K. Crammer and Y. Snger. Ultraconservatve onlne algorthms for multclass problems. JMLR, 3, 2003. 9. Y. Freund and R. E. Schapre. A decson-theoretc generalzaton of on-lne learnng and an applcaton to boostng. JCSS, 551, 1997. 10. Y. Freund and R. E. Schapre. Large margn classfcaton usng the perceptron algorthm. ML, 373, 1999. 11. C. Gentle. A new approxmate maxmal margn classfcaton algorthm. JMLR, 2, 2001. 12. S. C. Ho, R. Jn, and M. R. Lyu. Learnng non-parametrc kernel matrces from parwse constrants. In ICML, pages 361 368, 2007. 13. S. C. H. Ho, M. R. Lyu, and E. Y. Chang. Learnng the unfed kernel machnes for classfcaton. In KDD, pages 187 196, 2006. 14. H. Kashma, K. Tsuda, and A. Inokuch. Margnalzed kernels between labeled graphs. In ICML, pages 321 328, 2003. 15. J. Kvnen, A. Smola, and R. Wllamson. Onlne learnng wth kernels. IEEE Trans. on Sg. Proc., 528, 2004. 16. J. Kvnen, A. J. Smola, and R. C. Wllamson. Onlne learnng wth kernels. In NIPS, pages 785 792, 2001. 17. M. Kloft, U. Rückert, and P. L. Bartlett. A unfyng vew of multple kernel learnng. CoRR, abs/1005.0437, 2010. 18. R. I. Kondor and J. D. Lafferty. Dffuson kernels on graphs and other dscrete nput spaces. In ICML, pages 315 322, 2002. 19. B. Kuls, M. Sustk, and I. Dhllon. Learnng low-rank kernel matrces. In ICML, pages 505 512, 2006. 20. G. R. G. Lanckret, N. Crstann, P. Bartlett, L. E. Ghaou, and M. I. Jordan. Learnng the kernel matrx wth semdefnte programmng. JMLR, 5, 2004. 21. Y. L and P. M. Long. The relaxed onlne maxmum margn algorthm. ML, 461-3, 2002. 22. N. Lttlestone and M. K. Warmuth. The weghted majorty algorthm. In FOCS, 1989. 23. A. Novkoff. On convergence proofs on perceptrons. In Proceedngs of the Symposum on the Mathematcal Theory of Automata, volume XII, 1962. 24. A. Rakotomamonjy, F. R. Bach, S. Canu, and Y. Grandvalet. Smplemkl. JMLR, 11, 2008. 25. F. Rosenblatt. The perceptron: A probablstc model for nformaton storage and organzaton n the bran. Psychologcal Revew, 65, 1958. 26. B. Schölkopf and A. J. Smola. Learnng wth Kernels: Support Vector Machnes, Regularzaton, Optmzaton, and Beyond. 2001. 27. S. Shalev-Shwartz. Onlne learnng: Theory, algorthms, and applcatons. In Ph.D thess, 2007. 28. S. Sonnenburg, G. Rätsch, C. Schäfer, and B. Schölkopf. Large scale multple kernel learnng. JMLR, 7, 2006. 29. V. N. Vapnk. Statstcal Learnng Theory. Wley, 1998. 30. V. Vovk. A game of predcton wth expert advce. J. Comput. Syst. Sc., 562, 1998. 31. Z. Xu, R. Jn, I. Kng, and M. R. Lyu. An extended level method for effcent multple kernel learnng. In NIPS, pages 1825 1832, 2008. 32. X. Zhu, J. S. Kandola, Z. Ghahraman, and J. D. Lafferty. Nonparametrc transforms of graph kernels for sem-supervsed learnng. In NIPS, pages 1641 1648, 2004.