On the Generalization Ability of Online Learning Algorithms for Pairwise Loss Functions

Transcription

1 O the Geeraliatio Ability of Olie Learig Algorithms for Pairwise Loss Fuctios Purushottam Kar Departmet of Computer Sciece ad Egieerig, Idia Istitute of Techology, Kapur, UP , INDIA. Bharath K Sriperumbudur [email protected] Statistical Laboratory, Cetre for Mathematical Scieces, Wilberforce Road, Cambridge, CB3 0WB, ENGLAND. Prateek Jai [email protected] Microsoft Research Idia, Vigya, #9, Lavelle Road, Bagalore, KA , INDIA. Harish C Karick [email protected] Departmet of Computer Sciece ad Egieerig, Idia Istitute of Techology, Kapur, UP , INDIA. Abstract I this paper, we study the geeraliatio properties of olie learig based stochastic methods for supervised learig problems where the loss fuctio is depedet o more tha oe traiig sample (e.g., metric learig, rakig). We preset a geeric decouplig techique that eables us to provide Rademacher complexity-based geeraliatio error bouds. Our bouds are i geeral tighter tha those obtaied by Wag et al. (202) for the same problem. Usig our decouplig techique, we are further able to obtai fast covergece rates for strogly covex pairwise loss fuctios. We are also able to aalye a class of memory efficiet olie learig algorithms for pairwise learig problems that use oly a bouded subset of past traiig samples to update the hypothesis at each step. Fially, i order to complemet our geeraliatio bouds, we propose a ovel memory efficiet olie learig algorithm for higher order learig problems with bouded regret guaratees.. Itroductio Several supervised learig problems ivolve workig with pairwise or higher order loss fuctios, i.e., loss fuctios that deped o more tha oe traiig sam- Proceedigs of the 30 th Iteratioal Coferece o Machie Learig, Atlata, Georgia, USA, 203. JMLR: W&CP volume 28. Copyright 203 by the author(s). ple. Take for example the metric learig problem (Ji et al., 2009), where the goal is to lear a metric M that brigs poits of a similar label together while keepig differetly labeled poits apart. I this case the loss fuctio used is a pairwise loss fuctio l(m, (x, y), (x, y )) = φ (yy ( M(x, x ))) where φ is the hige loss fuctio. I geeral, a pairwise loss fuctio is of the form l : H X X R + where H is the hypothesis space ad X is the iput domai. Other examples iclude preferece learig (Xig et al., 2002), rakig (Agarwal & Niyogi, 2009), AUC maximiatio (Zhao et al., 20) ad multiple kerel learig (Kumar et al., 202). I practice, algorithms for such problems use itersectig pairs of traiig samples to lear. Hece the traiig data pairs are ot i.i.d. ad cosequetly, stadard geeraliatio error aalysis techiques do ot apply to these algorithms. Recetly, the aalysis of batch algorithms learig from such coupled samples has received much attetio (Cao et al., 202; Clémeço et al., 2008; Brefeld & Scheffer, 2005) where a domiat idea has bee to use a alterate represetatio of the U-statistic ad provide uiform covergece bouds. Aother popular approach has bee to use algorithmic stability (Agarwal & Niyogi, 2009; Ji et al., 2009) to obtai algorithm-specific results. While batch algorithms for pairwise (ad higher-order) learig problems have bee studied well theoretically, olie learig based stochastic algorithms are more popular i practice due to their scalability. However, their geeraliatio properties were ot studied util recetly. Wag et al. (202) provided the first geeraliatio error aalysis of olie learig methods

2 O the Geeraliatio Ability of Olie Learig Algorithms for Pairwise Loss Fuctios applied to pairwise loss fuctios. I particular, they showed that such higher-order olie learig methods also admit olie to batch coversio bouds (similar to those for first-order problems (Cesa-Biachi et al., 200)) which ca be combied with regret bouds to obtai geeraliatio error bouds. However, due to their proof techique ad depedece o L coverig umbers of fuctio classes, their bouds are ot tight ad have a strog depedece o the dimesioality of the iput space. I literature, there are several istaces where Rademacher complexity based techiques achieve sharper bouds tha those based o coverig umbers (Kakade et al., 2008). However, the couplig of differet iput pairs i our problem does ot allow us to use such techiques directly. I this paper we itroduce a geeric techique for aalyig olie learig algorithms for higher order learig problems. Our techique, that uses a extesio of Rademacher complexities to higher order fuctio classes (istead of coverig umbers), allows us to give bouds that are tighter tha those of (Wag et al., 202) ad that, for several learig scearios, have o depedece o iput dimesioality at all. Key to our proof is a techique we call Symmetriatio of Expectatios which acts as a decouplig step ad allows us to reduce excess risk estimates to Rademacher complexities of fuctio classes. (Wag et al., 202), o the other had, perform a symmetriatio with probabilities which, apart from beig more ivolved, yields suboptimal bouds. Aother advatage of our techique is that it allows us to obtai fast covergece rates for learig algorithms that use strogly covex loss fuctios. Our result, that uses a ovel two stage proof techique, exteds a similar result i the first order settig by Kakade & Tewari (2008) to the pairwise settig. Wag et al. (202) (ad our results metioed above) assume a olie learig setup i which a stream of poits,..., is observed ad the pealty fuctio t τ= l(h, t, τ ). used at the t th step is ˆL t (h) = t Cosequetly, the results of Wag et al. (202) expect regret bouds with respect to these all-pairs pealties ˆL t. This requires oe to use/store all previously see poits which is computatioally/storagewise expesive ad hece i practice, learig algorithms update their hypotheses usig oly a bouded subset of the past samples (Zhao et al., 20). I the above metioed settig, we are able to give geeraliatio bouds that oly require algorithms to give regret bouds with respect to fiite-fer pealty fuctios such as ˆL t (h) = B B l(h, t, ) where B is a fer that is updated at each step. Our proofs hold for ay stream oblivious fer update policy icludig FIFO ad the widely used reservoir samplig policy (Vitter, 985; Zhao et al., 20). To complemet our olie to batch coversio bouds, we also provide a memory efficiet olie learig algorithm that works with bouded fers. Although our algorithm is costraied to observe ad lear usig ˆL t the fiite-fer pealties aloe, we are still able to provide high cofidece regret bouds with respect to the all-pairs pealty fuctios ˆL t. We ote that Zhao et al. (20) also propose a algorithm that uses fiite fers ad claim a all-pairs regret boud for the same. However, their regret boud does ot hold due to a subtle mistake i their proof. We also provide empirical validatio of our proposed olie learig algorithm o AUC maximiatio tasks ad show that our algorithm performs competitively with that of (Zhao et al., 20), i additio to beig able to offer theoretical regret bouds. Our Cotributios: (a) We provide a geeric olie-to-batch coversio techique for higher-order supervised learig problems offerig bouds that are sharper tha those of (Wag et al., 202). (b) We obtai fast covergece rates whe loss fuctios are strogly covex. (c) We aalye olie learig algorithms that are costraied to lear usig a fiite fer. (d) We propose a ovel olie learig algorithm that works with fiite fers but is able to provide a high cofidece regret boud with respect to the all-pairs pealty fuctios. 2. Problem Setup For ease of expositio, we itroduce a olie learig model for higher order supervised learig problems i this sectio; cocrete learig istaces such as AUC maximiatio ad metric learig are give i Sectio 6. For sake of simplicity, we restrict ourselves to pairwise problems i this paper; our techiques ca be readily exteded to higher order problems as well. For pairwise learig problems, our goal is to lear a Idepedetly, Wag et al. (203) also exteded their proof to give similar guaratees. However, their bouds hold oly for the FIFO update policy ad have worse depedece o dimesioality i several cases (see Sectio 5).

3 O the Geeraliatio Ability of Olie Learig Algorithms for Pairwise Loss Fuctios real valued bivariate fuctio h : X X Y, where h H, uder some loss fuctio l : H Z Z R + where Z = X Y. The olie learig algorithm is give sequetial access to a stream of elemets, 2,..., chose i.i.d. from the domai Z. Let Z t := {,..., t }. At each time step t = 2..., the algorithm posits a hypothesis h t H upo which the elemet t is revealed ad the algorithm icurs the followig pealty: ˆL t (h t ) = t l(h t, t, τ ). () t τ= For ay h H, we defie its expected risk as: L(h) := E, l(h,, ). (2) Our aim is to preset a esemble h,..., h such that the expected risk of the esemble is small. More specifically, we desire that, for some small ɛ > 0, L(h t ) L(h ) + ɛ, where h = arg mi L(h) is the populatio risk miimier. Note that this allows us to do hypothesis selectio i a way that esures small expected risk. Specifically, if oe chooses a hypothesis as ĥ := ( ) h t (for covex l) or ĥ := arg mi L(h t ),..., the we have L(ĥ) L(h ) + ɛ. Sice the model preseted above requires storig all previously see poits, it becomes uusable i large scale learig scearios. Istead, i practice, a sketch of the stream is maitaied i a fer B of capacity s. At each step, the pealty is ow icurred oly o the pairs {( t, ) : B t } where B t is the state of the fer at time t. That is, ˆL t (h t ) = l(h t, t, ). (3) B t B t We shall assume that the fer is updated at each step usig some stream oblivious policy such as FIFO or Reservoir samplig (Vitter, 985) (see Sectio 5). I Sectio 3, we preset olie-to-batch coversio bouds for olie learig algorithms that give regret bouds w.r.t. pealty fuctios give by (). I Sectio 4, we exted our aalysis to algorithms usig strogly covex loss fuctios. I Sectio 5 we provide geeraliatio error bouds for algorithms that give regret bouds w.r.t. fiite-fer pealty fuctios give by (3). Fially i sectio 7 we preset a ovel memory efficiet olie learig algorithm with regret bouds. 3. Olie to Batch Coversio Bouds for Bouded Loss Fuctios We ow preset our geeraliatio bouds for algorithms that provide regret bouds with respect to the all-pairs loss fuctios (see Eq. ()). Our results give tighter bouds ad have a much better depedece o iput dimesioality tha the bouds give by Wag et al. (202). See Sectio 3. for a detailed compariso. As was oted by (Wag et al., 202), the geeraliatio error aalysis of olie learig algorithms i this settig does ot follow from existig techiques for first-order problems (such as (Cesa-Biachi et al., 200; Kakade & Tewari, 2008)). The reaso is that the terms V t = ˆL t (h t ) do ot form a martigale due to the itersectio of traiig samples i V t ad V τ, τ < t. Our techique, that aims to utilie the Rademacher complexities of fuctio classes i order to get tighter bouds, faces yet aother challege at the symmetriatio step, a precursor to the itroductio of Rademacher complexities. It turs out that, due to the couplig betwee the head variable t ad the tail variables τ i the loss fuctio ˆL t, a stadard symmetriatio betwee true τ ad ghost τ samples does ot succeed i geeratig Rademacher averages ad istead yields complex lookig terms. More specifically, suppose we have true variables t ad ghost variables t ad are i the process of boudig the expected excess risk by aalyig expressios of the form E orig = l(h t, t, τ ) l(h t, t, τ ). Performig a traditioal symmetriatio of the variables τ with τ would give us expressios of the form E symm = l(h t, t, τ ) l(h t, t, τ ). At this poit the aalysis hits a barrier sice ulike first order situatios, we caot relate E symm to E orig by meas of itroducig Rademacher variables. We circumvet this problem by usig a techique that we call Symmetriatio of Expectatios. The techique allows us to use stadard symmetriatio to obtai Rademacher complexities. More specifically, we aalye expressios of the form E orig = E l(h t,, τ ) E l(h t,, τ ) which upo symmetriatio yield expressios such as E symm = E l(h t,, τ ) E l(h t,, τ ) which allow us to itroduce Rademacher variables sice E symm = E orig. This idea is exploited by the

4 O the Geeraliatio Ability of Olie Learig Algorithms for Pairwise Loss Fuctios lemma give below that relates the expected risk of the esemble to the pealties icurred durig the olie learig process. I the followig we use the followig extesio of Rademacher averages (Kakade et al., 2008) to bivariate fuctio classes: R (H) = E sup ɛ τ h(, τ ) τ= where the expectatio is over ɛ τ, ad τ. We shall deote composite fuctio classes as follows : l H := {(, ) l(h,, ), h H}. Lemma. Let h,..., h be a esemble of hypotheses geerated by a olie learig algorithm workig with a bouded loss fuctio l : H Z Z [0, B]. The for ay δ > 0, we have with probability at least δ, L(h t ) ˆL t (h t ) + 2 log δ R t (l H) + 3B. The proof of the lemma ivolves decomposig the excess risk term ito a martigale differece sequece ad a residual term i a maer similar to (Wag et al., 202). The martigale sequece, beig a bouded oe, is show to coverge usig the Auma- Hoeffdig iequality. The residual term is hadled usig uiform covergece techiques ivolvig Rademacher averages. The complete proof of the lemma is give i the Appedix A. Similar to Lemma, the followig coverse relatio betwee the populatio ad empirical risk of the populatio risk miimier h ca also be show. Lemma 2. For ay δ > 0, we have with probability at least δ, ˆL t (h ) L(h ) + 2 R t (l H) log δ +3B. A olie learig algorithm will be said to have a all-pairs regret boud R if it presets a esemble h,..., h such that ˆL t (h t ) if ˆL t (h) + R. Suppose we have a olie learig algorithm with a regret boud R. The combiig Lemmata ad 2 gives us the followig olie to batch coversio boud: Theorem 3. Let h,..., h be a esemble of hypotheses geerated by a olie learig algorithm workig with a B-bouded loss fuctio l that guaratees a regret boud of R. The for ay δ > 0, we have with probability at least δ, L(h t ) L(h ) R + 6B R t (l H) log δ. As we shall see i Sectio 6, for several learig problems, the Rademacher ( ) complexities behave as R t (l H) C d O t where C d is a costat depedet oly o the dimesio d of the iput space ad the O ( ) otatio hides costats depedet o the domai sie ad the loss fuctio. This allows us to boud the excess risk as follows: ( L(ht ) L(h ) + R + O C d + ) log(/δ). Here, the error decreases with at a stadard / rate (up to a log factor), similar to that obtaied by Wag et al. (202). However, for several problems the above boud ca be sigificatly tighter tha those offered by coverig umber based argumets. We provide below a detailed compariso of our results with those of Wag et al. (202). 3.. Discussio o the ature of our bouds As metioed above, our proof eables us to use Rademacher complexities which are typically easier to aalye ad provide tighter bouds (Kakade et al., 2008). I particular, as show i Sectio 6, for L 2 regularied learig formulatios, the Rademacher complexities are dimesio idepedet i.e. C d =. Cosequetly, ulike the bouds of (Wag et al., 202) that have a liear depedece o d, our boud becomes idepedet of the iput space dimesio. For sparse learig formulatios with L or trace orm regulariatio, we have C d = log d givig us a mild depedece o the iput dimesioality. Our bouds are also tighter that those of (Wag et al., 202) i geeral. Whereas we provide a cofidece boud of δ < exp ( ɛ 2 + log ), (Wag et al., 202) offer a weaker boud δ < (/ɛ) d exp ( ɛ 2 + log ). A artifact of the proof techique of (Wag et al., 202) is that their proof is required to exclude a costat fractio of the esemble (h,..., h c ) from the

5 O the Geeraliatio Ability of Olie Learig Algorithms for Pairwise Loss Fuctios aalysis, failig which their bouds tur vacuous. Our proof o the other had is able to give guaratees for the etire esemble. I additio to this, as the followig sectios show, our proof techique ejoys the flexibility of beig extedable to give fast covergece guaratees for strogly covex loss fuctios as well as beig able to accommodate learig algorithms that use fiite fers. 4. Fast Covergece Rates for Strogly Covex Loss Fuctios I this sectio we exted results of the previous sectio to give fast covergece guaratees for olie learig algorithms that use strogly covex loss fuctios of the followig form: l(h,, ) = g( h, φ(, ) ) + r(h), where g is a covex fuctio ad r(h) is a σ-strogly covex regularier (see Sectio 6 for examples) i.e. h, h 2 H ad α [0, ], we have r(αh + ( α)h 2 ) αr(h ) + ( α)r(h 2 ) σ 2 α( α) h h 2 2. For ay orm, let deote its dual orm. Our aalysis reduces the pairwise problem to a first order problem ad a martigale covergece problem. We require the followig fast covergece boud i the stadard first order batch learig settig: Theorem 4. Let F be a closed ad covex set of fuctios over X. Let (f, x) = p( f, φ(x) ) + r(f), for a σ-strogly covex fuctio r, be a loss fuctio with P ad ˆP as the associated populatio ad empirical risk fuctioals ad f as the populatio risk miimier. Suppose is L-Lipschit ad φ(x) R, x X. The w.p. δ, for ay ɛ > 0, we have for all f F, ( ) P(f) P(f ) ( + ɛ) ˆP(f) ˆP(f ) + C δ ɛσ where C δ = C 2 d (4( + ɛ)lr)2 (32 + log(/δ)) ad C d is the depedece of the Rademacher complexity of the class F o the iput dimesioality d. The above theorem is a mior modificatio of a similar result by Sridhara et al. (2008) ad the proof (give i Appedix B) closely follows their proof as well. We ca ow state our olie to batch coversio result for strogly covex loss fuctios. Theorem 5. Let h,..., h be a esemble of hypotheses geerated by a olie learig algorithm workig with a B-bouded, L-Lipschit ad σ-strogly covex loss fuctio l. Further suppose the learig algorithm guaratees a regret boud of R. Let V = max { R, 2C 2 d log log(/δ)} The for ay δ > 0, we have with probability at least δ, L(h t ) L(h ) + R ( ) V log log(/δ) +C d O, where the O ( ) otatio hides costats depedet o domai sie ad the loss fuctio such as L, B ad σ. The decompositio of the excess risk i this case is ot made explicitly but rather emerges as a side-effect of the proof progressio. The proof starts off by applyig Theorem 4 to the hypothesis i each roud with the followig loss fuctio (h, ) := E l(h,, ). Applyig the regret boud to the resultig expressio gives us a martigale differece sequece which we the boud usig Berstei-style iequalities ad a proof techique from (Kakade & Tewari, 2008). The complete proof is give i Appedix C. We ow ote some properties of this result. The effective depedece of the above boud o the iput dimesioality is Cd 2 sice the expressio V hides a C d term. We have Cd 2 = for o sparse learig formulatios ad Cd 2 = log d for sparse learig formulatios. We ote that our boud matches that of Kakade & Tewari (2008) (for first-order learig problems) up to a logarithmic factor. 5. Aalyig Olie Learig Algorithms that use Fiite Buffers I this sectio, we preset our olie to batch coversio bouds for algorithms that work with fiitefer loss fuctios ˆL t. Recall that a olie learig algorithm workig with fiite fers icurs a loss ˆL t (h) = B t B t l(h t, t, ) at each step where B t is the state of the fer at time t. A olie learig algorithm will be said to have a fiite-fer regret boud R if it presets a esemble h,..., h such that ˆL t (h t ) if ˆL t (h) R. For our guaratees to hold, we require the fer update policy used by the learig algorithm to be stream oblivious. More specifically, we require the fer update rule to decide upo the iclusio of a particular poit i i the fer based oly o its stream idex i []. Popular examples of stream oblivious policies iclude Reservoir samplig (Vitter, 985) (referred to

6 O the Geeraliatio Ability of Olie Learig Algorithms for Pairwise Loss Fuctios as RS heceforth) ad FIFO. Stream oblivious policies allow us to decouple fer costructio radomess from traiig sample radomess which makes aalysis easier; we leave the aalysis of stream aware fer update policies as a topic of future research. I the above metioed settig, we ca prove the followig olie to batch coversio bouds: Theorem 6. Let h,..., h be a esemble of hypotheses geerated by a olie learig algorithm workig with a fiite fer of capacity s ad a B- bouded loss fuctio l. Moreover, suppose that the algorithm guaratees a regret boud of R. The for ay δ > 0, we have with probability at least δ, ( L(h ) t ) L(h ) + R + O C d log δ + B s s If the loss fuctio is Lipschit ad strogly covex as well, the with the same cofidece, we have ( L(h t ) L(h ) + R + C W log δ d O s { where W = max R, 2C2 d log(/δ) s } ad C d is the depedece of R (H) o the iput dimesioality d. The above boud guaratees a excess error of Õ (/s) for algorithms (such as Follow-the-leader (Haa et al., 2006)) that offer logarithmic regret R = O (log ). We stress that this theorem is ot a direct corollary of our results for the ifiite fer case (Theorems 3 ad 5). Istead, our proofs require a more careful aalysis of the excess risk i order to accommodate the fiiteess of the fer ad the radomess (possibly) used i costructig it. More specifically, care eeds to be take to hadle radomied fer update policies such as RS which itroduce additioal radomess ito the aalysis. A aive applicatio of techiques used to prove results for the ubouded fer case would result i bouds that give o trivial geeraliatio guaratees oly for large fer sies such as s = ω( ). Our bouds, o the other had, oly require s = ω(). Key to our proofs is a coditioig step where we first aalye the coditioal excess risk by coditioig upo radomess used by the fer update policy. Such coditioig is made possible by the stream-oblivious ature of the update policy ad thus, stream-obliviousess is required by our aalysis. Subsequetly, we aalye the excess risk by takig expectatios over radomess used by the fer update policy. The complete proofs of both parts of Theorem 6 are give i Appedix D. ) Note that the above results oly require a olie learig algorithm to provide regret bouds w.r.t. the fiite-fer pealties ˆL t ad do ot require ay regret bouds w.r.t the all-pairs pealties ˆL t. For istace, the fiite fer based olie learig algorithms OAM seq ad OAM gra proposed i (Zhao et al., 20) are able to provide a regret boud w.r.t. ˆL t (Zhao et al., 20, Lemma 2) but are ot able to do so w.r.t the all-pairs loss fuctio (see Sectio 7 for a discussio). Usig Theorem 6, we are able to give a geeraliatio boud for OAM seq ad OAM gra ad hece explai the good empirical performace of these algorithms as reported i (Zhao et al., 20). Note that Wag et al. (203) are ot able to aalye OAM seq ad OAM gra sice their aalysis is restricted to algorithms that use the (determiistic) FIFO update policy whereas OAM seq ad OAM gra use the (radomied) RS policy of Vitter (985). 6. Applicatios I this sectio we make explicit our olie to batch coversio bouds for several learig scearios ad also demostrate their depedece o iput dimesioality by calculatig their respective Rademacher complexities. Recall that our defiitio of Rademacher complexity for a pairwise fuctio class is give by, R (H) = E sup ɛ τ h(, τ ). τ= For our purposes, we would be iterested i the Rademacher complexities of compositio classes of the form l H := {(, ) l(h,, ), h H} where l is some Lipschit loss fuctio. Frequetly we have l(h,, ) = φ (h(x, x )Y (y, y )) where Y (y, y ) = y y or Y (y, y ) = yy ad φ : R R is some margi loss fuctio (Steiwart & Christma, 2008). Suppose φ is L-Lipschit ad Y = sup Y (y, y ). The we have y,y Y Theorem 7. R (l H) LY R (H). The proof uses stadard cotractio iequalities ad is give i Appedix E. This reduces our task to computig the values of R (H) which we do usig a two stage proof techique (see Appedix F). For ay subset X of a Baach space ad ay orm p, we defie X p := sup x p. Let the domai X R d. x X AUC maximiatio (Zhao et al., 20): the goal here is to maximie the area uder the ROC curve for a liear classificatio problem where the hypothesis space W R d. We have h w (x, x ) = w x w x ad l(h w,, ) = φ ((y y )h w (x, x )) where φ is the

7 O the Geeraliatio Ability of Olie Learig Algorithms for Pairwise Loss Fuctios hige loss. I case our classifiers are L p regularied for q p >, we ca show that R (W) 2 X q W p where q = p/(p ). Usig the sparsity promotig L e log d regularier gives us R (W) 2 X W. Note that we obtai dimesio idepedece, for example whe the classifiers are L 2 regularied which allows us to boud the Rademacher complexities of kerelied fuctio classes for bouded kerels as well. Metric learig (Ji et al., 2009): the goal here is to lear a Mahalaobis metric M W (x, x ) = (x x ) W(x x ) usig the loss fuctio l(w,, ) = φ ( yy ( MW 2 (x, x ) )) for a hypothesis class W R d d. I this case it is possible to use a variety of mixed orm p,q ad Schatte orm S(p) regulariatios o matrices i the hypothesis class. I case we use trace orm regulariatio o the matrix class, we get R (W) X 2 2 W e log d S(). The (2, 2)-orm regulariatio offers a dimesio idepedet boud R (W) X 2 2 W 2,2. The mixed (2, )-orm regulariatio offers R (W) e log d X 2 X W 2,. Multiple kerel learig (Kumar et al., 202): the goal here is to improve the SVM classificatio algorithm by learig a good kerel K that is a positive combiatio of base kerels K,..., K p i.e. K µ (x, x ) = p i= µ ik i (x, x ) for some µ R p, µ 0. The base kerels are bouded, i.e. for all i, K i (x, x ) κ 2 for all x, x X The otio of goodess used here is the oe proposed by Balca & Blum (2006) ad ivolves usig the loss fuctio l(µ,, ) = φ (yy K µ (x, x )) where φ( ) is a margi loss fuctio meat to ecode some otio of aligmet. The two hypothesis classes for the combiatio vector µ that we study are the L regularied uit simplex () = {µ : µ =, µ 0} ad the L 2 regularied uit sphere S 2 () = {µ : µ 2 =, µ 0}. We are able to show the followig Rademacher complexity bouds for these classes: R (S 2 ()) κ 2 p ad R ( ()) κ 2 e log p. The details of the Rademacher complexity derivatios for these problems ad other examples such as similarity learig ca be foud i Appedix F. 7. OLP : Olie Learig with Pairwise Loss Fuctios I this sectio, we preset a olie learig algorithm for learig with pairwise loss fuctios i a fiite fer settig. The key cotributio i this sectio Algorithm RS-x : Stream Subsamplig with Replacemet Iput: Buffer B, ew poit t, fer sie s, timestep t. : if B < s the //There is space 2: B B { t} 3: else //Overflow situatio 4: if t = s + the //Repopulatio step 5: TMP B { t} 6: Repopulate B with s poits sampled uiformly with replacemet from TMP. 7: else //Normal update step 8: Idepedetly, replace each poit of B with t with probability /t. 9: ed if 0: ed if Algorithm 2 OLP : Olie Learig with Pairwise Loss Fuctios Iput: Step legth scale η, Buffer sie s Output: A esemble w 2,..., w W with low regret : w 0 0, B φ 2: for t = to do 3: Obtai a traiig poit t 4: Set step legth η t η t B wl(wt, t, ) ] 5: w t Π W [w t + η t B //Π W projects oto the set W 6: B Update-fer(B, t, s, t) //usig RS-x 7: ed for 8: retur w 2,..., w is a fer update policy that whe combied with a variat of the GIGA algorithm (Zikevich, 2003) allows us to give high probability regret bouds. I previous work, Zhao et al. (20) preseted a olie learig algorithm that uses fiite fers with the RS policy ad proposed a all-pairs regret boud. The RS policy esures, over the radomess used i fer updates, that at ay give time, the fer cotais a uiform sample from the precedig stream. Usig thisproperty, (Zhao et al., 20, Lemma 2) claimed that E ˆL t (h t ) = ˆL t (h t ) where the expectatio is take over the radomess used i fer costructio. However, a property such as E ˆL t (h) = ˆL t (h) holds oly for fuctios h that are either fixed or obtaied idepedetly of the radom variables used i fer updates (over which the expectatio is take). Sice h t is leared from poits i the fer itself, the above property, ad cosequetly the regret boud, does ot hold. We remedy this issue by showig a relatively weaker claim; we show that with high probability we have ˆL t (h t ) ˆL t (h t ) + ɛ. At a high level, this claim is similar to showig uiform covergece bouds for ˆL t. However, the reservoir samplig algorithm is ot particularly well suited to prove such uiform cover-

8 O the Geeraliatio Ability of Olie Learig Algorithms for Pairwise Loss Fuctios Average AUC value Average AUC value Vitter s RS Policy RS-x Policy Buffer sie (a) Soar Vitter s RS Policy RS-x Policy Buffer sie (c) IJCNN Average AUC value Average AUC value Vitter s RS Policy RS-x Policy Buffer sie (b) Segmet Vitter s RS Policy RS-x Policy Buffer sie (d) Covertype Figure. Performace of OLP (usig RS-x) ad OAM gra (usig RS) by (Zhao et al., 20) o AUC maximiatio tasks with varyig fer sies. gece bouds as it essetially performs samplig without replacemet (see Appedix G for a discussio). We overcome this hurdle by proposig a ew fer update policy RS-x (see Algorithm ) that, at each time step, guaratees s i.i.d. samples from the precedig stream (see Appedix H for a proof). Our algorithm uses this fer update policy i cojuctio with a olie learig algorithm OLP (see Algorithm 2) that is a variat of the well-kow GIGA algorithm (Zikevich, 2003). We provide the followig all-pairs regret guaratee for our algorithm: Theorem 8. Suppose the OLP algorithm workig with a s-sied fer geerates a esemble w,..., w. The with probability at least δ, R O ( C d log δ s + ) See Appedix I for the proof. A drawback of our boud is that it offers subliear regret oly for fer sies s = ω(log ). A better regret boud for costat s or a lower-boud o the regret is a ope problem. geeraliatio guaratees despite the lack of a allpairs regret boud. I our experimets, we adapted the OLP algorithm to the AUC maximiatio problem ad compared it with OAM gra o 8 differet bechmark datasets. We used 60% of the available data poits up to a maximum of poits to trai both algorithms. We refer the reader to Appedix J for a discussio o the implemetatio of the RS-x algorithm. Figure presets the results of our experimets o 4 datasets across 5 radom traiig/test splits. Results o other datasets ca be foud i Appedix K. The results demostrate that OLP performs competitively to OAM gra while i some cases havig slightly better performace for small fer sies. 9. Coclusio I this paper we studied the geeraliatio capabilities of olie learig algorithms for pairwise loss fuctios from several differet perspectives. Usig the method of Symmetriatio of Expectatios, we first provided sharp olie to batch coversio bouds for algorithms that offer all-pairs regret bouds. Our results for bouded ad strogly covex loss fuctios closely match their first order couterparts. We also exteded our aalysis to algorithms that are oly able to provide fiite-fer regret bouds usig which we were able to explai the good empirical performace of some existig algorithms. Fially we preseted a ew memory-efficiet olie learig algorithm that is able to provide all-pairs regret bouds i additio to performig well empirically. Several iterestig directios ca be pursued for future work, foremost beig the developmet of olie learig algorithms that ca guaratee sub-liear regret at costat fer sies or else a regret lower boud for fiite fer algorithms. Secodly, the idea of a stream-aware fer update policy is especially iterestig both from a empirical as well as theoretical poit of view ad would possibly require ovel proof techiques for its aalysis. Lastly, scalability issues that arise whe workig with higher order loss fuctios also pose a iterestig challege. 8. Experimetal Evaluatio I this sectio we preset experimetal evaluatio of our proposed OLP algorithm. We stress that the aim of this evaluatio is to show that our algorithm, that ejoys high cofidece regret bouds, also performs competitively i practice with respect to the OAM gra algorithm proposed by Zhao et al. (20) sice our results i Sectio 5 show that OAM gra does ejoy good Ackowledgmet The authors thak the aoymous referees for commets that improved the presetatio of the paper. PK is supported by the Microsoft Corporatio ad Microsoft Research Idia uder a Microsoft Research Idia Ph.D. fellowship award.

9 O the Geeraliatio Ability of Olie Learig Algorithms for Pairwise Loss Fuctios Refereces Agarwal, Shivai ad Niyogi, Partha. Geeraliatio Bouds for Rakig Algorithms via Algorithmic Stability. JMLR, 0:44 474, Balca, Maria-Floria ad Blum, Avrim. O a Theory of Learig with Similarity Fuctios. I ICML, pp , Bellet, Aurélie, Habrard, Amaury, ad Sebba, Marc. Similarity Learig for Provably Accurate Sparse Liear Classificatio. I ICML, 202. Brefeld, Ulf ad Scheffer, Tobias. AUC Maximiig Support Vector Learig. I ICML workshop o ROC Aalysis i Machie Learig, Cao, Qiog, Guo, Zheg-Chu, ad Yig, Yimig. Geeraliatio Bouds for Metric ad Similarity Learig, 202. arxiv: Cesa-Biachi, Nicoló ad Getile, Claudio. Improved Risk Tail Bouds for O-Lie Algorithms. IEEE Tras. o If. Theory, 54(): , Cesa-Biachi, Nicoló, Cocoi, Alex, ad Getile, Claudio. O the Geeraliatio Ability of O-Lie Learig Algorithms. I NIPS, pp , 200. Clémeço, Stépha, Lugosi, Gábor, ad Vayatis, Nicolas. Rakig ad empirical miimiatio of U- statistics. Aals of Statistics, 36: , Cortes, Coria, Mohri, Mehryar, ad Rostamiadeh, Afshi. Geeraliatio Bouds for Learig Kerels. I ICML, pp , 200a. Cortes, Coria, Mohri, Mehryar, ad Rostamiadeh, Afshi. Two-Stage Learig Kerel Algorithms. I ICML, pp , 200b. Cristiaii, Nello, Shawe-Taylor, Joh, Elisseeff, Adré, ad Kadola, Ja S. O Kerel-Target Aligmet. I NIPS, pp , 200. Freedma, David A. O Tail Probabilities for Martigales. Aals of Probability, 3():00 8, 975. Haa, Elad, Kalai, Adam, Kale, Satye, ad Agarwal, Amit. Logarithmic Regret Algorithms for Olie Covex Optimiatio. I COLT, pp , Ji, Rog, Wag, Shiju, ad Zhou, Yag. Regularied Distace Metric Learig: Theory ad Algorithm. I NIPS, pp , Kakade, Sham M. ad Tewari, Ambuj. O the Geeraliatio Ability of Olie Strogly Covex Programmig Algorithms. I NIPS, pp , Kakade, Sham M., Sridhara, Karthik, ad Tewari, Ambuj. O the Complexity of Liear Predictio: Risk Bouds, Margi Bouds, ad Regulariatio. I NIPS, Kakade, Sham M., Shalev-Shwart, Shai, ad Tewari, Ambuj. Regulariatio Techiques for Learig with Matrices. JMLR, 3: , 202. Kumar, Abhishek, Niculescu-Miil, Alexadru, Kavukcuoglu, Koray, ad III, Hal Daumé. A Biary Classificatio Framework for Two-Stage Multiple Kerel Learig. I ICML, 202. Ledoux, Michel ad Talagrad, Michel. Probability i Baach Spaces: Isoperimetry ad Processes. Spriger, Sridhara, Karthik, Shalev-Shwart, Shai, ad Srebro, Natha. Fast Rates for Regularied Objectives. I NIPS, pp , Steiwart, Igo ad Christma, Adreas. Support Vector Machies. Iformatio Sciece ad Statistics. Spriger, Vitter, Jeffrey Scott. Radom Samplig with a Reservoir. ACM Tras. o Math. Soft., ():37 57, 985. Wag, Yuyag, Khardo, Roi, Pechyoy, Dmitry, ad Joes, Rosie. Geeraliatio Bouds for Olie Learig Algorithms with Pairwise Loss Fuctios. JMLR - Proceedigs Track, 23: , 202. Wag, Yuyag, Khardo, Roi, Pechyoy, Dmitry, ad Joes, Rosie. Olie Learig with Pairwise Loss Fuctios, 203. arxiv: Xig, Eric P., Ng, Adrew Y., Jorda, Michael I., ad Russell, Stuart J. Distace Metric Learig with Applicatio to Clusterig with Side-Iformatio. I NIPS, pp , Zhao, Peili, Hoi, Steve C. H., Ji, Rog, ad Yag, Tiabao. Olie AUC Maximiatio. I ICML, pp , 20. Zikevich, Marti. Olie Covex Programmig ad Geeralied Ifiitesimal Gradiet Ascet. I ICML, pp , 2003.