Boosting as a Regularized Path to a Maximum Margin Classifier

Transcription

1 Journal of Machne Learnng Research 5 (2004) Submtted 5/03; Revsed 10/03; Publshed 8/04 Boostng as a Regularzed Path to a Maxmum Margn Classfer Saharon Rosset Data Analytcs Research Group IBM T.J. Watson Research Center Yorktown Heghts, NY 10598, USA J Zhu Department of Statstcs Unversty of Mchgan Ann Arbor, MI 48109, USA Trevor Haste Department of Statstcs Stanford Unversty Stanford, CA 94305,USA [email protected] [email protected] [email protected] Edtor: Robert Schapre Abstract In ths paper we study boostng methods from a new perspectve. We buld on recent work by Efron et al. to show that boostng approxmately (and n some cases exactly) mnmzes ts loss crteron wth an l 1 constrant on the coeffcent vector. Ths helps understand the success of boostng wth early stoppng as regularzed fttng of the loss crteron. For the two most commonly used crtera (exponental and bnomal log-lkelhood), we further show that as the constrant s relaxed or equvalently as the boostng teratons proceed the soluton converges (n the separable case) to an l 1 -optmal separatng hyper-plane. We prove that ths l 1 -optmal separatng hyper-plane has the property of maxmzng the mnmal l 1 -margn of the tranng data, as defned n the boostng lterature. An nterestng fundamental smlarty between boostng and kernel support vector machnes emerges, as both can be descrbed as methods for regularzed optmzaton n hgh-dmensonal predctor space, usng a computatonal trck to make the calculaton practcal, and convergng to margn-maxmzng solutons. Whle ths statement descrbes SVMs exactly, t apples to boostng only approxmately. Keywords: boostng, regularzed optmzaton, support vector machnes, margn maxmzaton 1. Introducton and Outlne Boostng s a method for teratvely buldng an addtve model F T (x) = T t=1 α t h jt (x), (1) where h jt H a large (but we wll assume fnte) dctonary of canddate predctors or weak learners ; and h jt s the bass functon selected as the best canddate to modfy the functon at stage t. The model F T can equvalently be represented by assgnng a coeffcent to each dctonary c 2004 Saharon Rosset, J Zhu and Trevor Haste.

2 ROSSET, ZHU AND HASTIE functon h H rather than to the selected h jt s only: F T (x) = J h j (x) β (T ) j, (2) j=1 where J = H and β (T ) j = jt = j α t. The β representaton allows us to nterpret the coeffcent vector β (T ) as a vector n R J or, equvalently, as the hyper-plane whch has β (T ) as ts normal. Ths nterpretaton wll play a key role n our exposton. Some examples of common dctonares are: The tranng varables themselves, n whch case h j (x) = x j. Ths leads to our addtve model F T beng just a lnear model n the orgnal data. The number of dctonary functons wll be J = d, the dmenson of x. Polynomal ( dctonary ) of degree p, n whch case the number of dctonary functons wll be p + d J =. d Decson trees wth up to k termnal nodes, f we lmt the splt ponts to data ponts (or mdway between data ponts as CART does). The number of possble trees s bounded from above (trvally) by J (np) k 2 k2. Note that regresson trees do not ft nto our framework, snce they wll gve J =. The boostng dea was frst ntroduced by Freund and Schapre (1995), wth ther AdaBoost algorthm. AdaBoost and other boostng algorthms have attracted a lot of attenton due to ther great success n data modelng tasks, and the mechansm whch makes them work has been presented and analyzed from several perspectves. Fredman et al. (2000) develop a statstcal perspectve, whch ultmately leads to vewng AdaBoost as a gradent-based ncremental search for a good addtve model (more specfcally, t s a coordnate descent algorthm), usng the exponental loss functon C(y, F) = exp( yf), where y { 1, 1}. The gradent boostng (Fredman, 2001) and anyboost (Mason et al., 1999) generc algorthms have used ths approach to generalze the boostng dea to wder famles of problems and loss functons. In partcular, Fredman et al. (2000) have ponted out that the bnomal log-lkelhood loss C(y,F) = log(1 + exp( yf)) s a more natural loss for classfcaton, and s more robust to outlers and msspecfed data. A dfferent analyss of boostng, orgnatng n the machne learnng communty, concentrates on the effect of boostng on the margns y F(x ). For example, Schapre et al. (1998) use margn-based arguments to prove convergence of boostng to perfect classfcaton performance on the tranng data under general condtons, and to derve bounds on the generalzaton error (on future, unseen data). In ths paper we combne the two approaches, to conclude that gradent-based boostng can be descrbed, n the separable case, as an approxmate margn maxmzng process. The vew we develop of boostng as an approxmate path of optmal solutons to regularzed problems also justfes early stoppng n boostng as specfyng a value for regularzaton parameter. We consder the problem of mnmzng non-negatve convex loss functons (n partcular the exponental and bnomal log-lkelhood loss functons) over the tranng data, wth an l 1 bound on the model coeffcents: ˆβ(c) = arg mn β 1 c C(y,h(x ) β). (3) 942

3 BOOSTING AS A REGULARIZED PATH Where h(x ) = [h 1 (x ),h 2 (x ),...,h J (x )] and J = H. 1 Haste et al. (2001, Chapter 10) have observed that slow gradent-based boostng (.e., we set α t = ε, t n (1), wth ε small) tends to follow the penalzed path ˆβ(c) as a functon of c, under some mld condtons on ths path. In other words, usng the notaton of (2), (3), ths mples that β (c/ε) ˆβ(c) vanshes wth ε, for all (or a wde range of) values of c. Fgure 1 llustrates ths equvalence between ε-boostng and the optmal soluton of (3) on a real-lfe data set, usng squared error loss as the loss functon. In ths paper we demonstrate ths equvalence further and formally Lasso Forward Stagewse lcavol lcavol PSfrag replacements Coeffcents sv lweght pgg45 lbph gleason age lcp Coeffcents sv lweght pgg45 lbph gleason age lcp j ˆβ j (c) Iteraton Fgure 1: Exact coeffcent paths(left) for l 1 -constraned squared error regresson and boostng coeffcent paths (rght) on the data from a prostate cancer study state t as a conjecture. Some progress towards provng ths conjecture has been made by Efron et al. (2004), who prove a weaker local result for the case where C s squared error loss, under some mld condtons on the optmal path. We generalze ther result to general convex loss functons. Combnng the emprcal and theoretcal evdence, we conclude that boostng can be vewed as an approxmate ncremental method for followng the l 1 -regularzed path. We then prove that n the separable case, for both the exponental and logstc log-lkelhood loss functons, ˆβ(c)/c converges as c to an optmal separatng hyper-plane ˆβ descrbed by ˆβ = arg max mn y β h(x ). (4) β 1 =1 In other words, ˆβ maxmzes the mnmal margn among all vectors wth l 1 -norm equal to 1. 2 Ths result generalzes easly to other l p -norm constrants. For example, f p = 2, then ˆβ descrbes the optmal separatng hyper-plane n the Eucldean sense,.e., the same one that a non-regularzed support vector machne would fnd. Combnng our two man results, we get the followng characterzaton of boostng: 1. Our notaton assumes that the mnmum n (3) s unque, whch requres some mld assumptons. To avod notatonal complcatons we use ths slghtly abusve notaton throughout ths paper. In Appendx B we gve explct condtons for unqueness of ths mnmum. 2. The margn maxmzng hyper-plane n (4) may not be unque, and we show that n that case the lmt ˆβ s stll defned and t also maxmzes the second mnmal margn. See Appendx B.2 for detals. 943

4 ROSSET, ZHU AND HASTIE ε-boostng can be descrbed as a gradent-descent search, approxmately followng the path of l 1 -constraned optmal solutons to ts loss crteron, and convergng, n the separable case, to a margn maxmzer n the l 1 sense. Note that boostng wth a large dctonary H (n partcular f n < J = H ) guarantees that the data wll be separable (except for pathologes), hence separablty s a very mld assumpton here. As n the case of support vector machnes n hgh dmensonal feature spaces, the non-regularzed optmal separatng hyper-plane s usually of theoretcal nterest only, snce t typcally represents an over-ftted model. Thus, we would want to choose a good regularzed model. Our results ndcate that Boostng gves a natural method for dong that, by stoppng early n the boostng process. Furthermore, they pont out the fundamental smlarty between Boostng and SVMs: both approaches allow us to ft regularzed models n hgh-dmensonal predctor space, usng a computatonal trck. They dffer n the regularzaton approach they take exact l 2 regularzaton for SVMs, approxmate l 1 regularzaton for Boostng -and n the computatonal trck that facltates fttng the kernel trck for SVMs, coordnate descent for Boostng. 1.1 Related Work Schapre et al. (1998) have dentfed the normalzed margns as dstance from an l 1 -normed separatng hyper-plane. Ther results relate the boostng teratons success to the mnmal margn of the combned model. Rätsch et al. (2001b) take ths further usng an asymptotc analyss of AdaBoost. They prove that the normalzed mnmal margn, mn y t α t h t (x )/ t α t, s asymptotcally equal for both classes. In other words, they prove that the asymptotc separatng hyper-plane s equally far away from the closest ponts on ether sde. Ths s a property of the margn maxmzng separatng hyper-plane as we defne t. Both papers also llustrate the margn maxmzng effects of AdaBoost through expermentaton. However, they both stop short of provng the convergence to optmal (margn maxmzng) solutons. Motvated by our result, Rätsch and Warmuth (2002) have recently asserted the margn-maxmzng propertes of ε-adaboost, usng a dfferent approach than the one used n ths paper. Ther results relate only to the asymptotc convergence of nfntesmal AdaBoost, compared to our analyss of the regularzed path traced along the way and of a varety of boostng loss functons, whch also leads to a convergence result on bnomal log-lkelhood loss. The convergence of boostng to an optmal soluton from a loss functon perspectve has been analyzed n several papers. Rätsch et al. (2001a) and Collns et al. (2000) gve results and bounds on the convergence of tranng-set loss, C(y, t α t h t (x )), to ts mnmum. However, n the separable case convergence of the loss to 0 s nherently dfferent from convergence of the lnear separator to the optmal separator. Any soluton whch separates the two classes perfectly can drve the exponental (or log-lkelhood) loss to 0, smply by scalng coeffcents up lnearly. Two recent papers have made the connecton between boostng and l 1 regularzaton n a slghtly dfferent context than ths paper. Zhang (2003) suggests a shrnkage verson of boostng whch converges to l 1 regularzed solutons, whle Zhang and Yu (2003) llustrate the quanttatve relatonshp between early stoppng n boostng and l 1 constrants. 944

5 BOOSTING AS A REGULARIZED PATH 2. Boostng as Gradent Descent Generc gradent-based boostng algorthms (Fredman, 2001; Mason et al., 1999) attempt to fnd a good lnear combnaton of the members of some dctonary of bass functons to optmze a gven loss functon over a sample. Ths s done by searchng, at each teraton, for the bass functon whch gves the steepest descent n the loss, and changng ts coeffcent accordngly. In other words, ths s a coordnate descent algorthm n R J, where we assgn one dmenson (or coordnate) for the coeffcent of each dctonary functon. Assume we have data {x,y } n =1 wth x R d, a loss (or cost) functon C(y,F), and a set of dctonary functons {h j (x)} : R d R. Then all of these algorthms follow the same essental steps: Algorthm 1 Generc gradent-based boostng algorthm 1. Set β (0) = For t = 1 : T, (a) Let F = β (t 1) h(x ), = 1,...,n (the current ft). (b) Set w = C(y,F ) F, = 1,...,n. (c) Identfy j t = argmax j w h j (x ). (d) Set β (t) j t = β (t 1) j t α t sgn( w h jt (x )) and β (t) k = β (t 1) k,k j t. Here β (t) s the current coeffcent vector and α t > 0 s the current step sze. Notce that w h jt (x ) = C(y,F ) β j. t As we mentoned, Algorthm 1 can be nterpreted smply as a coordnate descent algorthm n weak learner space. Implementaton detals nclude the dctonary H of weak learners, the loss functon C(y,F), the method of searchng for the optmal j t and the way n whch α t s determned. 3 For example, the orgnal AdaBoost algorthm uses ths scheme wth the exponental loss C(y,F) = exp( yf), and an mplct lne search to fnd the best α t once a drecton j t has been chosen (see Haste et al., 2001; Mason et al., 1999). The dctonary used by AdaBoost n ths formulaton would be a set of canddate classfers,.e., h j (x ) { 1,+1} usually decson trees are used n practce. 2.1 Practcal Implementaton of Boostng The dctonares used for boostng are typcally very large practcally nfnte and therefore the generc boostng algorthm we have presented cannot be mplemented verbatm. In partcular, t s not practcal to exhaustvely search for the maxmzer n step 2(c). Instead, an approxmate, usually greedy search s conducted to fnd a good canddate weak learner h jt whch makes the frst order declne n the loss large (even f not maxmal among all possble models). In the common case that the dctonary of weak learners s comprsed of decson trees wth up to k nodes, the way AdaBoost and other boostng algorthms solve stage 2(c) s by buldng a 3. The sgn of α t wll always be sgn( w h jt (x )), snce we want the loss to be reduced. In most cases, the dctonary H s negaton closed, and so t can be assumed WLOG that the coeffcents are always postve and ncreasng 945

6 ROSSET, ZHU AND HASTIE decson tree to a re-weghted verson of the data, wth the weghts w. Thus they frst replace step 2(c) wth mnmzaton of w 1{y h jt (x )}, whch s easly shown to be equvalent to the orgnal step 2(c). They then use a greedy decsontree buldng algorthm such as CART or C5 to buld a k-node decson tree whch mnmzes ths quantty,.e., acheves low weghted msclassfcaton error on the weghted data. Snce the tree s bult greedly one splt at a tme t wll not be the global mnmzer of weghted msclassfcaton error among all k-node decson trees. However, t wll be a good ft for the re-weghted data, and can be consdered an approxmaton to the optmal tree. Ths use of approxmate optmzaton technques s crtcal, snce much of the strength of the boostng approach comes from ts ablty to buld addtve models n very hgh-dmensonal predctor spaces. In such spaces, standard exact optmzaton technques are mpractcal: any approach whch requres calculaton and nverson of Hessan matrces s completely out of the queston, and even approaches whch requre only frst dervatves, such as coordnate descent, can only be mplemented approxmately. 2.2 Gradent-Based Boostng as a Generc Modelng Tool As Fredman (2001); Mason et al. (1999) menton, ths vew of boostng as gradent descent allows us to devse boostng algorthms for any functon estmaton problem all we need s an approprate loss and an approprate dctonary of weak learners. For example, Fredman et al. (2000) suggested usng the bnomal log-lkelhood loss nstead of the exponental loss of AdaBoost for bnary classfcaton, resultng n the LogtBoost algorthm. However, there s no need to lmt boostng algorthms to classfcaton Fredman (2001) appled ths methodology to regresson estmaton, usng squared error loss and regresson trees, and Rosset and Segal (2003) appled t to densty estmaton, usng the log-lkelhood crteron and Bayesan networks as weak learners. Ther experments and those of others llustrate that the practcal usefulness of ths approach coordnate descent n hgh dmensonal predctor space carres beyond classfcaton, and even beyond supervsed learnng. The vew we present n ths paper, of coordnate-descent boostng as approxmate l 1 -regularzed fttng, offers some nsght nto why ths approach would be good n general: t allows us to ft regularzed models drectly n hgh dmensonal predctor space. In ths t bears a conceptual smlarty to support vector machnes, whch exactly ft an l 2 regularzed model n hgh dmensonal (RKH) predctor space. 2.3 Loss Functons The two most commonly used loss functons for boostng classfcaton models are the exponental and the (mnus) bnomal log-lkelhood: Exponental : Loglkelhood : C e (y,f) = exp( yf); C l (y,f) = log(1 + exp( yf)). These two loss functons bear some mportant smlartes to each other. As Fredman et al. (2000) show, the populaton mnmzer of expected loss at pont x s smlar for both loss functons and s 946

7 BOOSTING AS A REGULARIZED PATH 2.5 Exponental Logstc Fgure 2: The two classfcaton loss functons gven by [ ] P(y = 1 x) ˆF(x) = c log, P(y = 1 x) where c e = 1/2 for exponental loss and c l = 1 for bnomal loss. More mportantly for our purpose, we have the followng smple proposton, whch llustrates the strong smlarty between the two loss functons for postve margns (.e., correct classfcatons): Proposton 1 yf 0 0.5C e (y,f) C l (y,f) C e (y,f). (5) In other words, the two losses become smlar f the margns are postve, and both behave lke exponentals. Proof Consder the functons f 1 (z) = z and f 2 (z) = log(1+z) for z [0,1]. Then f 1 (0) = f 2 (0) = 0, and f 1 (z) 1 z 1 2 f 2(z) = 1 z 1 + z 1. Thus we can conclude 0.5 f 1 (z) f 2 (z) f 1 (z). Now set z = exp( y f ) and we get the desred result. For negatve margns the behavors of C e and C l are very dfferent, as Fredman et al. (2000) have noted. In partcular, C l s more robust aganst outlers and msspecfed data. 2.4 Lne-Search Boostng vs. ε-boostng As mentoned above, AdaBoost determnes α t usng a lne search. In our notaton for Algorthm 1 ths would be α t = argmn α C(y,F + αh jt (x )). 947

8 ROSSET, ZHU AND HASTIE The alternatve approach, suggested by Fredman (2001); Haste et al. (2001), s to shrnk all α t to a sngle small value ε. Ths may slow down learnng consderably (dependng on how small ε s), but s attractve theoretcally: the frst-order theory underlyng gradent boostng mples that the weak learner chosen s the best ncrement only locally. It can also be argued that ths approach s stronger than lne search, as we can keep selectng the same h jt repeatedly f t remans optmal and so ε-boostng domnates lne-search boostng n terms of tranng error. In practce, ths approach of slowng the learnng rate usually performs better than lne-search n terms of predcton error as well (see Fredman, 2001). For our purposes, we wll mostly assume ε s nfntesmally small, so the theoretcal boostng algorthm whch results s the lmt of a seres of boostng algorthms wth shrnkng ε. In regresson termnology, the lne-search verson s equvalent to forward stage-wse modelng, nfamous n the statstcs lterature for beng too greedy and hghly unstable (see Fredman, 2001). Ths s ntutvely obvous, snce by ncreasng the coeffcent untl t saturates we are destroyng sgnal whch may help us select other good predctors. 3. l p Margns, Support Vector Machnes and Boostng We now ntroduce the concept of margns as a geometrc nterpretaton of a bnary classfcaton model. In the context of boostng, ths vew offers a dfferent understandng of AdaBoost from the gradent descent vew presented above. In the followng sectons we connect the two vews. 3.1 The Eucldean Margn and the Support Vector Machne Consder a classfcaton model n hgh dmensonal predctor space: F(x) = j h j (x)β j. We say that the model separates the tranng data {x,y } n =1 f sgn(f(x )) = y,. From a geometrcal perspectve ths means that the hyper-plane defned by F(x) = 0 s a separatng hyper-plane for ths data, and we defne ts (Eucldean) margn as m 2 (β) = mn y F(x ) β 2. (6) The margn-maxmzng separatng hyper-plane for ths data would be defned by β whch maxmzes m 2 (β). Fgure 3 shows a smple example of separable data n two dmensons, wth ts margn-maxmzng separatng hyper-plane. The Eucldean margn-maxmzng separatng hyperplane s the (non regularzed) support vector machne soluton. Its margn maxmzng propertes play a central role n dervng generalzaton error bounds for these models, and form the bass for a rch lterature. 3.2 The l 1 Margn and Its Relaton to Boostng Instead of consderng the Eucldean margn as n (6) we can defne an l p margn concept as m p (β) = mn y F(x ) β p. (7) Of partcular nterest to us s the case p = 1. Fgure 4 shows the l 1 margn maxmzng separatng hyper-plane for the same smple example as Fgure 3. Note the fundamental dfference between 948

9 BOOSTING AS A REGULARIZED PATH Fgure 3: A smple data example, wth two observatons from class O and two observatons from class X. The full lne s the Eucldean margn-maxmzng separatng hyper-plane Fgure 4: l 1 margn maxmzng separatng hyper-plane for the same data set as Fgure 3. The dfference between the dagonal Eucldean optmal separator and the vertcal l 1 optmal separator llustrates the sparsty effect of optmal l 1 separaton 949

10 ROSSET, ZHU AND HASTIE the two solutons: the l 2 -optmal separator s dagonal, whle the l 1 -optmal one s vertcal. To understand why ths s so we can relate the two margn defntons to each other as yf(x) β 1 = yf(x) β 2 β 2 β 1. (8) From ths representaton we can observe that the l 1 margn wll tend to be bg f the rato β 2 β 1 s bg. Ths rato wll generally be bg f β s sparse. To see ths, consder fxng the l 1 norm of the vector and then comparng the l 2 norm of two canddates: one wth many small components and the other a sparse one wth a few large components and many zero components. It s easy to see that the second vector wll have bgger l 2 norm, and hence (f the l 2 margn for both vectors s equal) a bgger l 1 margn. A dfferent perspectve on the dfference between the optmal solutons s gven by a theorem due to Mangasaran (1999), whch states that the l p margn maxmzng separatng hyper plane maxmzes the l q dstance from the closest ponts to the separatng hyper-plane, wth 1 p + 1 q = 1. Thus the Eucldean optmal separator (p = 2) also maxmzes Eucldean dstance between the ponts and the hyper-plane, whle the l 1 optmal separator maxmzes l dstance. Ths nterestng result gves another ntuton why l 1 optmal separatng hyper-planes tend to be coordnate-orented (.e., have sparse representatons): snce l projecton consders only the largest coordnate dstance, some coordnate dstances may be 0 at no cost of decreased l dstance. Schapre et al. (1998) have ponted out the relaton between AdaBoost and the l 1 margn. They prove that, n the case of separable data, the boostng teratons ncrease the boostng margn of the model, defned as mn y F(x ) α 1. (9) In other words, ths s the l 1 margn of the model, except that t uses the α ncremental representaton rather than the β geometrc representaton for the model. The two representatons gve the same l 1 norm f there s sgn consstency, or monotoncty n the coeffcent paths traced by the model,.e., f at every teraton t of the boostng algorthm β jt 0 sgn(α t ) = sgn(β jt ). (10) As we wll see later, ths monotoncty condton wll play an mportant role n the equvalence between boostng and l 1 regularzaton. The l 1 -margn maxmzaton vew of AdaBoost presented by Schapre et al. (1998) and a whole plethora of papers that followed s mportant for the analyss of boostng algorthms for two dstnct reasons: It gves an ntutve, geometrc nterpretaton of the model that AdaBoost s lookng for a model whch separates the data well n ths l 1 -margn sense. Note that the vew of boostng as gradent descent n a loss crteron doesn t really gve the same knd of ntuton: f the data s separable, then any model whch separates the tranng data wll drve the exponental or bnomal loss to 0 when scaled up: m 1 (β) > 0 = C(y,dβ x ) 0 as d. 950

11 BOOSTING AS A REGULARIZED PATH The l 1 -margn behavor of a classfcaton model on ts tranng data facltates generaton of generalzaton (or predcton) error bounds, smlar to those that exst for support vector machnes (Schapre et al., 1998). The mportant quantty n ths context s not the margn but the normalzed margn, whch consders the conjugate norm of the predctor vectors: y β h(x ) β 1 h(x ). When the dctonary we are usng s comprsed of classfers then h(x ) 1 always and thus the l 1 margn s exactly the relevant quantty. The error bounds descrbed by Schapre et al. (1998) allow usng the whole l 1 margn dstrbuton, not just the mnmal margn. However, boostng s tendency to separate well n the l 1 sense s a central motvaton behnd ther results. From a statstcal perspectve, however, we should be suspcous of margn-maxmzaton as a method for buldng good predcton models n hgh dmensonal predctor space. Margn maxmzaton n hgh dmensonal space s lkely to lead to over-fttng and bad predcton performance. Ths has been observed n practce by many authors, n partcular Breman (1999). Our results n the next two sectons suggest an explanaton based on model complexty: margn maxmzaton s the lmt of parametrc regularzed optmzaton models, as the regularzaton vanshes, and the regularzed models along the path may well be superor to the margn maxmzng lmtng model, n terms of predcton performance. In Secton 7 we return to dscuss these ssues n more detal. 4. Boostng as Approxmate Incremental l 1 Constraned Fttng In ths secton we ntroduce an nterpretaton of the generc coordnate-descent boostng algorthm as trackng a path of approxmate solutons to l 1 -constraned (or equvalently, regularzed) versons of ts loss crteron. Ths vew serves our understandng of what boostng does, n partcular the connecton between early stoppng n boostng and regularzaton. We wll also use ths vew to get a result about the asymptotc margn-maxmzaton of regularzed classfcaton models, and by analogy of classfcaton boostng. We buld on deas frst presented by Haste et al. (2001, Chapter 10) and Efron et al. (2004). Gven a convex non-negatve loss crteron C(, ), consder the 1-dmensonal path of optmal solutons to l 1 constraned optmzaton problems over the tranng data: ˆβ(c) = arg mn β 1 c C(y,h(x ) β). (11) As c vares, we get that ˆβ(c) traces a 1-dmensonal optmal curve through R J. If an optmal soluton for the non-constraned problem exsts and has fnte l 1 norm c 0, then obvously ˆβ(c) = ˆβ(c 0 ) = ˆβ, c > c 0. n the case of separable 2-class data, usng ether C e or C l, there s no fntenorm optmal soluton. Rather, the constraned soluton wll always have ˆβ(c) 1 = c. A dfferent way of buldng a soluton whch has l 1 norm c, s to run our ε-boostng algorthm for c/ε teratons. Ths wll gve an α (c/ε) vector whch has l 1 norm exactly c. For the norm of the geometrc representaton β (c/ε) to also be equal to c, we need the monotoncty condton (10) to hold as well. Ths condton wll play a key role n our exposton. We are gong to argue that the two soluton paths ˆβ(c) and β (c/ε) are very smlar for ε small. Let us start by observng ths smlarty n practce. Fgure 1 n the ntroducton shows an example of 951

12 ROSSET, ZHU AND HASTIE Lasso Stagwse Fgure 5: Another example of the equvalence between the Lasso optmal soluton path (left) and ε-boostng wth squared error loss. Note that the equvalence breaks down when the path of varable 7 becomes non-monotone ths smlarty for squared error loss fttng wth l 1 (lasso) penalty. Fgure 5 shows another example n the same mold, taken from Efron et al. (2004). The data s a dabetes study and the dctonary used s just the orgnal 10 varables. The panel on the left shows the path of optmal l 1 -constraned solutons ˆβ(c) and the panel on the rght shows the ε-boostng path wth the 10-dmensonal dctonary (the total number of boostng teratons s about 6000). The 1-dmensonal path through R 10 s descrbed by 10 coordnate curves, correspondng to each one of the varables. The nterestng phenomenon we observe s that the two coeffcent traces are not completely dentcal. Rather, they agree up to the pont where varable 7 coeffcent path becomes non monotone,.e., t volates (10) (ths pont s where varable 8 comes nto the model, see the arrow on the rght panel). Ths example llustrates that the monotoncty condton and ts mplcaton that α 1 = β 1 s crtcal for the equvalence between ε-boostng and l 1 -constraned optmzaton. The two examples we have seen so far have used squared error loss, and we should ask ourselves whether ths equvalence stretches beyond ths loss. Fgure 6 shows a smlar result, but ths tme for the bnomal log-lkelhood loss, C l. We used the spam data set, taken from the UCI repostory (Blake and Merz, 1998). We chose only 5 predctors of the 57 to make the plots more nterpretable and the computatons more accommodatng. We see that there s a perfect equvalence between the exact constraned soluton (.e., regularzed logstc regresson) and ε-boostng n ths case, snce the paths are fully monotone. To justfy why ths observed equvalence s not surprsng, let us consder the followng l 1 - locally optmal monotone drecton problem of fndng the best monotone ε ncrement to a gven model β 0 : mn C(β) (12) s.t. β 1 β 0 1 ε, β β 0 (component-wse). 952

13 BOOSTING AS A REGULARIZED PATH 6 Exact constraned soluton 6 ε Stagewse β values 2 β values β β 1 Fgure 6: Exact coeffcent paths (left) for l 1 -constraned logstc regresson and boostng coeffcent paths (rght) wth bnomal log-lkelhood loss on fve varables from the spam data set. The boostng path was generated usng ε = and 7000 teratons. Here we use C(β) as shorthand for C(y,h(x ) β). A frst order Taylor expanson gves us C(β) = C(β 0 ) + C(β 0 ) (β β 0 ) + O(ε 2 ). And gven the l 1 constrant on the ncrease n β 1, t s easy to see that a frst-order optmal soluton (and therefore an optmal soluton as ε 0) wll make a coordnate descent step,.e. β j β 0, j C(β 0 ) j = max C(β 0 ) k, k assumng the sgns match,.e., sgn(β 0 j ) = sgn( C(β 0 ) j ). So we get that f the optmal soluton to (12) wthout the monotoncty constrant happens to be monotone, then t s equvalent to a coordnate descent step. And so t s reasonable to expect that f the optmal l 1 regularzed path s monotone (as t ndeed s n Fgures 1,6), then an nfntesmal ε-boostng algorthm would follow the same path of solutons. Furthermore, even f the optmal path s not monotone, we can stll use the formulaton (12) to argue that ε-boostng would tend to follow an approxmate l 1 -regularzed path. The man dfference between the ε-boostng path and the true optmal path s that t wll tend to delay becomng non-monotone, as we observe for varable 7 n Fgure 5. To understand ths specfc phenomenon would requre analyss of the true optmal path, whch falls outsde the scope of our dscusson Efron et al. (2004) cover the subject for squared error loss, and ther dscusson apples to any contnuously dfferentable convex loss, usng second-order approxmatons. We can employ ths understandng of the relatonshp between boostng and l 1 regularzaton to construct l p boostng algorthms by changng the coordnate-selecton crteron n the coordnate descent algorthm. We wll get back to ths pont n Secton 7, where we desgn an l 2 boostng algorthm. The expermental evdence and heurstc dscusson we have presented lead us to the followng conjecture whch connects slow boostng and l 1 -regularzed optmzaton: 953

14 ROSSET, ZHU AND HASTIE Conjecture 2 Consder applyng the ε-boostng algorthm to any convex loss functon, generatng a path of solutons β (ε) (t). Then f the optmal coeffcent paths are monotone c < c 0,.e., f j, ˆβ(c) j s non-decreasng n the range c < c 0, then lm ε 0 β(ε) (c 0 /ε) = ˆβ(c 0 ). Efron et al. (2004, Theorem 2) prove a weaker local result for the case of squared error loss only. We generalze ther result to any convex loss. However ths result stll does not prove the global convergence whch the conjecture clams, and the emprcal evdence mples. For the sake of brevty and readablty, we defer ths proof, together wth concse mathematcal defnton of the dfferent types of convergence, to appendx A. In the context of real-lfe boostng, where the number of bass functons s usually very large, and makng ε small enough for the theory to apply would requre runnng the algorthm forever, these results should not be consdered drectly applcable. Instead, they should be taken as an ntutve ndcaton that boostng especally the ε verson s, ndeed, approxmatng optmal solutons to the constraned problems t encounters along the way. 5. l p -Constraned Classfcaton Loss Functons Havng establshed the relaton between boostng and l 1 regularzaton, we are gong to turn our attenton to the regularzed optmzaton problem. By analogy, our results wll apply to boostng as well. We concentrate on C e and C l, the two classfcaton losses defned above, and the soluton paths of ther l p constraned versons: ˆβ (p) (c) = arg mn β p c C(y,β h(x )). (13) where C s ether C e or C l. As we dscussed below Equaton (11), f the tranng data s separable n span(h ), then we have ˆβ (p) (c) p = c for all values of c. Consequently ˆβ (p) (c) p = 1. c We may ask what are the convergence ponts of ths sequence as c. The followng theorem shows that these convergence ponts descrbe l p -margn maxmzng separatng hyper-planes. Theorem 3 Assume the data s separable,.e., β s.t., y β h(x ) > 0. Then for both C e and C l, every convergence pont of ˆβ(c) c corresponds to an l p -margn-maxmzng separatng hyper-plane. If the l p -margn-maxmzng separatng hyper-plane s unque, then t s the unque convergence ponts,.e. ˆβ (p) = lm ˆβ(p) (c) = arg max c c mn y β h(x ). (14) β p =1 Proof Ths proof apples to both C e and C l, gven the property n (5). Consder two separatng canddates β 1 and β 2 such that β 1 p = β 2 p = 1. Assume that β 1 separates better,.e. Then we have the followng smple lemma: m 1 := mny β 1h(x ) > m 2 := mny β 2h(x ) >

15 BOOSTING AS A REGULARIZED PATH Lemma 4 There exsts some D = D(m 1,m 2 ) such that d > D, dβ 1 ncurs smaller loss than dβ 2, n other words: C(y,dβ 1h(x )) < C(y,dβ 2h(x )). Gven ths lemma, we can now prove that any convergence pont of ˆβ (p) (c) c must be an l p -margn maxmzng separator. Assume β s a convergence pont of ˆβ (p) (c) c. Denote ts mnmal margn on the data by m. If the data s separable, clearly m > 0 (snce otherwse the loss of dβ does not even converge to 0 as d ). Now, assume some β wth β p = 1 has bgger mnmal margn m > m. By contnuty of the mnmal margn n β, there exsts some open neghborhood of β and an ε > 0, such that N β = {β : β β 2 < δ} mn y β h(x ) < m ε, β N β. Now by the lemma we get that there exsts some D = D( m, m ε) such that d β ncurs smaller loss than dβ for any d > D, β N β. Therefore β cannot be a convergence pont of ˆβ (p) (c) c. We conclude that any convergence pont of the sequence ˆβ (p) (c) c must be an l p -margn maxmzng separator. If the margn maxmzng separator s unque then t s the only possble convergence pont, and therefore ˆβ (p) = lm ˆβ(p) (c) = arg max c c mn y β h(x ). β p =1 Proof of Lemma Usng (5) and the defnton of C e, we get for both loss functons: C(y,dβ 1h(x )) nexp( d m 1 ). Now, snce β 1 separates better, we can fnd our desred such that D = D(m 1,m 2 ) = logn + log2 m 1 m 2 d > D, nexp( d m 1 ) < 0.5exp( d m 2 ). And usng (5) and the defnton of C e agan we can wrte 0.5exp( d m 2 ) C(y,dβ 2h(x )). Combnng these three nequaltes we get our desred result: d > D, C(y,dβ 1h(x )) C(y,dβ 2h(x )). 955

16 ROSSET, ZHU AND HASTIE We thus conclude that f the l p -margn maxmzng separatng hyper-plane s unque, the normalzed constraned soluton converges to t. In the case that the margn maxmzng separatng hyper-plane s not unque, we can n fact prove a stronger result, whch ndcates that the lmt of the regularzed solutons would then be determned by the second smallest margn, then by the thrd and so on. Ths result s manly of techncal nterest and we prove t n Appendx B, Secton Implcatons of Theorem 3 We now brefly dscuss the mplcatons of ths theorem for boostng and logstc regresson BOOSTING IMPLICATIONS Combned wth our results from Secton 4, Theorem 3 ndcates that the normalzed boostng path β (t) u t α u wth ether C e or C l used as loss approxmately converges to a separatng hyper-plane ˆβ, whch attans max mn y β h(x ) = max β 2 mny d, (15) β 1 =1 β 1 =1 where d s the (sgned) Eucldean dstance from the tranng pont to the separatng hyper-plane. In other words, t maxmzes Eucldean dstance scaled by an l 2 norm. As we have mentoned already, ths mples that the asymptotc boostng soluton wll tend to be sparse n representaton, due to the fact that for fxed l 1 norm, the l 2 norm of vectors that have many 0 entres wll generally be larger. In fact, under rather mld condtons, the asymptotc soluton ˆβ = lm c ˆβ(1) (c)/c, wll have at most n (the number of observatons) non-zero coeffcents, f we use ether C l or C e as the loss. See Appendx B, Secton 1 for proof LOGISTIC REGRESSION IMPLICATIONS Recall, that the logstc regresson (maxmum lkelhood) soluton s undefned f the data s separable n the Eucldean space spanned by the predctors. Theorem 3 allows us to defne a logstc regresson soluton for separable data, as follows: 1. Set a hgh constrant value c max 2. Fnd ˆβ (p) (c max ), the soluton to the logstc regresson problem subject to the constrant β p c max. The problem s convex for any p 1 and dfferentable for any p > 1, so nteror pont methods can be used to solve ths problem. 3. Now you have (approxmately) the l p -margn maxmzng soluton for ths data, descrbed by ˆβ (p) (c max ) c max. Ths s a soluton to the orgnal problem n the sense that t s, approxmately, the convergence pont of the normalzed l p -constraned solutons, as the constrant s relaxed. 956

17 BOOSTING AS A REGULARIZED PATH Of course, wth our result from Theorem 3 t would probably make more sense to smply fnd the optmal separatng hyper-plane drectly ths s a lnear programmng problem for l 1 separaton and a quadratc programmng problem for l 2 separaton. We can then consder ths optmal separator as a logstc regresson soluton for the separable data. 6. Examples We now apply boostng to several data sets and nterpret the results n lght of our regularzaton and margn-maxmzaton vew. 6.1 Spam Data Set We now know f the data are separable and we let boostng run forever, we wll approach the same optmal separator for both C e and C l. However f we stop early or f the data s not separable the behavor of the two loss functons may dffer sgnfcantly, snce C e weghs negatve margns exponentally, whle C l s approxmately lnear n the margn for large negatve margns (see Fredman et al., 2000). Consequently, we can expect C e to concentrate more on the hard tranng data, n partcular n the non-separable case. Fgure 7 llustrates the behavor of ε-boostng wth both Mnmal margns Test error exponental logstc AdaBoost mnmal margn test error exponental logstc AdaBoost β β 1 Fgure 7: Behavor of boostng wth the two loss functons on spam data set loss functons, as well as that of AdaBoost, on the spam data set (57 predctors, bnary response). We used 10 node trees and ε = 0.1. The left plot shows the mnmal margn as a functon of the l 1 norm of the coeffcent vector β 1. Bnomal loss creates a bgger mnmal margn ntally, but the mnmal margns for both loss functons are convergng asymptotcally. AdaBoost ntally lags behnd but catches up ncely and reaches the same mnmal margn asymptotcally. The rght plot shows the test error as the teratons proceed, llustratng that both ε-methods ndeed seem to over-ft eventually, even as ther separaton (mnmal margn) s stll mprovng. AdaBoost dd not sgnfcantly over-ft n the 1000 teratons t was allowed to run, but t obvously would have f t were allowed to run on. We should emphasze that the comparson between AdaBoost and ε-boostng presented consders as a bass for comparson the l 1 norm, not the number of teratons. In terms of computatonal complexty, as represented by the number of teratons, AdaBoost reaches both a large mnmal mar- 957

18 ROSSET, ZHU AND HASTIE gn and good predcton performance much more quckly than the slow boostng approaches, as AdaBoost tends to take larger steps. 6.2 Smulated Data To make a more educated comparson and more compellng vsualzaton, we have constructed an example of separaton of 2-dmensonal data usng a 8-th degree polynomal dctonary (45 functons). The data conssts of 50 observatons of each class, drawn from a mxture of Gaussans, and presented n Fgure 8. Also presented, n the sold lne, s the optmal l 1 separator for ths data n ths dctonary (easly calculated as a lnear programmng problem - note the dfference from the l 2 optmal decson boundary, presented n Secton 7.1, Fgure 11 ). The optmal l 1 separator has only 12 non-zero coeffcents out of optmal boost 10 5 ter boost 3*10 6 ter Fgure 8: Artfcal data set wth l 1 -margn maxmzng separator (sold), and boostng models after 10 5 teratons (dashed) and 10 6 teratons (dotted) usng ε = We observe the convergence of the boostng separator to the optmal separator We ran an ε-boostng algorthm on ths data set, usng the logstc log-lkelhood loss C l, wth ε = 0.001, and Fgure 8 shows two of the models generated after 10 5 and teratons. We see that the models seem to converge to the optmal separator. A dfferent vew of ths convergence s gven n Fgure 9, where we see two measures of convergence: the mnmal margn (left, maxmum value obtanable s the horzontal lne) and the l 1 -norm dstance between the normalzed models (rght), gven by j ˆβ j β (t) j β (t) 1, where ˆβ s the optmal separator wth l 1 norm 1 and β (t) s the boostng model after t teratons. We can conclude that on ths smple artfcal example we get nce convergence of the logstcboostng model path to the l 1 -margn maxmzng separatng hyper-plane. We can also use ths example to llustrate the smlarty between the boosted path and the path of l 1 optmal solutons, as we have dscussed n Secton

19 BOOSTING AS A REGULARIZED PATH Mnmal margn l 1 dfference β β 1 Fgure 9: Two measures of convergence of boostng model path to optmal l 1 separator: mnmal margn (left) and l 1 dstance between the normalzed boostng coeffcent vector and the optmal model (rght) l 1 norm: 20 l 1 norm: 350 l 1 norm: 2701 l 1 norm: 5401 Fgure 10: Comparson of decson boundary of boostng models (broken) and of optmal constraned solutons wth same norm (full) Fgure 10 shows the class decson boundares for 4 models generated along the boostng path, compared to the optmal solutons to the constraned logstc regresson problem wth the same bound on the l 1 norm of the coeffcent vector. We observe the clear smlartes n the way the solutons evolve and converge to the optmal l 1 separator. The fact that they dffer (n some cases sgnfcantly) s not surprsng f we recall the monotoncty condton presented n Secton 4 for exact correspondence between the two model paths. In ths case f we look at the coeffcent paths 959

20 ROSSET, ZHU AND HASTIE (not shown), we observe that the monotoncty condton s consstently volated n the low norm ranges, and hence we can expect the paths to be smlar n sprt but not dentcal. 7. Dscusson We can now summarze what we have learned about boostng from the prevous sectons: Boostng approxmately follows the path of l 1 -regularzed models for ts loss crteron If the loss crteron s the exponental loss of AdaBoost or the bnomal log-lkelhood loss of logstc regresson, then the l 1 regularzed model converges to an l 1 -margn maxmzng separatng hyper-plane, f the data are separable n the span of the weak learners We may ask, whch of these two ponts s the key to the success of boostng approaches. One emprcal clue to answerng ths queston, can be found n Breman (1999), who programmed an algorthm to drectly maxmze the margns. Hs results were that hs algorthm consstently got sgnfcantly hgher mnmal margns than AdaBoost on many data sets (and, n fact, a hgher margn dstrbuton beyond the mnmal margn), but had slghtly worse predcton performance. Hs concluson was that margn maxmzaton s not the key to AdaBoost s success. From a statstcal perspectve we can embrace ths concluson, as reflectng the mportance of regularzaton n hghdmensonal predctor space. By our results from the prevous sectons, margn maxmzaton can be vewed as the lmt of parametrc regularzed models, as the regularzaton vanshes. 4 Thus we would generally expect the margn maxmzng solutons to perform worse than regularzed models. In the case of boostng, regularzaton would correspond to early stoppng of the boostng algorthm. 7.1 Boostng and SVMs as Regularzed Optmzaton n Hgh-dmensonal Predctor Spaces Our exposton has led us to vew boostng as an approxmate way to solve the regularzed optmzaton problem mn β C(y,β h(x )) + λ β 1 (16) whch converges as λ 0 to ˆβ (1), f our loss s C e or C l. In general, the loss C can be any convex dfferentable loss and should be defned to match the problem doman. Support vector machnes can be descrbed as solvng the regularzed optmzaton problem (see Fredman et al., 2000, Chapter 12) mn β (1 y β h(x )) + + λ β 2 2 (17) whch converges as λ 0 to the non-regularzed support vector machne soluton,.e., the optmal Eucldean separator, whch we denoted by ˆβ (2). An nterestng connecton exsts between these two approaches, n that they allow us to solve the regularzed optmzaton problem n hgh dmensonal predctor space: 4. It can be argued that margn-maxmzng models are stll regularzed n some sense, as they mnmze a norm crteron among all separatng models. Ths s arguably the property whch stll allows them to generalze reasonably well n many cases. 960

21 BOOSTING AS A REGULARIZED PATH We are able to solve the l 1 - regularzed problem approxmately n very hgh dmenson va boostng by applyng the approxmate coordnate descent trck of buldng a decson tree (or otherwse greedly selectng a weak learner) based on re-weghted versons of the data. Support vector machnes facltate a dfferent trck for solvng the regularzed optmzaton problem n hgh dmensonal predctor space: the kernel trck. If our dctonary H spans a Reproducng Kernel Hlbert Space, then RKHS theory tells us we can fnd the regularzed solutons by solvng an n-dmensonal problem, n the space spanned by the kernel representers {K(x,x)}. Ths fact s by no means lmted to the hnge loss of (17), and apples to any convex loss. We concentrate our dscusson on SVM (and hence hnge loss) only snce t s by far the most common and well-known applcaton of ths result. So we can vew both boostng and SVM as methods that allow us to ft regularzed models n hgh dmensonal predctor space usng a computatonal shortcut. The complexty of the model bult s controlled by regularzaton. These methods are dstnctly dfferent than tradtonal statstcal approaches for buldng models n hgh dmenson, whch start by reducng the dmensonalty of the problem so that standard tools (e.g., Newton s method) can be appled to t, and also to make over-fttng less of a concern. Whle the merts of regularzaton wthout dmensonalty reducton lke Rdge regresson or the Lasso are well documented n statstcs, computatonal ssues make t mpractcal for the sze of problems typcally solved va boostng or SVM, wthout computatonal trcks. We beleve that ths dfference may be a sgnfcant reason for the endurng success of boostng and SVM n data modelng,.e.: Workng n hgh dmenson and regularzng s statstcally preferable to a two-step procedure of frst reducng the dmenson, then fttng a model n the reduced space. It s also nterestng to consder the dfferences between the two approaches, n the loss (flexble vs. hnge loss), the penalty (l 1 vs. l 2 ), and the type of dctonary used (usually trees vs. RKHS). These dfferences ndcate that the two approaches wll be useful for dfferent stuatons. For example, f the true model has a sparse representaton n the chosen dctonary, then l 1 regularzaton may be warranted; f the form of the true model facltates descrpton of the class probabltes va a logstc-lnear model, then the logstc loss C l s the best loss to use, and so on. The computatonal trcks for both SVM and boostng lmt the knd of regularzaton that can be used for fttng n hgh dmensonal space. However, the problems can stll be formulated and solved for dfferent regularzaton approaches, as long as the dmensonalty s low enough: Support vector machnes can be ftted wth an l 1 penalty, by solvng the 1-norm verson of the SVM problem, equvalent to replacng the l 2 penalty n (17) wth an l 1 penalty. In fact, the 1- norm SVM s used qute wdely, because t s more easly solved n the lnear, non-rkhs, stuaton (as a lnear program, compared to the standard SVM whch s a quadratc program) and tends to gve sparser solutons n the prmal doman. Smlarly, we descrbe below an approach for developng a boostng algorthm for fttng approxmate l 2 regularzed models. Both of these methods are nterestng and potentally useful. However they lack what s arguably the most attractve property of the standard boostng and SVM algorthms: a computatonal trck to allow fttng n hgh dmensons. 961

22 ROSSET, ZHU AND HASTIE AN l 2 BOOSTING ALGORITHM We can use our understandng of the relaton of boostng to regularzaton and Theorem 3 to formulate l p -boostng algorthms, whch wll approxmately follow the path of l p -regularzed solutons and converge to the correspondng l p -margn maxmzng separatng hyper-planes. Of partcular nterest s the l 2 case, snce Theorem 3 mples that l 2 -constraned fttng usng C l or C e wll buld a regularzed path to the optmal separatng hyper-plane n the Eucldean (or SVM) sense. To construct an l 2 boostng algorthm, consder the equvalent optmzaton problem (12), and change the step-sze constrant to an l 2 constrant: β 2 β 0 2 ε. It s easy to see that the frst order soluton to ths problem entals selectng for modfcaton the coordnate whch maxmzes C(β 0 ) k β 0,k and that subject to monotoncty, ths wll lead to a correspondence to the locally l 2 -optmal drecton. Followng ths ntuton, we can construct an l 2 boostng algorthm by changng only step 2(c) of our generc boostng algorthm of Secton 2 to 2(c)* Identfy j t whch maxmzes w h j t (x ) β j t. Note that the need to consder the current coeffcent (n the denomnator) makes the l 2 algorthm approprate for toy examples only. In stuatons where the dctonary of weak learner s prohbtvely large, we wll need to fgure out a trck lke the one we presented n Secton 2.1, to allow us to make an approxmate search for the optmzer of step 2(c)*. Another problem n applyng ths algorthm to large problems s that we never choose the same dctonary functon twce, untl all have non-0 coeffcents. Ths s due to the use of the l 2 penalty, where the current coeffcent value affects the rate at whch the penalty term s ncreasng. In partcular, f β j = 0 then ncreasng t causes the penalty term β 2 to ncrease at rate 0, to frst order (whch s all the algorthm s consderng). The convergence of our l 2 boostng algorthm on the artfcal data set of Secton 6.2 s llustrated n Fgure 11. We observe that the l 2 boostng models do ndeed approach the optmal l 2 separator. It s nterestng to note the sgnfcant dfference between the optmal l 2 separator as presented n Fgure 11 and the optmal l 1 separator presented n Secton 6.2 (Fgure 8). 8. Summary and Future Work In ths paper we have ntroduced a new vew of boostng n general, and two-class boostng n partcular, comprsed of two man ponts: We have generalzed results from Efron et al. (2004) and Haste et al. (2001), to descrbe boostng as approxmate l 1 -regularzed optmzaton. We have shown that the exact l 1 -regularzed solutons converge to an l 1 -margn maxmzng separatng hyper-plane. 962

23 BOOSTING AS A REGULARIZED PATH optmal boost 5*10 6 ter boost 10 8 ter Fgure 11: Artfcal data set wth l 2 -margn maxmzng separator (sold), and l 2 -boostng models after teratons (dashed) and 10 8 teratons (dotted) usng ε = We observe the convergence of the boostng separator to the optmal separator We hope our results wll help n better understandng how and why boostng works. It s an nterestng and challengng task to separate the effects of the dfferent components of a boostng algorthm: Loss crteron Dctonary and greedy learnng method Lne search / slow learnng and relate them to ts success n dfferent scenaros. The mplct l 1 regularzaton n boostng may also contrbute to ts success, as t has been shown that n some stuatons l 1 regularzaton s nherently superor to others (see Donoho et al., 1995). An mportant ssue when analyzng boostng s over-fttng n the nosy data case. To deal wth over-fttng, Rätsch et al. (2001b) propose several regularzaton methods and generalzatons of the orgnal AdaBoost algorthm to acheve a soft margn by ntroducng slack varables. Our results ndcate that the models along the boostng path can be regarded as l 1 regularzed versons of the optmal separator, hence regularzaton can be done more drectly and naturally by stoppng the boostng teratons early. It s essentally a choce of the l 1 constrant parameter c. Many other questons arse from our vew of boostng. Among the ssues to be consdered: Is there a smlar separator vew of mult-class boostng? We have some tentatve results to ndcate that ths mght be the case f the boostng problem s formulated properly. Can the constraned optmzaton vew of boostng help n producng generalzaton error bounds for boostng that would be more tght than the current exstng ones? 963

24 ROSSET, ZHU AND HASTIE Acknowledgments We thank Stephen Boyd, Brad Efron, Jerry Fredman, Robert Schapre and Rob Tbshran for helpful dscussons. We thank the referees for ther thoughtful and useful comments. Ths work was partally supported by Stanford graduate fellowshp, grant DMS from the Natonal Scence Foundaton, and grant ROI-CA from the Natonal Insttutes of Health. Appendx A. Local Equvalence of Infntesmal ε-boostng and l 1 -Constraned Optmzaton As before, we assume we have a set of tranng data (x 1,y 1 ),(x 2,y 2 ),...(x n,y n ), a smooth cost functon C(y,F), and a set of bass functons (h 1 (x),h 2 (x),...h J (x)). We denote by ˆβ(s) be the optmal soluton of the l 1 -constraned optmzaton problem: mn β n =1 C(y,h(x ) β) (18) subject to β 1 s. (19) Suppose we ntalze the ε-boostng verson of Algorthm 1, as descrbed n Secton 2, at ˆβ(s) and run the algorthm for T steps. Let β(t ) denote the coeffcents after T steps. The global convergence Conjecture 2 n Secton 4 mples that s > 0: β( s/ε) ˆβ(s + s) as ε 0 under some mld assumptons. Instead of provng ths global result, we show here a local result by lookng at the dervatve of ˆβ(s). Our proof bulds on the proof by Efron et al. (2004, Theorem 2) of a smlar result for the case that the cost s squared error loss C(y,F) = (y F) 2. Theorem 1 below shows that f we start the ε-boostng algorthm at a soluton ˆβ(s) of the l 1 -constraned optmzaton problem (18) (19), the drecton of change of the ε-boostng soluton wll agree wth that of the l 1 -constraned optmzaton problem. Theorem 1 Assume the optmal coeffcent paths ˆβ j (s) j are monotone n s and the coeffcent paths β j (T ) j are also monotone as ε-boostng proceeds, then β(t ) ˆβ(s) T ε Proof Frst we ntroduce some notatons. Let ˆβ(s) as ε 0,T,T ε 0. h j = (h j (x 1 ),...h j (x n )) be the jth bass functon evaluated at the n tranng data. Let F = (F(x 1 ),...F(x n )) be the vector of current ft. Let ( r = C(y 1,F 1 ),... C(y ) n,f n ) F 1 F n 964

25 BOOSTING AS A REGULARIZED PATH be the current generalzed resdual vector as defned n Fredman (2001). Let c j = h jr, j = 1,...J be the current correlaton between h j and r. Let A = { j : c j = max j c j } be the set of ndces for the maxmum absolute correlaton. For clarty, we re-wrte ths ε-boostng algorthm, startng from ˆβ(s), as a specal case of Algorthm 1, as follows: (1) Intalze β(0) = ˆβ(s),F 0 = F,r 0 = r. (2) For t = 1 : T (a) Fnd j t = argmax j h j r t 1. (b) Update β t, jt β t 1, jt + ε sgn(c jt ) (c) Update F t and r t. Notce n the above algorthm, we start from ˆβ(s), rather than 0. As proposed n Efron et al. (2004), we consder an dealzed ε-boostng case: ε 0. As ε 0, T and T ε 0, under the monotone paths condton, Secton 3.2 and Secton 6 of Efron et al. (2004) showed where u and v satsfy two constrants: F T F 0 T ε r T r 0 T ε u, (20) v, (21) (Constrant 1) u s n the convex cone generated by {sgn(c j )h j : j A},.e.: u = P j sgn(c j )h j,p j 0. j A (Constrant 2) v has equal correlaton wth sgn(c j )h j, j A: sgn(c j )h jv = λ A for j A. The frst constrant s true because the bass functons n A C wll not be able to catch up n terms of c j for suffcently small T ε; the P j s are non-negatve because the coeffcent paths β j (T ) are monotone. The second constrant can be seen by takng a Taylor expanson of C(y,F) around F 0 to the quadratc term, lettng T ε go to zero and applyng the result for the squared error loss from Efron et al. (2004). Once the two constrants are establshed, we notce that v = 2 C(y,F) F 2 u. F0 (x ) 965

26 ROSSET, ZHU AND HASTIE Hence we can plug the constrant 1 nto the constrant 2 and get the followng set of equatons: where H T A W H A P = λ A 1, H A = ( sgn(c j )h j ), j A, ( ) W = dag 2 C(y,F) F 2, P = ( P j ), j A. F0 (x ) If H s of rank A (we wll get back to ths ssue n detals n Appendx B), then P, or equvalently u and v, are unquely determned up to a scale number. Now we consder the l 1 -constraned optmzaton problem (18) (19). Let ˆF(s) be the ftted vector and ˆr(s) be the correspondng resdual vector. Snce ˆF(s) and ˆr(s) are smooth, defne u lm s 0 v lm s 0 ˆF(s + s) ˆF(s), s (22) ˆr(s + s) ˆr(s). s (23) Lemma 2 Under the monotone coeffcent paths assumpton, u and v also satsfy constrants 1 2. Proof Wrte the coeffcent β j as β + j β j, where { β + j = β j,β j = 0 f β j > 0, β + j = 0,β j = β j f β j < 0. The l 1 -constraned optmzaton problem (18) (19) s then equvalent to mn β +,β The correspondng Lagrangan dual s n =1 C ( y,h(x ) (β + β ) ), (24) subject to β β 1 s,β + 0,β 0. (25) L = n =1 C ( y,h(x ) (β + β ) ) + λ λ s J j=1 (β + j + β j ) (26) J λ + J j β+ j λ j β j, (27) j=1 j=1 where λ 0,λ + j 0,λ j 0 are Lagrange multplers. By dfferentatng the Lagrangan dual, we get the soluton of (24) (25) needed to satsfy the followng Karush-Kuhn-Tucker condtons: L β + j = h jˆr + λ λ+ j = 0, (28) 966

27 BOOSTING AS A REGULARIZED PATH λ + j λ j L β j = h jˆr + λ λ j = 0, (29) ˆβ + j = 0, (30) ˆβ j = 0. (31) Let c j = h jˆr and A = { j : c j = max j c j }. We can see the followng facts from the Karush-Kuhn- Tucker condtons: (Fact 1) Use (28), (29) and λ 0,λ + j,λ j 0, we have c j λ. (Fact 2) If ˆβ j 0, then c j = λ and j A. For example, suppose ˆβ + j 0, then λ + j = 0 and (28) mples c j = λ. (Fact 3) If ˆβ j 0, sgn(ˆβ j ) = sgn(c j ). We also note that: ˆβ + j and ˆβ j can not both be non-zero, otherwse λ + j = λ j = 0, (28) and (29) can not hold at the same tme. It s possble that ˆβ j = 0 and j A. Ths only happens for a fnte number of s values, where bass h j s about to enter the model. For suffcently small s, snce the second dervatve of the cost functon C(y,F) s fnte, A wll stay the same. Snce j A f ˆβ j 0, the change n the ftted vector s ˆF(s + s) ˆF(s) = Q j h j. j A Snce sgn(ˆβ j ) = sgn(c j ) and the coeffcents ˆβ j change monotoncally, sgn(q j ) wll agree wth sgn(c j ). Hence we have ˆF(s + s) ˆF(s) = s P j sgn(c j )h j. (32) j A Ths mples u satsfes constrant 1. The clam v satsfes constrant 2 follows drectly from fact 2, snce both ˆr(s + s) and ˆr(s) satsfy constrant 2. Completon of proof of Theorem (1): We further notce that n both the ε-boostng case and the constraned optmzaton case, we have j A P j = 1 by defnton and the monotone coeffcent paths condton, hence u and v are unquely determned,.e.: u = u and v = v. To translate the result nto ˆβ(s) and β(t ), we notce F(x) = h(x) β. Efron et al. (2004) showed that for ˆβ(s) to be well defned, A can have at most n elements,.e., A n. We gve suffcent condtons for when ths s true n Appendx B. Now Let H A = ( h j (x ) ), = 1,...n; j A 967

28 ROSSET, ZHU AND HASTIE be a n A matrx, whch we assume s of rank A. Then ˆβ(s) s gven by ˆβ(s) = ( H A WH A) 1 H A Wu, and Hence the theorem s proved. β(t ) ˆβ(s) T ε ( H A WH A) 1 H A Wu. Appendx B. Unqueness and Exstence Results In ths appendx, we gve some detals on the propertes of regularzed soluton paths. In secton B.1 we formulate and prove sparseness and unqueness results on l 1 -regularzed solutons for any convex loss. In secton B.2 we extend Theorem 3 of Secton 5 whch proved the margn maxmzng property of the lmt of l p -regularzed solutons, as regularzaton vares to the case that the margn maxmzng soluton s not unque. B.1 Sparseness and Unqueness of l 1 -Regularzed Solutons and Ther Lmts Consder the l 1 -constraned optmzaton problem: mn n β 1 c =1 C(y,β h(x )). (33) In ths secton we gve suffcent condtons for the followng propertes of the solutons of (33): 1. Exstence of a sparse soluton (wth at most n non-zero coeffcents), 2. Non-exstence of non-sparse solutons wth more than n non-zero coeffcents, 3. Unqueness of the soluton, 4. Convergence of the solutons to sparse soluton, as c ncreases. Theorem 3 Assume that the unconstraned soluton for problem (33) has l 1 norm bgger than c. Then there exsts a soluton of (33) whch has at most n non-zero coeffcents. Proof As Lemma 2 n the Appendx A, we wll prove the theorem usng the Karush-Kuhn-Tucker (KKT) formulaton of the optmzaton problem. The chan rule for dfferentaton gves us that C(y,β h(x )) β j = h jr(β), (34) where h j and r(β) are defned n the Appendx A; r(β) s the generalzed resdual vector. Usng ths smple relatonshp and fact 2 of Lemma 2 we can wrte a system of equatons for all nonzero coeffcents at the optmal constraned soluton as follows (denote by A the set of ndces for non-zero coeffcents): H A r(β) = λ sgnβ A. (35) 968

29 BOOSTING AS A REGULARIZED PATH In other words, we get A equatons n A varables, correspondng to the non-zero β j s. However, each column of the matrx H A s of length n, and so H A can have at most n lnearly ndependent columns, rank(h A ) n. Assume now that we have an optmal soluton for (33) wth A > n. Then there exsts l A such that h l = Substtutng (36) nto the l th row n (35) we get α j h j. (36) j A, j l ( α j h j ) r(β) = λ sgnβ l. (37) j A, j l But from (35) we know that h j r(β) = λ sgnβ j, j A, meanng we can re-phrase (37) as α j sgnβ j sgnβ l = 1. (38) j A, j l In other words, we get that h l s a lnear combnaton of the columns of H A {l} whch must obey the specfc numerc relaton n (38). Now we can construct an alternatve optmal soluton for (33) wth one less non-zero coeffcent, as follows: 1. Start from β 2. Defne the drecton γ n coeffcent space mpled by (36), that s: γ l = sgnβ l, γ j = α j sgnβ l, j A {l} 3. Move n drecton γ untl some coeffcent n A hts zero,.e., defne: δ = mn { δ > 0 : j A s.t. β j + γ j δ = 0 } (we know that δ β l ) 4. Set β = β + δ γ Then from (36) we get that β h(x ) = β h(x ), and from (38) we get that β 1 = β 1 [ β j + γ j δ β j ] = (39) j A = β 1 δ ( ) 1 α j sgnβ j sgnβ l = β 1. j A l So β generates the same ft as β and has the same l 1 norm, therefore t s also an optmal soluton, wth at least one less non-zero coeffcent (from the defnton of δ ). We can obvously apply ths process repeatedly untl we get a soluton wth at most n non-zero coeffcents. Ths theorem has the followng mmedate mplcaton: 969

30 ROSSET, ZHU AND HASTIE Corollary 4 If there s no set of more than n dctonary functons whch obeys the equaltes (36,38) on the tranng data, then any soluton of (33) has at most n non-zero coeffcents. Ths corollary mples, for example, that f the bass functons come from a contnuous nonredundant dstrbuton (whch means that any equalty would hold wth probablty 0) then wth probablty 1 any soluton of (33) has at most n non-zero coeffcents. Theorem 5 Assume that there s no set of more than n dctonary functons whch obeys the equaltes (36,38) on the tranng data. In addton assume: 1. The loss functon C s strctly convex (squared error loss, C l and C e obvously qualfy), 2. No set of dctonary functons of sze n s lnearly dependent on the tranng data. Then the problem (33) has a unque soluton. Proof The prevous corollary tells us that any soluton has at most n non-zero coeffcents. Now assume β 1, β 2 are both solutons of (33). From strct convexty of the loss we get that h(x) β 1 = h(x) β 2 = h(x) (αβ 1 + (1 α)β 2 ), 0 α 1; (40) and from convexty of the l 1 norm we get αβ 1 + (1 α)β 2 1 β 1 1 = β 2 1 = c. (41) So (αβ 1 + (1 α)β 2 ) must also be a soluton. Thus, the total number of varables wth non-zero coeffcents n ether β 1 or β 2 cannot be bgger than n, snce then (αβ 1 + (1 α)β 2 ), would have > n non-zero coeffcents for almost all values of α, contradctng Corollary 4. Thus, by gnorng all coeffcents whch are 0 n both β 1 and β 2 we get that both β 1 and β 2 can be represented n the same n dmensonal (maxmum) sub-space of R J. Whch leads to a contradcton between (40) and assumpton 2. Corollary 6 Consder a sequence { ˆβ(c) c : 0 c } of normalzed solutons to the problem (33). Assume that all these solutons have at most n non-zero coeffcents. Then any lmt pont of the sequence has at most n non-zero coeffcents. Proof Ths s a trval consequence of convergence. Assume by contradcton β s a convergence pont wth more than n non-zero coeffcents. Let k = argmn j { β j : β j 0}. Then for any vector β wth at most n non-zero coeffcents we know that β β β j > 0 so we get a contradcton to convergence. 970

31 BOOSTING AS A REGULARIZED PATH B.2 Unqueness of Lmtng Soluton n Theorem 3 when Margn Maxmzng Separator s not Unque Recall, that we are nterested n convergence ponts of the normalzed regularzed solutons ˆβ (p) (c) c. Theorem 3 proves that any such convergence pont corresponds to an l p -margn maxmzng separatng hyper-plane. We now extend t to the case that ths frst-order separator s not unque, by extendng the result to consder the second smallest margn as a te breaker. We show that any convergence pont maxmzes the second smallest margn among all models wth maxmal mnmal margn. If there are also tes n the second smallest margn, then any lmt pont maxmzes the thrd smallest margn among all models whch stll reman, and so on. It should be noted that the mnmal margn s typcally not attaned by one observaton only n margn maxmzng models. In case of tes n the smallest margns our reference to smallest, second smallest etc. mples arbtrary te-breakng (.e., our decson on whch one of the ted margns s consdered smallest, and whch one second smallest s of no consequence). Theorem 7 Assume that the data s separable and that the margn-maxmzng separatng hyperplane, as defned n (4) s not unque. Then any convergence pont of ˆβ (p) (c) c wll correspond to a margn-maxmzng separatng hyper-plane whch also maxmzes the second smallest margn. Proof The proof s essentally the same as that of Theorem 3. We outlne t below. From Theorem 3 we know that we only need to consder margn-maxmzng models as lmt ponts. Thus let β 1, β 2 be two margn maxmzng models wth l p norm 1, but let β 1 have a bgger second smallest margn. Assume that β 1 attans ts smallest margn on observaton 1 and β 2 attans the same smallest margn on observaton 2. Now defne m 1 = mn 1 y h(x ) β 1 > mn 2 y h(x ) β 2 = m 2. Then we have that Lemma 4 of Theorem 3 holds for β 1 and β 2 (the proof s exactly the same, except that we gnore the smallest margn observaton for each model, snce these always contrbute the same amount to the combned loss). Let β be a convergence pont. We know β maxmzes the margn from Theorem 3. Now assume β also maxmzes the margn but has bgger second-smallest margn than β. Then we can proceed exactly as the proof of Theorem 3, consderng only n 1 observatons for each model and usng our modfed Lemma 4, to conclude that β cannot be a convergence pont (agan note that the smallest margn observaton always contrbutes the same to the loss of both models). In the case that the two smallest margns stll do not defne a unque soluton, we can contnue up the lst of margns, applyng ths result recursvely. The concluson s that the lmt of the normalzed, l p -regularzed models maxmzes the margns, and not just the mnmal margn. The only case when ths convergence pont s not unque s, therefore, the case that the whole order statstc of the optmal separator s not unque. It s an nterestng research queston to nvestgate under whch condtons ths scenaro s possble. 971

32 ROSSET, ZHU AND HASTIE References C. L. Blake and C. J. Merz. Repostory of machne learnng databases. [ mlearn/mlrepostory.html]. Irvne, CA: Unversty of Calforna, Department of Informaton and Computer Scence., L. Breman. Predcton games and arcng algorthms. Neural Computaton, 11(7): , M. Collns, R. E. Schapre, and Y. Snger. Logstc regresson, adaboost and bregman dstances. In Computatonal Learng Theory, pages , D. L. Donoho, I. M. Johnstone, G. Kerkyacharan, and D. Pcard. Wavelet shrnkage: Asymptopa? J. R. Statst. Soc. B., 57(2): , B. Efron, T. Haste, I. M. Johnstone, and R. Tbshran. Least angle regresson. Annals of Statstcs, 32(2), Y. Freund and R. E. Schapre. A decson-theoretc generalzaton of on-lne learnng and an applcaton to boostng. In European Conference on Computatonal Learnng Theory, pages 23 37, J. H. Fredman. Greedy functon approxmaton: A gradent boostng machne. Annals of Statstcs, 29(5), J. H. Fredman, T. Haste, and R. Tbshran. Addtve logstc regresson: a statstcal vew of boostng. Annals of Statstcs, 28: , T. Haste, T. Tbshran, and J. H. Fredman. Elements of Statstcal Learnng. Sprnger-Verlag, New York, O. Mangasaran. Arbtrary-norm separatng plane. Operatons Research Letters, 24(1 2):15 23, L. Mason, J. Baxter, P. Bartlett, and M. Frean. Boostng algorthms as gradent descent. In Neural Informaton Processng Systems, volume 12, G. Rätsch, S. Mka, and M. K. Warmuth. On the convergence of leveragng. NeuroCOLT2 Techncal Report 98, Royal Holloway College, London, 2001a. G. Rätsch, T. Onoda, and K.-R. Müller. Soft margns for AdaBoost. Machne Learnng, 42(3): , March 2001b. G. Rätsch and M. W. Warmuth. Effcent margn maxmzaton wth boostng. submtted to JMLR, December S. Rosset and E. Segal. Boostng densty estmaton. NIPS-02, R. E. Schapre, Y. Freund, P. Bartlett, and W. S. Lee. Boostng the margn: a new explanaton for the effectveness of votng methods. Annals of Statstcs, 26: , T. Zhang. Sequental greedy approxmaton for certan convex optmzaton problems. IEEE Transactons on Informaton Theory, 49,

33 BOOSTING AS A REGULARIZED PATH T. Zhang and B. Yu. Boostng wth early stoppng: Convergence and results. Techncal report, Dept. of Statstcs, Unv. of Calforna, Berkeley,

Boosting as a Regularized Path to a Maximum Margin Classifier

(Almost) No Label No Cry

Do Firms Maximize? Evidence from Professional Football

MANY of the problems that arise in early vision can be

Ensembling Neural Networks: Many Could Be Better Than All

Who are you with and Where are you going?

Assessing health efficiency across countries with a two-step and bootstrap analysis *

The Relationship between Exchange Rates and Stock Prices: Studied in a Multivariate Model Desislava Dimitrova, The College of Wooster

What to Maximize if You Must

Complete Fairness in Secure Two-Party Computation

Ciphers with Arbitrary Finite Domains

EVERY GOOD REGULATOR OF A SYSTEM MUST BE A MODEL OF THAT SYSTEM 1

The Developing World Is Poorer Than We Thought, But No Less Successful in the Fight against Poverty

The Global Macroeconomic Costs of Raising Bank Capital Adequacy Requirements

As-Rigid-As-Possible Shape Manipulation

Why Don t We See Poverty Convergence?

can basic entrepreneurship transform the economic lives of the poor?

DISCUSSION PAPER. Is There a Rationale for Output-Based Rebating of Environmental Levies? Alain L. Bernard, Carolyn Fischer, and Alan Fox

From Computing with Numbers to Computing with Words From Manipulation of Measurements to Manipulation of Perceptions

4.3.3 Some Studies in Machine Learning Using the Game of Checkers

UPGRADE YOUR PHYSICS

How To Prove That A Multplcty Map Is A Natural Map

DISCUSSION PAPER. Should Urban Transit Subsidies Be Reduced? Ian W.H. Parry and Kenneth A. Small

Turbulence Models and Their Application to Complex Flows R. H. Nichols University of Alabama at Birmingham

As-Rigid-As-Possible Image Registration for Hand-drawn Cartoon Animations

Alpha if Deleted and Loss in Criterion Validity 1. Appeared in British Journal of Mathematical and Statistical Psychology, 2008, 61, 275-285

TrueSkill Through Time: Revisiting the History of Chess

Finance and Economics Discussion Series Divisions of Research & Statistics and Monetary Affairs Federal Reserve Board, Washington, D.C.

Income per natural: Measuring development as if people mattered more than places

WHICH SECTORS MAKE THE POOR COUNTRIES SO UNPRODUCTIVE?