Boosting as a Regularized Path to a Maximum Margin Classifier

Size: px
Start display at page:

Download "Boosting as a Regularized Path to a Maximum Margin Classifier"

Transcription

1 Journal of Machne Learnng Research 5 (2004) Submtted 5/03; Revsed 10/03; Publshed 8/04 Boostng as a Regularzed Path to a Maxmum Margn Classfer Saharon Rosset Data Analytcs Research Group IBM T.J. Watson Research Center Yorktown Heghts, NY 10598, USA J Zhu Department of Statstcs Unversty of Mchgan Ann Arbor, MI 48109, USA Trevor Haste Department of Statstcs Stanford Unversty Stanford, CA 94305,USA Edtor: Robert Schapre Abstract In ths paper we study boostng methods from a new perspectve. We buld on recent work by Efron et al. to show that boostng approxmately (and n some cases exactly) mnmzes ts loss crteron wth an l 1 constrant on the coeffcent vector. Ths helps understand the success of boostng wth early stoppng as regularzed fttng of the loss crteron. For the two most commonly used crtera (exponental and bnomal log-lkelhood), we further show that as the constrant s relaxed or equvalently as the boostng teratons proceed the soluton converges (n the separable case) to an l 1 -optmal separatng hyper-plane. We prove that ths l 1 -optmal separatng hyper-plane has the property of maxmzng the mnmal l 1 -margn of the tranng data, as defned n the boostng lterature. An nterestng fundamental smlarty between boostng and kernel support vector machnes emerges, as both can be descrbed as methods for regularzed optmzaton n hgh-dmensonal predctor space, usng a computatonal trck to make the calculaton practcal, and convergng to margn-maxmzng solutons. Whle ths statement descrbes SVMs exactly, t apples to boostng only approxmately. Keywords: boostng, regularzed optmzaton, support vector machnes, margn maxmzaton 1. Introducton and Outlne Boostng s a method for teratvely buldng an addtve model F T (x) = T t=1 α t h jt (x), (1) where h jt H a large (but we wll assume fnte) dctonary of canddate predctors or weak learners ; and h jt s the bass functon selected as the best canddate to modfy the functon at stage t. The model F T can equvalently be represented by assgnng a coeffcent to each dctonary c 2004 Saharon Rosset, J Zhu and Trevor Haste.

2 ROSSET, ZHU AND HASTIE functon h H rather than to the selected h jt s only: F T (x) = J h j (x) β (T ) j, (2) j=1 where J = H and β (T ) j = jt = j α t. The β representaton allows us to nterpret the coeffcent vector β (T ) as a vector n R J or, equvalently, as the hyper-plane whch has β (T ) as ts normal. Ths nterpretaton wll play a key role n our exposton. Some examples of common dctonares are: The tranng varables themselves, n whch case h j (x) = x j. Ths leads to our addtve model F T beng just a lnear model n the orgnal data. The number of dctonary functons wll be J = d, the dmenson of x. Polynomal ( dctonary ) of degree p, n whch case the number of dctonary functons wll be p + d J =. d Decson trees wth up to k termnal nodes, f we lmt the splt ponts to data ponts (or mdway between data ponts as CART does). The number of possble trees s bounded from above (trvally) by J (np) k 2 k2. Note that regresson trees do not ft nto our framework, snce they wll gve J =. The boostng dea was frst ntroduced by Freund and Schapre (1995), wth ther AdaBoost algorthm. AdaBoost and other boostng algorthms have attracted a lot of attenton due to ther great success n data modelng tasks, and the mechansm whch makes them work has been presented and analyzed from several perspectves. Fredman et al. (2000) develop a statstcal perspectve, whch ultmately leads to vewng AdaBoost as a gradent-based ncremental search for a good addtve model (more specfcally, t s a coordnate descent algorthm), usng the exponental loss functon C(y, F) = exp( yf), where y { 1, 1}. The gradent boostng (Fredman, 2001) and anyboost (Mason et al., 1999) generc algorthms have used ths approach to generalze the boostng dea to wder famles of problems and loss functons. In partcular, Fredman et al. (2000) have ponted out that the bnomal log-lkelhood loss C(y,F) = log(1 + exp( yf)) s a more natural loss for classfcaton, and s more robust to outlers and msspecfed data. A dfferent analyss of boostng, orgnatng n the machne learnng communty, concentrates on the effect of boostng on the margns y F(x ). For example, Schapre et al. (1998) use margn-based arguments to prove convergence of boostng to perfect classfcaton performance on the tranng data under general condtons, and to derve bounds on the generalzaton error (on future, unseen data). In ths paper we combne the two approaches, to conclude that gradent-based boostng can be descrbed, n the separable case, as an approxmate margn maxmzng process. The vew we develop of boostng as an approxmate path of optmal solutons to regularzed problems also justfes early stoppng n boostng as specfyng a value for regularzaton parameter. We consder the problem of mnmzng non-negatve convex loss functons (n partcular the exponental and bnomal log-lkelhood loss functons) over the tranng data, wth an l 1 bound on the model coeffcents: ˆβ(c) = arg mn β 1 c C(y,h(x ) β). (3) 942

3 BOOSTING AS A REGULARIZED PATH Where h(x ) = [h 1 (x ),h 2 (x ),...,h J (x )] and J = H. 1 Haste et al. (2001, Chapter 10) have observed that slow gradent-based boostng (.e., we set α t = ε, t n (1), wth ε small) tends to follow the penalzed path ˆβ(c) as a functon of c, under some mld condtons on ths path. In other words, usng the notaton of (2), (3), ths mples that β (c/ε) ˆβ(c) vanshes wth ε, for all (or a wde range of) values of c. Fgure 1 llustrates ths equvalence between ε-boostng and the optmal soluton of (3) on a real-lfe data set, usng squared error loss as the loss functon. In ths paper we demonstrate ths equvalence further and formally Lasso Forward Stagewse lcavol lcavol PSfrag replacements Coeffcents sv lweght pgg45 lbph gleason age lcp Coeffcents sv lweght pgg45 lbph gleason age lcp j ˆβ j (c) Iteraton Fgure 1: Exact coeffcent paths(left) for l 1 -constraned squared error regresson and boostng coeffcent paths (rght) on the data from a prostate cancer study state t as a conjecture. Some progress towards provng ths conjecture has been made by Efron et al. (2004), who prove a weaker local result for the case where C s squared error loss, under some mld condtons on the optmal path. We generalze ther result to general convex loss functons. Combnng the emprcal and theoretcal evdence, we conclude that boostng can be vewed as an approxmate ncremental method for followng the l 1 -regularzed path. We then prove that n the separable case, for both the exponental and logstc log-lkelhood loss functons, ˆβ(c)/c converges as c to an optmal separatng hyper-plane ˆβ descrbed by ˆβ = arg max mn y β h(x ). (4) β 1 =1 In other words, ˆβ maxmzes the mnmal margn among all vectors wth l 1 -norm equal to 1. 2 Ths result generalzes easly to other l p -norm constrants. For example, f p = 2, then ˆβ descrbes the optmal separatng hyper-plane n the Eucldean sense,.e., the same one that a non-regularzed support vector machne would fnd. Combnng our two man results, we get the followng characterzaton of boostng: 1. Our notaton assumes that the mnmum n (3) s unque, whch requres some mld assumptons. To avod notatonal complcatons we use ths slghtly abusve notaton throughout ths paper. In Appendx B we gve explct condtons for unqueness of ths mnmum. 2. The margn maxmzng hyper-plane n (4) may not be unque, and we show that n that case the lmt ˆβ s stll defned and t also maxmzes the second mnmal margn. See Appendx B.2 for detals. 943

4 ROSSET, ZHU AND HASTIE ε-boostng can be descrbed as a gradent-descent search, approxmately followng the path of l 1 -constraned optmal solutons to ts loss crteron, and convergng, n the separable case, to a margn maxmzer n the l 1 sense. Note that boostng wth a large dctonary H (n partcular f n < J = H ) guarantees that the data wll be separable (except for pathologes), hence separablty s a very mld assumpton here. As n the case of support vector machnes n hgh dmensonal feature spaces, the non-regularzed optmal separatng hyper-plane s usually of theoretcal nterest only, snce t typcally represents an over-ftted model. Thus, we would want to choose a good regularzed model. Our results ndcate that Boostng gves a natural method for dong that, by stoppng early n the boostng process. Furthermore, they pont out the fundamental smlarty between Boostng and SVMs: both approaches allow us to ft regularzed models n hgh-dmensonal predctor space, usng a computatonal trck. They dffer n the regularzaton approach they take exact l 2 regularzaton for SVMs, approxmate l 1 regularzaton for Boostng -and n the computatonal trck that facltates fttng the kernel trck for SVMs, coordnate descent for Boostng. 1.1 Related Work Schapre et al. (1998) have dentfed the normalzed margns as dstance from an l 1 -normed separatng hyper-plane. Ther results relate the boostng teratons success to the mnmal margn of the combned model. Rätsch et al. (2001b) take ths further usng an asymptotc analyss of AdaBoost. They prove that the normalzed mnmal margn, mn y t α t h t (x )/ t α t, s asymptotcally equal for both classes. In other words, they prove that the asymptotc separatng hyper-plane s equally far away from the closest ponts on ether sde. Ths s a property of the margn maxmzng separatng hyper-plane as we defne t. Both papers also llustrate the margn maxmzng effects of AdaBoost through expermentaton. However, they both stop short of provng the convergence to optmal (margn maxmzng) solutons. Motvated by our result, Rätsch and Warmuth (2002) have recently asserted the margn-maxmzng propertes of ε-adaboost, usng a dfferent approach than the one used n ths paper. Ther results relate only to the asymptotc convergence of nfntesmal AdaBoost, compared to our analyss of the regularzed path traced along the way and of a varety of boostng loss functons, whch also leads to a convergence result on bnomal log-lkelhood loss. The convergence of boostng to an optmal soluton from a loss functon perspectve has been analyzed n several papers. Rätsch et al. (2001a) and Collns et al. (2000) gve results and bounds on the convergence of tranng-set loss, C(y, t α t h t (x )), to ts mnmum. However, n the separable case convergence of the loss to 0 s nherently dfferent from convergence of the lnear separator to the optmal separator. Any soluton whch separates the two classes perfectly can drve the exponental (or log-lkelhood) loss to 0, smply by scalng coeffcents up lnearly. Two recent papers have made the connecton between boostng and l 1 regularzaton n a slghtly dfferent context than ths paper. Zhang (2003) suggests a shrnkage verson of boostng whch converges to l 1 regularzed solutons, whle Zhang and Yu (2003) llustrate the quanttatve relatonshp between early stoppng n boostng and l 1 constrants. 944

5 BOOSTING AS A REGULARIZED PATH 2. Boostng as Gradent Descent Generc gradent-based boostng algorthms (Fredman, 2001; Mason et al., 1999) attempt to fnd a good lnear combnaton of the members of some dctonary of bass functons to optmze a gven loss functon over a sample. Ths s done by searchng, at each teraton, for the bass functon whch gves the steepest descent n the loss, and changng ts coeffcent accordngly. In other words, ths s a coordnate descent algorthm n R J, where we assgn one dmenson (or coordnate) for the coeffcent of each dctonary functon. Assume we have data {x,y } n =1 wth x R d, a loss (or cost) functon C(y,F), and a set of dctonary functons {h j (x)} : R d R. Then all of these algorthms follow the same essental steps: Algorthm 1 Generc gradent-based boostng algorthm 1. Set β (0) = For t = 1 : T, (a) Let F = β (t 1) h(x ), = 1,...,n (the current ft). (b) Set w = C(y,F ) F, = 1,...,n. (c) Identfy j t = argmax j w h j (x ). (d) Set β (t) j t = β (t 1) j t α t sgn( w h jt (x )) and β (t) k = β (t 1) k,k j t. Here β (t) s the current coeffcent vector and α t > 0 s the current step sze. Notce that w h jt (x ) = C(y,F ) β j. t As we mentoned, Algorthm 1 can be nterpreted smply as a coordnate descent algorthm n weak learner space. Implementaton detals nclude the dctonary H of weak learners, the loss functon C(y,F), the method of searchng for the optmal j t and the way n whch α t s determned. 3 For example, the orgnal AdaBoost algorthm uses ths scheme wth the exponental loss C(y,F) = exp( yf), and an mplct lne search to fnd the best α t once a drecton j t has been chosen (see Haste et al., 2001; Mason et al., 1999). The dctonary used by AdaBoost n ths formulaton would be a set of canddate classfers,.e., h j (x ) { 1,+1} usually decson trees are used n practce. 2.1 Practcal Implementaton of Boostng The dctonares used for boostng are typcally very large practcally nfnte and therefore the generc boostng algorthm we have presented cannot be mplemented verbatm. In partcular, t s not practcal to exhaustvely search for the maxmzer n step 2(c). Instead, an approxmate, usually greedy search s conducted to fnd a good canddate weak learner h jt whch makes the frst order declne n the loss large (even f not maxmal among all possble models). In the common case that the dctonary of weak learners s comprsed of decson trees wth up to k nodes, the way AdaBoost and other boostng algorthms solve stage 2(c) s by buldng a 3. The sgn of α t wll always be sgn( w h jt (x )), snce we want the loss to be reduced. In most cases, the dctonary H s negaton closed, and so t can be assumed WLOG that the coeffcents are always postve and ncreasng 945

6 ROSSET, ZHU AND HASTIE decson tree to a re-weghted verson of the data, wth the weghts w. Thus they frst replace step 2(c) wth mnmzaton of w 1{y h jt (x )}, whch s easly shown to be equvalent to the orgnal step 2(c). They then use a greedy decsontree buldng algorthm such as CART or C5 to buld a k-node decson tree whch mnmzes ths quantty,.e., acheves low weghted msclassfcaton error on the weghted data. Snce the tree s bult greedly one splt at a tme t wll not be the global mnmzer of weghted msclassfcaton error among all k-node decson trees. However, t wll be a good ft for the re-weghted data, and can be consdered an approxmaton to the optmal tree. Ths use of approxmate optmzaton technques s crtcal, snce much of the strength of the boostng approach comes from ts ablty to buld addtve models n very hgh-dmensonal predctor spaces. In such spaces, standard exact optmzaton technques are mpractcal: any approach whch requres calculaton and nverson of Hessan matrces s completely out of the queston, and even approaches whch requre only frst dervatves, such as coordnate descent, can only be mplemented approxmately. 2.2 Gradent-Based Boostng as a Generc Modelng Tool As Fredman (2001); Mason et al. (1999) menton, ths vew of boostng as gradent descent allows us to devse boostng algorthms for any functon estmaton problem all we need s an approprate loss and an approprate dctonary of weak learners. For example, Fredman et al. (2000) suggested usng the bnomal log-lkelhood loss nstead of the exponental loss of AdaBoost for bnary classfcaton, resultng n the LogtBoost algorthm. However, there s no need to lmt boostng algorthms to classfcaton Fredman (2001) appled ths methodology to regresson estmaton, usng squared error loss and regresson trees, and Rosset and Segal (2003) appled t to densty estmaton, usng the log-lkelhood crteron and Bayesan networks as weak learners. Ther experments and those of others llustrate that the practcal usefulness of ths approach coordnate descent n hgh dmensonal predctor space carres beyond classfcaton, and even beyond supervsed learnng. The vew we present n ths paper, of coordnate-descent boostng as approxmate l 1 -regularzed fttng, offers some nsght nto why ths approach would be good n general: t allows us to ft regularzed models drectly n hgh dmensonal predctor space. In ths t bears a conceptual smlarty to support vector machnes, whch exactly ft an l 2 regularzed model n hgh dmensonal (RKH) predctor space. 2.3 Loss Functons The two most commonly used loss functons for boostng classfcaton models are the exponental and the (mnus) bnomal log-lkelhood: Exponental : Loglkelhood : C e (y,f) = exp( yf); C l (y,f) = log(1 + exp( yf)). These two loss functons bear some mportant smlartes to each other. As Fredman et al. (2000) show, the populaton mnmzer of expected loss at pont x s smlar for both loss functons and s 946

7 BOOSTING AS A REGULARIZED PATH 2.5 Exponental Logstc Fgure 2: The two classfcaton loss functons gven by [ ] P(y = 1 x) ˆF(x) = c log, P(y = 1 x) where c e = 1/2 for exponental loss and c l = 1 for bnomal loss. More mportantly for our purpose, we have the followng smple proposton, whch llustrates the strong smlarty between the two loss functons for postve margns (.e., correct classfcatons): Proposton 1 yf 0 0.5C e (y,f) C l (y,f) C e (y,f). (5) In other words, the two losses become smlar f the margns are postve, and both behave lke exponentals. Proof Consder the functons f 1 (z) = z and f 2 (z) = log(1+z) for z [0,1]. Then f 1 (0) = f 2 (0) = 0, and f 1 (z) 1 z 1 2 f 2(z) = 1 z 1 + z 1. Thus we can conclude 0.5 f 1 (z) f 2 (z) f 1 (z). Now set z = exp( y f ) and we get the desred result. For negatve margns the behavors of C e and C l are very dfferent, as Fredman et al. (2000) have noted. In partcular, C l s more robust aganst outlers and msspecfed data. 2.4 Lne-Search Boostng vs. ε-boostng As mentoned above, AdaBoost determnes α t usng a lne search. In our notaton for Algorthm 1 ths would be α t = argmn α C(y,F + αh jt (x )). 947

8 ROSSET, ZHU AND HASTIE The alternatve approach, suggested by Fredman (2001); Haste et al. (2001), s to shrnk all α t to a sngle small value ε. Ths may slow down learnng consderably (dependng on how small ε s), but s attractve theoretcally: the frst-order theory underlyng gradent boostng mples that the weak learner chosen s the best ncrement only locally. It can also be argued that ths approach s stronger than lne search, as we can keep selectng the same h jt repeatedly f t remans optmal and so ε-boostng domnates lne-search boostng n terms of tranng error. In practce, ths approach of slowng the learnng rate usually performs better than lne-search n terms of predcton error as well (see Fredman, 2001). For our purposes, we wll mostly assume ε s nfntesmally small, so the theoretcal boostng algorthm whch results s the lmt of a seres of boostng algorthms wth shrnkng ε. In regresson termnology, the lne-search verson s equvalent to forward stage-wse modelng, nfamous n the statstcs lterature for beng too greedy and hghly unstable (see Fredman, 2001). Ths s ntutvely obvous, snce by ncreasng the coeffcent untl t saturates we are destroyng sgnal whch may help us select other good predctors. 3. l p Margns, Support Vector Machnes and Boostng We now ntroduce the concept of margns as a geometrc nterpretaton of a bnary classfcaton model. In the context of boostng, ths vew offers a dfferent understandng of AdaBoost from the gradent descent vew presented above. In the followng sectons we connect the two vews. 3.1 The Eucldean Margn and the Support Vector Machne Consder a classfcaton model n hgh dmensonal predctor space: F(x) = j h j (x)β j. We say that the model separates the tranng data {x,y } n =1 f sgn(f(x )) = y,. From a geometrcal perspectve ths means that the hyper-plane defned by F(x) = 0 s a separatng hyper-plane for ths data, and we defne ts (Eucldean) margn as m 2 (β) = mn y F(x ) β 2. (6) The margn-maxmzng separatng hyper-plane for ths data would be defned by β whch maxmzes m 2 (β). Fgure 3 shows a smple example of separable data n two dmensons, wth ts margn-maxmzng separatng hyper-plane. The Eucldean margn-maxmzng separatng hyperplane s the (non regularzed) support vector machne soluton. Its margn maxmzng propertes play a central role n dervng generalzaton error bounds for these models, and form the bass for a rch lterature. 3.2 The l 1 Margn and Its Relaton to Boostng Instead of consderng the Eucldean margn as n (6) we can defne an l p margn concept as m p (β) = mn y F(x ) β p. (7) Of partcular nterest to us s the case p = 1. Fgure 4 shows the l 1 margn maxmzng separatng hyper-plane for the same smple example as Fgure 3. Note the fundamental dfference between 948

9 BOOSTING AS A REGULARIZED PATH Fgure 3: A smple data example, wth two observatons from class O and two observatons from class X. The full lne s the Eucldean margn-maxmzng separatng hyper-plane Fgure 4: l 1 margn maxmzng separatng hyper-plane for the same data set as Fgure 3. The dfference between the dagonal Eucldean optmal separator and the vertcal l 1 optmal separator llustrates the sparsty effect of optmal l 1 separaton 949

10 ROSSET, ZHU AND HASTIE the two solutons: the l 2 -optmal separator s dagonal, whle the l 1 -optmal one s vertcal. To understand why ths s so we can relate the two margn defntons to each other as yf(x) β 1 = yf(x) β 2 β 2 β 1. (8) From ths representaton we can observe that the l 1 margn wll tend to be bg f the rato β 2 β 1 s bg. Ths rato wll generally be bg f β s sparse. To see ths, consder fxng the l 1 norm of the vector and then comparng the l 2 norm of two canddates: one wth many small components and the other a sparse one wth a few large components and many zero components. It s easy to see that the second vector wll have bgger l 2 norm, and hence (f the l 2 margn for both vectors s equal) a bgger l 1 margn. A dfferent perspectve on the dfference between the optmal solutons s gven by a theorem due to Mangasaran (1999), whch states that the l p margn maxmzng separatng hyper plane maxmzes the l q dstance from the closest ponts to the separatng hyper-plane, wth 1 p + 1 q = 1. Thus the Eucldean optmal separator (p = 2) also maxmzes Eucldean dstance between the ponts and the hyper-plane, whle the l 1 optmal separator maxmzes l dstance. Ths nterestng result gves another ntuton why l 1 optmal separatng hyper-planes tend to be coordnate-orented (.e., have sparse representatons): snce l projecton consders only the largest coordnate dstance, some coordnate dstances may be 0 at no cost of decreased l dstance. Schapre et al. (1998) have ponted out the relaton between AdaBoost and the l 1 margn. They prove that, n the case of separable data, the boostng teratons ncrease the boostng margn of the model, defned as mn y F(x ) α 1. (9) In other words, ths s the l 1 margn of the model, except that t uses the α ncremental representaton rather than the β geometrc representaton for the model. The two representatons gve the same l 1 norm f there s sgn consstency, or monotoncty n the coeffcent paths traced by the model,.e., f at every teraton t of the boostng algorthm β jt 0 sgn(α t ) = sgn(β jt ). (10) As we wll see later, ths monotoncty condton wll play an mportant role n the equvalence between boostng and l 1 regularzaton. The l 1 -margn maxmzaton vew of AdaBoost presented by Schapre et al. (1998) and a whole plethora of papers that followed s mportant for the analyss of boostng algorthms for two dstnct reasons: It gves an ntutve, geometrc nterpretaton of the model that AdaBoost s lookng for a model whch separates the data well n ths l 1 -margn sense. Note that the vew of boostng as gradent descent n a loss crteron doesn t really gve the same knd of ntuton: f the data s separable, then any model whch separates the tranng data wll drve the exponental or bnomal loss to 0 when scaled up: m 1 (β) > 0 = C(y,dβ x ) 0 as d. 950

11 BOOSTING AS A REGULARIZED PATH The l 1 -margn behavor of a classfcaton model on ts tranng data facltates generaton of generalzaton (or predcton) error bounds, smlar to those that exst for support vector machnes (Schapre et al., 1998). The mportant quantty n ths context s not the margn but the normalzed margn, whch consders the conjugate norm of the predctor vectors: y β h(x ) β 1 h(x ). When the dctonary we are usng s comprsed of classfers then h(x ) 1 always and thus the l 1 margn s exactly the relevant quantty. The error bounds descrbed by Schapre et al. (1998) allow usng the whole l 1 margn dstrbuton, not just the mnmal margn. However, boostng s tendency to separate well n the l 1 sense s a central motvaton behnd ther results. From a statstcal perspectve, however, we should be suspcous of margn-maxmzaton as a method for buldng good predcton models n hgh dmensonal predctor space. Margn maxmzaton n hgh dmensonal space s lkely to lead to over-fttng and bad predcton performance. Ths has been observed n practce by many authors, n partcular Breman (1999). Our results n the next two sectons suggest an explanaton based on model complexty: margn maxmzaton s the lmt of parametrc regularzed optmzaton models, as the regularzaton vanshes, and the regularzed models along the path may well be superor to the margn maxmzng lmtng model, n terms of predcton performance. In Secton 7 we return to dscuss these ssues n more detal. 4. Boostng as Approxmate Incremental l 1 Constraned Fttng In ths secton we ntroduce an nterpretaton of the generc coordnate-descent boostng algorthm as trackng a path of approxmate solutons to l 1 -constraned (or equvalently, regularzed) versons of ts loss crteron. Ths vew serves our understandng of what boostng does, n partcular the connecton between early stoppng n boostng and regularzaton. We wll also use ths vew to get a result about the asymptotc margn-maxmzaton of regularzed classfcaton models, and by analogy of classfcaton boostng. We buld on deas frst presented by Haste et al. (2001, Chapter 10) and Efron et al. (2004). Gven a convex non-negatve loss crteron C(, ), consder the 1-dmensonal path of optmal solutons to l 1 constraned optmzaton problems over the tranng data: ˆβ(c) = arg mn β 1 c C(y,h(x ) β). (11) As c vares, we get that ˆβ(c) traces a 1-dmensonal optmal curve through R J. If an optmal soluton for the non-constraned problem exsts and has fnte l 1 norm c 0, then obvously ˆβ(c) = ˆβ(c 0 ) = ˆβ, c > c 0. n the case of separable 2-class data, usng ether C e or C l, there s no fntenorm optmal soluton. Rather, the constraned soluton wll always have ˆβ(c) 1 = c. A dfferent way of buldng a soluton whch has l 1 norm c, s to run our ε-boostng algorthm for c/ε teratons. Ths wll gve an α (c/ε) vector whch has l 1 norm exactly c. For the norm of the geometrc representaton β (c/ε) to also be equal to c, we need the monotoncty condton (10) to hold as well. Ths condton wll play a key role n our exposton. We are gong to argue that the two soluton paths ˆβ(c) and β (c/ε) are very smlar for ε small. Let us start by observng ths smlarty n practce. Fgure 1 n the ntroducton shows an example of 951

12 ROSSET, ZHU AND HASTIE Lasso Stagwse Fgure 5: Another example of the equvalence between the Lasso optmal soluton path (left) and ε-boostng wth squared error loss. Note that the equvalence breaks down when the path of varable 7 becomes non-monotone ths smlarty for squared error loss fttng wth l 1 (lasso) penalty. Fgure 5 shows another example n the same mold, taken from Efron et al. (2004). The data s a dabetes study and the dctonary used s just the orgnal 10 varables. The panel on the left shows the path of optmal l 1 -constraned solutons ˆβ(c) and the panel on the rght shows the ε-boostng path wth the 10-dmensonal dctonary (the total number of boostng teratons s about 6000). The 1-dmensonal path through R 10 s descrbed by 10 coordnate curves, correspondng to each one of the varables. The nterestng phenomenon we observe s that the two coeffcent traces are not completely dentcal. Rather, they agree up to the pont where varable 7 coeffcent path becomes non monotone,.e., t volates (10) (ths pont s where varable 8 comes nto the model, see the arrow on the rght panel). Ths example llustrates that the monotoncty condton and ts mplcaton that α 1 = β 1 s crtcal for the equvalence between ε-boostng and l 1 -constraned optmzaton. The two examples we have seen so far have used squared error loss, and we should ask ourselves whether ths equvalence stretches beyond ths loss. Fgure 6 shows a smlar result, but ths tme for the bnomal log-lkelhood loss, C l. We used the spam data set, taken from the UCI repostory (Blake and Merz, 1998). We chose only 5 predctors of the 57 to make the plots more nterpretable and the computatons more accommodatng. We see that there s a perfect equvalence between the exact constraned soluton (.e., regularzed logstc regresson) and ε-boostng n ths case, snce the paths are fully monotone. To justfy why ths observed equvalence s not surprsng, let us consder the followng l 1 - locally optmal monotone drecton problem of fndng the best monotone ε ncrement to a gven model β 0 : mn C(β) (12) s.t. β 1 β 0 1 ε, β β 0 (component-wse). 952

13 BOOSTING AS A REGULARIZED PATH 6 Exact constraned soluton 6 ε Stagewse β values 2 β values β β 1 Fgure 6: Exact coeffcent paths (left) for l 1 -constraned logstc regresson and boostng coeffcent paths (rght) wth bnomal log-lkelhood loss on fve varables from the spam data set. The boostng path was generated usng ε = and 7000 teratons. Here we use C(β) as shorthand for C(y,h(x ) β). A frst order Taylor expanson gves us C(β) = C(β 0 ) + C(β 0 ) (β β 0 ) + O(ε 2 ). And gven the l 1 constrant on the ncrease n β 1, t s easy to see that a frst-order optmal soluton (and therefore an optmal soluton as ε 0) wll make a coordnate descent step,.e. β j β 0, j C(β 0 ) j = max C(β 0 ) k, k assumng the sgns match,.e., sgn(β 0 j ) = sgn( C(β 0 ) j ). So we get that f the optmal soluton to (12) wthout the monotoncty constrant happens to be monotone, then t s equvalent to a coordnate descent step. And so t s reasonable to expect that f the optmal l 1 regularzed path s monotone (as t ndeed s n Fgures 1,6), then an nfntesmal ε-boostng algorthm would follow the same path of solutons. Furthermore, even f the optmal path s not monotone, we can stll use the formulaton (12) to argue that ε-boostng would tend to follow an approxmate l 1 -regularzed path. The man dfference between the ε-boostng path and the true optmal path s that t wll tend to delay becomng non-monotone, as we observe for varable 7 n Fgure 5. To understand ths specfc phenomenon would requre analyss of the true optmal path, whch falls outsde the scope of our dscusson Efron et al. (2004) cover the subject for squared error loss, and ther dscusson apples to any contnuously dfferentable convex loss, usng second-order approxmatons. We can employ ths understandng of the relatonshp between boostng and l 1 regularzaton to construct l p boostng algorthms by changng the coordnate-selecton crteron n the coordnate descent algorthm. We wll get back to ths pont n Secton 7, where we desgn an l 2 boostng algorthm. The expermental evdence and heurstc dscusson we have presented lead us to the followng conjecture whch connects slow boostng and l 1 -regularzed optmzaton: 953

14 ROSSET, ZHU AND HASTIE Conjecture 2 Consder applyng the ε-boostng algorthm to any convex loss functon, generatng a path of solutons β (ε) (t). Then f the optmal coeffcent paths are monotone c < c 0,.e., f j, ˆβ(c) j s non-decreasng n the range c < c 0, then lm ε 0 β(ε) (c 0 /ε) = ˆβ(c 0 ). Efron et al. (2004, Theorem 2) prove a weaker local result for the case of squared error loss only. We generalze ther result to any convex loss. However ths result stll does not prove the global convergence whch the conjecture clams, and the emprcal evdence mples. For the sake of brevty and readablty, we defer ths proof, together wth concse mathematcal defnton of the dfferent types of convergence, to appendx A. In the context of real-lfe boostng, where the number of bass functons s usually very large, and makng ε small enough for the theory to apply would requre runnng the algorthm forever, these results should not be consdered drectly applcable. Instead, they should be taken as an ntutve ndcaton that boostng especally the ε verson s, ndeed, approxmatng optmal solutons to the constraned problems t encounters along the way. 5. l p -Constraned Classfcaton Loss Functons Havng establshed the relaton between boostng and l 1 regularzaton, we are gong to turn our attenton to the regularzed optmzaton problem. By analogy, our results wll apply to boostng as well. We concentrate on C e and C l, the two classfcaton losses defned above, and the soluton paths of ther l p constraned versons: ˆβ (p) (c) = arg mn β p c C(y,β h(x )). (13) where C s ether C e or C l. As we dscussed below Equaton (11), f the tranng data s separable n span(h ), then we have ˆβ (p) (c) p = c for all values of c. Consequently ˆβ (p) (c) p = 1. c We may ask what are the convergence ponts of ths sequence as c. The followng theorem shows that these convergence ponts descrbe l p -margn maxmzng separatng hyper-planes. Theorem 3 Assume the data s separable,.e., β s.t., y β h(x ) > 0. Then for both C e and C l, every convergence pont of ˆβ(c) c corresponds to an l p -margn-maxmzng separatng hyper-plane. If the l p -margn-maxmzng separatng hyper-plane s unque, then t s the unque convergence ponts,.e. ˆβ (p) = lm ˆβ(p) (c) = arg max c c mn y β h(x ). (14) β p =1 Proof Ths proof apples to both C e and C l, gven the property n (5). Consder two separatng canddates β 1 and β 2 such that β 1 p = β 2 p = 1. Assume that β 1 separates better,.e. Then we have the followng smple lemma: m 1 := mny β 1h(x ) > m 2 := mny β 2h(x ) >

15 BOOSTING AS A REGULARIZED PATH Lemma 4 There exsts some D = D(m 1,m 2 ) such that d > D, dβ 1 ncurs smaller loss than dβ 2, n other words: C(y,dβ 1h(x )) < C(y,dβ 2h(x )). Gven ths lemma, we can now prove that any convergence pont of ˆβ (p) (c) c must be an l p -margn maxmzng separator. Assume β s a convergence pont of ˆβ (p) (c) c. Denote ts mnmal margn on the data by m. If the data s separable, clearly m > 0 (snce otherwse the loss of dβ does not even converge to 0 as d ). Now, assume some β wth β p = 1 has bgger mnmal margn m > m. By contnuty of the mnmal margn n β, there exsts some open neghborhood of β and an ε > 0, such that N β = {β : β β 2 < δ} mn y β h(x ) < m ε, β N β. Now by the lemma we get that there exsts some D = D( m, m ε) such that d β ncurs smaller loss than dβ for any d > D, β N β. Therefore β cannot be a convergence pont of ˆβ (p) (c) c. We conclude that any convergence pont of the sequence ˆβ (p) (c) c must be an l p -margn maxmzng separator. If the margn maxmzng separator s unque then t s the only possble convergence pont, and therefore ˆβ (p) = lm ˆβ(p) (c) = arg max c c mn y β h(x ). β p =1 Proof of Lemma Usng (5) and the defnton of C e, we get for both loss functons: C(y,dβ 1h(x )) nexp( d m 1 ). Now, snce β 1 separates better, we can fnd our desred such that D = D(m 1,m 2 ) = logn + log2 m 1 m 2 d > D, nexp( d m 1 ) < 0.5exp( d m 2 ). And usng (5) and the defnton of C e agan we can wrte 0.5exp( d m 2 ) C(y,dβ 2h(x )). Combnng these three nequaltes we get our desred result: d > D, C(y,dβ 1h(x )) C(y,dβ 2h(x )). 955

16 ROSSET, ZHU AND HASTIE We thus conclude that f the l p -margn maxmzng separatng hyper-plane s unque, the normalzed constraned soluton converges to t. In the case that the margn maxmzng separatng hyper-plane s not unque, we can n fact prove a stronger result, whch ndcates that the lmt of the regularzed solutons would then be determned by the second smallest margn, then by the thrd and so on. Ths result s manly of techncal nterest and we prove t n Appendx B, Secton Implcatons of Theorem 3 We now brefly dscuss the mplcatons of ths theorem for boostng and logstc regresson BOOSTING IMPLICATIONS Combned wth our results from Secton 4, Theorem 3 ndcates that the normalzed boostng path β (t) u t α u wth ether C e or C l used as loss approxmately converges to a separatng hyper-plane ˆβ, whch attans max mn y β h(x ) = max β 2 mny d, (15) β 1 =1 β 1 =1 where d s the (sgned) Eucldean dstance from the tranng pont to the separatng hyper-plane. In other words, t maxmzes Eucldean dstance scaled by an l 2 norm. As we have mentoned already, ths mples that the asymptotc boostng soluton wll tend to be sparse n representaton, due to the fact that for fxed l 1 norm, the l 2 norm of vectors that have many 0 entres wll generally be larger. In fact, under rather mld condtons, the asymptotc soluton ˆβ = lm c ˆβ(1) (c)/c, wll have at most n (the number of observatons) non-zero coeffcents, f we use ether C l or C e as the loss. See Appendx B, Secton 1 for proof LOGISTIC REGRESSION IMPLICATIONS Recall, that the logstc regresson (maxmum lkelhood) soluton s undefned f the data s separable n the Eucldean space spanned by the predctors. Theorem 3 allows us to defne a logstc regresson soluton for separable data, as follows: 1. Set a hgh constrant value c max 2. Fnd ˆβ (p) (c max ), the soluton to the logstc regresson problem subject to the constrant β p c max. The problem s convex for any p 1 and dfferentable for any p > 1, so nteror pont methods can be used to solve ths problem. 3. Now you have (approxmately) the l p -margn maxmzng soluton for ths data, descrbed by ˆβ (p) (c max ) c max. Ths s a soluton to the orgnal problem n the sense that t s, approxmately, the convergence pont of the normalzed l p -constraned solutons, as the constrant s relaxed. 956

17 BOOSTING AS A REGULARIZED PATH Of course, wth our result from Theorem 3 t would probably make more sense to smply fnd the optmal separatng hyper-plane drectly ths s a lnear programmng problem for l 1 separaton and a quadratc programmng problem for l 2 separaton. We can then consder ths optmal separator as a logstc regresson soluton for the separable data. 6. Examples We now apply boostng to several data sets and nterpret the results n lght of our regularzaton and margn-maxmzaton vew. 6.1 Spam Data Set We now know f the data are separable and we let boostng run forever, we wll approach the same optmal separator for both C e and C l. However f we stop early or f the data s not separable the behavor of the two loss functons may dffer sgnfcantly, snce C e weghs negatve margns exponentally, whle C l s approxmately lnear n the margn for large negatve margns (see Fredman et al., 2000). Consequently, we can expect C e to concentrate more on the hard tranng data, n partcular n the non-separable case. Fgure 7 llustrates the behavor of ε-boostng wth both Mnmal margns Test error exponental logstc AdaBoost mnmal margn test error exponental logstc AdaBoost β β 1 Fgure 7: Behavor of boostng wth the two loss functons on spam data set loss functons, as well as that of AdaBoost, on the spam data set (57 predctors, bnary response). We used 10 node trees and ε = 0.1. The left plot shows the mnmal margn as a functon of the l 1 norm of the coeffcent vector β 1. Bnomal loss creates a bgger mnmal margn ntally, but the mnmal margns for both loss functons are convergng asymptotcally. AdaBoost ntally lags behnd but catches up ncely and reaches the same mnmal margn asymptotcally. The rght plot shows the test error as the teratons proceed, llustratng that both ε-methods ndeed seem to over-ft eventually, even as ther separaton (mnmal margn) s stll mprovng. AdaBoost dd not sgnfcantly over-ft n the 1000 teratons t was allowed to run, but t obvously would have f t were allowed to run on. We should emphasze that the comparson between AdaBoost and ε-boostng presented consders as a bass for comparson the l 1 norm, not the number of teratons. In terms of computatonal complexty, as represented by the number of teratons, AdaBoost reaches both a large mnmal mar- 957

18 ROSSET, ZHU AND HASTIE gn and good predcton performance much more quckly than the slow boostng approaches, as AdaBoost tends to take larger steps. 6.2 Smulated Data To make a more educated comparson and more compellng vsualzaton, we have constructed an example of separaton of 2-dmensonal data usng a 8-th degree polynomal dctonary (45 functons). The data conssts of 50 observatons of each class, drawn from a mxture of Gaussans, and presented n Fgure 8. Also presented, n the sold lne, s the optmal l 1 separator for ths data n ths dctonary (easly calculated as a lnear programmng problem - note the dfference from the l 2 optmal decson boundary, presented n Secton 7.1, Fgure 11 ). The optmal l 1 separator has only 12 non-zero coeffcents out of optmal boost 10 5 ter boost 3*10 6 ter Fgure 8: Artfcal data set wth l 1 -margn maxmzng separator (sold), and boostng models after 10 5 teratons (dashed) and 10 6 teratons (dotted) usng ε = We observe the convergence of the boostng separator to the optmal separator We ran an ε-boostng algorthm on ths data set, usng the logstc log-lkelhood loss C l, wth ε = 0.001, and Fgure 8 shows two of the models generated after 10 5 and teratons. We see that the models seem to converge to the optmal separator. A dfferent vew of ths convergence s gven n Fgure 9, where we see two measures of convergence: the mnmal margn (left, maxmum value obtanable s the horzontal lne) and the l 1 -norm dstance between the normalzed models (rght), gven by j ˆβ j β (t) j β (t) 1, where ˆβ s the optmal separator wth l 1 norm 1 and β (t) s the boostng model after t teratons. We can conclude that on ths smple artfcal example we get nce convergence of the logstcboostng model path to the l 1 -margn maxmzng separatng hyper-plane. We can also use ths example to llustrate the smlarty between the boosted path and the path of l 1 optmal solutons, as we have dscussed n Secton

19 BOOSTING AS A REGULARIZED PATH Mnmal margn l 1 dfference β β 1 Fgure 9: Two measures of convergence of boostng model path to optmal l 1 separator: mnmal margn (left) and l 1 dstance between the normalzed boostng coeffcent vector and the optmal model (rght) l 1 norm: 20 l 1 norm: 350 l 1 norm: 2701 l 1 norm: 5401 Fgure 10: Comparson of decson boundary of boostng models (broken) and of optmal constraned solutons wth same norm (full) Fgure 10 shows the class decson boundares for 4 models generated along the boostng path, compared to the optmal solutons to the constraned logstc regresson problem wth the same bound on the l 1 norm of the coeffcent vector. We observe the clear smlartes n the way the solutons evolve and converge to the optmal l 1 separator. The fact that they dffer (n some cases sgnfcantly) s not surprsng f we recall the monotoncty condton presented n Secton 4 for exact correspondence between the two model paths. In ths case f we look at the coeffcent paths 959

20 ROSSET, ZHU AND HASTIE (not shown), we observe that the monotoncty condton s consstently volated n the low norm ranges, and hence we can expect the paths to be smlar n sprt but not dentcal. 7. Dscusson We can now summarze what we have learned about boostng from the prevous sectons: Boostng approxmately follows the path of l 1 -regularzed models for ts loss crteron If the loss crteron s the exponental loss of AdaBoost or the bnomal log-lkelhood loss of logstc regresson, then the l 1 regularzed model converges to an l 1 -margn maxmzng separatng hyper-plane, f the data are separable n the span of the weak learners We may ask, whch of these two ponts s the key to the success of boostng approaches. One emprcal clue to answerng ths queston, can be found n Breman (1999), who programmed an algorthm to drectly maxmze the margns. Hs results were that hs algorthm consstently got sgnfcantly hgher mnmal margns than AdaBoost on many data sets (and, n fact, a hgher margn dstrbuton beyond the mnmal margn), but had slghtly worse predcton performance. Hs concluson was that margn maxmzaton s not the key to AdaBoost s success. From a statstcal perspectve we can embrace ths concluson, as reflectng the mportance of regularzaton n hghdmensonal predctor space. By our results from the prevous sectons, margn maxmzaton can be vewed as the lmt of parametrc regularzed models, as the regularzaton vanshes. 4 Thus we would generally expect the margn maxmzng solutons to perform worse than regularzed models. In the case of boostng, regularzaton would correspond to early stoppng of the boostng algorthm. 7.1 Boostng and SVMs as Regularzed Optmzaton n Hgh-dmensonal Predctor Spaces Our exposton has led us to vew boostng as an approxmate way to solve the regularzed optmzaton problem mn β C(y,β h(x )) + λ β 1 (16) whch converges as λ 0 to ˆβ (1), f our loss s C e or C l. In general, the loss C can be any convex dfferentable loss and should be defned to match the problem doman. Support vector machnes can be descrbed as solvng the regularzed optmzaton problem (see Fredman et al., 2000, Chapter 12) mn β (1 y β h(x )) + + λ β 2 2 (17) whch converges as λ 0 to the non-regularzed support vector machne soluton,.e., the optmal Eucldean separator, whch we denoted by ˆβ (2). An nterestng connecton exsts between these two approaches, n that they allow us to solve the regularzed optmzaton problem n hgh dmensonal predctor space: 4. It can be argued that margn-maxmzng models are stll regularzed n some sense, as they mnmze a norm crteron among all separatng models. Ths s arguably the property whch stll allows them to generalze reasonably well n many cases. 960

(Almost) No Label No Cry

(Almost) No Label No Cry (Almost) No Label No Cry Gorgo Patrn,, Rchard Nock,, Paul Rvera,, Tbero Caetano,3,4 Australan Natonal Unversty, NICTA, Unversty of New South Wales 3, Ambata 4 Sydney, NSW, Australa {namesurname}@anueduau

More information

Do Firms Maximize? Evidence from Professional Football

Do Firms Maximize? Evidence from Professional Football Do Frms Maxmze? Evdence from Professonal Football Davd Romer Unversty of Calforna, Berkeley and Natonal Bureau of Economc Research Ths paper examnes a sngle, narrow decson the choce on fourth down n the

More information

MANY of the problems that arise in early vision can be

MANY of the problems that arise in early vision can be IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 26, NO. 2, FEBRUARY 2004 147 What Energy Functons Can Be Mnmzed va Graph Cuts? Vladmr Kolmogorov, Member, IEEE, and Ramn Zabh, Member,

More information

Ensembling Neural Networks: Many Could Be Better Than All

Ensembling Neural Networks: Many Could Be Better Than All Artfcal Intellgence, 22, vol.37, no.-2, pp.239-263. @Elsever Ensemblng eural etworks: Many Could Be Better Than All Zh-Hua Zhou*, Janxn Wu, We Tang atonal Laboratory for ovel Software Technology, anng

More information

Who are you with and Where are you going?

Who are you with and Where are you going? Who are you wth and Where are you gong? Kota Yamaguch Alexander C. Berg Lus E. Ortz Tamara L. Berg Stony Brook Unversty Stony Brook Unversty, NY 11794, USA {kyamagu, aberg, leortz, tlberg}@cs.stonybrook.edu

More information

Assessing health efficiency across countries with a two-step and bootstrap analysis *

Assessing health efficiency across countries with a two-step and bootstrap analysis * Assessng health effcency across countres wth a two-step and bootstrap analyss * Antóno Afonso # $ and Mguel St. Aubyn # February 2007 Abstract We estmate a sem-parametrc model of health producton process

More information

The Relationship between Exchange Rates and Stock Prices: Studied in a Multivariate Model Desislava Dimitrova, The College of Wooster

The Relationship between Exchange Rates and Stock Prices: Studied in a Multivariate Model Desislava Dimitrova, The College of Wooster Issues n Poltcal Economy, Vol. 4, August 005 The Relatonshp between Exchange Rates and Stock Prces: Studed n a Multvarate Model Desslava Dmtrova, The College of Wooster In the perod November 00 to February

More information

What to Maximize if You Must

What to Maximize if You Must What to Maxmze f You Must Avad Hefetz Chrs Shannon Yoss Spegel Ths verson: July 2004 Abstract The assumpton that decson makers choose actons to maxmze ther preferences s a central tenet n economcs. Ths

More information

Complete Fairness in Secure Two-Party Computation

Complete Fairness in Secure Two-Party Computation Complete Farness n Secure Two-Party Computaton S. Dov Gordon Carmt Hazay Jonathan Katz Yehuda Lndell Abstract In the settng of secure two-party computaton, two mutually dstrustng partes wsh to compute

More information

Ciphers with Arbitrary Finite Domains

Ciphers with Arbitrary Finite Domains Cphers wth Arbtrary Fnte Domans John Black 1 and Phllp Rogaway 2 1 Dept. of Computer Scence, Unversty of Nevada, Reno NV 89557, USA, jrb@cs.unr.edu, WWW home page: http://www.cs.unr.edu/~jrb 2 Dept. of

More information

EVERY GOOD REGULATOR OF A SYSTEM MUST BE A MODEL OF THAT SYSTEM 1

EVERY GOOD REGULATOR OF A SYSTEM MUST BE A MODEL OF THAT SYSTEM 1 Int. J. Systems Sc., 1970, vol. 1, No. 2, 89-97 EVERY GOOD REGULATOR OF A SYSTEM MUST BE A MODEL OF THAT SYSTEM 1 Roger C. Conant Department of Informaton Engneerng, Unversty of Illnos, Box 4348, Chcago,

More information

The Developing World Is Poorer Than We Thought, But No Less Successful in the Fight against Poverty

The Developing World Is Poorer Than We Thought, But No Less Successful in the Fight against Poverty Publc Dsclosure Authorzed Pol c y Re s e a rc h Wo r k n g Pa p e r 4703 WPS4703 Publc Dsclosure Authorzed Publc Dsclosure Authorzed The Developng World Is Poorer Than We Thought, But No Less Successful

More information

The Global Macroeconomic Costs of Raising Bank Capital Adequacy Requirements

The Global Macroeconomic Costs of Raising Bank Capital Adequacy Requirements W/1/44 The Global Macroeconomc Costs of Rasng Bank Captal Adequacy Requrements Scott Roger and Francs Vtek 01 Internatonal Monetary Fund W/1/44 IMF Workng aper IMF Offces n Europe Monetary and Captal Markets

More information

As-Rigid-As-Possible Shape Manipulation

As-Rigid-As-Possible Shape Manipulation As-Rgd-As-Possble Shape Manpulaton akeo Igarash 1, 3 omer Moscovch John F. Hughes 1 he Unversty of okyo Brown Unversty 3 PRESO, JS Abstract We present an nteractve system that lets a user move and deform

More information

Why Don t We See Poverty Convergence?

Why Don t We See Poverty Convergence? Why Don t We See Poverty Convergence? Martn Ravallon 1 Development Research Group, World Bank 1818 H Street NW, Washngton DC, 20433, USA Abstract: We see sgns of convergence n average lvng standards amongst

More information

can basic entrepreneurship transform the economic lives of the poor?

can basic entrepreneurship transform the economic lives of the poor? can basc entrepreneurshp transform the economc lves of the poor? Orana Bandera, Robn Burgess, Narayan Das, Selm Gulesc, Imran Rasul, Munsh Sulaman Aprl 2013 Abstract The world s poorest people lack captal

More information

DISCUSSION PAPER. Is There a Rationale for Output-Based Rebating of Environmental Levies? Alain L. Bernard, Carolyn Fischer, and Alan Fox

DISCUSSION PAPER. Is There a Rationale for Output-Based Rebating of Environmental Levies? Alain L. Bernard, Carolyn Fischer, and Alan Fox DISCUSSION PAPER October 00; revsed October 006 RFF DP 0-3 REV Is There a Ratonale for Output-Based Rebatng of Envronmental Leves? Alan L. Bernard, Carolyn Fscher, and Alan Fox 66 P St. NW Washngton, DC

More information

From Computing with Numbers to Computing with Words From Manipulation of Measurements to Manipulation of Perceptions

From Computing with Numbers to Computing with Words From Manipulation of Measurements to Manipulation of Perceptions IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS I: FUNDAMENTAL THEORY AND APPLICATIONS, VOL. 45, NO. 1, JANUARY 1999 105 From Computng wth Numbers to Computng wth Words From Manpulaton of Measurements to Manpulaton

More information

4.3.3 Some Studies in Machine Learning Using the Game of Checkers

4.3.3 Some Studies in Machine Learning Using the Game of Checkers 4.3.3 Some Studes n Machne Learnng Usng the Game of Checkers 535 Some Studes n Machne Learnng Usng the Game of Checkers Arthur L. Samuel Abstract: Two machne-learnng procedures have been nvestgated n some

More information

UPGRADE YOUR PHYSICS

UPGRADE YOUR PHYSICS Correctons March 7 UPGRADE YOUR PHYSICS NOTES FOR BRITISH SIXTH FORM STUDENTS WHO ARE PREPARING FOR THE INTERNATIONAL PHYSICS OLYMPIAD, OR WISH TO TAKE THEIR KNOWLEDGE OF PHYSICS BEYOND THE A-LEVEL SYLLABI.

More information

MULTIPLE VALUED FUNCTIONS AND INTEGRAL CURRENTS

MULTIPLE VALUED FUNCTIONS AND INTEGRAL CURRENTS ULTIPLE VALUED FUNCTIONS AND INTEGRAL CURRENTS CAILLO DE LELLIS AND EANUELE SPADARO Abstract. We prove several results on Almgren s multple valued functons and ther lnks to ntegral currents. In partcular,

More information

DISCUSSION PAPER. Should Urban Transit Subsidies Be Reduced? Ian W.H. Parry and Kenneth A. Small

DISCUSSION PAPER. Should Urban Transit Subsidies Be Reduced? Ian W.H. Parry and Kenneth A. Small DISCUSSION PAPER JULY 2007 RFF DP 07-38 Should Urban Transt Subsdes Be Reduced? Ian W.H. Parry and Kenneth A. Small 1616 P St. NW Washngton, DC 20036 202-328-5000 www.rff.org Should Urban Transt Subsdes

More information

Turbulence Models and Their Application to Complex Flows R. H. Nichols University of Alabama at Birmingham

Turbulence Models and Their Application to Complex Flows R. H. Nichols University of Alabama at Birmingham Turbulence Models and Ther Applcaton to Complex Flows R. H. Nchols Unversty of Alabama at Brmngham Revson 4.01 CONTENTS Page 1.0 Introducton 1.1 An Introducton to Turbulent Flow 1-1 1. Transton to Turbulent

More information

As-Rigid-As-Possible Image Registration for Hand-drawn Cartoon Animations

As-Rigid-As-Possible Image Registration for Hand-drawn Cartoon Animations As-Rgd-As-Possble Image Regstraton for Hand-drawn Cartoon Anmatons Danel Sýkora Trnty College Dubln John Dnglana Trnty College Dubln Steven Collns Trnty College Dubln source target our approach [Papenberg

More information

Alpha if Deleted and Loss in Criterion Validity 1. Appeared in British Journal of Mathematical and Statistical Psychology, 2008, 61, 275-285

Alpha if Deleted and Loss in Criterion Validity 1. Appeared in British Journal of Mathematical and Statistical Psychology, 2008, 61, 275-285 Alpha f Deleted and Loss n Crteron Valdty Appeared n Brtsh Journal of Mathematcal and Statstcal Psychology, 2008, 6, 275-285 Alpha f Item Deleted: A Note on Crteron Valdty Loss n Scale Revson f Maxmsng

More information

TrueSkill Through Time: Revisiting the History of Chess

TrueSkill Through Time: Revisiting the History of Chess TrueSkll Through Tme: Revstng the Hstory of Chess Perre Dangauther INRIA Rhone Alpes Grenoble, France perre.dangauther@mag.fr Ralf Herbrch Mcrosoft Research Ltd. Cambrdge, UK rherb@mcrosoft.com Tom Mnka

More information

Should marginal abatement costs differ across sectors? The effect of low-carbon capital accumulation

Should marginal abatement costs differ across sectors? The effect of low-carbon capital accumulation Should margnal abatement costs dffer across sectors? The effect of low-carbon captal accumulaton Adren Vogt-Schlb 1,, Guy Meuner 2, Stéphane Hallegatte 3 1 CIRED, Nogent-sur-Marne, France. 2 INRA UR133

More information

Finance and Economics Discussion Series Divisions of Research & Statistics and Monetary Affairs Federal Reserve Board, Washington, D.C.

Finance and Economics Discussion Series Divisions of Research & Statistics and Monetary Affairs Federal Reserve Board, Washington, D.C. Fnance and Economcs Dscusson Seres Dvsons of Research & Statstcs and Monetary Affars Federal Reserve Board, Washngton, D.C. Banks as Patent Fxed Income Investors Samuel G. Hanson, Andre Shlefer, Jeremy

More information

Income per natural: Measuring development as if people mattered more than places

Income per natural: Measuring development as if people mattered more than places Income per natural: Measurng development as f people mattered more than places Mchael A. Clemens Center for Global Development Lant Prtchett Kennedy School of Government Harvard Unversty, and Center for

More information

WHICH SECTORS MAKE THE POOR COUNTRIES SO UNPRODUCTIVE?

WHICH SECTORS MAKE THE POOR COUNTRIES SO UNPRODUCTIVE? MŰHELYTANULMÁNYOK DISCUSSION PAPERS MT DP. 2005/19 WHICH SECTORS MAKE THE POOR COUNTRIES SO UNPRODUCTIVE? BERTHOLD HERRENDORF ÁKOS VALENTINYI Magyar Tudományos Akadéma Közgazdaságtudomány Intézet Budapest

More information