Boosting as a Regularized Path to a Maximum Margin Classifier


 Melvyn Hopkins
 1 years ago
 Views:
Transcription
1 Journal of Machne Learnng Research 5 (2004) Submtted 5/03; Revsed 10/03; Publshed 8/04 Boostng as a Regularzed Path to a Maxmum Margn Classfer Saharon Rosset Data Analytcs Research Group IBM T.J. Watson Research Center Yorktown Heghts, NY 10598, USA J Zhu Department of Statstcs Unversty of Mchgan Ann Arbor, MI 48109, USA Trevor Haste Department of Statstcs Stanford Unversty Stanford, CA 94305,USA Edtor: Robert Schapre Abstract In ths paper we study boostng methods from a new perspectve. We buld on recent work by Efron et al. to show that boostng approxmately (and n some cases exactly) mnmzes ts loss crteron wth an l 1 constrant on the coeffcent vector. Ths helps understand the success of boostng wth early stoppng as regularzed fttng of the loss crteron. For the two most commonly used crtera (exponental and bnomal loglkelhood), we further show that as the constrant s relaxed or equvalently as the boostng teratons proceed the soluton converges (n the separable case) to an l 1 optmal separatng hyperplane. We prove that ths l 1 optmal separatng hyperplane has the property of maxmzng the mnmal l 1 margn of the tranng data, as defned n the boostng lterature. An nterestng fundamental smlarty between boostng and kernel support vector machnes emerges, as both can be descrbed as methods for regularzed optmzaton n hghdmensonal predctor space, usng a computatonal trck to make the calculaton practcal, and convergng to margnmaxmzng solutons. Whle ths statement descrbes SVMs exactly, t apples to boostng only approxmately. Keywords: boostng, regularzed optmzaton, support vector machnes, margn maxmzaton 1. Introducton and Outlne Boostng s a method for teratvely buldng an addtve model F T (x) = T t=1 α t h jt (x), (1) where h jt H a large (but we wll assume fnte) dctonary of canddate predctors or weak learners ; and h jt s the bass functon selected as the best canddate to modfy the functon at stage t. The model F T can equvalently be represented by assgnng a coeffcent to each dctonary c 2004 Saharon Rosset, J Zhu and Trevor Haste.
2 ROSSET, ZHU AND HASTIE functon h H rather than to the selected h jt s only: F T (x) = J h j (x) β (T ) j, (2) j=1 where J = H and β (T ) j = jt = j α t. The β representaton allows us to nterpret the coeffcent vector β (T ) as a vector n R J or, equvalently, as the hyperplane whch has β (T ) as ts normal. Ths nterpretaton wll play a key role n our exposton. Some examples of common dctonares are: The tranng varables themselves, n whch case h j (x) = x j. Ths leads to our addtve model F T beng just a lnear model n the orgnal data. The number of dctonary functons wll be J = d, the dmenson of x. Polynomal ( dctonary ) of degree p, n whch case the number of dctonary functons wll be p + d J =. d Decson trees wth up to k termnal nodes, f we lmt the splt ponts to data ponts (or mdway between data ponts as CART does). The number of possble trees s bounded from above (trvally) by J (np) k 2 k2. Note that regresson trees do not ft nto our framework, snce they wll gve J =. The boostng dea was frst ntroduced by Freund and Schapre (1995), wth ther AdaBoost algorthm. AdaBoost and other boostng algorthms have attracted a lot of attenton due to ther great success n data modelng tasks, and the mechansm whch makes them work has been presented and analyzed from several perspectves. Fredman et al. (2000) develop a statstcal perspectve, whch ultmately leads to vewng AdaBoost as a gradentbased ncremental search for a good addtve model (more specfcally, t s a coordnate descent algorthm), usng the exponental loss functon C(y, F) = exp( yf), where y { 1, 1}. The gradent boostng (Fredman, 2001) and anyboost (Mason et al., 1999) generc algorthms have used ths approach to generalze the boostng dea to wder famles of problems and loss functons. In partcular, Fredman et al. (2000) have ponted out that the bnomal loglkelhood loss C(y,F) = log(1 + exp( yf)) s a more natural loss for classfcaton, and s more robust to outlers and msspecfed data. A dfferent analyss of boostng, orgnatng n the machne learnng communty, concentrates on the effect of boostng on the margns y F(x ). For example, Schapre et al. (1998) use margnbased arguments to prove convergence of boostng to perfect classfcaton performance on the tranng data under general condtons, and to derve bounds on the generalzaton error (on future, unseen data). In ths paper we combne the two approaches, to conclude that gradentbased boostng can be descrbed, n the separable case, as an approxmate margn maxmzng process. The vew we develop of boostng as an approxmate path of optmal solutons to regularzed problems also justfes early stoppng n boostng as specfyng a value for regularzaton parameter. We consder the problem of mnmzng nonnegatve convex loss functons (n partcular the exponental and bnomal loglkelhood loss functons) over the tranng data, wth an l 1 bound on the model coeffcents: ˆβ(c) = arg mn β 1 c C(y,h(x ) β). (3) 942
3 BOOSTING AS A REGULARIZED PATH Where h(x ) = [h 1 (x ),h 2 (x ),...,h J (x )] and J = H. 1 Haste et al. (2001, Chapter 10) have observed that slow gradentbased boostng (.e., we set α t = ε, t n (1), wth ε small) tends to follow the penalzed path ˆβ(c) as a functon of c, under some mld condtons on ths path. In other words, usng the notaton of (2), (3), ths mples that β (c/ε) ˆβ(c) vanshes wth ε, for all (or a wde range of) values of c. Fgure 1 llustrates ths equvalence between εboostng and the optmal soluton of (3) on a reallfe data set, usng squared error loss as the loss functon. In ths paper we demonstrate ths equvalence further and formally Lasso Forward Stagewse lcavol lcavol PSfrag replacements Coeffcents sv lweght pgg45 lbph gleason age lcp Coeffcents sv lweght pgg45 lbph gleason age lcp j ˆβ j (c) Iteraton Fgure 1: Exact coeffcent paths(left) for l 1 constraned squared error regresson and boostng coeffcent paths (rght) on the data from a prostate cancer study state t as a conjecture. Some progress towards provng ths conjecture has been made by Efron et al. (2004), who prove a weaker local result for the case where C s squared error loss, under some mld condtons on the optmal path. We generalze ther result to general convex loss functons. Combnng the emprcal and theoretcal evdence, we conclude that boostng can be vewed as an approxmate ncremental method for followng the l 1 regularzed path. We then prove that n the separable case, for both the exponental and logstc loglkelhood loss functons, ˆβ(c)/c converges as c to an optmal separatng hyperplane ˆβ descrbed by ˆβ = arg max mn y β h(x ). (4) β 1 =1 In other words, ˆβ maxmzes the mnmal margn among all vectors wth l 1 norm equal to 1. 2 Ths result generalzes easly to other l p norm constrants. For example, f p = 2, then ˆβ descrbes the optmal separatng hyperplane n the Eucldean sense,.e., the same one that a nonregularzed support vector machne would fnd. Combnng our two man results, we get the followng characterzaton of boostng: 1. Our notaton assumes that the mnmum n (3) s unque, whch requres some mld assumptons. To avod notatonal complcatons we use ths slghtly abusve notaton throughout ths paper. In Appendx B we gve explct condtons for unqueness of ths mnmum. 2. The margn maxmzng hyperplane n (4) may not be unque, and we show that n that case the lmt ˆβ s stll defned and t also maxmzes the second mnmal margn. See Appendx B.2 for detals. 943
4 ROSSET, ZHU AND HASTIE εboostng can be descrbed as a gradentdescent search, approxmately followng the path of l 1 constraned optmal solutons to ts loss crteron, and convergng, n the separable case, to a margn maxmzer n the l 1 sense. Note that boostng wth a large dctonary H (n partcular f n < J = H ) guarantees that the data wll be separable (except for pathologes), hence separablty s a very mld assumpton here. As n the case of support vector machnes n hgh dmensonal feature spaces, the nonregularzed optmal separatng hyperplane s usually of theoretcal nterest only, snce t typcally represents an overftted model. Thus, we would want to choose a good regularzed model. Our results ndcate that Boostng gves a natural method for dong that, by stoppng early n the boostng process. Furthermore, they pont out the fundamental smlarty between Boostng and SVMs: both approaches allow us to ft regularzed models n hghdmensonal predctor space, usng a computatonal trck. They dffer n the regularzaton approach they take exact l 2 regularzaton for SVMs, approxmate l 1 regularzaton for Boostng and n the computatonal trck that facltates fttng the kernel trck for SVMs, coordnate descent for Boostng. 1.1 Related Work Schapre et al. (1998) have dentfed the normalzed margns as dstance from an l 1 normed separatng hyperplane. Ther results relate the boostng teratons success to the mnmal margn of the combned model. Rätsch et al. (2001b) take ths further usng an asymptotc analyss of AdaBoost. They prove that the normalzed mnmal margn, mn y t α t h t (x )/ t α t, s asymptotcally equal for both classes. In other words, they prove that the asymptotc separatng hyperplane s equally far away from the closest ponts on ether sde. Ths s a property of the margn maxmzng separatng hyperplane as we defne t. Both papers also llustrate the margn maxmzng effects of AdaBoost through expermentaton. However, they both stop short of provng the convergence to optmal (margn maxmzng) solutons. Motvated by our result, Rätsch and Warmuth (2002) have recently asserted the margnmaxmzng propertes of εadaboost, usng a dfferent approach than the one used n ths paper. Ther results relate only to the asymptotc convergence of nfntesmal AdaBoost, compared to our analyss of the regularzed path traced along the way and of a varety of boostng loss functons, whch also leads to a convergence result on bnomal loglkelhood loss. The convergence of boostng to an optmal soluton from a loss functon perspectve has been analyzed n several papers. Rätsch et al. (2001a) and Collns et al. (2000) gve results and bounds on the convergence of tranngset loss, C(y, t α t h t (x )), to ts mnmum. However, n the separable case convergence of the loss to 0 s nherently dfferent from convergence of the lnear separator to the optmal separator. Any soluton whch separates the two classes perfectly can drve the exponental (or loglkelhood) loss to 0, smply by scalng coeffcents up lnearly. Two recent papers have made the connecton between boostng and l 1 regularzaton n a slghtly dfferent context than ths paper. Zhang (2003) suggests a shrnkage verson of boostng whch converges to l 1 regularzed solutons, whle Zhang and Yu (2003) llustrate the quanttatve relatonshp between early stoppng n boostng and l 1 constrants. 944
5 BOOSTING AS A REGULARIZED PATH 2. Boostng as Gradent Descent Generc gradentbased boostng algorthms (Fredman, 2001; Mason et al., 1999) attempt to fnd a good lnear combnaton of the members of some dctonary of bass functons to optmze a gven loss functon over a sample. Ths s done by searchng, at each teraton, for the bass functon whch gves the steepest descent n the loss, and changng ts coeffcent accordngly. In other words, ths s a coordnate descent algorthm n R J, where we assgn one dmenson (or coordnate) for the coeffcent of each dctonary functon. Assume we have data {x,y } n =1 wth x R d, a loss (or cost) functon C(y,F), and a set of dctonary functons {h j (x)} : R d R. Then all of these algorthms follow the same essental steps: Algorthm 1 Generc gradentbased boostng algorthm 1. Set β (0) = For t = 1 : T, (a) Let F = β (t 1) h(x ), = 1,...,n (the current ft). (b) Set w = C(y,F ) F, = 1,...,n. (c) Identfy j t = argmax j w h j (x ). (d) Set β (t) j t = β (t 1) j t α t sgn( w h jt (x )) and β (t) k = β (t 1) k,k j t. Here β (t) s the current coeffcent vector and α t > 0 s the current step sze. Notce that w h jt (x ) = C(y,F ) β j. t As we mentoned, Algorthm 1 can be nterpreted smply as a coordnate descent algorthm n weak learner space. Implementaton detals nclude the dctonary H of weak learners, the loss functon C(y,F), the method of searchng for the optmal j t and the way n whch α t s determned. 3 For example, the orgnal AdaBoost algorthm uses ths scheme wth the exponental loss C(y,F) = exp( yf), and an mplct lne search to fnd the best α t once a drecton j t has been chosen (see Haste et al., 2001; Mason et al., 1999). The dctonary used by AdaBoost n ths formulaton would be a set of canddate classfers,.e., h j (x ) { 1,+1} usually decson trees are used n practce. 2.1 Practcal Implementaton of Boostng The dctonares used for boostng are typcally very large practcally nfnte and therefore the generc boostng algorthm we have presented cannot be mplemented verbatm. In partcular, t s not practcal to exhaustvely search for the maxmzer n step 2(c). Instead, an approxmate, usually greedy search s conducted to fnd a good canddate weak learner h jt whch makes the frst order declne n the loss large (even f not maxmal among all possble models). In the common case that the dctonary of weak learners s comprsed of decson trees wth up to k nodes, the way AdaBoost and other boostng algorthms solve stage 2(c) s by buldng a 3. The sgn of α t wll always be sgn( w h jt (x )), snce we want the loss to be reduced. In most cases, the dctonary H s negaton closed, and so t can be assumed WLOG that the coeffcents are always postve and ncreasng 945
6 ROSSET, ZHU AND HASTIE decson tree to a reweghted verson of the data, wth the weghts w. Thus they frst replace step 2(c) wth mnmzaton of w 1{y h jt (x )}, whch s easly shown to be equvalent to the orgnal step 2(c). They then use a greedy decsontree buldng algorthm such as CART or C5 to buld a knode decson tree whch mnmzes ths quantty,.e., acheves low weghted msclassfcaton error on the weghted data. Snce the tree s bult greedly one splt at a tme t wll not be the global mnmzer of weghted msclassfcaton error among all knode decson trees. However, t wll be a good ft for the reweghted data, and can be consdered an approxmaton to the optmal tree. Ths use of approxmate optmzaton technques s crtcal, snce much of the strength of the boostng approach comes from ts ablty to buld addtve models n very hghdmensonal predctor spaces. In such spaces, standard exact optmzaton technques are mpractcal: any approach whch requres calculaton and nverson of Hessan matrces s completely out of the queston, and even approaches whch requre only frst dervatves, such as coordnate descent, can only be mplemented approxmately. 2.2 GradentBased Boostng as a Generc Modelng Tool As Fredman (2001); Mason et al. (1999) menton, ths vew of boostng as gradent descent allows us to devse boostng algorthms for any functon estmaton problem all we need s an approprate loss and an approprate dctonary of weak learners. For example, Fredman et al. (2000) suggested usng the bnomal loglkelhood loss nstead of the exponental loss of AdaBoost for bnary classfcaton, resultng n the LogtBoost algorthm. However, there s no need to lmt boostng algorthms to classfcaton Fredman (2001) appled ths methodology to regresson estmaton, usng squared error loss and regresson trees, and Rosset and Segal (2003) appled t to densty estmaton, usng the loglkelhood crteron and Bayesan networks as weak learners. Ther experments and those of others llustrate that the practcal usefulness of ths approach coordnate descent n hgh dmensonal predctor space carres beyond classfcaton, and even beyond supervsed learnng. The vew we present n ths paper, of coordnatedescent boostng as approxmate l 1 regularzed fttng, offers some nsght nto why ths approach would be good n general: t allows us to ft regularzed models drectly n hgh dmensonal predctor space. In ths t bears a conceptual smlarty to support vector machnes, whch exactly ft an l 2 regularzed model n hgh dmensonal (RKH) predctor space. 2.3 Loss Functons The two most commonly used loss functons for boostng classfcaton models are the exponental and the (mnus) bnomal loglkelhood: Exponental : Loglkelhood : C e (y,f) = exp( yf); C l (y,f) = log(1 + exp( yf)). These two loss functons bear some mportant smlartes to each other. As Fredman et al. (2000) show, the populaton mnmzer of expected loss at pont x s smlar for both loss functons and s 946
7 BOOSTING AS A REGULARIZED PATH 2.5 Exponental Logstc Fgure 2: The two classfcaton loss functons gven by [ ] P(y = 1 x) ˆF(x) = c log, P(y = 1 x) where c e = 1/2 for exponental loss and c l = 1 for bnomal loss. More mportantly for our purpose, we have the followng smple proposton, whch llustrates the strong smlarty between the two loss functons for postve margns (.e., correct classfcatons): Proposton 1 yf 0 0.5C e (y,f) C l (y,f) C e (y,f). (5) In other words, the two losses become smlar f the margns are postve, and both behave lke exponentals. Proof Consder the functons f 1 (z) = z and f 2 (z) = log(1+z) for z [0,1]. Then f 1 (0) = f 2 (0) = 0, and f 1 (z) 1 z 1 2 f 2(z) = 1 z 1 + z 1. Thus we can conclude 0.5 f 1 (z) f 2 (z) f 1 (z). Now set z = exp( y f ) and we get the desred result. For negatve margns the behavors of C e and C l are very dfferent, as Fredman et al. (2000) have noted. In partcular, C l s more robust aganst outlers and msspecfed data. 2.4 LneSearch Boostng vs. εboostng As mentoned above, AdaBoost determnes α t usng a lne search. In our notaton for Algorthm 1 ths would be α t = argmn α C(y,F + αh jt (x )). 947
8 ROSSET, ZHU AND HASTIE The alternatve approach, suggested by Fredman (2001); Haste et al. (2001), s to shrnk all α t to a sngle small value ε. Ths may slow down learnng consderably (dependng on how small ε s), but s attractve theoretcally: the frstorder theory underlyng gradent boostng mples that the weak learner chosen s the best ncrement only locally. It can also be argued that ths approach s stronger than lne search, as we can keep selectng the same h jt repeatedly f t remans optmal and so εboostng domnates lnesearch boostng n terms of tranng error. In practce, ths approach of slowng the learnng rate usually performs better than lnesearch n terms of predcton error as well (see Fredman, 2001). For our purposes, we wll mostly assume ε s nfntesmally small, so the theoretcal boostng algorthm whch results s the lmt of a seres of boostng algorthms wth shrnkng ε. In regresson termnology, the lnesearch verson s equvalent to forward stagewse modelng, nfamous n the statstcs lterature for beng too greedy and hghly unstable (see Fredman, 2001). Ths s ntutvely obvous, snce by ncreasng the coeffcent untl t saturates we are destroyng sgnal whch may help us select other good predctors. 3. l p Margns, Support Vector Machnes and Boostng We now ntroduce the concept of margns as a geometrc nterpretaton of a bnary classfcaton model. In the context of boostng, ths vew offers a dfferent understandng of AdaBoost from the gradent descent vew presented above. In the followng sectons we connect the two vews. 3.1 The Eucldean Margn and the Support Vector Machne Consder a classfcaton model n hgh dmensonal predctor space: F(x) = j h j (x)β j. We say that the model separates the tranng data {x,y } n =1 f sgn(f(x )) = y,. From a geometrcal perspectve ths means that the hyperplane defned by F(x) = 0 s a separatng hyperplane for ths data, and we defne ts (Eucldean) margn as m 2 (β) = mn y F(x ) β 2. (6) The margnmaxmzng separatng hyperplane for ths data would be defned by β whch maxmzes m 2 (β). Fgure 3 shows a smple example of separable data n two dmensons, wth ts margnmaxmzng separatng hyperplane. The Eucldean margnmaxmzng separatng hyperplane s the (non regularzed) support vector machne soluton. Its margn maxmzng propertes play a central role n dervng generalzaton error bounds for these models, and form the bass for a rch lterature. 3.2 The l 1 Margn and Its Relaton to Boostng Instead of consderng the Eucldean margn as n (6) we can defne an l p margn concept as m p (β) = mn y F(x ) β p. (7) Of partcular nterest to us s the case p = 1. Fgure 4 shows the l 1 margn maxmzng separatng hyperplane for the same smple example as Fgure 3. Note the fundamental dfference between 948
9 BOOSTING AS A REGULARIZED PATH Fgure 3: A smple data example, wth two observatons from class O and two observatons from class X. The full lne s the Eucldean margnmaxmzng separatng hyperplane Fgure 4: l 1 margn maxmzng separatng hyperplane for the same data set as Fgure 3. The dfference between the dagonal Eucldean optmal separator and the vertcal l 1 optmal separator llustrates the sparsty effect of optmal l 1 separaton 949
10 ROSSET, ZHU AND HASTIE the two solutons: the l 2 optmal separator s dagonal, whle the l 1 optmal one s vertcal. To understand why ths s so we can relate the two margn defntons to each other as yf(x) β 1 = yf(x) β 2 β 2 β 1. (8) From ths representaton we can observe that the l 1 margn wll tend to be bg f the rato β 2 β 1 s bg. Ths rato wll generally be bg f β s sparse. To see ths, consder fxng the l 1 norm of the vector and then comparng the l 2 norm of two canddates: one wth many small components and the other a sparse one wth a few large components and many zero components. It s easy to see that the second vector wll have bgger l 2 norm, and hence (f the l 2 margn for both vectors s equal) a bgger l 1 margn. A dfferent perspectve on the dfference between the optmal solutons s gven by a theorem due to Mangasaran (1999), whch states that the l p margn maxmzng separatng hyper plane maxmzes the l q dstance from the closest ponts to the separatng hyperplane, wth 1 p + 1 q = 1. Thus the Eucldean optmal separator (p = 2) also maxmzes Eucldean dstance between the ponts and the hyperplane, whle the l 1 optmal separator maxmzes l dstance. Ths nterestng result gves another ntuton why l 1 optmal separatng hyperplanes tend to be coordnateorented (.e., have sparse representatons): snce l projecton consders only the largest coordnate dstance, some coordnate dstances may be 0 at no cost of decreased l dstance. Schapre et al. (1998) have ponted out the relaton between AdaBoost and the l 1 margn. They prove that, n the case of separable data, the boostng teratons ncrease the boostng margn of the model, defned as mn y F(x ) α 1. (9) In other words, ths s the l 1 margn of the model, except that t uses the α ncremental representaton rather than the β geometrc representaton for the model. The two representatons gve the same l 1 norm f there s sgn consstency, or monotoncty n the coeffcent paths traced by the model,.e., f at every teraton t of the boostng algorthm β jt 0 sgn(α t ) = sgn(β jt ). (10) As we wll see later, ths monotoncty condton wll play an mportant role n the equvalence between boostng and l 1 regularzaton. The l 1 margn maxmzaton vew of AdaBoost presented by Schapre et al. (1998) and a whole plethora of papers that followed s mportant for the analyss of boostng algorthms for two dstnct reasons: It gves an ntutve, geometrc nterpretaton of the model that AdaBoost s lookng for a model whch separates the data well n ths l 1 margn sense. Note that the vew of boostng as gradent descent n a loss crteron doesn t really gve the same knd of ntuton: f the data s separable, then any model whch separates the tranng data wll drve the exponental or bnomal loss to 0 when scaled up: m 1 (β) > 0 = C(y,dβ x ) 0 as d. 950
11 BOOSTING AS A REGULARIZED PATH The l 1 margn behavor of a classfcaton model on ts tranng data facltates generaton of generalzaton (or predcton) error bounds, smlar to those that exst for support vector machnes (Schapre et al., 1998). The mportant quantty n ths context s not the margn but the normalzed margn, whch consders the conjugate norm of the predctor vectors: y β h(x ) β 1 h(x ). When the dctonary we are usng s comprsed of classfers then h(x ) 1 always and thus the l 1 margn s exactly the relevant quantty. The error bounds descrbed by Schapre et al. (1998) allow usng the whole l 1 margn dstrbuton, not just the mnmal margn. However, boostng s tendency to separate well n the l 1 sense s a central motvaton behnd ther results. From a statstcal perspectve, however, we should be suspcous of margnmaxmzaton as a method for buldng good predcton models n hgh dmensonal predctor space. Margn maxmzaton n hgh dmensonal space s lkely to lead to overfttng and bad predcton performance. Ths has been observed n practce by many authors, n partcular Breman (1999). Our results n the next two sectons suggest an explanaton based on model complexty: margn maxmzaton s the lmt of parametrc regularzed optmzaton models, as the regularzaton vanshes, and the regularzed models along the path may well be superor to the margn maxmzng lmtng model, n terms of predcton performance. In Secton 7 we return to dscuss these ssues n more detal. 4. Boostng as Approxmate Incremental l 1 Constraned Fttng In ths secton we ntroduce an nterpretaton of the generc coordnatedescent boostng algorthm as trackng a path of approxmate solutons to l 1 constraned (or equvalently, regularzed) versons of ts loss crteron. Ths vew serves our understandng of what boostng does, n partcular the connecton between early stoppng n boostng and regularzaton. We wll also use ths vew to get a result about the asymptotc margnmaxmzaton of regularzed classfcaton models, and by analogy of classfcaton boostng. We buld on deas frst presented by Haste et al. (2001, Chapter 10) and Efron et al. (2004). Gven a convex nonnegatve loss crteron C(, ), consder the 1dmensonal path of optmal solutons to l 1 constraned optmzaton problems over the tranng data: ˆβ(c) = arg mn β 1 c C(y,h(x ) β). (11) As c vares, we get that ˆβ(c) traces a 1dmensonal optmal curve through R J. If an optmal soluton for the nonconstraned problem exsts and has fnte l 1 norm c 0, then obvously ˆβ(c) = ˆβ(c 0 ) = ˆβ, c > c 0. n the case of separable 2class data, usng ether C e or C l, there s no fntenorm optmal soluton. Rather, the constraned soluton wll always have ˆβ(c) 1 = c. A dfferent way of buldng a soluton whch has l 1 norm c, s to run our εboostng algorthm for c/ε teratons. Ths wll gve an α (c/ε) vector whch has l 1 norm exactly c. For the norm of the geometrc representaton β (c/ε) to also be equal to c, we need the monotoncty condton (10) to hold as well. Ths condton wll play a key role n our exposton. We are gong to argue that the two soluton paths ˆβ(c) and β (c/ε) are very smlar for ε small. Let us start by observng ths smlarty n practce. Fgure 1 n the ntroducton shows an example of 951
12 ROSSET, ZHU AND HASTIE Lasso Stagwse Fgure 5: Another example of the equvalence between the Lasso optmal soluton path (left) and εboostng wth squared error loss. Note that the equvalence breaks down when the path of varable 7 becomes nonmonotone ths smlarty for squared error loss fttng wth l 1 (lasso) penalty. Fgure 5 shows another example n the same mold, taken from Efron et al. (2004). The data s a dabetes study and the dctonary used s just the orgnal 10 varables. The panel on the left shows the path of optmal l 1 constraned solutons ˆβ(c) and the panel on the rght shows the εboostng path wth the 10dmensonal dctonary (the total number of boostng teratons s about 6000). The 1dmensonal path through R 10 s descrbed by 10 coordnate curves, correspondng to each one of the varables. The nterestng phenomenon we observe s that the two coeffcent traces are not completely dentcal. Rather, they agree up to the pont where varable 7 coeffcent path becomes non monotone,.e., t volates (10) (ths pont s where varable 8 comes nto the model, see the arrow on the rght panel). Ths example llustrates that the monotoncty condton and ts mplcaton that α 1 = β 1 s crtcal for the equvalence between εboostng and l 1 constraned optmzaton. The two examples we have seen so far have used squared error loss, and we should ask ourselves whether ths equvalence stretches beyond ths loss. Fgure 6 shows a smlar result, but ths tme for the bnomal loglkelhood loss, C l. We used the spam data set, taken from the UCI repostory (Blake and Merz, 1998). We chose only 5 predctors of the 57 to make the plots more nterpretable and the computatons more accommodatng. We see that there s a perfect equvalence between the exact constraned soluton (.e., regularzed logstc regresson) and εboostng n ths case, snce the paths are fully monotone. To justfy why ths observed equvalence s not surprsng, let us consder the followng l 1  locally optmal monotone drecton problem of fndng the best monotone ε ncrement to a gven model β 0 : mn C(β) (12) s.t. β 1 β 0 1 ε, β β 0 (componentwse). 952
13 BOOSTING AS A REGULARIZED PATH 6 Exact constraned soluton 6 ε Stagewse β values 2 β values β β 1 Fgure 6: Exact coeffcent paths (left) for l 1 constraned logstc regresson and boostng coeffcent paths (rght) wth bnomal loglkelhood loss on fve varables from the spam data set. The boostng path was generated usng ε = and 7000 teratons. Here we use C(β) as shorthand for C(y,h(x ) β). A frst order Taylor expanson gves us C(β) = C(β 0 ) + C(β 0 ) (β β 0 ) + O(ε 2 ). And gven the l 1 constrant on the ncrease n β 1, t s easy to see that a frstorder optmal soluton (and therefore an optmal soluton as ε 0) wll make a coordnate descent step,.e. β j β 0, j C(β 0 ) j = max C(β 0 ) k, k assumng the sgns match,.e., sgn(β 0 j ) = sgn( C(β 0 ) j ). So we get that f the optmal soluton to (12) wthout the monotoncty constrant happens to be monotone, then t s equvalent to a coordnate descent step. And so t s reasonable to expect that f the optmal l 1 regularzed path s monotone (as t ndeed s n Fgures 1,6), then an nfntesmal εboostng algorthm would follow the same path of solutons. Furthermore, even f the optmal path s not monotone, we can stll use the formulaton (12) to argue that εboostng would tend to follow an approxmate l 1 regularzed path. The man dfference between the εboostng path and the true optmal path s that t wll tend to delay becomng nonmonotone, as we observe for varable 7 n Fgure 5. To understand ths specfc phenomenon would requre analyss of the true optmal path, whch falls outsde the scope of our dscusson Efron et al. (2004) cover the subject for squared error loss, and ther dscusson apples to any contnuously dfferentable convex loss, usng secondorder approxmatons. We can employ ths understandng of the relatonshp between boostng and l 1 regularzaton to construct l p boostng algorthms by changng the coordnateselecton crteron n the coordnate descent algorthm. We wll get back to ths pont n Secton 7, where we desgn an l 2 boostng algorthm. The expermental evdence and heurstc dscusson we have presented lead us to the followng conjecture whch connects slow boostng and l 1 regularzed optmzaton: 953
14 ROSSET, ZHU AND HASTIE Conjecture 2 Consder applyng the εboostng algorthm to any convex loss functon, generatng a path of solutons β (ε) (t). Then f the optmal coeffcent paths are monotone c < c 0,.e., f j, ˆβ(c) j s nondecreasng n the range c < c 0, then lm ε 0 β(ε) (c 0 /ε) = ˆβ(c 0 ). Efron et al. (2004, Theorem 2) prove a weaker local result for the case of squared error loss only. We generalze ther result to any convex loss. However ths result stll does not prove the global convergence whch the conjecture clams, and the emprcal evdence mples. For the sake of brevty and readablty, we defer ths proof, together wth concse mathematcal defnton of the dfferent types of convergence, to appendx A. In the context of reallfe boostng, where the number of bass functons s usually very large, and makng ε small enough for the theory to apply would requre runnng the algorthm forever, these results should not be consdered drectly applcable. Instead, they should be taken as an ntutve ndcaton that boostng especally the ε verson s, ndeed, approxmatng optmal solutons to the constraned problems t encounters along the way. 5. l p Constraned Classfcaton Loss Functons Havng establshed the relaton between boostng and l 1 regularzaton, we are gong to turn our attenton to the regularzed optmzaton problem. By analogy, our results wll apply to boostng as well. We concentrate on C e and C l, the two classfcaton losses defned above, and the soluton paths of ther l p constraned versons: ˆβ (p) (c) = arg mn β p c C(y,β h(x )). (13) where C s ether C e or C l. As we dscussed below Equaton (11), f the tranng data s separable n span(h ), then we have ˆβ (p) (c) p = c for all values of c. Consequently ˆβ (p) (c) p = 1. c We may ask what are the convergence ponts of ths sequence as c. The followng theorem shows that these convergence ponts descrbe l p margn maxmzng separatng hyperplanes. Theorem 3 Assume the data s separable,.e., β s.t., y β h(x ) > 0. Then for both C e and C l, every convergence pont of ˆβ(c) c corresponds to an l p margnmaxmzng separatng hyperplane. If the l p margnmaxmzng separatng hyperplane s unque, then t s the unque convergence ponts,.e. ˆβ (p) = lm ˆβ(p) (c) = arg max c c mn y β h(x ). (14) β p =1 Proof Ths proof apples to both C e and C l, gven the property n (5). Consder two separatng canddates β 1 and β 2 such that β 1 p = β 2 p = 1. Assume that β 1 separates better,.e. Then we have the followng smple lemma: m 1 := mny β 1h(x ) > m 2 := mny β 2h(x ) >
15 BOOSTING AS A REGULARIZED PATH Lemma 4 There exsts some D = D(m 1,m 2 ) such that d > D, dβ 1 ncurs smaller loss than dβ 2, n other words: C(y,dβ 1h(x )) < C(y,dβ 2h(x )). Gven ths lemma, we can now prove that any convergence pont of ˆβ (p) (c) c must be an l p margn maxmzng separator. Assume β s a convergence pont of ˆβ (p) (c) c. Denote ts mnmal margn on the data by m. If the data s separable, clearly m > 0 (snce otherwse the loss of dβ does not even converge to 0 as d ). Now, assume some β wth β p = 1 has bgger mnmal margn m > m. By contnuty of the mnmal margn n β, there exsts some open neghborhood of β and an ε > 0, such that N β = {β : β β 2 < δ} mn y β h(x ) < m ε, β N β. Now by the lemma we get that there exsts some D = D( m, m ε) such that d β ncurs smaller loss than dβ for any d > D, β N β. Therefore β cannot be a convergence pont of ˆβ (p) (c) c. We conclude that any convergence pont of the sequence ˆβ (p) (c) c must be an l p margn maxmzng separator. If the margn maxmzng separator s unque then t s the only possble convergence pont, and therefore ˆβ (p) = lm ˆβ(p) (c) = arg max c c mn y β h(x ). β p =1 Proof of Lemma Usng (5) and the defnton of C e, we get for both loss functons: C(y,dβ 1h(x )) nexp( d m 1 ). Now, snce β 1 separates better, we can fnd our desred such that D = D(m 1,m 2 ) = logn + log2 m 1 m 2 d > D, nexp( d m 1 ) < 0.5exp( d m 2 ). And usng (5) and the defnton of C e agan we can wrte 0.5exp( d m 2 ) C(y,dβ 2h(x )). Combnng these three nequaltes we get our desred result: d > D, C(y,dβ 1h(x )) C(y,dβ 2h(x )). 955
16 ROSSET, ZHU AND HASTIE We thus conclude that f the l p margn maxmzng separatng hyperplane s unque, the normalzed constraned soluton converges to t. In the case that the margn maxmzng separatng hyperplane s not unque, we can n fact prove a stronger result, whch ndcates that the lmt of the regularzed solutons would then be determned by the second smallest margn, then by the thrd and so on. Ths result s manly of techncal nterest and we prove t n Appendx B, Secton Implcatons of Theorem 3 We now brefly dscuss the mplcatons of ths theorem for boostng and logstc regresson BOOSTING IMPLICATIONS Combned wth our results from Secton 4, Theorem 3 ndcates that the normalzed boostng path β (t) u t α u wth ether C e or C l used as loss approxmately converges to a separatng hyperplane ˆβ, whch attans max mn y β h(x ) = max β 2 mny d, (15) β 1 =1 β 1 =1 where d s the (sgned) Eucldean dstance from the tranng pont to the separatng hyperplane. In other words, t maxmzes Eucldean dstance scaled by an l 2 norm. As we have mentoned already, ths mples that the asymptotc boostng soluton wll tend to be sparse n representaton, due to the fact that for fxed l 1 norm, the l 2 norm of vectors that have many 0 entres wll generally be larger. In fact, under rather mld condtons, the asymptotc soluton ˆβ = lm c ˆβ(1) (c)/c, wll have at most n (the number of observatons) nonzero coeffcents, f we use ether C l or C e as the loss. See Appendx B, Secton 1 for proof LOGISTIC REGRESSION IMPLICATIONS Recall, that the logstc regresson (maxmum lkelhood) soluton s undefned f the data s separable n the Eucldean space spanned by the predctors. Theorem 3 allows us to defne a logstc regresson soluton for separable data, as follows: 1. Set a hgh constrant value c max 2. Fnd ˆβ (p) (c max ), the soluton to the logstc regresson problem subject to the constrant β p c max. The problem s convex for any p 1 and dfferentable for any p > 1, so nteror pont methods can be used to solve ths problem. 3. Now you have (approxmately) the l p margn maxmzng soluton for ths data, descrbed by ˆβ (p) (c max ) c max. Ths s a soluton to the orgnal problem n the sense that t s, approxmately, the convergence pont of the normalzed l p constraned solutons, as the constrant s relaxed. 956
17 BOOSTING AS A REGULARIZED PATH Of course, wth our result from Theorem 3 t would probably make more sense to smply fnd the optmal separatng hyperplane drectly ths s a lnear programmng problem for l 1 separaton and a quadratc programmng problem for l 2 separaton. We can then consder ths optmal separator as a logstc regresson soluton for the separable data. 6. Examples We now apply boostng to several data sets and nterpret the results n lght of our regularzaton and margnmaxmzaton vew. 6.1 Spam Data Set We now know f the data are separable and we let boostng run forever, we wll approach the same optmal separator for both C e and C l. However f we stop early or f the data s not separable the behavor of the two loss functons may dffer sgnfcantly, snce C e weghs negatve margns exponentally, whle C l s approxmately lnear n the margn for large negatve margns (see Fredman et al., 2000). Consequently, we can expect C e to concentrate more on the hard tranng data, n partcular n the nonseparable case. Fgure 7 llustrates the behavor of εboostng wth both Mnmal margns Test error exponental logstc AdaBoost mnmal margn test error exponental logstc AdaBoost β β 1 Fgure 7: Behavor of boostng wth the two loss functons on spam data set loss functons, as well as that of AdaBoost, on the spam data set (57 predctors, bnary response). We used 10 node trees and ε = 0.1. The left plot shows the mnmal margn as a functon of the l 1 norm of the coeffcent vector β 1. Bnomal loss creates a bgger mnmal margn ntally, but the mnmal margns for both loss functons are convergng asymptotcally. AdaBoost ntally lags behnd but catches up ncely and reaches the same mnmal margn asymptotcally. The rght plot shows the test error as the teratons proceed, llustratng that both εmethods ndeed seem to overft eventually, even as ther separaton (mnmal margn) s stll mprovng. AdaBoost dd not sgnfcantly overft n the 1000 teratons t was allowed to run, but t obvously would have f t were allowed to run on. We should emphasze that the comparson between AdaBoost and εboostng presented consders as a bass for comparson the l 1 norm, not the number of teratons. In terms of computatonal complexty, as represented by the number of teratons, AdaBoost reaches both a large mnmal mar 957
18 ROSSET, ZHU AND HASTIE gn and good predcton performance much more quckly than the slow boostng approaches, as AdaBoost tends to take larger steps. 6.2 Smulated Data To make a more educated comparson and more compellng vsualzaton, we have constructed an example of separaton of 2dmensonal data usng a 8th degree polynomal dctonary (45 functons). The data conssts of 50 observatons of each class, drawn from a mxture of Gaussans, and presented n Fgure 8. Also presented, n the sold lne, s the optmal l 1 separator for ths data n ths dctonary (easly calculated as a lnear programmng problem  note the dfference from the l 2 optmal decson boundary, presented n Secton 7.1, Fgure 11 ). The optmal l 1 separator has only 12 nonzero coeffcents out of optmal boost 10 5 ter boost 3*10 6 ter Fgure 8: Artfcal data set wth l 1 margn maxmzng separator (sold), and boostng models after 10 5 teratons (dashed) and 10 6 teratons (dotted) usng ε = We observe the convergence of the boostng separator to the optmal separator We ran an εboostng algorthm on ths data set, usng the logstc loglkelhood loss C l, wth ε = 0.001, and Fgure 8 shows two of the models generated after 10 5 and teratons. We see that the models seem to converge to the optmal separator. A dfferent vew of ths convergence s gven n Fgure 9, where we see two measures of convergence: the mnmal margn (left, maxmum value obtanable s the horzontal lne) and the l 1 norm dstance between the normalzed models (rght), gven by j ˆβ j β (t) j β (t) 1, where ˆβ s the optmal separator wth l 1 norm 1 and β (t) s the boostng model after t teratons. We can conclude that on ths smple artfcal example we get nce convergence of the logstcboostng model path to the l 1 margn maxmzng separatng hyperplane. We can also use ths example to llustrate the smlarty between the boosted path and the path of l 1 optmal solutons, as we have dscussed n Secton
19 BOOSTING AS A REGULARIZED PATH Mnmal margn l 1 dfference β β 1 Fgure 9: Two measures of convergence of boostng model path to optmal l 1 separator: mnmal margn (left) and l 1 dstance between the normalzed boostng coeffcent vector and the optmal model (rght) l 1 norm: 20 l 1 norm: 350 l 1 norm: 2701 l 1 norm: 5401 Fgure 10: Comparson of decson boundary of boostng models (broken) and of optmal constraned solutons wth same norm (full) Fgure 10 shows the class decson boundares for 4 models generated along the boostng path, compared to the optmal solutons to the constraned logstc regresson problem wth the same bound on the l 1 norm of the coeffcent vector. We observe the clear smlartes n the way the solutons evolve and converge to the optmal l 1 separator. The fact that they dffer (n some cases sgnfcantly) s not surprsng f we recall the monotoncty condton presented n Secton 4 for exact correspondence between the two model paths. In ths case f we look at the coeffcent paths 959
20 ROSSET, ZHU AND HASTIE (not shown), we observe that the monotoncty condton s consstently volated n the low norm ranges, and hence we can expect the paths to be smlar n sprt but not dentcal. 7. Dscusson We can now summarze what we have learned about boostng from the prevous sectons: Boostng approxmately follows the path of l 1 regularzed models for ts loss crteron If the loss crteron s the exponental loss of AdaBoost or the bnomal loglkelhood loss of logstc regresson, then the l 1 regularzed model converges to an l 1 margn maxmzng separatng hyperplane, f the data are separable n the span of the weak learners We may ask, whch of these two ponts s the key to the success of boostng approaches. One emprcal clue to answerng ths queston, can be found n Breman (1999), who programmed an algorthm to drectly maxmze the margns. Hs results were that hs algorthm consstently got sgnfcantly hgher mnmal margns than AdaBoost on many data sets (and, n fact, a hgher margn dstrbuton beyond the mnmal margn), but had slghtly worse predcton performance. Hs concluson was that margn maxmzaton s not the key to AdaBoost s success. From a statstcal perspectve we can embrace ths concluson, as reflectng the mportance of regularzaton n hghdmensonal predctor space. By our results from the prevous sectons, margn maxmzaton can be vewed as the lmt of parametrc regularzed models, as the regularzaton vanshes. 4 Thus we would generally expect the margn maxmzng solutons to perform worse than regularzed models. In the case of boostng, regularzaton would correspond to early stoppng of the boostng algorthm. 7.1 Boostng and SVMs as Regularzed Optmzaton n Hghdmensonal Predctor Spaces Our exposton has led us to vew boostng as an approxmate way to solve the regularzed optmzaton problem mn β C(y,β h(x )) + λ β 1 (16) whch converges as λ 0 to ˆβ (1), f our loss s C e or C l. In general, the loss C can be any convex dfferentable loss and should be defned to match the problem doman. Support vector machnes can be descrbed as solvng the regularzed optmzaton problem (see Fredman et al., 2000, Chapter 12) mn β (1 y β h(x )) + + λ β 2 2 (17) whch converges as λ 0 to the nonregularzed support vector machne soluton,.e., the optmal Eucldean separator, whch we denoted by ˆβ (2). An nterestng connecton exsts between these two approaches, n that they allow us to solve the regularzed optmzaton problem n hgh dmensonal predctor space: 4. It can be argued that margnmaxmzng models are stll regularzed n some sense, as they mnmze a norm crteron among all separatng models. Ths s arguably the property whch stll allows them to generalze reasonably well n many cases. 960
(Almost) No Label No Cry
(Almost) No Label No Cry Gorgo Patrn,, Rchard Nock,, Paul Rvera,, Tbero Caetano,3,4 Australan Natonal Unversty, NICTA, Unversty of New South Wales 3, Ambata 4 Sydney, NSW, Australa {namesurname}@anueduau
More informationDo Firms Maximize? Evidence from Professional Football
Do Frms Maxmze? Evdence from Professonal Football Davd Romer Unversty of Calforna, Berkeley and Natonal Bureau of Economc Research Ths paper examnes a sngle, narrow decson the choce on fourth down n the
More informationMANY of the problems that arise in early vision can be
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 26, NO. 2, FEBRUARY 2004 147 What Energy Functons Can Be Mnmzed va Graph Cuts? Vladmr Kolmogorov, Member, IEEE, and Ramn Zabh, Member,
More informationEnsembling Neural Networks: Many Could Be Better Than All
Artfcal Intellgence, 22, vol.37, no.2, pp.239263. @Elsever Ensemblng eural etworks: Many Could Be Better Than All ZhHua Zhou*, Janxn Wu, We Tang atonal Laboratory for ovel Software Technology, anng
More informationWho are you with and Where are you going?
Who are you wth and Where are you gong? Kota Yamaguch Alexander C. Berg Lus E. Ortz Tamara L. Berg Stony Brook Unversty Stony Brook Unversty, NY 11794, USA {kyamagu, aberg, leortz, tlberg}@cs.stonybrook.edu
More informationAssessing health efficiency across countries with a twostep and bootstrap analysis *
Assessng health effcency across countres wth a twostep and bootstrap analyss * Antóno Afonso # $ and Mguel St. Aubyn # February 2007 Abstract We estmate a semparametrc model of health producton process
More informationThe Relationship between Exchange Rates and Stock Prices: Studied in a Multivariate Model Desislava Dimitrova, The College of Wooster
Issues n Poltcal Economy, Vol. 4, August 005 The Relatonshp between Exchange Rates and Stock Prces: Studed n a Multvarate Model Desslava Dmtrova, The College of Wooster In the perod November 00 to February
More informationWhat to Maximize if You Must
What to Maxmze f You Must Avad Hefetz Chrs Shannon Yoss Spegel Ths verson: July 2004 Abstract The assumpton that decson makers choose actons to maxmze ther preferences s a central tenet n economcs. Ths
More informationComplete Fairness in Secure TwoParty Computation
Complete Farness n Secure TwoParty Computaton S. Dov Gordon Carmt Hazay Jonathan Katz Yehuda Lndell Abstract In the settng of secure twoparty computaton, two mutually dstrustng partes wsh to compute
More informationCiphers with Arbitrary Finite Domains
Cphers wth Arbtrary Fnte Domans John Black 1 and Phllp Rogaway 2 1 Dept. of Computer Scence, Unversty of Nevada, Reno NV 89557, USA, jrb@cs.unr.edu, WWW home page: http://www.cs.unr.edu/~jrb 2 Dept. of
More informationEVERY GOOD REGULATOR OF A SYSTEM MUST BE A MODEL OF THAT SYSTEM 1
Int. J. Systems Sc., 1970, vol. 1, No. 2, 8997 EVERY GOOD REGULATOR OF A SYSTEM MUST BE A MODEL OF THAT SYSTEM 1 Roger C. Conant Department of Informaton Engneerng, Unversty of Illnos, Box 4348, Chcago,
More informationThe Developing World Is Poorer Than We Thought, But No Less Successful in the Fight against Poverty
Publc Dsclosure Authorzed Pol c y Re s e a rc h Wo r k n g Pa p e r 4703 WPS4703 Publc Dsclosure Authorzed Publc Dsclosure Authorzed The Developng World Is Poorer Than We Thought, But No Less Successful
More informationThe Global Macroeconomic Costs of Raising Bank Capital Adequacy Requirements
W/1/44 The Global Macroeconomc Costs of Rasng Bank Captal Adequacy Requrements Scott Roger and Francs Vtek 01 Internatonal Monetary Fund W/1/44 IMF Workng aper IMF Offces n Europe Monetary and Captal Markets
More informationAsRigidAsPossible Shape Manipulation
AsRgdAsPossble Shape Manpulaton akeo Igarash 1, 3 omer Moscovch John F. Hughes 1 he Unversty of okyo Brown Unversty 3 PRESO, JS Abstract We present an nteractve system that lets a user move and deform
More informationWhy Don t We See Poverty Convergence?
Why Don t We See Poverty Convergence? Martn Ravallon 1 Development Research Group, World Bank 1818 H Street NW, Washngton DC, 20433, USA Abstract: We see sgns of convergence n average lvng standards amongst
More informationcan basic entrepreneurship transform the economic lives of the poor?
can basc entrepreneurshp transform the economc lves of the poor? Orana Bandera, Robn Burgess, Narayan Das, Selm Gulesc, Imran Rasul, Munsh Sulaman Aprl 2013 Abstract The world s poorest people lack captal
More informationDISCUSSION PAPER. Is There a Rationale for OutputBased Rebating of Environmental Levies? Alain L. Bernard, Carolyn Fischer, and Alan Fox
DISCUSSION PAPER October 00; revsed October 006 RFF DP 03 REV Is There a Ratonale for OutputBased Rebatng of Envronmental Leves? Alan L. Bernard, Carolyn Fscher, and Alan Fox 66 P St. NW Washngton, DC
More informationFrom Computing with Numbers to Computing with Words From Manipulation of Measurements to Manipulation of Perceptions
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS I: FUNDAMENTAL THEORY AND APPLICATIONS, VOL. 45, NO. 1, JANUARY 1999 105 From Computng wth Numbers to Computng wth Words From Manpulaton of Measurements to Manpulaton
More information4.3.3 Some Studies in Machine Learning Using the Game of Checkers
4.3.3 Some Studes n Machne Learnng Usng the Game of Checkers 535 Some Studes n Machne Learnng Usng the Game of Checkers Arthur L. Samuel Abstract: Two machnelearnng procedures have been nvestgated n some
More informationUPGRADE YOUR PHYSICS
Correctons March 7 UPGRADE YOUR PHYSICS NOTES FOR BRITISH SIXTH FORM STUDENTS WHO ARE PREPARING FOR THE INTERNATIONAL PHYSICS OLYMPIAD, OR WISH TO TAKE THEIR KNOWLEDGE OF PHYSICS BEYOND THE ALEVEL SYLLABI.
More informationMULTIPLE VALUED FUNCTIONS AND INTEGRAL CURRENTS
ULTIPLE VALUED FUNCTIONS AND INTEGRAL CURRENTS CAILLO DE LELLIS AND EANUELE SPADARO Abstract. We prove several results on Almgren s multple valued functons and ther lnks to ntegral currents. In partcular,
More informationDISCUSSION PAPER. Should Urban Transit Subsidies Be Reduced? Ian W.H. Parry and Kenneth A. Small
DISCUSSION PAPER JULY 2007 RFF DP 0738 Should Urban Transt Subsdes Be Reduced? Ian W.H. Parry and Kenneth A. Small 1616 P St. NW Washngton, DC 20036 2023285000 www.rff.org Should Urban Transt Subsdes
More informationTurbulence Models and Their Application to Complex Flows R. H. Nichols University of Alabama at Birmingham
Turbulence Models and Ther Applcaton to Complex Flows R. H. Nchols Unversty of Alabama at Brmngham Revson 4.01 CONTENTS Page 1.0 Introducton 1.1 An Introducton to Turbulent Flow 11 1. Transton to Turbulent
More informationAsRigidAsPossible Image Registration for Handdrawn Cartoon Animations
AsRgdAsPossble Image Regstraton for Handdrawn Cartoon Anmatons Danel Sýkora Trnty College Dubln John Dnglana Trnty College Dubln Steven Collns Trnty College Dubln source target our approach [Papenberg
More informationAlpha if Deleted and Loss in Criterion Validity 1. Appeared in British Journal of Mathematical and Statistical Psychology, 2008, 61, 275285
Alpha f Deleted and Loss n Crteron Valdty Appeared n Brtsh Journal of Mathematcal and Statstcal Psychology, 2008, 6, 275285 Alpha f Item Deleted: A Note on Crteron Valdty Loss n Scale Revson f Maxmsng
More informationTrueSkill Through Time: Revisiting the History of Chess
TrueSkll Through Tme: Revstng the Hstory of Chess Perre Dangauther INRIA Rhone Alpes Grenoble, France perre.dangauther@mag.fr Ralf Herbrch Mcrosoft Research Ltd. Cambrdge, UK rherb@mcrosoft.com Tom Mnka
More informationShould marginal abatement costs differ across sectors? The effect of lowcarbon capital accumulation
Should margnal abatement costs dffer across sectors? The effect of lowcarbon captal accumulaton Adren VogtSchlb 1,, Guy Meuner 2, Stéphane Hallegatte 3 1 CIRED, NogentsurMarne, France. 2 INRA UR133
More informationFinance and Economics Discussion Series Divisions of Research & Statistics and Monetary Affairs Federal Reserve Board, Washington, D.C.
Fnance and Economcs Dscusson Seres Dvsons of Research & Statstcs and Monetary Affars Federal Reserve Board, Washngton, D.C. Banks as Patent Fxed Income Investors Samuel G. Hanson, Andre Shlefer, Jeremy
More informationIncome per natural: Measuring development as if people mattered more than places
Income per natural: Measurng development as f people mattered more than places Mchael A. Clemens Center for Global Development Lant Prtchett Kennedy School of Government Harvard Unversty, and Center for
More informationWHICH SECTORS MAKE THE POOR COUNTRIES SO UNPRODUCTIVE?
MŰHELYTANULMÁNYOK DISCUSSION PAPERS MT DP. 2005/19 WHICH SECTORS MAKE THE POOR COUNTRIES SO UNPRODUCTIVE? BERTHOLD HERRENDORF ÁKOS VALENTINYI Magyar Tudományos Akadéma Közgazdaságtudomány Intézet Budapest
More information