A DecisionTheoretic Generalization of OnLine Learning and an Application to Boosting*


 Dina Montgomery
 2 years ago
 Views:
Transcription
1 journal of compuer and sysem scences 55, (1997) arcle no. SS A Decsonheorec Generalzaon of OnLne Learnng and an Applcaon o Boosng* Yoav Freund and Rober E. Schapre  A6 Labs, 180 Park Avenue, Florham Park, ew Jersey Receved December 19, 1996 In he frs par of he paper we consder he problem of dynamcally apporonng resources among a se of opons n a worscase onlne framework. he model we sudy can be nerpreed as a broad, absrac exenson of he wellsuded onlne predcon model o a general decsonheorec seng. We show ha he mulplcave weghupdae LlesoneWarmuh rule can be adaped o hs model, yeldng bounds ha are slghly weaker n some cases, bu applcable o a consderably more general class of learnng problems. We show how he resulng learnng algorhm can be appled o a varey of problems, ncludng gamblng, mulpleoucome predcon, repeaed games, and predcon of pons n R n. In he second par of he paper we apply he mulplcave weghupdae echnque o derve a new boosng algorhm. hs boosng algorhm does no requre any pror knowledge abou he performance of he weak learnng algorhm. We also sudy generalzaons of he new boosng algorhm o he problem of learnng funcons whose range, raher han beng bnary, s an arbrary fne se or a bounded segmen of he real lne. ] 1997 Academc Press 1. IRODUCIO A gambler, frusraed by perssen horseracng losses and envous of hs frends' wnnngs, decdes o allow a group of hs fellow gamblers o make bes on hs behalf. He decdes he wll wager a fxed sum of money n every race, bu ha he wll apporon hs money among hs frends based on how well hey are dong. Ceranly, f he knew psychcally ahead of me whch of hs frends would wn he mos, he would naurally have ha frend handle all hs wagers. Lackng such clarvoyance, however, he aemps o allocae each race's wager n such a way ha hs oal wnnngs for he season wll be reasonably close o wha he would have won had he be everyhng wh he luckes of hs frends. In hs paper, we descrbe a smple algorhm for solvng such dynamc allocaon problems, and we show ha our soluon can be appled o a grea assormen of learnng problems. Perhaps he mos surprsng of hese applcaons s he dervaon of a new algorhm for ``boosng,''.e., for * An exended absrac of hs work appeared n he ``Proceedngs of he Second European Conference on Compuaonal Learnng heory, Barcelona, March, 1995.''  Emal [yoav, schapre]research.a.com. converng a ``weak'' PAC learnng algorhm ha performs jus slghly beer han random guessng no one wh arbrarly hgh accuracy. We formalze our onlne allocaon model as follows. he allocaon agen A has opons or sraeges o choose from; we number hese usng he negers 1,...,. A each me sep, 2,...,, he allocaor A decdes on a dsrbuon p over he sraeges; ha s p 0 s he amoun allocaed o sraegy, and p =1. Each sraegy hen suffers some loss l whch s deermned by he (possbly adversaral) ``envronmen.'' he loss suffered by A s hen n p l =p } l,.e., he average loss of he sraeges wh respec o A's chosen allocaon rule. We call hs loss funcon he mxure loss. In hs paper, we always assume ha he loss suffered by any sraegy s bounded so ha, whou loss of generaly, l # [0, 1]. Besdes hs condon, we make no assumpons abou he form of he loss vecors l, or abou he manner n whch hey are generaed; ndeed, he adversary's choce for l may even depend on he allocaor's chosen mxure p. he goal of he algorhm A s o mnmze s cumulave loss relave o he loss suffered by he bes sraegy. ha s, A aemps o mnmze s ne loss where L A &mn L A = L p } l s he oal cumulave loss suffered by algorhm A on he frs rals, and L = s sraegy 's cumulave loss. In Secon 2, we show ha Llesone and Warmuh's [20] ``weghed majory'' algorhm can be generalzed o l Copyrgh 1997 by Academc Press All rghs of reproducon n any form reserved.
2 120 FREUD AD SCHAPIRE handle hs problem, and we prove a number of bounds on he ne loss. For nsance, one of our resuls shows ha he ne loss of our algorhm can be bounded by O( ln ) or, pu anoher way, ha he average per ral ne loss s decreasng a he rae O( (ln )). hus, as ncreases, hs dfference decreases o zero. Our resuls for he onlne allocaon model can be appled o a wde varey of learnng problems, as we descrbe n Secon 3. In parcular, we generalze he resuls of Llesone and Warmuh [20] and CesaBanch e al. [4] for he problem of predcng a bnary sequence usng he advce of a eam of ``expers.'' Whereas hese auhors proved worscase bounds for makng onlne randomzed decsons over a bnary decson and oucome space wh a [0, 1]valued dscree loss, we prove (slghly weaker) bounds ha are applcable o any bounded loss funcon over any decson and oucome spaces. Our bounds express explcly he rae a whch he loss of he learnng algorhm approaches ha of he bes exper. Relaed generalzaons of he exper predcon model were suded by Vovk [25], Kvnen and Warmuh [19], and Haussler e al. [15]. Lke us, hese auhors focused prmarly on mulplcave weghupdae algorhms. Chung [5] also presened a generalzaon, gvng he problem a gameheorec reamen. Boosng Reurnng o he horseracng sory, suppose now ha he gambler grows weary of choosng among he expers and nsead wshes o creae a compuer program ha wll accuraely predc he wnner of a horse race based on he usual nformaon (number of races recenly won by each horse, beng odds for each horse, ec.). o creae such a program, he asks hs favore exper o explan hs beng sraegy. o surprsngly, he exper s unable o arculae a grand se of rules for selecng a horse. On he oher hand, when presened wh he daa for a specfc se of races, he exper has no rouble comng up wh a ``ruleofhumb'' for ha se of races (such as, ``Be on he horse ha has recenly won he mos races'' or ``Be on he horse wh he mos favored odds''). Alhough such a ruleofhumb, by self, s obvously very rough and naccurae, s no unreasonable o expec o provde predcons ha are a leas a lle b beer han random guessng. Furhermore, by repeaedly askng he exper's opnon on dfferen collecons of races, he gambler s able o exrac many rulesofhumb. In order o use hese rulesofhumb o maxmum advanage, here are wo problems faced by he gambler Frs, how should he choose he collecons of races presened o he exper so as o exrac rulesofhumb from he exper ha wll be he mos useful? Second, once he has colleced many rulesofhumb, how can hey be combned no a sngle, hghly accurae predcon rule? Boosng refers o hs general problem of producng a very accurae predcon rule by combnng rough and moderaely naccurae rulesofhumb. In he second par of he paper, we presen and analyze a new boosng algorhm nspred by he mehods we used for solvng he onlne allocaon problem. Formally, boosng proceeds as follows he booser s provded wh a se of labelled ranng examples (x 1, y 1 ),..., (x, y ), where y s he label assocaed wh nsance x ; for nsance, n he horseracng example, x mgh be he observable daa assocaed wh a parcular horse race, and y he oucome (wnnng horse) of ha race. On each round,...,, he booser devses a dsrbuon D over he se of examples, and requess (from an unspecfed oracle) a weak hypohess (or ruleofhumb) h wh low error = wh respec o D (ha s, = =Pr D [h (x ){y ]). hus, dsrbuon D specfes he relave mporance of each example for he curren round. Afer rounds, he booser mus combne he weak hypoheses no a sngle predcon rule. Unlke he prevous boosng algorhms of Freund [10, 11] and Schapre [22], he new algorhm needs no pror knowledge of he accuraces of he weak hypoheses. Raher, adaps o hese accuraces and generaes a weghed majory hypohess n whch he wegh of each weak hypohess s a funcon of s accuracy. For bnary predcon problems, we prove n Secon 4 ha he error of hs fnal hypohess (wh respec o he gven se of examples) s bounded by exp(&2 #2 ) where = = 12&# s he error of he h weak hypohess. Snce a hypohess ha makes enrely random guesses has error 12, # measures he accuracy of he h weak hypohess relave o random guessng. hus, hs bound shows ha f we can conssenly fnd weak hypoheses ha are slghly beer han random guessng, hen he error of he fnal hypohess drops exponenally fas. oe ha he bound on he accuracy of he fnal hypohess mproves when any of he weak hypoheses s mproved. hs s n conras wh prevous boosng algorhms whose performance bound depended only on he accuracy of he leas accurae weak hypohess. A he same me, f he weak hypoheses all have he same accuracy, he performance of he new algorhm s very close o ha acheved by he bes of he known boosng algorhms. In Secon 5, we gve wo exensons of our boosng algorhm o mulclass predcon problems n whch each example belongs o one of several possble classes (raher han jus wo). We also gve an exenson o regresson problems n whch he goal s o esmae a realvalued funcon. 2. HE OLIE ALLOCAIO ALGORIHM AD IS AALYSIS In hs secon, we presen our algorhm, called Hedge(;), for he onlne allocaon problem. he algorhm
3 A DECISIO HEOREIC GEERALIZAIO 121 and s analyss are drec generalzaons of Llesone and Warmuh's weghed majory algorhm [20]. he pseudocode for Hedge(;) s shown n Fg. 1. he algorhm manans a wegh vecor whose value a me s denoed w =(w,..., 1 w ). A all mes, all weghs wll be nonnegave. All of he weghs of he nal wegh vecor w 1 mus be nonnegave and sum o one, so ha w1 =1. Besdes hese condons, he nal wegh vecor may be arbrary, and may be vewed as a ``pror'' over he se of sraeges. Snce our bounds are sronges for hose sraeges recevng he greaes nal wegh, we wll wan o choose he nal weghs so as o gve he mos wegh o hose sraeges whch we expec are mos lkely o perform he bes. aurally, f we have no reason o favor any of he sraeges, we can se all of he nal weghs equally so ha w 1 =1. oe ha he weghs on fuure rals need no sum o one. Our algorhm allocaes among he sraeges usng he curren wegh vecor, afer normalzng. ha s, a me, Hedge(;) chooses he dsrbuon vecor w p =. (1) w Afer he loss vecor l has been receved, he wegh vecor w s updaed usng he mulplcave rule w +1 =w } ;l. (2) More generally, can be shown ha our analyss s applcable wh only mnor modfcaon o an alernave updae rule of he form w +1 =w } U ;(l ) where U ; [0,1][0, 1] s any funcon, parameerzed by ; # [0, 1] sasfyng for all r #[0,1]. ; r U ; (r)1&(1&;) r 2.1. Analyss he analyss of Hedge(;) mmcs drecly ha gven by Llesone and Warmuh [20]. he man dea s o derve upper and lower bounds on w+1 whch, ogeher, mply an upper bound on he loss of he algorhm. We begn wh an upper bound. Lemma 1. For any sequence of loss vecors l 1,..., l, ln \ w +1 + &(1&;) L Hedge(;). Algorhm Hedge(;) Parameers ; #[0,1] nal wegh vecor w 1 #[0,1] wh w1 =1 number of rals Do for, 2,..., 1. Choose allocaon p = w w 2. Receve loss vecor l #[0,1] from envronmen. 3. Suffer loss p } l. 4. Se he new weghs vecor o be Proof. FIG. 1. w +1 =w ;l he onlne allocaon algorhm. By a convexy argumen, can be shown ha r 1&(1&) r (3) for 0 and r # [0, 1]. Combned wh Eqs. (1) and (2), hs mples = w +1 = \ w ;l w (1&(1&;) l ) Applyng repeaedly for,..., yelds for all x. he lemma follows mme snce 1+xe x daely. K hus, w +1 ` oe ha, from Eq. (2), w+ (1&(1&;) p } l ). (4) (1&(1&;) p } l ) exp \&(1&;) p } l + L Hedge(;) &ln( w+1. (5) 1&; =w 1 ` w +1 ) ; l =w 1 ;L. (6) hs s all ha s needed o complee our analyss.
4 122 FREUD AD SCHAPIRE heorem 2. For any sequence of loss vecors l 1,..., l, and for any # [1,..., ], we have L Hedge(;) &ln(w1)&l ln ;. (7) 1&; More generally, for any nonempy se S[1,..., ], we have L Hedge(;) &ln( # S w1)&(ln ;) max # S L. (8) 1&; Proof. We prove he more general saemen (8) snce Eq. (7) follows n he specal case ha S=[]. From Eq. (6), w +1 w +1 # S = # S w 1 ;L ; max # S L # S w 1. he heorem now follows mmedaely from Eq. (5). he smpler bound (7) saes ha Hedge(;) does no perform ``oo much worse'' han he bes sraegy for he sequence. he dfference n loss depends on our choce of ; and on he nal wegh w 1 of each sraegy. If each wegh s se equally so ha w 1 =1, hen hs bound becomes L Hedge(;) mn L ln(1;)+ln. (9) 1&; Snce depends only logarhmcally on, hs bound s reasonable even for a very large number of sraeges. he more complcaed bound (8) s a generalzaon of he smpler bound ha s especally applcable when he number of sraeges s nfne. aurally, for uncounable collecons of sraeges, he sum appearng n Eq. (8) can be replaced by an negral, and he maxmum by a supremum. he bound gven n Eq. (9) can be wren as L Hedge(;) c mn L +a ln, (10) where c=ln(1;)(1&;) and a=1(1&;). Vovk [24] analyzes predcon algorhms ha have performance bounds of hs form, and proves gh upper and lower bounds for he achevable values of c and a. Usng Vovk's resuls, we can show ha he consans a and c acheved by Hedge(;) are opmal. K heorem 3. Le B be an algorhm for he onlne allocaon problem wh an arbrary number of sraeges. Suppose ha here exss posve real numbers a and c such ha for any number of sraeges and for any sequence of loss vecors l 1,..., l L B c mn hen for all ; #(0,1),eher c ln(1;) 1&; L +a ln. he proof s gven n he appendx. or a 1 (1&;) How o Choose ; So far, we have analyzed Hedge(;) for a gven choce of ;, and we have proved reasonable bounds for any choce of ;. In pracce, we wll ofen wan o choose ; so as o maxmally explo any pror knowledge we may have abou he specfc problem a hand. he followng lemma wll be helpful for choosng ; usng he bounds derved above. Lemma 4. Suppose 0LL and 0<RR. Le ;= g(l R)where g(z)=1(1+2z). hen &L ln ;+R L+2L R+R. 1&; Proof. (Skech) I can be shown ha &ln ;(1&; 2 ) (2;) for ; # (0, 1]. Applyng hs approxmaon and he gven choce of ; yelds he resul. K Lemma 4 can be appled o any of he bounds above snce all of hese bounds have he form gven n he lemma. For example, suppose we have sraeges, and we also know a pror bound L on he loss of he bes sraegy. hen, combnng Eq. (9) and Lemma 4, we have L Hedge(;) mn L + 2L ln +ln (11) for ;= g(l ln ). In general, f we know ahead of me he number of rals, hen we can use L = as an upper bound on he cumulave loss of each sraegy. Dvdng boh sdes of Eq. (11) by, we oban an explc bound on he rae a whch he average perral loss of Hedge(;) approaches he average loss for he bes sraegy L Hedge(;) mn L 2L ln + + ln. (12)
5 A DECISIO HEOREIC GEERALIZAIO 123 Snce L, hs gves a wors case rae of convergence of O( (ln )). However, f L s close o zero, hen he rae of convergence wll be much faser, roughly, O((ln )). Lemma 4 can also be appled o he oher bounds gven n heorem 2 o oban analogous resuls. he bound gven n Eq. (11) can be mproved n specal cases n whch he loss s a funcon of a predcon and an oucome and hs funcon s of a specal form (see Example 4 below). However, for he general case, one canno mprove he squareroo erm  2L ln, by more han a consan facor. hs s a corollary of he lower bound gven by Cesa Banch e al. ([4], heorem 7) who analyze an onlne predcon problem ha can be seen as a specal case of he onlne allocaon model. 3. APPLICAIOS he framework descrbed up o hs pon s que general and can be appled n a wde varey of learnng problems. Consder he followng seup used by Chung [5]. We are gven a decson space 2, a space of oucomes 0, and a bounded loss funcon * 2_0 [0, 1]. (Acually, our resuls requre only ha * be bounded, bu, by rescalng, we can assume ha s range s [0, 1].) A every me sep, he learnng algorhm selecs a decson $ # 2, receves an oucome # 0, and suffers loss *($, ). More generally, we may allow he learner o selec a dsrbuon D over he space of decsons, n whch case suffers he expeced loss of a decson randomly seleced accordng o D ; ha s, s expeced loss s 4(D, ) where 4(D, )=E $D [*($, )]. o decde on dsrbuon D, we assume ha he learner has access o a se of expers. A every me sep, exper produces s own dsrbuon E on 2, and suffers loss 4(E, ). he goal of he learner s o combne he dsrbuons produced by he expers so as o suffer expeced loss ``no much worse'' han ha of he bes exper. he resuls of Secon 2 provde a mehod for solvng hs problem. Specfcally, we run algorhm Hedge(;), reang each exper as a sraegy. A every me sep, Hedge(;) produces a dsrbuon p on he se of expers whch s used o consruc he mxure dsrbuon D = p E. For any oucome, he loss suffered by Hedge(;) wll hen be 4(D, )= p 4(E, ). hus, f we defne l =4(E, ) hen he loss suffered by he learner s p } l,.e., exacly he mxure loss ha was analyzed n Secon 2. Hence, he bounds of Secon 2 can be appled o our curren framework. For nsance, applyng Eq. (11), we oban he followng heorem 5. For any loss funcon *, for any se of expers, and for any sequence of oucomes, he expeced loss of Hedge(;) f used as descrbed above s a mos 4(D, )mn 4(E, )+2L ln +ln where L s an assumed bound on he expeced loss of he bes exper, and ;= g(l ln ). Example 1. In he kary predcon problem, 2=0= [1, 2,..., k], and *($, ) s1f${ and 0 oherwse. In oher words, he problem s o predc a sequence of leers over an alphabe of sze k. he loss funcon * s 1 f a msake was made, and 0 oherwse. hus, 4(D, ) s he probably (wh respec o D) of a predcon ha dsagrees wh. he cumulave loss of he learner, or of any exper, s herefore he expeced number of msakes on he enre sequence. So, n hs case, heorem 2 saes ha he expeced number of msakes of he learnng algorhm wll exceed he expeced number of msakes of he bes exper by a mos O( ln ), or possbly much less f he loss of he bes exper can be bounded ahead of me. Bounds of hs ype were prevously proved n he bnary case (k=2) by Llesone and Warmuh [20] usng he same algorhm. her algorhm was laer mproved by Vovk [25] and CesaBanch e al. [4]. he man resul of hs secon s a proof ha such bounds can be shown o hold for any bounded loss funcon. Example 2. he loss funcon * may represen an arbrary marx game, such as ``rock, paper, scssors.'' Here, 2=0=[R, P, S], and he loss funcon s defned by he marx R P S R $ P S 1 0 he decson $ represens he learner's play, and he oucome s he adversary's play; hen *($, ), he learner's loss, s 1 f he learner loses he round, 0 f wns he round, and 12 f he round s ed. (For nsance, *(S, P)=0 snce ``scssors cu paper.'') So he cumulave loss of he learner 1 2
6 124 FREUD AD SCHAPIRE (or an exper) s he expeced number of losses n a seres of rounds of game play (counng es as half a loss). Our resuls show hen ha, n repeaed play, he expeced number of rounds los by our algorhm wll converge quckly o he expeced number ha would have been los by he bes of he expers (for he parcular sequence of moves ha were acually played by he adversary). Example 3. Suppose ha 2 and 0 are fne, and ha * represens a game marx as n he las example. Suppose furher ha we creae one exper for each decson $ # 2 and ha exper always recommends playng $. In gameheorec ermnology such expers would be denfed wh pure sraeges. Von eumann's classcal mnmax heorem saes ha for any fxed game marx here exss a dsrbuon over he acons, also called a mxed sraegy, whch acheves he mnmax opmal value of he expeced loss agans any adversaral sraegy. hs mnmax value s also called he value of he game. Suppose ha we use algorhm Hedge(;) o choose dsrbuons over he acons when playng a marx game repeaedly. In hs case, heorem 2 mples ha he gap beween he learner's average perround loss can never be much larger han ha of he bes pure sraegy, and ha he maxmal gap decreases o zero a he rae O(1 log 2 ). However, he expeced loss of he opmal mxed sraegy s a fxed convex combnaon of he losses of he pure sraeges, hus can never be smaller han he loss of he bes pure sraegy for a parcular sequence of evens. We conclude ha he expeced perral loss of Hedge(;) s upper bounded by he value of he game plus O(1 log 2 ). In oher words, he algorhm can never perform much worse ha an algorhm ha uses he opmal mxed sraegy for he game, and can be beer f he adversary does no play opmally. Moreover, hs holds rue even f he learner knows nohng a all abou he game ha s beng played (so ha * s unknown o he learner), and even f he adversaral opponen has complee knowledge boh of he game ha s beng played and he algorhm ha s beng used by he learner. Algorhms wh smlar properes (bu weaker convergence bounds) were frs devsed by Blackwell [2] and Hannan [14]. For more deals see our relaed paper [13]. Example 4. Suppose ha 2=0 s he un ball n R n, and ha *($, )=&$& &. hus, he problem here s o predc he locaon of a pon, and he loss suffered s he Eucldean dsance beween he predced pon $ and he acual oucome. heorem 2 can be appled f probablsc predcons are allowed. However, n hs seng s more naural o requre ha he learner and each exper predc a sngle pon (raher han a measure on he space of possble pons). Essenally, hs s he problem of ``rackng'' a sequence of pons 1,..., where he loss funcon measures he dsance o he predced pon. o see how o handle he problem of fndng deermnsc predcons, noce ha he loss funcon *($, ) s convex wh respec o $ &(a$ 1 +(1&a) $ 2 )& &a &$ 1 & &+(1&a) &$ 2 & & (13) for any a # [0, 1] and any # 0. hus we can do as follows. A me, he learner predcs wh he weghed average of he expers' predcons $ = p = where = # Rn s he predcon of he h exper a me. Regardless of he oucome, Eq. (13) mples ha &$ & & p &= & &. Snce heorem 2 provdes an upper bound on he rgh hand sde of hs nequaly, we also oban upper bounds for he lef hand sde. hus, our resuls n hs case gve explc bounds on he oal error (.e., dsance beween predced and observed pons) for he learner relave o he bes of a eam of expers. In he onedmensonal case (n=1), hs case was prevously analyzed by Llesone and Warmuh [20], and laer mproved upon by Kvnen and Warmuh [19]. hs resul depends only on he convexy and he bounded range of he loss funcon *($, ) wh respec o $. hus, can also be appled, for example, o he squareddsance loss funcon *($, )=&$& & 2, as well as he log loss funcon *($, )=&ln($ } ) used by Cover [6] for he desgn of ``unversal'' nvesmen porfolos. (In hs las case, 2 s he se of probably vecors on n pons, and 0=[1B, B] n for some consan B>1.) In many of he cases lsed above, superor algorhms or analyses are known. Alhough weaker n specfc cases, should be emphaszed ha our resuls are far more general, and can be appled n sengs ha exhb consderably less srucure, such as he horseracng example descrbed n he nroducon. 4. BOOSIG In hs secon we show how he algorhm presened n Secon 2 for he onlne allocaon problem can be modfed o boos he performance of weak learnng algorhms. We very brefly revew he PAC learnng model (see, for nsance, Kearns and Vazran [18] for a more dealed descrpon). Le X be a se called he doman. A concep s a Boolean funcon c X [0, 1]. Aconcep class C s a collecon of conceps. he learner has access o an oracle whch provdes labelled examples of he form (x, c(x)) where x s chosen randomly accordng o some fxed bu
7 A DECISIO HEOREIC GEERALIZAIO 125 unknown and arbrary dsrbuon D on he doman X, and c # C s he arge concep. Afer some amoun of me, he learner mus oupu a hypohess h X [0, 1]. he value h(x) can be nerpreed as a randomzed predcon of he label of x ha s 1 wh probably h(x) and 0 wh probably 1&h(x). (Alhough we assume here ha we have drec access o he bas of hs predcon, our resuls can be exended o he case ha h s nsead a random mappng no [0, 1].) he error of he hypohess h s he expeced value E x D ( h(x)&c(x) ) where x s chosen accordng o D. If h(x) s nerpreed as a sochasc predcon, hen hs s smply he probably of an ncorrec predcon. A srong PAClearnng algorhm s an algorhm ha, gven =, $>0 and access o random examples, oupus wh probably 1&$ a hypohess wh error a mos =. Furher, he runnng me mus be polynomal n 1=, 1$ and oher relevan parameers (namely, he ``sze'' of he examples receved, and he ``sze'' or ``complexy'' of he arge concep). A weak PAClearnng algorhm sasfes he same condons bu only for =12&# where #>0 s eher a consan, or decreases as 1p where p s a polynomal n he relevan parameers. We use WeakLearn o denoe a generc weak learnng algorhm. Schapre [22] showed ha any weak learnng algorhm can be effcenly ransformed or ``boosed'' no a srong learnng algorhm. Laer, Freund [10, 11] presened he ``boosbymajory'' algorhm ha s consderably more effcen han Schapre's. Boh algorhms work by callng a gven weak learnng algorhm WeakLearn mulple mes, each me presenng wh a dfferen dsrbuon over he doman X, and fnally combnng all of he generaed hypoheses no a sngle hypohess. he nuve dea s o aler he dsrbuon over he doman X n a way ha ncreases he probably of he ``harder'' pars of he space, hus forcng he weak learner o generae new hypoheses ha make less msakes on hese pars. An mporan, praccal defcency of he boosbymajory algorhm s he requremen ha he bas # of he weak learnng algorhm WeakLearn be known ahead of me. o only s hs worscase bas usually unknown n pracce, bu he bas ha can be acheved by WeakLearn wll ypcally vary consderably from one dsrbuon o he nex. Unforunaely, he boosbymajory algorhm canno ake advanage of hypoheses compued by WeakLearn wh error sgnfcanly smaller han he presumed worscase bas of 12&#. In hs secon, we presen a new boosng algorhm whch was derved from he onlne allocaon algorhm of Secon 2. hs new algorhm s very nearly as effcen as boosbymajory. However, unlke boosbymajory, he accuracy of he fnal hypohess produced by he new algorhm depends on he accuracy of all he hypoheses reurned by WeakLearn, and so s able o more fully explo he power of he weak learnng algorhm. Also, hs new algorhm gves a clean mehod for handlng realvalued hypoheses whch ofen are produced by neural neworks and oher learnng algorhms he ew Boosng Algorhm Alhough boosng has s roos n he PAC model, for he remander of he paper, we adop a more general learnng framework n whch he learner receves examples (x, y ) chosen randomly accordng o some fxed bu unknown dsrbuon P on X_Y, where Y s a se of possble labels. As usual, he goal s o learn o predc he label y gven an nsance x. We sar by descrbng our new boosng algorhm n he smples case ha he label se Y consss of jus wo possble labels, Y=[0, 1]. In laer secons, we gve exensons of he algorhm for more general label ses. Freund [11] descrbes wo frameworks n whch boosng can be appled boosng by flerng and boosng by samplng. In hs paper, we use he boosng by samplng framework, whch s he naural framework for analyzng ``bach'' learnng,.e., learnng usng a fxed ranng se whch s sored n he compuer's memory. We assume ha a sequence of ranng examples (labelled nsances) (x 1, y 1 ),..., (x, y ) s drawn randomly from X_Y accordng o dsrbuon P. We use boosng o fnd a hypohess h f whch s conssen wh mos of he sample (.e., h f (x )=y for mos 1). In general, a hypohess whch s accurae on he ranng se mgh no be accurae on examples ousde he ranng se; hs problem s somemes referred o as ``overfng.'' Ofen, however, overfng can be avoded by resrcng he hypohess o be smple. We wll come back o hs problem n Secon 4.3. he new boosng algorhm s descrbed n Fg. 2. he goal of he algorhm s o fnd a fnal hypohess wh low error relave o a gven dsrbuon D over he ranng examples. Unlke he dsrbuon P whch s over X_Y and s se by ``naure,'' he dsrbuon D s only over he nsances n he ranng se and s conrolled by he learner. Ordnarly, hs dsrbuon wll be se o be unform so ha D()=1. he algorhm manans a se of weghs w over he ranng examples. On eraon a dsrbuon p s compued by normalzng hese weghs. hs dsrbuon s fed o he weak learner WeakLearn whch generaes a hypohess h ha (we hope) has small error wh respec o he dsrbuon. 1 Usng he new hypohess h, he boosng 1 Some learnng algorhms can be generalzed o use a gven dsrbuon drecly. For nsance, graden based algorhms can use he probably assocaed wh each example o scale he updae sep sze whch s based on he example. If he algorhm canno be generalzed n hs way, he ranng sample can be resampled o generae a new se of ranng examples ha s dsrbued accordng o he gven dsrbuon. he compuaon requred o generae each resampled example akes O(log ) me.
8 126 FREUD AD SCHAPIRE Algorhm AdaBoos Inpu sequence of labeled examples ((x 1, y 1 ),..., (x, y )) dsrbuon D over he examples weak learnng algorhm WeakLearn neger specfyng number of eraons Inalze he wegh vecor w 1 =D() for,...,. Do for, 2,..., 1. Se p = w w 2. Call WeakLearn, provdng wh he dsrbuon p ; ge back a hypohess h X [0, 1]. 3. Calculae he error of h = = p h (x )&y. 4. Se ; == (1&= ). 5. Se he new weghs vecor o be Oupu he hypohess w +1 =w ;1& h (x )&y h f (x)= { 1 f (log 1; ) h (x) 1 2 log 1; 0 oherwse. FIG. 2. he adapve boosng algorhm. algorhm generaes he nex wegh vecor w +1, and he process repeas. Afer such eraons, he fnal hypohess h f s oupu. he hypohess h f combnes he oupus of he weak hypoheses usng a weghed majory voe. We call he algorhm AdaBoos because, unlke prevous algorhms, adjuss adapvely o he errors of he weak hypoheses reurned by WeakLearn.If WeakLearn s a PAC weak learnng algorhm n he sense defned above, hen = 12&# for all (assumng he examples have been generaed appropraely wh y =c(x ) for some c # C). However, such a bound on he error need no be known ahead of me. Our resuls hold for any = # [0, 1], and depend only on he performance of he weak learner on hose dsrbuons ha are acually generaed durng he boosng process. he parameer ; s chosen as a funcon of = and s used for updang he wegh vecor. he updae rule reduces he probably assgned o hose examples on whch he hypohess makes a good predcon and ncreases he probably of he examples on whch he predcon s poor. 2 oe ha AdaBoos, unlke boosbymajory, combnes he weak hypoheses by summng her probablsc predcons. Drucker, Schapre and Smard [9], n expermens hey performed usng boosng o mprove he performance 2 Furhermore, f h s Boolean (wh range [0, 1]), hen can be shown ha hs updae rule exacly removes he advanage of he las hypohess. ha s, he error of h on dsrbuon p +1 s exacly 12. of a realvalued neural nework, observed ha summng he oucomes of he neworks and hen selecng he bes predcon performs beer han selecng he bes predcon of each nework and hen combnng hem wh a majory rule. I s neresng ha he new boosng algorhm's fnal hypohess uses he same combnaon rule ha was observed o be beer n pracce, bu whch prevously lacked heorecal jusfcaon. Snce was frs nroduced, several successful expermens have been conduced usng AdaBoos, ncludng work by he auhors [12], Drucker and Cores [8], Jackson and Craven [16], Qunlan [21], and Breman [3] Analyss Comparng Fgs. 1 and 2, here s an obvous smlary beween he algorhms Hedge(;) and AdaBoos. hs smlary reflecs a surprsng ``dual'' relaonshp beween he onlne allocaon model and he problem of boosng. Pu anoher way, here s a drec mappng or reducon of he boosng problem o he onlne allocaon problem. In such a reducon, one mgh naurally expec a correspondence relang he sraeges o he weak hypoheses and he rals (and assocaed loss vecors) o he examples n he ranng se. However, he reducon we have used s reversed he ``sraeges'' correspond o he examples, and he rals are assocaed wh he weak hypoheses. Anoher reversal s n he defnon of he loss n Hedge(;) he loss l s small f he h sraegy suggess a good acon on he h ral whle n AdaBoos he ``loss'' l & h (x )&y appearng n he weghupdae rule (Sep 5) s small f he h hypohess suggess a bad predcon on he h example. he reason s ha n Hedge(;) he wegh assocaed wh a sraegy s ncreased f he sraegy s successful whle n AdaBoos he wegh assocaed wh an example s ncreased f he example s ``hard.'' he man echncal dfference beween he wo algorhms s ha n AdaBoos he parameer ; s no longer fxed ahead of me bu raher changes a each eraon accordng o =. If we are gven ahead of me he nformaon ha = 12&# for some #>0 and for all,...,, hen we could nsead drecly apply algorhm Hedge(;) and s analyss as follows Fx ; o be 1&#, and se l =1& h (x )&y, and h f as n AdaBoos, bu wh equal wegh assgned o all hypoheses. hen p } l s exacly he accuracy of h on dsrbuon p, whch, by assumpon, s a leas 12+#. Also, leng S=[ h f (x ){y ], s sraghforward o show ha f # S hen L = 1 l & 1 =1& }y & 1 y &h (x ) h (x ) }12
9 A DECISIO HEOREIC GEERALIZAIO 127 by h f 's defnon, and snce y # [0, 1]. hus, by heorem 2, }(12+#) p } l &ln( # S D())+(#+#2 )(2) # snce &ln(;)=&ln(1&#)#+# 2 for # #[0,12]. hs mples ha he error == # S D() ofh f s a mos e . he boosng algorhm AdaBoos has wo advanages over hs drec applcaon of Hedge(;). Frs, by gvng a more refned analyss and choce of ;, we oban a sgnfcanly superor bound on he error =. Second, he algorhm does no requre pror knowledge of he accuracy of he hypoheses ha WeakLearn wll generae. Insead, measures he accuracy of h a each eraon and ses s parameers accordngly. he updae facor ; decreases wh = whch causes he dfference beween he dsrbuons p and p +1 o ncrease. Decreasng ; also ncreases he wegh ln(1; ) whch s assocaed wh h n he fnal hypohess. hs makes nuve sense more accurae hypoheses cause larger changes n he generaed dsrbuons and have more nfluence on he oucome of he fnal hypohess. We now gve our analyss of he performance of AdaBoos. oe ha hs heorem apples also f, for some hypoheses, = 12. heorem 6. Suppose he weak learnng algorhm WeakLearn, when called by AdaBoos, generaes hypoheses wh errors = 1,..., = (as defned n Sep 3 of Fg. 2.)hen he error ==Pr D [h f (x ){y ] of he fnal hypohess h f oupu by AdaBoos s bounded above by =2 ` = (1&= ). (14) Proof. We adap he man argumens from Lemma 1 and heorem 2. We use p and w as hey are defned n Fg. 2. Smlar o Eq. (4), he updae rule gven n Sep 5 n Fg. 2 mples ha = w +1 = \ w ;1& h (x )&y w (1&(1&; )(1& h (x )&y )) w+ (1&(1&= )(1&; )). (15) Combnng hs nequaly over,...,, we ge ha w +1 ` (1&(1&= )(1&; )). (16) he fnal hypohess h f, as defned n Fg. 2, makes a msake on nsance only f ` \` ; & h (x )&y ; + &12 (snce y # [0, 1]). he fnal wegh of any nsance s w +1 =D() ` (17) ; 1& h (x )&y. (18) Combnng Eqs. (17) and (18) we can lower bound he sum of he fnal weghs by he sum of he fnal weghs of he examples on whch h f s ncorrec w +1 w +1 h f (x ){y \ D() +\ ` ; h f (x ){y + 12 == } \` ; + 12 (19) where = s he error of h f. Combnng Eqs. (16) and (19), we ge ha = ` 1&(1&= )(1&; )  ;. (20) As all he facors n he produc are posve, we can mnmze he rgh hand sde by mnmzng each facor separaely. Seng he dervave of he h facor o zero, we fnd ha he choce of ; whch mnmzes he rgh hand sde s ; == (1&= ). Pluggng hs choce of ; no Eq. (20) we ge Eq. (14), compleng he proof. K he bound on he error gven n heorem 6, can also be wren n he form = ` 1&4# 2 =exp \& KL(12 & 12&# ) + exp \&2 # + 2 (21)
10 128 FREUD AD SCHAPIRE where KL(a & b)=a ln(ab)+(1&a)ln((1&a)(1&b)) s he KullbackLebler dvergence, and where = has been replaced by 12&#. In he case where he errors of all he hypoheses are equal o 12&#, Eq. (21) smplfes o =(1&4# 2 ) 2 =exp(& }KL(12&12&#)) exp(&2# 2 ). (22) hs s a form of he Chernoff bound for he probably ha less han 2 con flps urn ou ``heads'' n osses of a random con whose probably for ``heads'' s 12&#. hs bound has he same asympoc behavor as he bound gven for he boosbymajory algorhm [11]. From Eq. (22) we ge ha he number of eraons of he boosng algorhm ha s suffcen o acheve error = of h f s = 1 KL(12 & 12&#) ln 1 = 1 2# 2 ln 1 =. (23) oe, however, ha when he errors of he hypoheses generaed by WeakLearn are no unform, heorem 6 mples ha he fnal error depends on he error of all of he weak hypoheses. Prevous bounds on he errors of boosng algorhms depended only on he maxmal error of he weakes hypohess and gnored he advanage ha can be ganed from he hypoheses whose errors are smaller. hs advanage seems o be very relevan o praccal applcaons of boosng, because here one expecs he error of he learnng algorhm o ncrease as he dsrbuons fed o WeakLearn shf more and more away from he arge dsrbuon Generalzaon Error We now come back o dscussng he error of he fnal hypohess ousde he ranng se. heorem 6 guaranees ha he error of h f on he sample s small; however, he quany ha neress us s he generalzaon error of h f, whch s he error of h f over he whole nsance space X; ha s, = g =Pr (x, y)p [h f (x){y]. In order o make = g close o he emprcal error =^ on he ranng se, we have o resrc he choce of h f n some way. One naural way of dong hs n he conex of boosng s o resrc he weak learner o choose s hypoheses from some smple class of funcons and resrc, he number of weak hypoheses ha are combned o make h f. he choce of he class of weak hypoheses s specfc o he learnng problem a hand and should reflec our knowledge abou he properes of he unknown concep. As for he choce of, varous general mehods can be devsed. One popular mehod s o use an upper bound on he VCdmenson of he concep class. hs mehod s somemes called ``srucural rsk mnmzaon.'' See Vapnk's book [23] for an exensve dscusson of he heory of srucural rsk mnmzaon. For our purposes, we quoe Vapnk's heorem 6.7 heorem 7 (Vapnk). Le H be a class of bnary funcons over some doman X. Le d be he VCdmenson of H. Le P be a dsrbuon over he pars X_[0, 1]. For h # H, defne he (generalzaon) error of h wh respec o P o be = g (h).pr (x, y) P [h(x){y]. Le S=[(x 1, y 1 ),..., (x, y )] be a sample (ranng se) of ndependen random examples drawn from X_[0, 1] accordng o P. Defne he emprcal error of h wh respec o he sample S o be =^( h ). [ h(x ){y ]. hen, for any $>0 we have ha Pr h # H =^( h )&= g(h) >2 d(ln 2d+1)+ln 9$ & $ where he probably s compued wh respec o he random choce of he sample S. Le % R [0, 1] be defned by %(x)= {1 f x0 0 oherwse and, for any class H of funcons, le 3 (H) be he class of all funcons defned as a lnear hreshold of funcons n H 3 (H)= {% \ h 1,..., h # H =. a h &b + b, a 1,..., a # R; Clearly, f all hypoheses generaed by WeakLearn belong o some class H, hen he fnal hypohess of AdaBoos, afer rounds of boosng, belongs o 3 (H). hus, he nex heorem provdes an upper bound on he VCdmenson of he class of fnal hypoheses generaed by AdaBoos n erms of he weak hypohess class.
11 A DECISIO HEOREIC GEERALIZAIO 129 heorem 8. Le H be a class of bnary funcons of VCdmenson d2. hen he VCdmenson of 3 (H) s a mos 2(d+1)(+1) log 2 (e(+1)) (where e s he base of he naural logarhm). herefore, f he hypoheses generaed by WeakLearn are chosen from a class of VCdmenson d2, hen he fnal hypoheses generaed by AdaBoos afer eraons belong o a class of VCdmenson a mos 2(d+1)(+1) log 2 [e(+1)]. Proof. We use a resul abou he VCdmenson of compuaon neworks proved by Baum and Haussler [1]. We can vew he fnal hypohess oupu by AdaBoos as a funcon ha s compued by a wolayer feedforward nework where he compuaon uns of he frs layer are he weak hypoheses and he compuaon un of he second layer s he lnear hreshold funcon whch combnes he weak hypoheses. he VCdmenson of he se of lnear hreshold funcons over R s +1 [26]. hus he sum over all compuaon uns of he VCdmensons of he classes of funcons assocaed wh each un s d+(+1)< (+1)(d+1). Baum and Haussler's heorem 1 [1] mples ha he number of dfferen funcons ha can be realzed by h # 3 (H) when he doman s resrced o a se of sze m s a mos ((+1) em(+1)(d+1)) (+1)(d+1).Ifd2, 1 and we se m=w2(+1)(d+1) log 2 [e(+1)]x, hen he number of realzable funcons s smaller han 2 m whch mples ha he VCdmenson of 3 (H) s smaller han m. K Followng he gudelnes of srucural rsk mnmzaon we can do he followng (assumng we know a reasonable upper bound on he VCdmenson of he class of weak hypoheses). Le h f be he hypohess generaed by runnng AdaBoos for eraons. By combnng he observed emprcal error of h f wh he bounds gven n heorems 7 and 8, we can compue an upper bound on he generalzaon error of h f for all. We would hen selec he hypohess h f ha mnmzes he guaraneed upper bound. Whle srucural rsk mnmzaon s a mahemacally sound mehod, he upper bounds on = g ha are generaed n hs way mgh be larger han he acual value and so he chosen number of eraons mgh be much smaller han he opmal value, leadng o nferor performance. A smple alernave s o use ``crossvaldaon'' n whch a fracon of he ranng se s lef ousde he se used o generae h f as he socalled ``valdaon'' se. he value of s hen chosen o be he one for whch he error of he fnal hypohess on he valdaon se s mnmzed. (For an exensve analyss of he relaons beween dfferen mehods for selecng model complexy n learnng, see Kearns e al. [17].) Some nal expermens usng AdaBoos on realworld problems conduced by ourselves and Drucker and Cores [8] ndcae ha AdaBoos ends no o overf; on many problems, even afer hundreds of rounds of boosng, he generalzaon error connues o drop, or a leas does no ncrease A Bayesan Inerpreaon he fnal hypohess generaed by AdaBoos s closely relaed o one suggesed by a Bayesan analyss. As usual, we assume ha examples (x, y) are beng generaed accordng o some dsrbuon P on X_[0, 1]; all probables n hs subsecon are aken wh respec o P. Suppose we are gven a se of [0, 1]valued hypoheses h 1,..., h and ha our goal s o combne he predcons of hese hypoheses n he opmal way. hen, gven an nsance x and he hypohess predcons h (x), he Bayes opmal decson rule says ha we should predc he label wh he hghes lkelhood, gven he hypohess values,.e., we should predc 1 f Pr[ y=1 h 1 (x),..., h (x)]>pr[y=0 h 1 (x),..., h (x)], and oherwse we should predc 0. hs rule s especally easy o compue f we assume ha he errors of he dfferen hypoheses are ndependen of one anoher and of he arge concep, ha s, f we assume ha he even h (x){y s condonally ndependen of he acual label y and he predcons of all he oher hypoheses h 1 (x),..., h &1 (x), h +1 (x),..., h (x). In hs case, by applyng Bayes rule, we can rewre he Bayes opmal decson rule n a parcularly smple form n whch we predc 1f Pr[ y=1] ` h (x)=0 >Pr[y=0] = ` h (x)=1 ` h (x)=0 (1&= ) (1&= ) ` h (x)=1 and 0 oherwse. Here = =Pr[h (x){y]. We add o he se of hypoheses he rval hypohess h 0 whch always predcs he value 1. We can hen replace Pr[ y=0] by = 0. akng he logarhm of boh sdes n hs nequaly and rearrangng he erms, we fnd ha he Bayes opmal decson rule s dencal o he combnaon rule ha s generaed by AdaBoos. If he errors of he dfferen hypoheses are dependen, hen he Bayes opmal decson rule becomes much more complcaed. However, n pracce, s common o use he smple rule descrbed above even when here s no jusfcaon for assumng ndependence. (hs s somemes called ``nave Bayes.'') An neresng and more prncpled alernave o hs pracce would be o use he algorhm AdaBoos o fnd a combnaon rule whch, by heorem 6, has a guaraneed nonrval accuracy. =,
12 130 FREUD AD SCHAPIRE 4.5. Improvng he Error Bound We show n hs secon how he bound gven n heorem 6 can be mproved by a facor of wo. he man dea of hs mprovemen s o replace he ``hard'' [0, 1] valued decson used by h f by a ``sof'' hreshold. o be more precse, le r(x)= (log 1 ; ) h (x) log 1 ; be a weghed average of he weak hypoheses h. We wll here consder fnal hypoheses of he form h f (x)=f(r(x)) where F [0,1][0, 1]. For he verson of AdaBoos gven n Fg. 2, F(r) s he hard hreshold ha equals 1 f r12 and 0 oherwse. In hs secon, we wll nsead use sof hreshold funcons ha ake values n [0, 1]. As menoned above, when h f (x) # [0, 1], we can nerpre h f as a randomzed hypohess and h f (x) as he probably of predcng 1. hen he error E D [ h f (x )&y ] s smply he probably of an ncorrec predcon. heorem 9. Le = 1,..., = be as n heorem 6, and le r(x ) be as defned above. Le he modfed fnal hypohess be defned by h f (x)=f(r(x)) where F sasfes he followng for r #[0,1] F(1&r)=1&F(r); and F(r) 1 12&r ; 2\` +. hen he error = of h f s bounded above by =2 &1 ` = (1&= ). For nsance, can be shown ha he sgmod funcon F(r)=(1+> ;2r&1 ) &1 sasfes he condons of he heorem. Proof. By our assumpons on F, he error of h f s == = 1 2 D()} F(r(x ))&y D() F( r(x )&y ) \ D() ` ; 12& r(x )&y +. Snce y # [0, 1] and by defnon of r(x ), hs mples ha = 1 2 = 1 2\ 1 2 ` \ D() ` w +1 ; 12& h (x )&y + + ` ; &12 ((1&(1&= )(1&; )) ; &12 ). he las wo seps follow from Eqs. (18) and (16), respecvely. he heorem now follows from our choce of ;. K 5. BOOSIG FOR MULICLASS AD REGRESSIO PROBLEMS So far, we have resrced our aenon o bnary classfcaon problems n whch he se of labels Y conans only wo elemens. In hs secon, we descrbe wo possble exensons of AdaBoos o he mulclass case n whch Y s any fne se of class labels. We also gve an exenson for a regresson problem n whch Y s a real bounded nerval. We sar wh he mulplelabel classfcaon problem. Le Y=[1, 2,..., k] be he se of possble labels. he boosng algorhms we presen oupu hypoheses h f X Y, and he error of he fnal hypohess s measured n he usual way as he probably of an ncorrec predcon. he frs exenson of AdaBoos, whch we call AdaBoos.M1, s he mos drec. he weak learner generaes hypoheses whch assgn o each nsance one of he k possble labels. We requre ha each weak hypohess have predcon error less han 12 (wh respec o he dsrbuon on whch was raned). Provded hs requremen can be me, we are able o prove ha he error of he combned fnal hypohess decreases exponenally, as n he bnary case. Inuvely, however, hs requremen on he performance of he weak learner s sronger han mgh be desred. In he bnary case (k=2), a random guess wll be correc wh probably 12, bu when k>2, he probably of a correc random predcon s only 1k<12. hus, our requremen ha he accuracy of he weak hypohess be greaer han 12 s sgnfcanly sronger han smply requrng ha he weak hypohess perform beer han random guessng. In fac, when he performance of he weak learner s measured only n erms of error rae, hs dffculy s unavodable as s shown by he followng nformal example (also presened by Schapre [22]) Consder a learnng problem where Y=[0, 1, 2] and suppose ha s ``easy'' o predc wheher he label s 2 bu ``hard'' o predc wheher he label s 0 or 1. hen a hypohess whch predcs correcly whenever he label s 2 and oherwse guesses
13 A DECISIO HEOREIC GEERALIZAIO 131 randomly beween 0 and 1 s guaraneed o be correc a leas half of he me (sgnfcanly beang he 13 accuracy acheved by guessng enrely a random). On he oher hand, boosng hs learner o an arbrary accuracy s nfeasble snce we assumed ha s hard o dsngush 0 and 1labelled nsances. As a more naural example of hs problem, consder classfcaon of handwren dgs n an OCR applcaon. I may be easy for he weak learner o ell ha a parcular mage of a ``7'' s no a ``0'' bu hard o ell for sure f s a ``7'' or a ``9''. Par of he problem here s ha, alhough he boosng algorhm can focus he aenon of he weak learner on he harder examples, has no way of forcng he weak learner o dscrmnae beween parcular labels ha may be especally hard o dsngush. In our second verson of mulclass boosng, we aemp o overcome hs dffculy by exendng he communcaon beween he boosng algorhm and he weak learner. Frs, we allow he weak learner o generae more expressve hypoheses whose oupu s a vecor n [0, 1] k, raher han a sngle label n Y. Inuvely, he yh componen of hs vecor represens a ``degree of belef'' ha he correc label s y. he componens wh large values (close o 1) correspond o hose labels consdered o be plausble. Lkewse, labels consdered mplausble are assgned a small value (near 0), and quesonable labels may be assgned a value near 12. If several labels are consdered plausble (or mplausble), hen hey all may be assgned large (or small) values. Whle we gve he weak learnng algorhm more expressve power, we also place a more complex requremen on he performance of he weak hypoheses. Raher han usng he usual predcon error, we ask ha he weak hypoheses do well wh respec o a more sophscaed error measure ha we call he pseudoloss. hs pseudoloss vares from example o example, and from one round o he nex. On each eraon, he pseudoloss funcon s suppled o he weak learner by he boosng algorhm, along wh he dsrbuon on he examples. By manpulang he pseudoloss funcon, he boosng algorhm can focus he weak learner on he labels ha are hardes o dscrmnae. he boosng algorhm AdaBoos.M2, descrbed n Secon 5.2, s based on hese deas and acheves boosng f each weak hypohess has pseudoloss slghly beer han random guessng (wh respec o he pseudoloss measure ha was suppled o he weak learner). In addon o he wo exensons descrbed n hs paper, we menon an alernave, sandard approach whch would be o conver he gven mulclass problem no several bnary problems, and hen o use boosng separaely on each of he bnary problems. here are several sandard ways of makng such a converson, one of he mos successful beng he errorcorrecng oupu codng approach advocaed by Deerch and Bakr [7]. Fnally, n Secon 5.3 we exend AdaBoos o boosng regresson algorhms. In hs case Y=[0, 1], and he error of a hypohess s defned as E (x, y) P [(h(x)&y) 2 ]. We descrbe a boosng algorhm AdaBoos.R. whch, usng mehods smlar o hose used n AdaBoos.M2, booss he performance of a weak regresson algorhm Frs Mulclass Exenson In our frs and mos drec exenson o he mulclass case, he goal of he weak learner s o generae on round a hypohess h X Y wh low classfcaon error =. Pr p [h (x ){y ]. Our exended boosng algorhm, called AdaBoos.M1, s shown n Fg. 3, and dffers only slghly from AdaBoos. he man dfference s n he replacemen of he error h (x )&y for he bnary case by h (x ){y where, for any predcae?, we defne? o be 1f?holds and 0 oherwse. Also, he fnal hypohess h f, for a gven nsance x, now oupus he label y ha maxmzes he sum of he weghs of he weak hypoheses predcng ha label. In he case of bnary classfcaon (k=2), a weak hypohess h wh error sgnfcanly larger han 12 sof equal value o one wh error sgnfcanly less han 12 snce h can be replaced by 1&h. However, for k>2, a hypohess h wh error = 12 s useless o he boosng algorhm. If Algorhm AdaBoos.M1 Inpu sequence of examples ( (x 1, y 1 )..., (x, y )) wh labels y # Y=[1,..., k] dsrbuon D over he examples weak learnng algorhm WeakLearn neger specfyng number of eraons Inalze he wegh vecor w 1 =D() for,...,. Do for, 2,..., 1. Se p = w w 2. Call WeakLearn, provdng wh he dsrbuon p ; ge back a hypohess h X Y. 3. Calculae he error of h = = p h (x ){y. If = >12, hen se =&1 and abor loop. 4. Se ; == (1&= ). 5. Se he new weghs vecor o be Oupu he hypohess FIG. 3. h f (x)=arg max y # Y w +1 =w ;1&h (x ){y% \ log 1 ; + h (x)=y. A frs mulclass exenson of AdaBoos.
14 132 FREUD AD SCHAPIRE such a weak hypohess s reurned by he weak learner, our algorhm smply hals, usng only he weak hypoheses ha were already compued. heorem 10. Suppose he weak learnng algorhm WeakLearn, when called by AdaBoos.M1, generaes hypoheses wh errors = 1,..., =, where = s as defned n Fg. 3. Assume each = 12. hen he error == Pr D [h f (x ){y ] of he fnal hypohess h f oupu by AdaBoos.M1 s bounded above by =2 ` = (1&= ). Proof. o prove hs heorem, we reduce our seup for AdaBoos.M1 o an nsanaon of AdaBoos, and hen apply heorem 6. For clary, we mark wh ldes varables n he reduced AdaBoos space. For each of he gven examples (x, y ), we defne an AdaBoos example (x~, y~ )n whch x~ = and y~ =0. We defne he AdaBoos dsrbuon D over examples o be equal o he AdaBoos.M1 dsrbuon D. On he h round, we provde AdaBoos wh a hypohess h defned by he rule h ()=h (x ){y n erms of he h hypohess h whch was reurned o AdaBoos.M1 by WeakLearn. Gven hs seup, can be easly proved by nducon on he number of rounds ha he wegh vecors, dsrbuons and errors compued by AdaBoos and AdaBoos.M1 are dencal so ha w~ =w, p~ = p, =~ = = and ; =;. Suppose ha AdaBoos.M1's fnal hypohess h f makes a msake on nsance so ha h f (x ){y. hen, by defnon of h f, h (x )=y where =ln(1; ). hs mples h (x )=y 1 2 h (x )=h f (x ) usng he fac ha each 0 snce = 12. By defnon of h, hs mples h () 1 2 so h f ()=1 by defnon of he fnal AdaBoos hypohess.,, herefore, Pr D [h f (x ){y ]Pr D [h f()=1]. Snce each AdaBoos nsance has a 0label, Pr D [h f () =1] s exacly he error of h f. Applyng heorem 6, we can oban a bound on hs error, compleng he proof. K I s possble, for hs verson of he boosng algorhm, o allow hypoheses whch generae for each x, no only a predced class label h(x)#y, bu also a ``confdence'' }(x) # [0, 1]. he learner hen suffers loss 12&}(x)2 f s predcon s correc and 12+}(x)2 oherwse. (Deals omed.) 5.2. Second Mulclass Exenson In hs secon we descrbe a second alernave exenson of AdaBoos o he case where he label space Y s fne. hs exenson requres more elaborae communcaon beween he boosng algorhm and he weak learnng algorhm. he advanage of dong hs s ha gves he weak learner more flexbly n makng s predcons. In parcular, somemes enables he weak learner o make useful conrbuons o he accuracy of he fnal hypohess even when he weak hypohess does no predc he correc label wh probably greaer han 12. As descrbed above, he weak learner generaes hypoheses whch have he form h X_Y [0, 1]. Roughly speakng, h(x, y) measures he degree o whch s beleved ha y s he correc label assocaed wh nsance x. If, for a gven x, h(x, y) aans he same value for all y hen we say ha he hypohess s unnformave on nsance x. On he oher hand, any devaon from src equaly s poenally nformave, because predcs some labels o be more plausble han ohers. As wll be seen, any such nformaon s poenally useful for he boosng algorhm. Below, we formalze he goal of he weak learner by defnng a pseudoloss whch measures he goodness of he weak hypoheses. o movae our defnon, we frs consder he followng seup. For a fxed ranng example (x, y ), we use a gven hypohess h o answer k&1 bnary quesons. For each of he ncorrec labels y{ y we ask he queson ``Whch s he label of x y or y?'' In oher words, we ask ha he correc label y be dscrmnaed from he ncorrec label y. Assume momenarly ha h only akes values n [0, 1]. hen f h(x, y)=0 and h(x, y )=1, we nerpre h's answer o he queson above o be y (snce h deems y o be a plausble label for x, bu y s consdered mplausble). Lkewse, f h(x, y)=1 and h(x, y )=0 hen he answer s y. Ifh(x,y)=h(x, y ), hen one of he wo answers s chosen unformly a random.
15 A DECISIO HEOREIC GEERALIZAIO 133 In he more general case ha h akes values n [0, 1], we nerpre h(x, y) as a randomzed decson for he procedure above. ha s, we frs choose a random b b(x, y) whch s 1 wh probably h(x, y) and 0 oherwse. We hen apply he above procedure o he sochascally chosen bnary funcon b. he probably of choosng he ncorrec answer y o he queson above s Pr[b(x, y )=07b(x, y)=1]+ 1 2 Pr[b(x, y )=b(x, y)] = 1 2 (1&h(x, y )+h(x, y)). If he answers o all k&1 quesons are consdered equally mporan, hen s naural o defne he loss of he hypohess o be he average, over all k&1 quesons, of he probably of an ncorrec answer 1 k&1 1 y{y 2 (1&h(x, y )+h(x, y)) = 1 2\ 1&h(x, y )+ 1 k&1 y{y h(x, y) +. (24) However, as was dscussed n he nroducon o Secon 5, dfferen dscrmnaon quesons are lkely o have dfferen mporance n dfferen suaons. For example, consderng he OCR problem descrbed earler, mgh be ha a some pon durng he boosng process, some example of he dg ``7'' has been recognzed as beng eher a ``7'' or a ``9''. A hs sage he queson ha dscrmnaes beween ``7'' (he correc label) and ``9'' s clearly much more mporan han he oher egh quesons ha dscrmnae ``7'' from he oher dgs. A naural way of aachng dfferen degrees of mporance o he dfferen quesons s o assgn a wegh o each queson. So, for each nsance x and ncorrec label y{ y, we assgn a wegh q(, y) whch we assocae wh he queson ha dscrmnaes label y from he correc label y.we hen replace he average used n Eq. (24) wh an average weghed accordng o q(, y); he resulng formula s called he pseudoloss of h on ranng nsance wh respec o q ploss q (h, ). 2\ 1 1&h(x, y )+ q(, y) h(x, y) +. y{y he funcon q=[1,..., ]_Y[0, 1], called he label weghng funcon, assgns o each example n he ranng se a probably dsrbuon over he k&1 dscrmnaon problems defned above. So, for all, q(, y)=1. y{ y he weak learner's goal s o mnmze he expeced pseudoloss for gven dsrbuon D and weghng funcon q ploss D, q (h) =E D [ploss q (h, )]. As we have seen, by manpulang boh he dsrbuon on nsances, and he label weghng funcon q, our boosng algorhm effecvely forces he weak learner o focus no only on he hard nsances, bu also on he ncorrec class labels ha are hardes o elmnae. Conversely, hs pseudoloss measure may make easer for he weak learner o ge a weak advanage. For nsance, f he weak learner can smply deermne ha a parcular nsance does no belong o a ceran class (even f has no dea whch of he remanng classes s he correc one), hen, dependng on q, hs may be enough o gan a weak advanage. heorem 11, he man resul of hs secon, shows ha a weak learner can be boosed f can conssenly produce weak hypoheses wh pseudolosses smaller han 12. oe ha pseudoloss 12 can be acheved rvally by any unnformave hypohess. Furhermore, a weak hypohess h wh pseudoloss =>12 s also benefcal o boosng snce can be replaced by he hypohess 1&h whose pseudoloss s 1&=<12. Example 5. As a smple example llusrang he use of pseudoloss, suppose we seek an oblvous weak hypohess,.e., a weak hypohess whose value depends only on he class label y so ha h(x, y)=h(y) for all x. Alhough oblvous hypoheses per se are generally oo weak o be of neres, may ofen be approprae o fnd he bes oblvous hypohess on a par of he nsance space (such as he se of nsances covered by a leaf of a decson ree). Le D be he arge dsrbuon, and q he label weghng funcon. For noaonal convenence, le us defne q(, y ) =&1 for all so ha ploss q (h, )= 2\ 1 1+ q(, y) h(x, y) +. y # Y Seng $( y)= D() q(, y), can be verfed ha for an oblvous hypohess h, ploss D, q (h)= 2\ 1 1+ h(y) $(y) +, y # Y whch s clearly mnmzed by he choce h( y)= {1 f $(y)<0 0 oherwse. Suppose now ha q(, y)=1(k&1) for y{ y, and le d( y)=pr D [y =y] be he proporon of examples wh
16 134 FREUD AD SCHAPIRE label y. hen can be verfed ha h wll always have pseudoloss srcly smaller han 12 excep n he case of a unform dsrbuon of labels (d( y)=1k for all y). In conras, when he weak learner's goal s mnmzaon of predcon error (as n Secon 5.1), can be shown ha an oblvous hypohess wh predcon error srcly less han 12 can only be found when one label y covers more han 12 he dsrbuon (d( y)>12). So n hs case, s much easer o fnd a hypohess wh small pseudoloss raher han small predcon error. On he oher hand, f q(, y)=0 for some values of y, hen he qualy of predcon on hese labels s of no consequence. In parcular, f q(, y)=0 for all bu one ncorrec label for each nsance, hen n order o make he pseudoloss smaller han 12 he hypohess has o predc he correc label wh probably larger han 12, whch means ha n hs case he pseudoloss creron s as srngen as he usual predcon error. However, as dscussed above, hs case s unavodable because a hard bnary classfcaon problem can always be embedded n a mulclass problem. hs example suggess ha may ofen be sgnfcanly easer o fnd weak hypoheses wh small pseudoloss raher han hypoheses whose predcon error s small. On he oher hand, our heorecal bound for boosng usng he predcon error (heorem 10) s sronger han he bound for ploss (heorem 11). Emprcal ess [12] have shown ha pseudoloss s generally more successful when he weak learners use very resrced hypoheses. However, for more powerful weak learners, such as decsonree learnng algorhms, here s lle dfference beween usng pseudoloss and predcon error. Our algorhm called AdaBoos.M2, s shown n Fg. 4. Here, we manan weghs w, y for each nsance and each label y # Y&[y ]. he weak learner mus be provded boh wh a dsrbuon D and a label wegh funcon q. Boh of hese are compued usng he wegh vecor w as shown n Sep 1. he weak learner's goal hen s o mnmze he pseudoloss =, as defned n Sep 3. he weghs are updaed as shown n Sep 5. he fnal hypohess h f oupus, for a gven nsance x, he label y ha maxmzes a weghed average of he weak hypohess values h (x, y). heorem 11. Suppose he weak learnng algorhm WeakLearn, when called by AdaBoos.M2 generaes hypoheses wh pseudolosses = 1,..., =, where = s as defned n Fg. 4.hen he error ==Pr D [h f (x ){y ] of he fnal hypohess h f oupu by AdaBoos.M2 s bounded above by =(k&1) 2 ` = (1&= ). Proof. As n he proof of heorem 10, we reduce o an nsance of AdaBoos and apply heorem 6. As before, we mark AdaBoos varables wh a lde. Algorhm AdaBoos.M2 Inpu sequence of examples ( (x 1, y 1 )..., (x, y )) wh labels y # Y=[1,..., k] dsrbuon D over he examples weak learnng algorhm WeakLearn neger specfyng number of eraons Inalze he wegh vecor w 1, y =D()(k&1) for,...,, y # Y&[y ]. Do for, 2,..., 1. Se W = y{ y w, y ; for y{ y ; and se q (, y)= w, y W W D ()=. W 2. Call WeakLearn, provdng wh he dsrbuon D and label weghng funcon q ; ge back a hypohess h X_Y [0, 1]. 3. Calculae he pseudoloss of h = = 1 2 D () \ 1&h (x, y )+ y{y q (, y) h (x, y) Se ; == (1&= ). 5. Se he new weghs vecor o be w +1, y =w, y ;(12)(1+h (x, y )&h (x, y)) for,...,, y # Y&[y ]. Oupu he hypohess FIG. 4. h f (x)=arg max y # Y \ log 1 ; + h (x, y). A second mulclass exenson of AdaBoos. For each ranng nsance (x, y ) and for each ncorrec label y # Y&[y ], we defne one AdaBoos nsance x~, y = (, y) wh assocaed label y~, y =0. hus, here are = (k&1) AdaBoos nsances, each ndexed by a par (, y). he dsrbuon over hese nsances s defned o be D (, y) =D()(k&1). he h hypohess h provded o AdaBoos for hs reducon s defned by he rule h (, y)= 1 2 (1&h (x, y )+h (x, y)). Wh hs seup, can be verfed ha he compued dsrbuons and errors wll be dencal so ha w~, y =w, y, p~, y = p, y, =~ = = and ; =;.
17 A DECISIO HEOREIC GEERALIZAIO 135 Suppose now ha h f (x ){y for some example. hen, by defnon of h f, h (x, y ) where =ln(1; ). hs mples ha h (, h f (x ))= so h f (, h f (x ))=1 by defnon of h f. herefore, h (x, h f (x )), (1&h (x, y )+h (x, h f (x ))) Pr D [h f (x ){y ]Pr D [_y{y h f(, y)=1]. Snce all AdaBoos nsances have a 0label, and by defnon of D, he error of h f s Pr (, y)d [h f (, y)=1] 1 k&1 Pr D[_y{ y h f (, y)=1]. Applyng heorem 6 o bound he error of h f, hs complees he proof. K Alhough we om he deals, he bound for AdaBoos.M2 can be mproved by a facor of wo n a manner smlar o ha descrbed n Secon Boosng Regresson Algorhms In hs secon we show how boosng can be used for a regresson problem. In hs seng, he label space s Y=[0, 1]. As before, he learner receves examples (x, y) chosen a random accordng o some dsrbuon P, and s goal s o fnd a hypohess h X Y whch, gven some x value, predcs approxmaely he value y ha s lkely o be seen. More precsely, he learner aemps o fnd an h wh small mean squared error (MSE) E (x, y) P [(h(x)&y) 2 ]. (25) Our mehods can be appled o any reasonable bounded error measure, bu, for he sake of concreeness, we concenrae here on he squared error measure. Followng our approach for classfcaon problems, we assume ha he leaner has been provded wh a ranng se (x 1, y 1 ),..., (x, y ) of examples dsrbued accordng o P, and we focus only on he mnmzaon of he emprcal MSE 1 (h(x )&y ) 2. Usng echnques smlar o hose oulned n Secon 4.3, he rue MSE gven n Eq. (25) can be relaed o he emprcal MSE. o derve a boosng algorhm n hs conex, we reduce he gven regresson problem o a bnary classfcaon problem, and hen apply AdaBoos. As was done for he reducons used n he proofs of heorems 10 and 11, we mark wh ldes all varables n he reduced (AdaBoos) space. For each example (x, y ) n he ranng se, we defne a connuum of examples ndexed by pars (, y) for all y # [0, 1] he assocaed nsance s x~, y =(x, y), and he label s y~, y = y y. (Recall ha? s 1 f predcae? holds and 0 oherwse.) Alhough s obvously nfeasble o explcly manan an nfnely large ranng se, we wll see laer how hs mehod can be mplemened effcenly. Also, alhough he resuls of Secon 4 only deal wh fnely large ranng ses, he exenson o nfne ranng ses s sraghforward. hus, nformally, each nsance (x, y ) s mapped o an nfne se of bnary quesons, one for each y # Y, and each of he form ``Is he correc label y bgger or smaller han y?'' In a smlar manner, each hypohess h X Y s reduced o a bnaryvalued hypohess h X_Y [0, 1] defned by he rule h (x, y)=yh(x). hus, h aemps o answer hese bnary quesons n a naural way usng he esmaed value h(x). Fnally, as was done for classfcaon problems, we assume we are gven a dsrbuon D over he ranng se; ordnarly, hs wll be unform so ha D()=1. In our reducon, hs dsrbuon s mapped o a densy D over pars (, y) n such a way ha mnmzaon of classfcaon error n he reduced space s equvalen o mnmzaon of MSE for he orgnal problem. o do hs, we defne D (, y)= D() y&y Z where Z s a normalzaon consan Z= D() 1 y&y dy. I s sraghforward o show ha 14Z12. 0
18 136 FREUD AD SCHAPIRE If we calculae he bnary error of h wh respec o he densy D, we fnd ha, as desred, s drecly proporonal o he mean squared error 1 y~, y & h( x~, y ) D (, y) dy 0 = 1 Z = 1 2Z D() } h(x ) y y&y dy } D()(h(x )&y ) 2. he consan of proporonaly s 1(2Z)#[1,2]. Unravellng hs reducon, we oban he regresson boosng procedure AdaBoos.R shown n Fg. 5. As prescrbed by he reducon, AdaBoos.R manans a wegh w, y for each nsance and label y # Y. he nal wegh funcon w 1 s exacly he densy D defned above. By normalzng he weghs w, a densy p s defned a Sep 1 and provded o he weak learner a Sep 2. he goal of he weak learner s o fnd a hypohess h X Y ha mnmzes he loss = defned n Sep 3. Fnally, a Sep 5, he weghs are updaed as prescrbed by he reducon. he defnon of = a Sep 3 follows drecly from he reducon above; s exacly he classfcaon error of h f n he reduced space. oe ha, smlar o AdaBoos.M2, AdaBoos.R no only vares he dsrbuon over he examples (x, y ), bu also modfes from round o round he defnon of he loss suffered by a hypohess on each example. hus, alhough our ulmae goal s mnmzaon of he squared error, he weak learner mus be able o handle loss funcons ha are more complcaed han MSE. he fnal hypohess h f also s conssen wh he reducon. Each reduced weak hypohess h f (x, y) s nondecreasng as a funcon of y. hus, he fnal hypohess h f generaed by AdaBoos n he reduced space, beng he hreshold of a weghed sum of hese hypoheses, also s nondecreasng as a funcon of y. As he oupu of h f s bnary, hs mples ha for every x here s one value of y for whch h f (x, y$)=0 for all y$<y and h f (x, y$)=1 for all y$>y. hs s exacly he value of y gven by h f (x) as defned n he fgure. oe ha h f s acually compung a weghed medan of he weak hypoheses. A frs, mgh seem mpossble o manan weghs w, y over an uncounable se of pons. However, on closer nspecon, can be seen ha, when vewed as a funcon of y, w s a pecewse lnear funcon. For,, y w1, y has wo lnear peces, and each updae a Sep 5 poenally breaks one of he peces n wo a he pon h (x ). Inalzng, sorng and updang such pecewse lnear funcons are all sraghforward operaons. Also, he negrals whch appear n he fgure can be evaluaed explcly snce hese only nvolve negraon of pecewse lnear funcons. Algorhm AdaBoos.R Inpu sequence of examples ( (x 1, y 1 )..., (x, y )) wh labels y # Y=[0, 1] dsrbuon D over he examples weak learnng algorhm WeakLearn neger specfyng number of eraons Inalze he wegh vecor for,...,, y # Y, where Do for, 2,..., 1. Se Z= w 1, y =D() y&y Z 1 D() y&y dy. p = 1 0 w, y dy. 2. Call WeakLearn, provdng wh he densy p ; ge back a hypohess h X_Y. 3. Calculae he loss of h 0 w = = } h(x) p, y y } dy. If = >12, hen se =&1 and abor he loop. 4. Se ; == (1&= ). 5. Se he new weghs vecor o be, y { = w, y w, y; w +1 for,...,, y # Y. Oupu he hypohess h f (x)=nf { y # Y FIG. 5. f y yh (x )orh (x )yy oherwse. h (x)y log(1; ) 1 2 log(1; ) =. An exenson of AdaBoos o regresson problems. he followng heorem descrbes our performance guaranee for AdaBoos.R. he proof follows from he reducon descrbed above coupled wh a drec applcaon of heorem 6. heorem 12. Suppose he weak learnng algorhm WeakLearn, when called by AdaBoos.R, generaes hypoheses wh errors = 1,..., =, where = s as defned n Fg. 5. hen he mean squared error ==E D [(h f (x )&y ) 2 ] of
19 A DECISIO HEOREIC GEERALIZAIO 137 he fnal hypohess h f oupu by AdaBoos.R s bounded above by =2 ` = (1&= ). (26) An unforunae propery of hs seup s ha here s no rval way o generae a hypohess whose loss s 12. hs s a smlar suaon o he one we encounered wh algorhm AdaBoos.M1. A remedy o hs problem mgh be o allow weak hypoheses from a more general class of funcons. One smple generalzaon s o allow for weak hypoheses ha are defned by wo funcons h X [0, 1] as before, and } X [0, 1] whch assocaes a measure of confdence o each predcon of h. he reduced hypohess whch we assocae wh hs par of funcons s h (x, y)= {(1+}(x))2 (1&}(x))2 f h(x)y oherwse. hese hypoheses are used n he same way as he ones defned before and a slgh varaon of algorhm AdaBoos.R can be used o boos he accuracy of hese more general weak learners (deals omed). he advanage of hs varan s ha any hypohess for whch }(x) s dencally zero has pseudoloss exacly 12 and slgh devaons from hs hypohess can be used o encode very weak predcons. he mehod presened n hs secon for boosng wh square loss can be used wh any reasonable bounded loss funcon L Y_Y [0, 1]. Here, L( y$, y) s a measure of he ``dscrepancy'' beween he observed label y and a predced label y$; for nsance, above we used L( y$, y)= (y$&y) 2. he goal of learnng s o fnd a hypohess h wh small average loss E (x, y) P [L(h(x), y)]. Assume, for any y, ha L( y, y)=0 and ha L( y$, y) s dfferenable wh respec o y$, nonncreasng for y$ y and nondecreasng for y$y. hen, o modfy AdaBoos.R o handle such a loss funcon, we need only replace y& y n he nalzaon sep wh L( y, y )y. he res of he algorhm s unchanged, and he modfcaons needed for he analyss are sraghforward. APPEDIX PROOF OF HEOREM 3 We sar wh a bref revew of a framework used by Vovk [24], whch s very smlar o he framework used n Secon 3. In hs framework, an onlne decson problem consss of a decson space 2, an oucome space 0 and a loss funcon * 2_0 [0, ], whch assocaes a loss o each decson and oucome par. A each ral he learnng algorhm receves he decsons =,..., 1 = # 2 of expers, and hen generaes s own decson $ # 2. Upon recevng an oucome # 0, he learner and each exper ncur loss *($, ) and *(=, ), respecvely. he goal of he learnng algorhm s o generae decsons n such a way ha s cumulave loss wll no be much larger han he cumulave loss of he bes exper. he followng four properes are assumed o hold 1. 2 s a compac opologcal space. 2. For each, he funcon $ *($, ) s connuous. 3. here exss $ such ha, for all, *($, )<. 4. here exss no $ such ha, for all, *($, )=0. We now gve Vovk's man resul [24]. Le a decson problem defned by 0, 2 and * obey Assumpons 14. Le c and a be posve real numbers. We say ha he decson problem s (c, a)bounded f here exss an algorhm A such ha for any fne se of expers and for any fne sequence of rals, he cumulave loss of he algorhm s bounded by *($, )c mn *(=, )+a ln, where s he number of expers. We say ha a dsrbuon D s smple f s nonzero on a fne se denoed dom(d). Le S be he se of smple dsrbuons over 2. Vovk defnes he followng funcon c (0, 1)[0, ] whch characerzes he hardness of any decson problem c(;)= sup nf D # S $ # 2 *($, ) sup log ; = # dom(d) ; *(=, ) D(=). (27) # 0 He hen proves he followng powerful heorem heorem 13 (Vovk). A decson problem s (c, a) bounded f and only f for all ; #(0,1), cc(;) or ac(;)ln(1;). Proof of heorem 3. he proof consss of he followng hree seps We frs defne a decson problem ha conforms o Vovk's framework. We hen show a lower bound on he funcon c(;) for hs problem. Fnally, we show how any algorhm A for he onlne allocaon problem can be used o generae decsons n he defned problem, and so we ge from heorem 13 a lower bound on he wors case cumulave loss of A. he decson problem s defned as follows. We fx an neger K>1 and se 2=S K where S K s he K dmensonal smplex,.e., S K =[x #[0,1] K K x ]. We se 0 o be he se of un vecors n R K,.e., 0=[e 1,..., e K ] where e # [0, 1] K has a 1 n he h componen, and 0 n all oher componens. Fnally, we defne he loss funcon o be *($, e ).$ } e =$. One can easly verfy ha hese defnons conform o Assumpons 14.
20 138 FREUD AD SCHAPIRE o prove a lower bound on c(;) for hs decson problem we choose a parcular smple dsrbuon over he decson space 2. Le D be he unform dsrbuon over he un vecors,.e., dom(d)=[e 1,..., e K ]. For hs dsrbuon, we can explcly calculae *($, ) c(;) nf sup $ # 2 # 0 log ; = # dom(d) ; *(=, ) D(=). (28) Frs, s easy o see ha he denomnaor n Eq. (28) s a consan = # dom(d) ; *(=, ) D(=)= ; K +K&1 K. (29) For any probably vecor $ # 2, here mus exs one componen for whch $ 1K. hus nf sup *($, )=1K. (30) $ # 2 # 0 Combnng Eqs. (28), (29), (30), we ge ha ln(1;) c(;) K ln(1&(1&;)k). (31) We now show how an onlne allocaon algorhm A can be used as a subroune for solvng hs decson problem. We mach each of he expers of he decson problem wh a sraegy of he allocaon problem. Each eraon of he decson problem proceeds as follows. 1. Each of he expers generaes a decson = # S K. 2. he algorhm A generaes a dsrbuon p # S. 3. he learner chooses he decson $ = p =. 4. he oucome # 0 s generaed. 5. he learner ncurs loss $ }, and each exper suffers loss = }. 6. Algorhm A receves he loss vecor l where l == }, and ncurs loss p } l = = \ p (= } ) p + = } =$ }. Observe ha he loss ncurved by he learner n he decson problem s equal o he loss ncurred by A. hus, f for algorhm A we have an upper bound of he form L A c mn L +a ln, hen he decson problem s (c, a))bounded. On he oher hand, usng he lower bound gven by heorem 13 and he lower bound on c(;) gven n Eq. (31), we ge ha for any K and any ;, eher ln(1;) c K ln(1&(1&;)k) or 1 a K ln(1&(1&;)k). (32) As K s a free parameer we can le K and he denomnaors n Eq. (22) become 1&; whch gves he saemen of he heorem. K ACKOWLEDGMES hanks are due o Cornna Cores, Harrs Drucker, Davd Helmbold, Keh Messer, Volodya Vovk, and Manfred Warmuh for helpful dscussons. REFERECES 1. E. B. Baum and D. Haussler, Wha sze ne gves vald generalzaon?, Adv. eural Inform. Process. Sysems I,'' pp. 8190, Morgan Kaufmann, D. Blackwell, An analog of he mnmax heorem for vecor payoffs, Pacfc J. Mah. 6, o. 1 (Sprng 1956), L. Breman, Bas, varance, and arcng classfers, echncal Repor *460, Sascs Dep., Unversy of Calforna, 1996; avalable from fpfp.sa.berkeley.edupubusersbremanarcall.ps.z., CesaBanch, Y. Freund, D. P. Helmhold, D. Haussler, R. E. Schapre, and M. K. Warmuh, How o use exper advce, n ``Proceedngs of he wenyffh Annual ACM Symposum on he heory of Compung, 1993,'' pp H. Chung, Approxmae mehods for sequenal decson makng usng exper advce, n ``Proceedngs of he Sevenh Annual ACM Conference on Compuaonal Learnng heory, 1994,'' pp M. Cover, Unversal porfolos, Mah. Fnance 1, o. 1 (Jan. 1991), G. Deerch and G. Bakr, Solvng mulclass learnng problems va errorcorrecng oupu codes, J. Arf. Inell. Res. 2 (January 1995), H. Drucker and C. Cores, Boosng decson rees, Adv. eural Inform. Process. Sysems 8 (1996). 9. H. Drucker, R. Schapre, and P. Smard, Boosng performance n neural neworks, In. J. Paern Recognon Arf. Inell. 7, o. 4 (1993), Y. Freund, ``Daa Flerng and Dsrbuon Modelng Algorhms for Machne Learnng,'' Ph.D. hess, Unversy of Calforna a Sana Cruz, 1993; rerevable from fp.cse.ucsc.edupubrucsccrl ps.z. 11. Y. Freund, Boosng a weak learnng algorhm by majory, Inform. and Compu. 121, o. 2 (Sepember 1995), ; an exended absrac appeared n ``Proceedngs of he hrd Annual Workshop on Compuaonal Learnng heory, 1990.'' 12. Y. Freund and R. E. Schapre, Expermens wh a new boosng algorhm, n ``Machne Learnng Proceedngs of he hreenh Inernaonal Conference, 1996,'' pp Y. Freund and R. E. Schapre, Game heory, onlne predcon and boosng, n ``Proceedngs of he nh Annual Conference on Compuaonal Learnng heory, 1996,'' pp
A New Approach to Linear Filtering and Prediction Problems 1
R. E. KALMAN Research Insue for Advanced Sudy, Balmore, Md. A New Approach o Lnear Flerng and Predcon Problems The classcal flerng and predcon problem s reexamned usng he Bode Shannon represenaon of
More informationJUST WHAT YOU NEED TO KNOW ABOUT VARIANCE SWAPS
MAY 5 JU WHA YOU NEED O KNOW AOU VARIANCE WAP ebasen ossu Eva rasser Regs Guchard EQUIY DERIVAIVE Inal publcaon February 5 Equy Dervaves Invesor Markeng JPMorgan London Quanave Research & Developmen IN
More informationFollow the Leader If You Can, Hedge If You Must
Journal of Machine Learning Research 15 (2014) 12811316 Submied 1/13; Revised 1/14; Published 4/14 Follow he Leader If You Can, Hedge If You Mus Seven de Rooij seven.de.rooij@gmail.com VU Universiy and
More informationComplete Fairness in Secure TwoParty Computation
Complete Farness n Secure TwoParty Computaton S. Dov Gordon Carmt Hazay Jonathan Katz Yehuda Lndell Abstract In the settng of secure twoparty computaton, two mutually dstrustng partes wsh to compute
More informationBoosting as a Regularized Path to a Maximum Margin Classifier
Journal of Machne Learnng Research 5 (2004) 941 973 Submtted 5/03; Revsed 10/03; Publshed 8/04 Boostng as a Regularzed Path to a Maxmum Margn Classfer Saharon Rosset Data Analytcs Research Group IBM T.J.
More informationDo Firms Maximize? Evidence from Professional Football
Do Frms Maxmze? Evdence from Professonal Football Davd Romer Unversty of Calforna, Berkeley and Natonal Bureau of Economc Research Ths paper examnes a sngle, narrow decson the choce on fourth down n the
More informationThe concept of potential output plays a
Wha Do We Know (And No Know) Abou Poenial Oupu? Susano Basu and John G. Fernald Poenial oupu is an imporan concep in economics. Policymakers ofen use a onesecor neoclassical model o hink abou longrun
More informationImproved Techniques for Grid Mapping with RaoBlackwellized Particle Filters
1 Improved Techniques for Grid Mapping wih RaoBlackwellized Paricle Filers Giorgio Grisei Cyrill Sachniss Wolfram Burgard Universiy of Freiburg, Dep. of Compuer Science, GeorgesKöhlerAllee 79, D79110
More informationOUTOFBAG ESTIMATION. Leo Breiman* Statistics Department University of California Berkeley, CA. 94708 leo@stat.berkeley.edu
1 OUTOFBAG ESTIMATION Leo Breiman* Saisics Deparmen Universiy of California Berkeley, CA. 94708 leo@sa.berkeley.edu Absrac In bagging, predicors are consruced using boosrap samples from he raining se
More informationDoes Britain or the United States Have the Right Gasoline Tax?
Does Briain or he Unied Saes Have he Righ Gasoline Tax? Ian W.H. Parry and Kenneh A. Small March 2002 (rev. Sep. 2004) Discussion Paper 02 12 rev. Resources for he uure 1616 P Sree, NW Washingon, D.C.
More informationCiphers with Arbitrary Finite Domains
Cphers wth Arbtrary Fnte Domans John Black 1 and Phllp Rogaway 2 1 Dept. of Computer Scence, Unversty of Nevada, Reno NV 89557, USA, jrb@cs.unr.edu, WWW home page: http://www.cs.unr.edu/~jrb 2 Dept. of
More information(Almost) No Label No Cry
(Almost) No Label No Cry Gorgo Patrn,, Rchard Nock,, Paul Rvera,, Tbero Caetano,3,4 Australan Natonal Unversty, NICTA, Unversty of New South Wales 3, Ambata 4 Sydney, NSW, Australa {namesurname}@anueduau
More informationMANY of the problems that arise in early vision can be
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 26, NO. 2, FEBRUARY 2004 147 What Energy Functons Can Be Mnmzed va Graph Cuts? Vladmr Kolmogorov, Member, IEEE, and Ramn Zabh, Member,
More informationAre Under and Overreaction the Same Matter? A Price Inertia based Account
Are Under and Overreacion he Same Maer? A Price Ineria based Accoun Shengle Lin and Sephen Rasseni Economic Science Insiue, Chapman Universiy, Orange, CA 92866, USA Laes Version: Nov, 2008 Absrac. Theories
More informationWhen Should Public Debt Be Reduced?
I M F S T A F F D I S C U S S I ON N O T E When Should Public Deb Be Reduced? Jonahan D. Osry, Aish R. Ghosh, and Raphael Espinoza June 2015 SDN/15/10 When Should Public Deb Be Reduced? Prepared by Jonahan
More information4.3.3 Some Studies in Machine Learning Using the Game of Checkers
4.3.3 Some Studes n Machne Learnng Usng the Game of Checkers 535 Some Studes n Machne Learnng Usng the Game of Checkers Arthur L. Samuel Abstract: Two machnelearnng procedures have been nvestgated n some
More informationI M F S T A F F D I S C U S S I O N N O T E
I M F S T A F F D I S C U S S I O N N O T E February 29, 2012 SDN/12/01 Two Targes, Two Insrumens: Moneary and Exchange Rae Policies in Emerging Marke Economies Jonahan D. Osry, Aish R. Ghosh, and Marcos
More informationKONSTANTĪNS BEŅKOVSKIS IS THERE A BANK LENDING CHANNEL OF MONETARY POLICY IN LATVIA? EVIDENCE FROM BANK LEVEL DATA
ISBN 9984 676 20 X KONSTANTĪNS BEŅKOVSKIS IS THERE A BANK LENDING CHANNEL OF MONETARY POLICY IN LATVIA? EVIDENCE FROM BANK LEVEL DATA 2008 WORKING PAPER Lavias Banka, 2008 This source is o be indicaed
More informationWhy Don t We See Poverty Convergence?
Why Don t We See Poverty Convergence? Martn Ravallon 1 Development Research Group, World Bank 1818 H Street NW, Washngton DC, 20433, USA Abstract: We see sgns of convergence n average lvng standards amongst
More informationBoard of Governors of the Federal Reserve System. International Finance Discussion Papers. Number 1003. July 2010
Board of Governors of he Federal Reserve Sysem Inernaional Finance Discussion Papers Number 3 July 2 Is There a Fiscal Free Lunch in a Liquidiy Trap? Chrisopher J. Erceg and Jesper Lindé NOTE: Inernaional
More informationExchange Rate PassThrough into Import Prices: A Macro or Micro Phenomenon? Abstract
Exchange Rae PassThrough ino Impor Prices: A Macro or Micro Phenomenon? Absrac Exchange rae regime opimaliy, as well as moneary policy effeciveness, depends on he ighness of he link beween exchange rae
More informationEnsembling Neural Networks: Many Could Be Better Than All
Artfcal Intellgence, 22, vol.37, no.2, pp.239263. @Elsever Ensemblng eural etworks: Many Could Be Better Than All ZhHua Zhou*, Janxn Wu, We Tang atonal Laboratory for ovel Software Technology, anng
More informationThe Macroeconomics of MediumTerm Aid ScalingUp Scenarios
WP//6 The Macroeconomics of MediumTerm Aid ScalingUp Scenarios Andrew Berg, Jan Goschalk, Rafael Porillo, and LuisFelipe Zanna 2 Inernaional Moneary Fund WP//6 IMF Working Paper Research Deparmen The
More informationThe Developing World Is Poorer Than We Thought, But No Less Successful in the Fight against Poverty
Publc Dsclosure Authorzed Pol c y Re s e a rc h Wo r k n g Pa p e r 4703 WPS4703 Publc Dsclosure Authorzed Publc Dsclosure Authorzed The Developng World Is Poorer Than We Thought, But No Less Successful
More informationThe Relationship between Exchange Rates and Stock Prices: Studied in a Multivariate Model Desislava Dimitrova, The College of Wooster
Issues n Poltcal Economy, Vol. 4, August 005 The Relatonshp between Exchange Rates and Stock Prces: Studed n a Multvarate Model Desslava Dmtrova, The College of Wooster In the perod November 00 to February
More informationWhich Archimedean Copula is the right one?
Which Archimedean is he righ one? CPA Mario R. Melchiori Universidad Nacional del Lioral Sana Fe  Argenina Third Version Sepember 2003 Published in he YieldCurve.com ejournal (www.yieldcurve.com), Ocober
More informationWhch one should I mtate? Karl H. Schlag Projektberech B Dscusson Paper No. B365 March, 996 I wsh to thank Avner Shaked for helpful comments. Fnancal support from the Deutsche Forschungsgemenschaft, Sonderforschungsberech
More informationWhy Have Economic Reforms in Mexico Not Generated Growth?*
Federal Reserve Bank of Minneapolis Research Deparmen Saff Repor 453 November 2010 Why Have Economic Reforms in Mexico No Generaed Growh?* Timohy J. Kehoe Universiy of Minnesoa, Federal Reserve Bank of
More informationThe Simple Analytics of Helicopter Money: Why It Works Always
Vol. 8, 201428 Augus 21, 2014 hp://dx.doi.org/10.5018/economicsejournal.ja.201428 The Simple Analyics of Helicoper Money: Why I Works Always Willem H. Buier Absrac The auhor proides a rigorous analysis
More informationAssessing health efficiency across countries with a twostep and bootstrap analysis *
Assessng health effcency across countres wth a twostep and bootstrap analyss * Antóno Afonso # $ and Mguel St. Aubyn # February 2007 Abstract We estmate a semparametrc model of health producton process
More information