Totally Corrective Boosting Algorithms that Maximize the Margin

Transcription

1 Mafred K. Warmuth Ju Liao Uiversity of Califoria at Sata Cruz, Sata Cruz, CA 95064, USA Guar Rätsch Friedrich Miescher Laboratory of the Max Plack Society, Spemastr. 39, 7076 Tübige, Germay Boostig, Margis, Covergece, Relative Etropy, Bregma Divergeces, Bregma Projectio Abstract We cosider boostig algorithms that maitai a distributio over a set of examples. At each iteratio a weak hypothesis is received ad the distributio is updated. We motivate these updates as miimizig the relative etropy subject to liear costraits. For example AdaBoost costrais the edge of the last hypothesis w.r.t. the updated distributio to be at most γ = 0. I some sese, AdaBoost is corrective w.r.t. the last hypothesis. A cleaer boostig method is to be totally corrective : the edges of all past hypotheses are costraied to be at most γ, where γ is suitably adapted. Usig ew techiques, we prove the same iteratio bouds for the totally corrective algorithms as for their corrective versios. Moreover with adaptive γ, the algorithms provably maximizes the margi. Experimetally, the totally corrective versios retur smaller covex combiatios of weak hypotheses tha the corrective oes ad are competitive with LPBoost, a totally corrective boostig algorithm with o regularizatio, for which there is o iteratio boud kow. The first author was partially fuded by the NSF grat CCR The first two authors were partially fuded by UC Discovery grat ITl ad Telik Ic. grat ITl Part of this work was doe while the third author was visitig UC Sata Cruz. The authors would thak Telik Ic. for providig the COX-1 dataset. Appearig i Proceedigs of the 3 rd Iteratioal Coferece o Machie Learig, Pittsburgh, PA, 006. Copyright 006 by the author(s)/ower(s). 1. Itroductio I this paper we characterize boostig algorithms by the uderlyig optimizatio problems rather tha the approximatio algorithms that solve these problems. The goal is to select a small covex combiatio of weak hypotheses that maximize the margi. For lack of space we oly compare the algorithms i terms of this goal rather tha the geeralizatio error ad refer to (Schapire et al., 1998) for geeralizatio bouds that improve with the margi ad degrade with the size of the fial covex combiatio. Oe of the most commo boostig algorithms is Ada- Boost (Freud & Schapire, 1997; Schapire & Siger, 1999). It ca be viewed as miimizig the relative etropy to the last distributio subject to the costrait that the edge of the last hypothesis is zero (equivaletly its weighted error is half) (Kivie & Warmuth, 1999; Lafferty, 1999). Oe of the importat properties of AdaBoost is that it has a decet iteratio boud ad approximately maximizes the margi of the examples (Breima, 1997; Rätsch et al., 001; Rudi et al., 004a). A similar algorithm called AdaBoost provably maximizes the margi ad has a aalogous iteratio boud (Rätsch & Warmuth, 005). 1 This algorithm eforces oly a sigle costrait at iteratio t: the edge of the hypothesis must be at most γ, where γ is adapted. A atural idea is to costrai the edges of all t past hypotheses to be at most γ ad otherwise miimize the relative etropy to the iitial distributio. Such algorithms were proposed by Kivie ad Warmuth (1999) ad are called totally corrective. However, i that paper oly γ = 0 was cosidered, which leads to 1 Other algorithms for maximizig the margi with weaker iteratio bouds are give i (Breima, 1999; Rudi et al., 004a).

2 a ifeasible optimizatio problem whe the traiig data is separable. Buildig o the work of Rätsch ad Warmuth (005), we ow adapt the edge boud γ of the totally corrective algorithm so that the margi is approximately maximized. We call our ew algorithm TotalBoost. The corrective AdaBoost ca be used as a heuristic for implemetig TotalBoost by doig may passes over all past hypotheses before addig a ew oe. However, we ca show that this heuristic is ofte several orders of magitude less efficiet tha a vailla sequetial quadratic optimizatio approach for solvig the optimizatio problem uderlyig TotalBoost. A parallel progressio occurred for o-lie learig algorithms of disjuctios. The origial algorithms (variats of the Wiow algorithm (Littlestoe, 1988)) ca be see as processig a sigle costrait iduced by the last example. However, more recetly a olie algorithm has bee developed for learig disjuctios (i the oise-free case) that eforces the costraits iduced by all past examples (Log & Wu, 005). The proof techiques i both settigs are essetially the same except that for disjuctios the margi/threshold is fixed whereas i boostig we optimize the margi. Besides emphasizig the ew proof methods for iteratio bouds of boostig algorithms, this paper also does a experimetal compariso of the algorithms. We show that while TotalBoost has the same iteratio boud as AdaBoost, it ofte requires several orders of magitudes fewer iteratios. Whe there are may similar weak hypotheses, the totally corrective algorithms has a additioal advatage: assume we have 100 groups of 100 weak hypotheses each, where the hypotheses withi each group are very similar. TotalBoost picks a small umber of hypotheses from each group, whereas the algorithms that process oe costrait at a time ofte come back to the same group ad choose may more members from the same group. Therefore i our experimets the umber of weak hypotheses i the fial covex combiatio (with o-zero coefficiets) is cosistetly much smaller for the totally corrective algorithms, makig them better suited for the purpose of feature selectio. Perhaps oe of the simplest boostig algorithms is LP- Boost: it is totally corrective, but ulike TotalBoost, it uses o etropic regularizatio. Also, the upper boud γ o the edge is chose to be as small as possible i each iteratio, whereas i TotalBoost it is decreased more moderately. Experimetally, we have idetified cases where TotalBoost requires cosiderably fewer iteratios tha LPBoost, which suggests that either the etropic regularizatio or the moderate choice of γ is helpful for more tha just for provig iteratio bouds.. Prelimiaries Assume we are give N labeled examples (x, y ) 1 N, where the examples are from some domai ad the labels y lie i {±1}. A boostig algorithm combies may weak hypotheses or rules of thumb for the examples to form a covex combiatio of hypotheses with high accuracy. I this paper a boostig algorithm adheres to the followig protocol: it maitais a distributio d t o the examples; i each iteratio t a weak learer provides a weak hypothesis h t ad the distributio d t is updated to d t+1. Ituitively the updated distributio icorporates the iformatio obtaied from h t ad gives high weights to the remaiig hard examples. After iteratig T steps the algorithm stops ad outputs a covex combiatio of the T weak hypotheses it received from the weak learer. We first discuss how we measure the performace of a weak hypothesis h w.r.t. the curret distributio d. If h is ±1 valued, the the error ɛ is the total weight o all the examples that are misclassified. Whe the rage of a hypothesis h is the etire iterval [ 1, +1], the the edge γ h (d) = N =1 d y h(x ) is a more coveiet quatity for measurig the quality of h. This edge is a affie trasformatio of the error for the case whe h has rage ±1: ɛ h (d) = 1 1 γ h(d). Ideally we wat a hypothesis of edge 1 (error 0). O the other had it is ofte easy to produce hypotheses of edge at least 0 (or equivaletly error at most 1 ). We defie the edge of a set of hypotheses as the maximum of the edges. Assumptio o the weak learer: Assume that for ay distributio d o the examples the weak learer returs a hypothesis h with edge γ h (d) at least g. As we will discuss later, the guaratee parameter g might ot be kow to the boostig algorithm. Boostig algorithms produce a covex combiatio of weak hypotheses: f α (x) := T t=1 α th t (x), where h t is the hypothesis added i iteratio t ad α t is its coefficiet. The margi of a give example (x, y ) is defied as y f α (x ). The margi of a set of examples is always the miimum over the examples. Our algorithms always produce a covex combiatio of weak learers of margi at least g, where is a precisio parameter. Also the size of the covex combiatio is at most O( log N ). Note that the higher the guaratee g of the weak learer, the larger the produced margi.

3 Algorithm 1 LPBoost algorithm 1. Iput: S = (x 1, y 1 ),..., (x N, y N ), desired accuracy. Iitialize: d 1 = 1/N for all = 1... N 3. Do for t = 1,..., (a) Trai classifier o {S, d t } ad obtai hypothesis h t : x [ 1, 1] ad let u t = y h t (x ) (b) Calculate the edge γ t of h t : γ t = d t u t, (c) Set γ t = ( mi γ q) q=1,...,t (d) Compute γt as i (1) ad set d t+1 to ay distributio d for which u q d γt, for 1 q t (e) If γt γ t the T = t ad break 4. Output: f α (x) = T t=1 α th t (x), where the coefficiets α t realize margi γ T. How are edges ad margis related? By duality the miimum edge of the examples w.r.t. the hypotheses set H t = {h 1,..., h t } equals the maximum margi: γt :=mi max d γ h(d)=max h H t α mi y f α (x ) := ρ t, (1) where d ad α are N ad t dimesioal probability vectors, respectively. Note that the sequece γ t is odecreasig. It will approach the guaratee g from below. The algorithms will stop as soo as the edges are withi of g (See ext sectio.) The above duality also restricts the rage of the guaratee g that a weak learer ca possible have. Let H be the etire (possibly ifiite) hypothesis set from which the weak learer is choosig. If H is compact (see discussio i Rätsch & Warmuth, 005) the γ := mi d max γ h(d) = max mi y f α (x ) := ρ, h H α where d ad α are probability distributios over the examples ad H, respectively, ad f α (x ) ow sums over H. Clearly g ρ ad for ay o-optimal d, α: max γ h(d) > γ = ρ > mi y f α (x ) =: ρ(α). () h H So eve though there always is a weak hypothesis i H with edge at least ρ, the weak learer is oly guarateed to produce oe of edge at least g ρ. Oe of the most bare-boes boostig algorithms is LP- Boost (Algorithm 1) proposed by Grove ad Schuurmas (1998); Beett et al. (000). It uses liear programmig to costrai the edges of the past t weak hypotheses to be at most γ t, which is as small as possible. No iteratio boud is kow for this algorithm, Algorithm TotalBoost with accuracy param. 1. Iput: S = (x 1, y 1 ),..., (x N, y N ), desired accuracy. Iitialize: d 1 = 1/N for all = 1... N 3. Do for t = 1,... (a) Trai classifier o {S, d t } ad obtai hypothesis h t : x [ 1, 1] ad let u t = y h t (x ) (b) Calculate the edge γ t of h t : γ t = d t u t (c) Set γ t = ( mi q=1,...,t γ q) (d) Update weights: d t+1 = argmi (d, d 1 ) {d P N : d u q bγ t, for 1 q t} (e) If above ifeasible or d t+1 cotais a zero the T = t ad break 4. Output: f α (x) = T t=1 α th t (x), where the coefficiets α t maximize margi over hypotheses set {h 1,..., h T }. Algorithm 3 TotalBoost g with accuracy parameter ad edge guaratee g As TotalBoost but i step 3(c) we use γ t = g. ad also the performace ca very much deped o which LP solver is used (see experimetal sectio). Our algorithms are motivated by the miimum relative etropy priciple of Jayes: amog the solutios satisfyig some liear costraits choose the oe that miimizes a relative etropy to the iitial distributio d 1, where the relative etropy is defied as follows: ( d, d) = d d l e d. Our default iitial distributio is uiform. However, the aalysis works for ay choice of d 1 with o-zero compoets. There are two totally corrective versios of the algorithm: oe that kows the guaratee g of the weak learer ad oe that does ot. The oe that does (called TotalBoost g ; Algorithm 3), simply costrais the edges of the previous hypotheses to be at most g, where is a give precisio parameter. Our mai algorithm, TotalBoost (Algorithm ) does ot kow g. It maitais the estimates γ t = ( mi t ) q=1 γ q ad costrais the edges of the past hypotheses to be at most γ t. The sequece { γ t } t is clearly o-icreasig. By our assumptio γ t g, ad therefore γ t g. 3. Termiatio Guaratees Whe the algorithms break, we eed to guaratee that the margi w.r.t. the curret hypothesis set is at least

4 Algorithm 4 AdaBoost with accuracy parameter As TotalBoost but miimize the divergece to the last distributio w.r.t. a sigle costrait: d t+1 = argmi (d, d t ). {d:d u t bγ t} Let α t be the dual coefficiet of the costrait o the edge of h t used i iteratio t. The algorithm breaks if the margi w.r.t. the curret covex combiatio (i.e. the ormalized α t ) is at least γ t. Algorithm 5 AdaBoost g with accuracy parameter ad guaratee g As AdaBoost but i step 3(c) we use γ t = g. g. TotalBoost g is give g ad costrais the edges of all past hypotheses to be at most g. Whe these become ifeasible, the edge γt w.r.t. the curret hypotheses set is larger tha g. The algorithm also breaks whe the solutio d t+1 of the miimizatio problem lies at the boudary of the simplex (i.e. the distributio has a zero compoet). I this case γt = g, because if γt < g, the all costraits would have slack ad the solutio d that miimizes the divergece (d, d 1 ) would lie i the iterior of the simplex sice d 1 does. Thus wheever the algorithm breaks, we have ρ γt. TotalBoost g outputs a covex combiatio of the hypotheses {h 1,..., h T } that maximizes the margi. By duality, the value ρ t of this margi equals the miimum edge γt ad therefore TotalBoost g is guarateed to output a combied hypothesis of margi larger tha g. The secod algorithm TotalBoost does ot kow the guaratee g of the weak learer. It breaks if its optimizatio problem becomes ifeasible, which happes whe γt > γ t g. The algorithm also breaks whe the solutio d t+1 of the miimizatio problem lies at the boudary of the simplex. I this case, γt = γ t by a argumet similar to the oe used above. Thus wheever the algorithm breaks, we have γt γ t g ad therefore TotalBoost is guarateed to output a hypothesis of margi ρ t = γt g. The termiatio coditio for LPBoost 3 follows a similar argumet: we directly check for γt γ t. The algorithm Adaboost computes the margi usig the ormalized dual coefficiets α t of its costraits ad stops as soo as this margi is at least γ t. Fially, Adaboost g breaks whe the same margi is at least g. For both of these algorithms the curret distri- This secod coditio for breakig is oly added to esure the the dual variables of the optimizatio problem of TotalBoost remai fiite. 3 We use a differet termiatio coditio for LPBoost tha i (Beett et al., 000; Grove & Schuurmas, 1998). butio d t lies i the iterior because the dual coefficiets α t are fiite ad d t d 1 exp( t 1 q=1 α qu q ). 4. Iteratio Boud I the previous sectio we showed that whe the algorithms break, the the output hypothesis has margi at least g. We ow show that TotalBoost must break after T l N iteratios. I each iteratio t, the algorithm updates the distributio that is closest to d 1 ad lies i a certai covex set ad these sets get smaller as t icreases. Here closeess is measured with the relative etropy which is a special Bregma divergece. This closest poit is called a projectio of d 1 to the covex set (d 1 is assumed to lie i the iterior of the simplex). The proof is aalogous to a o-lie mistake boud for learig disjuctios (Log & Wu, 005). It employs the Geeralized Pythagorea Theorem that holds for such projectios w.r.t. ay Bregma divergece (Bregma, 1967, Lemma 1; Herbster & Warmuth, 001, Theorem ). Theorem 1 TotalBoost breaks after at most l N iteratios. Proof Let C t deote the covex set of all poits d R N that satisfy d = 1, d 0 (for 1 N), ad edge costraits d u q γ t, for 1 q t, where u q = y h q (x ). The distributio d t at iteratio t 1 is the projectio of d 1 oto the closed covex set C t 1. Notice that C 0 is the etire simplex ad because γ t ca oly decrease ad a ew costrait is added i trial t, we have C t C t 1. If t T 1, the our termiatio coditio assures that at trial t 1 the set C t 1 has a feasible solutio i the iterior of the simplex. Also d 1 lies i the iterior ad d t+1 C t C t 1. These precoditios assure that at trial t 1 the projectio d t of d 1 oto C t 1 exists ad the Geeralized Pythagorea Theorem for Bregma divergeces ca be applied: (d t+1, d 1 ) (d t, d 1 ) (d t+1, d t ). (3) Sice d t u t = γ t ad d t+1 u t γ t γ t, d t u t d t+1 u t ad because u t [ 1, 1] N, d t+1 d t 1. We ow apply Pisker s iequality: d t+1 d t 1 implies that (d t+1, d t ) >. (4) By summig (3) over the first T 1 trials we obtai (d T, d 1 ) (d 1, d 1 ) > (T 1) } {{ }. 0 Sice the left is at most l N, the boud of the theorem follows.

5 The key requiremet for this proof is that the closed ad covex costrait sets C t used for the projectio at trial t must be o-icreasig. It is therefore easy to see that the iteratio boud also holds for the TotalBoost g algorithm because of our assumptio that γ t g. I the complete paper we prove the same iteratio boud for corrective versio AdaBoost, Adaboost g, ad the variats of TotalBoost where argmi(d, d 1 ) is replaced by argmi(d, d t ). 5. Experimets I this sectio we illustrate the behavior of our ew algorithms TotalBoost ad TotalBoost g, ad compare them with LPBoost ad AdaBoost o three differet datasets: Dataset 1 is a public dataset from Telik Ic. for a drug discovery problem called COX-1: 15 biary labeled examples with a set of 3888 biary features that are complemetatio closed. Dataset is a artificial dataset used i Rudi et al. (004b) for ivestigatig boostig algorithms that maximize the margi: 50 biary labeled examples with 100 biary features. For each origial feature we added 99 similar features by ivertig the feature value of oe radomly chose example (with replacemet). This results i a 10,000 dimesioal feature set of 100 blocks of size 100. Dataset 3 is a series of artificially geerated datasets of 1000 examples with varyig umber of features but roughly costat margi. We first geerated N 1 radom ±1-valued features x 1,..., x N1 ad set the label of the examples as y = sig(x 1 + x + x 3 + x 4 + x 5 ). We the duplicated each features N times, perturbed the features by Gaussia oise with σ = 0.1, ad clipped the feature values so that they lie i the iterval [-1,1]. We cosidered N 1 = 1, 10, 100 ad N = 10, 100, The features of our datasets represet the values of the available weak hypotheses o the examples. I each iteratio of boostig, the base learer simply selects the feature that maximizes the edge w.r.t. the curret distributio d o the examples. This meas that the guaratee g equals the maximum margi ρ. Note that our datasets ad the base learer were chose to exemplify certai properties of the algorithms ad more extesive experimets are still eeded. We first discuss how the etropy miimizatio problems ca be solved efficietly. We the compare the algorithms w.r.t. the umber of iteratios ad the umber of selected hypothesis. Fially we show how LP- Boost is affected by the uderlyig optimizer ad exhibit cases where LPBoost requires cosiderably more iteratios tha TotalBoost Solvig the Etropy Problems We use a vailla sequetial quadratic programmig algorithm (Nocedal & Wright, 000) for solvig our mai optimizatio problem: mi N d : P d=1, d 0, uq d bγ t (1 q t) =1 d log d d 1. We iitially set our approximate solutio to d = d 1 ad iteratively optimize d. Give the curret solutio d satisfies the costraits d = 1 ad d 0, we determie a update δ by solvig the followig problem: mi δ ( N ( =1 1 + log d d 1 ) ) δ + 1 d δ, w.r.t. the costraits d + δ 0, δ = 0, ad u q ( d + δ) γ t (for 1 q t). The estimate d is updated to d d + δ ad we repeat this process util covergece. The algorithms typically coverges i very few steps. Note that the above objective is the d order Taylor approximatio of the relative etropy ( d + δ, d 1 ) at δ = 0. The resultig optimizatio problem is quadratic with a diagoal Hessia ad ca be efficietly solved by off-the-shelf optimizer packages (e.g. ILOG CPLEX). 5.. Number of Iteratios First, we cosider the umber of iteratios eeded util each of the algorithms has achieved a margi of at least ρ. We use dataset 1 ad record the margi of the covex combiatio of hypotheses produced by TotalBoost, LPBoost ad AdaBoost. Additioally, we compute the maximal margi of the curret hypothesis sets i each iteratio. See Figure 1 for details. The default optimizer used for solvig LPs ad QPs is ILOG CPLEX s iterior poit method. It should be oted that AdaBoost eeds cosiderably less computatios per iteratio tha the totally corrective algorithms. I the case where callig the base learer is very cheap, AdaBoost may i some uusual cases require less computatio time tha TotalBoost. However, i our experimets, the umber of iteratios required by AdaBoost to achieve margi at

6 Figure 1: TotalBoost, LPBoost ad AdaBoost o dataset 1 for = 0.03, 0.01, 0.003: We show the margi realized the ormalized dual coefficiets cα t of TotalBoost ad AdaBoost (gree) ad the LP-optimized margi ρ t (1) (blue). Observe that AdaBoost eeds several thousads iteratios ad the umber of iteratios of TotalBoost ad LPBoost are comparable. The margis of TotalBoost ad AdaBoost start growig slowly, i particular whe is small. The margi of TotalBoost g (with guaratee g = ρ ) icreases faster tha LPBoost (ot show). least ρ was 1/10 times the theoretical upper boud log(n)/. TotalBoost typically requires much fewer iteratios, eve though o improved theoretical boud is kow for this algorithm. I our experiece, the iteratio umber of TotalBoost depeds oly slightly o the precisio parameter ad whe γ t is close to ρ, the this algorithm coverges very fast to the maximum margi solutio (LPBoost has a similar behavior). While the algorithms AdaBoost ad TotalBoost provably maximize the margi, they both have the problem of startig too slowly for small. If there is ay good upper boud available for the guaratee g (which here is the optimal margi ρ ), the we ca iitialize γ t with this upper boud ad speed up the startig phase. I particular, whe ρ is kow exactly, the the algorithms AdaBoost g ad TotalBoost g require drastically fewer iteratios ad the latter cosistetly beats LPBoost (ot show). I practical situatios it is ofte easy to obtai a reasoable upper boud for g Number of Hypotheses I this subsectio, we compare how may hypotheses the algorithms eed to achieve a large margi. Note that LPBoost ad TotalBoost oly select a base hypothesis oce: After the first selectio, the distributio d is maitaied such that the edge for that hypothesis is smaller tha γ t ad it is ot selected agai. AdaBoost may select the same hypothesis may times. However, if there are several similar features (as i datasets & 3), the this corrective algorithm ofte selects hypotheses that are similar to previously selected oes ad the umber of weak hypotheses used i the fial covex combiatio is uecessarily large. Hece, TotalBoost ad LPBoost seem better suited for feature selectio, whe small esembles are eeded. I Figure we display the margi vs. the umber of used ad selected hypotheses. The umber of selected hypothesis for LPBoost ad TotalBoost is equal to the umber of iteratios. For these algorithms a previously selected hypothesis ca become iactive (correspodig α = 0). I this case it is ot couted as a used hypothesis. Note that the umber of used hypotheses for LPBoost may deped o the choice of the optimizer (also see discussio below). I the case of AdaBoost, all dual coefficiets α t are o-zero i the fial covex combiatio. (See captio of Figure for more details.) We ca coclude that the totally corrective algorithms eed cosiderable less hypotheses whe there are may redudat hypotheses/features. LPBoost ad TotalBoost differ i the iitial iteratios (depedig o ), but produce combied hypotheses of similar size. I Figure 3 we compare the effect of differet choices of the optimizer for LPBoost. For dataset there is a surprisigly large differece betwee iterior poit ad simplex based methods. The reaso is that the weights computed by the simplex method are ofte sparse ad the chages i the duplicated features are sparse as well (by desig). Hece, it ca easily happe that the base learer is blid o some examples whe selectig the hypotheses. Iterior poit methods fid a solutio i the iterior ad therefore distribute the weights amog the examples. To illustrate that this is the right explaatio, we modify LPBoost such that it first computes γt but the it computes the

7 Figure : TotalBoost, LPBoost ad AdaBoost o dataset for = 0.01: [left & middle] The realized (gree) ad the LP-optimized (blue) margi ρ t (as i Figure 1) vs. the umber of used (active) ad selected (active or iactive) hypotheses i the covex combiatio. We observe that the totally corrective algorithms use cosiderable less hypotheses tha the AdaBoost. If 0.01, the TotalBoost is agai affected by the slow start which leads to a relatively large umber of selected hypotheses i the begiig. [right] The umber of selected hypotheses vs. the umber of selected blocks of hypotheses. AdaBoost ofte chooses additioal hypotheses from previously chose blocks, while LPBoost typically uses oly oe per block ad TotalBoost a few per block. Whe =.1, TotalBoost behaves more like LPBoost (ot show). weights usig the relative etropy miimizatio with γ t = γ t + ɛ (where ɛ = 10 4 ). We call this the regularized LPBoost algorithm. We observe i Figure 3 that the regularizatio cosiderably improves the covergece speed to ρ of the simplex based solver Redudacy i High Dimesios We foud that LPBoost usually performs very well ad is very competitive to TotalBoost i terms of the umber of iteratios. Additioally, it oly eeds to solve liear ad ot etropy miimizatio problems. However, o iteratio boud is kow for LP- Boost that is idepedet of the size of the hypothesis set. We performed a series of experimets with icreasig dimesioality ad compared LPBoost s ad TotalBoost s covergece speed. We foud that i rather high dimesioal cases, LPBoost coverges quite slowly whe features are redudat (see Figure 4 for a example usig dataset 3). I future work, we will ivestigate why LPBoost coverges more slowly i this example ad costruct more extreme datasets that show this. 6. Coclusio We view boostig as a relative etropy projectio method ad obtai our iteratio bouds without boudig the average traiig error i terms of the product of expoetial potetials as is customarily doe i the boostig literature (see e.g. Schapire ad Siger (1999)). I the full paper we will relate our methods to the latter slightly loger proof style. The proof techique based o Bregma projectio ad the Geeralized Pythagorea theorem is very versatile. The iteratio boud of O( log N ) holds for all boostig algorithms that use costraied miimizatio of ay Bregma divergece (.,.) over a domai that cotais the probability simplex for which if d Ct (d, d t ) = Ω( ) ad ( d T, ( 1 N )) = O(log N). For example, the sum of biary etropies has both these properties: if C t := (d,d t ) { }} { ( d l d d t + (1 d ) l 1 d ) 1 d t if (d, d t ) + if C t d: P (1 d, 1 (4) d=1 dt ), } {{ } 0 where the first iequality follows from splittig the if ad droppig oe of the costraits from the costrait set( C t ad 1 deotes the all oe vector. Furthermore, d T 1, ( 1 N )) (l N)+1 ad this leads to a ((l N)+1) iteratio boud of. The corrective versio based o this divergece has bee called LogitBoost (Friedma et al., 000; Duffy & Helmbold, 000). The above reasoig immediately provides O( log N ) iteratio bouds for the totally corrective versios of Log- itboost that maximize the margi. Eve though the theoretical bouds for the LogitBoost variats are essetially the same as the bouds for the stadard relative etropy algorithms discussed i this paper, the LogitBoost variats are margially iferior i practice (ot show). Both the corrective ad totally corrective algorithms for maximizig the margi start rather slowly ad heuristics are eeded for decreasig the edge boud γ t

8 Figure 3: LPBoost with differet optimizers: show is the margi vs. the o. of selected hypotheses. Differet optimizers lead to the selectio of differet hypotheses with varyig maximum margis. Addig a regularizer (see text) sigificatly improves the simplex solutio i some cases. Figure 4: LPBoost vs. TotalBoost o two 100,000 dimesioal datasets. Show is the margis vs. the umber of iteratios: [left] data with 100 duplicated blocks (with clipped Gaussia oise) ad [right] data with idepedet features. For TotalBoost, we depict the realized (gree) ad the LP-optimized (blue) margi. Whe there are lots of duplicated features, the LPBoost stalls after a iitial fast phase, while it performs well i other cases. We did ot observe this behavior for TotalBoost or AdaBoost (ot show). The differece becomes larger whe the block size is icreased. so that this slow start is avoided. For practical oisy applicatios, boostig algorithms are eeded that allow for a bias term ad for soft margis. LPBoost has already bee used this way i Beett et al. (000) but o iteratio bouds are kow for ay versio of LPBoost. We show i the full paper that our methodology still leads to iteratio bouds for boostig algorithms with etropic regularizatio whe a bias term is added. Iteratio bouds for soft margi versios are left as future research. Refereces Beett, K., Demiriz, A., & Shawe-Taylor, J. (000). A colum geeratio algorithm for boostig. Proc. ICML (pp. 65 7). Morga Kaufma. Bregma, L. (1967). The relaxatio method for fidig the commo poit of covex sets ad its applicatio to the solutio of problems i covex programmig. USSR Computatioal Math. ad Math. Physics, 7, Breima, L. (1997). Predictio games ad arcig algorithmstechical Report 504). Statistics Departmet, Uiversity of Califoria at Berkeley. Breima, L. (1999). Predictio games ad arcig algorithms. Neural Computatio, 11, Duffy, N., & Helmbold, D. (000). NIPS 00 (pp ). Potetial boosters? Freud, Y., & Schapire, R. (1997). A decisio-theoretic geeralizatio of o-lie learig ad a applicatio to boostig. J. of Comp. & Sys. Sci., 55, Friedma, J., Hastie, T., & Tibshirai, R. (000). Additive Logistic Regressio: a statistical view of boostig. Aals of Statistics,, Grove, A., & Schuurmas, D. (1998). Boostig i the limit: Maximizig the margi of leared esembles. Proc. 15th Nat. Cof. o Art. It.. Herbster, M., & Warmuth, M. (001). Trackig the best liear predictio. J. Mach. Lear. Res., Kivie, J., & Warmuth, M. (1999). Boostig as etropy projectio. COLT 99. Lafferty, J. (1999). Additive models, boostig, ad iferece for geeralized divergeces. COLT 99 (pp ). Littlestoe, N. (1988). Learig whe irrelevat attributes aboud: A ew liear-threshold algorithm. Machie Learig,, Log, P. M., & Wu, X. (005). Mistake bouds for maximum etropy discrimiatio. NIPS 04 (pp ). Nocedal, J., & Wright, S. (000). Numerical optimizatio. Spriger Series i Op. Res. Spriger. Rätsch, G., Ooda, T., & Müller, K.-R. (001). Soft margis for AdaBoost. Machie Learig, 4, Rätsch, G., & Warmuth, M. K. (005). Efficiet margi maximizig with boostig. J. Mach. Lear. Res., Rudi, C., Daubechies, I., & Schapire, R. (004a). Dyamics of AdaBoost: Cyclic behavior ad covergece of margis. J. Mach. Lear. Res., Rudi, C., Schapire, R., & Daubechies, I. (004b). Aalysis of boostig algoritms usig the smooth margi fuctio: A study of three algorithms. Upublished mauscript. Schapire, R., Freud, Y., Bartlett, P., & Lee, W. (1998). Boostig the margi: A ew explaatio for the effectiveess of votig methods. The Aals of Statistics, 6, Schapire, R., & Siger, Y. (1999). Improved boostig algorithms usig cofidece-rated predictios. Machie Learig, 37,