Totally Corrective Boosting Algorithms that Maximize the Margin



Similar documents
Modified Line Search Method for Global Optimization

Irreducible polynomials with consecutive zero coefficients

Properties of MLE: consistency, asymptotic normality. Fisher information.

Sequences and Series

Incremental calculation of weighted mean and variance

Chapter 7 Methods of Finding Estimators

Soving Recurrence Relations

In nite Sequences. Dr. Philippe B. Laval Kennesaw State University. October 9, 2008


Output Analysis (2, Chapters 10 &11 Law)

Asymptotic Growth of Functions

Hypothesis testing. Null and alternative hypotheses

I. Chi-squared Distributions

Maximum Likelihood Estimators.

Lecture 4: Cauchy sequences, Bolzano-Weierstrass, and the Squeeze theorem

Plug-in martingales for testing exchangeability on-line

NEW HIGH PERFORMANCE COMPUTATIONAL METHODS FOR MORTGAGES AND ANNUITIES. Yuri Shestopaloff,

Determining the sample size

Taking DCOP to the Real World: Efficient Complete Solutions for Distributed Multi-Event Scheduling

A Combined Continuous/Binary Genetic Algorithm for Microstrip Antenna Design

Discrete Mathematics and Probability Theory Spring 2014 Anant Sahai Note 13

A probabilistic proof of a binomial identity

Theorems About Power Series

SAMPLE QUESTIONS FOR FINAL EXAM. (1) (2) (3) (4) Find the following using the definition of the Riemann integral: (2x + 1)dx

Lecture 13. Lecturer: Jonathan Kelner Scribe: Jonathan Pines (2009)

Universal coding for classes of sources

Vladimir N. Burkov, Dmitri A. Novikov MODELS AND METHODS OF MULTIPROJECTS MANAGEMENT

LECTURE 13: Cross-validation

Your organization has a Class B IP address of Before you implement subnetting, the Network ID and Host ID are divided as follows:

Confidence Intervals for One Mean

THE REGRESSION MODEL IN MATRIX FORM. For simple linear regression, meaning one predictor, the model is. for i = 1, 2, 3,, n

Department of Computer Science, University of Otago

SUPPLEMENTARY MATERIAL TO GENERAL NON-EXACT ORACLE INEQUALITIES FOR CLASSES WITH A SUBEXPONENTIAL ENVELOPE

5 Boolean Decision Trees (February 11)

SECTION 1.5 : SUMMATION NOTATION + WORK WITH SEQUENCES

*The most important feature of MRP as compared with ordinary inventory control analysis is its time phasing feature.

CHAPTER 3 DIGITAL CODING OF SIGNALS

A Faster Clause-Shortening Algorithm for SAT with No Restriction on Clause Length

Overview of some probability distributions.

Research Article Sign Data Derivative Recovery

Center, Spread, and Shape in Inference: Claims, Caveats, and Insights

Perfect Packing Theorems and the Average-Case Behavior of Optimal and Online Bin Packing

hp calculators HP 12C Statistics - average and standard deviation Average and standard deviation concepts HP12C average and standard deviation

Statistical inference: example 1. Inferential Statistics

The analysis of the Cournot oligopoly model considering the subjective motive in the strategy selection

Optimal Strategies from Random Walks

1 The Gaussian channel

Linear classifier MAXIMUM ENTROPY. Linear regression. Logistic regression 11/3/11. f 1

Here are a couple of warnings to my students who may be here to get a copy of what happened on a day that you missed.

Section 11.3: The Integral Test

Convexity, Inequalities, and Norms

INFINITE SERIES KEITH CONRAD

Class Meeting # 16: The Fourier Transform on R n

Estimating Probability Distributions by Observing Betting Practices

Tradigms of Astundithi and Toyota

Lecture 3. denote the orthogonal complement of S k. Then. 1 x S k. n. 2 x T Ax = ( ) λ x. with x = 1, we have. i = λ k x 2 = λ k.

Systems Design Project: Indoor Location of Wireless Devices

5: Introduction to Estimation

TIGHT BOUNDS ON EXPECTED ORDER STATISTICS

Non-life insurance mathematics. Nils F. Haavardsson, University of Oslo and DNB Skadeforsikring

Week 3 Conditional probabilities, Bayes formula, WEEK 3 page 1 Expected value of a random variable

Cutting-Plane Training of Structural SVMs

Domain 1: Designing a SQL Server Instance and a Database Solution

Finding the circle that best fits a set of points

Project Deliverables. CS 361, Lecture 28. Outline. Project Deliverables. Administrative. Project Comments

Stock Market Trading via Stochastic Network Optimization

Case Study. Normal and t Distributions. Density Plot. Normal Distributions

1 Correlation and Regression Analysis

CHAPTER 3 THE TIME VALUE OF MONEY

CS103A Handout 23 Winter 2002 February 22, 2002 Solving Recurrence Relations

1 Computing the Standard Deviation of Sample Means

Analyzing Longitudinal Data from Complex Surveys Using SUDAAN

Chair for Network Architectures and Services Institute of Informatics TU München Prof. Carle. Network Security. Chapter 2 Basics

Lecture 2: Karger s Min Cut Algorithm

Chapter 7 - Sampling Distributions. 1 Introduction. What is statistics? It consist of three major areas:

where: T = number of years of cash flow in investment's life n = the year in which the cash flow X n i = IRR = the internal rate of return

The Power of Free Branching in a General Model of Backtracking and Dynamic Programming Algorithms

Chatpun Khamyat Department of Industrial Engineering, Kasetsart University, Bangkok, Thailand

AP Calculus AB 2006 Scoring Guidelines Form B

Research Method (I) --Knowledge on Sampling (Simple Random Sampling)

Ekkehart Schlicht: Economic Surplus and Derived Demand

THE HEIGHT OF q-binary SEARCH TREES

Normal Distribution.

Hypergeometric Distributions

THE ABRACADABRA PROBLEM

INVESTMENT PERFORMANCE COUNCIL (IPC)

AMS 2000 subject classification. Primary 62G08, 62G20; secondary 62G99

A Constant-Factor Approximation Algorithm for the Link Building Problem

.04. This means $1000 is multiplied by 1.02 five times, once for each of the remaining sixmonth

Chapter 6: Variance, the law of large numbers and the Monte-Carlo method

HIGH-DIMENSIONAL REGRESSION WITH NOISY AND MISSING DATA: PROVABLE GUARANTEES WITH NONCONVEXITY

Overview. Learning Objectives. Point Estimate. Estimation. Estimating the Value of a Parameter Using Confidence Intervals

Reliability Analysis in HPC clusters

Stochastic Online Scheduling with Precedence Constraints

Chapter 7: Confidence Interval and Sample Size

1. C. The formula for the confidence interval for a population mean is: x t, which was

Z-TEST / Z-STATISTIC: used to test hypotheses about. µ when the population standard deviation is unknown

AP Calculus BC 2003 Scoring Guidelines Form B

Trackless online algorithms for the server problem

Coordinating Principal Component Analyzers

Transcription:

Mafred K. Warmuth mafred@cse.ucsc.edu Ju Liao liaoju@cse.ucsc.edu Uiversity of Califoria at Sata Cruz, Sata Cruz, CA 95064, USA Guar Rätsch Guar.Raetsch@tuebige.mpg.de Friedrich Miescher Laboratory of the Max Plack Society, Spemastr. 39, 7076 Tübige, Germay Boostig, Margis, Covergece, Relative Etropy, Bregma Divergeces, Bregma Projectio Abstract We cosider boostig algorithms that maitai a distributio over a set of examples. At each iteratio a weak hypothesis is received ad the distributio is updated. We motivate these updates as miimizig the relative etropy subject to liear costraits. For example AdaBoost costrais the edge of the last hypothesis w.r.t. the updated distributio to be at most γ = 0. I some sese, AdaBoost is corrective w.r.t. the last hypothesis. A cleaer boostig method is to be totally corrective : the edges of all past hypotheses are costraied to be at most γ, where γ is suitably adapted. Usig ew techiques, we prove the same iteratio bouds for the totally corrective algorithms as for their corrective versios. Moreover with adaptive γ, the algorithms provably maximizes the margi. Experimetally, the totally corrective versios retur smaller covex combiatios of weak hypotheses tha the corrective oes ad are competitive with LPBoost, a totally corrective boostig algorithm with o regularizatio, for which there is o iteratio boud kow. The first author was partially fuded by the NSF grat CCR-981087. The first two authors were partially fuded by UC Discovery grat ITl03-1017 ad Telik Ic. grat ITl03-10110. Part of this work was doe while the third author was visitig UC Sata Cruz. The authors would thak Telik Ic. for providig the COX-1 dataset. Appearig i Proceedigs of the 3 rd Iteratioal Coferece o Machie Learig, Pittsburgh, PA, 006. Copyright 006 by the author(s)/ower(s). 1. Itroductio I this paper we characterize boostig algorithms by the uderlyig optimizatio problems rather tha the approximatio algorithms that solve these problems. The goal is to select a small covex combiatio of weak hypotheses that maximize the margi. For lack of space we oly compare the algorithms i terms of this goal rather tha the geeralizatio error ad refer to (Schapire et al., 1998) for geeralizatio bouds that improve with the margi ad degrade with the size of the fial covex combiatio. Oe of the most commo boostig algorithms is Ada- Boost (Freud & Schapire, 1997; Schapire & Siger, 1999). It ca be viewed as miimizig the relative etropy to the last distributio subject to the costrait that the edge of the last hypothesis is zero (equivaletly its weighted error is half) (Kivie & Warmuth, 1999; Lafferty, 1999). Oe of the importat properties of AdaBoost is that it has a decet iteratio boud ad approximately maximizes the margi of the examples (Breima, 1997; Rätsch et al., 001; Rudi et al., 004a). A similar algorithm called AdaBoost provably maximizes the margi ad has a aalogous iteratio boud (Rätsch & Warmuth, 005). 1 This algorithm eforces oly a sigle costrait at iteratio t: the edge of the hypothesis must be at most γ, where γ is adapted. A atural idea is to costrai the edges of all t past hypotheses to be at most γ ad otherwise miimize the relative etropy to the iitial distributio. Such algorithms were proposed by Kivie ad Warmuth (1999) ad are called totally corrective. However, i that paper oly γ = 0 was cosidered, which leads to 1 Other algorithms for maximizig the margi with weaker iteratio bouds are give i (Breima, 1999; Rudi et al., 004a).

a ifeasible optimizatio problem whe the traiig data is separable. Buildig o the work of Rätsch ad Warmuth (005), we ow adapt the edge boud γ of the totally corrective algorithm so that the margi is approximately maximized. We call our ew algorithm TotalBoost. The corrective AdaBoost ca be used as a heuristic for implemetig TotalBoost by doig may passes over all past hypotheses before addig a ew oe. However, we ca show that this heuristic is ofte several orders of magitude less efficiet tha a vailla sequetial quadratic optimizatio approach for solvig the optimizatio problem uderlyig TotalBoost. A parallel progressio occurred for o-lie learig algorithms of disjuctios. The origial algorithms (variats of the Wiow algorithm (Littlestoe, 1988)) ca be see as processig a sigle costrait iduced by the last example. However, more recetly a olie algorithm has bee developed for learig disjuctios (i the oise-free case) that eforces the costraits iduced by all past examples (Log & Wu, 005). The proof techiques i both settigs are essetially the same except that for disjuctios the margi/threshold is fixed whereas i boostig we optimize the margi. Besides emphasizig the ew proof methods for iteratio bouds of boostig algorithms, this paper also does a experimetal compariso of the algorithms. We show that while TotalBoost has the same iteratio boud as AdaBoost, it ofte requires several orders of magitudes fewer iteratios. Whe there are may similar weak hypotheses, the totally corrective algorithms has a additioal advatage: assume we have 100 groups of 100 weak hypotheses each, where the hypotheses withi each group are very similar. TotalBoost picks a small umber of hypotheses from each group, whereas the algorithms that process oe costrait at a time ofte come back to the same group ad choose may more members from the same group. Therefore i our experimets the umber of weak hypotheses i the fial covex combiatio (with o-zero coefficiets) is cosistetly much smaller for the totally corrective algorithms, makig them better suited for the purpose of feature selectio. Perhaps oe of the simplest boostig algorithms is LP- Boost: it is totally corrective, but ulike TotalBoost, it uses o etropic regularizatio. Also, the upper boud γ o the edge is chose to be as small as possible i each iteratio, whereas i TotalBoost it is decreased more moderately. Experimetally, we have idetified cases where TotalBoost requires cosiderably fewer iteratios tha LPBoost, which suggests that either the etropic regularizatio or the moderate choice of γ is helpful for more tha just for provig iteratio bouds.. Prelimiaries Assume we are give N labeled examples (x, y ) 1 N, where the examples are from some domai ad the labels y lie i {±1}. A boostig algorithm combies may weak hypotheses or rules of thumb for the examples to form a covex combiatio of hypotheses with high accuracy. I this paper a boostig algorithm adheres to the followig protocol: it maitais a distributio d t o the examples; i each iteratio t a weak learer provides a weak hypothesis h t ad the distributio d t is updated to d t+1. Ituitively the updated distributio icorporates the iformatio obtaied from h t ad gives high weights to the remaiig hard examples. After iteratig T steps the algorithm stops ad outputs a covex combiatio of the T weak hypotheses it received from the weak learer. We first discuss how we measure the performace of a weak hypothesis h w.r.t. the curret distributio d. If h is ±1 valued, the the error ɛ is the total weight o all the examples that are misclassified. Whe the rage of a hypothesis h is the etire iterval [ 1, +1], the the edge γ h (d) = N =1 d y h(x ) is a more coveiet quatity for measurig the quality of h. This edge is a affie trasformatio of the error for the case whe h has rage ±1: ɛ h (d) = 1 1 γ h(d). Ideally we wat a hypothesis of edge 1 (error 0). O the other had it is ofte easy to produce hypotheses of edge at least 0 (or equivaletly error at most 1 ). We defie the edge of a set of hypotheses as the maximum of the edges. Assumptio o the weak learer: Assume that for ay distributio d o the examples the weak learer returs a hypothesis h with edge γ h (d) at least g. As we will discuss later, the guaratee parameter g might ot be kow to the boostig algorithm. Boostig algorithms produce a covex combiatio of weak hypotheses: f α (x) := T t=1 α th t (x), where h t is the hypothesis added i iteratio t ad α t is its coefficiet. The margi of a give example (x, y ) is defied as y f α (x ). The margi of a set of examples is always the miimum over the examples. Our algorithms always produce a covex combiatio of weak learers of margi at least g, where is a precisio parameter. Also the size of the covex combiatio is at most O( log N ). Note that the higher the guaratee g of the weak learer, the larger the produced margi.

Algorithm 1 LPBoost algorithm 1. Iput: S = (x 1, y 1 ),..., (x N, y N ), desired accuracy. Iitialize: d 1 = 1/N for all = 1... N 3. Do for t = 1,..., (a) Trai classifier o {S, d t } ad obtai hypothesis h t : x [ 1, 1] ad let u t = y h t (x ) (b) Calculate the edge γ t of h t : γ t = d t u t, (c) Set γ t = ( mi γ q) q=1,...,t (d) Compute γt as i (1) ad set d t+1 to ay distributio d for which u q d γt, for 1 q t (e) If γt γ t the T = t ad break 4. Output: f α (x) = T t=1 α th t (x), where the coefficiets α t realize margi γ T. How are edges ad margis related? By duality the miimum edge of the examples w.r.t. the hypotheses set H t = {h 1,..., h t } equals the maximum margi: γt :=mi max d γ h(d)=max h H t α mi y f α (x ) := ρ t, (1) where d ad α are N ad t dimesioal probability vectors, respectively. Note that the sequece γ t is odecreasig. It will approach the guaratee g from below. The algorithms will stop as soo as the edges are withi of g (See ext sectio.) The above duality also restricts the rage of the guaratee g that a weak learer ca possible have. Let H be the etire (possibly ifiite) hypothesis set from which the weak learer is choosig. If H is compact (see discussio i Rätsch & Warmuth, 005) the γ := mi d max γ h(d) = max mi y f α (x ) := ρ, h H α where d ad α are probability distributios over the examples ad H, respectively, ad f α (x ) ow sums over H. Clearly g ρ ad for ay o-optimal d, α: max γ h(d) > γ = ρ > mi y f α (x ) =: ρ(α). () h H So eve though there always is a weak hypothesis i H with edge at least ρ, the weak learer is oly guarateed to produce oe of edge at least g ρ. Oe of the most bare-boes boostig algorithms is LP- Boost (Algorithm 1) proposed by Grove ad Schuurmas (1998); Beett et al. (000). It uses liear programmig to costrai the edges of the past t weak hypotheses to be at most γ t, which is as small as possible. No iteratio boud is kow for this algorithm, Algorithm TotalBoost with accuracy param. 1. Iput: S = (x 1, y 1 ),..., (x N, y N ), desired accuracy. Iitialize: d 1 = 1/N for all = 1... N 3. Do for t = 1,... (a) Trai classifier o {S, d t } ad obtai hypothesis h t : x [ 1, 1] ad let u t = y h t (x ) (b) Calculate the edge γ t of h t : γ t = d t u t (c) Set γ t = ( mi q=1,...,t γ q) (d) Update weights: d t+1 = argmi (d, d 1 ) {d P N : d u q bγ t, for 1 q t} (e) If above ifeasible or d t+1 cotais a zero the T = t ad break 4. Output: f α (x) = T t=1 α th t (x), where the coefficiets α t maximize margi over hypotheses set {h 1,..., h T }. Algorithm 3 TotalBoost g with accuracy parameter ad edge guaratee g As TotalBoost but i step 3(c) we use γ t = g. ad also the performace ca very much deped o which LP solver is used (see experimetal sectio). Our algorithms are motivated by the miimum relative etropy priciple of Jayes: amog the solutios satisfyig some liear costraits choose the oe that miimizes a relative etropy to the iitial distributio d 1, where the relative etropy is defied as follows: ( d, d) = d d l e d. Our default iitial distributio is uiform. However, the aalysis works for ay choice of d 1 with o-zero compoets. There are two totally corrective versios of the algorithm: oe that kows the guaratee g of the weak learer ad oe that does ot. The oe that does (called TotalBoost g ; Algorithm 3), simply costrais the edges of the previous hypotheses to be at most g, where is a give precisio parameter. Our mai algorithm, TotalBoost (Algorithm ) does ot kow g. It maitais the estimates γ t = ( mi t ) q=1 γ q ad costrais the edges of the past hypotheses to be at most γ t. The sequece { γ t } t is clearly o-icreasig. By our assumptio γ t g, ad therefore γ t g. 3. Termiatio Guaratees Whe the algorithms break, we eed to guaratee that the margi w.r.t. the curret hypothesis set is at least

Algorithm 4 AdaBoost with accuracy parameter As TotalBoost but miimize the divergece to the last distributio w.r.t. a sigle costrait: d t+1 = argmi (d, d t ). {d:d u t bγ t} Let α t be the dual coefficiet of the costrait o the edge of h t used i iteratio t. The algorithm breaks if the margi w.r.t. the curret covex combiatio (i.e. the ormalized α t ) is at least γ t. Algorithm 5 AdaBoost g with accuracy parameter ad guaratee g As AdaBoost but i step 3(c) we use γ t = g. g. TotalBoost g is give g ad costrais the edges of all past hypotheses to be at most g. Whe these become ifeasible, the edge γt w.r.t. the curret hypotheses set is larger tha g. The algorithm also breaks whe the solutio d t+1 of the miimizatio problem lies at the boudary of the simplex (i.e. the distributio has a zero compoet). I this case γt = g, because if γt < g, the all costraits would have slack ad the solutio d that miimizes the divergece (d, d 1 ) would lie i the iterior of the simplex sice d 1 does. Thus wheever the algorithm breaks, we have ρ γt. TotalBoost g outputs a covex combiatio of the hypotheses {h 1,..., h T } that maximizes the margi. By duality, the value ρ t of this margi equals the miimum edge γt ad therefore TotalBoost g is guarateed to output a combied hypothesis of margi larger tha g. The secod algorithm TotalBoost does ot kow the guaratee g of the weak learer. It breaks if its optimizatio problem becomes ifeasible, which happes whe γt > γ t g. The algorithm also breaks whe the solutio d t+1 of the miimizatio problem lies at the boudary of the simplex. I this case, γt = γ t by a argumet similar to the oe used above. Thus wheever the algorithm breaks, we have γt γ t g ad therefore TotalBoost is guarateed to output a hypothesis of margi ρ t = γt g. The termiatio coditio for LPBoost 3 follows a similar argumet: we directly check for γt γ t. The algorithm Adaboost computes the margi usig the ormalized dual coefficiets α t of its costraits ad stops as soo as this margi is at least γ t. Fially, Adaboost g breaks whe the same margi is at least g. For both of these algorithms the curret distri- This secod coditio for breakig is oly added to esure the the dual variables of the optimizatio problem of TotalBoost remai fiite. 3 We use a differet termiatio coditio for LPBoost tha i (Beett et al., 000; Grove & Schuurmas, 1998). butio d t lies i the iterior because the dual coefficiets α t are fiite ad d t d 1 exp( t 1 q=1 α qu q ). 4. Iteratio Boud I the previous sectio we showed that whe the algorithms break, the the output hypothesis has margi at least g. We ow show that TotalBoost must break after T l N iteratios. I each iteratio t, the algorithm updates the distributio that is closest to d 1 ad lies i a certai covex set ad these sets get smaller as t icreases. Here closeess is measured with the relative etropy which is a special Bregma divergece. This closest poit is called a projectio of d 1 to the covex set (d 1 is assumed to lie i the iterior of the simplex). The proof is aalogous to a o-lie mistake boud for learig disjuctios (Log & Wu, 005). It employs the Geeralized Pythagorea Theorem that holds for such projectios w.r.t. ay Bregma divergece (Bregma, 1967, Lemma 1; Herbster & Warmuth, 001, Theorem ). Theorem 1 TotalBoost breaks after at most l N iteratios. Proof Let C t deote the covex set of all poits d R N that satisfy d = 1, d 0 (for 1 N), ad edge costraits d u q γ t, for 1 q t, where u q = y h q (x ). The distributio d t at iteratio t 1 is the projectio of d 1 oto the closed covex set C t 1. Notice that C 0 is the etire simplex ad because γ t ca oly decrease ad a ew costrait is added i trial t, we have C t C t 1. If t T 1, the our termiatio coditio assures that at trial t 1 the set C t 1 has a feasible solutio i the iterior of the simplex. Also d 1 lies i the iterior ad d t+1 C t C t 1. These precoditios assure that at trial t 1 the projectio d t of d 1 oto C t 1 exists ad the Geeralized Pythagorea Theorem for Bregma divergeces ca be applied: (d t+1, d 1 ) (d t, d 1 ) (d t+1, d t ). (3) Sice d t u t = γ t ad d t+1 u t γ t γ t, d t u t d t+1 u t ad because u t [ 1, 1] N, d t+1 d t 1. We ow apply Pisker s iequality: d t+1 d t 1 implies that (d t+1, d t ) >. (4) By summig (3) over the first T 1 trials we obtai (d T, d 1 ) (d 1, d 1 ) > (T 1) } {{ }. 0 Sice the left is at most l N, the boud of the theorem follows.

The key requiremet for this proof is that the closed ad covex costrait sets C t used for the projectio at trial t must be o-icreasig. It is therefore easy to see that the iteratio boud also holds for the TotalBoost g algorithm because of our assumptio that γ t g. I the complete paper we prove the same iteratio boud for corrective versio AdaBoost, Adaboost g, ad the variats of TotalBoost where argmi(d, d 1 ) is replaced by argmi(d, d t ). 5. Experimets I this sectio we illustrate the behavior of our ew algorithms TotalBoost ad TotalBoost g, ad compare them with LPBoost ad AdaBoost o three differet datasets: Dataset 1 is a public dataset from Telik Ic. for a drug discovery problem called COX-1: 15 biary labeled examples with a set of 3888 biary features that are complemetatio closed. Dataset is a artificial dataset used i Rudi et al. (004b) for ivestigatig boostig algorithms that maximize the margi: 50 biary labeled examples with 100 biary features. For each origial feature we added 99 similar features by ivertig the feature value of oe radomly chose example (with replacemet). This results i a 10,000 dimesioal feature set of 100 blocks of size 100. Dataset 3 is a series of artificially geerated datasets of 1000 examples with varyig umber of features but roughly costat margi. We first geerated N 1 radom ±1-valued features x 1,..., x N1 ad set the label of the examples as y = sig(x 1 + x + x 3 + x 4 + x 5 ). We the duplicated each features N times, perturbed the features by Gaussia oise with σ = 0.1, ad clipped the feature values so that they lie i the iterval [-1,1]. We cosidered N 1 = 1, 10, 100 ad N = 10, 100, 1000. The features of our datasets represet the values of the available weak hypotheses o the examples. I each iteratio of boostig, the base learer simply selects the feature that maximizes the edge w.r.t. the curret distributio d o the examples. This meas that the guaratee g equals the maximum margi ρ. Note that our datasets ad the base learer were chose to exemplify certai properties of the algorithms ad more extesive experimets are still eeded. We first discuss how the etropy miimizatio problems ca be solved efficietly. We the compare the algorithms w.r.t. the umber of iteratios ad the umber of selected hypothesis. Fially we show how LP- Boost is affected by the uderlyig optimizer ad exhibit cases where LPBoost requires cosiderably more iteratios tha TotalBoost. 5.1. Solvig the Etropy Problems We use a vailla sequetial quadratic programmig algorithm (Nocedal & Wright, 000) for solvig our mai optimizatio problem: mi N d : P d=1, d 0, uq d bγ t (1 q t) =1 d log d d 1. We iitially set our approximate solutio to d = d 1 ad iteratively optimize d. Give the curret solutio d satisfies the costraits d = 1 ad d 0, we determie a update δ by solvig the followig problem: mi δ ( N ( =1 1 + log d d 1 ) ) δ + 1 d δ, w.r.t. the costraits d + δ 0, δ = 0, ad u q ( d + δ) γ t (for 1 q t). The estimate d is updated to d d + δ ad we repeat this process util covergece. The algorithms typically coverges i very few steps. Note that the above objective is the d order Taylor approximatio of the relative etropy ( d + δ, d 1 ) at δ = 0. The resultig optimizatio problem is quadratic with a diagoal Hessia ad ca be efficietly solved by off-the-shelf optimizer packages (e.g. ILOG CPLEX). 5.. Number of Iteratios First, we cosider the umber of iteratios eeded util each of the algorithms has achieved a margi of at least ρ. We use dataset 1 ad record the margi of the covex combiatio of hypotheses produced by TotalBoost, LPBoost ad AdaBoost. Additioally, we compute the maximal margi of the curret hypothesis sets i each iteratio. See Figure 1 for details. The default optimizer used for solvig LPs ad QPs is ILOG CPLEX s iterior poit method. It should be oted that AdaBoost eeds cosiderably less computatios per iteratio tha the totally corrective algorithms. I the case where callig the base learer is very cheap, AdaBoost may i some uusual cases require less computatio time tha TotalBoost. However, i our experimets, the umber of iteratios required by AdaBoost to achieve margi at

Figure 1: TotalBoost, LPBoost ad AdaBoost o dataset 1 for = 0.03, 0.01, 0.003: We show the margi realized the ormalized dual coefficiets cα t of TotalBoost ad AdaBoost (gree) ad the LP-optimized margi ρ t (1) (blue). Observe that AdaBoost eeds several thousads iteratios ad the umber of iteratios of TotalBoost ad LPBoost are comparable. The margis of TotalBoost ad AdaBoost start growig slowly, i particular whe is small. The margi of TotalBoost g (with guaratee g = ρ ) icreases faster tha LPBoost (ot show). least ρ was 1/10 times the theoretical upper boud log(n)/. TotalBoost typically requires much fewer iteratios, eve though o improved theoretical boud is kow for this algorithm. I our experiece, the iteratio umber of TotalBoost depeds oly slightly o the precisio parameter ad whe γ t is close to ρ, the this algorithm coverges very fast to the maximum margi solutio (LPBoost has a similar behavior). While the algorithms AdaBoost ad TotalBoost provably maximize the margi, they both have the problem of startig too slowly for small. If there is ay good upper boud available for the guaratee g (which here is the optimal margi ρ ), the we ca iitialize γ t with this upper boud ad speed up the startig phase. I particular, whe ρ is kow exactly, the the algorithms AdaBoost g ad TotalBoost g require drastically fewer iteratios ad the latter cosistetly beats LPBoost (ot show). I practical situatios it is ofte easy to obtai a reasoable upper boud for g. 5.3. Number of Hypotheses I this subsectio, we compare how may hypotheses the algorithms eed to achieve a large margi. Note that LPBoost ad TotalBoost oly select a base hypothesis oce: After the first selectio, the distributio d is maitaied such that the edge for that hypothesis is smaller tha γ t ad it is ot selected agai. AdaBoost may select the same hypothesis may times. However, if there are several similar features (as i datasets & 3), the this corrective algorithm ofte selects hypotheses that are similar to previously selected oes ad the umber of weak hypotheses used i the fial covex combiatio is uecessarily large. Hece, TotalBoost ad LPBoost seem better suited for feature selectio, whe small esembles are eeded. I Figure we display the margi vs. the umber of used ad selected hypotheses. The umber of selected hypothesis for LPBoost ad TotalBoost is equal to the umber of iteratios. For these algorithms a previously selected hypothesis ca become iactive (correspodig α = 0). I this case it is ot couted as a used hypothesis. Note that the umber of used hypotheses for LPBoost may deped o the choice of the optimizer (also see discussio below). I the case of AdaBoost, all dual coefficiets α t are o-zero i the fial covex combiatio. (See captio of Figure for more details.) We ca coclude that the totally corrective algorithms eed cosiderable less hypotheses whe there are may redudat hypotheses/features. LPBoost ad TotalBoost differ i the iitial iteratios (depedig o ), but produce combied hypotheses of similar size. I Figure 3 we compare the effect of differet choices of the optimizer for LPBoost. For dataset there is a surprisigly large differece betwee iterior poit ad simplex based methods. The reaso is that the weights computed by the simplex method are ofte sparse ad the chages i the duplicated features are sparse as well (by desig). Hece, it ca easily happe that the base learer is blid o some examples whe selectig the hypotheses. Iterior poit methods fid a solutio i the iterior ad therefore distribute the weights amog the examples. To illustrate that this is the right explaatio, we modify LPBoost such that it first computes γt but the it computes the

Figure : TotalBoost, LPBoost ad AdaBoost o dataset for = 0.01: [left & middle] The realized (gree) ad the LP-optimized (blue) margi ρ t (as i Figure 1) vs. the umber of used (active) ad selected (active or iactive) hypotheses i the covex combiatio. We observe that the totally corrective algorithms use cosiderable less hypotheses tha the AdaBoost. If 0.01, the TotalBoost is agai affected by the slow start which leads to a relatively large umber of selected hypotheses i the begiig. [right] The umber of selected hypotheses vs. the umber of selected blocks of hypotheses. AdaBoost ofte chooses additioal hypotheses from previously chose blocks, while LPBoost typically uses oly oe per block ad TotalBoost a few per block. Whe =.1, TotalBoost behaves more like LPBoost (ot show). weights usig the relative etropy miimizatio with γ t = γ t + ɛ (where ɛ = 10 4 ). We call this the regularized LPBoost algorithm. We observe i Figure 3 that the regularizatio cosiderably improves the covergece speed to ρ of the simplex based solver. 5.4. Redudacy i High Dimesios We foud that LPBoost usually performs very well ad is very competitive to TotalBoost i terms of the umber of iteratios. Additioally, it oly eeds to solve liear ad ot etropy miimizatio problems. However, o iteratio boud is kow for LP- Boost that is idepedet of the size of the hypothesis set. We performed a series of experimets with icreasig dimesioality ad compared LPBoost s ad TotalBoost s covergece speed. We foud that i rather high dimesioal cases, LPBoost coverges quite slowly whe features are redudat (see Figure 4 for a example usig dataset 3). I future work, we will ivestigate why LPBoost coverges more slowly i this example ad costruct more extreme datasets that show this. 6. Coclusio We view boostig as a relative etropy projectio method ad obtai our iteratio bouds without boudig the average traiig error i terms of the product of expoetial potetials as is customarily doe i the boostig literature (see e.g. Schapire ad Siger (1999)). I the full paper we will relate our methods to the latter slightly loger proof style. The proof techique based o Bregma projectio ad the Geeralized Pythagorea theorem is very versatile. The iteratio boud of O( log N ) holds for all boostig algorithms that use costraied miimizatio of ay Bregma divergece (.,.) over a domai that cotais the probability simplex for which if d Ct (d, d t ) = Ω( ) ad ( d T, ( 1 N )) = O(log N). For example, the sum of biary etropies has both these properties: if C t := (d,d t ) { }} { ( d l d d t + (1 d ) l 1 d ) 1 d t if (d, d t ) + if C t d: P (1 d, 1 (4) d=1 dt ), } {{ } 0 where the first iequality follows from splittig the if ad droppig oe of the costraits from the costrait set( C t ad 1 deotes the all oe vector. Furthermore, d T 1, ( 1 N )) (l N)+1 ad this leads to a ((l N)+1) iteratio boud of. The corrective versio based o this divergece has bee called LogitBoost (Friedma et al., 000; Duffy & Helmbold, 000). The above reasoig immediately provides O( log N ) iteratio bouds for the totally corrective versios of Log- itboost that maximize the margi. Eve though the theoretical bouds for the LogitBoost variats are essetially the same as the bouds for the stadard relative etropy algorithms discussed i this paper, the LogitBoost variats are margially iferior i practice (ot show). Both the corrective ad totally corrective algorithms for maximizig the margi start rather slowly ad heuristics are eeded for decreasig the edge boud γ t

Figure 3: LPBoost with differet optimizers: show is the margi vs. the o. of selected hypotheses. Differet optimizers lead to the selectio of differet hypotheses with varyig maximum margis. Addig a regularizer (see text) sigificatly improves the simplex solutio i some cases. Figure 4: LPBoost vs. TotalBoost o two 100,000 dimesioal datasets. Show is the margis vs. the umber of iteratios: [left] data with 100 duplicated blocks (with clipped Gaussia oise) ad [right] data with idepedet features. For TotalBoost, we depict the realized (gree) ad the LP-optimized (blue) margi. Whe there are lots of duplicated features, the LPBoost stalls after a iitial fast phase, while it performs well i other cases. We did ot observe this behavior for TotalBoost or AdaBoost (ot show). The differece becomes larger whe the block size is icreased. so that this slow start is avoided. For practical oisy applicatios, boostig algorithms are eeded that allow for a bias term ad for soft margis. LPBoost has already bee used this way i Beett et al. (000) but o iteratio bouds are kow for ay versio of LPBoost. We show i the full paper that our methodology still leads to iteratio bouds for boostig algorithms with etropic regularizatio whe a bias term is added. Iteratio bouds for soft margi versios are left as future research. Refereces Beett, K., Demiriz, A., & Shawe-Taylor, J. (000). A colum geeratio algorithm for boostig. Proc. ICML (pp. 65 7). Morga Kaufma. Bregma, L. (1967). The relaxatio method for fidig the commo poit of covex sets ad its applicatio to the solutio of problems i covex programmig. USSR Computatioal Math. ad Math. Physics, 7, 00 17. Breima, L. (1997). Predictio games ad arcig algorithmstechical Report 504). Statistics Departmet, Uiversity of Califoria at Berkeley. Breima, L. (1999). Predictio games ad arcig algorithms. Neural Computatio, 11, 1493 1518. Duffy, N., & Helmbold, D. (000). NIPS 00 (pp. 58 64). Potetial boosters? Freud, Y., & Schapire, R. (1997). A decisio-theoretic geeralizatio of o-lie learig ad a applicatio to boostig. J. of Comp. & Sys. Sci., 55, 119 139. Friedma, J., Hastie, T., & Tibshirai, R. (000). Additive Logistic Regressio: a statistical view of boostig. Aals of Statistics,, 337 374. Grove, A., & Schuurmas, D. (1998). Boostig i the limit: Maximizig the margi of leared esembles. Proc. 15th Nat. Cof. o Art. It.. Herbster, M., & Warmuth, M. (001). Trackig the best liear predictio. J. Mach. Lear. Res., 81 309. Kivie, J., & Warmuth, M. (1999). Boostig as etropy projectio. COLT 99. Lafferty, J. (1999). Additive models, boostig, ad iferece for geeralized divergeces. COLT 99 (pp. 15 133). Littlestoe, N. (1988). Learig whe irrelevat attributes aboud: A ew liear-threshold algorithm. Machie Learig,, 85 318. Log, P. M., & Wu, X. (005). Mistake bouds for maximum etropy discrimiatio. NIPS 04 (pp. 833 840). Nocedal, J., & Wright, S. (000). Numerical optimizatio. Spriger Series i Op. Res. Spriger. Rätsch, G., Ooda, T., & Müller, K.-R. (001). Soft margis for AdaBoost. Machie Learig, 4, 87 30. Rätsch, G., & Warmuth, M. K. (005). Efficiet margi maximizig with boostig. J. Mach. Lear. Res., 131 15. Rudi, C., Daubechies, I., & Schapire, R. (004a). Dyamics of AdaBoost: Cyclic behavior ad covergece of margis. J. Mach. Lear. Res., 1557 1595. Rudi, C., Schapire, R., & Daubechies, I. (004b). Aalysis of boostig algoritms usig the smooth margi fuctio: A study of three algorithms. Upublished mauscript. Schapire, R., Freud, Y., Bartlett, P., & Lee, W. (1998). Boostig the margi: A ew explaatio for the effectiveess of votig methods. The Aals of Statistics, 6, 1651 1686. Schapire, R., & Siger, Y. (1999). Improved boostig algorithms usig cofidece-rated predictios. Machie Learig, 37, 97 336.