Multi-Class Deep Boosting

Transcription

1 Multi-Class Deep Boosting Vitaly Kuznetsov Courant Institute 25 Mercer Street New York, NY 002 Mehryar Mohri Courant Institute & Google Research 25 Mercer Street New York, NY 002 Uar Syed Google Research 76 Ninth Avenue New York, NY 00 Abstract We present new enseble learning algoriths for ulti-class classification. Our algoriths can use as a base classifier set a faily of deep decision trees or other rich or coplex failies and yet benefit fro strong generalization guarantees. We give new data-dependent learning bounds for convex ensebles in the ulticlass classification setting expressed in ters of the Radeacher coplexities of the sub-failies coposing the base classifier set, and the ixture weight assigned to each sub-faily. These bounds are finer than existing ones both thanks to an iproved dependency on the nuber of classes and, ore crucially, by virtue of a ore favorable coplexity ter expressed as an average of the Radeacher coplexities based on the enseble s ixture weights. We introduce and discuss several new ulti-class enseble algoriths benefiting fro these guarantees, prove positive results for the H-consistency of several of the, and report the results of experients showing that their perforance copares favorably with that of ulti-class versions of AdaBoost and Logistic Regression and their L - regularized counterparts. Introduction Devising ensebles of base predictors is a standard approach in achine learning which often helps iprove perforance in practice. Enseble ethods include the faily of boosting eta-algoriths aong which the ost notable and widely used one is AdaBoost [Freund and Schapire, 997], also known as forward stagewise additive odeling [Friedan et al., 998]. AdaBoost and its other variants learn convex cobinations of predictors. They seek to greedily iniize a convex surrogate function upper bounding the isclassification loss by augenting, at each iteration, the current enseble, with a new suitably weighted predictor. One key advantage of AdaBoost is that, since it is based on a stagewise procedure, it can learn an effective enseble of base predictors chosen fro a very large and potentially infinite faily, provided that an efficient algorith is available for selecting a good predictor at each stage. Furtherore, AdaBoost and its L -regularized counterpart [Rätsch et al., 200a] benefit fro favorable learning guarantees, in particular theoretical argin bounds [Schapire et al., 997, Koltchinskii and Panchenko, 2002]. However, those bounds depend not just on the argin and the saple size, but also on the coplexity of the base hypothesis set, which suggests a risk of overfitting when using too coplex base hypothesis sets. And indeed, overfitting has been reported in practice for AdaBoost in the past [Grove and Schuurans, 998, Schapire, 999, Dietterich, 2000, Rätsch et al., 200b]. Cortes, Mohri, and Syed [204] introduced a new enseble algorith, DeepBoost, which they proved to benefit fro finer learning guarantees, including favorable ones even when using as base classifier set relatively rich failies, for exaple a faily of very deep decision trees, or other siilarly coplex failies. In DeepBoost, the decisions in each iteration of which classifier to add to the enseble and which weight to assign to that classifier, depend on the data-dependent) coplexity

2 of the sub-faily to which the classifier belongs one interpretation of DeepBoost is that it applies the principle of structural risk iniization to each iteration of boosting. Cortes, Mohri, and Syed [204] further showed that epirically DeepBoost achieves a better perforance than AdaBoost, Logistic Regression, and their L -regularized variants. The ain contribution of this paper is an extension of these theoretical, algorithic, and epirical results to the ulti-class setting. Two distinct approaches have been considered in the past for the definition and the design of boosting algoriths in the ulti-class setting. One approach consists of cobining base classifiers apping each exaple x to an output label y. This includes the SAMME algorith [Zhu et al., 2009] as well as the algorith of Mukherjee and Schapire [203], which is shown to be, in a certain sense, optial for this approach. An alternative approach, often ore flexible and ore widely used in applications, consists of cobining base classifiers apping each pair x, y) fored by an exaple x and a label y to a real-valued score. This is the approach adopted in this paper, which is also the one used for the design of AdaBoost.MR [Schapire and Singer, 999] and other variants of that algorith. In Section 2, we prove a novel generalization bound for ulti-class classification ensebles that depends only on the Radeacher coplexity of the hypothesis classes to which the classifiers in the enseble belong. Our result generalizes the ain result of Cortes et al. [204] to the ulti-class setting, and also represents an iproveent on the ulti-class generalization bound due to Koltchinskii and Panchenko [2002], even if we disregard our finer analysis related to Radeacher coplexity. In Section 3, we present several ulti-class surrogate losses that are otivated by our generalization bound, and discuss and copare their functional and consistency properties. In particular, we prove that our surrogate losses are realizable H-consistent, a hypothesis-set-specific notion of consistency that was recently introduced by Long and Servedio [203]. Our results generalize those of Long and Servedio [203] and adit sipler proofs. We also present a faily of ulti-class DeepBoost learning algoriths based on each of these surrogate losses, and prove general convergence guarantee for the. In Section 4, we report the results of experients deonstrating that ulti-class DeepBoost outperfors AdaBoost.MR and ultinoial additive) logistic regression, as well as their L -nor regularized variants, on several datasets. 2 Multi-class data-dependent learning guarantee for convex ensebles In this section, we present a data-dependent learning bound in the ulti-class setting for convex ensebles based on ultiple base hypothesis sets. Let X denote the input space. We denote by Y = {,..., c} a set of c 2 classes. The label associated by a hypothesis f : X Y R to x X is given by argax y Y fx, y). The argin ρ f x, y) of the function f for a labeled exaple x, y) X Y is defined by ρ f x, y) = fx, y) ax y y fx, y ). ) Thus, f isclassifies x, y) iff ρ f x, y) 0. We consider p failies H,..., H p of functions apping fro X Y to [0, ] and the enseble faily F = conv p k= H k), that is the faily of functions f of the for f = T α th t, where α = α,..., α T ) is in the siplex and where, for each t [, T ], h t is in H kt for soe k t [, p]. We assue that training and test points are drawn i.i.d. according to soe distribution D over X Y and denote by S = x, y ),..., x, y )) a training saple of size drawn according to D. For any ρ > 0, the generalization error Rf), its ρ-argin error R ρ f) and its epirical argin error are defined as follows: Rf) = E [ ρ f x,y) 0], R ρ f) = E [ ρ f x,y) ρ], and R S,ρ f) = x,y) D x,y) D E x,y) S [ ρ f x,y) ρ], 2) where the notation x, y) S indicates that x, y) is drawn according to the epirical distribution defined by S. For any faily of hypotheses G apping X Y to R, we define Π G) by Π G) = {x hx, y): y Y, h G}. 3) The following theore gives a argin-based Radeacher coplexity bound for learning with ensebles of base classifiers with ultiple hypothesis sets. As with other Radeacher coplexity learning guarantees, our bound is data-dependent, which is an iportant and favorable characteristic of our results. 2

3 Theore. Assue p > and let H,..., H p be p failies of functions apping fro X Y to [0, ]. Fix ρ > 0. Then, for any δ > 0, with probability at least δ over the choice of a saple S of size drawn i.i.d. according to D, the following inequality holds for all f = T α th t F: Rf) R S,ρ f)+ 8c ρ α t R Π H kt ))+ 2 log p cρ + 4 ρ log ) c 2 ρ 2 log p 2 4 log p + log 2 δ 2, Thus, Rf) R S,ρ f) + 8c T log p [ ] ) ρ α tr H kt ) + O ρ 2 log ρ 2 c 2 4 log p. The full proof of theore 3 is given in Appendix B. Even for p =, that is for the special case of a single hypothesis set, our analysis iproves upon the ulti-class argin bound of Koltchinskii and Panchenko [2002] since our bound adits only a linear dependency on the nuber of classes c instead of a quadratic one. However, the ain rearkable benefit of this learning bound is that its coplexity ter adits an explicit dependency on the ixture coefficients α t. It is a weighted average of Radeacher coplexities with ixture weights α t, t [, T ]. Thus, the second ter of the bound suggests that, while soe hypothesis sets H k used for learning could have a large Radeacher coplexity, this ay not negatively affect generalization if the corresponding total ixture weight su of α t s corresponding to that hypothesis set) is relatively sall. Using such potentially coplex failies could help achieve a better argin on the training saple. The theore cannot be proven via the standard Radeacher coplexity analysis of Koltchinskii and Panchenko [2002] since the coplexity ter of the bound would then be R conv p k= H k)) = R p k= H k) which does not adit an explicit dependency on the ixture weights and is lower bounded by T α tr H kt ). Thus, the theore provides a finer learning bound than the one obtained via a standard Radeacher coplexity analysis. 3 Algoriths In this section, we will use the learning guarantees just described to derive several new enseble algoriths for ulti-class classification. 3. Optiization proble Let H,..., H p be p disjoint failies of functions taking values in [0, ] with increasing Radeacher coplexities R H k ), k [, p]. For any hypothesis h p k= H k, we denote by dh) the index of the hypothesis set it belongs to, that is h H dh). The bound of Theore 3 holds uniforly for all ρ > 0 and functions f conv p k= H k). Since the last ter of the bound does not depend on α, it suggests selecting α that would iniize: Gα) = ρf x i,y i) ρ + 8c ρ α t r t, where r t = R H dht)) and α. Since for any ρ > 0, f and f/ρ adit the sae generalization error, we can instead search for α 0 with T α t /ρ, which leads to in α 0 ρf x i,y i) + 8c α t r t s.t. α t ρ. 4) The first ter of the objective is not a convex function of α and its iniization is known to be coputationally hard. Thus, we will consider instead a convex upper bound. Let u Φ u) be a non-increasing convex function upper-bounding u u 0 over R. Φ ay be selected to be The condition P T αt = of Theore 3 can be relaxed to P T a null hypothesis h t = 0 for soe t). αt. To see this, use for exaple 3

4 for exaple the exponential function as in AdaBoost [Freund and Schapire, 997] or the logistic function. Using such an upper bound, we obtain the following convex optiization proble: in α 0 ) Φ ρ f x i, y i ) + λ α t r t s.t. α t ρ, 5) where we introduced a paraeter λ 0 controlling the balance between the agnitude of the values taken by function Φ and the second ter. 2 Introducing a Lagrange variable β 0 associated to the constraint in 5), the proble can be equivalently written as in α 0 Φ in [ T ]) α t h t x i, y i ) α t h t x i, y) + λr t + β)α t. Here, β is a paraeter that can be freely selected by the algorith since any choice of its value is equivalent to a choice of ρ in 5). Since Φ is a non-decreasing function, the proble can be equivalently written as in α 0 [ T ]) ax Φ α t h t x i, y i ) α t h t x i, y) + λr t + β)α t. Let {h,..., h N } be the set of distinct base functions, and let F ax be the objective function based on that expression: F ax α) = ) ax Φ α j h j x i, y i, y) + Λ j α j, 6) with α = α,..., α N ) R N, h j x i, y i, y) = h j x i, y i ) h j x i, y), and Λ j = λr j + β for all j [, N]. Then, our optiization proble can be rewritten as in α 0 F ax α). This defines a convex optiization proble since the doain {α 0} is a convex set and since F ax is convex: each ter of the su in its definition is convex as a pointwise axiu of convex functions coposition of the convex function Φ with an affine function) and the second ter is a linear function of α. In general, F ax is not differentiable even when Φ is, but, since it is convex, it adits a sub-differential at every point. Additionally, along each direction, F ax adits left and right derivatives both nonincreasing and a differential everywhere except for a set that is at ost countable. 3.2 Alternative objective functions We now consider the following three natural upper bounds on F ax which adit useful properties that we will discuss later, the third one valid when Φ can be written as the coposition of two function Φ and Φ 2 with Φ a non-increasing function: F su α) = F axsu α) = F copsu α) = Φ Φ Φ Φ 2 ) α j h j x i, y i, y) + ) α j ρ hj x i, y i ) + Λ j α j 7) Λ j α j 8) )) α j h j x i, y i, y) + Λ j α j. 9) F su is obtained fro F ax siply by replacing in the definition of F ax the ax operator by a su. Clearly, function F su is convex and inherits the differentiability properties of Φ. A drawback of F su is that for probles with very large c as in structured prediction, the coputation of the su 2 Note that this is a standard practice in the field of optiization. The optiization proble in 4) is equivalent to a vector optiization proble, where P ρ f x i,y i ), P T αtrt) is iniized over α. The latter proble can be scalarized leading to the introduction of a paraeter λ in 5). 4

5 ay require resorting to approxiations. F axsu is obtained fro F ax by noticing that, by the sub-additivity of the ax operator, the following inequality holds: ax α j h j x i, y i, y) ax α j h j x i, y i, y) = α j ρ hj x i, y i ). As with F su, function F axsu is convex and adits the sae differentiability properties as Φ. Unlike F su, F axsu does not require coputing a su over the classes. Furtherore, note that the expressions ρ hj x i, y i ), i [, ], can be pre-coputed prior to the application of any optiization algorith. Finally, for Φ = Φ Φ 2 with Φ non-increasing, the ax operator can be replaced by a su before applying φ, as follows: ax Φ ) fx i, y i, y) = Φ ax Φ 2 fxi, y i, y) )) Φ Φ 2 where fx i, y i, y) = N α jh j x i, y i, y). This leads to the definition of F copsu. fxi, y i, y) )), In Appendix C, we discuss the consistency properties of the loss functions just introduced. In particular, we prove that the loss functions associated to F ax and F su are realizable H-consistent see Long and Servedio [203]) in the coon cases where the exponential or logistic losses are used and that, siilarly, in the coon case where Φ u) = log + u) and Φ 2 u) = expu + ), the loss function associated to F copsu is H-consistent. Furtherore, in Appendix D, we show that, under soe ild assuptions, the objective functions we just discussed are essentially within a constant factor of each other. Moreover, in the case of binary classification all of these objectives coincide. 3.3 Multi-class DeepBoost algoriths In this section, we discuss in detail a faily of ulti-class DeepBoost algoriths, which are derived by application of coordinate descent to the objective functions discussed in the previous paragraphs. We will assue that Φ is differentiable over R and that Φ u) 0 for all u. This condition is not necessary, in particular, our presentation can be extended to non-differentiable functions such as the hinge loss, but it siplifies the presentation. In the case of the objective function F axsu, we will assue that both Φ and Φ 2, where Φ = Φ Φ 2, are differentiable. Under these assuptions, F su, F axsu, and F copsu are differentiable. F ax is not differentiable due to the presence of the ax operators in its definition, but it adits a sub-differential at every point. For convenience, let α t = α t,,..., α t,n ) denote the vector obtained after t iterations and let α 0 = 0. Let e k denote the kth unit vector in R N, k [, N]. For a differentiable objective F, we denote by F α, e j ) the directional derivative of F along the direction e j at α. Our coordinate descent algorith consists of first deterining the direction of axial descent, that is k = argax j [,N] F α t, e j ), next of deterining the best step η along that direction that preserves non-negativity of α, η = argin αt +ηe k 0 F α t + ηe k ), and updating α t to α t = α t + ηe k. We will refer to this ethod as projected coordinate descent. The following theore provides a convergence guarantee for our algoriths in that case. Theore 2. Assue that Φ is twice differentiable and that Φ u) > 0 for all u R. Then, the projected coordinate descent algorith applied to F converges to the solution α of the optiization ax α 0 F α) for F = F su, F = F axsu, or F = F copsu. If additionally Φ is strongly convex over the path of the iterates α t, then there exists τ > 0 and γ > 0 such that for all t > τ, F α t+ ) F α ) γ )F α t) F α )). 0) The proof is given in Appendix I and is based on the results of Luo and Tseng [992]. The theore can in fact be extended to the case where instead of the best direction, the derivative for the direction selected at each round is within a constant threshold of the best [Luo and Tseng, 992]. The conditions of Theore 2 hold for any cases in practice, in particular in the case of the exponential loss Φ = exp) or the logistic loss Φ x) = log 2 + e x )). In particular, linear convergence is guaranteed in those cases since both the exponential and logistic losses are strongly convex over a copact set containing the converging sequence of α t s. 5

6 MDEEPBOOSTSUMS = x, y ),..., x, y ))) for i to do 2 for y Y {y i } do 3 D i, y) c ) 4 for t to T do 5 k argin ɛ t,j + Λ j j [,N] 2S t 6 if ɛ t,k )e α t,k ɛ t,k e α t,k < Λ ) k S t then 7 η t α t,k [ [ Λ k 2ɛ ts t + Λk ] ] 2 2ɛ ts t + ɛ t ɛ t 8 else η t log 9 α t α t + η t e k 0 S t+ for i to do 2 for y Y {y i } do 3 D t+ i, y) Φ P ) N αt,jhjxi,yi,y) 4 f N α t,jh j 5 return f Φ N α t,jh j x i, y i, y) ) S t+ Figure : Pseudocode of the MDeepBoostSu algorith for both the exponential loss and the logistic loss. The expression of the weighted error ɛ t,j is given in 2). We will refer to the algorith defined by projected coordinate descent applied to F su by MDeep- BoostSu, to F axsu by MDeepBoostMaxSu, to F copsu by MDeepBoostCopSu, and to F ax by MDeepBoostMax. In the following, we briefly describe MDeepBoostSu, including its pseudocode. We give a detailed description of all of these algoriths in the suppleentary aterial: MDeepBoostSu Appendix E), MDeepBoostMaxSu Appendix F), MDeepBoostCopSu Appendix G), MDeepBoostMax Appendix H). Define f t = N α t,jh j. Then, F su α t ) can be rewritten as follows: F su α t ) = Φ ) f t x i, y i, y) + Λ j α t,j. For any t [, T ], we denote by D t the distribution over [, ] [, c] defined for all i [, ] and y Y {y i } by D t i, y) = Φ f t x i, y i, y) ), ) S t where S t is a noralization factor, S t = Φ f t x i, y i, y)). For any j [, N] and s [, T ], we also define the weighted error ɛ s,j as follows: ɛ s,j = 2 [ E i,y) D s [ hj x i, y i, y) ]]. 2) Figure gives the pseudocode of the MDeepBoostSu algorith. The details of the derivation of the expressions are given in Appendix E. In the special cases of the exponential loss Φ u) = exp u)) or the logistic loss Φ u) = log 2 + exp u))), a closed-for expression is given for the step size lines 6-8), which is the sae in both cases see Sections E.2. and E.2.2). In the generic case, the step size can be found using a line search or other nuerical ethods. The algoriths presented above have several connections with other boosting algoriths, particularly in the absence of regularization. We discuss these connections in detail in Appendix K. 6

7 4 Experients The algoriths presented in the previous sections can be used with a variety of different base classifier sets. For our experients, we used ulti-class binary decision trees. A ulti-class binary decision tree in diension d can be defined by a pair t, h), where t is a binary tree with a variablethreshold question at each internal node, e.g., X j θ, j [, d], and h = h l ) l Leavest) a vector of distributions over the leaves Leavest) of t. At any leaf l Leavest), h l y) [0, ] for all y Y and y Y h ly) =. For convenience, we will denote by tx) the leaf l Leavest) associated to x by t. Thus, the score associated by t, h) to a pair x, y) X Y is h l y) where l = tx). Let T n denote the faily of all ulti-class decision trees with n internal nodes in diension d. In Appendix J, we derive the following upper bound on the Radeacher coplexity of T n : 4n + 2) log2 d + 2) log + ) RΠ T n )). 3) All of the experients in this section use T n as the faily of base hypothesis sets paraetrized by n). Since T n is a very large hypothesis set when n is large, for the sake of coputational efficiency we ake a few approxiations. First, although our MDeepBoost algoriths were derived in ters of Radeacher coplexity, we use the upper bound in Eq. 3) in place of the Radeacher coplexity thus, in Algorith we let Λ n = λb n + β, where B n is the bound given in Eq. 3)). Secondly, instead of exhaustively searching for the best decision tree in T n for each possible size n, we use the following greedy procedure: Given the best decision tree of size n starting with n = ), we find the best decision tree of size n+ that can be obtained by splitting one leaf, and continue this procedure until soe axiu depth K. Decision trees are coonly learned in this anner, and so in this context our Radeacher-coplexity-based bounds can be viewed as a novel stopping criterion for decision tree learning. Let HK be the set of trees found by the greedy algorith just described. In each iteration t of MDeepBoost, we select the best tree in the set HK {h,..., h t }, where h,..., h t are the trees selected in previous iterations. While we described any objective functions that can be used as the basis of a ulti-class deep boosting algorith, the experients in this section focus on algoriths derived fro F su. We also refer the reader to Table 3 in Appendix A for results of experients with F copsu objective functions. The F su and F copsu objectives cobine several advantages that suggest they will perfor well epirically. F su is consistent and both F su and F copsu are by Theore 4) H-consistent. Also, unlike F ax both of these objectives are differentiable, and therefore the convergence guarantee in Theore 2 applies. Our preliinary findings also indicate that algoriths based on F su and F copsu objectives perfor better than those derived fro F ax and F axsu. All of our objective functions require a choice for Φ, the loss function. Since Cortes et al. [204] reported coparable results for exponential and logistic loss for the binary version of DeepBoost, we let Φ be the exponential loss in all of our experients with MDeepBoostSu. For MDeepBoostCopSu we select Φ u) = log 2 + u) and Φ 2 u) = exp u). In our experients, we used 8 UCI data sets: abalone, handwritten, letters, pageblocks, pendigits, satiage, statlog and yeast see ore details on these datasets in Table 4, Appendix L. In Appendix K, we explain that when λ = β = 0 then MDeepBoostSu is equivalent to AdaBoost.MR. Also, if we set λ = 0 and β 0 then the resulting algorith is an L -nor regularized variant of AdaBoost.MR. We copared MDeepBoostSu to these two algoriths, with the results also reported in Table and Table 2 in Appendix A. Likewise, we copared MDeepBoost- CopSu with ultinoial additive) logistic regression, LogReg, and its L -regularized version LogReg-L, which, as discussed in Appendix K, are equivalent to MDeepBoostCopSu when λ = β = 0 and λ = 0, β 0 respectively. Finally, we reark that it can be argued that the paraeter optiization procedure described below) significantly extends AdaBoost.MR since it effectively ipleents structural risk iniization: for each tree depth, the epirical error is iniized and we choose the depth to achieve the best generalization error. All of these algoriths use axiu tree depth K as a paraeter. L -nor regularized versions adit two paraeters: K and β 0. Deep boosting algoriths have a third paraeter, λ 0. To set these paraeters, we used the following paraeter optiization procedure: we randoly partitioned each dataset into 4 folds and, for each tuple λ, β, K) in the set of possible paraeters described below), we ran MDeepBoostSu, with a different assignent of folds to the training 7

8 Table : Epirical results for MDeepBoostSu, Φ = exp. AB stands for AdaBoost. abalone AB.MR AB.MR-L MDeepBoost handwritten AB.MR AB.MR-L MDeepBoost Error Error std dev) 0.006) ) ) std dev) 0.00) 0.008) 0.005) letters AB.MR AB.MR-L MDeepBoost pageblocks AB.MR AB.MR-L MDeepBoost Error Error std dev) 0.008) ) ) std dev) ) 0.003) 0.004) pendigits AB.MR AB.MR-L MDeepBoost satiage AB.MR AB.MR-L MDeepBoost Error Error std dev) ) 0.003) 0.00) std dev) 0.023) ) ) statlog AB.MR AB.MR-L MDeepBoost yeast AB.MR AB.MR-L MDeepBoost Error Error std dev) ) 0.007) ) std dev) ) ) ) set, validation set and test set for each run. Specifically, for each run i {0,, 2, 3}, fold i was used for testing, fold i + od 4) was used for validation, and the reaining folds were used for training. For each run, we selected the paraeters that had the lowest error on the validation set and then easured the error of those paraeters on the test set. The average test error and the standard deviation of the test error over all 4 runs is reported in Table. Note that an alternative procedure to copare algoriths that is adopted in a nuber of previous studies of boosting [Li, 2009a,b, Sun et al., 202] is to siply record the average test error of the best paraeter tuples over all runs. While it is of course possible to overestiate the perforance of a learning algorith by optiizing hyperparaeters on the test set, this concern is less valid when the size of the test set is large relative to the coplexity of the hyperparaeter space. We report results for this alternative procedure in Table 2 and Table 3, Appendix A. For each dataset, the set of possible values for λ and β was initialized to {0 5, 0 6,..., 0 0 }, and to {, 2, 3, 4, 5} for the axiu tree depth K. However, if we found an optial paraeter value to be at the end point of these ranges, we extended the interval in that direction by an order of agnitude for λ and β, and by for the axiu tree depth K) and re-ran the experients. We have also experiented with 200 and 500 iterations but we have observed that the errors do not change significantly and the ranking of the algoriths reains the sae. The results of our experients show that, for each dataset, deep boosting algoriths outperfor the other algoriths evaluated in our experients. Let us point out that, even though not all of our results are statistically significant, MDeepBoostSu outperfors AdaBoost.MR and AdaBoost.MR- L and, hence, effectively structural risk iniization) on each dataset. More iportantly, for each dataset MDeepBoostSu outperfors other algoriths on ost of the individual runs. Moreover, results for soe datasets presented here naely pendigits) appear to be state-of-the-art. We also refer our reader to experiental results suarized in Table 2 and Table 3 in Appendix A. These results provide further evidence in favor of DeepBoost algoriths. The consistent perforance iproveent by MDeepBoostSu over AdaBoost.MR or its L-nor regularized variant shows the benefit of the new coplexity-based regularization we introduced. 5 Conclusion We presented new data-dependent learning guarantees for convex ensebles in the ulti-class setting where the base classifier set is coposed of increasingly coplex sub-failies, including very deep or coplex ones. These learning bounds generalize to the ulti-class setting the guarantees presented by Cortes et al. [204] in the binary case. We also introduced and discussed several new ulti-class enseble algoriths benefiting fro these guarantees and proved positive results for the H-consistency and convergence of several of the. Finally, we reported the results of several experients with DeepBoost algoriths, and copared their perforance with that of AdaBoost.MR and additive ultinoial Logistic Regression and their L -regularized variants. Acknowledgents We thank Andres Muñoz Medina and Scott Yang for discussions and help with the experients. This work was partly funded by the NSF award IIS-759 and supported by a NSERC PGS grant. 8

9 References P. Bühlann and B. Yu. Boosting with the L2 loss. J. of the Aer. Stat. Assoc., 98462): , M. Collins, R. E. Schapire, and Y. Singer. Logistic regression, Adaboost and Bregan distances. Machine Learning, 48: , Septeber C. Cortes, M. Mohri, and U. Syed. Deep boosting. In ICML, pages 79 87, 204. T. G. Dietterich. An experiental coparison of three ethods for constructing ensebles of decision trees: Bagging, boosting, and randoization. Machine Learning, 402):39 57, J. C. Duchi and Y. Singer. Boosting with structural sparsity. In ICML, page 38, N. Duffy and D. P. Helbold. Potential boosters? In NIPS, pages , 999. Y. Freund and R. E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Coputer Syste Sciences, 55):9 39, 997. J. H. Friedan. Greedy function approxiation: A gradient boosting achine. Annals of Statistics, 29:89 232, J. H. Friedan, T. Hastie, and R. Tibshirani. Additive logistic regression: a statistical view of boosting. Annals of Statistics, 28:2000, 998. A. J. Grove and D. Schuurans. Boosting in the liit: Maxiizing the argin of learned ensebles. In AAAI/IAAI, pages , 998. J. Kivinen and M. K. Waruth. Boosting as entropy projection. In COLT, pages 34 44, 999. V. Koltchinskii and D. Panchenko. Epirical argin distributions and bounding the generalization error of cobined classifiers. Annals of Statistics, 30, M. Ledoux and M. Talagrand. Probability in Banach Spaces: Isoperietry and Processes. Springer, 99. P. Li. ABC-boost: adaptive base class boost for ulti-class classification. In ICML, page 79, 2009a. P. Li. ABC-logitboost for ulti-class classification. Technical report, Rutgers University, 2009b. P. M. Long and R. A. Servedio. Consistency versus realizable H-consistency for ulticlass classification. In ICML 3), pages , 203. Z.-Q. Luo and P. Tseng. On the convergence of coordinate descent ethod for convex differentiable iniization. Journal of Optiization Theory and Applications, 72):7 35, 992. L. Mason, J. Baxter, P. L. Bartlett, and M. R. Frean. Boosting algoriths as gradient descent. In NIPS, 999. M. Mohri, A. Rostaizadeh, and A. Talwalkar. Foundations of Machine Learning. The MIT Press, 202. I. Mukherjee and R. E. Schapire. A theory of ulticlass boosting. JMLR, 4): , 203. G. Rätsch and M. K. Waruth. Maxiizing the argin with boosting. In COLT, pages , G. Rätsch and M. K. Waruth. Efficient argin axiizing with boosting. JMLR, 6:23 252, G. Rätsch, S. Mika, and M. K. Waruth. On the convergence of leveraging. In NIPS, pages , 200a. G. Rätsch, T. Onoda, and K.-R. Müller. Soft argins for AdaBoost. Machine Learning, 423): , 200b. R. E. Schapire. Theoretical views of boosting and applications. In Proceedings of ALT 999, volue 720 of Lecture Notes in Coputer Science, pages Springer, 999. R. E. Schapire and Y. Freund. Boosting: Foundations and Algoriths. The MIT Press, 202. R. E. Schapire and Y. Singer. Iproved boosting algoriths using confidence-rated predictions. Machine Learning, 373): , 999. R. E. Schapire, Y. Freund, P. Bartlett, and W. S. Lee. Boosting the argin: A new explanation for the effectiveness of voting ethods. In ICML, pages , 997. P. Sun, M. D. Reid, and J. Zhou. Aoso-logitboost: Adaptive one-vs-one logitboost for ulti-class proble. In ICML, 202. A. Tewari and P. L. Bartlett. On the consistency of ulticlass classification ethods. JMLR, 8: , M. K. Waruth, J. Liao, and G. Rätsch. Totally corrective boosting algoriths that axiize the argin. In ICML, pages , T. Zhang. Statistical analysis of soe ulti-category large argin classification ethods. JMLR, 5:225 25, 2004a. T. Zhang. Statistical behavior and consistency of classification ethods based on convex risk iniization. Annals of Statistics, 32):56 85, 2004b. J. Zhu, H. Zou, S. Rosset, and T. Hastie. Multi-class adaboost. Statistics and Its Interface, H. Zou, J. Zhu, and T. Hastie. New ulticategory boosting algoriths based on ulticategory fisher-consistent losses. Annals of Statistics, 24): ,