Cuttig-Plae Traiig of Structural SVMs Thorste Joachims, Thomas Filey, ad Chu-Nam Joh Yu Abstract Discrimiative traiig approaches like structural SVMs have show much promise for buildig highly complex ad accurate models i areas like atural laguage processig, protei structure predictio, ad iformatio retrieval. However, curret traiig algorithms are computatioally expesive or itractable o large datasets. To overcome this bottleeck, this paper explores how cuttig-plae methods ca provide fast traiig ot oly for classificatio SVMs, but also for structural SVMs. We show that for a equivalet -slack reformulatio of the liear SVM traiig problem, our cuttig-plae method has time complexity liear i the umber of traiig examples. I particular, the umber of iteratios does ot deped o the umber of traiig examples, ad it is liear i the desired precisio ad the regularizatio parameter. Furthermore, we preset a extesive empirical evaluatio of the method applied to biary classificatio, multi-class classificatio, HMM sequece taggig, ad CFG parsig. The experimets show that the cuttigplae algorithm is broadly applicable ad fast i practice. O large datasets, it is typically several orders of magitude faster tha covetioal traiig methods derived from decompositio methods like SVM-light, or covetioal cuttig-plae methods. Implemetatios of our methods are available at www.joachims.org. Key words: Structural SVMs, Support Vector Machies, Structured Output Predictio, Traiig Algorithms Thorste Joachims Dept. of Computer Sciece, Corell Uiversity, Ithaca, NY, USA, e-mail: tj@cs.corell.edu Thomas Filey Dept. of Computer Sciece, Corell Uiversity, Ithaca, NY, USA, e-mail: tomf@cs.corell.edu Chu-Nam Joh Yu Dept. of Computer Sciece, Corell Uiversity, Ithaca, NY, USA, e-mail: cyu@cs.corell.edu
2 Thorste Joachims, Thomas Filey, ad Chu-Nam Joh Yu Itroductio Cosider the problem of learig a fuctio with complex outputs, where the predictio is ot a sigle uivariate respose (e.g., 0/ for classificatio or a real umber for regressio), but a complex multivariate object. For example, the desired predictio is a tree i atural laguage parsig, or a total orderig i web search, or a aligmet betwee two amio acid sequeces i protei threadig. Further istaces of such structured predictio problems are ubiquitous i atural laguage processig, bioiformatics, computer visio, ad may other applicatio domais. Recet years have provided itriguig advaces i extedig methods like Logistic Regressio, Perceptros, ad Support Vector Machies (SVMs) to global traiig of such structured predictio models (e.g., Lafferty et al, 200; Collis, 2004; Collis ad Duffy, 2002; Taskar et al, 2003; Tsochataridis et al, 2004). I cotrast to covetioal geerative traiig, these methods are discrimiative (e.g., coditioal likelihood, empirical risk miimizatio). Aki to movig from Naive Bayes to a SVM for classificatio, this provides greater modelig flexibility through avoidace of idepedece assumptios, ad it was show to provide substatially improved predictio accuracy i may domais (e.g., Lafferty et al, 200; Taskar et al, 2003; Tsochataridis et al, 2004; Taskar et al, 2004; Yu et al, 2007). By elimiatig the eed to model statistical depedecies betwee features, discrimiative traiig eables us to freely use more complex ad possibly iterdepedet features, which provides the potetial to lear models with improved fidelity. However, traiig these rich models with a sufficietly large traiig set is ofte beyod the reach of curret discrimiative traiig algorithms. We focus o the problem of traiig structural SVMs i this paper. Formally, this ca be thought of as solvig a covex quadratic program (QP) with a large (typically expoetial or ifiite) umber of costraits. Existig algorithm fall ito two groups. The first group of algorithms relies o a elegat polyomial-size reformulatio of the traiig problem (Taskar et al, 2003; Aguelov et al, 2005), which is possible for the special case of margi-rescalig (Tsochataridis et al, 2005) with liearly decomposable loss. These smaller QPs ca the be solved, for example, with geeral-purpose optimizatio methods (Aguelov et al, 2005) or decompositio methods similar to SMO (Taskar et al, 2003; Platt, 999). Ufortuately, decompositio methods are kow to scale super-liearly with the umber of examples (Platt, 999; Joachims, 999), ad so do geeral-purpose optimizers, sice they do ot exploit the special structure of this optimizatio problem. But most sigificatly, the algorithms i the first group are limited to applicatios where the polyomial-size reformulatio exists. Similar restrictios also apply to the extragradiet method (Taskar et al, 2005), which applies oly to problems where subgradiets of the QP ca be computed via a covex real relaxatio, as well as expoetiated gradiet methods (Bartlett et al, 2004; Globerso et al, 2007), which require the ability to compute margials (e.g. via the sum-product algorithm). The secod group of algorithms works directly with the origial, expoetially-sized QP. This is feasible, sice a polyomially-sized subset of the costraits from the origial QP is already sufficiet for a solutio of arbitrary accuracy (Joachims, 2003; Tsochataridis
Cuttig-Plae Traiig of Structural SVMs 3 et al, 2005). Such algorithms either take stochastic subgradiet steps (Collis, 2002; Ratliff et al, 2007; Shalev-Shwartz et al, 2007), or build a cuttig-plae model which is easy to solve directly (Tsochataridis et al, 2004). The algorithm i (Tsochataridis et al, 2005) shows how such a cuttig-plae ca be costructed efficietly. Compared to the subgradiet methods, the cuttig-plae approach does ot take a sigle gradiet step, but always takes a optimal step i the curret cuttig-plae model. It requires oly the existece of a efficiet separatio oracle, which makes it applicable to may problems for which o polyomially-sized reformulatio is kow. I practice, however, the cuttig-plae method of Tsochataridis et al (2005) is kow to scale super-liearly with the umber of traiig examples. I particular, sice the size of the cuttig-plae model typically grows liearly with the dataset size (see (Tsochataridis et al, 2005) ad Sectio 5.5), QPs of icreasig size eed to be solved to compute the optimal steps, which leads to the super-liear rutime. I this paper, we explore a extesio of the cuttig-plae method preseted i (Joachims, 2006) for traiig liear structural SVMs, both i the margi-rescalig ad i the slack-rescalig formulatio (Tsochataridis et al, 2005). I cotrast to the cuttig-plae method preseted i (Tsochataridis et al, 2005), we show that the size of the cuttig-plae models ad the umber of iteratios are idepedet of the umber of traiig examples. Istead, their size ad the umber of iteratios ca be upper bouded by O( C ε ), where C is the regularizatio costat ad ε is the desired precisio of the solutio (see Optimizatio Problems OP2 ad OP3). Sice each iteratio of the ew algorithm takes O() time ad memory, it also scales O() overall with the umber of traiig examples both i terms of computatio time ad memory. Empirically, the size of the cuttig-plae models ad the QPs that eed to be solved i each iteratio is typically very small (less tha a few hudred) eve for problems with millios of features ad hudreds of thousads of examples. A key coceptual differece of the ew algorithm compared to the algorithm of Tsochataridis et al (2005) ad most other SVM traiig methods is that ot oly idividual data poits are cosidered as potetial Support Vectors (SVs), but also liear combiatios of those. This icreased flexibility allows for solutios with far fewer o-zero dual variables, ad it leads to the small cuttig-plae models discussed above. The ew algorithm is applicable to all structural SVM problems where the separatio oracle ca be computed efficietly, which makes it just as widely applicable as the most geeral traiig algorithms kow to date. Eve further, followig the origial publicatio i (Joachims, 2006), Teo et al (2007) have already show that the algorithm ca also be exteded to Coditioal Radom Field traiig. We provide a theoretical aalysis of the algorithm s correctess, covergece rate, ad scalig behavior for structured predictio. Furthermore, we preset empirical results for several structured predictio problems (i.e., multi-class classificatio, part-ofspeech taggig, ad atural laguage parsig), ad compare agaist covetioal algorithms also for the special case of biary classificatio. O all problems, the ew algorithm is substatially faster tha covetioal decompositio methods ad cuttig-plae methods, ofte by several orders of magitude for large datasets.
4 Thorste Joachims, Thomas Filey, ad Chu-Nam Joh Yu 2 Structural Support Vector Machies Structured output predictio describes the problem of learig a fuctio h : X Y where X is the space of iputs, ad Y is the space of (multivariate ad structured) outputs. I the case of atural laguage parsig, for example, X is the space of seteces, ad Y is the space of trees over a give set of o-termial grammar symbols. To lear h, we assume that a traiig sample of iput-output pairs S =((x,y ),...,(x,y )) (X Y ) is available ad draw i.i.d. from a distributio P(X,Y). The goal is to fid a fuctio h from some hypothesis space H that has low predictio error, or, more geerally, low risk RP Δ (h)= Δ(y,h(x))dP(x,y). X Y Δ(y,ȳ) is a loss fuctio that quatifies the loss associated with predictig ȳ whe y is the correct output value. Furthermore, we assume that Δ(y,y)=0 ad Δ(y,ȳ) 0 for y ȳ. We follow the Empirical Risk Miimizatio Priciple (Vapik, 998) to ifer a fuctio h from the traiig sample S. The learer evaluates the quality of a fuctio h H usig the empirical risk RS Δ (h) o the traiig sample S. R Δ S (h)= Δ(y i,h(x i )) Support Vector Machies select a h H that miimizes a regularized empirical risk o S. For covetioal biary classificatio where Y = {,+}, SVM traiig is typically formulated as the followig covex quadratic optimizatio problem 2 (Cortes ad Vapik, 995; Vapik, 998). Optimizatio Problem (CLASSIFICATION SVM (PRIMAL)) mi w,ξ i 0 2 wt w + C s.t. i {,...,}: y i (w T x i ) ξ i ξ i It was show that SVM traiig ca be geeralized to structured outputs (Altu et al, 2003; Taskar et al, 2003; Tsochataridis et al, 2004), leadig to a optimiza- Note, however, that all formal results i this paper also hold for o i.i.d. data, sice our algorithms do ot rely o the order or distributio of the examples. 2 For simplicity, we cosider the case of hyperplaes passig through the origi. By addig a costat feature, a offset ca easily be simulated.
Cuttig-Plae Traiig of Structural SVMs 5 tio problem that is similar to multi-class SVMs (Crammer ad Siger, 200) ad extedig the Perceptro approach described i (Collis, 2002). The idea is to lear a discrimiat fuctio f : X Y R over iput/output pairs from which oe derives a predictio by maximizig f over all y Y for a specific give iput x. h w (x)=argmax f w (x, y) y Y We assume that f w (x,y) takes the form of a liear fuctio f w (x,y)=w T Ψ(x,y) where w R N is a parameter vector ad Ψ(x,y) is a feature vector relatig iput x ad output y. Ituitively, oe ca thik of f w (x,y) as a compatibility fuctio that measures how well the output y matches the give iput x. The flexibility i desigig Ψ allows us to employ SVMs to lear models for problems as diverse as atural laguage parsig (Taskar et al, 2004; Tsochataridis et al, 2004), protei sequece aligmet (Yu et al, 2007), learig rakig fuctios that optimize IR performace measures (Yue et al, 2007), ad segmetig images (Aguelov et al, 2005). For traiig the weights w of the liear discrimiat fuctio, the stadard SVM optimizatio problem ca be geeralized i several ways (Altu et al, 2003; Joachims, 2003; Taskar et al, 2003; Tsochataridis et al, 2004, 2005). This paper uses the formulatios give i (Tsochataridis et al, 2005), which subsume all other approaches. We refer to these as the -slack formulatios, sice they assig a differet slack variable to each of the traiig examples. Tsochataridis et al (2005) idetify two differet ways of usig a hige loss to covex upper boud the loss, amely margi-rescalig ad slack-rescalig. I margi-rescalig, the positio of the hige is adapted while the slope is fixed, Δ MR (y,h w (x)) = max ȳ Y {Δ(y,ȳ) wt Ψ(x,y)+w T Ψ(x,ȳ)} Δ(y,h w (x)) () while i slack-rescalig, the slope is adjusted while the positio of the hige is fixed. Δ SR (y,h w (x)) = max ȳ Y {Δ(y,ȳ)( wt Ψ(x,y)+w T Ψ(x,ȳ))} Δ(y,h w (x)) (2) This leads to the followig two traiig problems, where each slack variable ξ i is equal to the respective Δ MR (y i,h w (x i )) or Δ SR (y i,h w (x i )) for traiig example (x i,y i ).
6 Thorste Joachims, Thomas Filey, ad Chu-Nam Joh Yu Optimizatio Problem 2 (-SLACK STRUCTURAL SVM WITH MARGIN- RESCALING (PRIMAL)) mi w,ξ 0 2 wt w + C ξ i s.t. ȳ Y : w T [Ψ(x,y ) Ψ(x,ȳ )] Δ(y,ȳ ) ξ. s.t. ȳ Y : w T [Ψ(x,y ) Ψ(x,ȳ )] Δ(y,ȳ ) ξ Optimizatio Problem 3 (-SLACK STRUCTURAL SVM WITH SLACK- RESCALING (PRIMAL)) mi w,ξ 0 2 wt w + C ξ i s.t. ȳ Y : w T [Ψ(x,y ) Ψ(x,ȳ )] ξ Δ(y,ȳ ). s.t. ȳ Y : w T ξ [Ψ(x,y ) Ψ(x,ȳ )] Δ(y,ȳ ) The objective is the covetioal regularized risk used i SVMs. The costraits state that for each traiig example (x i,y i ), the score w T Ψ(x i,y i ) of the correct structure y i must be greater tha the score w T Ψ(x i,ȳ) of all icorrect structures ȳ by a required margi. This margi is i slack-rescalig, ad equal to the loss Δ(y i,ȳ) i margi rescalig. If the margi is violated, the slack variable ξ i of the example becomes o-zero. Note that ξ i is shared amog costraits from the same example. The correct labels y i s are ot excluded from the costraits because they correspod to o-egativity costraits o the slack variables ξ i. It is easy to verify that for both margi-rescalig ad for slack-rescalig, ξ i is a upper boud o the empirical risk RS Δ (h) o the traiig sample S. It is ot immediately obvious that Optimizatio Problems OP2 ad OP3 ca be solved efficietly sice they have O( Y ) costraits. Y is typically extremely large (e.g., all possible aligmets of two amio-acid sequece) or eve ifiite (e.g., real-valued outputs). For the special case of margi-rescalig with liearly decomposable loss fuctios Δ, Taskar et al. (Taskar et al, 2003) have show that the problem ca be reformulated as a quadratic program with oly a polyomial umber of costraits ad variables. A more geeral algorithm that applies to both margi-rescalig ad slackrescalig uder a large variety of loss fuctios was give i (Tsochataridis et al, 2004, 2005). The algorithm relies o the theoretical result that for ay desired precisio ε, a greedily costructed cuttig-plae model of OP2 ad OP3 requires oly O( ε 2 ) may costraits (Joachims, 2003; Tsochataridis et al, 2005). This greedy algorithm for the case of margi-rescalig is Algorithm, for slack-rescalig it leads
Cuttig-Plae Traiig of Structural SVMs 7 Algorithm for traiig Structural SVMs (with margi-rescalig) via the -Slack Formulatio (OP2). : Iput: S =((x,y ),...,(x,y )), C, ε 2: W i /0, ξ i 0 for all i =,..., 3: repeat 4: for,..., do 5: ŷ argmaxŷ Y {Δ(y i, ŷ) w T [Ψ(x i,y i ) Ψ(x i, ŷ)]} 6: if Δ(y i, ŷ) w T [Ψ(x i,y i ) Ψ(x i, ŷ)] > ξ i + ε the 7: W i W i {ŷ} 8: (w, ξ) argmi w,ξ 0 2 wt w + C ξ i s.t. ȳ W : w T [Ψ(x,y ) Ψ(x, ȳ )] Δ(y, ȳ ) ξ 9: ed if 0: ed for : util o W i has chaged durig iteratio 2: retur(w,ξ ). ȳ W : w T [Ψ(x,y ) Ψ(x, ȳ )] Δ(y, ȳ ) ξ Algorithm 2 for traiig Structural SVMs (with slack-rescalig) via the -Slack Formulatio (OP3). : Iput: S =((x,y ),...,(x,y )), C, ε 2: W i /0, ξ i 0 for all i =,..., 3: repeat 4: for,..., do 5: ŷ argmaxŷ Y {Δ(y i, ŷ)( w T [Ψ(x i,y i ) Ψ(x i, ŷ)])} 6: if Δ(y i, ŷ)( w T [Ψ(x i,y i ) Ψ(x i, ŷ)]) > ξ i + ε the 7: W i W i {ŷ} 8: (w, ξ) argmi w,ξ 0 2 wt w + C ξ i s.t. ȳ W : w T Δ(y, ȳ )[Ψ(x,y ) Ψ(x, ȳ )] Δ(y, ȳ ) ξ 9: ed if 0: ed for : util o W i has chaged durig iteratio 2: retur(w,ξ ). ȳ W : w T Δ(y, ȳ )[Ψ(x,y ) Ψ(x, ȳ )] Δ(y, ȳ ) ξ to Algorithm 2. The algorithms iteratively costruct a workig set W = W... W of costraits, startig with a empty workig set W = /0. The algorithms iterate through the traiig examples ad fid the costrait that is violated most by the curret solutio w, ξ (Lie 5). If this costrait is violated by more tha the desired precisio ε (Lie 6), the costrait is added to the workig set (Lie 7) ad the QP is solved over the exteded W (Lie 8). The algorithms termiate whe o costrait is added i the previous iteratio, meaig that all costraits i OP2 or OP3 are fulfilled up to a precisio of ε. The algorithm is provably efficiet wheever the most violated costrait ca be foud efficietly. The procedure i Lie 5 for fidig the most violated costrait is called the separatio oracle. The argmax i
8 Thorste Joachims, Thomas Filey, ad Chu-Nam Joh Yu Lie 5 has a efficiet solutio for a wide variety of choices for Ψ, Y, ad Δ (see e.g., Tsochataridis et al, 2005; Joachims, 2005; Yu et al, 2007; Yue et al, 2007)), ad ofte it ivolves the same algorithm for makig predictios (see Eq. ()). Related to Algorithm is the method proposed i (Aguelov et al, 2005), which applies to the special case where the argmax i Lie 5 ca be computed as a liear program. This allows them ot to explicitly maitai a workig set, but implicitly represet it by foldig liear programs ito the quadratic program OP2. To this special case also applies the method of Taskar et al (2005), which casts the traiig of max-margi structured predictors as a covex-cocave saddle-poit problem. It provides improved scalability compared to a explicit reductio to a polyomiallysized QP, but ivolves the use of a special mi-cost quadratic flow solver i the projectio steps of the extragradiet method. Expoetiated gradiet methods, origially proposed for olie learig of liear predictors (Kivie ad Warmuth, 997), have also bee applied to the traiig of structured predictors (Globerso et al, 2007; Bartlett et al, 2004). They solve the optimizatio problem i the dual, ad treat coditioal radom field ad structural SVM withi the same framework usig Bregma divergeces. Stochastic gradiet methods Vishwaatha et al (2006) have bee applied to the traiig of coditioal radom field o large scale problems, ad exhibit faster rate of covergece tha BFGS methods. Recetly, subgradiet methods ad their stochastic variats (Ratliff et al, 2007) have also bee proposed to solve the optimizatio problem i maxmargi structured predictio. While ot yet explored for structured predictio, the PEGASOS algorithm (Shalev-Shwartz et al, 2007) has show promisig performace for biary classificatio SVMs. Related to such olie methods is also the MIRA algorithm (Crammer ad Siger, 2003), which has bee used for traiig structured predictors (e.g. McDoald et al (2005)). However, to deal with the expoetial size of Y, heuristics have to be used (e.g. oly usig a k-best subset of Y ), leadig to oly approximate solutios of Optimizatio Problem OP2. 3 Traiig Algorithm While polyomial rutime was established for most algorithms discussed above, traiig geeral structural SVMs o large-scale problems is still a challegig problem. I the followig, we preset a equivalet reformulatio of the traiig problems for both margi-rescalig ad slack-rescalig, leadig to a cuttig-plae traiig algorithm that has ot oly provably liear rutime i the umber of traiig examples, but is also several orders of magitude faster tha covetioal cuttigplae methods (Tsochataridis et al, 2005) o large-scale problems. Nevertheless, the ew algorithm is equally geeral as Algorithms ad 2.
Cuttig-Plae Traiig of Structural SVMs 9 3. -Slack Formulatio The first step towards the ew algorithm is a reformulatio of the optimizatio problems for traiig. The key idea is to replace the cuttig-plae models of the hige loss oe for each traiig example with a sigle cuttig plae model for the sum of the hige-losses. Sice there is oly a sigle slack variable i the ew formulatios, we refer to them the -slack formulatios. Optimizatio Problem 4 (-SLACK STRUCTURAL SVM WITH MARGIN- RESCALING (PRIMAL)) mi w,ξ 0 2 wt w +C ξ s.t. (ȳ,...,ȳ ) Y : wt i,y i ) Ψ(x i,ȳ i )] [Ψ(x Δ(y i,ȳ i ) ξ Optimizatio Problem 5 (-SLACK STRUCTURAL SVM WITH SLACK- RESCALING (PRIMAL)) mi w,ξ 0 2 wt w +C ξ s.t. (ȳ,...,ȳ ) Y : wt i,ȳ i )[Ψ(x i,y i ) Ψ(x i,ȳ i )] Δ(y Δ(y i,ȳ i ) ξ While OP4 has Y costraits, oe for each possible combiatio of labels (ȳ,...,ȳ ) Y, it has oly oe slack variable ξ that is shared across all costraits. Each costrait correspods to a taget to R Δ MR S (h) ad R Δ SR S (h) respectively, ad the set of costraits forms a equivalet model of the risk fuctio. Specifically, the followig theorems show that ξ = R Δ MR S (h w ) at the solutio (w,ξ ) of OP4, ad ξ = R Δ SR S (h w ) at the solutio (w,ξ ) of OP5, sice the -slack ad the -slack formulatios are equivalet i the followig sese. Theorem. (EQUIVALENCE OF OP2 AND OP4) Ay solutio w of OP4 is also a solutio of OP2 (ad vice versa), with ξ = ξ i. Proof. Geeralizig the proof i (Joachims, 2006), we will show that both optimizatio problems have the same objective value ad a equivalet set of costraits. I particular, for every w the smallest feasible ξ ad i ξ i are equal. For a give w, each ξ i i OP2 ca be optimized idividually, ad the smallest feasible ξ i give w is achieved for ξ i = max ȳ i Y {Δ(y i,ȳ i ) w T [Ψ(x i,y i ) Ψ(x i,ȳ i )]}. For OP4, the smallest feasible ξ for a give wis
0 Thorste Joachims, Thomas Filey, ad Chu-Nam Joh Yu Algorithm 3 for traiig Structural SVMs (with margi-rescalig) via the -Slack Formulatio (OP4). : Iput: S =((x,y ),...,(x,y )), C, ε 2: W /0 3: repeat 4: (w,ξ) argmi w,ξ 0 2 wt w +Cξ s.t. (ȳ,...,ȳ ) W : wt 5: for,..., do 6: ŷ i argmaxŷ Y {Δ(y i, ŷ)+w T Ψ(x i, ŷ)} 7: ed for 8: W W {(ŷ,...,ŷ )} 9: util Δ(y i, ŷ i ) wt 0: retur(w,ξ ) [Ψ(x i,y i ) Ψ(x i, ŷ i )] ξ + ε [Ψ(x i,y i ) Ψ(x i, ȳ i )] Δ(y i, ȳ i ) ξ ξ = max (ȳ,...,ȳ ) Y { Δ(y i,ȳ i ) w T } [Ψ(x i,y i ) Ψ(x i,ȳ i )]. Sice the fuctio ca be decomposed liearly i ȳ i, for ay give w, each ȳ i ca be optimized idepedetly. { ξ = max ȳ i Y Δ(y i,ȳ i ) } wt [Ψ(x i,y i ) Ψ(x i,ȳ i )] = ξ i Therefore, the objective fuctios of both optimizatio problems are equal for ay w give the correspodig smallest feasible ξ ad ξ i. Cosequetly this is also true for w ad its correspodig smallest feasible slacks ξ ad ξi. Theorem 2. (EQUIVALENCE OF OP3 AND OP5) Ay solutio w of OP5 is also a solutio of OP3 (ad vice versa), with ξ = ξ i. Proof. Aalogous to Theorem. 3.2 Cuttig-Plae Algorithm What could we possibly have gaied by movig from the -slack to the -slack formulatio, expoetially icreasig the umber of costraits i the process? We will show i the followig that the dual of the -slack formulatio has a solutio that is extremely sparse, with the umber of o-zero dual variables idepedet of the umber of traiig examples. To fid this solutio, we propose Algorithms 3 ad 4, which are geeralizatios of the algorithm i (Joachims, 2006) to structural SVMs. Similar to the cuttig-plae algorithms for the -slack formulatios, Algorithms 3 ad 4 iteratively costruct a workig set W of costraits. I each iteratio, the al-
Cuttig-Plae Traiig of Structural SVMs Algorithm 4 for traiig Structural SVMs (with slack-rescalig) via the -Slack Formulatio (OP5). : Iput: S =((x,y ),...,(x,y )), C, ε 2: W /0 3: repeat 4: (w,ξ) argmi w,ξ 0 2 wt w +Cξ s.t. (ȳ,...,ȳ ) W : wt Δ(y i, ȳ i )[Ψ(x i,y i ) Ψ(x i, ȳ i )] 5: for,..., do 6: ŷ i argmaxŷ Y {Δ(y i, ŷ)( w T [Ψ(x i,y i ) Ψ(x i, ŷ)])} 7: ed for 8: W W {(ŷ,...,ŷ )} 9: util Δ(y i, ŷ i ) wt 0: retur(w,ξ ) Δ(y i, ŷ i )[Ψ(x i,y i ) Ψ(x i, ŷ i )] ξ + ε Δ(y i, ȳ i ) ξ gorithms compute the solutio over the curret W (Lie 4), fid the most violated costrait (Lies 5-7), ad add it to the workig set. The algorithm stops oce o costrait ca be foud that is violated by more tha the desired precisio ε (Lie 9). Ulike i the -slack algorithms, oly a sigle costrait is added i each iteratio. The followig theorems characterize the quality of the solutios retured by Algorithms 3 ad 4. Theorem 3. (CORRECTNESS OF ALGORITHM 3) For ay traiig sample S =((x,y ),...,(x,y )) ad ay ε > 0,if(w,ξ ) is the optimal solutio of OP4, the Algorithm 3 returs a poit ( w,ξ ) that has a better objective value tha (w,ξ ), ad for which (w,ξ + ε) is feasible i OP4. Proof. We first verify that Lies 5-7 i Algorithm 3 compute the vector (ŷ,...,ŷ ) Y that maximizes { } ξ = max (ŷ,...,ŷ ) Y Δ(y i,ŷ i ) wt [Ψ(x i,y i ) Ψ(x i,ŷ i )]. ξ is the miimum value eeded to fulfill all costraits i OP4 for the curret w. The maximizatio problem is liear i the ŷ i, so oe ca maximize over each ŷ i idepedetly. ξ = { max Δ(yi,ŷ) w T [Ψ(x i,y i ) Ψ(x i,ŷ)] } (3) ŷ Y = w T Ψ(x i,y i )+ { max Δ(yi,ŷ)+w T Ψ(x i,ŷ) } (4) ŷ Y Sice the first sum i Equatio (4) is costat, the secod term directly correspods to the assigmet i Lie 6. As checked i Lie 9, the algorithm termiates oly if ξ does ot exceed the ξ from the solutio over W by more tha ε as desired.
2 Thorste Joachims, Thomas Filey, ad Chu-Nam Joh Yu Sice the (w,ξ ) retured by Algorithm 3 is the solutio o a subset of the costraits from OP4, it holds that 2 w T w +Cξ 2 wt w +Cξ. Theorem 4. (CORRECTNESS OF ALGORITHM 4) For ay traiig sample S =((x,y ),...,(x,y )) ad ay ε > 0,if(w,ξ ) is the optimal solutio of OP5, the Algorithm 4 returs a poit ( w,ξ ) that has a better objective value tha (w,ξ ), ad for which (w,ξ + ε) is feasible i OP5. Proof. Aalogous to the proof of Theorem 3. Usig a stoppig criterio based o the accuracy of the empirical risk ξ is very ituitive ad practically meaigful, ulike the stoppig criteria typically used i decompositio methods. Ituitively, ε ca be used to idicate how close oe wats to be to the empirical risk of the best parameter vector. I most machie learig applicatios, toleratig a traiig error that is suboptimal by 0.% is very acceptable. This ituitio makes selectig the stoppig criterio much easier tha i other traiig methods, where it is usually defied based o the accuracy of the Kuh-Tucker Coditios of the dual (see e.g., Joachims, 999). Nevertheless, it is easy to see that ε also bouds the duality gap of the solutio by Cε. Solvig the optimizatio problems to a arbitrary but fixed precisio of ε is essetial i our aalysis below, makig sure that computatio time is ot wasted o computig a solutio that is more accurate tha ecessary. We ext aalyze the time complexity of Algorithms 3 ad 4. It is easy to see that each iteratio of the algorithm takes calls to the separatio oracle, ad that for the liear kerel the remaiig work i each iteratio scales liearly with as well. We show ext that the umber of iteratios util covergece is bouded, ad that this upper boud is idepedet of. The argumet requires the Wolfe-dual programs, which are straightforward to derive (see Appedix). For a more compact otatio, we deote vectors of labels as ȳ =(ȳ,...,ȳ ) Y. For such vectors of labels, we the defie Δ(ȳ) ad the ier product H MR (ȳ,ȳ ) as follows. Note that y i ad y j deote correct traiig labels, while ȳ i ad ȳ j deote arbitrary labels: Δ(ȳ)= H MR (ȳ,ȳ )= 2 Δ(y i,ȳ i ) (5) j= [ Ψ(x i,y i ) T Ψ(x j,y j ) Ψ(x i,y i ) T Ψ(x j,ȳ j ) ] Ψ(x i,ȳ i ) T Ψ(x j,y j )+Ψ(x i,ȳ i ) T Ψ(x j,ȳ j ) The ier products Ψ(x,y) T Ψ(x,y ) are computed either explicitly or via a Kerel K(x,y,x,y )=Ψ(x,y) T Ψ(x,y ). Note that it is typically more efficiet to compute H MR (ȳ,ȳ )= 2 [ (Ψ(x i,y i ) Ψ(x i,ȳ i )) ] T [ j=(ψ(x j,y j ) Ψ(x j,ȳ j )) ] (6) (7) if o kerel is used. The dual of the -slack formulatio for margi-rescalig is:
Cuttig-Plae Traiig of Structural SVMs 3 Optimizatio Problem 6 (-SLACK STRUCTURAL SVM WITH MARGIN- RESCALING (DUAL)) max Δ(ȳ)αȳ α 0 ȳ Y 2 αȳαȳ H MR (ȳ,ȳ ) ȳ Y ȳ Y s.t. αȳ = C ȳ Y For the case of slack-rescalig, the respective H(ȳ,ȳ ) is as follows. There is a aalogous factorizatio that is more efficiet to compute if o kerel is used: H SR (ȳ,ȳ )= 2 j= Δ(y i,ȳ i )Δ(y j,ȳ j ) [Ψ(x i,y i ) T Ψ(x j,y j ) Ψ(x i,y i ) T Ψ(x j,ȳ j ) Ψ(x i,ȳ i ) T Ψ(x j,y j )+Ψ(x i,ȳ i ) T Ψ(x j,ȳ j ) ] [ ] T [ = 2 i,ȳ i )(Ψ(x i,y i ) Ψ(x i,ȳ i )) Δ(y j,ȳ Δ(y j)(ψ(x j,y j ) Ψ(x j,ȳ j)) j= (8) ] (9) The dual of the -slack formulatio is: Optimizatio Problem 7 (-SLACK STRUCTURAL SVM WITH SLACK- RESCALING (DUAL)) max Δ(ȳ)αȳ α 0 ȳ Y 2 αȳαȳ H SR (ȳ,ȳ ) ȳ Y ȳ Y s.t. αȳ = C ȳ Y Usig the respective dual solutio α, oe ca compute ier products with the weight vector w solvig the primal via ( w T [ Ψ(x,y) = Ψ(x,y) T Ψ(x j,y j ) Ψ(x,y) T Ψ(x j,ȳ j ) ]) = ȳ Y α ȳ [ ȳ Y α ȳ for margi-rescalig ad via ( w T Ψ(x,y) = = ȳ Y α ȳ [ ȳ Y α ȳ j= j= j= j= [Ψ(x j,y j ) Ψ(x j,ȳ j )]] T Ψ(x,y) Δ(y j,ȳ j ) [ Ψ(x,y) T Ψ(x j,y j ) Ψ(x,y) T Ψ(x j,ȳ j ) ]) Δ(y j,ȳ j )[Ψ(x j,y j ) Ψ(x j,ȳ j )]] T Ψ(x,y)
4 Thorste Joachims, Thomas Filey, ad Chu-Nam Joh Yu for slack-rescalig. We will show i the followig that oly a small (i.e., polyomial) umber of the αȳ is o-zero at the solutio. I aalogy to classificatio SVMs, we will refer to those ȳ with o-zero αȳ as Support Vectors. However, ote that Support Vectors i the -slack formulatio are liear combiatios of multiple examples. We ca ow state the theorem givig a upper boud o the umber of iteratios of the -slack algorithms. The proof exteds the oe i (Joachims, 2006) to geeral structural SVMs, ad is based o the techique itroduced i (Joachims, 2003) ad geeralized i (Tsochataridis et al, 2005). The fial step of the proof uses a improvemet developed i (Teo et al, 2007). Theorem 5. (-SLACK MARGIN-RESCALING SVM ITERATION COMPLEXITY) For ay 0 < C, 0 < ε 4R 2 C ad ay traiig sample S =((x,y ),...,(x,y )), Algorithms 3 termiates after at most log 2 ( Δ 4R 2 C ) + 6R 2 C iteratios. R 2 = max i,ȳ Ψ(x i,y i ) Ψ(x i,ȳ) 2, Δ = max i,ȳ Δ(y i,ȳ), ad.. is the iteger ceilig fuctio. Proof. We will show that addig each ew costrait to W icreases the objective value at the solutio of the quadratic program i Lie 4 by at least some costat positive value. Sice the objective value of the solutio of OP6 is upper bouded by CΔ (sice w = 0 ad ξ = Δ is a feasible poit i the primal), the algorithm ca oly perform a costat umber of iteratios before termiatio. The amout by which the solutio icreases by addig oe costrait that is violated by more tha ε (i.e., the criteria i Lie 9 of Algorithm 3 ad Algorithm 4) to W ca be lower bouded as follows. Let ŷ be the ewly added costrait ad let α be the solutio of the dual before the additio. To lower boud the progress made by the algorithm i each iteratio, cosider the icrease i the dual that ca be achieved with a lie search ε (0) max {D(α + βη)} D(α). () 0 β C The directio η is costructed by settig ηŷ = ad ηȳ = C α ȳ for all other ȳ. Note that the costraits o β ad the costructio of η esure that α + βη ever leaves the feasible regio of the dual. To apply Lemma 2 (see Appedix) for computig the progress made by a lie search, we eed a lower boud for D(α) T η ad a upper boud for η T Hη. Startig with the lower boud for D(α) T η, ote that D(α) αȳ = Δ(ȳ) αȳ H MR (ȳ,ȳ )=ξ (2) ȳ W for all ȳ with o-zero αȳ at the solutio over the previous workig set W. For the ewly added costrait ŷ ad some γ > 0,
Cuttig-Plae Traiig of Structural SVMs 5 D(α) αŷ = Δ(ŷ) αȳ H MR (ŷ,ȳ )=ξ + γ ξ + ε (3) ȳ W by costructio due to Lie 9 of Algorithms 3. It follows that αȳ D(α) T η = ξ + γ ȳ W C ξ (4) ( ) = ξ C αȳ ȳ W + γ (5) = γ. (6) The followig gives a upper boud for η T Hη, where Hȳȳ = H MR (ȳ,ȳ ) for ȳ,ȳ W {ŷ}. η T Hη = H MR (ŷ,ŷ) 2 C αȳh MR (ȳ,ŷ)+ C ȳ W 2 ȳ W αȳαȳ H MR (ȳ,ȳ )(7) ȳ W R 2 + 2 C CR2 + C 2C2 R 2 (8) = 4R 2 (9) The boud uses that R 2 H MR (ȳ,ŷ) R 2. Pluggig everythig ito the boud of Lemma 2 shows that the icrease of the objective is at least { } Cγ max {D(α + βη)} D(α) mi 0 β C 2, γ2 8R 2 (20) Note that the first case applies wheever γ 4R 2 C, ad that the secod case applies otherwise. The fial step of the proof is to use this costat icrease of the objective value i each iteratio to boud the maximum umber of iteratios. First, ote that α ȳ = 0 for all icorrect vectors of labels ȳ ad αȳ = C for the correct vector of labels ȳ =(y,...,y ) is a feasible startig poit α 0 with a dual objective of 0. This meas the iitial optimality gap δ(0) =D(α ) D(α 0 ) is at most CΔ, where α is the optimal dual solutio. A optimality gap of δ(i) =D( α ) D(α i ) esures that there exists a costrait that is violated by at least γ δ(i) C. This meas that the first case of (20) applies while δ(i) 4R 2 C 2, leadig to a decrease i the optimality gap of at least δ(i + ) δ(i) δ(i) (2) 2 i each iteratio. Startig from the worst possible optimality gap of δ(0)=cδ, the algorithm eeds at most
6 Thorste Joachims, Thomas Filey, ad Chu-Nam Joh Yu ( ) Δ i log 2 4R 2 (22) C iteratios util it has reached a optimality gap of δ(i ) 4R 2 C 2, where the secod case of (20) becomes valid. As proposed i (Teo et al, 2007), the recurrece equatio δ(i + ) δ(i) 8R 2 C 2 δ(i)2 (23) for the secod case of (20) ca be upper bouded by solvig the differetial equatio δ(i) i = δ(i) 2 with boudary coditio δ(0)=4r 2 C 2. The solutio is δ(i) 8R 2 C 2, showig that the algorithms does ot eed more tha 8R 2 C 2 i+2 i 2 8R2 C 2 Cε (24) iteratios util it reaches a optimality gap of Cε whe startig at a gap of 4R 2 C 2, where ε is the desired target precisio give to the algorithm. Oce the optimality gap reaches Cε, it is o loger guarateed that a ε-violated costrait exists. However, such costraits may still exist ad so the algorithm does ot yet termiate. But sice each such costrait leads to a icrease i the dual objective of at ε least 2, oly 8R 2 8R 2 C i 3 (25) ε ca be added before the optimality gap becomes egative. The overall boud results from addig i,i 2, ad i 3. Note that the proof of the theorem requires oly a lie search i each step, while Algorithm 4 actually computes the full QP solutio. This suggests the followig. O the oe had, the actual umber of iteratios i Algorithm 4 might be substatially smaller i practice tha what is predicted by the boud. O the other had, it suggests a variat of Algorithm 4, where the QP solver is replaced by a simple lie search. This may be beeficial i structured predictio problems where the separatio oracle i Lie 6 is particularly cheap to compute. Theorem 6. (-SLACK SLACK-RESCALING SVM ITERATION COMPLEXITY) For ay 0 < C, 0 < ε 4Δ 2 R 2 C ad ay traiig sample S =((x,y ),...,(x,y )), Algorithms 4 termiate after at most ( ) 6R 2 Δ 2 C log 2 4R 2 + (26) ΔC ε iteratios. R 2 = max i,ȳ Ψ(x i,y i ) Ψ(x i,ȳ) 2, Δ = max i,ȳ Δ(y i,ȳ), ad.. is the iteger ceilig fuctio.
Cuttig-Plae Traiig of Structural SVMs 7 Proof. The proof for the case of slack-rescalig is aalogous. The oly differece is that Δ 2 R 2 H SR (ȳ,ȳ ) Δ 2 R 2. The O( ε ) covergece rate i the boud is tight, as the followig example shows. Cosider a multi-class classificatio problem with ifiitely may classes Y = {,..., } ad a feature space X = R that cotais oly oe feature. This problem ca be ecoded usig a feature map Ψ(x,y) which takes value x i positio y ad 0 everywhere else. For a traiig set with a sigle traiig example (x, y)=((), ) ad usig the zero/oe-loss, the -slack quadratic program for both margi-rescalig ad slack-rescalig is 2 wt w +Cξ (27) s.t. w T [Ψ(x,) Ψ(x,2)] ξ mi w,ξ 0 w T [Ψ(x,) Ψ(x,3)] ξ w T [Ψ(x,) Ψ(x,4)] ξ. Let s assume without loss of geeraltity that Algorithm 3 (or equivaletly Algorithm 4) itroduces the first costrait i the first iteratio. For C 2 the solutio over this workig set is w T =( 2, 2,0,0,...) ad ξ = 0. All other costraits are ow violated by 2 ad oe of them is selected at radom to be added to the workig set i the ext iteratio. It is easy to verify that after addig k costraits, the solutio over the workig set is w T =( k+ k, k+,..., k+,0,0,...) for C 2, ad all costraits outside the workig set are violated by ε = k+. It therefore takes O( ε ) iteratios to reach a desired precisio of ε. The O(C) scalig with C is tight as well, at least for small values of C. For C 2, the solutio over the workig set after addig k costraits is wt = (C, C k,..., C k,0,0,...). This meas that after k costraits, all costraits outside the workig set are violated by ε = C k. Cosequetly, the bouds i (0) ad (26) accurately reflect the scalig with C up to the log-term for C 2. The followig theorem summarizes our characterizatio of the time complexity of the -slack algorithms. I real applicatios, however, we will see that Algorithms 3 scales much better tha what is predicted by these worst-case bouds both w.r.t. C ad ε. Note that a support vector (i.e. poit with o-zero dual variable) o loger correspods to a sigle data poit i the -slack dual, but is typically a liear combiatio of data poits. Corollary. (TIME COMPLEXITY OF ALGORITHMS 3 AND 4 FOR LINEAR KERNEL) For ay traiig examples S = ((x,y ),...,(x,y )) with max i,ȳ Ψ(x i,y i ) Ψ(x i,ȳ) 2 R 2 < ad max i,ȳ Δ(y i,ȳ) Δ < for all, the -slack cuttig plae Algorithms 3 ad 4 with costat ε ad C usig the liear kerel
8 Thorste Joachims, Thomas Filey, ad Chu-Nam Joh Yu require at most O() calls to the separatio oracle, require at most O() computatio time outside the separatio oracle, fid a solutio where the umber of support vectors (i.e. the umber of o-zero dual variables i the cuttig-plae model) does ot deped o, for ay fixed value of C > 0 ad ε > 0. Proof. Theorems 5 ad 6 shows that the algorithms termiate after a costat umber of iteratios does ot deped o. Sice oly oe costrait is itroduced i each iteratio, the umber of support vectors is bouded by the umber of iteratios. I each iteratio, the algorithm performs exactly calls to the separatio oracle, which proves the first statemet. Similarly, the QP that is solved i each iteratio is of costat size ad therefore requires oly costat time. It is easily verified that the remaiig operatios i each iteratios ca be doe i time O() usig Eqs. (7) ad (9). We further discuss the time complexity for the case of kerels i the followig sectio. Note that the liear-time algorithm proposed i (Joachims, 2006) for traiig biary classificatio SVMs is a special case of the -slack methods developed here. For biary classificatio X = R N ad Y = {,+}. Pluggig Ψ(x,y)= { 0ify = y 2 yx ad Δ(y,y )= otherwise (28) ito either -slack formulatio OP2 or OP3 produces the stadard SVM optimizatio problem OP. The -slack formulatios ad algorithms are the equivalet to those i (Joachims, 2006). However, the O( ε ) boud o the maximum umber of iteratios derived here is tighter tha the O( ) boud i (Joachims, 2006). Usig a ε similar argumet, it ca also be show the ordial 2 regressio method i (Joachims, 2006) is a special case of the -slack algorithm. 3.3 Kerels ad Low-Rak Approximatios For problems where a (o-liear) kerel is used, the computatio time i each iteratio is O( 2 ) istead of O(), sice Eqs. (7) ad (9) o loger apply. However, the -slack algorithm ca easily exploit rak-k approximatios, which we will show reduces the computatio time outside of the separatio oracle from O( 2 ) to O(k + k 3 ). Let (x,y ),...,(x k,y k ) be a set of basis fuctios so that the subspace spaed by Ψ(x,y ),...,Ψ(x k,y k ) (approximately) cotais the solutio w of OP4 ad OP5 respectively. Algorithms for fidig such approximatios have bee suggested i (Keerthi et al, 2006; Fukumizu et al, 2004; Smola ad Schölkopf, 2000) for classificatios SVMs, ad at least some of them ca be exteded to structural SVMs as well. I the simplest case, the set of k basis fuctios ca be chose radomly from the set of traiig examples.
Cuttig-Plae Traiig of Structural SVMs 9 For a kerel K(.) ad the resultig Gram matrix K with K ij =Ψ(x i,y i )T Ψ(x j,y j )= K(x i,y i,x j,y j ), we ca compute the iverse L of the Cholesky Decompositio L of K i time O(k 3 ). Assumig that w actually lies i the subspace, we ca equivaletly rewrite the -slack optimizatio problems as Optimizatio Problem 8 (-SLACK STRUCTURAL SVM WITH MARGIN- RESCALING AND k BASIS FUNCTIONS (PRIMAL)) mi β,ξ 0 2 β T β +C ξ K(x i,y i,x,y ) K(x i,ȳ i,x,y ) s.t. (ȳ,...,ȳ ) Y : β T L. K(x i,y i,x k,y k ) K(x i,ȳ i,x k,y k ) Δ(y i,ȳ i ) ξ Optimizatio Problem 9 (-SLACK STRUCTURAL SVM WITH SLACK- RESCALING AND k BASIS FUNCTIONS (PRIMAL)) mi β,ξ 0 2 β T β +C ξ s.t. (ȳ,..,ȳ ) Y : βt L Δ(y i,ȳ i ) K(x i,y i,x,y ) K(x i,ȳ i,x,y ). K(x i,y i,x k,y k ) K(x i,ȳ i,x k,y k ) Δ(y i,ȳ i ) ξ Ituitively, the values of the kerel K(.) with each of the k basis fuctios form a ew feature vector Ψ (x,y) T =(K(x,y,x,y ),...,K(x,y,x k,y k ))T describig each example (x,y). After multiplicatio with L, OP8 ad OP9 become idetical to a problem with liear kerel ad k features, ad it is straightforward to see that Algorithms 3 ad 4 apply to this ew represetatio. Corollary 2. (TIME COMPLEXITY OF ALGORITHMS 3 AND 4 FOR NON-LINEAR KERNEL) For ay traiig examples S = ((x,y ),...,(x,y )) with max i,ȳ Ψ(x i,y i ) Ψ(x i,ȳ) 2 R 2 < ad max i,ȳ Δ(y i,ȳ) Δ < for all, the -slack cuttig plae Algorithms 3 ad 4 usig a o-liear kerel require at most O() calls to the separatio oracle, require at most O( 2 ) computatio time outside the separatio oracle, require at most O(k + k 3 ) computatio time outside the separatio oracle, if a set of k basis fuctios is used, fid a solutio where the umber of support vectors does ot deped o, for ay fixed value of C > 0 ad ε > 0. Proof. The proof is aalogous to that of Corollary. For the low-rak approximatio, ote that it is more efficiet to oce compute w T = β T L before eterig the loop i Lie 5, tha to compute L Ψ (x,y) for each example. k 3 is the cost of the Cholesky Decompositio, but this eeds to be computed oly oce.
20 Thorste Joachims, Thomas Filey, ad Chu-Nam Joh Yu 4 Implemetatio We implemeted both the -slack algorithms ad the -slack algorithms i software package called SVM struct, which we make publicly available for dowload at http://svmlight.joachims.org. SVM struct uses SVM-light as the optimizer for solvig the QP sub-problems. Users may adapt SVM struct to their ow structural learig tasks by implemetig API fuctios correspodig to taskspecific Ψ, Δ, separatio oracle, ad iferece. User API fuctios are i C. A popular extesio is SVM pytho, which allows users to write API fuctios i Pytho istead, ad elimiates much of the drudge work of C icludig model serializatio/deserializatio ad memory maagemet. A efficiet implemetatio of the algorithms required a variety of desig decisios, which are summarized i the followig. These desig decisios have a substatial ifluece o the practical efficiecy of the algorithms. Restartig the QP Sub-Problem Solver from the Previous Solutio. Istead of solvig each QP subproblem from scratch, we restart the optimizer from the dual solutio of the previous workig set as the startig poit. This applies to both the -slack ad the -slack algorithms. Batch Updates for the -Slack Algorithm. Algorithm recomputes the solutio of the QP sub-problem after each update to the workig set. While this allows the algorithm to potetially fid better costraits to be added i each step, it requires a lot of time i the QP solver. We foud that it is more efficiet to wait with recomputig the solutio of the QP sub-problem util 00 costraits have bee added. Maagig the Accuracy of the QP Sub-Problem Solver. I the iitial iteratios, a relatively low precisio solutio of the QP sub-problems is sufficiet for idetifyig the ext violated costrait to add to the workig set. We therefore adjust the precisio of the QP sub-problem optimizer throughout the optimizatio process for all algorithms. Removig Iactive Costraits from the Workig Set. For both the -slack ad the -slack algorithm, costraits that were added to the workig set i early iteratios ofte become iactive later i the optimizatio process. These costraits ca be removed without affectig the theoretical covergece guaratees of the algorithm, leadig to smaller QP s beig solved i each iteratio. At the ed of each iteratio, we therefore remove costraits from the workig set that have ot bee active i the last 50 QP sub-problems. Cachig Ψ(x i,y i ) Ψ(x i,ŷ i ) i the -Slack Algorithm. If the separatio oracle returs a label ŷ i for a example x i, the costrait added i the -slack algorithm esures that this label will ever agai produce a ε-violated costrait i a subsequet iteratio. This is differet, however, i the -slack algorithm, where the same label ca be ivolved i a ε-violated costrait over ad over agai. We therefore cache the f most recetly used Ψ(x i,y i ) Ψ(x i,ŷ i ) for each traiig example x i (typically f = 0 i the followig experimets). Let s deote the cache for example x i with C i.
Cuttig-Plae Traiig of Structural SVMs 2 Istead of askig the separatio oracle i every iteratio, the algorithm first tries to costruct a sufficietly violated costrait from the caches via for,..., do ŷ i maxŷ Ci {Δ(y i,ŷ)+w T Ψ(x i,ŷ)} ed for or the aalogous variat for the case of slack-rescalig. Oly if this fails will the algorithm ask the separatio oracle. The goal of this cachig strategy is to decrease the umber of calls to the separatio oracle. Note that i may applicatios, the separatio oracle is very expesive (e.g., CFG parsig). Parallelizatio. While curretly ot implemeted, the loop i Lies 5-7 of the - slack algorithms ca easily be parallelized. I priciple, oe could make use of up to parallel threads, each computig the separatio oracle for a subset of the traiig sample. For applicatios like CFG parsig, where more tha 98% of the overall rutime is spet o the separatio oracle (see Sectio 5), parallizig this loop will lead to a substatial speed-up that should be almost liear i the umber of threads. Solvig the Dual of the QP Sub-Problems i the -Slack Algorithm. As idicated by Theorems 5 ad 6, the workig sets i the -slack algorithm stay small idepedet of the size of the traiig set. I practice, typically less the 00 costraits are active at the solutios ad we ever ecoutered a sigle istace where the workig set grew beyod 000 costraits. This makes it advatageous to store ad solve the QP sub-problems i the dual istead of i the primal, sice the dual is ot affected by the dimesioality of Ψ(x, y). The algorithm explicitly stores the Hessia H of the dual ad adds or deletes a row/colum wheever a costrait is added or removed from the workig set. Note that this is ot feasible for the -slack algorithm, sice the workig set size is typically orders of magitude larger (ofte > 00, 000 costraits). 5 Experimets For the experimets i this paper we will cosider the followig four applicatios, amely biary classificatio, multi-class classificatio, sequece taggig with liear chai HMMs, ad CFG grammar learig. They cover the whole spectrum of possible applicatios, from multi-class classificatio ivolvig a simple Y of low cardiality ad with a very iexpesive separatio oracle, to CFG parsig with large ad complex structural objects ad a expesive separatio oracle. The particular setup for the differet applicatios is as follows. Biary Classificatio. For biary classificatio X = R N ad Y = {,+}. Usig Ψ(x,y)= { 0 if y = ȳ 2 yx ad Δ(y,ȳ)=00[y ȳ]= 00 otherwise (29)
22 Thorste Joachims, Thomas Filey, ad Chu-Nam Joh Yu i the -slack formulatio, OP4 results i the algorithm preseted i (Joachims, 2006) ad implemeted i the SVM-perf software 3. I the -slack formulatio, oe immediately recovers Vapik et al. s origial classificatio SVM formulatio of OP (Cortes ad Vapik, 995; Vapik, 998) (up to the more coveiet percetagescale rescalig of the loss fuctio ad the absece of the bias term), which we solve usig SVM-light. Multi-Class Classificatio. This is aother simple istace of a structual SVM, where X = R N ad Y = {,..,k}. Usig Δ(y,ȳ)=00[y ȳ] ad 0. 0 Ψ multi (x,y)= x 0. 0 where the feature vector x is stacked ito positio y, the resultig -slack problem becomes idetical to the multi-class SVM of Crammer ad Siger (200). Our SVM-multiclass (V2.3) implemetatio 3 is also built via the SVM struct API. The argmax for the separatio oracle ad the predictio are computed by explicit eumeratio. We use the Covertype dataset of Blackard, Jock & Dea as our bechmark for the multi-class SVM. It is a 7-class problem with = 522,9 examples ad 54 features. This meas that the dimesioality of Ψ(x, y) is N = 378. Sequece Taggig with Liear Chai HMMs. I sequece taggig (e.g., Part-of- Speech Taggig) each iput x =(x,...,x l ) is a sequece of feature vectors (oe for each word), ad y =(y,...,y l ) is a sequece of labels y i {..k} of matchig legth. Isomorphic to a liear chai HMM, we model depedecies betwee each y i ad x i, as well as depedecies betwee y i ad y i. Usig the defiitio ofψ multi (x,y) from above, this leads to a joit feature vector of Ψ multi (x i,y i ) l [y i = ][y i = ] Ψ HMM ((x,...,x l ),(y,...,y l )) = [y i = ][y i = 2]. (3). [y i = k][y i = k] We use the umber of misclassified tags Δ((y,...,y l ),(y,...,y l )) = l [y i y i ] as the loss fuctio. The argmax for predictio ad the separatio oracle are both computed via the Viterbi algorithm. Note that the separatio oracle is equivalet to the 3 Available at svmlight.joachims.org (30)
Cuttig-Plae Traiig of Structural SVMs 23 predictio argmax after addig to the ode potetials of all icorrect labels. Our SVM-HMM (V3.0) implemetatio based o SVM struct is also available olie 3. We evaluate o the Part-of-Speech taggig dataset from the Pe Treebak corpus (Marcus et al, 993). After splittig the dataset ito traiig ad test set, it has = 35,53 traiig examples (i.e., seteces), leadig to a total of 854,022 tags over k = 43 labels. The feature vectors x i describig each word cosist of biary features, each idicatig the presece of a particular prefix or suffix i the curret word, the previous word, ad the followig word. All prefixes ad suffixes observed i the traiig data are used as features. I additio, there are features ecodig the legth of the word. The total umber of features is approximately 430,000, leadig to a Ψ HMM (x,y) of dimesioality N = 8,573,78. Parsig with Cotext Free Grammars. We use atural laguage parsig as a example applicatio where the cost of computig the separatio oracle is comparatively high. Here, each iput x =(x,...,x l ) is a sequece of feature vectors (oe for each word), ad y is a tree with x as its leaves. Admissible trees are those that ca be costructed from a give set of grammar rules i our case, all grammar rules observed i the traiig data. As the loss fuctio, we use Δ(y,ȳ)=00[y ȳ], ad Ψ CFG (x,y) has oe feature per grammar rule that couts how ofte this rule was applied i y. The argmax for predictio ca be computed efficietly usig a CKY parser. We use the CKY parser implemetatio 4 of Johso (998). For the separatio oracle the same CKY parser is used after extedig it to also retur the secod best solutio. Agai, our SVM-CFG (V3.0) implemetatio based o SVM struct is available olie 3. For the followig experimets, we use all seteces with at most 5 words from the Pe Treebak corpus (Marcus et al, 993). Restrictig the dataset to short seteces is ot due to a limitatio of SVM struct, but due to the CKY implemetatio we are usig. It becomes very slow for log seteces. Faster parsers that use pruig could easily hadle loger seteces as well. After splittig the data ito traiig ad test set, we have = 9,780 traiig examples (i.e., seteces) ad Ψ CFG (x,y) has a dimesioality of N = 54, 655. 5. Experimet Setup Uless oted otherwise, the followig parameters are used i the experimets reported below. Both the -slack (SVM struct optios -w 3 ad -w 4 with cachig) ad the -slack algorithms (optio -w 0 ) use ε = 0. as the stoppig criterio (optio -e 0. ). Give the scalig of the loss for multi-class classificatio ad CFG parsig, this correspods to a precisio of approximately 0.% of the empirical risk for the -slack algorithm, ad it is slightly higher for the HMM problem. For the -slack problem it is harder to iterpret the meaig of this ε, but we will see i Sectio 5.7 that it gives solutios of comparable precisio. As the value 4 Available at http://www.cog.brow.edu/ mj/software.htm
24 Thorste Joachims, Thomas Filey, ad Chu-Nam Joh Yu Table Traiig CPU-time (i hours), umber of calls to the separatio oracle, ad umber of support vectors for both the -Slack (with cachig) ad the -Slack Algorithm. is the umber of traiig examples ad N is the umber of features i Ψ(x, y). CPU-Time # Sep. Oracle # Support Vec. N -slack -slack -slack -slack -slack -slack MultiC 522,9 378.05 80.56 4,83,288 0,98,3 98 334,524 HMM 35,53 8,573,78 0.90 77.00,34,647 4,476,906 39 83,26 CFG 9,780 54,655 2.90 8.52 224,940 479,220 70 2,890 of C, we use the settig that achieves the best predictio performace o the test set whe usig the full traiig set (C = 0, 000, 000 for multi-class classificatio, C = 5,000 for HMM sequece taggig, ad C = 20,000 for CFG parsig) (optio -c ). As the cache size we use f = 0 (optio -f 0 ). For multi-class classificatio, margi-rescalig ad slack-rescalig are equivalet. For the others two problems we use margi-rescalig (optio -o 2 ). Wheever possible, rutime comparisos are doe o the full traiig set. All experimets are ru o 3.6 GHz Itel Xeo processors with 4GB of mai memory uder Liux. 5.2 How Fast is the -Slack Algorithm Compared to the -Slack Algorithm? We first examie absolute rutimes of the -slack algorithm, ad the aalyze ad explai various aspects of its scalig behavior i the followig. Table shows the CPU-time that both the -slack ad the -slack algorithm take o the multi-class, sequece taggig, ad parsig bechmark problems. For all problems, the -slack algorithm is substatially faster, for multi-class ad HMM by several orders of magitude. The speed-up is largest for the multi-class problem, which has the least expesive separatio oracle. Not coutig costraits costructed from the cache, less tha % of the time is sped o the separatio oracle for the multi-class problem, while it is 5% for the HMM ad 98% for CFG parsig. Therefore, it is iterestig to also compare the umber of calls to the separatio oracle. I all cases, Table shows that the -slack algorithm requires by a factor betwee 2 ad 4 fewer calls, accoutig for much of the time saved o the CFG problem. The most strikig differece betwee the two algorithms lies i the umber of support vectors they produce (i.e., the umber of dual variables that are o-zero). For the -slack algorithm, the umber of support vectors lie i the tes or hudreds of thousads, while all solutios produced by the -slack algorithm have oly about 00 support vectors. This meas that the workig sets that eed to be solved i each iteratio are orders of magitude smaller i the -slack algorithm, accoutig for oly 26% of the overall rutime i the multi-class experimet compared to more tha 99% for the -slack algorithm. We will further aalyze this i the followig.
Cuttig-Plae Traiig of Structural SVMs 25 Table 2 Traiig CPU time (i secods) for five biary classificatio problems comparig the - slack algorithm (without cachig) with SVM-light. is the umber of traiig examples, N is the umber of features, ad s is the fractio of o-zero elemets of the feature vectors. The SVMlight results are quoted from (Joachims, 2006), the -slack results are re-ru with the latest versio of SVM-struct usig the same experimet setup as i (Joachims, 2006). CPU-Time # Support Vec. N s -slack SVM-light -slack SVM-light Reuters CCAT 804,44 47,236 0.6% 58.0 20,075.5 8 230388 Reuters C 804,44 47,236 0.6% 7.3 5,87.4 6 60748 ArXiv Astro-ph 62,369 99,757 0.08% 4.4 80. 9 38 Covertype 522,9 54 22.22% 53.4 25,54.3 27 279092 KDD04 Physics 50,000 78 38.42% 9.2,040.2 3 9923 5.3 How Fast is the -Slack Algorithm Compared to Covetioal SVM Traiig Algorithms? Sice most work o traiig algorithms for SVMs was doe for biary classificatio, we compare the -slack algorithms agaist algorithms for the special case of biary classificatio. While there are traiig algorithms for liear SVMs that scale liearly with (e.g., Lagragia SVM (Magasaria ad Musicat, 200) (usig the ξi 2 loss), Proximal SVM (Fug ad Magasaria, 200) (usig a L 2 regressio loss), ad Iterior Poit Methods (Ferris ad Muso, 2003)), they use the Sherma-Morriso-Woodbury formula (or matrix factorizatios) for ivertig the Hessia of the dual. This requires operatig o N N matrices, which makes them applicable oly for problems with small N. The L2-SVM-MFN method (Keerthi ad DeCoste, 2005) avoids explicitly represetig N N matrices usig cojugate gradiet techiques. While the worst-case cost is still O(s mi(,n)) per iteratio for feature vectors with sparsity s, they observe that their method empirically scales much better. The discussio i (Joachims, 2006) cocludes that rutime is comparable to the -slack algorithm implemeted i SVM-perf. The -slack algorithm scales liearly i both ad the sparsity s of the feature vectors, eve if the total umber N of features is large (Joachims, 2006). Note that it is uclear whether ay of the covetioal algorithms ca be exteded to structural SVM traiig. The most widely used algorithms for traiig biary SVMs are decompositio methods like SVM-light (Joachims, 999), SMO (Platt, 999), ad others (Chag ad Li, 200; Collobert ad Begio, 200). Taskar et al. (Taskar et al, 2003) exteded the SMO algorithm to structured predictio problems based o their polyomial-size reformulatio of the -slack optimizatio problem OP2 for the special case of decomposable models ad decomposable loss fuctios. I the case of biary classificatio, their SMO algorithm reduces to a variat of the traditioal SMO algorithm, which ca be see as a special case of the SVM-light algorithm. We therefore use SVM-light as a represetative of the class of decompositio methods. Table 2 compares the rutime of the -slack algorithm to SVM-light o five bechmark problems with varyig umbers of features, sparsity, ad umbers of traiig examples. The bechmarks iclude two text classificatio problems from
26 Thorste Joachims, Thomas Filey, ad Chu-Nam Joh Yu the Reuters RCV collectio 5 (Lewis et al, 2004), a problem of classifyig ArXiv abstracts, a biary classifier for class of the Covertype dataset 6 of Blackard, Jock & Dea, ad the KDD04 Physics task from the KDD-Cup 2004 (Caruaa et al, 2004). I all cases, the -slack algorithm is faster tha SVM-light, which is highly optimized to biary classificatio. O large datasets, the differece spas several orders of magitude. After the -slack algorithm was origially itroduced, ew stochastic subgradiet descet methods were proposed that are competitive i rutime for classificatio SVMs, especially the PEGASOS algorithm (Shalev-Shwartz et al, 2007). While curretly oly explored for classificatio, it should be possible to exted PEGA- SOS also to structured predictio problems. Ulike expoetiated gradiet methods (Bartlett et al, 2004; Globerso et al, 2007), PEGASOS does ot require the computatio of margials, which makes it equally easy to apply as cuttig-plae methods. However, ulike for our cuttig-plae methods where the theory provides a practically effective stoppig criterio, it is less clear whe to stop primal stochastic subgradiet methods. Sice they do ot maitai a dual program, the duality gap caot be used to characterize the quality of the solutio at termiatio. Furthermore, there is a questios of how to icorporate cachig ito stochastic subgradiet methods while still maitaiig fast covergece. As show i the followig, cachig is essetial for problems where the separatio oracle (or, equivaletly, the computatio of subgradiets) is expesive (e.g. CFG parsig). 5.4 How does Traiig Time Scale with the Number of Traiig Examples? A key questio is the scalability of the algorithm for large datasets. While Corollary shows that a upper boud o the traiig time scales liearly with the umber of traiig examples, the actual behavior udereath this boud could potetially be differet. Figure shows how traiig time relates to the umber of traiig examples for the three structural predictio problems. For the multi-class ad the HMM problem, traiig time does ideed scale at most liearly as predicted by Corollary, both with ad without usig the cache. However, the cache helps for larger datasets, ad there is a large advatage from usig the cache over the whole rage for CFG parsig. This is to be expected, give the high cost of the separatio oracle i the case of parsig. As show i Figure 2, the scalig behavior of the -slack algorithm remais essetially uchaged eve whe the regularizatio parameter C is ot held costat, but is set to the value that gives optimal predictio performace o the test set for each traiig set size. The scalig with C is aalyzed i more detail i Sectio 5.9. 5 http://jmlr.csail.mit.edu/papers/volume5/lewis04a/lyrl2004 rcvv2 README.htm 6 http://www.ics.uci.edu/ mlear/mlrepository.html
Cuttig-Plae Traiig of Structural SVMs 27 e+07 Multi-Class HMM CFG e+07 e+06 e+06 e+06 00000 CPU-Secods 00000 0000 000 CPU-Secods 00000 0000 000 CPU-Secods 0000 000 00 00 -slack -slack -slack (cache) O(x) 0 000 0000 00000 e+06 Number of Traiig Examples 00 -slack -slack -slack (cache) O(x) 0 00 000 0000 00000 Number of Traiig Examples 0 -slack -slack -slack (cache) O(x) 0 00 000 0000 Number of Traiig Examples Fig. Traiig times for multi-class classificatio (left) HMM part-of-speech taggig (middle) ad CFG parsig (right) as a fuctio of for the -slack algorithm, the -slack algorithm, ad the -slack algorithm with cachig. 00000 -Slack Algorithm 00000 -Slack Algorithm with Cachig 0000 0000 CPU-Secods 000 00 CPU-Secods 000 00 0 Multi-Class HMM CFG O(x) 0 00 000 0000 00000 e+06 Number of Traiig Examples 0 Multi-Class HMM CFG O(x) 0 00 000 0000 00000 e+06 Number of Traiig Examples Fig. 2 Traiig times as a fuctio of usig the optimal value of C at each traiig set size for the the -slack algorithm (left) ad the -slack algorithm with cachig (right). The -slack algorithm scales super-liearly for all problems, but so does the -slack algorithm for CFG parsig. This ca be explaied as follows. Sice the grammar is costructed from all rules observed i the traiig data, the umber of grammar rules grows with the umber of traiig examples. Eve from the secod-largest to the largest traiig set, the umber of rules i the grammar still grows by almost 70% (3550 rules vs. 582 rules). This has two effects. First, the separatio oracle becomes slower, sice its time scales with the umber of rules i the grammar. I particular, the time the CFG parser takes to compute a sigle argmax icreases more tha six-fold from the smallest to the largest traiig set. Secod, additioal rules (i particular uary rules) itroduce additioal features ad allow the costructio of larger ad larger wrog trees ȳ, which meas that R 2 = max i,ȳ Ψ(x i,y i ) Ψ(x i,ȳ) 2 is ot costat but grows. Ideed, Figure 3 shows that cosistet with Theorem 5 the umber of iteratios of the -slack
28 Thorste Joachims, Thomas Filey, ad Chu-Nam Joh Yu 0000 -Slack Algorithm 0000 -Slack Algorithm with Cachig 000 000 Iteratios 00 Iteratios 00 0 0 Multi-Class HMM CFG 0 00 000 0000 00000 e+06 Number of Traiig Examples Multi-Class HMM CFG 0 00 000 0000 00000 e+06 Number of Traiig Examples Fig. 3 Number of iteratios as a fuctio of for the the -slack algorithm (left) ad the -slack algorithm with cachig (right). e+07 e+06 Multi-Class HMM CFG e+07 e+07 -slack -slack -slack -slack -slack -slack -slack (cache) e+06 -slack (cache) e+06 -slack (cache) Support Vectors 00000 0000 000 Support Vectors 00000 0000 000 Support Vectors 00000 0000 000 00 00 00 0 000 0000 00000 e+06 Number of Traiig Examples 0 00 000 0000 00000 Number of Traiig Examples 0 0 00 000 0000 Number of Traiig Examples Fig. 4 Number of support vectors for multi-class classificatio (left) HMM part-of-speech taggig (middle) ad CFG parsig (right) as a fuctio of for the -slack algorithm, the -slack algorithm, ad the -slack algorithm with cachig. algorithm is roughly costat for multi-class classificatio ad the HMM 7, while it grows slowly for CFG parsig. Fially, ote that i Figure 3 the differece i the umber of iteratios of the algorithm without cachig (left) ad with cachig (right) is small. Despite the fact that the costrait from the cache is typically ot the overall most violated costrait, but oly a sufficietly violated costrait, both versios of the algorithm appear to make similar progress i each iteratio. 7 Note that the HMM always cosiders all possible rules i the regular laguage, so that there is o growth i the umber of rules oce all symbols are added.
Cuttig-Plae Traiig of Structural SVMs 29 Calls to Separatio Oracle e+0 e+09 e+08 e+07 e+06 00000 Multi-Class HMM CFG e+09 e+08 -slack -slack -slack -slack -slack -slack -slack (cache) -slack (cache) -slack (cache) O(x) e+08 O(x) e+07 O(x) Calls to Separatio Oracle e+07 e+06 00000 0000 Calls to Separatio Oracle e+06 00000 0000 000 0000 000 0000 00000 e+06 Number of Traiig Examples 000 00 000 0000 00000 Number of Traiig Examples 00 0 00 000 0000 Number of Traiig Examples Fig. 5 Number of calls to the separatio oracle for multi-class classificatio (left) HMM part-ofspeech taggig (middle) ad CFG parsig (right) as a fuctio of for the -slack algorithm, the -slack algorithm, ad the -slack algorithm with cachig. 5.5 What is the Size of the Workig Set? As already oted above, the size of the workig set ad its scalig has a substatial ifluece o the overall efficiecy of the algorithm. I particular, large (ad growig) workig sets will make it expesive to solve the quadratic programs. While the umber of iteratios is a upper boud o the workig set size for the -slack algorithm, the umber of support vectors show i Figure 4 gives a much better idea of its size, sice we are removig iactive costraits from the workig set. For the -slack algorithm, Figure 4 shows that the umber of support vectors does ot systematically grow with for ay of the problems, makig it easy to solve the workig set QPs eve for large datasets. This is very much i cotrast to the - slack algorithm, where the growig umber of support vectors makes each iteratio icreasigly costly, ad is startig to push the limits of what ca be kept i mai memory. 5.6 How ofte is the Separatio Oracle Called? Next to solvig the workig set QPs i each iteratio, computig the separatio oracle is the other major expese i each iteratio. We ow ivestigate how the umber of calls to the separatio oracle scales with, ad how this is iflueced by cachig. Figure 5 shows that for all algorithms the umber of calls scales liearly with for the multi-class problem ad the HMM. It is slightly super-liear for CFG parsig due to the icreasig umber of iteratios as discussed above. For all problems ad traiig set sizes, the -slack algorithm with cachig requires the fewest calls. The size of the cache has a surprisigly little ifluece o the reductio of calls to the separatio oracle. Figure 6 shows that a cache of size f = 5 already provides all
30 Thorste Joachims, Thomas Filey, ad Chu-Nam Joh Yu e+09 Multi-Class HMM CFG Calls to Separatio Oracle e+08 e+07 e+06 00000 0 2 5 0 20 Size of Cache Fig. 6 Number of calls to the separatio oracle as a fuctio of cache size for the the -slack algorithm. of the beefits, ad that larger cache sizes do ot further reduce the umber of calls. However, we cojecture that this might be a artifact of our simple least-recetlyused cachig strategy, ad that improved cachig methods that selectively call the separatio oracle for oly a well-chose subset of the examples will provide further beefits. 5.7 Are the Solutios Differet? Sice the stoppig criteria are differet i the -slack ad the -slack algorithm, it remais to verify that they do ideed compute a solutio of comparable effectiveess. The plot i Figure 7 shows the dual objective value of the -slack solutio relative to the -slack solutio. A value below zero idicates that the -slack solutio has a better dual objective value, while a positive value shows by which fractio the -slack objective is higher tha the -slack objective. For all values of C the solutios are very close for the multi-class problem ad for CFG parsig, ad so are their predictio performaces o the test set (see table i Figure 7). This is ot surprisig, sice for both the -slack ad the -slack formulatio the respective ε boud the duality gap by Cε. For the HMM, however, this Cε is a substatial fractio of the objective value at the solutio, especially for large values of C. Sice the traiig data is almost liearly separable for the HMM, Cε becomes a substatial part of the slack cotributio to the objective value. Furthermore, ote the differet scalig of the HMM loss (i.e., umber of misclassified tags i the setece), which is roughly 5 times smaller tha the loss fuctio o the other problems (i.e., 0 to 00 scale). So, a ε = 0. o the HMM problem is comparable to a ε = 0.5 o the other problems. Nevertheless, with a per-toke test error rate of 3.29% for the -slack solutio, the
Cuttig-Plae Traiig of Structural SVMs 3 (Obj_ - Obj_)/Obj_ 0.09 0.08 0.07 0.06 0.05 0.04 0.03 0.02 0.0 0-0.0-0.02-0.03-0.04-0.05-0.06-0.07-0.08-0.09 Multi-Class HMM CFG Task Measure -slack -slack MultiC Accuracy 72.33 72.35 HMM Toke Accuracy 96.7 96.69 CFG Bracket F 70.22 70.09 00 000 0000 00000 e+06 e+07 C Fig. 7 Relative differece i dual objective value of the solutios foud by the -slack algorithm ad by the -slack algorithm as a fuctio of C at the maximum traiig set size (left), ad test-set predictio performace for the optimal value of C (right). Iteratios 00000 0000 000 Multi-Class HMM CFG O(/eps) Calls to Separatio Oracle e+07 e+06 00000 00 0.0 0. 0 Epsilo Multi-Class HMM CFG 0000 0.0 0. 0 Epsilo Fig. 8 Number of iteratios for the -slack algorithm (left) ad umber of calls to the separatio oracle for the -slack algorithm with cachig (right) as a fuctio of ε at the maximum traiig set size. predictio accuracy is eve slightly better tha the 3.3% error rate of the -slack solutio. 5.8 How does the -Slack Algorithm Scale with ε? While the scalig with is the most importat criterio from a practical perspective, it is also iterestig to look at the scalig with ε. Theorem 5 shows that the umber of iteratios (ad therefore the umber of calls to the separatio oracle) scales O( ε ) i the worst cast. Figure 8, however, shows that the scalig is much better i practice. I particular, the umber of calls to the separatio oracle is largely idepedet of
32 Thorste Joachims, Thomas Filey, ad Chu-Nam Joh Yu e+06 e+07 Iteratios 00000 0000 000 00 0 Multi-Class HMM CFG O(C) 00 000 0000 00000 e+06 e+07 e+08 C Calls to Separatio Oracle e+06 00000 Multi-Class HMM CFG 0000 00 000 0000 00000 e+06 e+07 e+08 Fig. 9 Number of iteratios for the -slack algorithm (left) ad umber of calls to the separatio oracle for the -slack algorithm with cachig (right) as a fuctio of C at the maximum traiig set size. C ε ad remais costat whe cachig is used. It seems like the additioal iteratios ca be doe almost etirely from the cache. 5.9 How does the -Slack Algorithm Scale with C? With icreasig traiig set size, the optimal value of C will typically chage (some theoretical results suggest a icrease o the order of ). I practice, fidig the optimal value of C typically requires traiig for a large rage of C values as part of a cross-validatio experimet. It is therefore iterestig to kow how the algorithm scales with C. While Theorem 5 bouds the umber of iteratio with O(C), Figure 9 shows that the actual scalig is agai much better. The umber of iteratios icreases slower tha Ω(C) o all problems. Furthermore, as already observed for ε above, the additioal iteratios are almost etirely based o the cache, so that C has hardly ay ifluece o the umber of calls to the separatio oracle. 6 Coclusios We preseted a cuttig-plae algorithm for traiig structural SVMs. Ulike existig cuttig-plae methods for this problems, the umber of costraits that are geerated does ot deped o the umber of traiig examples, but oly o C ad the desired precisio ε. Empirically, the ew algorithm is substatially faster tha existig methods, i particular decompositio methods like SMO ad SVM-light, ad it icludes the traiig algorithm of Joachims (2006) for liear biary classificatio SVMs as a special case. A implemetatio of the algorithm is available olie
Cuttig-Plae Traiig of Structural SVMs 33 with istaces for multi-class classificatio, HMM sequece taggig, CFG parsig, ad biary classificatio. Ackowledgemets We thak Eva Herbst for implemetig a prototype of the HMM istace of SVM struct, which was used i some of our prelimiary experimets. This work was supported i part through the grat NSF IIS-073483 from the Natioal Sciece Foudatio ad through a gift from Yahoo!. Refereces Altu Y, Tsochataridis I, Hofma T (2003) Hidde Markov support vector machies. I: Iteratioal Coferece o Machie Learig (ICML), pp 3 0 Aguelov D, Taskar B, Chatalbashev V, Koller D, Gupta D, Heitz G, Ng AY (2005) Discrimiative learig of Markov radom fields for segmetatio of 3D sca data. I: IEEE Coferece o Computer Visio ad Patter Recogitio (CVPR), IEEE Computer Society, pp 69 76 Bartlett P, Collis M, Taskar B, McAllester D (2004) Expoetiated algorithms for large-margi structured classificatio. I: Advaces i Neural Iformatio Processig Systems (NIPS), pp 305 32 Caruaa R, Joachims T, Backstrom L (2004) KDDCup 2004: Results ad aalysis. ACM SIGKDD Newsletter 6(2):95 08 Chag CC, Li CJ (200) LIBSVM: a library for support vector machies. Software available at http://www.csie.tu.edu.tw/ cjli/libsvm Collis M (2002) Discrimiative traiig methods for hidde Markov models: Theory ad experimets with perceptro algorithms. I: Empirical Methods i Natural Laguage Processig (EMNLP), pp 8 Collis M (2004) Parameter estimatio for statistical parsig models: Theory ad practice of distributio-free methods. I: New Developmets i Parsig Techology, Kluwer, (paper accompaied ivited talk at IWPT 200) Collis M, Duffy N (2002) New rakig algorithms for parsig ad taggig: Kerels over discrete structures, ad the voted perceptro. I: Aual Meetig of the Associatio for Computatioal Liguistics (ACL), pp 263 270 Collobert R, Begio S (200) SVMTorch: Support vector machies for large-scale regressio problems. Joural of Machie Learig Research (JMLR) :43 60 Cortes C, Vapik VN (995) Support vector etworks. Machie Learig 20:273 297 Crammer K, Siger Y (200) O the algorithmic implemetatio of multiclass kerel-based vector machies. Joural of Machie Learig Research (JMLR) 2:265 292 Crammer K, Siger Y (2003) Ultracoservative olie algorithms for multiclass problems. Joural of Machie Learig Research (JMLR) 3:95 99 Ferris M, Muso T (2003) Iterior-poit methods for massive support vector machies. SIAM Joural of Optimizatio 3(3):783 804
34 Thorste Joachims, Thomas Filey, ad Chu-Nam Joh Yu Fukumizu K, Bach F, Jorda M (2004) Dimesioality reductio for supervised learig with reproducig kerel Hilbert spaces. Joural of Machie Learig Research (JMLR) 5:73 99 Fug G, Magasaria O (200) Proximal support vector classifiers. I: ACM Coferece o Kowledge Discovery ad Data Miig (KDD), pp 77 86 Globerso A, Koo TY, Carreras X, Collis M (2007) Expoetiated gradiet algorithm for log-liear structured predictio. I: Iteratioal Coferece o Machie Learig (ICML), pp 305 32 Joachims T (999) Makig large-scale SVM learig practical. I: Schölkopf B, Burges C, Smola A (eds) Advaces i Kerel Methods - Support Vector Learig, MIT Press, Cambridge, MA, chap, pp 69 84 Joachims T (2003) Learig to alig sequeces: A maximum-margi approach, olie mauscript Joachims T (2005) A support vector method for multivariate performace measures. I: Iteratioal Coferece o Machie Learig (ICML), pp 377 384 Joachims T (2006) Traiig liear SVMs i liear time. I: ACM SIGKDD Iteratioal Coferece O Kowledge Discovery ad Data Miig (KDD), pp 27 226 Johso M (998) PCFG models of liguistic tree represetatios. Computatioal Liguistics 24(4):63 632 Keerthi S, DeCoste D (2005) A modified fiite Newto method for fast solutio of large scale liear SVMs. Joural of Machie Learig Research (JMLR) 6:34 36 Keerthi S, Chapelle O, DeCoste D (2006) Buildig support vector machies with reduced classifier complexity. Joural of Machie Learig Research (JMLR) 7:493 55 Kivie J, Warmuth MK (997) Expoetiated gradiet versus gradiet descet for liear predictors. Iformatio ad Computatio 32(): 63 Lafferty J, McCallum A, Pereira F (200) Coditioal radom fields: Probabilistic models for segmetig ad labelig sequece data. I: Iteratioal Coferece o Machie Learig (ICML) Lewis D, Yag Y, Rose T, Li F (2004) Rcv: A ew bechmark collectio for text categorizatio research. Joural of Machie Learig Research (JMLR) 5:36 397 Magasaria O, Musicat D (200) Lagragia support vector machies. Joural of Machie Learig Research (JMLR) :6 77 Marcus M, Satorii B, Marcikiewicz MA (993) Buildig a large aotated corpus of Eglish: The Pe Treebak. Computatioal Liguistics 9(2):33 330 McDoald R, Crammer K, Pereira F (2005) Olie large-margi traiig of depedecy parsers. I: Aual Meetig of the Associatio for Computatioal Liguistics (ACL), pp 9 98 Platt J (999) Fast traiig of support vector machies usig sequetial miimal optimizatio. I: Schölkopf B, Burges C, Smola A (eds) Advaces i Kerel Methods - Support Vector Learig, MIT-Press, chap 2
Cuttig-Plae Traiig of Structural SVMs 35 Ratliff ND, Bagell JA, Zikevich MA (2007) (Olie) subgradiet methods for structured predictio. I: Coferece o Artificial Itelligece ad Statistics (AISTATS) Shalev-Shwartz S, Siger Y, Srebro N (2007) PEGASOS: Primal Estimated sub- GrAdiet SOlver for SVM. I: Iteratioal Coferece o Machie Learig (ICML), ACM, pp 807 84 Smola A, Schölkopf B (2000) Sparse greedy matrix approximatio for machie learig. I: Iteratioal Coferece o Machie Learig, pp 9 98 Taskar B, Guestri C, Koller D (2003) Maximum-margi Markov etworks. I: Advaces i Neural Iformatio Processig Systems (NIPS) Taskar B, Klei D, Collis M, Koller D, Maig C (2004) Max-margi parsig. I: Empirical Methods i Natural Laguage Processig (EMNLP) Taskar B, Lacoste-Julie S, Jorda MI (2005) Structured predictio via the extragradiet method. I: Advaces i Neural Iformatio Processig Systems (NIPS) Teo CH, Smola A, Vishwaatha SV, Le QV (2007) A scalable modular covex solver for regularized risk miimizatio. I: ACM Coferece o Kowledge Discovery ad Data Miig (KDD), pp 727 736 Tsochataridis I, Hofma T, Joachims T, Altu Y (2004) Support vector machie learig for iterdepedet ad structured output spaces. I: Iteratioal Coferece o Machie Learig (ICML), pp 04 2 Tsochataridis I, Joachims T, Hofma T, Altu Y (2005) Large margi methods for structured ad iterdepedet output variables. Joural of Machie Learig Research (JMLR) 6:453 484 Vapik V (998) Statistical Learig Theory. Wiley, Chichester, GB Vishwaatha SVN, Schraudolph NN, Schmidt MW, Murphy KP (2006) Accelerated traiig of coditioal radom fields with stochastic gradiet methods. I: Iteratioal Coferece o Machie Learig (ICML), pp 969 976 Yu CN, Joachims T, Elber R, Pillardy J (2007) Support vector traiig of protei aligmet models. I: Proceedig of the Iteratioal Coferece o Research i Computatioal Molecular Biology (RECOMB), pp 253 267 Yue Y, Filey T, Radliski F, Joachims T (2007) A support vector method for optimizig average precisio. I: ACM SIGIR Coferece o Research ad Developmet i Iformatio Retrieval (SIGIR), pp 27 278 Appedix Lemma. max α 0 D(α)= ȳ Y Δ(ȳ)αȳ 2 ȳ Y αȳαȳ H MR (ȳ,ȳ ) s.t. ȳ Y ȳ Y αȳ = C ad the Wolfe-Dual of the -slack optimizatio problem OP5 for slack-rescalig is
36 Thorste Joachims, Thomas Filey, ad Chu-Nam Joh Yu max α 0 D(α)= ȳ Y Δ(ȳ)αȳ 2 ȳ Y αȳαȳ H SR (ȳ,ȳ ) s.t. ȳ Y ȳ Y αȳ = C Proof. The Lagragia of OP4 is [ L(w,ξ, α)= 2 wt w +Cξ +αȳ ȳ Y ] Δ(y i,ȳ i ) ξ wt [Ψ(x i,y i ) Ψ(x i,ȳ i )]. Differetiatig with respect to w ad settig the derivative to zero gives ( ) w = αȳ ȳ Y [Ψ(x i,y i ) Ψ(x i,ȳ i )]. Similarly, differetiatig with respect to ξ ad settig the derivative to zero gives ȳ Y αȳ = C. Pluggig w ito the Lagragia with costraits o α we obtai the dual problem: max 2 s.t. αȳαȳ ȳ Y ȳ Y + ȳ Y αȳ ( 2 [ Δ(y i,ȳ i ) ) ȳ Y αȳ = C ad ȳ Y : αȳ 0 [Ψ(x i,y i ) Ψ(x i,ȳ i )] ] T [ j=[ψ(x j,y j ) Ψ(x j,ȳ j )] ] The derivatio of the dual of OP5 is aalogous. Lemma 2. For ay ucostraied quadratic program max α R {Θ(α)} <, Θ(α)=hT α 2 αt Hα (32) with positive semi-defiite H, ad derivative Θ( α) =h H α, a lie search startig at α alog a ascet directio η with maximum step-size C > 0 improves the objective by at least max {Θ(α + βη)} Θ(α) { } 0 β C 2 mi C, Θ(α)T η η T Θ(α) T η. (33) Hη Proof. For ay β ad η, it is easy to verify that [ Θ(α + βη) Θ(α) =β Θ(α) T η ] 2 βηt Hη. (34)
Cuttig-Plae Traiig of Structural SVMs 37 Maximizig this expressio with respect to a ucostraied β by settig the derivative to zero, the solutio β is β = Θ(α)T η η T Hη. (35) Note that η T Hη is o-egative, sice H is positive semi-defiite. Furthermore, η T Hη 0, sice otherwise η beig a ascet directio would cotradict max α R {Θ(α)} <. Pluggig β ito (34) shows that max β R {Θ(α + βη)} Θ(α) = 2 ( Θ(α) T η) 2 η T. (36) Hη It remais to check whether the ucostraied solutio β fulfills the costraits 0 β C. Sice η is a ascet directio, β is always o-egative. But oe eeds to cosider the case that β > C, which happes whe Θ(α) T η > Cη T Hη. I that case, the costraied optimum is at β = C due to covexity. Pluggig C ito (34) shows that max {Θ(α + βη)} Θ(α) β R =C Θ(α)T η 2 C2 η T Hη (37) 2 C Θ(α)T η. (38) The iequality follow from C Θ(α)T η η T Hη.