Cutting-Plane Training of Structural SVMs
|
|
|
- Crystal Gibson
- 10 years ago
- Views:
Transcription
1 Cuttig-Plae Traiig of Structural SVMs Thorste Joachims, Thomas Filey, ad Chu-Nam Joh Yu Abstract Discrimiative traiig approaches like structural SVMs have show much promise for buildig highly complex ad accurate models i areas like atural laguage processig, protei structure predictio, ad iformatio retrieval. However, curret traiig algorithms are computatioally expesive or itractable o large datasets. To overcome this bottleeck, this paper explores how cuttig-plae methods ca provide fast traiig ot oly for classificatio SVMs, but also for structural SVMs. We show that for a equivalet -slack reformulatio of the liear SVM traiig problem, our cuttig-plae method has time complexity liear i the umber of traiig examples. I particular, the umber of iteratios does ot deped o the umber of traiig examples, ad it is liear i the desired precisio ad the regularizatio parameter. Furthermore, we preset a extesive empirical evaluatio of the method applied to biary classificatio, multi-class classificatio, HMM sequece taggig, ad CFG parsig. The experimets show that the cuttigplae algorithm is broadly applicable ad fast i practice. O large datasets, it is typically several orders of magitude faster tha covetioal traiig methods derived from decompositio methods like SVM-light, or covetioal cuttig-plae methods. Implemetatios of our methods are available at Key words: Structural SVMs, Support Vector Machies, Structured Output Predictio, Traiig Algorithms Thorste Joachims Dept. of Computer Sciece, Corell Uiversity, Ithaca, NY, USA, [email protected] Thomas Filey Dept. of Computer Sciece, Corell Uiversity, Ithaca, NY, USA, [email protected] Chu-Nam Joh Yu Dept. of Computer Sciece, Corell Uiversity, Ithaca, NY, USA, [email protected]
2 2 Thorste Joachims, Thomas Filey, ad Chu-Nam Joh Yu Itroductio Cosider the problem of learig a fuctio with complex outputs, where the predictio is ot a sigle uivariate respose (e.g., 0/ for classificatio or a real umber for regressio), but a complex multivariate object. For example, the desired predictio is a tree i atural laguage parsig, or a total orderig i web search, or a aligmet betwee two amio acid sequeces i protei threadig. Further istaces of such structured predictio problems are ubiquitous i atural laguage processig, bioiformatics, computer visio, ad may other applicatio domais. Recet years have provided itriguig advaces i extedig methods like Logistic Regressio, Perceptros, ad Support Vector Machies (SVMs) to global traiig of such structured predictio models (e.g., Lafferty et al, 200; Collis, 2004; Collis ad Duffy, 2002; Taskar et al, 2003; Tsochataridis et al, 2004). I cotrast to covetioal geerative traiig, these methods are discrimiative (e.g., coditioal likelihood, empirical risk miimizatio). Aki to movig from Naive Bayes to a SVM for classificatio, this provides greater modelig flexibility through avoidace of idepedece assumptios, ad it was show to provide substatially improved predictio accuracy i may domais (e.g., Lafferty et al, 200; Taskar et al, 2003; Tsochataridis et al, 2004; Taskar et al, 2004; Yu et al, 2007). By elimiatig the eed to model statistical depedecies betwee features, discrimiative traiig eables us to freely use more complex ad possibly iterdepedet features, which provides the potetial to lear models with improved fidelity. However, traiig these rich models with a sufficietly large traiig set is ofte beyod the reach of curret discrimiative traiig algorithms. We focus o the problem of traiig structural SVMs i this paper. Formally, this ca be thought of as solvig a covex quadratic program (QP) with a large (typically expoetial or ifiite) umber of costraits. Existig algorithm fall ito two groups. The first group of algorithms relies o a elegat polyomial-size reformulatio of the traiig problem (Taskar et al, 2003; Aguelov et al, 2005), which is possible for the special case of margi-rescalig (Tsochataridis et al, 2005) with liearly decomposable loss. These smaller QPs ca the be solved, for example, with geeral-purpose optimizatio methods (Aguelov et al, 2005) or decompositio methods similar to SMO (Taskar et al, 2003; Platt, 999). Ufortuately, decompositio methods are kow to scale super-liearly with the umber of examples (Platt, 999; Joachims, 999), ad so do geeral-purpose optimizers, sice they do ot exploit the special structure of this optimizatio problem. But most sigificatly, the algorithms i the first group are limited to applicatios where the polyomial-size reformulatio exists. Similar restrictios also apply to the extragradiet method (Taskar et al, 2005), which applies oly to problems where subgradiets of the QP ca be computed via a covex real relaxatio, as well as expoetiated gradiet methods (Bartlett et al, 2004; Globerso et al, 2007), which require the ability to compute margials (e.g. via the sum-product algorithm). The secod group of algorithms works directly with the origial, expoetially-sized QP. This is feasible, sice a polyomially-sized subset of the costraits from the origial QP is already sufficiet for a solutio of arbitrary accuracy (Joachims, 2003; Tsochataridis
3 Cuttig-Plae Traiig of Structural SVMs 3 et al, 2005). Such algorithms either take stochastic subgradiet steps (Collis, 2002; Ratliff et al, 2007; Shalev-Shwartz et al, 2007), or build a cuttig-plae model which is easy to solve directly (Tsochataridis et al, 2004). The algorithm i (Tsochataridis et al, 2005) shows how such a cuttig-plae ca be costructed efficietly. Compared to the subgradiet methods, the cuttig-plae approach does ot take a sigle gradiet step, but always takes a optimal step i the curret cuttig-plae model. It requires oly the existece of a efficiet separatio oracle, which makes it applicable to may problems for which o polyomially-sized reformulatio is kow. I practice, however, the cuttig-plae method of Tsochataridis et al (2005) is kow to scale super-liearly with the umber of traiig examples. I particular, sice the size of the cuttig-plae model typically grows liearly with the dataset size (see (Tsochataridis et al, 2005) ad Sectio 5.5), QPs of icreasig size eed to be solved to compute the optimal steps, which leads to the super-liear rutime. I this paper, we explore a extesio of the cuttig-plae method preseted i (Joachims, 2006) for traiig liear structural SVMs, both i the margi-rescalig ad i the slack-rescalig formulatio (Tsochataridis et al, 2005). I cotrast to the cuttig-plae method preseted i (Tsochataridis et al, 2005), we show that the size of the cuttig-plae models ad the umber of iteratios are idepedet of the umber of traiig examples. Istead, their size ad the umber of iteratios ca be upper bouded by O( C ε ), where C is the regularizatio costat ad ε is the desired precisio of the solutio (see Optimizatio Problems OP2 ad OP3). Sice each iteratio of the ew algorithm takes O() time ad memory, it also scales O() overall with the umber of traiig examples both i terms of computatio time ad memory. Empirically, the size of the cuttig-plae models ad the QPs that eed to be solved i each iteratio is typically very small (less tha a few hudred) eve for problems with millios of features ad hudreds of thousads of examples. A key coceptual differece of the ew algorithm compared to the algorithm of Tsochataridis et al (2005) ad most other SVM traiig methods is that ot oly idividual data poits are cosidered as potetial Support Vectors (SVs), but also liear combiatios of those. This icreased flexibility allows for solutios with far fewer o-zero dual variables, ad it leads to the small cuttig-plae models discussed above. The ew algorithm is applicable to all structural SVM problems where the separatio oracle ca be computed efficietly, which makes it just as widely applicable as the most geeral traiig algorithms kow to date. Eve further, followig the origial publicatio i (Joachims, 2006), Teo et al (2007) have already show that the algorithm ca also be exteded to Coditioal Radom Field traiig. We provide a theoretical aalysis of the algorithm s correctess, covergece rate, ad scalig behavior for structured predictio. Furthermore, we preset empirical results for several structured predictio problems (i.e., multi-class classificatio, part-ofspeech taggig, ad atural laguage parsig), ad compare agaist covetioal algorithms also for the special case of biary classificatio. O all problems, the ew algorithm is substatially faster tha covetioal decompositio methods ad cuttig-plae methods, ofte by several orders of magitude for large datasets.
4 4 Thorste Joachims, Thomas Filey, ad Chu-Nam Joh Yu 2 Structural Support Vector Machies Structured output predictio describes the problem of learig a fuctio h : X Y where X is the space of iputs, ad Y is the space of (multivariate ad structured) outputs. I the case of atural laguage parsig, for example, X is the space of seteces, ad Y is the space of trees over a give set of o-termial grammar symbols. To lear h, we assume that a traiig sample of iput-output pairs S =((x,y ),...,(x,y )) (X Y ) is available ad draw i.i.d. from a distributio P(X,Y). The goal is to fid a fuctio h from some hypothesis space H that has low predictio error, or, more geerally, low risk RP Δ (h)= Δ(y,h(x))dP(x,y). X Y Δ(y,ȳ) is a loss fuctio that quatifies the loss associated with predictig ȳ whe y is the correct output value. Furthermore, we assume that Δ(y,y)=0 ad Δ(y,ȳ) 0 for y ȳ. We follow the Empirical Risk Miimizatio Priciple (Vapik, 998) to ifer a fuctio h from the traiig sample S. The learer evaluates the quality of a fuctio h H usig the empirical risk RS Δ (h) o the traiig sample S. R Δ S (h)= Δ(y i,h(x i )) Support Vector Machies select a h H that miimizes a regularized empirical risk o S. For covetioal biary classificatio where Y = {,+}, SVM traiig is typically formulated as the followig covex quadratic optimizatio problem 2 (Cortes ad Vapik, 995; Vapik, 998). Optimizatio Problem (CLASSIFICATION SVM (PRIMAL)) mi w,ξ i 0 2 wt w + C s.t. i {,...,}: y i (w T x i ) ξ i ξ i It was show that SVM traiig ca be geeralized to structured outputs (Altu et al, 2003; Taskar et al, 2003; Tsochataridis et al, 2004), leadig to a optimiza- Note, however, that all formal results i this paper also hold for o i.i.d. data, sice our algorithms do ot rely o the order or distributio of the examples. 2 For simplicity, we cosider the case of hyperplaes passig through the origi. By addig a costat feature, a offset ca easily be simulated.
5 Cuttig-Plae Traiig of Structural SVMs 5 tio problem that is similar to multi-class SVMs (Crammer ad Siger, 200) ad extedig the Perceptro approach described i (Collis, 2002). The idea is to lear a discrimiat fuctio f : X Y R over iput/output pairs from which oe derives a predictio by maximizig f over all y Y for a specific give iput x. h w (x)=argmax f w (x, y) y Y We assume that f w (x,y) takes the form of a liear fuctio f w (x,y)=w T Ψ(x,y) where w R N is a parameter vector ad Ψ(x,y) is a feature vector relatig iput x ad output y. Ituitively, oe ca thik of f w (x,y) as a compatibility fuctio that measures how well the output y matches the give iput x. The flexibility i desigig Ψ allows us to employ SVMs to lear models for problems as diverse as atural laguage parsig (Taskar et al, 2004; Tsochataridis et al, 2004), protei sequece aligmet (Yu et al, 2007), learig rakig fuctios that optimize IR performace measures (Yue et al, 2007), ad segmetig images (Aguelov et al, 2005). For traiig the weights w of the liear discrimiat fuctio, the stadard SVM optimizatio problem ca be geeralized i several ways (Altu et al, 2003; Joachims, 2003; Taskar et al, 2003; Tsochataridis et al, 2004, 2005). This paper uses the formulatios give i (Tsochataridis et al, 2005), which subsume all other approaches. We refer to these as the -slack formulatios, sice they assig a differet slack variable to each of the traiig examples. Tsochataridis et al (2005) idetify two differet ways of usig a hige loss to covex upper boud the loss, amely margi-rescalig ad slack-rescalig. I margi-rescalig, the positio of the hige is adapted while the slope is fixed, Δ MR (y,h w (x)) = max ȳ Y {Δ(y,ȳ) wt Ψ(x,y)+w T Ψ(x,ȳ)} Δ(y,h w (x)) () while i slack-rescalig, the slope is adjusted while the positio of the hige is fixed. Δ SR (y,h w (x)) = max ȳ Y {Δ(y,ȳ)( wt Ψ(x,y)+w T Ψ(x,ȳ))} Δ(y,h w (x)) (2) This leads to the followig two traiig problems, where each slack variable ξ i is equal to the respective Δ MR (y i,h w (x i )) or Δ SR (y i,h w (x i )) for traiig example (x i,y i ).
6 6 Thorste Joachims, Thomas Filey, ad Chu-Nam Joh Yu Optimizatio Problem 2 (-SLACK STRUCTURAL SVM WITH MARGIN- RESCALING (PRIMAL)) mi w,ξ 0 2 wt w + C ξ i s.t. ȳ Y : w T [Ψ(x,y ) Ψ(x,ȳ )] Δ(y,ȳ ) ξ. s.t. ȳ Y : w T [Ψ(x,y ) Ψ(x,ȳ )] Δ(y,ȳ ) ξ Optimizatio Problem 3 (-SLACK STRUCTURAL SVM WITH SLACK- RESCALING (PRIMAL)) mi w,ξ 0 2 wt w + C ξ i s.t. ȳ Y : w T [Ψ(x,y ) Ψ(x,ȳ )] ξ Δ(y,ȳ ). s.t. ȳ Y : w T ξ [Ψ(x,y ) Ψ(x,ȳ )] Δ(y,ȳ ) The objective is the covetioal regularized risk used i SVMs. The costraits state that for each traiig example (x i,y i ), the score w T Ψ(x i,y i ) of the correct structure y i must be greater tha the score w T Ψ(x i,ȳ) of all icorrect structures ȳ by a required margi. This margi is i slack-rescalig, ad equal to the loss Δ(y i,ȳ) i margi rescalig. If the margi is violated, the slack variable ξ i of the example becomes o-zero. Note that ξ i is shared amog costraits from the same example. The correct labels y i s are ot excluded from the costraits because they correspod to o-egativity costraits o the slack variables ξ i. It is easy to verify that for both margi-rescalig ad for slack-rescalig, ξ i is a upper boud o the empirical risk RS Δ (h) o the traiig sample S. It is ot immediately obvious that Optimizatio Problems OP2 ad OP3 ca be solved efficietly sice they have O( Y ) costraits. Y is typically extremely large (e.g., all possible aligmets of two amio-acid sequece) or eve ifiite (e.g., real-valued outputs). For the special case of margi-rescalig with liearly decomposable loss fuctios Δ, Taskar et al. (Taskar et al, 2003) have show that the problem ca be reformulated as a quadratic program with oly a polyomial umber of costraits ad variables. A more geeral algorithm that applies to both margi-rescalig ad slackrescalig uder a large variety of loss fuctios was give i (Tsochataridis et al, 2004, 2005). The algorithm relies o the theoretical result that for ay desired precisio ε, a greedily costructed cuttig-plae model of OP2 ad OP3 requires oly O( ε 2 ) may costraits (Joachims, 2003; Tsochataridis et al, 2005). This greedy algorithm for the case of margi-rescalig is Algorithm, for slack-rescalig it leads
7 Cuttig-Plae Traiig of Structural SVMs 7 Algorithm for traiig Structural SVMs (with margi-rescalig) via the -Slack Formulatio (OP2). : Iput: S =((x,y ),...,(x,y )), C, ε 2: W i /0, ξ i 0 for all i =,..., 3: repeat 4: for,..., do 5: ŷ argmaxŷ Y {Δ(y i, ŷ) w T [Ψ(x i,y i ) Ψ(x i, ŷ)]} 6: if Δ(y i, ŷ) w T [Ψ(x i,y i ) Ψ(x i, ŷ)] > ξ i + ε the 7: W i W i {ŷ} 8: (w, ξ) argmi w,ξ 0 2 wt w + C ξ i s.t. ȳ W : w T [Ψ(x,y ) Ψ(x, ȳ )] Δ(y, ȳ ) ξ 9: ed if 0: ed for : util o W i has chaged durig iteratio 2: retur(w,ξ ). ȳ W : w T [Ψ(x,y ) Ψ(x, ȳ )] Δ(y, ȳ ) ξ Algorithm 2 for traiig Structural SVMs (with slack-rescalig) via the -Slack Formulatio (OP3). : Iput: S =((x,y ),...,(x,y )), C, ε 2: W i /0, ξ i 0 for all i =,..., 3: repeat 4: for,..., do 5: ŷ argmaxŷ Y {Δ(y i, ŷ)( w T [Ψ(x i,y i ) Ψ(x i, ŷ)])} 6: if Δ(y i, ŷ)( w T [Ψ(x i,y i ) Ψ(x i, ŷ)]) > ξ i + ε the 7: W i W i {ŷ} 8: (w, ξ) argmi w,ξ 0 2 wt w + C ξ i s.t. ȳ W : w T Δ(y, ȳ )[Ψ(x,y ) Ψ(x, ȳ )] Δ(y, ȳ ) ξ 9: ed if 0: ed for : util o W i has chaged durig iteratio 2: retur(w,ξ ). ȳ W : w T Δ(y, ȳ )[Ψ(x,y ) Ψ(x, ȳ )] Δ(y, ȳ ) ξ to Algorithm 2. The algorithms iteratively costruct a workig set W = W... W of costraits, startig with a empty workig set W = /0. The algorithms iterate through the traiig examples ad fid the costrait that is violated most by the curret solutio w, ξ (Lie 5). If this costrait is violated by more tha the desired precisio ε (Lie 6), the costrait is added to the workig set (Lie 7) ad the QP is solved over the exteded W (Lie 8). The algorithms termiate whe o costrait is added i the previous iteratio, meaig that all costraits i OP2 or OP3 are fulfilled up to a precisio of ε. The algorithm is provably efficiet wheever the most violated costrait ca be foud efficietly. The procedure i Lie 5 for fidig the most violated costrait is called the separatio oracle. The argmax i
8 8 Thorste Joachims, Thomas Filey, ad Chu-Nam Joh Yu Lie 5 has a efficiet solutio for a wide variety of choices for Ψ, Y, ad Δ (see e.g., Tsochataridis et al, 2005; Joachims, 2005; Yu et al, 2007; Yue et al, 2007)), ad ofte it ivolves the same algorithm for makig predictios (see Eq. ()). Related to Algorithm is the method proposed i (Aguelov et al, 2005), which applies to the special case where the argmax i Lie 5 ca be computed as a liear program. This allows them ot to explicitly maitai a workig set, but implicitly represet it by foldig liear programs ito the quadratic program OP2. To this special case also applies the method of Taskar et al (2005), which casts the traiig of max-margi structured predictors as a covex-cocave saddle-poit problem. It provides improved scalability compared to a explicit reductio to a polyomiallysized QP, but ivolves the use of a special mi-cost quadratic flow solver i the projectio steps of the extragradiet method. Expoetiated gradiet methods, origially proposed for olie learig of liear predictors (Kivie ad Warmuth, 997), have also bee applied to the traiig of structured predictors (Globerso et al, 2007; Bartlett et al, 2004). They solve the optimizatio problem i the dual, ad treat coditioal radom field ad structural SVM withi the same framework usig Bregma divergeces. Stochastic gradiet methods Vishwaatha et al (2006) have bee applied to the traiig of coditioal radom field o large scale problems, ad exhibit faster rate of covergece tha BFGS methods. Recetly, subgradiet methods ad their stochastic variats (Ratliff et al, 2007) have also bee proposed to solve the optimizatio problem i maxmargi structured predictio. While ot yet explored for structured predictio, the PEGASOS algorithm (Shalev-Shwartz et al, 2007) has show promisig performace for biary classificatio SVMs. Related to such olie methods is also the MIRA algorithm (Crammer ad Siger, 2003), which has bee used for traiig structured predictors (e.g. McDoald et al (2005)). However, to deal with the expoetial size of Y, heuristics have to be used (e.g. oly usig a k-best subset of Y ), leadig to oly approximate solutios of Optimizatio Problem OP2. 3 Traiig Algorithm While polyomial rutime was established for most algorithms discussed above, traiig geeral structural SVMs o large-scale problems is still a challegig problem. I the followig, we preset a equivalet reformulatio of the traiig problems for both margi-rescalig ad slack-rescalig, leadig to a cuttig-plae traiig algorithm that has ot oly provably liear rutime i the umber of traiig examples, but is also several orders of magitude faster tha covetioal cuttigplae methods (Tsochataridis et al, 2005) o large-scale problems. Nevertheless, the ew algorithm is equally geeral as Algorithms ad 2.
9 Cuttig-Plae Traiig of Structural SVMs Slack Formulatio The first step towards the ew algorithm is a reformulatio of the optimizatio problems for traiig. The key idea is to replace the cuttig-plae models of the hige loss oe for each traiig example with a sigle cuttig plae model for the sum of the hige-losses. Sice there is oly a sigle slack variable i the ew formulatios, we refer to them the -slack formulatios. Optimizatio Problem 4 (-SLACK STRUCTURAL SVM WITH MARGIN- RESCALING (PRIMAL)) mi w,ξ 0 2 wt w +C ξ s.t. (ȳ,...,ȳ ) Y : wt i,y i ) Ψ(x i,ȳ i )] [Ψ(x Δ(y i,ȳ i ) ξ Optimizatio Problem 5 (-SLACK STRUCTURAL SVM WITH SLACK- RESCALING (PRIMAL)) mi w,ξ 0 2 wt w +C ξ s.t. (ȳ,...,ȳ ) Y : wt i,ȳ i )[Ψ(x i,y i ) Ψ(x i,ȳ i )] Δ(y Δ(y i,ȳ i ) ξ While OP4 has Y costraits, oe for each possible combiatio of labels (ȳ,...,ȳ ) Y, it has oly oe slack variable ξ that is shared across all costraits. Each costrait correspods to a taget to R Δ MR S (h) ad R Δ SR S (h) respectively, ad the set of costraits forms a equivalet model of the risk fuctio. Specifically, the followig theorems show that ξ = R Δ MR S (h w ) at the solutio (w,ξ ) of OP4, ad ξ = R Δ SR S (h w ) at the solutio (w,ξ ) of OP5, sice the -slack ad the -slack formulatios are equivalet i the followig sese. Theorem. (EQUIVALENCE OF OP2 AND OP4) Ay solutio w of OP4 is also a solutio of OP2 (ad vice versa), with ξ = ξ i. Proof. Geeralizig the proof i (Joachims, 2006), we will show that both optimizatio problems have the same objective value ad a equivalet set of costraits. I particular, for every w the smallest feasible ξ ad i ξ i are equal. For a give w, each ξ i i OP2 ca be optimized idividually, ad the smallest feasible ξ i give w is achieved for ξ i = max ȳ i Y {Δ(y i,ȳ i ) w T [Ψ(x i,y i ) Ψ(x i,ȳ i )]}. For OP4, the smallest feasible ξ for a give wis
10 0 Thorste Joachims, Thomas Filey, ad Chu-Nam Joh Yu Algorithm 3 for traiig Structural SVMs (with margi-rescalig) via the -Slack Formulatio (OP4). : Iput: S =((x,y ),...,(x,y )), C, ε 2: W /0 3: repeat 4: (w,ξ) argmi w,ξ 0 2 wt w +Cξ s.t. (ȳ,...,ȳ ) W : wt 5: for,..., do 6: ŷ i argmaxŷ Y {Δ(y i, ŷ)+w T Ψ(x i, ŷ)} 7: ed for 8: W W {(ŷ,...,ŷ )} 9: util Δ(y i, ŷ i ) wt 0: retur(w,ξ ) [Ψ(x i,y i ) Ψ(x i, ŷ i )] ξ + ε [Ψ(x i,y i ) Ψ(x i, ȳ i )] Δ(y i, ȳ i ) ξ ξ = max (ȳ,...,ȳ ) Y { Δ(y i,ȳ i ) w T } [Ψ(x i,y i ) Ψ(x i,ȳ i )]. Sice the fuctio ca be decomposed liearly i ȳ i, for ay give w, each ȳ i ca be optimized idepedetly. { ξ = max ȳ i Y Δ(y i,ȳ i ) } wt [Ψ(x i,y i ) Ψ(x i,ȳ i )] = ξ i Therefore, the objective fuctios of both optimizatio problems are equal for ay w give the correspodig smallest feasible ξ ad ξ i. Cosequetly this is also true for w ad its correspodig smallest feasible slacks ξ ad ξi. Theorem 2. (EQUIVALENCE OF OP3 AND OP5) Ay solutio w of OP5 is also a solutio of OP3 (ad vice versa), with ξ = ξ i. Proof. Aalogous to Theorem. 3.2 Cuttig-Plae Algorithm What could we possibly have gaied by movig from the -slack to the -slack formulatio, expoetially icreasig the umber of costraits i the process? We will show i the followig that the dual of the -slack formulatio has a solutio that is extremely sparse, with the umber of o-zero dual variables idepedet of the umber of traiig examples. To fid this solutio, we propose Algorithms 3 ad 4, which are geeralizatios of the algorithm i (Joachims, 2006) to structural SVMs. Similar to the cuttig-plae algorithms for the -slack formulatios, Algorithms 3 ad 4 iteratively costruct a workig set W of costraits. I each iteratio, the al-
11 Cuttig-Plae Traiig of Structural SVMs Algorithm 4 for traiig Structural SVMs (with slack-rescalig) via the -Slack Formulatio (OP5). : Iput: S =((x,y ),...,(x,y )), C, ε 2: W /0 3: repeat 4: (w,ξ) argmi w,ξ 0 2 wt w +Cξ s.t. (ȳ,...,ȳ ) W : wt Δ(y i, ȳ i )[Ψ(x i,y i ) Ψ(x i, ȳ i )] 5: for,..., do 6: ŷ i argmaxŷ Y {Δ(y i, ŷ)( w T [Ψ(x i,y i ) Ψ(x i, ŷ)])} 7: ed for 8: W W {(ŷ,...,ŷ )} 9: util Δ(y i, ŷ i ) wt 0: retur(w,ξ ) Δ(y i, ŷ i )[Ψ(x i,y i ) Ψ(x i, ŷ i )] ξ + ε Δ(y i, ȳ i ) ξ gorithms compute the solutio over the curret W (Lie 4), fid the most violated costrait (Lies 5-7), ad add it to the workig set. The algorithm stops oce o costrait ca be foud that is violated by more tha the desired precisio ε (Lie 9). Ulike i the -slack algorithms, oly a sigle costrait is added i each iteratio. The followig theorems characterize the quality of the solutios retured by Algorithms 3 ad 4. Theorem 3. (CORRECTNESS OF ALGORITHM 3) For ay traiig sample S =((x,y ),...,(x,y )) ad ay ε > 0,if(w,ξ ) is the optimal solutio of OP4, the Algorithm 3 returs a poit ( w,ξ ) that has a better objective value tha (w,ξ ), ad for which (w,ξ + ε) is feasible i OP4. Proof. We first verify that Lies 5-7 i Algorithm 3 compute the vector (ŷ,...,ŷ ) Y that maximizes { } ξ = max (ŷ,...,ŷ ) Y Δ(y i,ŷ i ) wt [Ψ(x i,y i ) Ψ(x i,ŷ i )]. ξ is the miimum value eeded to fulfill all costraits i OP4 for the curret w. The maximizatio problem is liear i the ŷ i, so oe ca maximize over each ŷ i idepedetly. ξ = { max Δ(yi,ŷ) w T [Ψ(x i,y i ) Ψ(x i,ŷ)] } (3) ŷ Y = w T Ψ(x i,y i )+ { max Δ(yi,ŷ)+w T Ψ(x i,ŷ) } (4) ŷ Y Sice the first sum i Equatio (4) is costat, the secod term directly correspods to the assigmet i Lie 6. As checked i Lie 9, the algorithm termiates oly if ξ does ot exceed the ξ from the solutio over W by more tha ε as desired.
12 2 Thorste Joachims, Thomas Filey, ad Chu-Nam Joh Yu Sice the (w,ξ ) retured by Algorithm 3 is the solutio o a subset of the costraits from OP4, it holds that 2 w T w +Cξ 2 wt w +Cξ. Theorem 4. (CORRECTNESS OF ALGORITHM 4) For ay traiig sample S =((x,y ),...,(x,y )) ad ay ε > 0,if(w,ξ ) is the optimal solutio of OP5, the Algorithm 4 returs a poit ( w,ξ ) that has a better objective value tha (w,ξ ), ad for which (w,ξ + ε) is feasible i OP5. Proof. Aalogous to the proof of Theorem 3. Usig a stoppig criterio based o the accuracy of the empirical risk ξ is very ituitive ad practically meaigful, ulike the stoppig criteria typically used i decompositio methods. Ituitively, ε ca be used to idicate how close oe wats to be to the empirical risk of the best parameter vector. I most machie learig applicatios, toleratig a traiig error that is suboptimal by 0.% is very acceptable. This ituitio makes selectig the stoppig criterio much easier tha i other traiig methods, where it is usually defied based o the accuracy of the Kuh-Tucker Coditios of the dual (see e.g., Joachims, 999). Nevertheless, it is easy to see that ε also bouds the duality gap of the solutio by Cε. Solvig the optimizatio problems to a arbitrary but fixed precisio of ε is essetial i our aalysis below, makig sure that computatio time is ot wasted o computig a solutio that is more accurate tha ecessary. We ext aalyze the time complexity of Algorithms 3 ad 4. It is easy to see that each iteratio of the algorithm takes calls to the separatio oracle, ad that for the liear kerel the remaiig work i each iteratio scales liearly with as well. We show ext that the umber of iteratios util covergece is bouded, ad that this upper boud is idepedet of. The argumet requires the Wolfe-dual programs, which are straightforward to derive (see Appedix). For a more compact otatio, we deote vectors of labels as ȳ =(ȳ,...,ȳ ) Y. For such vectors of labels, we the defie Δ(ȳ) ad the ier product H MR (ȳ,ȳ ) as follows. Note that y i ad y j deote correct traiig labels, while ȳ i ad ȳ j deote arbitrary labels: Δ(ȳ)= H MR (ȳ,ȳ )= 2 Δ(y i,ȳ i ) (5) j= [ Ψ(x i,y i ) T Ψ(x j,y j ) Ψ(x i,y i ) T Ψ(x j,ȳ j ) ] Ψ(x i,ȳ i ) T Ψ(x j,y j )+Ψ(x i,ȳ i ) T Ψ(x j,ȳ j ) The ier products Ψ(x,y) T Ψ(x,y ) are computed either explicitly or via a Kerel K(x,y,x,y )=Ψ(x,y) T Ψ(x,y ). Note that it is typically more efficiet to compute H MR (ȳ,ȳ )= 2 [ (Ψ(x i,y i ) Ψ(x i,ȳ i )) ] T [ j=(ψ(x j,y j ) Ψ(x j,ȳ j )) ] (6) (7) if o kerel is used. The dual of the -slack formulatio for margi-rescalig is:
13 Cuttig-Plae Traiig of Structural SVMs 3 Optimizatio Problem 6 (-SLACK STRUCTURAL SVM WITH MARGIN- RESCALING (DUAL)) max Δ(ȳ)αȳ α 0 ȳ Y 2 αȳαȳ H MR (ȳ,ȳ ) ȳ Y ȳ Y s.t. αȳ = C ȳ Y For the case of slack-rescalig, the respective H(ȳ,ȳ ) is as follows. There is a aalogous factorizatio that is more efficiet to compute if o kerel is used: H SR (ȳ,ȳ )= 2 j= Δ(y i,ȳ i )Δ(y j,ȳ j ) [Ψ(x i,y i ) T Ψ(x j,y j ) Ψ(x i,y i ) T Ψ(x j,ȳ j ) Ψ(x i,ȳ i ) T Ψ(x j,y j )+Ψ(x i,ȳ i ) T Ψ(x j,ȳ j ) ] [ ] T [ = 2 i,ȳ i )(Ψ(x i,y i ) Ψ(x i,ȳ i )) Δ(y j,ȳ Δ(y j)(ψ(x j,y j ) Ψ(x j,ȳ j)) j= (8) ] (9) The dual of the -slack formulatio is: Optimizatio Problem 7 (-SLACK STRUCTURAL SVM WITH SLACK- RESCALING (DUAL)) max Δ(ȳ)αȳ α 0 ȳ Y 2 αȳαȳ H SR (ȳ,ȳ ) ȳ Y ȳ Y s.t. αȳ = C ȳ Y Usig the respective dual solutio α, oe ca compute ier products with the weight vector w solvig the primal via ( w T [ Ψ(x,y) = Ψ(x,y) T Ψ(x j,y j ) Ψ(x,y) T Ψ(x j,ȳ j ) ]) = ȳ Y α ȳ [ ȳ Y α ȳ for margi-rescalig ad via ( w T Ψ(x,y) = = ȳ Y α ȳ [ ȳ Y α ȳ j= j= j= j= [Ψ(x j,y j ) Ψ(x j,ȳ j )]] T Ψ(x,y) Δ(y j,ȳ j ) [ Ψ(x,y) T Ψ(x j,y j ) Ψ(x,y) T Ψ(x j,ȳ j ) ]) Δ(y j,ȳ j )[Ψ(x j,y j ) Ψ(x j,ȳ j )]] T Ψ(x,y)
14 4 Thorste Joachims, Thomas Filey, ad Chu-Nam Joh Yu for slack-rescalig. We will show i the followig that oly a small (i.e., polyomial) umber of the αȳ is o-zero at the solutio. I aalogy to classificatio SVMs, we will refer to those ȳ with o-zero αȳ as Support Vectors. However, ote that Support Vectors i the -slack formulatio are liear combiatios of multiple examples. We ca ow state the theorem givig a upper boud o the umber of iteratios of the -slack algorithms. The proof exteds the oe i (Joachims, 2006) to geeral structural SVMs, ad is based o the techique itroduced i (Joachims, 2003) ad geeralized i (Tsochataridis et al, 2005). The fial step of the proof uses a improvemet developed i (Teo et al, 2007). Theorem 5. (-SLACK MARGIN-RESCALING SVM ITERATION COMPLEXITY) For ay 0 < C, 0 < ε 4R 2 C ad ay traiig sample S =((x,y ),...,(x,y )), Algorithms 3 termiates after at most log 2 ( Δ 4R 2 C ) + 6R 2 C iteratios. R 2 = max i,ȳ Ψ(x i,y i ) Ψ(x i,ȳ) 2, Δ = max i,ȳ Δ(y i,ȳ), ad.. is the iteger ceilig fuctio. Proof. We will show that addig each ew costrait to W icreases the objective value at the solutio of the quadratic program i Lie 4 by at least some costat positive value. Sice the objective value of the solutio of OP6 is upper bouded by CΔ (sice w = 0 ad ξ = Δ is a feasible poit i the primal), the algorithm ca oly perform a costat umber of iteratios before termiatio. The amout by which the solutio icreases by addig oe costrait that is violated by more tha ε (i.e., the criteria i Lie 9 of Algorithm 3 ad Algorithm 4) to W ca be lower bouded as follows. Let ŷ be the ewly added costrait ad let α be the solutio of the dual before the additio. To lower boud the progress made by the algorithm i each iteratio, cosider the icrease i the dual that ca be achieved with a lie search ε (0) max {D(α + βη)} D(α). () 0 β C The directio η is costructed by settig ηŷ = ad ηȳ = C α ȳ for all other ȳ. Note that the costraits o β ad the costructio of η esure that α + βη ever leaves the feasible regio of the dual. To apply Lemma 2 (see Appedix) for computig the progress made by a lie search, we eed a lower boud for D(α) T η ad a upper boud for η T Hη. Startig with the lower boud for D(α) T η, ote that D(α) αȳ = Δ(ȳ) αȳ H MR (ȳ,ȳ )=ξ (2) ȳ W for all ȳ with o-zero αȳ at the solutio over the previous workig set W. For the ewly added costrait ŷ ad some γ > 0,
15 Cuttig-Plae Traiig of Structural SVMs 5 D(α) αŷ = Δ(ŷ) αȳ H MR (ŷ,ȳ )=ξ + γ ξ + ε (3) ȳ W by costructio due to Lie 9 of Algorithms 3. It follows that αȳ D(α) T η = ξ + γ ȳ W C ξ (4) ( ) = ξ C αȳ ȳ W + γ (5) = γ. (6) The followig gives a upper boud for η T Hη, where Hȳȳ = H MR (ȳ,ȳ ) for ȳ,ȳ W {ŷ}. η T Hη = H MR (ŷ,ŷ) 2 C αȳh MR (ȳ,ŷ)+ C ȳ W 2 ȳ W αȳαȳ H MR (ȳ,ȳ )(7) ȳ W R C CR2 + C 2C2 R 2 (8) = 4R 2 (9) The boud uses that R 2 H MR (ȳ,ŷ) R 2. Pluggig everythig ito the boud of Lemma 2 shows that the icrease of the objective is at least { } Cγ max {D(α + βη)} D(α) mi 0 β C 2, γ2 8R 2 (20) Note that the first case applies wheever γ 4R 2 C, ad that the secod case applies otherwise. The fial step of the proof is to use this costat icrease of the objective value i each iteratio to boud the maximum umber of iteratios. First, ote that α ȳ = 0 for all icorrect vectors of labels ȳ ad αȳ = C for the correct vector of labels ȳ =(y,...,y ) is a feasible startig poit α 0 with a dual objective of 0. This meas the iitial optimality gap δ(0) =D(α ) D(α 0 ) is at most CΔ, where α is the optimal dual solutio. A optimality gap of δ(i) =D( α ) D(α i ) esures that there exists a costrait that is violated by at least γ δ(i) C. This meas that the first case of (20) applies while δ(i) 4R 2 C 2, leadig to a decrease i the optimality gap of at least δ(i + ) δ(i) δ(i) (2) 2 i each iteratio. Startig from the worst possible optimality gap of δ(0)=cδ, the algorithm eeds at most
16 6 Thorste Joachims, Thomas Filey, ad Chu-Nam Joh Yu ( ) Δ i log 2 4R 2 (22) C iteratios util it has reached a optimality gap of δ(i ) 4R 2 C 2, where the secod case of (20) becomes valid. As proposed i (Teo et al, 2007), the recurrece equatio δ(i + ) δ(i) 8R 2 C 2 δ(i)2 (23) for the secod case of (20) ca be upper bouded by solvig the differetial equatio δ(i) i = δ(i) 2 with boudary coditio δ(0)=4r 2 C 2. The solutio is δ(i) 8R 2 C 2, showig that the algorithms does ot eed more tha 8R 2 C 2 i+2 i 2 8R2 C 2 Cε (24) iteratios util it reaches a optimality gap of Cε whe startig at a gap of 4R 2 C 2, where ε is the desired target precisio give to the algorithm. Oce the optimality gap reaches Cε, it is o loger guarateed that a ε-violated costrait exists. However, such costraits may still exist ad so the algorithm does ot yet termiate. But sice each such costrait leads to a icrease i the dual objective of at ε least 2, oly 8R 2 8R 2 C i 3 (25) ε ca be added before the optimality gap becomes egative. The overall boud results from addig i,i 2, ad i 3. Note that the proof of the theorem requires oly a lie search i each step, while Algorithm 4 actually computes the full QP solutio. This suggests the followig. O the oe had, the actual umber of iteratios i Algorithm 4 might be substatially smaller i practice tha what is predicted by the boud. O the other had, it suggests a variat of Algorithm 4, where the QP solver is replaced by a simple lie search. This may be beeficial i structured predictio problems where the separatio oracle i Lie 6 is particularly cheap to compute. Theorem 6. (-SLACK SLACK-RESCALING SVM ITERATION COMPLEXITY) For ay 0 < C, 0 < ε 4Δ 2 R 2 C ad ay traiig sample S =((x,y ),...,(x,y )), Algorithms 4 termiate after at most ( ) 6R 2 Δ 2 C log 2 4R 2 + (26) ΔC ε iteratios. R 2 = max i,ȳ Ψ(x i,y i ) Ψ(x i,ȳ) 2, Δ = max i,ȳ Δ(y i,ȳ), ad.. is the iteger ceilig fuctio.
17 Cuttig-Plae Traiig of Structural SVMs 7 Proof. The proof for the case of slack-rescalig is aalogous. The oly differece is that Δ 2 R 2 H SR (ȳ,ȳ ) Δ 2 R 2. The O( ε ) covergece rate i the boud is tight, as the followig example shows. Cosider a multi-class classificatio problem with ifiitely may classes Y = {,..., } ad a feature space X = R that cotais oly oe feature. This problem ca be ecoded usig a feature map Ψ(x,y) which takes value x i positio y ad 0 everywhere else. For a traiig set with a sigle traiig example (x, y)=((), ) ad usig the zero/oe-loss, the -slack quadratic program for both margi-rescalig ad slack-rescalig is 2 wt w +Cξ (27) s.t. w T [Ψ(x,) Ψ(x,2)] ξ mi w,ξ 0 w T [Ψ(x,) Ψ(x,3)] ξ w T [Ψ(x,) Ψ(x,4)] ξ. Let s assume without loss of geeraltity that Algorithm 3 (or equivaletly Algorithm 4) itroduces the first costrait i the first iteratio. For C 2 the solutio over this workig set is w T =( 2, 2,0,0,...) ad ξ = 0. All other costraits are ow violated by 2 ad oe of them is selected at radom to be added to the workig set i the ext iteratio. It is easy to verify that after addig k costraits, the solutio over the workig set is w T =( k+ k, k+,..., k+,0,0,...) for C 2, ad all costraits outside the workig set are violated by ε = k+. It therefore takes O( ε ) iteratios to reach a desired precisio of ε. The O(C) scalig with C is tight as well, at least for small values of C. For C 2, the solutio over the workig set after addig k costraits is wt = (C, C k,..., C k,0,0,...). This meas that after k costraits, all costraits outside the workig set are violated by ε = C k. Cosequetly, the bouds i (0) ad (26) accurately reflect the scalig with C up to the log-term for C 2. The followig theorem summarizes our characterizatio of the time complexity of the -slack algorithms. I real applicatios, however, we will see that Algorithms 3 scales much better tha what is predicted by these worst-case bouds both w.r.t. C ad ε. Note that a support vector (i.e. poit with o-zero dual variable) o loger correspods to a sigle data poit i the -slack dual, but is typically a liear combiatio of data poits. Corollary. (TIME COMPLEXITY OF ALGORITHMS 3 AND 4 FOR LINEAR KERNEL) For ay traiig examples S = ((x,y ),...,(x,y )) with max i,ȳ Ψ(x i,y i ) Ψ(x i,ȳ) 2 R 2 < ad max i,ȳ Δ(y i,ȳ) Δ < for all, the -slack cuttig plae Algorithms 3 ad 4 with costat ε ad C usig the liear kerel
18 8 Thorste Joachims, Thomas Filey, ad Chu-Nam Joh Yu require at most O() calls to the separatio oracle, require at most O() computatio time outside the separatio oracle, fid a solutio where the umber of support vectors (i.e. the umber of o-zero dual variables i the cuttig-plae model) does ot deped o, for ay fixed value of C > 0 ad ε > 0. Proof. Theorems 5 ad 6 shows that the algorithms termiate after a costat umber of iteratios does ot deped o. Sice oly oe costrait is itroduced i each iteratio, the umber of support vectors is bouded by the umber of iteratios. I each iteratio, the algorithm performs exactly calls to the separatio oracle, which proves the first statemet. Similarly, the QP that is solved i each iteratio is of costat size ad therefore requires oly costat time. It is easily verified that the remaiig operatios i each iteratios ca be doe i time O() usig Eqs. (7) ad (9). We further discuss the time complexity for the case of kerels i the followig sectio. Note that the liear-time algorithm proposed i (Joachims, 2006) for traiig biary classificatio SVMs is a special case of the -slack methods developed here. For biary classificatio X = R N ad Y = {,+}. Pluggig Ψ(x,y)= { 0ify = y 2 yx ad Δ(y,y )= otherwise (28) ito either -slack formulatio OP2 or OP3 produces the stadard SVM optimizatio problem OP. The -slack formulatios ad algorithms are the equivalet to those i (Joachims, 2006). However, the O( ε ) boud o the maximum umber of iteratios derived here is tighter tha the O( ) boud i (Joachims, 2006). Usig a ε similar argumet, it ca also be show the ordial 2 regressio method i (Joachims, 2006) is a special case of the -slack algorithm. 3.3 Kerels ad Low-Rak Approximatios For problems where a (o-liear) kerel is used, the computatio time i each iteratio is O( 2 ) istead of O(), sice Eqs. (7) ad (9) o loger apply. However, the -slack algorithm ca easily exploit rak-k approximatios, which we will show reduces the computatio time outside of the separatio oracle from O( 2 ) to O(k + k 3 ). Let (x,y ),...,(x k,y k ) be a set of basis fuctios so that the subspace spaed by Ψ(x,y ),...,Ψ(x k,y k ) (approximately) cotais the solutio w of OP4 ad OP5 respectively. Algorithms for fidig such approximatios have bee suggested i (Keerthi et al, 2006; Fukumizu et al, 2004; Smola ad Schölkopf, 2000) for classificatios SVMs, ad at least some of them ca be exteded to structural SVMs as well. I the simplest case, the set of k basis fuctios ca be chose radomly from the set of traiig examples.
19 Cuttig-Plae Traiig of Structural SVMs 9 For a kerel K(.) ad the resultig Gram matrix K with K ij =Ψ(x i,y i )T Ψ(x j,y j )= K(x i,y i,x j,y j ), we ca compute the iverse L of the Cholesky Decompositio L of K i time O(k 3 ). Assumig that w actually lies i the subspace, we ca equivaletly rewrite the -slack optimizatio problems as Optimizatio Problem 8 (-SLACK STRUCTURAL SVM WITH MARGIN- RESCALING AND k BASIS FUNCTIONS (PRIMAL)) mi β,ξ 0 2 β T β +C ξ K(x i,y i,x,y ) K(x i,ȳ i,x,y ) s.t. (ȳ,...,ȳ ) Y : β T L. K(x i,y i,x k,y k ) K(x i,ȳ i,x k,y k ) Δ(y i,ȳ i ) ξ Optimizatio Problem 9 (-SLACK STRUCTURAL SVM WITH SLACK- RESCALING AND k BASIS FUNCTIONS (PRIMAL)) mi β,ξ 0 2 β T β +C ξ s.t. (ȳ,..,ȳ ) Y : βt L Δ(y i,ȳ i ) K(x i,y i,x,y ) K(x i,ȳ i,x,y ). K(x i,y i,x k,y k ) K(x i,ȳ i,x k,y k ) Δ(y i,ȳ i ) ξ Ituitively, the values of the kerel K(.) with each of the k basis fuctios form a ew feature vector Ψ (x,y) T =(K(x,y,x,y ),...,K(x,y,x k,y k ))T describig each example (x,y). After multiplicatio with L, OP8 ad OP9 become idetical to a problem with liear kerel ad k features, ad it is straightforward to see that Algorithms 3 ad 4 apply to this ew represetatio. Corollary 2. (TIME COMPLEXITY OF ALGORITHMS 3 AND 4 FOR NON-LINEAR KERNEL) For ay traiig examples S = ((x,y ),...,(x,y )) with max i,ȳ Ψ(x i,y i ) Ψ(x i,ȳ) 2 R 2 < ad max i,ȳ Δ(y i,ȳ) Δ < for all, the -slack cuttig plae Algorithms 3 ad 4 usig a o-liear kerel require at most O() calls to the separatio oracle, require at most O( 2 ) computatio time outside the separatio oracle, require at most O(k + k 3 ) computatio time outside the separatio oracle, if a set of k basis fuctios is used, fid a solutio where the umber of support vectors does ot deped o, for ay fixed value of C > 0 ad ε > 0. Proof. The proof is aalogous to that of Corollary. For the low-rak approximatio, ote that it is more efficiet to oce compute w T = β T L before eterig the loop i Lie 5, tha to compute L Ψ (x,y) for each example. k 3 is the cost of the Cholesky Decompositio, but this eeds to be computed oly oce.
20 20 Thorste Joachims, Thomas Filey, ad Chu-Nam Joh Yu 4 Implemetatio We implemeted both the -slack algorithms ad the -slack algorithms i software package called SVM struct, which we make publicly available for dowload at SVM struct uses SVM-light as the optimizer for solvig the QP sub-problems. Users may adapt SVM struct to their ow structural learig tasks by implemetig API fuctios correspodig to taskspecific Ψ, Δ, separatio oracle, ad iferece. User API fuctios are i C. A popular extesio is SVM pytho, which allows users to write API fuctios i Pytho istead, ad elimiates much of the drudge work of C icludig model serializatio/deserializatio ad memory maagemet. A efficiet implemetatio of the algorithms required a variety of desig decisios, which are summarized i the followig. These desig decisios have a substatial ifluece o the practical efficiecy of the algorithms. Restartig the QP Sub-Problem Solver from the Previous Solutio. Istead of solvig each QP subproblem from scratch, we restart the optimizer from the dual solutio of the previous workig set as the startig poit. This applies to both the -slack ad the -slack algorithms. Batch Updates for the -Slack Algorithm. Algorithm recomputes the solutio of the QP sub-problem after each update to the workig set. While this allows the algorithm to potetially fid better costraits to be added i each step, it requires a lot of time i the QP solver. We foud that it is more efficiet to wait with recomputig the solutio of the QP sub-problem util 00 costraits have bee added. Maagig the Accuracy of the QP Sub-Problem Solver. I the iitial iteratios, a relatively low precisio solutio of the QP sub-problems is sufficiet for idetifyig the ext violated costrait to add to the workig set. We therefore adjust the precisio of the QP sub-problem optimizer throughout the optimizatio process for all algorithms. Removig Iactive Costraits from the Workig Set. For both the -slack ad the -slack algorithm, costraits that were added to the workig set i early iteratios ofte become iactive later i the optimizatio process. These costraits ca be removed without affectig the theoretical covergece guaratees of the algorithm, leadig to smaller QP s beig solved i each iteratio. At the ed of each iteratio, we therefore remove costraits from the workig set that have ot bee active i the last 50 QP sub-problems. Cachig Ψ(x i,y i ) Ψ(x i,ŷ i ) i the -Slack Algorithm. If the separatio oracle returs a label ŷ i for a example x i, the costrait added i the -slack algorithm esures that this label will ever agai produce a ε-violated costrait i a subsequet iteratio. This is differet, however, i the -slack algorithm, where the same label ca be ivolved i a ε-violated costrait over ad over agai. We therefore cache the f most recetly used Ψ(x i,y i ) Ψ(x i,ŷ i ) for each traiig example x i (typically f = 0 i the followig experimets). Let s deote the cache for example x i with C i.
21 Cuttig-Plae Traiig of Structural SVMs 2 Istead of askig the separatio oracle i every iteratio, the algorithm first tries to costruct a sufficietly violated costrait from the caches via for,..., do ŷ i maxŷ Ci {Δ(y i,ŷ)+w T Ψ(x i,ŷ)} ed for or the aalogous variat for the case of slack-rescalig. Oly if this fails will the algorithm ask the separatio oracle. The goal of this cachig strategy is to decrease the umber of calls to the separatio oracle. Note that i may applicatios, the separatio oracle is very expesive (e.g., CFG parsig). Parallelizatio. While curretly ot implemeted, the loop i Lies 5-7 of the - slack algorithms ca easily be parallelized. I priciple, oe could make use of up to parallel threads, each computig the separatio oracle for a subset of the traiig sample. For applicatios like CFG parsig, where more tha 98% of the overall rutime is spet o the separatio oracle (see Sectio 5), parallizig this loop will lead to a substatial speed-up that should be almost liear i the umber of threads. Solvig the Dual of the QP Sub-Problems i the -Slack Algorithm. As idicated by Theorems 5 ad 6, the workig sets i the -slack algorithm stay small idepedet of the size of the traiig set. I practice, typically less the 00 costraits are active at the solutios ad we ever ecoutered a sigle istace where the workig set grew beyod 000 costraits. This makes it advatageous to store ad solve the QP sub-problems i the dual istead of i the primal, sice the dual is ot affected by the dimesioality of Ψ(x, y). The algorithm explicitly stores the Hessia H of the dual ad adds or deletes a row/colum wheever a costrait is added or removed from the workig set. Note that this is ot feasible for the -slack algorithm, sice the workig set size is typically orders of magitude larger (ofte > 00, 000 costraits). 5 Experimets For the experimets i this paper we will cosider the followig four applicatios, amely biary classificatio, multi-class classificatio, sequece taggig with liear chai HMMs, ad CFG grammar learig. They cover the whole spectrum of possible applicatios, from multi-class classificatio ivolvig a simple Y of low cardiality ad with a very iexpesive separatio oracle, to CFG parsig with large ad complex structural objects ad a expesive separatio oracle. The particular setup for the differet applicatios is as follows. Biary Classificatio. For biary classificatio X = R N ad Y = {,+}. Usig Ψ(x,y)= { 0 if y = ȳ 2 yx ad Δ(y,ȳ)=00[y ȳ]= 00 otherwise (29)
22 22 Thorste Joachims, Thomas Filey, ad Chu-Nam Joh Yu i the -slack formulatio, OP4 results i the algorithm preseted i (Joachims, 2006) ad implemeted i the SVM-perf software 3. I the -slack formulatio, oe immediately recovers Vapik et al. s origial classificatio SVM formulatio of OP (Cortes ad Vapik, 995; Vapik, 998) (up to the more coveiet percetagescale rescalig of the loss fuctio ad the absece of the bias term), which we solve usig SVM-light. Multi-Class Classificatio. This is aother simple istace of a structual SVM, where X = R N ad Y = {,..,k}. Usig Δ(y,ȳ)=00[y ȳ] ad 0. 0 Ψ multi (x,y)= x 0. 0 where the feature vector x is stacked ito positio y, the resultig -slack problem becomes idetical to the multi-class SVM of Crammer ad Siger (200). Our SVM-multiclass (V2.3) implemetatio 3 is also built via the SVM struct API. The argmax for the separatio oracle ad the predictio are computed by explicit eumeratio. We use the Covertype dataset of Blackard, Jock & Dea as our bechmark for the multi-class SVM. It is a 7-class problem with = 522,9 examples ad 54 features. This meas that the dimesioality of Ψ(x, y) is N = 378. Sequece Taggig with Liear Chai HMMs. I sequece taggig (e.g., Part-of- Speech Taggig) each iput x =(x,...,x l ) is a sequece of feature vectors (oe for each word), ad y =(y,...,y l ) is a sequece of labels y i {..k} of matchig legth. Isomorphic to a liear chai HMM, we model depedecies betwee each y i ad x i, as well as depedecies betwee y i ad y i. Usig the defiitio ofψ multi (x,y) from above, this leads to a joit feature vector of Ψ multi (x i,y i ) l [y i = ][y i = ] Ψ HMM ((x,...,x l ),(y,...,y l )) = [y i = ][y i = 2]. (3). [y i = k][y i = k] We use the umber of misclassified tags Δ((y,...,y l ),(y,...,y l )) = l [y i y i ] as the loss fuctio. The argmax for predictio ad the separatio oracle are both computed via the Viterbi algorithm. Note that the separatio oracle is equivalet to the 3 Available at svmlight.joachims.org (30)
23 Cuttig-Plae Traiig of Structural SVMs 23 predictio argmax after addig to the ode potetials of all icorrect labels. Our SVM-HMM (V3.0) implemetatio based o SVM struct is also available olie 3. We evaluate o the Part-of-Speech taggig dataset from the Pe Treebak corpus (Marcus et al, 993). After splittig the dataset ito traiig ad test set, it has = 35,53 traiig examples (i.e., seteces), leadig to a total of 854,022 tags over k = 43 labels. The feature vectors x i describig each word cosist of biary features, each idicatig the presece of a particular prefix or suffix i the curret word, the previous word, ad the followig word. All prefixes ad suffixes observed i the traiig data are used as features. I additio, there are features ecodig the legth of the word. The total umber of features is approximately 430,000, leadig to a Ψ HMM (x,y) of dimesioality N = 8,573,78. Parsig with Cotext Free Grammars. We use atural laguage parsig as a example applicatio where the cost of computig the separatio oracle is comparatively high. Here, each iput x =(x,...,x l ) is a sequece of feature vectors (oe for each word), ad y is a tree with x as its leaves. Admissible trees are those that ca be costructed from a give set of grammar rules i our case, all grammar rules observed i the traiig data. As the loss fuctio, we use Δ(y,ȳ)=00[y ȳ], ad Ψ CFG (x,y) has oe feature per grammar rule that couts how ofte this rule was applied i y. The argmax for predictio ca be computed efficietly usig a CKY parser. We use the CKY parser implemetatio 4 of Johso (998). For the separatio oracle the same CKY parser is used after extedig it to also retur the secod best solutio. Agai, our SVM-CFG (V3.0) implemetatio based o SVM struct is available olie 3. For the followig experimets, we use all seteces with at most 5 words from the Pe Treebak corpus (Marcus et al, 993). Restrictig the dataset to short seteces is ot due to a limitatio of SVM struct, but due to the CKY implemetatio we are usig. It becomes very slow for log seteces. Faster parsers that use pruig could easily hadle loger seteces as well. After splittig the data ito traiig ad test set, we have = 9,780 traiig examples (i.e., seteces) ad Ψ CFG (x,y) has a dimesioality of N = 54, Experimet Setup Uless oted otherwise, the followig parameters are used i the experimets reported below. Both the -slack (SVM struct optios -w 3 ad -w 4 with cachig) ad the -slack algorithms (optio -w 0 ) use ε = 0. as the stoppig criterio (optio -e 0. ). Give the scalig of the loss for multi-class classificatio ad CFG parsig, this correspods to a precisio of approximately 0.% of the empirical risk for the -slack algorithm, ad it is slightly higher for the HMM problem. For the -slack problem it is harder to iterpret the meaig of this ε, but we will see i Sectio 5.7 that it gives solutios of comparable precisio. As the value 4 Available at mj/software.htm
24 24 Thorste Joachims, Thomas Filey, ad Chu-Nam Joh Yu Table Traiig CPU-time (i hours), umber of calls to the separatio oracle, ad umber of support vectors for both the -Slack (with cachig) ad the -Slack Algorithm. is the umber of traiig examples ad N is the umber of features i Ψ(x, y). CPU-Time # Sep. Oracle # Support Vec. N -slack -slack -slack -slack -slack -slack MultiC 522, ,83,288 0,98, ,524 HMM 35,53 8,573, ,34,647 4,476, ,26 CFG 9,780 54, , , ,890 of C, we use the settig that achieves the best predictio performace o the test set whe usig the full traiig set (C = 0, 000, 000 for multi-class classificatio, C = 5,000 for HMM sequece taggig, ad C = 20,000 for CFG parsig) (optio -c ). As the cache size we use f = 0 (optio -f 0 ). For multi-class classificatio, margi-rescalig ad slack-rescalig are equivalet. For the others two problems we use margi-rescalig (optio -o 2 ). Wheever possible, rutime comparisos are doe o the full traiig set. All experimets are ru o 3.6 GHz Itel Xeo processors with 4GB of mai memory uder Liux. 5.2 How Fast is the -Slack Algorithm Compared to the -Slack Algorithm? We first examie absolute rutimes of the -slack algorithm, ad the aalyze ad explai various aspects of its scalig behavior i the followig. Table shows the CPU-time that both the -slack ad the -slack algorithm take o the multi-class, sequece taggig, ad parsig bechmark problems. For all problems, the -slack algorithm is substatially faster, for multi-class ad HMM by several orders of magitude. The speed-up is largest for the multi-class problem, which has the least expesive separatio oracle. Not coutig costraits costructed from the cache, less tha % of the time is sped o the separatio oracle for the multi-class problem, while it is 5% for the HMM ad 98% for CFG parsig. Therefore, it is iterestig to also compare the umber of calls to the separatio oracle. I all cases, Table shows that the -slack algorithm requires by a factor betwee 2 ad 4 fewer calls, accoutig for much of the time saved o the CFG problem. The most strikig differece betwee the two algorithms lies i the umber of support vectors they produce (i.e., the umber of dual variables that are o-zero). For the -slack algorithm, the umber of support vectors lie i the tes or hudreds of thousads, while all solutios produced by the -slack algorithm have oly about 00 support vectors. This meas that the workig sets that eed to be solved i each iteratio are orders of magitude smaller i the -slack algorithm, accoutig for oly 26% of the overall rutime i the multi-class experimet compared to more tha 99% for the -slack algorithm. We will further aalyze this i the followig.
25 Cuttig-Plae Traiig of Structural SVMs 25 Table 2 Traiig CPU time (i secods) for five biary classificatio problems comparig the - slack algorithm (without cachig) with SVM-light. is the umber of traiig examples, N is the umber of features, ad s is the fractio of o-zero elemets of the feature vectors. The SVMlight results are quoted from (Joachims, 2006), the -slack results are re-ru with the latest versio of SVM-struct usig the same experimet setup as i (Joachims, 2006). CPU-Time # Support Vec. N s -slack SVM-light -slack SVM-light Reuters CCAT 804,44 47, % , Reuters C 804,44 47, % 7.3 5, ArXiv Astro-ph 62,369 99, % Covertype 522, % , KDD04 Physics 50, % 9.2, How Fast is the -Slack Algorithm Compared to Covetioal SVM Traiig Algorithms? Sice most work o traiig algorithms for SVMs was doe for biary classificatio, we compare the -slack algorithms agaist algorithms for the special case of biary classificatio. While there are traiig algorithms for liear SVMs that scale liearly with (e.g., Lagragia SVM (Magasaria ad Musicat, 200) (usig the ξi 2 loss), Proximal SVM (Fug ad Magasaria, 200) (usig a L 2 regressio loss), ad Iterior Poit Methods (Ferris ad Muso, 2003)), they use the Sherma-Morriso-Woodbury formula (or matrix factorizatios) for ivertig the Hessia of the dual. This requires operatig o N N matrices, which makes them applicable oly for problems with small N. The L2-SVM-MFN method (Keerthi ad DeCoste, 2005) avoids explicitly represetig N N matrices usig cojugate gradiet techiques. While the worst-case cost is still O(s mi(,n)) per iteratio for feature vectors with sparsity s, they observe that their method empirically scales much better. The discussio i (Joachims, 2006) cocludes that rutime is comparable to the -slack algorithm implemeted i SVM-perf. The -slack algorithm scales liearly i both ad the sparsity s of the feature vectors, eve if the total umber N of features is large (Joachims, 2006). Note that it is uclear whether ay of the covetioal algorithms ca be exteded to structural SVM traiig. The most widely used algorithms for traiig biary SVMs are decompositio methods like SVM-light (Joachims, 999), SMO (Platt, 999), ad others (Chag ad Li, 200; Collobert ad Begio, 200). Taskar et al. (Taskar et al, 2003) exteded the SMO algorithm to structured predictio problems based o their polyomial-size reformulatio of the -slack optimizatio problem OP2 for the special case of decomposable models ad decomposable loss fuctios. I the case of biary classificatio, their SMO algorithm reduces to a variat of the traditioal SMO algorithm, which ca be see as a special case of the SVM-light algorithm. We therefore use SVM-light as a represetative of the class of decompositio methods. Table 2 compares the rutime of the -slack algorithm to SVM-light o five bechmark problems with varyig umbers of features, sparsity, ad umbers of traiig examples. The bechmarks iclude two text classificatio problems from
26 26 Thorste Joachims, Thomas Filey, ad Chu-Nam Joh Yu the Reuters RCV collectio 5 (Lewis et al, 2004), a problem of classifyig ArXiv abstracts, a biary classifier for class of the Covertype dataset 6 of Blackard, Jock & Dea, ad the KDD04 Physics task from the KDD-Cup 2004 (Caruaa et al, 2004). I all cases, the -slack algorithm is faster tha SVM-light, which is highly optimized to biary classificatio. O large datasets, the differece spas several orders of magitude. After the -slack algorithm was origially itroduced, ew stochastic subgradiet descet methods were proposed that are competitive i rutime for classificatio SVMs, especially the PEGASOS algorithm (Shalev-Shwartz et al, 2007). While curretly oly explored for classificatio, it should be possible to exted PEGA- SOS also to structured predictio problems. Ulike expoetiated gradiet methods (Bartlett et al, 2004; Globerso et al, 2007), PEGASOS does ot require the computatio of margials, which makes it equally easy to apply as cuttig-plae methods. However, ulike for our cuttig-plae methods where the theory provides a practically effective stoppig criterio, it is less clear whe to stop primal stochastic subgradiet methods. Sice they do ot maitai a dual program, the duality gap caot be used to characterize the quality of the solutio at termiatio. Furthermore, there is a questios of how to icorporate cachig ito stochastic subgradiet methods while still maitaiig fast covergece. As show i the followig, cachig is essetial for problems where the separatio oracle (or, equivaletly, the computatio of subgradiets) is expesive (e.g. CFG parsig). 5.4 How does Traiig Time Scale with the Number of Traiig Examples? A key questio is the scalability of the algorithm for large datasets. While Corollary shows that a upper boud o the traiig time scales liearly with the umber of traiig examples, the actual behavior udereath this boud could potetially be differet. Figure shows how traiig time relates to the umber of traiig examples for the three structural predictio problems. For the multi-class ad the HMM problem, traiig time does ideed scale at most liearly as predicted by Corollary, both with ad without usig the cache. However, the cache helps for larger datasets, ad there is a large advatage from usig the cache over the whole rage for CFG parsig. This is to be expected, give the high cost of the separatio oracle i the case of parsig. As show i Figure 2, the scalig behavior of the -slack algorithm remais essetially uchaged eve whe the regularizatio parameter C is ot held costat, but is set to the value that gives optimal predictio performace o the test set for each traiig set size. The scalig with C is aalyzed i more detail i Sectio rcvv2 README.htm 6 mlear/mlrepository.html
27 Cuttig-Plae Traiig of Structural SVMs 27 e+07 Multi-Class HMM CFG e+07 e+06 e+06 e CPU-Secods CPU-Secods CPU-Secods slack -slack -slack (cache) O(x) e+06 Number of Traiig Examples 00 -slack -slack -slack (cache) O(x) Number of Traiig Examples 0 -slack -slack -slack (cache) O(x) Number of Traiig Examples Fig. Traiig times for multi-class classificatio (left) HMM part-of-speech taggig (middle) ad CFG parsig (right) as a fuctio of for the -slack algorithm, the -slack algorithm, ad the -slack algorithm with cachig Slack Algorithm Slack Algorithm with Cachig CPU-Secods CPU-Secods Multi-Class HMM CFG O(x) e+06 Number of Traiig Examples 0 Multi-Class HMM CFG O(x) e+06 Number of Traiig Examples Fig. 2 Traiig times as a fuctio of usig the optimal value of C at each traiig set size for the the -slack algorithm (left) ad the -slack algorithm with cachig (right). The -slack algorithm scales super-liearly for all problems, but so does the -slack algorithm for CFG parsig. This ca be explaied as follows. Sice the grammar is costructed from all rules observed i the traiig data, the umber of grammar rules grows with the umber of traiig examples. Eve from the secod-largest to the largest traiig set, the umber of rules i the grammar still grows by almost 70% (3550 rules vs. 582 rules). This has two effects. First, the separatio oracle becomes slower, sice its time scales with the umber of rules i the grammar. I particular, the time the CFG parser takes to compute a sigle argmax icreases more tha six-fold from the smallest to the largest traiig set. Secod, additioal rules (i particular uary rules) itroduce additioal features ad allow the costructio of larger ad larger wrog trees ȳ, which meas that R 2 = max i,ȳ Ψ(x i,y i ) Ψ(x i,ȳ) 2 is ot costat but grows. Ideed, Figure 3 shows that cosistet with Theorem 5 the umber of iteratios of the -slack
28 28 Thorste Joachims, Thomas Filey, ad Chu-Nam Joh Yu Slack Algorithm Slack Algorithm with Cachig Iteratios 00 Iteratios Multi-Class HMM CFG e+06 Number of Traiig Examples Multi-Class HMM CFG e+06 Number of Traiig Examples Fig. 3 Number of iteratios as a fuctio of for the the -slack algorithm (left) ad the -slack algorithm with cachig (right). e+07 e+06 Multi-Class HMM CFG e+07 e+07 -slack -slack -slack -slack -slack -slack -slack (cache) e+06 -slack (cache) e+06 -slack (cache) Support Vectors Support Vectors Support Vectors e+06 Number of Traiig Examples Number of Traiig Examples Number of Traiig Examples Fig. 4 Number of support vectors for multi-class classificatio (left) HMM part-of-speech taggig (middle) ad CFG parsig (right) as a fuctio of for the -slack algorithm, the -slack algorithm, ad the -slack algorithm with cachig. algorithm is roughly costat for multi-class classificatio ad the HMM 7, while it grows slowly for CFG parsig. Fially, ote that i Figure 3 the differece i the umber of iteratios of the algorithm without cachig (left) ad with cachig (right) is small. Despite the fact that the costrait from the cache is typically ot the overall most violated costrait, but oly a sufficietly violated costrait, both versios of the algorithm appear to make similar progress i each iteratio. 7 Note that the HMM always cosiders all possible rules i the regular laguage, so that there is o growth i the umber of rules oce all symbols are added.
29 Cuttig-Plae Traiig of Structural SVMs 29 Calls to Separatio Oracle e+0 e+09 e+08 e+07 e Multi-Class HMM CFG e+09 e+08 -slack -slack -slack -slack -slack -slack -slack (cache) -slack (cache) -slack (cache) O(x) e+08 O(x) e+07 O(x) Calls to Separatio Oracle e+07 e Calls to Separatio Oracle e e+06 Number of Traiig Examples Number of Traiig Examples Number of Traiig Examples Fig. 5 Number of calls to the separatio oracle for multi-class classificatio (left) HMM part-ofspeech taggig (middle) ad CFG parsig (right) as a fuctio of for the -slack algorithm, the -slack algorithm, ad the -slack algorithm with cachig. 5.5 What is the Size of the Workig Set? As already oted above, the size of the workig set ad its scalig has a substatial ifluece o the overall efficiecy of the algorithm. I particular, large (ad growig) workig sets will make it expesive to solve the quadratic programs. While the umber of iteratios is a upper boud o the workig set size for the -slack algorithm, the umber of support vectors show i Figure 4 gives a much better idea of its size, sice we are removig iactive costraits from the workig set. For the -slack algorithm, Figure 4 shows that the umber of support vectors does ot systematically grow with for ay of the problems, makig it easy to solve the workig set QPs eve for large datasets. This is very much i cotrast to the - slack algorithm, where the growig umber of support vectors makes each iteratio icreasigly costly, ad is startig to push the limits of what ca be kept i mai memory. 5.6 How ofte is the Separatio Oracle Called? Next to solvig the workig set QPs i each iteratio, computig the separatio oracle is the other major expese i each iteratio. We ow ivestigate how the umber of calls to the separatio oracle scales with, ad how this is iflueced by cachig. Figure 5 shows that for all algorithms the umber of calls scales liearly with for the multi-class problem ad the HMM. It is slightly super-liear for CFG parsig due to the icreasig umber of iteratios as discussed above. For all problems ad traiig set sizes, the -slack algorithm with cachig requires the fewest calls. The size of the cache has a surprisigly little ifluece o the reductio of calls to the separatio oracle. Figure 6 shows that a cache of size f = 5 already provides all
30 30 Thorste Joachims, Thomas Filey, ad Chu-Nam Joh Yu e+09 Multi-Class HMM CFG Calls to Separatio Oracle e+08 e+07 e Size of Cache Fig. 6 Number of calls to the separatio oracle as a fuctio of cache size for the the -slack algorithm. of the beefits, ad that larger cache sizes do ot further reduce the umber of calls. However, we cojecture that this might be a artifact of our simple least-recetlyused cachig strategy, ad that improved cachig methods that selectively call the separatio oracle for oly a well-chose subset of the examples will provide further beefits. 5.7 Are the Solutios Differet? Sice the stoppig criteria are differet i the -slack ad the -slack algorithm, it remais to verify that they do ideed compute a solutio of comparable effectiveess. The plot i Figure 7 shows the dual objective value of the -slack solutio relative to the -slack solutio. A value below zero idicates that the -slack solutio has a better dual objective value, while a positive value shows by which fractio the -slack objective is higher tha the -slack objective. For all values of C the solutios are very close for the multi-class problem ad for CFG parsig, ad so are their predictio performaces o the test set (see table i Figure 7). This is ot surprisig, sice for both the -slack ad the -slack formulatio the respective ε boud the duality gap by Cε. For the HMM, however, this Cε is a substatial fractio of the objective value at the solutio, especially for large values of C. Sice the traiig data is almost liearly separable for the HMM, Cε becomes a substatial part of the slack cotributio to the objective value. Furthermore, ote the differet scalig of the HMM loss (i.e., umber of misclassified tags i the setece), which is roughly 5 times smaller tha the loss fuctio o the other problems (i.e., 0 to 00 scale). So, a ε = 0. o the HMM problem is comparable to a ε = 0.5 o the other problems. Nevertheless, with a per-toke test error rate of 3.29% for the -slack solutio, the
31 Cuttig-Plae Traiig of Structural SVMs 3 (Obj_ - Obj_)/Obj_ Multi-Class HMM CFG Task Measure -slack -slack MultiC Accuracy HMM Toke Accuracy CFG Bracket F e+06 e+07 C Fig. 7 Relative differece i dual objective value of the solutios foud by the -slack algorithm ad by the -slack algorithm as a fuctio of C at the maximum traiig set size (left), ad test-set predictio performace for the optimal value of C (right). Iteratios Multi-Class HMM CFG O(/eps) Calls to Separatio Oracle e+07 e Epsilo Multi-Class HMM CFG Epsilo Fig. 8 Number of iteratios for the -slack algorithm (left) ad umber of calls to the separatio oracle for the -slack algorithm with cachig (right) as a fuctio of ε at the maximum traiig set size. predictio accuracy is eve slightly better tha the 3.3% error rate of the -slack solutio. 5.8 How does the -Slack Algorithm Scale with ε? While the scalig with is the most importat criterio from a practical perspective, it is also iterestig to look at the scalig with ε. Theorem 5 shows that the umber of iteratios (ad therefore the umber of calls to the separatio oracle) scales O( ε ) i the worst cast. Figure 8, however, shows that the scalig is much better i practice. I particular, the umber of calls to the separatio oracle is largely idepedet of
32 32 Thorste Joachims, Thomas Filey, ad Chu-Nam Joh Yu e+06 e+07 Iteratios Multi-Class HMM CFG O(C) e+06 e+07 e+08 C Calls to Separatio Oracle e Multi-Class HMM CFG e+06 e+07 e+08 Fig. 9 Number of iteratios for the -slack algorithm (left) ad umber of calls to the separatio oracle for the -slack algorithm with cachig (right) as a fuctio of C at the maximum traiig set size. C ε ad remais costat whe cachig is used. It seems like the additioal iteratios ca be doe almost etirely from the cache. 5.9 How does the -Slack Algorithm Scale with C? With icreasig traiig set size, the optimal value of C will typically chage (some theoretical results suggest a icrease o the order of ). I practice, fidig the optimal value of C typically requires traiig for a large rage of C values as part of a cross-validatio experimet. It is therefore iterestig to kow how the algorithm scales with C. While Theorem 5 bouds the umber of iteratio with O(C), Figure 9 shows that the actual scalig is agai much better. The umber of iteratios icreases slower tha Ω(C) o all problems. Furthermore, as already observed for ε above, the additioal iteratios are almost etirely based o the cache, so that C has hardly ay ifluece o the umber of calls to the separatio oracle. 6 Coclusios We preseted a cuttig-plae algorithm for traiig structural SVMs. Ulike existig cuttig-plae methods for this problems, the umber of costraits that are geerated does ot deped o the umber of traiig examples, but oly o C ad the desired precisio ε. Empirically, the ew algorithm is substatially faster tha existig methods, i particular decompositio methods like SMO ad SVM-light, ad it icludes the traiig algorithm of Joachims (2006) for liear biary classificatio SVMs as a special case. A implemetatio of the algorithm is available olie
33 Cuttig-Plae Traiig of Structural SVMs 33 with istaces for multi-class classificatio, HMM sequece taggig, CFG parsig, ad biary classificatio. Ackowledgemets We thak Eva Herbst for implemetig a prototype of the HMM istace of SVM struct, which was used i some of our prelimiary experimets. This work was supported i part through the grat NSF IIS from the Natioal Sciece Foudatio ad through a gift from Yahoo!. Refereces Altu Y, Tsochataridis I, Hofma T (2003) Hidde Markov support vector machies. I: Iteratioal Coferece o Machie Learig (ICML), pp 3 0 Aguelov D, Taskar B, Chatalbashev V, Koller D, Gupta D, Heitz G, Ng AY (2005) Discrimiative learig of Markov radom fields for segmetatio of 3D sca data. I: IEEE Coferece o Computer Visio ad Patter Recogitio (CVPR), IEEE Computer Society, pp Bartlett P, Collis M, Taskar B, McAllester D (2004) Expoetiated algorithms for large-margi structured classificatio. I: Advaces i Neural Iformatio Processig Systems (NIPS), pp Caruaa R, Joachims T, Backstrom L (2004) KDDCup 2004: Results ad aalysis. ACM SIGKDD Newsletter 6(2):95 08 Chag CC, Li CJ (200) LIBSVM: a library for support vector machies. Software available at cjli/libsvm Collis M (2002) Discrimiative traiig methods for hidde Markov models: Theory ad experimets with perceptro algorithms. I: Empirical Methods i Natural Laguage Processig (EMNLP), pp 8 Collis M (2004) Parameter estimatio for statistical parsig models: Theory ad practice of distributio-free methods. I: New Developmets i Parsig Techology, Kluwer, (paper accompaied ivited talk at IWPT 200) Collis M, Duffy N (2002) New rakig algorithms for parsig ad taggig: Kerels over discrete structures, ad the voted perceptro. I: Aual Meetig of the Associatio for Computatioal Liguistics (ACL), pp Collobert R, Begio S (200) SVMTorch: Support vector machies for large-scale regressio problems. Joural of Machie Learig Research (JMLR) :43 60 Cortes C, Vapik VN (995) Support vector etworks. Machie Learig 20: Crammer K, Siger Y (200) O the algorithmic implemetatio of multiclass kerel-based vector machies. Joural of Machie Learig Research (JMLR) 2: Crammer K, Siger Y (2003) Ultracoservative olie algorithms for multiclass problems. Joural of Machie Learig Research (JMLR) 3:95 99 Ferris M, Muso T (2003) Iterior-poit methods for massive support vector machies. SIAM Joural of Optimizatio 3(3):
34 34 Thorste Joachims, Thomas Filey, ad Chu-Nam Joh Yu Fukumizu K, Bach F, Jorda M (2004) Dimesioality reductio for supervised learig with reproducig kerel Hilbert spaces. Joural of Machie Learig Research (JMLR) 5:73 99 Fug G, Magasaria O (200) Proximal support vector classifiers. I: ACM Coferece o Kowledge Discovery ad Data Miig (KDD), pp Globerso A, Koo TY, Carreras X, Collis M (2007) Expoetiated gradiet algorithm for log-liear structured predictio. I: Iteratioal Coferece o Machie Learig (ICML), pp Joachims T (999) Makig large-scale SVM learig practical. I: Schölkopf B, Burges C, Smola A (eds) Advaces i Kerel Methods - Support Vector Learig, MIT Press, Cambridge, MA, chap, pp Joachims T (2003) Learig to alig sequeces: A maximum-margi approach, olie mauscript Joachims T (2005) A support vector method for multivariate performace measures. I: Iteratioal Coferece o Machie Learig (ICML), pp Joachims T (2006) Traiig liear SVMs i liear time. I: ACM SIGKDD Iteratioal Coferece O Kowledge Discovery ad Data Miig (KDD), pp Johso M (998) PCFG models of liguistic tree represetatios. Computatioal Liguistics 24(4): Keerthi S, DeCoste D (2005) A modified fiite Newto method for fast solutio of large scale liear SVMs. Joural of Machie Learig Research (JMLR) 6:34 36 Keerthi S, Chapelle O, DeCoste D (2006) Buildig support vector machies with reduced classifier complexity. Joural of Machie Learig Research (JMLR) 7: Kivie J, Warmuth MK (997) Expoetiated gradiet versus gradiet descet for liear predictors. Iformatio ad Computatio 32(): 63 Lafferty J, McCallum A, Pereira F (200) Coditioal radom fields: Probabilistic models for segmetig ad labelig sequece data. I: Iteratioal Coferece o Machie Learig (ICML) Lewis D, Yag Y, Rose T, Li F (2004) Rcv: A ew bechmark collectio for text categorizatio research. Joural of Machie Learig Research (JMLR) 5: Magasaria O, Musicat D (200) Lagragia support vector machies. Joural of Machie Learig Research (JMLR) :6 77 Marcus M, Satorii B, Marcikiewicz MA (993) Buildig a large aotated corpus of Eglish: The Pe Treebak. Computatioal Liguistics 9(2): McDoald R, Crammer K, Pereira F (2005) Olie large-margi traiig of depedecy parsers. I: Aual Meetig of the Associatio for Computatioal Liguistics (ACL), pp 9 98 Platt J (999) Fast traiig of support vector machies usig sequetial miimal optimizatio. I: Schölkopf B, Burges C, Smola A (eds) Advaces i Kerel Methods - Support Vector Learig, MIT-Press, chap 2
35 Cuttig-Plae Traiig of Structural SVMs 35 Ratliff ND, Bagell JA, Zikevich MA (2007) (Olie) subgradiet methods for structured predictio. I: Coferece o Artificial Itelligece ad Statistics (AISTATS) Shalev-Shwartz S, Siger Y, Srebro N (2007) PEGASOS: Primal Estimated sub- GrAdiet SOlver for SVM. I: Iteratioal Coferece o Machie Learig (ICML), ACM, pp Smola A, Schölkopf B (2000) Sparse greedy matrix approximatio for machie learig. I: Iteratioal Coferece o Machie Learig, pp 9 98 Taskar B, Guestri C, Koller D (2003) Maximum-margi Markov etworks. I: Advaces i Neural Iformatio Processig Systems (NIPS) Taskar B, Klei D, Collis M, Koller D, Maig C (2004) Max-margi parsig. I: Empirical Methods i Natural Laguage Processig (EMNLP) Taskar B, Lacoste-Julie S, Jorda MI (2005) Structured predictio via the extragradiet method. I: Advaces i Neural Iformatio Processig Systems (NIPS) Teo CH, Smola A, Vishwaatha SV, Le QV (2007) A scalable modular covex solver for regularized risk miimizatio. I: ACM Coferece o Kowledge Discovery ad Data Miig (KDD), pp Tsochataridis I, Hofma T, Joachims T, Altu Y (2004) Support vector machie learig for iterdepedet ad structured output spaces. I: Iteratioal Coferece o Machie Learig (ICML), pp 04 2 Tsochataridis I, Joachims T, Hofma T, Altu Y (2005) Large margi methods for structured ad iterdepedet output variables. Joural of Machie Learig Research (JMLR) 6: Vapik V (998) Statistical Learig Theory. Wiley, Chichester, GB Vishwaatha SVN, Schraudolph NN, Schmidt MW, Murphy KP (2006) Accelerated traiig of coditioal radom fields with stochastic gradiet methods. I: Iteratioal Coferece o Machie Learig (ICML), pp Yu CN, Joachims T, Elber R, Pillardy J (2007) Support vector traiig of protei aligmet models. I: Proceedig of the Iteratioal Coferece o Research i Computatioal Molecular Biology (RECOMB), pp Yue Y, Filey T, Radliski F, Joachims T (2007) A support vector method for optimizig average precisio. I: ACM SIGIR Coferece o Research ad Developmet i Iformatio Retrieval (SIGIR), pp Appedix Lemma. max α 0 D(α)= ȳ Y Δ(ȳ)αȳ 2 ȳ Y αȳαȳ H MR (ȳ,ȳ ) s.t. ȳ Y ȳ Y αȳ = C ad the Wolfe-Dual of the -slack optimizatio problem OP5 for slack-rescalig is
36 36 Thorste Joachims, Thomas Filey, ad Chu-Nam Joh Yu max α 0 D(α)= ȳ Y Δ(ȳ)αȳ 2 ȳ Y αȳαȳ H SR (ȳ,ȳ ) s.t. ȳ Y ȳ Y αȳ = C Proof. The Lagragia of OP4 is [ L(w,ξ, α)= 2 wt w +Cξ +αȳ ȳ Y ] Δ(y i,ȳ i ) ξ wt [Ψ(x i,y i ) Ψ(x i,ȳ i )]. Differetiatig with respect to w ad settig the derivative to zero gives ( ) w = αȳ ȳ Y [Ψ(x i,y i ) Ψ(x i,ȳ i )]. Similarly, differetiatig with respect to ξ ad settig the derivative to zero gives ȳ Y αȳ = C. Pluggig w ito the Lagragia with costraits o α we obtai the dual problem: max 2 s.t. αȳαȳ ȳ Y ȳ Y + ȳ Y αȳ ( 2 [ Δ(y i,ȳ i ) ) ȳ Y αȳ = C ad ȳ Y : αȳ 0 [Ψ(x i,y i ) Ψ(x i,ȳ i )] ] T [ j=[ψ(x j,y j ) Ψ(x j,ȳ j )] ] The derivatio of the dual of OP5 is aalogous. Lemma 2. For ay ucostraied quadratic program max α R {Θ(α)} <, Θ(α)=hT α 2 αt Hα (32) with positive semi-defiite H, ad derivative Θ( α) =h H α, a lie search startig at α alog a ascet directio η with maximum step-size C > 0 improves the objective by at least max {Θ(α + βη)} Θ(α) { } 0 β C 2 mi C, Θ(α)T η η T Θ(α) T η. (33) Hη Proof. For ay β ad η, it is easy to verify that [ Θ(α + βη) Θ(α) =β Θ(α) T η ] 2 βηt Hη. (34)
37 Cuttig-Plae Traiig of Structural SVMs 37 Maximizig this expressio with respect to a ucostraied β by settig the derivative to zero, the solutio β is β = Θ(α)T η η T Hη. (35) Note that η T Hη is o-egative, sice H is positive semi-defiite. Furthermore, η T Hη 0, sice otherwise η beig a ascet directio would cotradict max α R {Θ(α)} <. Pluggig β ito (34) shows that max β R {Θ(α + βη)} Θ(α) = 2 ( Θ(α) T η) 2 η T. (36) Hη It remais to check whether the ucostraied solutio β fulfills the costraits 0 β C. Sice η is a ascet directio, β is always o-egative. But oe eeds to cosider the case that β > C, which happes whe Θ(α) T η > Cη T Hη. I that case, the costraied optimum is at β = C due to covexity. Pluggig C ito (34) shows that max {Θ(α + βη)} Θ(α) β R =C Θ(α)T η 2 C2 η T Hη (37) 2 C Θ(α)T η. (38) The iequality follow from C Θ(α)T η η T Hη.
Asymptotic Growth of Functions
CMPS Itroductio to Aalysis of Algorithms Fall 3 Asymptotic Growth of Fuctios We itroduce several types of asymptotic otatio which are used to compare the performace ad efficiecy of algorithms As we ll
Modified Line Search Method for Global Optimization
Modified Lie Search Method for Global Optimizatio Cria Grosa ad Ajith Abraham Ceter of Excellece for Quatifiable Quality of Service Norwegia Uiversity of Sciece ad Techology Trodheim, Norway {cria, ajith}@q2s.tu.o
5 Boolean Decision Trees (February 11)
5 Boolea Decisio Trees (February 11) 5.1 Graph Coectivity Suppose we are give a udirected graph G, represeted as a boolea adjacecy matrix = (a ij ), where a ij = 1 if ad oly if vertices i ad j are coected
Soving Recurrence Relations
Sovig Recurrece Relatios Part 1. Homogeeous liear 2d degree relatios with costat coefficiets. Cosider the recurrece relatio ( ) T () + at ( 1) + bt ( 2) = 0 This is called a homogeeous liear 2d degree
Totally Corrective Boosting Algorithms that Maximize the Margin
Mafred K. Warmuth [email protected] Ju Liao [email protected] Uiversity of Califoria at Sata Cruz, Sata Cruz, CA 95064, USA Guar Rätsch [email protected] Friedrich Miescher Laboratory of
Incremental calculation of weighted mean and variance
Icremetal calculatio of weighted mea ad variace Toy Fich [email protected] [email protected] Uiversity of Cambridge Computig Service February 009 Abstract I these otes I eplai how to derive formulae for umerically
I. Chi-squared Distributions
1 M 358K Supplemet to Chapter 23: CHI-SQUARED DISTRIBUTIONS, T-DISTRIBUTIONS, AND DEGREES OF FREEDOM To uderstad t-distributios, we first eed to look at aother family of distributios, the chi-squared distributios.
In nite Sequences. Dr. Philippe B. Laval Kennesaw State University. October 9, 2008
I ite Sequeces Dr. Philippe B. Laval Keesaw State Uiversity October 9, 2008 Abstract This had out is a itroductio to i ite sequeces. mai de itios ad presets some elemetary results. It gives the I ite Sequeces
Domain 1: Designing a SQL Server Instance and a Database Solution
Maual SQL Server 2008 Desig, Optimize ad Maitai (70-450) 1-800-418-6789 Domai 1: Desigig a SQL Server Istace ad a Database Solutio Desigig for CPU, Memory ad Storage Capacity Requiremets Whe desigig a
Vladimir N. Burkov, Dmitri A. Novikov MODELS AND METHODS OF MULTIPROJECTS MANAGEMENT
Keywords: project maagemet, resource allocatio, etwork plaig Vladimir N Burkov, Dmitri A Novikov MODELS AND METHODS OF MULTIPROJECTS MANAGEMENT The paper deals with the problems of resource allocatio betwee
Chapter 7 Methods of Finding Estimators
Chapter 7 for BST 695: Special Topics i Statistical Theory. Kui Zhag, 011 Chapter 7 Methods of Fidig Estimators Sectio 7.1 Itroductio Defiitio 7.1.1 A poit estimator is ay fuctio W( X) W( X1, X,, X ) of
NEW HIGH PERFORMANCE COMPUTATIONAL METHODS FOR MORTGAGES AND ANNUITIES. Yuri Shestopaloff,
NEW HIGH PERFORMNCE COMPUTTIONL METHODS FOR MORTGGES ND NNUITIES Yuri Shestopaloff, Geerally, mortgage ad auity equatios do ot have aalytical solutios for ukow iterest rate, which has to be foud usig umerical
THE REGRESSION MODEL IN MATRIX FORM. For simple linear regression, meaning one predictor, the model is. for i = 1, 2, 3,, n
We will cosider the liear regressio model i matrix form. For simple liear regressio, meaig oe predictor, the model is i = + x i + ε i for i =,,,, This model icludes the assumptio that the ε i s are a sample
SECTION 1.5 : SUMMATION NOTATION + WORK WITH SEQUENCES
SECTION 1.5 : SUMMATION NOTATION + WORK WITH SEQUENCES Read Sectio 1.5 (pages 5 9) Overview I Sectio 1.5 we lear to work with summatio otatio ad formulas. We will also itroduce a brief overview of sequeces,
Project Deliverables. CS 361, Lecture 28. Outline. Project Deliverables. Administrative. Project Comments
Project Deliverables CS 361, Lecture 28 Jared Saia Uiversity of New Mexico Each Group should tur i oe group project cosistig of: About 6-12 pages of text (ca be loger with appedix) 6-12 figures (please
Here are a couple of warnings to my students who may be here to get a copy of what happened on a day that you missed.
This documet was writte ad copyrighted by Paul Dawkis. Use of this documet ad its olie versio is govered by the Terms ad Coditios of Use located at http://tutorial.math.lamar.edu/terms.asp. The olie versio
Running Time ( 3.1) Analysis of Algorithms. Experimental Studies ( 3.1.1) Limitations of Experiments. Pseudocode ( 3.1.2) Theoretical Analysis
Ruig Time ( 3.) Aalysis of Algorithms Iput Algorithm Output A algorithm is a step-by-step procedure for solvig a problem i a fiite amout of time. Most algorithms trasform iput objects ito output objects.
Convexity, Inequalities, and Norms
Covexity, Iequalities, ad Norms Covex Fuctios You are probably familiar with the otio of cocavity of fuctios. Give a twicedifferetiable fuctio ϕ: R R, We say that ϕ is covex (or cocave up) if ϕ (x) 0 for
Taking DCOP to the Real World: Efficient Complete Solutions for Distributed Multi-Event Scheduling
Taig DCOP to the Real World: Efficiet Complete Solutios for Distributed Multi-Evet Schedulig Rajiv T. Maheswara, Milid Tambe, Emma Bowrig, Joatha P. Pearce, ad Pradeep araatham Uiversity of Souther Califoria
Chapter 6: Variance, the law of large numbers and the Monte-Carlo method
Chapter 6: Variace, the law of large umbers ad the Mote-Carlo method Expected value, variace, ad Chebyshev iequality. If X is a radom variable recall that the expected value of X, E[X] is the average value
Your organization has a Class B IP address of 166.144.0.0 Before you implement subnetting, the Network ID and Host ID are divided as follows:
Subettig Subettig is used to subdivide a sigle class of etwork i to multiple smaller etworks. Example: Your orgaizatio has a Class B IP address of 166.144.0.0 Before you implemet subettig, the Network
CHAPTER 3 DIGITAL CODING OF SIGNALS
CHAPTER 3 DIGITAL CODING OF SIGNALS Computers are ofte used to automate the recordig of measuremets. The trasducers ad sigal coditioig circuits produce a voltage sigal that is proportioal to a quatity
Discrete Mathematics and Probability Theory Spring 2014 Anant Sahai Note 13
EECS 70 Discrete Mathematics ad Probability Theory Sprig 2014 Aat Sahai Note 13 Itroductio At this poit, we have see eough examples that it is worth just takig stock of our model of probability ad may
Scalable Biomedical Named Entity Recognition: Investigation of a Database-Supported SVM Approach
Scalable Biomedical Named Etity Recogitio: Ivestigatio of a Database-Supported SVM Approach Moa Solima Habib * ad Jugal Kalita Departmet of Computer Sciece Uiversity of Colorado, 1420 Austi Bluffs Pkwy
Output Analysis (2, Chapters 10 &11 Law)
B. Maddah ENMG 6 Simulatio 05/0/07 Output Aalysis (, Chapters 10 &11 Law) Comparig alterative system cofiguratio Sice the output of a simulatio is radom, the comparig differet systems via simulatio should
5: Introduction to Estimation
5: Itroductio to Estimatio Cotets Acroyms ad symbols... 1 Statistical iferece... Estimatig µ with cofidece... 3 Samplig distributio of the mea... 3 Cofidece Iterval for μ whe σ is kow before had... 4 Sample
CHAPTER 3 THE TIME VALUE OF MONEY
CHAPTER 3 THE TIME VALUE OF MONEY OVERVIEW A dollar i the had today is worth more tha a dollar to be received i the future because, if you had it ow, you could ivest that dollar ad ear iterest. Of all
Systems Design Project: Indoor Location of Wireless Devices
Systems Desig Project: Idoor Locatio of Wireless Devices Prepared By: Bria Murphy Seior Systems Sciece ad Egieerig Washigto Uiversity i St. Louis Phoe: (805) 698-5295 Email: [email protected] Supervised
SAMPLE QUESTIONS FOR FINAL EXAM. (1) (2) (3) (4) Find the following using the definition of the Riemann integral: (2x + 1)dx
SAMPLE QUESTIONS FOR FINAL EXAM REAL ANALYSIS I FALL 006 3 4 Fid the followig usig the defiitio of the Riema itegral: a 0 x + dx 3 Cosider the partitio P x 0 3, x 3 +, x 3 +,......, x 3 3 + 3 of the iterval
Sequences and Series
CHAPTER 9 Sequeces ad Series 9.. Covergece: Defiitio ad Examples Sequeces The purpose of this chapter is to itroduce a particular way of geeratig algorithms for fidig the values of fuctios defied by their
The Power of Free Branching in a General Model of Backtracking and Dynamic Programming Algorithms
The Power of Free Brachig i a Geeral Model of Backtrackig ad Dyamic Programmig Algorithms SASHKA DAVIS IDA/Ceter for Computig Scieces Bowie, MD [email protected] RUSSELL IMPAGLIAZZO Dept. of Computer
The Stable Marriage Problem
The Stable Marriage Problem William Hut Lae Departmet of Computer Sciece ad Electrical Egieerig, West Virgiia Uiversity, Morgatow, WV [email protected] 1 Itroductio Imagie you are a matchmaker,
A Combined Continuous/Binary Genetic Algorithm for Microstrip Antenna Design
A Combied Cotiuous/Biary Geetic Algorithm for Microstrip Atea Desig Rady L. Haupt The Pesylvaia State Uiversity Applied Research Laboratory P. O. Box 30 State College, PA 16804-0030 [email protected] Abstract:
Analyzing Longitudinal Data from Complex Surveys Using SUDAAN
Aalyzig Logitudial Data from Complex Surveys Usig SUDAAN Darryl Creel Statistics ad Epidemiology, RTI Iteratioal, 312 Trotter Farm Drive, Rockville, MD, 20850 Abstract SUDAAN: Software for the Statistical
Week 3 Conditional probabilities, Bayes formula, WEEK 3 page 1 Expected value of a random variable
Week 3 Coditioal probabilities, Bayes formula, WEEK 3 page 1 Expected value of a radom variable We recall our discussio of 5 card poker hads. Example 13 : a) What is the probability of evet A that a 5
A probabilistic proof of a binomial identity
A probabilistic proof of a biomial idetity Joatho Peterso Abstract We give a elemetary probabilistic proof of a biomial idetity. The proof is obtaied by computig the probability of a certai evet i two
Finding the circle that best fits a set of points
Fidig the circle that best fits a set of poits L. MAISONOBE October 5 th 007 Cotets 1 Itroductio Solvig the problem.1 Priciples............................... Iitializatio.............................
Notes on exponential generating functions and structures.
Notes o expoetial geeratig fuctios ad structures. 1. The cocept of a structure. Cosider the followig coutig problems: (1) to fid for each the umber of partitios of a -elemet set, (2) to fid for each the
Non-life insurance mathematics. Nils F. Haavardsson, University of Oslo and DNB Skadeforsikring
No-life isurace mathematics Nils F. Haavardsso, Uiversity of Oslo ad DNB Skadeforsikrig Mai issues so far Why does isurace work? How is risk premium defied ad why is it importat? How ca claim frequecy
Chatpun Khamyat Department of Industrial Engineering, Kasetsart University, Bangkok, Thailand [email protected]
SOLVING THE OIL DELIVERY TRUCKS ROUTING PROBLEM WITH MODIFY MULTI-TRAVELING SALESMAN PROBLEM APPROACH CASE STUDY: THE SME'S OIL LOGISTIC COMPANY IN BANGKOK THAILAND Chatpu Khamyat Departmet of Idustrial
MARTINGALES AND A BASIC APPLICATION
MARTINGALES AND A BASIC APPLICATION TURNER SMITH Abstract. This paper will develop the measure-theoretic approach to probability i order to preset the defiitio of martigales. From there we will apply this
Solutions to Selected Problems In: Pattern Classification by Duda, Hart, Stork
Solutios to Selected Problems I: Patter Classificatio by Duda, Hart, Stork Joh L. Weatherwax February 4, 008 Problem Solutios Chapter Bayesia Decisio Theory Problem radomized rules Part a: Let Rx be the
Lecture 3. denote the orthogonal complement of S k. Then. 1 x S k. n. 2 x T Ax = ( ) λ x. with x = 1, we have. i = λ k x 2 = λ k.
18.409 A Algorithmist s Toolkit September 17, 009 Lecture 3 Lecturer: Joatha Keler Scribe: Adre Wibisoo 1 Outlie Today s lecture covers three mai parts: Courat-Fischer formula ad Rayleigh quotiets The
Plug-in martingales for testing exchangeability on-line
Plug-i martigales for testig exchageability o-lie Valetia Fedorova, Alex Gammerma, Ilia Nouretdiov, ad Vladimir Vovk Computer Learig Research Cetre Royal Holloway, Uiversity of Lodo, UK {valetia,ilia,alex,vovk}@cs.rhul.ac.uk
0.7 0.6 0.2 0 0 96 96.5 97 97.5 98 98.5 99 99.5 100 100.5 96.5 97 97.5 98 98.5 99 99.5 100 100.5
Sectio 13 Kolmogorov-Smirov test. Suppose that we have a i.i.d. sample X 1,..., X with some ukow distributio P ad we would like to test the hypothesis that P is equal to a particular distributio P 0, i.e.
where: T = number of years of cash flow in investment's life n = the year in which the cash flow X n i = IRR = the internal rate of return
EVALUATING ALTERNATIVE CAPITAL INVESTMENT PROGRAMS By Ke D. Duft, Extesio Ecoomist I the March 98 issue of this publicatio we reviewed the procedure by which a capital ivestmet project was assessed. The
Maximum Likelihood Estimators.
Lecture 2 Maximum Likelihood Estimators. Matlab example. As a motivatio, let us look at oe Matlab example. Let us geerate a radom sample of size 00 from beta distributio Beta(5, 2). We will lear the defiitio
The analysis of the Cournot oligopoly model considering the subjective motive in the strategy selection
The aalysis of the Courot oligopoly model cosiderig the subjective motive i the strategy selectio Shigehito Furuyama Teruhisa Nakai Departmet of Systems Maagemet Egieerig Faculty of Egieerig Kasai Uiversity
Lecture 2: Karger s Min Cut Algorithm
priceto uiv. F 3 cos 5: Advaced Algorithm Desig Lecture : Karger s Mi Cut Algorithm Lecturer: Sajeev Arora Scribe:Sajeev Today s topic is simple but gorgeous: Karger s mi cut algorithm ad its extesio.
CS103A Handout 23 Winter 2002 February 22, 2002 Solving Recurrence Relations
CS3A Hadout 3 Witer 00 February, 00 Solvig Recurrece Relatios Itroductio A wide variety of recurrece problems occur i models. Some of these recurrece relatios ca be solved usig iteratio or some other ad
INFINITE SERIES KEITH CONRAD
INFINITE SERIES KEITH CONRAD. Itroductio The two basic cocepts of calculus, differetiatio ad itegratio, are defied i terms of limits (Newto quotiets ad Riema sums). I additio to these is a third fudametal
Department of Computer Science, University of Otago
Departmet of Computer Sciece, Uiversity of Otago Techical Report OUCS-2006-09 Permutatios Cotaiig May Patters Authors: M.H. Albert Departmet of Computer Sciece, Uiversity of Otago Micah Colema, Rya Fly
PROCEEDINGS OF THE YEREVAN STATE UNIVERSITY AN ALTERNATIVE MODEL FOR BONUS-MALUS SYSTEM
PROCEEDINGS OF THE YEREVAN STATE UNIVERSITY Physical ad Mathematical Scieces 2015, 1, p. 15 19 M a t h e m a t i c s AN ALTERNATIVE MODEL FOR BONUS-MALUS SYSTEM A. G. GULYAN Chair of Actuarial Mathematics
.04. This means $1000 is multiplied by 1.02 five times, once for each of the remaining sixmonth
Questio 1: What is a ordiary auity? Let s look at a ordiary auity that is certai ad simple. By this, we mea a auity over a fixed term whose paymet period matches the iterest coversio period. Additioally,
Class Meeting # 16: The Fourier Transform on R n
MATH 18.152 COUSE NOTES - CLASS MEETING # 16 18.152 Itroductio to PDEs, Fall 2011 Professor: Jared Speck Class Meetig # 16: The Fourier Trasform o 1. Itroductio to the Fourier Trasform Earlier i the course,
1 Computing the Standard Deviation of Sample Means
Computig the Stadard Deviatio of Sample Meas Quality cotrol charts are based o sample meas ot o idividual values withi a sample. A sample is a group of items, which are cosidered all together for our aalysis.
1 Correlation and Regression Analysis
1 Correlatio ad Regressio Aalysis I this sectio we will be ivestigatig the relatioship betwee two cotiuous variable, such as height ad weight, the cocetratio of a ijected drug ad heart rate, or the cosumptio
INVESTMENT PERFORMANCE COUNCIL (IPC)
INVESTMENT PEFOMANCE COUNCIL (IPC) INVITATION TO COMMENT: Global Ivestmet Performace Stadards (GIPS ) Guidace Statemet o Calculatio Methodology The Associatio for Ivestmet Maagemet ad esearch (AIM) seeks
Hypergeometric Distributions
7.4 Hypergeometric Distributios Whe choosig the startig lie-up for a game, a coach obviously has to choose a differet player for each positio. Similarly, whe a uio elects delegates for a covetio or you
A gentle introduction to Expectation Maximization
A getle itroductio to Expectatio Maximizatio Mark Johso Brow Uiversity November 2009 1 / 15 Outlie What is Expectatio Maximizatio? Mixture models ad clusterig EM for setece topic modelig 2 / 15 Why Expectatio
Case Study. Normal and t Distributions. Density Plot. Normal Distributions
Case Study Normal ad t Distributios Bret Halo ad Bret Larget Departmet of Statistics Uiversity of Wiscosi Madiso October 11 13, 2011 Case Study Body temperature varies withi idividuals over time (it ca
LECTURE 13: Cross-validation
LECTURE 3: Cross-validatio Resampli methods Cross Validatio Bootstrap Bias ad variace estimatio with the Bootstrap Three-way data partitioi Itroductio to Patter Aalysis Ricardo Gutierrez-Osua Texas A&M
Universal coding for classes of sources
Coexios module: m46228 Uiversal codig for classes of sources Dever Greee This work is produced by The Coexios Project ad licesed uder the Creative Commos Attributio Licese We have discussed several parametric
Properties of MLE: consistency, asymptotic normality. Fisher information.
Lecture 3 Properties of MLE: cosistecy, asymptotic ormality. Fisher iformatio. I this sectio we will try to uderstad why MLEs are good. Let us recall two facts from probability that we be used ofte throughout
Determining the sample size
Determiig the sample size Oe of the most commo questios ay statisticia gets asked is How large a sample size do I eed? Researchers are ofte surprised to fid out that the aswer depeds o a umber of factors
Multiplexers and Demultiplexers
I this lesso, you will lear about: Multiplexers ad Demultiplexers 1. Multiplexers 2. Combiatioal circuit implemetatio with multiplexers 3. Demultiplexers 4. Some examples Multiplexer A Multiplexer (see
Example 2 Find the square root of 0. The only square root of 0 is 0 (since 0 is not positive or negative, so those choices don t exist here).
BEGINNING ALGEBRA Roots ad Radicals (revised summer, 00 Olso) Packet to Supplemet the Curret Textbook - Part Review of Square Roots & Irratioals (This portio ca be ay time before Part ad should mostly
hp calculators HP 12C Statistics - average and standard deviation Average and standard deviation concepts HP12C average and standard deviation
HP 1C Statistics - average ad stadard deviatio Average ad stadard deviatio cocepts HP1C average ad stadard deviatio Practice calculatig averages ad stadard deviatios with oe or two variables HP 1C Statistics
Integer Factorization Algorithms
Iteger Factorizatio Algorithms Coelly Bares Departmet of Physics, Orego State Uiversity December 7, 004 This documet has bee placed i the public domai. Cotets I. Itroductio 3 1. Termiology 3. Fudametal
Chapter 5 Unit 1. IET 350 Engineering Economics. Learning Objectives Chapter 5. Learning Objectives Unit 1. Annual Amount and Gradient Functions
Chapter 5 Uit Aual Amout ad Gradiet Fuctios IET 350 Egieerig Ecoomics Learig Objectives Chapter 5 Upo completio of this chapter you should uderstad: Calculatig future values from aual amouts. Calculatig
Concept: Types of algorithms
Discrete Math for Bioiformatics WS 10/11:, by A. Bockmayr/K. Reiert, 18. Oktober 2010, 21:22 1001 Cocept: Types of algorithms The expositio is based o the followig sources, which are all required readig:
DAME - Microsoft Excel add-in for solving multicriteria decision problems with scenarios Radomir Perzina 1, Jaroslav Ramik 2
Itroductio DAME - Microsoft Excel add-i for solvig multicriteria decisio problems with scearios Radomir Perzia, Jaroslav Ramik 2 Abstract. The mai goal of every ecoomic aget is to make a good decisio,
Hypothesis testing. Null and alternative hypotheses
Hypothesis testig Aother importat use of samplig distributios is to test hypotheses about populatio parameters, e.g. mea, proportio, regressio coefficiets, etc. For example, it is possible to stipulate
THE ARITHMETIC OF INTEGERS. - multiplication, exponentiation, division, addition, and subtraction
THE ARITHMETIC OF INTEGERS - multiplicatio, expoetiatio, divisio, additio, ad subtractio What to do ad what ot to do. THE INTEGERS Recall that a iteger is oe of the whole umbers, which may be either positive,
CME 302: NUMERICAL LINEAR ALGEBRA FALL 2005/06 LECTURE 8
CME 30: NUMERICAL LINEAR ALGEBRA FALL 005/06 LECTURE 8 GENE H GOLUB 1 Positive Defiite Matrices A matrix A is positive defiite if x Ax > 0 for all ozero x A positive defiite matrix has real ad positive
Stackelberg Games for Adversarial Prediction Problems
Stackelberg Games for Adversarial Predictio Problems Michael Brücker Departmet of Computer Sciece Uiversity of Potsdam, Germay [email protected] Tobias Scheffer Departmet of Computer Sciece Uiversity
A Faster Clause-Shortening Algorithm for SAT with No Restriction on Clause Length
Joural o Satisfiability, Boolea Modelig ad Computatio 1 2005) 49-60 A Faster Clause-Shorteig Algorithm for SAT with No Restrictio o Clause Legth Evgey Datsi Alexader Wolpert Departmet of Computer Sciece
Overview of some probability distributions.
Lecture Overview of some probability distributios. I this lecture we will review several commo distributios that will be used ofte throughtout the class. Each distributio is usually described by its probability
MTO-MTS Production Systems in Supply Chains
NSF GRANT #0092854 NSF PROGRAM NAME: MES/OR MTO-MTS Productio Systems i Supply Chais Philip M. Kamisky Uiversity of Califoria, Berkeley Our Kaya Uiversity of Califoria, Berkeley Abstract: Icreasig cost
Basic Elements of Arithmetic Sequences and Series
MA40S PRE-CALCULUS UNIT G GEOMETRIC SEQUENCES CLASS NOTES (COMPLETED NO NEED TO COPY NOTES FROM OVERHEAD) Basic Elemets of Arithmetic Sequeces ad Series Objective: To establish basic elemets of arithmetic
AP Calculus AB 2006 Scoring Guidelines Form B
AP Calculus AB 6 Scorig Guidelies Form B The College Board: Coectig Studets to College Success The College Board is a ot-for-profit membership associatio whose missio is to coect studets to college success
Lecture 4: Cauchy sequences, Bolzano-Weierstrass, and the Squeeze theorem
Lecture 4: Cauchy sequeces, Bolzao-Weierstrass, ad the Squeeze theorem The purpose of this lecture is more modest tha the previous oes. It is to state certai coditios uder which we are guarateed that limits
Infinite Sequences and Series
CHAPTER 4 Ifiite Sequeces ad Series 4.1. Sequeces A sequece is a ifiite ordered list of umbers, for example the sequece of odd positive itegers: 1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25, 27, 29...
Lesson 17 Pearson s Correlation Coefficient
Outlie Measures of Relatioships Pearso s Correlatio Coefficiet (r) -types of data -scatter plots -measure of directio -measure of stregth Computatio -covariatio of X ad Y -uique variatio i X ad Y -measurig
Chair for Network Architectures and Services Institute of Informatics TU München Prof. Carle. Network Security. Chapter 2 Basics
Chair for Network Architectures ad Services Istitute of Iformatics TU Müche Prof. Carle Network Security Chapter 2 Basics 2.4 Radom Number Geeratio for Cryptographic Protocols Motivatio It is crucial to
ODBC. Getting Started With Sage Timberline Office ODBC
ODBC Gettig Started With Sage Timberlie Office ODBC NOTICE This documet ad the Sage Timberlie Office software may be used oly i accordace with the accompayig Sage Timberlie Office Ed User Licese Agreemet.
THE HEIGHT OF q-binary SEARCH TREES
THE HEIGHT OF q-binary SEARCH TREES MICHAEL DRMOTA AND HELMUT PRODINGER Abstract. q biary search trees are obtaied from words, equipped with the geometric distributio istead of permutatios. The average
Research Method (I) --Knowledge on Sampling (Simple Random Sampling)
Research Method (I) --Kowledge o Samplig (Simple Radom Samplig) 1. Itroductio to samplig 1.1 Defiitio of samplig Samplig ca be defied as selectig part of the elemets i a populatio. It results i the fact
A Constant-Factor Approximation Algorithm for the Link Building Problem
A Costat-Factor Approximatio Algorithm for the Lik Buildig Problem Marti Olse 1, Aastasios Viglas 2, ad Ilia Zvedeiouk 2 1 Ceter for Iovatio ad Busiess Developmet, Istitute of Busiess ad Techology, Aarhus
Dimensionality Reduction of Multimodal Labeled Data by Local Fisher Discriminant Analysis
Joural of Machie Learig Research 8 (2007) 1027-1061 Submitted 3/06; Revised 12/06; Published 5/07 Dimesioality Reductio of Multimodal Labeled Data by Local Fisher Discrimiat Aalysis Masashi Sugiyama Departmet
Section 11.3: The Integral Test
Sectio.3: The Itegral Test Most of the series we have looked at have either diverged or have coverged ad we have bee able to fid what they coverge to. I geeral however, the problem is much more difficult
Spam Detection. A Bayesian approach to filtering spam
Spam Detectio A Bayesia approach to filterig spam Kual Mehrotra Shailedra Watave Abstract The ever icreasig meace of spam is brigig dow productivity. More tha 70% of the email messages are spam, ad it
Partial Di erential Equations
Partial Di eretial Equatios Partial Di eretial Equatios Much of moder sciece, egieerig, ad mathematics is based o the study of partial di eretial equatios, where a partial di eretial equatio is a equatio
Groups of diverse problem solvers can outperform groups of high-ability problem solvers
Groups of diverse problem solvers ca outperform groups of high-ability problem solvers Lu Hog ad Scott E. Page Michiga Busiess School ad Complex Systems, Uiversity of Michiga, A Arbor, MI 48109-1234; ad
