Least 1-Norm SVMs: a New SVM Variant between Standard and LS-SVMs

ESANN proceedngs, European Smposum on Artfcal Neural Networks - Computatonal Intellgence and Machne Learnng. Bruges (Belgum), 8-3 Aprl, d-sde publ., ISBN -9337--. Least -Norm SVMs: a New SVM Varant between Standard and LS-SVMs Jorge López and José R. Dorronsoro Unversdad Autónoma de Madrd Departamento de Ingenería Informátca and Insttuto de Ingenería del Conocmento C/ Francsco Tomás Valente, 89 Madrd, Span Abstract. Least Squares Support Vector Machnes (LS-SVMs) were proposed b replacng the nequalt constrants nherent to L-SVMs wth equalt constrants. So far ths dea has onl been suggested for a least squares (L) loss. We descrbe how ths can also be done for the sumof-slacks (L) loss, eldng a new classfer (Least -Norm SVMs) whch gves smlar models n terms of complet and accurac and that ma also be more robust than LS-SVMs wth respect to outlers. Introducton Assumng a bnar classfcaton contet, we have a sample of N preclassfed patterns {X, }, =,...,N, where the outputs {+, }. If we further assume lnear nseparablt and consder slack varables to allow for msclassfcatons, the prmal of an LS-SVM [] s mn W,b,ξ W + C ξ s.t. (W Φ(X )+b) = ξ, () where denotes nner product, and Φ (X ) s the mage of X n the feature space wth feature map Φ ( ). The correspondng dual s mn α α α j j Kj j α s.t. α =, () wth the modfed kernel Kj = k (X,X j )+δ j /C, δ j standng for Kronecker s delta smbol and k (X,X j )=Φ(X ) Φ(X j ) the orgnal kernel. LS-SVMs were orgnall derved n [] from the so-called L-SVMs [], whose prmal changes () n three aspects: ) the objectve functon uses the L loss C ξ nstead of the L loss, ) the equalt constrants become nequalt ones, and 3) there s the addtonal requrement that ξ. In turn, L-SVMs, also descrbed n [], le somewhere n between, snce ther prmal s dentcal to (), but wth the equalt constrants stll transformed nto nequalt ones. To our knowledge, there s no current classfer that combnes equalt constrants wth the L loss. It s desrable to fll ths gap manl because of two Wth partal support of Span s TIN 7 6686 project and Cátedra IIC en Modelado Predccón. The frst author s kndl supported b FPU-MICINN grant reference AP7. 35

ESANN proceedngs, European Smposum on Artfcal Neural Networks - Computatonal Intellgence and Machne Learnng. Bruges (Belgum), 8-3 Aprl, d-sde publ., ISBN -9337--. Squared Slacks Slacks Inequalt constrants L-SVMs L-SVMs Equalt constrants LS-SVMs? Table : Tpes of SVMs accordng to how slacks and constrants are treated. facts: ) n practce L-SVMs and the L loss have become the standard, ) the nfluence of a gven pattern (.e. the value of ts coeffcent α ) n the model s not bounded when usng the L loss, so L and LS-SVMs are more senstve to outlers than L-SVMs. The central dea of ths work s to smplf L-SVMs smlarl to LS-SVMs, but keepng the L loss, gvng rse to the so-called Least -Norm SVMs, whch fll the gap above and are epected to preserve the robustness to outlers. The rest of the paper s organzed as follows: n Secton we gve the prmal and dual of Least -Norm SVMs and dscuss brefl ther KKT optmalt condtons. Secton 3 eplans how the popular SMO algorthm can be adapted to solve the Least -Norm dual. Secton reports some eperments that llustrate how the can be more robust to outlers than LS-SVMs whle beng as accurate as them, and dscusses the vared convergence speeds observed. Fnall, Secton 5 gves ponters to future possble etensons. Least -Norm SVMs In order to smplf the L-SVM prmal, one ma thnk that t suffces to force equalt constrants (W Φ(X )+b) = ξ, whle keepng the nherent requrement ξ. However, ths s not correct because t mples that slacks are onl allowed n one drecton, somethng whch s obvousl not convenent. Therefore, we propose to remove the constrants ξ and mnmze the -Norm of the slack vector, whch gves the Least -Norm SVM prmal mn W,b,ξ W + C ξ s.t. (W Φ(X )+b) = ξ. (3) Now we use the cast of -Norm problems as Lnear Programmng problems [3, p. 9]: mnmzng (3) can be reformulated as mn W,b,t W + C t s.t. t (W Φ(X )+b) t, () Note that () transforms the desred equaltes of (3) nto nequaltes, but otherwse the objectve functon s not dfferentable. Usng standard Lagrangan theor wth () and denotng β (γ ) as the multplers assocated wth t (+t ) we obtan the followng dual, where α = γ β : mn α α α j j K j j α s.t. α =, C α C, (5) 36

ESANN proceedngs, European Smposum on Artfcal Neural Networks - Computatonal Intellgence and Machne Learnng. Bruges (Belgum), 8-3 Aprl, d-sde publ., ISBN -9337--. whch happens to be dentcal to the L-SVM dual but wth the lower bound C nstead of, so that negatve values are allowed, as n LS-SVMs. Snce all the formulatons above are conve wth affne constrants, the KKT optmalt condtons are necessar and suffcent for optmalt [3]. The KKT condtons for () are analogous to the well-known ones for L-SVMs, substtutng just the lower bound C for, whch elds: (W Φ(X )+b) = C<α <C, (W Φ(X )+b) α = C, (W Φ(X )+b) α = C, (6) together wth the dual constrants W = α Φ(X )and α =. These are common to LS-SVMs, whose onl prmal KKT condton [] s (W Φ(X )+b) = α /C, (7) whch shows wh LS-SVMs are ver sensble to outlers: outlers are characterzed b a large ξ, whch n vew of (7) and () mples a large α. On the other hand, n Least -Norm SVMs ths nfluence s lmted because α C. It also shows another drawback of LS-SVMs: the are not sparse because α = Cξ, so a pattern takes part n the model whenever ξ,whchs almost certan to happen. Observe that ths s also the case for Least -Norm SVMs, snce α = mples (W Φ(X )+b) s eactl, so the are not lkel to be sparse ether. L-SVMs are ndeed sparse because, nstead of C, patterns wth (W Φ(X )+b) > are assgned α =. 3 Least -Norm SMO We wll adapt SMO for Least -Norm SVMs basng on a mamum gan vewpont (for more detals see []). In general, SMO performs updates of the form W = W + δ L L X L + δ U U X U. The constrant α = mples δ U U = δ L L and the updates become W = W + δ L (X L X U ), where we wrte δ = δ L and, hence, δ U = U L δ. As a consequence, the multpler updates are α L = α L + δ, α U = α U U L δ and α j = α j for other j. Therefore, denotng the dual functon n (5) as D (α), D (α ) can be wrtten as D (α )=D (α) (Δ U,L) Z L,U, wherewewrteδ U,L = W (X U X L ) ( U L )andz L,U = X L X U. Ignorng the denomnator, we can appromatel mamze the gan n D (α )b choosng L = arg mn j {W X j j } and U = arg ma j {W X j j }, so that the volaton etent Δ U,L s largest. Wrtng Δ = Δ U,L and λ =Δ/ Z U,L,we then have Δ >, λ >, δ = L λ and the α updates become α L = α L + L λ, α U = α U U λ. Thus, α L or α U wll decrease f L = or U =,whch requres the correspondng α L and α U to be greater than C. In turn, the wll 37

ESANN proceedngs, European Smposum on Artfcal Neural Networks - Computatonal Intellgence and Machne Learnng. Bruges (Belgum), 8-3 Aprl, d-sde publ., ISBN -9337--. ncrease f L =or U =, whch requres the correspondng α L and α U to be less than C. Hence, we must replace the prevous L, U choces wth L = arg mn j {W X j j : j I L },U= arg ma {W X j j : j I U }, (8) j where we use the notatons I U = { :( =,α > C) ( =,α <C)} and I L = { :( =,α <C) ( =,α > C)}. Moreover, to make sure that α L and α U reman then n the nterval [ C, C], we ma have to clp λ wth Numercal Eperments λ = mn {λ,c L α L,C + U α U }. (9) In ths secton we wll llustrate emprcall how the Least -Norm SVM ma be more robust to outlers than ts LS-SVM counterpart, as well as ts good generalzaton propertes. The tranng algorthm s SMO; the Least -Norm varant eplaned above and the LS-SVM verson n [5]. The stoppng crteron s fnal KKT volaton, specfcall when t s less than ɛ = 3. For LS-SVMs ths means { } { } ma W Φ(X ) mn W Φ(X ) ɛ, () where the tlde ndcates that we use the modfed kernel k as n (). For Least -Norm SVMs, t means ma {W Φ(X ) } mn {W Φ(X ) } ɛ. () I U I L The dervaton of these KKT based crtera s gven n [6] for LS-SVMs and [7] for L-SVMs. Frstl, to show generalzaton we take datasets from [8] wth tranng test splts each. We compare the performance of Least -Norm and LS-SVMs. We use the RBF kernel k (X,X j ) = ep ( X X j /σ ).The values for the hperparameters C and σ are sought wth a grd n the logarthmc range [, ] for C and [, ] for σ. Each pont of the grd s evaluated wth a - tmes--fold cross-valdaton over the whole dataset. We report n Table the accurac and number of support vectors obtaned n the fnal models, as well as the number of teratons needed b the correspondng SMO verson to stop. LS Least -Norm % err. #SV #It. % err. #SV #It. Ttanc.±. 5.±. 39.±.7.±. 7.±9.8 53.7±6. Heart 5.6±3. 69.9±.3 6.3±7.8 5.6±3.5 69.9±.3.±.7 Cancer 5.7±.5 99.8±.5.6±7.9 5.9±.5 95.9±.3 35.7±7.9 German 3.3±. 699.±.8 339.8±38. 3.3±. 699.9±.3 667.±9.3 Table : Average accuraces, number of support vectors and number of teratons obtaned b a Least -Norm SVM and an LS-SVM. 38

ESANN proceedngs, European Smposum on Artfcal Neural Networks - Computatonal Intellgence and Machne Learnng. Bruges (Belgum), 8-3 Aprl, d-sde publ., ISBN -9337--. (a) LS SVM wthout outlers, C = (b) Least Norm SVM wthout outlers, C = 5 (c) LS SVM wth outlers, C = 5 (d) Least Norm SVM wth outlers, C = 5 5 Fg. : Contours of functon W Φ(X)+b for a to problem traned wth an LS- SVM (left) and a Least -Norm SVM (rght). Top: orgnal problem. Bottom: modfed problem wth one outler for each class. It can be seen how the accuraces obtaned are smlar for both knds of SVM and also smlar to the ones reported n [8] for an L-SVM. Regardng the number of support vectors, as epected, none s sparse, ecept the Least -Norm SVM for dataset ttanc, whch we thnk s due to the estence of dentcal ponts wth dfferent tags. Fnall, concernng the number of teratons, t s somewhat puzzlng, sometmes the LS-SVM s remarkabl faster and sometmes the Least -Norm SVM s. Ths of course depends on the hperparameters chosen, but t s not clear what s the eact nfluence of them. Care must also be taken snce () and (), though formall smlar, ma requre qute dfferent number of teratons snce the W vectors are dfferent. Further stud s clearl needed to better characterze what the convergence speed wll be for each case. Secondl, to show robustness we use the to bdmensonal problem depcted n, where patterns belong to each class. The postve class patterns are drawn from a normal dstrbuton wth mean (, ), whereas the negatve class has a mean of (5, ). In both cases the covarance matr s the unt one. In the top part of the fgure we tran an LS-SVM (a) and a Least -Norm SVM (b) wth ths tranng set, whch s lnearl separable, wth C = and no specfc kernel (just the nner product). Note that the fnal hperplanes are ver smlar and the support hperplanes traverse ther correspondng cloud of ponts. In the bottom part of the fgure, we ntroduce two outlers b swtchng the class labels of two ponts, so that the classes are no longer lnearl separable, tranng agan the LS-SVM (c) and the Least -Norm SVM (d). Observe that the fnal LS-SVM hperplane has remarkabl changed ts orentaton because of the outlers nfluence, whereas the Least -Norm one changes qute less because 39

ESANN proceedngs, European Smposum on Artfcal Neural Networks - Computatonal Intellgence and Machne Learnng. Bruges (Belgum), 8-3 Aprl, d-sde publ., ISBN -9337--. ther nfluence s lmted. 5 Conclusons and further work In ths work we have presented Least -Norm SVMs, a new SVM classfer. As LS-SVMs dd wth L-SVMs, the are derved b substtutng nequalt for equalt constrants n the prmal. The arsng dual s almost dentcal to the L one, wth bo constrants [ C, C] n leu of [,C]. Ths mples that the outlers nfluence s also lmted, but sparst s lost because now the ponts for whch W Φ(X ) > are assgned an α = C nstead of beng zero. We have also seen how t can be traned wth an adaptaton of the well-known SMO algorthm, gvng models wth smlar test accuraces. Whch partcular SVM varant converges faster seems to be problem and parameter dependent. As a possble future etenson, the tranng phase for Least -Norm SVMs can be accelerated b makng use of the nd order varant of the SMO algorthm as was done for L-SVMs n [7]. Ths method has been shown to not alwas accelerate LS-SVM tranng [6]. As mentoned above, the convergence propertes of SMO for Least -Norm SVMs wll be further studed. References [] J. A. K. Sukens and J. Vandewalle. Least Squares Support Vector Machne Classfers. Neural Processng Letters, 9(3):93 3, 999. [] V. Vapnk. The Nature of Statstcal Learnng Theor. Sprnger-Verlag, New York, 995. [3] S. Bod and L. Vandenberghe. Conve Optmzaton. Cambrdge Unverst Press,. [] J. López, Á. Barbero, and J. R. Dorronsoro. On the Equvalence of the SMO and MDM Algorthms for SVM Tranng. In Lecture Notes n Computer Scence: Machne Learnng and Knowledge Dscover n Databases, volume 5, pages 88 3. Sprnger, 8. [5] S. S. Keerth and S. K. Shevade. SMO Algorthm for Least-Squares SVM Formulatons. Neural Computaton, 5():87 57, 3. [6] J. López and J. A. K. Sukens. Frst and Second Order SMO Algorthms for Large Scale LS-SVM tranng. Techncal Report 9-79, Katholeke Unverstet Leuven, 9. [7] R. E. Fan, P. H. Chen, and C. J. Ln. Workng Set Selecton usng Second Order Informaton for Tranng Support Vector Machnes. Journal of Machne Learnng Research, 6:889 98, 5. [8] G. Rätsch. Benchmark Repostor,. Datasets avalable at http://da. frst.fhg.de/projects/bench/benchmarks.htm.