Least 1-Norm SVMs: a New SVM Variant between Standard and LS-SVMs



Similar documents
Support Vector Machines

Institute of Informatics, Faculty of Business and Management, Brno University of Technology,Czech Republic

What is Candidate Sampling

Logistic Regression. Lecture 4: More classifiers and classes. Logistic regression. Adaboost. Optimization. Multiple class classification

1 Example 1: Axis-aligned rectangles

Forecasting the Direction and Strength of Stock Market Movement

benefit is 2, paid if the policyholder dies within the year, and probability of death within the year is ).

An Alternative Way to Measure Private Equity Performance

How To Know The Components Of Mean Squared Error Of Herarchcal Estmator S

New Approaches to Support Vector Ordinal Regression

Recurrence. 1 Definitions and main statements

Performance Analysis and Coding Strategy of ECOC SVMs

8.5 UNITARY AND HERMITIAN MATRICES. The conjugate transpose of a complex matrix A, denoted by A*, is given by

Fisher Markets and Convex Programs

The Development of Web Log Mining Based on Improve-K-Means Clustering Analysis

Forecasting the Demand of Emergency Supplies: Based on the CBR Theory and BP Neural Network

Luby s Alg. for Maximal Independent Sets using Pairwise Independence

1. Measuring association using correlation and regression

The Mathematical Derivation of Least Squares

How To Calculate The Accountng Perod Of Nequalty

DEFINING %COMPLETE IN MICROSOFT PROJECT

The OC Curve of Attribute Acceptance Plans

Feature selection for intrusion detection. Slobodan Petrović NISlab, Gjøvik University College

Module 2 LOSSLESS IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur

Project Networks With Mixed-Time Constraints

How To Understand The Results Of The German Meris Cloud And Water Vapour Product

How Sets of Coherent Probabilities May Serve as Models for Degrees of Incoherence

BERNSTEIN POLYNOMIALS

CHAPTER 14 MORE ABOUT REGRESSION

v a 1 b 1 i, a 2 b 2 i,..., a n b n i.

Loop Parallelization

21 Vectors: The Cross Product & Torque

Joint Scheduling of Processing and Shuffle Phases in MapReduce Systems

Logistic Regression. Steve Kroon

Conversion between the vector and raster data structures using Fuzzy Geographical Entities

Financial market forecasting using a two-step kernel learning method for the support vector regression

Probabilistic Linear Classifier: Logistic Regression. CS534-Machine Learning

Answer: A). There is a flatter IS curve in the high MPC economy. Original LM LM after increase in M. IS curve for low MPC economy

Lecture 3: Force of Interest, Real Interest Rate, Annuity

Chapter 4 ECONOMIC DISPATCH AND UNIT COMMITMENT

J. Parallel Distrib. Comput.

L10: Linear discriminants analysis

"Research Note" APPLICATION OF CHARGE SIMULATION METHOD TO ELECTRIC FIELD CALCULATION IN THE POWER CABLES *

Support vector domain description

An MILP model for planning of batch plants operating in a campaign-mode

Vision Mouse. Saurabh Sarkar a* University of Cincinnati, Cincinnati, USA ABSTRACT 1. INTRODUCTION

Learning from Multiple Outlooks

On the Use of Neural Network as a Universal Approximator

PSYCHOLOGICAL RESEARCH (PYC 304-C) Lecture 12

A Probabilistic Theory of Coherence

On the Solution of Indefinite Systems Arising in Nonlinear Optimization

Single and multiple stage classifiers implementing logistic discrimination

Section 5.4 Annuities, Present Value, and Amortization

SVM Tutorial: Classification, Regression, and Ranking

An Enhanced Super-Resolution System with Improved Image Registration, Automatic Image Selection, and Image Enhancement

Extending Probabilistic Dynamic Epistemic Logic

NON-CONSTANT SUM RED-AND-BLACK GAMES WITH BET-DEPENDENT WIN PROBABILITY FUNCTION LAURA PONTIGGIA, University of the Sciences in Philadelphia

POLYSA: A Polynomial Algorithm for Non-binary Constraint Satisfaction Problems with and

CS 2750 Machine Learning. Lecture 3. Density estimation. CS 2750 Machine Learning. Announcements

PERRON FROBENIUS THEOREM

ECE544NA Final Project: Robust Machine Learning Hardware via Classifier Ensemble

Out-of-Sample Extensions for LLE, Isomap, MDS, Eigenmaps, and Spectral Clustering

OPTIMAL INVESTMENT POLICIES FOR THE HORSE RACE MODEL. Thomas S. Ferguson and C. Zachary Gilstein UCLA and Bell Communications May 1985, revised 2004

: ;,i! i.i.i; " '^! THE LOGIC THEORY MACHINE; EMPIRICAL EXPLORATIONS WITH A CASE STUDY IN HEURISTICS

On the Optimal Control of a Cascade of Hydro-Electric Power Stations

Calculation of Sampling Weights

NPAR TESTS. One-Sample Chi-Square Test. Cell Specification. Observed Frequencies 1O i 6. Expected Frequencies 1EXP i 6

Section C2: BJT Structure and Operational Modes

Brigid Mullany, Ph.D University of North Carolina, Charlotte

Gender Classification for Real-Time Audience Analysis System

SUPPLIER FINANCING AND STOCK MANAGEMENT. A JOINT VIEW.

THE DISTRIBUTION OF LOAN PORTFOLIO VALUE * Oldrich Alfons Vasicek

Quantization Effects in Digital Filters

An Interest-Oriented Network Evolution Mechanism for Online Communities

THE METHOD OF LEAST SQUARES THE METHOD OF LEAST SQUARES

Software project management with GAs

Least Squares Fitting of Data

A Novel Methodology of Working Capital Management for Large. Public Constructions by Using Fuzzy S-curve Regression

1.1 The University may award Higher Doctorate degrees as specified from time-to-time in UPR AS11 1.

Face Verification Problem. Face Recognition Problem. Application: Access Control. Biometric Authentication. Face Verification (1:1 matching)

Descriptive Models. Cluster Analysis. Example. General Applications of Clustering. Examples of Clustering Applications

2008/8. An integrated model for warehouse and inventory planning. Géraldine Strack and Yves Pochet

Multiplication Algorithms for Radix-2 RN-Codings and Two s Complement Numbers

where the coordinates are related to those in the old frame as follows.

The Greedy Method. Introduction. 0/1 Knapsack Problem

Research Article Enhanced Two-Step Method via Relaxed Order of α-satisfactory Degrees for Fuzzy Multiobjective Optimization

STATISTICAL DATA ANALYSIS IN EXCEL

How Much to Bet on Video Poker

Improved SVM in Cloud Computing Information Mining

Causal, Explanatory Forecasting. Analysis. Regression Analysis. Simple Linear Regression. Which is Independent? Forecasting

8 Algorithm for Binary Searching in Trees

Power-of-Two Policies for Single- Warehouse Multi-Retailer Inventory Systems with Order Frequency Discounts

ONE of the most crucial problems that every image

Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers

GRAVITY DATA VALIDATION AND OUTLIER DETECTION USING L 1 -NORM

Transcription:

ESANN proceedngs, European Smposum on Artfcal Neural Networks - Computatonal Intellgence and Machne Learnng. Bruges (Belgum), 8-3 Aprl, d-sde publ., ISBN -9337--. Least -Norm SVMs: a New SVM Varant between Standard and LS-SVMs Jorge López and José R. Dorronsoro Unversdad Autónoma de Madrd Departamento de Ingenería Informátca and Insttuto de Ingenería del Conocmento C/ Francsco Tomás Valente, 89 Madrd, Span Abstract. Least Squares Support Vector Machnes (LS-SVMs) were proposed b replacng the nequalt constrants nherent to L-SVMs wth equalt constrants. So far ths dea has onl been suggested for a least squares (L) loss. We descrbe how ths can also be done for the sumof-slacks (L) loss, eldng a new classfer (Least -Norm SVMs) whch gves smlar models n terms of complet and accurac and that ma also be more robust than LS-SVMs wth respect to outlers. Introducton Assumng a bnar classfcaton contet, we have a sample of N preclassfed patterns {X, }, =,...,N, where the outputs {+, }. If we further assume lnear nseparablt and consder slack varables to allow for msclassfcatons, the prmal of an LS-SVM [] s mn W,b,ξ W + C ξ s.t. (W Φ(X )+b) = ξ, () where denotes nner product, and Φ (X ) s the mage of X n the feature space wth feature map Φ ( ). The correspondng dual s mn α α α j j Kj j α s.t. α =, () wth the modfed kernel Kj = k (X,X j )+δ j /C, δ j standng for Kronecker s delta smbol and k (X,X j )=Φ(X ) Φ(X j ) the orgnal kernel. LS-SVMs were orgnall derved n [] from the so-called L-SVMs [], whose prmal changes () n three aspects: ) the objectve functon uses the L loss C ξ nstead of the L loss, ) the equalt constrants become nequalt ones, and 3) there s the addtonal requrement that ξ. In turn, L-SVMs, also descrbed n [], le somewhere n between, snce ther prmal s dentcal to (), but wth the equalt constrants stll transformed nto nequalt ones. To our knowledge, there s no current classfer that combnes equalt constrants wth the L loss. It s desrable to fll ths gap manl because of two Wth partal support of Span s TIN 7 6686 project and Cátedra IIC en Modelado Predccón. The frst author s kndl supported b FPU-MICINN grant reference AP7. 35

ESANN proceedngs, European Smposum on Artfcal Neural Networks - Computatonal Intellgence and Machne Learnng. Bruges (Belgum), 8-3 Aprl, d-sde publ., ISBN -9337--. Squared Slacks Slacks Inequalt constrants L-SVMs L-SVMs Equalt constrants LS-SVMs? Table : Tpes of SVMs accordng to how slacks and constrants are treated. facts: ) n practce L-SVMs and the L loss have become the standard, ) the nfluence of a gven pattern (.e. the value of ts coeffcent α ) n the model s not bounded when usng the L loss, so L and LS-SVMs are more senstve to outlers than L-SVMs. The central dea of ths work s to smplf L-SVMs smlarl to LS-SVMs, but keepng the L loss, gvng rse to the so-called Least -Norm SVMs, whch fll the gap above and are epected to preserve the robustness to outlers. The rest of the paper s organzed as follows: n Secton we gve the prmal and dual of Least -Norm SVMs and dscuss brefl ther KKT optmalt condtons. Secton 3 eplans how the popular SMO algorthm can be adapted to solve the Least -Norm dual. Secton reports some eperments that llustrate how the can be more robust to outlers than LS-SVMs whle beng as accurate as them, and dscusses the vared convergence speeds observed. Fnall, Secton 5 gves ponters to future possble etensons. Least -Norm SVMs In order to smplf the L-SVM prmal, one ma thnk that t suffces to force equalt constrants (W Φ(X )+b) = ξ, whle keepng the nherent requrement ξ. However, ths s not correct because t mples that slacks are onl allowed n one drecton, somethng whch s obvousl not convenent. Therefore, we propose to remove the constrants ξ and mnmze the -Norm of the slack vector, whch gves the Least -Norm SVM prmal mn W,b,ξ W + C ξ s.t. (W Φ(X )+b) = ξ. (3) Now we use the cast of -Norm problems as Lnear Programmng problems [3, p. 9]: mnmzng (3) can be reformulated as mn W,b,t W + C t s.t. t (W Φ(X )+b) t, () Note that () transforms the desred equaltes of (3) nto nequaltes, but otherwse the objectve functon s not dfferentable. Usng standard Lagrangan theor wth () and denotng β (γ ) as the multplers assocated wth t (+t ) we obtan the followng dual, where α = γ β : mn α α α j j K j j α s.t. α =, C α C, (5) 36

ESANN proceedngs, European Smposum on Artfcal Neural Networks - Computatonal Intellgence and Machne Learnng. Bruges (Belgum), 8-3 Aprl, d-sde publ., ISBN -9337--. whch happens to be dentcal to the L-SVM dual but wth the lower bound C nstead of, so that negatve values are allowed, as n LS-SVMs. Snce all the formulatons above are conve wth affne constrants, the KKT optmalt condtons are necessar and suffcent for optmalt [3]. The KKT condtons for () are analogous to the well-known ones for L-SVMs, substtutng just the lower bound C for, whch elds: (W Φ(X )+b) = C<α <C, (W Φ(X )+b) α = C, (W Φ(X )+b) α = C, (6) together wth the dual constrants W = α Φ(X )and α =. These are common to LS-SVMs, whose onl prmal KKT condton [] s (W Φ(X )+b) = α /C, (7) whch shows wh LS-SVMs are ver sensble to outlers: outlers are characterzed b a large ξ, whch n vew of (7) and () mples a large α. On the other hand, n Least -Norm SVMs ths nfluence s lmted because α C. It also shows another drawback of LS-SVMs: the are not sparse because α = Cξ, so a pattern takes part n the model whenever ξ,whchs almost certan to happen. Observe that ths s also the case for Least -Norm SVMs, snce α = mples (W Φ(X )+b) s eactl, so the are not lkel to be sparse ether. L-SVMs are ndeed sparse because, nstead of C, patterns wth (W Φ(X )+b) > are assgned α =. 3 Least -Norm SMO We wll adapt SMO for Least -Norm SVMs basng on a mamum gan vewpont (for more detals see []). In general, SMO performs updates of the form W = W + δ L L X L + δ U U X U. The constrant α = mples δ U U = δ L L and the updates become W = W + δ L (X L X U ), where we wrte δ = δ L and, hence, δ U = U L δ. As a consequence, the multpler updates are α L = α L + δ, α U = α U U L δ and α j = α j for other j. Therefore, denotng the dual functon n (5) as D (α), D (α ) can be wrtten as D (α )=D (α) (Δ U,L) Z L,U, wherewewrteδ U,L = W (X U X L ) ( U L )andz L,U = X L X U. Ignorng the denomnator, we can appromatel mamze the gan n D (α )b choosng L = arg mn j {W X j j } and U = arg ma j {W X j j }, so that the volaton etent Δ U,L s largest. Wrtng Δ = Δ U,L and λ =Δ/ Z U,L,we then have Δ >, λ >, δ = L λ and the α updates become α L = α L + L λ, α U = α U U λ. Thus, α L or α U wll decrease f L = or U =,whch requres the correspondng α L and α U to be greater than C. In turn, the wll 37

ESANN proceedngs, European Smposum on Artfcal Neural Networks - Computatonal Intellgence and Machne Learnng. Bruges (Belgum), 8-3 Aprl, d-sde publ., ISBN -9337--. ncrease f L =or U =, whch requres the correspondng α L and α U to be less than C. Hence, we must replace the prevous L, U choces wth L = arg mn j {W X j j : j I L },U= arg ma {W X j j : j I U }, (8) j where we use the notatons I U = { :( =,α > C) ( =,α <C)} and I L = { :( =,α <C) ( =,α > C)}. Moreover, to make sure that α L and α U reman then n the nterval [ C, C], we ma have to clp λ wth Numercal Eperments λ = mn {λ,c L α L,C + U α U }. (9) In ths secton we wll llustrate emprcall how the Least -Norm SVM ma be more robust to outlers than ts LS-SVM counterpart, as well as ts good generalzaton propertes. The tranng algorthm s SMO; the Least -Norm varant eplaned above and the LS-SVM verson n [5]. The stoppng crteron s fnal KKT volaton, specfcall when t s less than ɛ = 3. For LS-SVMs ths means { } { } ma W Φ(X ) mn W Φ(X ) ɛ, () where the tlde ndcates that we use the modfed kernel k as n (). For Least -Norm SVMs, t means ma {W Φ(X ) } mn {W Φ(X ) } ɛ. () I U I L The dervaton of these KKT based crtera s gven n [6] for LS-SVMs and [7] for L-SVMs. Frstl, to show generalzaton we take datasets from [8] wth tranng test splts each. We compare the performance of Least -Norm and LS-SVMs. We use the RBF kernel k (X,X j ) = ep ( X X j /σ ).The values for the hperparameters C and σ are sought wth a grd n the logarthmc range [, ] for C and [, ] for σ. Each pont of the grd s evaluated wth a - tmes--fold cross-valdaton over the whole dataset. We report n Table the accurac and number of support vectors obtaned n the fnal models, as well as the number of teratons needed b the correspondng SMO verson to stop. LS Least -Norm % err. #SV #It. % err. #SV #It. Ttanc.±. 5.±. 39.±.7.±. 7.±9.8 53.7±6. Heart 5.6±3. 69.9±.3 6.3±7.8 5.6±3.5 69.9±.3.±.7 Cancer 5.7±.5 99.8±.5.6±7.9 5.9±.5 95.9±.3 35.7±7.9 German 3.3±. 699.±.8 339.8±38. 3.3±. 699.9±.3 667.±9.3 Table : Average accuraces, number of support vectors and number of teratons obtaned b a Least -Norm SVM and an LS-SVM. 38

ESANN proceedngs, European Smposum on Artfcal Neural Networks - Computatonal Intellgence and Machne Learnng. Bruges (Belgum), 8-3 Aprl, d-sde publ., ISBN -9337--. (a) LS SVM wthout outlers, C = (b) Least Norm SVM wthout outlers, C = 5 (c) LS SVM wth outlers, C = 5 (d) Least Norm SVM wth outlers, C = 5 5 Fg. : Contours of functon W Φ(X)+b for a to problem traned wth an LS- SVM (left) and a Least -Norm SVM (rght). Top: orgnal problem. Bottom: modfed problem wth one outler for each class. It can be seen how the accuraces obtaned are smlar for both knds of SVM and also smlar to the ones reported n [8] for an L-SVM. Regardng the number of support vectors, as epected, none s sparse, ecept the Least -Norm SVM for dataset ttanc, whch we thnk s due to the estence of dentcal ponts wth dfferent tags. Fnall, concernng the number of teratons, t s somewhat puzzlng, sometmes the LS-SVM s remarkabl faster and sometmes the Least -Norm SVM s. Ths of course depends on the hperparameters chosen, but t s not clear what s the eact nfluence of them. Care must also be taken snce () and (), though formall smlar, ma requre qute dfferent number of teratons snce the W vectors are dfferent. Further stud s clearl needed to better characterze what the convergence speed wll be for each case. Secondl, to show robustness we use the to bdmensonal problem depcted n, where patterns belong to each class. The postve class patterns are drawn from a normal dstrbuton wth mean (, ), whereas the negatve class has a mean of (5, ). In both cases the covarance matr s the unt one. In the top part of the fgure we tran an LS-SVM (a) and a Least -Norm SVM (b) wth ths tranng set, whch s lnearl separable, wth C = and no specfc kernel (just the nner product). Note that the fnal hperplanes are ver smlar and the support hperplanes traverse ther correspondng cloud of ponts. In the bottom part of the fgure, we ntroduce two outlers b swtchng the class labels of two ponts, so that the classes are no longer lnearl separable, tranng agan the LS-SVM (c) and the Least -Norm SVM (d). Observe that the fnal LS-SVM hperplane has remarkabl changed ts orentaton because of the outlers nfluence, whereas the Least -Norm one changes qute less because 39

ESANN proceedngs, European Smposum on Artfcal Neural Networks - Computatonal Intellgence and Machne Learnng. Bruges (Belgum), 8-3 Aprl, d-sde publ., ISBN -9337--. ther nfluence s lmted. 5 Conclusons and further work In ths work we have presented Least -Norm SVMs, a new SVM classfer. As LS-SVMs dd wth L-SVMs, the are derved b substtutng nequalt for equalt constrants n the prmal. The arsng dual s almost dentcal to the L one, wth bo constrants [ C, C] n leu of [,C]. Ths mples that the outlers nfluence s also lmted, but sparst s lost because now the ponts for whch W Φ(X ) > are assgned an α = C nstead of beng zero. We have also seen how t can be traned wth an adaptaton of the well-known SMO algorthm, gvng models wth smlar test accuraces. Whch partcular SVM varant converges faster seems to be problem and parameter dependent. As a possble future etenson, the tranng phase for Least -Norm SVMs can be accelerated b makng use of the nd order varant of the SMO algorthm as was done for L-SVMs n [7]. Ths method has been shown to not alwas accelerate LS-SVM tranng [6]. As mentoned above, the convergence propertes of SMO for Least -Norm SVMs wll be further studed. References [] J. A. K. Sukens and J. Vandewalle. Least Squares Support Vector Machne Classfers. Neural Processng Letters, 9(3):93 3, 999. [] V. Vapnk. The Nature of Statstcal Learnng Theor. Sprnger-Verlag, New York, 995. [3] S. Bod and L. Vandenberghe. Conve Optmzaton. Cambrdge Unverst Press,. [] J. López, Á. Barbero, and J. R. Dorronsoro. On the Equvalence of the SMO and MDM Algorthms for SVM Tranng. In Lecture Notes n Computer Scence: Machne Learnng and Knowledge Dscover n Databases, volume 5, pages 88 3. Sprnger, 8. [5] S. S. Keerth and S. K. Shevade. SMO Algorthm for Least-Squares SVM Formulatons. Neural Computaton, 5():87 57, 3. [6] J. López and J. A. K. Sukens. Frst and Second Order SMO Algorthms for Large Scale LS-SVM tranng. Techncal Report 9-79, Katholeke Unverstet Leuven, 9. [7] R. E. Fan, P. H. Chen, and C. J. Ln. Workng Set Selecton usng Second Order Informaton for Tranng Support Vector Machnes. Journal of Machne Learnng Research, 6:889 98, 5. [8] G. Rätsch. Benchmark Repostor,. Datasets avalable at http://da. frst.fhg.de/projects/bench/benchmarks.htm.