Support Vector Machines



Similar documents
Recurrence. 1 Definitions and main statements

BERNSTEIN POLYNOMIALS

Logistic Regression. Lecture 4: More classifiers and classes. Logistic regression. Adaboost. Optimization. Multiple class classification

n + d + q = 24 and.05n +.1d +.25q = 2 { n + d + q = 24 (3) n + 2d + 5q = 40 (2)

v a 1 b 1 i, a 2 b 2 i,..., a n b n i.

Institute of Informatics, Faculty of Business and Management, Brno University of Technology,Czech Republic

Fisher Markets and Convex Programs

Lecture 2: Single Layer Perceptrons Kevin Swingler

New Approaches to Support Vector Ordinal Regression

Least Squares Fitting of Data

1 Example 1: Axis-aligned rectangles

PSYCHOLOGICAL RESEARCH (PYC 304-C) Lecture 12

SVM Tutorial: Classification, Regression, and Ranking

21 Vectors: The Cross Product & Torque

Production. 2. Y is closed A set is closed if it contains its boundary. We need this for the solution existence in the profit maximization problem.

Linear Circuits Analysis. Superposition, Thevenin /Norton Equivalent circuits

Support Vector Machine Model for Currency Crisis Discrimination. Arindam Chaudhuri 1. Abstract

8.5 UNITARY AND HERMITIAN MATRICES. The conjugate transpose of a complex matrix A, denoted by A*, is given by

Logistic Regression. Steve Kroon

Luby s Alg. for Maximal Independent Sets using Pairwise Independence

A hybrid global optimization algorithm based on parallel chaos optimization and outlook algorithm

J. Parallel Distrib. Comput.

A Probabilistic Theory of Coherence

Loop Parallelization

L10: Linear discriminants analysis

Chapter 4 ECONOMIC DISPATCH AND UNIT COMMITMENT

Feature selection for intrusion detection. Slobodan Petrović NISlab, Gjøvik University College

Generalizing the degree sequence problem

where the coordinates are related to those in the old frame as follows.

Face Verification Problem. Face Recognition Problem. Application: Access Control. Biometric Authentication. Face Verification (1:1 matching)

The Mathematical Derivation of Least Squares

Forecasting the Direction and Strength of Stock Market Movement

8 Algorithm for Binary Searching in Trees

Formulating & Solving Integer Problems Chapter

The Greedy Method. Introduction. 0/1 Knapsack Problem

benefit is 2, paid if the policyholder dies within the year, and probability of death within the year is ).

7.5. Present Value of an Annuity. Investigate

On the Solution of Indefinite Systems Arising in Nonlinear Optimization

Causal, Explanatory Forecasting. Analysis. Regression Analysis. Simple Linear Regression. Which is Independent? Forecasting

PERRON FROBENIUS THEOREM

OPTIMAL INVESTMENT POLICIES FOR THE HORSE RACE MODEL. Thomas S. Ferguson and C. Zachary Gilstein UCLA and Bell Communications May 1985, revised 2004

Answer: A). There is a flatter IS curve in the high MPC economy. Original LM LM after increase in M. IS curve for low MPC economy

The OC Curve of Attribute Acceptance Plans

Simple Interest Loans (Section 5.1) :

Solution: Let i = 10% and d = 5%. By definition, the respective forces of interest on funds A and B are. i 1 + it. S A (t) = d (1 dt) 2 1. = d 1 dt.

Support vector domain description

What is Candidate Sampling

Period and Deadline Selection for Schedulability in Real-Time Systems

Performance Analysis and Coding Strategy of ECOC SVMs

Energies of Network Nastsemble

This circuit than can be reduced to a planar circuit

Financial market forecasting using a two-step kernel learning method for the support vector regression

POLYSA: A Polynomial Algorithm for Non-binary Constraint Satisfaction Problems with and

Level Annuities with Payments Less Frequent than Each Interest Period

Ring structure of splines on triangulations

An Alternative Way to Measure Private Equity Performance

General Auction Mechanism for Search Advertising

The Full-Wave Rectifier

Learning to Classify Ordinal Data: The Data Replication Method

Downlink Power Allocation for Multi-class. Wireless Systems

Mooring Pattern Optimization using Genetic Algorithms

How To Calculate The Accountng Perod Of Nequalty

arxiv: v1 [cs.dc] 11 Nov 2013

THE METHOD OF LEAST SQUARES THE METHOD OF LEAST SQUARES

GRAVITY DATA VALIDATION AND OUTLIER DETECTION USING L 1 -NORM

We are now ready to answer the question: What are the possible cardinalities for finite fields?

"Research Note" APPLICATION OF CHARGE SIMULATION METHOD TO ELECTRIC FIELD CALCULATION IN THE POWER CABLES *

A Lyapunov Optimization Approach to Repeated Stochastic Games

Least 1-Norm SVMs: a New SVM Variant between Standard and LS-SVMs

An MILP model for planning of batch plants operating in a campaign-mode

Power-of-Two Policies for Single- Warehouse Multi-Retailer Inventory Systems with Order Frequency Discounts

Quantization Effects in Digital Filters

Equlbra Exst and Trade S effcent proportionally

Research Article Enhanced Two-Step Method via Relaxed Order of α-satisfactory Degrees for Fuzzy Multiobjective Optimization

Problem Set 3. a) We are asked how people will react, if the interest rate i on bonds is negative.

Series Solutions of ODEs 2 the Frobenius method. The basic idea of the Frobenius method is to look for solutions of the form 3

An efficient constraint handling methodology for multi-objective evolutionary algorithms

SPEE Recommended Evaluation Practice #6 Definition of Decline Curve Parameters Background:

How Sets of Coherent Probabilities May Serve as Models for Degrees of Incoherence

On the Optimal Control of a Cascade of Hydro-Electric Power Stations

Addendum to: Importing Skill-Biased Technology

A Novel Methodology of Working Capital Management for Large. Public Constructions by Using Fuzzy S-curve Regression

2008/8. An integrated model for warehouse and inventory planning. Géraldine Strack and Yves Pochet

Enabling P2P One-view Multi-party Video Conferencing

Optimal Bidding Strategies for Generation Companies in a Day-Ahead Electricity Market with Risk Management Taken into Account

Embedding lattices in the Kleene degrees

A machine vision approach for detecting and inspecting circular parts

Using Series to Analyze Financial Situations: Present Value

Faraday's Law of Induction

NON-CONSTANT SUM RED-AND-BLACK GAMES WITH BET-DEPENDENT WIN PROBABILITY FUNCTION LAURA PONTIGGIA, University of the Sciences in Philadelphia

Lecture 3: Force of Interest, Real Interest Rate, Annuity

Heuristic Static Load-Balancing Algorithm Applied to CESM

ONE of the most crucial problems that every image

Goals Rotational quantities as vectors. Math: Cross Product. Angular momentum

Online Auctions in IaaS Clouds: Welfare and Profit Maximization with Server Costs

Project Networks With Mixed-Time Constraints

Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers

Finite Math Chapter 10: Study Guide and Solution to Problems

Compiling for Parallelism & Locality. Dependence Testing in General. Algorithms for Solving the Dependence Problem. Dependence Testing

On Robust Network Planning

Transcription:

Support Vector Machnes Max Wellng Department of Computer Scence Unversty of Toronto 10 Kng s College Road Toronto, M5S 3G5 Canada wellng@cs.toronto.edu Abstract Ths s a note to explan support vector machnes. 1 Prelmnares Our task s to predct whether a test sample belongs to one of two classes. We receve tranng examples of the form: {x, y }, = 1,..., n and x R d, y { 1, +1}. We call {x } the co-varates or nput vectors and {y } the response varables or labels. We consder a very smple example where the data are n fact lnearly separable:.e. I can draw a straght lne f(x) = w T x b such that all cases wth y = 1 fall on one sde and have f(x ) < 0 and cases wth y = +1 fall on the other and have f(x ) > 0. Gven that we have acheved that, we could classfy new test cases accordng to the rule y test = sgn(x test ). However, typcally there are nfntely many such hyper-planes obtaned by small perturbatons of a gven soluton. How do we choose between all these hyper-planes whch the solve the separaton problem for our tranng data, but may have dfferent performance on the newly arrvng test cases. For nstance, we could choose to put the lne very close to members of one partcular class, say y = 1. Intutvely, when test cases arrve we wll not make many mstakes on cases that should be classfed wth y = +1, but we wll make very easly mstakes on the cases wth y = 1 (for nstance, magne that a new batch of test cases arrves whch are small perturbatons of the tranng data). A sensble thng thus seems to choose the separaton lne as far away from both y = 1 and y = +1 tranng cases as we can,.e. rght n the mddle. Geometrcally, the vector w s drected orthogonal to the lne defned by w T x = b. Ths can be understood as follows. Frst take b = 0. Now t s clear that all vectors, x, wth vanshng nner product wth w satsfy ths equaton,.e. all vectors orthogonal to w satsfy ths equaton. Now translate the hyperplane away from the orgn over a vector a. The equaton for the plane now becomes: (x a) T w = 0,.e. we fnd that for the offset b = a T w, whch s the projecton of a onto to the vector w. Wthout loss of generalty we may thus choose a perpendcular to the plane, n whch case the length a = b / w represents the shortest, orthogonal dstance between the orgn and the hyperplane. We now defne 2 more hyperplanes parallel to the separatng hyperplane. They represent that planes that cut through the closest tranng examples on ether sde. We wll call them

support hyper-planes n the followng, because the data-vectors they contan support the plane. We defne the dstance between the these hyperplanes and the separatng hyperplane to be d + and d respectvely. The margn, γ, s defned to be d + + d. Our goal s now to fnd a the separatng hyperplane so that the margn s largest, whle the separatng hyperplane s equdstant from both. We can wrte the followng equatons for the support hyperplanes: w T x = b + δ (1) w T x = b δ (2) We now note that we have over-parameterzed the problem: f we scale w, b and δ by a constant factor α, the equatons for x are stll satsfed. To remove ths ambguty we wll requre that δ = 1, ths sets the scale of the problem,.e. f we measure dstance n mllmeters or meters. We can now also compute the values for d + = ( b+1 b )/ w = 1/ w (ths s only true f b / ( 1, 0) snce the orgn doesn t fall n between the hyperplanes n that case. If b ( 1, 0) you should use d + = ( b + 1 + b )/ w = 1/ w ). Hence the margn s equal to twce that value: γ = 2/ w. Wth the above defnton of the support planes we can wrte down the followng constrant that any soluton must satsfy, w T x b 1 y = 1 (3) w T x b +1 y = +1 (4) or n one equaton, y (w T x b) 1 0 (5) We now formulate the prmal problem of the SVM: 1 mnmze 2 w 2 subject to y (w T x b) 1 0 (6) Thus, we maxmze the margn, subject to the constrants that all tranng cases fall on ether sde of the support hyper-planes. The data-cases that le on the hyperplane are called support vectors, snce they support the hyper-planes and hence determne the soluton to the problem. The prmal problem can be solved by a quadratc program. However, t s not ready to be kernelsed, because ts dependence s not only on nner products between data-vectors. Hence, we transform to the dual formulaton by frst wrtng the problem usng a Lagrangan, L(w, b, α) = 1 N [ 2 w 2 α y (w T x b) 1 ] (7) =1 The soluton that mnmzes the prmal problem subject to the constrants s gven by mn w max α L(w, α),.e. a saddle pont problem. When the orgnal objectve-functon s convex, (and only then), we can nterchange the mnmzaton and maxmzaton. Dong that, we fnd that we can fnd the condton on w that must hold at the saddle pont we are solvng for. Ths s done by takng dervatves wrt w and b and solvng, w α y x = 0 w = α y x (8) α y = 0 (9)

Insertng ths back nto the Lagrangan we obtan what s known as the dual problem, N maxmze L D = α 1 α α j y y j x T x j 2 =1 j subject to α y = 0 (10) α 0 (11) The dual formulaton of the problem s also a quadratc program, but note that the number of varables, α n ths problem s equal to the number of data-cases, N. The crucal pont s however, that ths problem only depends on x through the nner product x T x j. Ths s readly kernelsed through the substtuton x T x j k(x, x j ). Ths s a recurrent theme: the dual problem lends tself to kernelsaton, whle the prmal problem dd not. The theory of dualty guarantees that for convex problems, the dual problem wll be concave, and moreover, that the unque soluton of the prmal problem corresponds tot the unque soluton of the dual problem. In fact, we have: L P (w ) = L D (α ),.e. the dualty-gap s zero. Next we turn to the condtons that must necessarly hold at the saddle pont and thus the soluton of the problem. These are called the KKT condtons (whch stands for Karush- Kuhn-Tucker). These condtons are necessary n general, and suffcent for convex optmzaton problems. They can be derved from the prmal problem by settng the dervatves wrt to w to zero. Also, the constrants themselves are part of these condtons and we need that for nequalty constrants the Lagrange multplers are non-negatve. Fnally, an mportant constrant called complementary slackness needs to be satsfed, w L P = 0 w α y x = 0 (12) b L P = 0 α y = 0 (13) constrant - 1 y (w T x b) 1 0 (14) multpler condton α 0 (15) complementary slackness α [ y (w T x b) 1 ] = 0 (16) It s the last equaton whch may be somewhat surprsng. It states that ether the nequalty constrant s satsfed, but not saturated: y (w T x b) 1 > 0 n whch case α for that data-case must be zero, or the nequalty constrant s saturated y (w T x b) 1 = 0, n whch case α can be any value α 0. Inequalty constrants whch are saturated are sad to be actve, whle unsaturated constrants are nactve. One could magne the process of searchng for a soluton as a ball whch runs down the prmary objectve functon usng gradent descent. At some pont, t wll ht a wall whch s the constrant and although the dervatve s stll pontng partally towards the wall, the constrants prohbts the ball to go on. Ths s an actve constrant because the ball s glued to that wall. When a fnal soluton s reached, we could remove some constrants, wthout changng the soluton, these are nactve constrants. One could thnk of the term w L P as the force actng on the ball. We see from the frst equaton above that only the forces wth α 0 exsert a force on the ball that balances wth the force from the curved quadratc surface w. The tranng cases wth α > 0, representng actve constrants on the poston of the support hyperplane are called support vectors. These are the vectors that are stuated n the support hyperplane and they determne the soluton. Typcally, there are only few of them, whch people call a sparse soluton (most α s vansh).

What we are really nterested n s the functon f( ) whch can be used to classfy future test cases, f(x) = w T x b = α y x T x b (17) As an applcaton of the KKT condtons we derve a soluton for b by usng the complementary slackness condton, b = j α j y j x T j x y a support vector (18) where we used y 2 = 1. So, usng any support vector one can determne b, but for numercal stablty t s better to average over all of them (although they should obvously be consstent). The most mportant concluson s agan that ths functon f( ) can thus be expressed solely n terms of nner products x T x whch we can replace wth kernel matrces k(x, x j ) to move to hgh dmensonal non-lnear spaces. Moreover, snce α s typcally very sparse, we don t need to evaluate many kernel entres n order to predct the class of the new nput x. 2 The Non-Separable case Obvously, not all datasets are lnearly separable, and so we need to change the formalsm to account for that. Clearly, the problem les n the constrants, whch cannot always be satsfed. So, let s relax those constrants by ntroducng slack varables, ξ, w T x b 1 + ξ y = 1 (19) w T x b +1 ξ y = +1 (20) ξ 0 (21) The varables, ξ allow for volatons of the constrant. We should penalze the objectve functon for these volatons, otherwse the above constrants become vod (smply always pck ξ very large). Penalty functons of the form C( ξ ) k wll lead to convex optmzaton problems for postve ntegers k. For k = 1, 2 t s stll a quadratc program (QP). In the followng we wll choose k = 1. C controls the tradeoff between the penalty and margn. To be on the wrong sde of the separatng hyperplane, a data-case would need ξ > 1. Hence, the sum ξ could be nterpreted as measure of how bad the volatons are and s an upper bound on the number of volatons. The new prmal problem thus becomes, mnmze L P = 1 2 w 2 + C ξ leadng to the Lagrangan, subject to y (w T x b) 1 + ξ 0 (22) ξ 0 (23) L(w, b, ξ, α, µ) = 1 2 w 2 +C N [ ξ α y (w T ] N x b) 1 + ξ µ ξ (24) =1 =1

from whch we derve the KKT condtons, 1. w L P = 0 w α y x = 0 (25) 2. b L P = 0 α y = 0 (26) 3. ξ L P = 0 C α µ = 0 (27) 4.constrant-1 y (w T x b) 1 + ξ 0 (28) 5.constrant-2 ξ 0 (29) 6.multpler condton-1 α 0 (30) 7.multpler condton-2 µ 0 (31) 8.complementary slackness-1 [ α y (w T ] x b) 1 + ξ = 0 (32) 9.complementary slackness-1 µ ξ = 0 (33) (34) From here we can deduce the followng facts. If we assume that ξ > 0, then µ = 0 (9), hence α = C (1) and thus ξ = 1 y (x T w b) (8). Also, when ξ = 0 we have µ > 0 (9) and hence α < C. If n addton to ξ = 0 we also have that y (w T x b) 1 = 0, then α > 0 (8). Otherwse, f y (w T x b) 1 > 0 then α = 0. In summary, as before for ponts not on the support plane and on the correct sde we have ξ = α = 0 (all constrants nactve). On the support plane, we stll have ξ = 0, but now α > 0. Fnally, for data-cases on the wrong sde of the support hyperplane the α max-out to α = C and the ξ balance the volaton of the constrant such that y (w T x b) 1 + ξ = 0. Geometrcally, we can calculate the gap between support hyperplane and the volatng datacase to be ξ / w. Ths can be seen because the plane defned by y (w T x b) 1+ξ = 0 s parallel to the support plane at a dstance 1 + y b ξ / w from the orgn. Snce the support plane s at a dstance 1 + y b / w the result follows. Fnally, we need to convert to the dual problem to solve t effcently and to kernelse t. Agan, we use the KKT equatons to get rd of w, b and ξ, N maxmze L D = α 1 α α j y y j x T x j 2 =1 j subject to α y = 0 (35) 0 α C (36) Surprsngly, ths s almost the same QP s before, but wth an extra constrant on the multplers α whch now lve n a box. Ths constrant s derved from the fact that α = C µ and µ 0. We also note that t only depends on nner products x T x j whch are ready to be kernelsed.