Large margin classifiers: Support Vector Machines

Size: px

Start display at page:

Download "Large margin classifiers: Support Vector Machines"

Ethan Richards
7 years ago
Views:

1 9/4/3 Large margin classifiers Large margin classifiers: Support Vector Machines Perceptron: find hyperplane that separates the two classes Support Vector Machine (SVM): separating hyperplane with a large margin Chapter 7 margin Intuitive concept that is backed by theoretical results (statistical learning theory) margin Has its origins in the work of Valdimir Vapnik Vapnik, V., and A. Lerner. Pattern recognition using generalized portrait method. Automation and Remote Control, 4, , 963. The history of SVMs The geometric margin Large margin linear classifiers Vapnik, V., and A. Lerner. Pattern recognition using generalized portrait method. Automation and Remote Control, 4, , 963. Large margin non-linear classifiers B. Boser, I. Guyon, and V. Vapnik. A training algorithm for optimal margin classifiers. In Fifth Annual Workshop on Computational Learning Theory, pages 44 5, 99 SVMs for non-separable data C. Cortes and V. N. Vapnik, Support vector networks. Machine Learning, vol. 0, no. 3, pp , 995. Since then lots of other large margin algorithms margin The margin of a linear discriminant function f with respect to a labeled dataset D: m D (f) = ŵ ( ) ŵ a unit vector in the direction of w margin w 3 4

2 9/4/3 The geometric margin The geometric margin Want to find: Suppose that + and - are equidistant from the decision boundary: Subtracting the two equations: Divide by the norm of w: m D (f) = ŵ ( ) f( )=w + b = a f( )=w + b = a w ( )=a ŵ ( )= a w To get a well-defined value we will fi the value of f at the points closest to the hyperplane by setting a =. Under this assumption we have that m D (f) = w Maimizing the margin is therefore equivalent to minimizing w 5 6 Theoretical motivation Theorem: Let D be an i.i.d. sample of size n that is linearly separable and let m D (f) be the margin associated with f() = w T + b then: P (y 6= sign(f())) apple 4 p nmd (f) Linear SVMs Objective: maimize the margin while still correctly classifying all eamples correctly w,b w subject to: y i (w i + b) i =,...,n. 7 8

3 9/4/3 Digression: constrained optimization Digression: constrained optimization Before considering optimization problems with inequality constraints we will consider ones with equality constraints: f() subject to: g i () =0 And to make things even simpler, start with the case of a single constraint g() f() subject to: g() = 0 9 Images from 0 Digression: constrained optimization Claim: A r * of the constrained optimization problem must have the property that rf( ) is orthogonal to the constraint surface. Therefore there eists 6= 0 such that rf( )+ rg( )=0 is known as a Lagrange multiplier Lagrange multipliers When there are multiple equality constraints: The Lagrangian function: (, The above condition is obtained by setting And the condition rf( )+ X i )=f()+ X i leads to the constraint equations. irg i ( )=0 ig i () =f()+ g() Denote differentiation r (, )=0 r with respect to r (, )=0 Conclusion: the solution is a stationary point of the Lagrangian Images from 3

4 9/4/3 Inequality constraints f() subject to: g() apple 0 Two possible scenarios: g() < 0 the constraint is inactive g() = 0 the constraint is active If the constraint is inactive the stationarity condition is rf() =0 This corresponds to a stationary point of the Lagrangian with =0 When the constraint is active, we have 6= 0 Both cases can be summarized by the condition g() =0 The sign of is important: f() will be d only if its gradient is oriented away from the region g() < 0, i.e. rf( )= rg( )where > 0 Constrained optimization with inequality constraints Conclusion: Our constrained optimization problem of minimizing f() such that g() 0 is solved by, that satisfy: r (, )=0 g() apple 0 These are known as the KKT conditions 0 g() =0 3 4 Constrained optimization with inequality constraints With multiple constraints: Our constrained optimization problem of minimizing f() such that g i () 0 is solved by, that satisfy: r (, )=0 g i () 0 0 ig i () =0 Lagrangian duality Claim: The problem of minimizing f() s.t. g i () 0 can be epressed as: min ma (, ) such that 0 We can see this by performing the inner maimization: ( ma f()+ f() g() apple 0 g() = g() > 0 Solution is a saddle point These are known as the KKT conditions 5 6 4

5 9/4/3 Lagrangian duality Claim: The problem of minimizing f() s.t. g i () 0 can be epressed as: min ma (, ) such that 0 Instead of using the primal formulation let s consider: This is called the dual ma min (, ) such that 0 Under certain conditions (conveity) the two problems have the same solution Back to SVMs Lagrangian for the SVM problem: (w,b, ) = w + Necessary conditions for the saddle point: How do we get b? [ = w + X ( y i i )=0 ) w = y i = n X y i =0 y i (w i + b)] original y i (w constraints: i + b) 7 8 Let s use the KKT conditions: Implication: Pick an i such that Support Vectors [ y i (w i + b)] = 0 > 0 y i (w i + b) = ) b = y i w i Let s use the KKT conditions: Implication: Pick an i such that Support Vectors [ y i (w i + b)] = 0 > 0 y i (w i + b) = ) b = y i w i The correspond i are called support vectors 9 0 5

6 9/4/3 Support Vectors Claim: The number of support vectors is an upper bound on the estimated Leave-One-Out error. (w,b, ) = w + [ : W ( ) = +! y i i y i (w i + b)] w = y i i n X j= b y i j= j y j j 0 y j y j j A i y i =0 W ( ) = = +! y i i b n X j= y i j y j j 0 y j y j j A i j= j= j y i y j i j maimize Comments: quadratic programming problem (no local minima!) Usually a sparse solution (many alphas equal to 0) Compare to the primal: subject to: 0, j= j y i y j i j y i 0 w,b w subject to: y i (w i + b) i =,...,n

7 9/4/3 The non-seaprable case In order to allow for misclassification we replace the constraints with y i (w i + b) y i (w i + b) i SVMs for non-separable data Our optimization problem for the non-separable case: w,b w + C i subject to: y i (w i + b) i, i 0, i =,...,n. i 0 are called slack variables Need to incorporate the slack variables in the optimization problem because we want to discourage overuse of the slacks. i is a bound on the number of misclassified eamples 5 6 SVMs for non-separable data Our optimization problem for the non-separable case: w,b w + C i subject to: y i (w i + b) i, i 0, i =,...,n. Let s form the Lagrangian: (w,b,, ) = w + C i + [ i y i (w i + b)] + i i Saddle point = w n X = X y i =0 y i i i = C i =0 7 Plugging into the Lagrangian we get the following dual formulation: maimize j y i y j i j subject to: 0, j= Beta appears only in the constraints. Replace it with the constraint 0 apple apple C y i 0 i 0, C i =0 8 7

8 9/4/3 The final form of the dual becomes maimize subject to: 0 apple apple C, j= j y i y j i j y i =0 Primal: Dual: SVM: dual and primal w,b w + C maimize subject to: 0 apple apple C, i subject to: y i (w i + b) i, i 0, i =,...,n. j= j y i y j i j y i =0 Dual: simpler constraints; will allow us to use SVMs as nonlinear classifiers 9 30 Primal: Limited to linear SVMs Fast Dual: Software: LibLinear SVM solvers Interior point methods (generic solvers for quadratic programming problems) SVM-specific solvers: SMO (optimize two alphas at a time) Software: LibSVM (a flavor of SMO) Approimate solvers (e.g. LASVM) SMO Sequential Minimal Optimization (SMO): A solver for the SVM dual problem. When you choose two variables, the resulting problem can be solved analytically! Issues and tricks: Which two variables to choose? Shrinking: temporarily remove variables that are less likely to be chosen (at upper/lower bounds). Need occasional unshrinking. Platt, John (998), Sequential Minimal Optimization: A Fast Algorithm for Training Support Vector Machines 3 3 8

Support Vector Machine (SVM)

Support Vector Machine (SVM) CE-725: Statistical Pattern Recognition Sharif University of Technology Spring 2013 Soleymani Outline Margin concept Hard-Margin SVM Soft-Margin SVM Dual Problems of Hard-Margin