Support Vector Machine (SVM) CE-725: Statistical Pattern Recognition Sharif University of Technology Spring 2013 Soleymani
Outline Margin concept Hard-Margin SVM Soft-Margin SVM Dual Problems of Hard-Margin SVM and Soft-Margin SVM Nonlinear SVM Kernel trick 2
Margin Which line is better to select as the boundary to provide more generalization capability? Larger margin provides better generalization to unseen data 2 Margin for a hyperplane that separates samples of two linearly separable classes is: The smallest distance between the decision boundary and any of the training samples 1 3
Maximum Margin SVM finds the solution with maximum margin Solution: a hyperplane that is farthest from all training samples 2 2 1 Larger margin 1 The hyperplane with the largest margin has equal distances to the nearest sample of both classes 4
Hard-Margin SVM: Optimization Problem,, =1 = 1 + =0 2 + = + = 5 1
Hard-Margin SVM: Optimization Problem (),, + =0 2 6 + = + = 1
Hard-Margin SVM: Optimization Problem We can set, : (), + =0 2 1 The place of boundary and margin lines do not change 7 + = 1 + =1 1
Hard-Margin SVM: Optimization Problem We can equivalently optimize: 1 min, 2 s. t. + 1 = 1,, It is a convex Quadratic Programming (QP) problem There are computationally efficient packages to solve it. It has a unique global minimum (if any). When training samples are not linearly separable, it has no solution. How to extend it to find a solution even though the classes are not linearly separable. 8
Beyond Linear Separability How to extend the hard-margin SVM to allow classification error Overlapping classes that can be approximately separated by a linear boundary Noise in the linearly separable classes 2 9 1 1
Beyond Linear Separability: Soft-Margin SVM Minimizing the number of misclassified points?! NP-complete Soft margin: Maximizing a margin while trying to minimize the distance between misclassified points and their correct margin plane 10
Soft-Margin SVM SVM with slack variables: allows samples to fall within the margin, but penalizes them 1 min,, 2 + s. t. + 1 = 1,, 0 2 : slack variables >1:if misclassifed 0 1: if correctly classified but inside margin 11 1
Soft-Margin SVM linear penalty (hinge loss) for a sample if it is misclassified or lied in the margin tries to maintain small while maximizing the margin. always finds a solution (compared to hard-margin SVM) more robust to the outliers Soft margin problem is still a convex QP =0 =0 12
Soft-Margin SVM: Parameter is a tradeoff parameter: small allows margin constraints to be easily ignored large margin large makes constraints hard to ignore narrow margin enforces all constraints: hard margin can be determined using a technique like crossvalidation 13
Soft-Margin SVM: Cost Function,, It is equivalent to the unconstrained optimization problem: () (), 14
SVM Loss Function Hinge loss vs. 0-1 loss =1 max (0,1 () ( () + )) 0-1 Loss Hinge Loss + 15
Optimization: Lagrangian Multipliers Lagrangian multipliers =[,, ] 16
Optimization: Dual Problem Primal problem: =min In general, we have: max Dual problem: min max L,, (, ) min max (, ) =max min L,, Obtained by swapping the order of the optimizations When the original problem is convex ( and are convex functions and h is affine), we have strong duality = 17
Hard-Margin SVM: Dual Problem, By incorporating the constraints through lagrangian multipliers, we will have: 18 min, max { } 1 2 + 1 ( () + ) Dual problem (changing the order of min and max in the above problem): max min 1 { }, 2 + 1 ( () + )
Hard-Margin SVM: Dual Problem { }, () () (),, () do not appear, instead, a global constraint on is created. 19
Hard-Margin SVM: Dual Problem Subject to () () () It is a convex QP By solving the above problem first we find = () () and then 20
Hard-Margin SVM: Dual Problem Subject to () () () Only the dot product of each pair of training data appears in the optimization problem This is an important property that is helpful to extend to non-linear SVM (the cost function does not depend explicitly on the dimensionality of the feature space). 21
Hard-Margin SVM: Support Vectors Support Vectors (SVs)= The direction of hyper-plane can be found only based on support vectors: () () can be set by making the margin equidistant to two classes. can be found using each of equations on SVs: + =1 Numerically safer to find using the equations on all SVs 22
Hard-Margin SVM: Dual Problem Classifying New Samples Using only SVs Classification of a new sample : = + = + = ( + ) Support vectors are sufficient to predict labels of new samples The classifier is based on the expansion in terms of dot products of with support vectors. 23
Karush-Kuhn-Tucker (KKT) Conditions Necessary conditions for the solution L,, =0,, =0 0 = 1,, 1 + = 0 = 1,, 24
Hard-Margin SVM: Support Vectors Inactive constraint: + >1 =0and thus is not a support vector. Active constraint: + =1 can be greater than 0 and thus can be a support vector. 2 >0 >0 >0 1 25 1
Hard-Margin SVM: Support Vectors Inactive constraint: + >1 =0and thus is not a support vector. Active constraint: + =1 can be greater than 0 and thus can be a support vector. 2 >0 =0 =0 >0 >0 1 A sample with =0can lie on one of the margin hyperplanes 26 1
Soft-Margin SVM: Dual Problem max 1 2 () () Subject to () =0 0 =1,, By solving the above quadratic problem first we find and then find = () () and is computed from SVs. For a test sample (as before): 27 = + = ( + )
Soft-Margin SVM: Support Vectors Support Vectors: If : SVs on the margin,. If : SVs on or over the margin. 28
Primal vs. Dual Soft-Margin SVM Problem Primal problem of soft-margin SVM inequality constraints positivity constraints ++1number of variables Dual problem of soft-margin SVM one equality constraint 2 positivity constraints number of variables (Lagrange multipliers) Objective function more complicated The dual problem is helpful and instrumental to use the kernel trick 29
Not linearly separable data Noisy data or overlapping classes (we discussed about it: soft margin) Near linearly separable 2 1 Non-linear decision surface 2 Transform to a new feature space 30 1
Nonlinear SVM Nonlinearly separable classes Φ: x φ(x) = [ (),..., ()] { (),..., ()}: set of basis functions (or features) :R R 31
SVM in a Transformed Feature Space Assume a transformation on the feature space Find a hyper-plane in the transformed feature space 2 () : + =0 32 1 ()
Basis functions: Examples Polynomial: Gaussian: Sigmoid: () 33 [Bishop]
Soft-Margin SVM in a Transformed Space: Primal Problem Primal problem: 1 min, 2 + s. t. ( ) + 1 = 1,, 0 R : the weights that must be found If (very high dimensional feature space) then there are many more parameters to learn Classifying a new data: = + () = ( + ( ) () ) 34
Soft-Margin SVM in a Transformed Space: Dual Problem Optimization problem: max 1 2 () () () Subject to () () If we have inner products () (), only =[,, ] needs to be learnt 35 It is not necessary to learn parameters as opposed to the primal problem
Kernelized Soft-Margin SVM Optimization problem: Subject to () () () ()() Classifying a new data: () = + () =( + ( ) () ), = ( ) 36
SVM: Summary Hard margin: maximizing margin Soft margin: handling noisy data and overlapping classes Slack variables in the problem Dual problems of hard-margin and soft-margin SVM Lead us to non-linear SVM method easily by kernel substitution Also, classifier decision in terms of support vectors Kernel SVM s Learns linear decision boundary in a high dimension space without explicitly working on the mapped data 37