Lecture 2: The SVM classifier

Size: px

Start display at page:

Download "Lecture 2: The SVM classifier"

Sheila Pierce
9 years ago
Views:

1 Lecture 2: The SVM classifier C19 Machine Learning Hilary 2015 A. Zisserman Review of linear classifiers Linear separability Perceptron Support Vector Machine (SVM) classifier Wide margin Cost function Slack variables Loss functions revisited Optimization

Perceptron Support Vector Machine (SVM) classifier Wide margin

2 Binary Classification Given training data (x i,y i )fori =1...N,with x i R d and y i { 1, 1}, learnaclassifier f(x) such that ( 0 yi =+1 f(x i ) < 0 y i = 1 i.e. y i f(x i ) > 0 for a correct classification.

3 Linear separability linearly separable not linearly separable

4 Linear classifiers A linear classifier has the form f(x) =0 X 2 f(x) =w > x + b f(x) < 0 f(x) > 0 X 1 in 2D the discriminant is a line is the normal to the line, and b the bias is known as the weight vector

5 Linear classifiers A linear classifier has the form f(x) =0 f(x) =w > x + b in 3D the discriminant is a plane, and in nd it is a hyperplane For a K-NN classifier it was necessary to `carry the training data For a linear classifier, the training data is used to learn w and then discarded Only w is needed for classifying new data

was necessary to `carry the training data For a linear classifier, the training

6 The Perceptron Classifier Given linearly separable data x i labelled into two categories y i = {-1,1}, find a weight vector w such that the discriminant function f(x i )=w > x i + b separates the categories for i = 1,.., N how can we find this separating hyperplane? The Perceptron Algorithm Write classifier as f(x i )= w > x i + w 0 = w > x i where w =( w,w 0 ), x i =( x i, 1) Initialize w = 0 Cycle though the data points { x i, y i } if x i is misclassified then w w + α sign(f(x i )) x i Until all the data is correctly classified

The Perceptron Algorithm Write classifier as f(x i )= w > x i + w 0 = w > x i where w =( w,w 0 ), x i =( x i, 1) Initialize w = 0

7 For example in 2D Initialize w = 0 Cycle though the data points { x i, y i } if x i is misclassified then Until all the data is correctly classified w w + α sign(f(x i )) x i before update after update X 2 X 2 w w x i X 1 X 1 NB after convergence w = P N i α i x i

correctly classified w w + α sign(f(x i )) x i before update

8 8 Perceptron example if the data is linearly separable, then the algorithm will converge convergence can be slow separating line close to training data we would prefer a larger margin for generalization

9 What is the best w? maximum margin solution: most stable under perturbations of the inputs

10 f(x) = X i Support Vector Machine linearly separable data w T x + b = 0 b w Support Vector Support Vector w α i y i (x i > x)+b support vectors

11 SVM sketch derivation Since w > x + b =0andc(w > x + b) =0define the same plane, we have the freedom to choose the normalization of w Choose normalization such that w > x + +b =+1andw > x + b = 1 for the positive and negative support vectors respectively Then the margin is given by w w. ³ w > ³ x + x x + x = w = 2 w

normalization such that w > x + +b =+1andw > x + b = 1 for the positive and

12 Support Vector Machine linearly separable data Margin = 2 w Support Vector Support Vector w T x + b = 1 w w T x + b = 0 w T x + b = -1

13 SVM Optimization Learning the SVM can be formulated as an optimization: max w 2 w subject to w > x i +b 1 if y i =+1 1 if y i = 1 for i =1...N Or equivalently min w w 2 subject to y i ³ w > x i + b 1fori =1...N This is a quadratic optimization problem subject to linear constraints and there is a unique minimum

..N Or equivalently min w w 2 subject to y i ³ w > x i + b 1fori =1.

14 Linear separability again: What is the best w? the points can be linearly separated but there is a very narrow margin but possibly the large margin solution is better, even though one constraint is violated In general there is a trade off between the margin and the number of mistakes on the training data

possibly the large margin solution is better, even though one constraint is

15 Introduce slack variables ξ i 0 for 0 < ξ 1 point is between margin and correct side of hyperplane. This is a margin violation Misclassified point ξ i w > 2 w ξ i w < 1 w Margin = 2 w for ξ > 1 point is misclassified Support Vector Support Vector = 0 w T x + b = 1 w w T x + b = 0 w T x + b = -1

This is a margin violation Misclassified point ξ i w > 2 w ξ i w < 1 w

16 The optimization problem becomes Soft margin solution subject to min w R d,ξ i R w 2 +C + NX ξ i i y i ³ w > x i + b 1 ξ i for i =1...N Every constraint can be satisfied if ξ i is sufficiently large C is a regularization parameter: small C allows constraints to be easily ignored large margin large C makes constraints hard to ignore narrow margin C = enforces all constraints: hard margin This is still a quadratic optimization problem and there is a unique minimum. Note, there is only one parameter, C.

constraints to be easily ignored large margin large C makes constraints hard to ignore narrow margin C = enforces all

17 feature y feature x data is linearly separable but only with a narrow margin

18 C = Infinity hard margin

19 C = 10 soft margin

20 Application: Pedestrian detection in Computer Vision Objective: detect (localize) standing humans in an image cf face detection with a sliding window classifier reduces object detection to binary classification does an image window contain a person or not? Method: the HOG detector

sliding window classifier reduces object detection to binary

21 Training data and features Positive data 1208 positive window examples Negative data 1218 negative window examples (initially)

22 Feature: histogram of oriented gradients (HOG) image dominant direction HOG tile window into 8 x 8 pixel cells each cell represented by HOG frequency orientation Feature vector dimension = 16 x 8 (for tiling) x 8 (orientations) = 1024

24 Averaged positive examples

25 Algorithm Training (Learning) Represent each example window by a HOG feature vector x i R d, with d = 1024 Train a SVM classifier Testing (Detection) Sliding window classifier f(x) =w > x + b

26 Dalal and Triggs, CVPR 2005

27 Learned model f (x) = w>x + b Slide from Deva Ramanan

28 Slide from Deva Ramanan

29 Optimization Learning an SVM has been formulated as a constrained optimization problem over w and ξ min w R d,ξ i R w 2 + C + NX i ξ i subject to y i ³ w > x i + b 1 ξ i for i =1...N The constraint y i ³ w > x i + b 1 ξ i, can be written more concisely as y i f(x i ) 1 ξ i which, together with ξ i 0, is equivalent to ξ i =max(0, 1 y i f(x i )) Hence the learning problem is equivalent to the unconstrained optimization problem over w min w R w 2 + C d regularization NX i max (0, 1 y i f(x i )) loss function

30 Loss function min w 2 + C w R d NX i max (0, 1 y i f(x i )) loss function w T x + b = 0 Points are in three categories: 1. y i f(x i ) > 1 Point is outside margin. No contribution to loss 2. y i f(x i )=1 Point is on margin. No contribution to loss. As in hard margin case. 3. y i f(x i ) < 1 Point violates margin constraint. Contributes to loss Support Vector w Support Vector

31 Loss functions y i f(x i ) SVM uses hinge loss an approximation to the 0-1 loss max (0, 1 y i f(x i ))

32 Optimization continued min w R d C N X i max (0, 1 y i f(x i )) + w 2 local minimum global minimum Does this cost function have a unique solution? Does the solution depend on the starting point of an iterative optimization algorithm (such as gradient descent)? If the cost function is convex, then a locally optimal point is globally optimal (provided the optimization is over a convex set, which it is in our case)

33 Convex functions

34 Convex function examples convex Not convex A non-negative sum of convex functions is convex

35 + SVM min w R d C N X i max (0, 1 y i f(x i )) + w 2 convex

36 Gradient (or steepest) descent algorithm for SVM To minimize a cost function C(w) use the iterative update where η is the learning rate. First, rewrite the optimization problem as an average min w C(w) = λ 2 w N = 1 N NX i w t+1 w t η t w C(w t ) NX i max (0, 1 y i f(x i )) µ λ 2 w 2 +max(0, 1 y i f(x i )) (with λ =2/(NC) up to an overall scale of the problem) and f(x) =w > x + b Because the hinge loss is not differentiable, a sub-gradient is computed

37 Sub-gradient for hinge loss L(x i,y i ; w) =max(0, 1 y i f(x i )) f(x i )=w > x i + b L w = y ix i L w =0 y i f(x i )

38 Sub-gradient descent algorithm for SVM C(w) = 1 N NX i µ λ 2 w 2 + L(x i,y i ; w) The iterative update is w t+1 w t η wt C(w t ) where η is the learning rate. w t η 1 N NX i (λw t + w L(x i,y i ; w t )) Then each iteration t involves cycling through the training data with the updates: w t+1 w t η(λw t y i x i ) if y i f(x i ) < 1 w t ηλw t otherwise In the Pegasos algorithm the learning rate is set at η t = 1 λt

39 Pegasos Stochastic Gradient Descent Algorithm Randomly sample from the training data energy

40 Background reading and more Next lecture see that the SVM can be expressed as a sum over the support vectors: f(x) = X i α i y i (x i > x)+b support vectors On web page: links to SVM tutorials and video lectures MATLAB SVM demo

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x)) To go the other way, you need to diagonalize S

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x)) To go the other way, you need to diagonalize S Linear smoother ŷ = S y where s ij = s ij (x) e.g. s ij = diag(l i (x)) To go the other way, you need to diagonalize S 2 Online Learning: LMS and Perceptrons Partially adapted from slides by Ryan Gabbard