Lecture 3: SVM dual, kernels and regression

Lecture 3: SVM dual, kernels and regression C19 Machine Learning Hilary 215 A. Zisserman Primal and dual forms Linear separability revisted Feature maps Kernels for SVMs Regression Ridge regression Basis functions

SVM review We have seen that for an SVM learning a linear classifier f(x) =w > x + b is formulated as solving an optimization problem over w : min w R d w 2 + C i max (, 1 y i f(x i )) This quadratic optimization problem is known as the primal problem. Instead,theSVMcanbeformulatedtolearnalinearclassifier f(x) = i α i y i (x i > x)+b by solving an optimization problem over α i. This is know as the dual problem, and we will look at the advantages of this formulation.

Sketch derivation of dual form The Representer Theorem states that the solution w can always be written as a linear combination of the training data: w = j=1 α j y j x j Proof: see example sheet. Now, substitute for w in f(x) =w > x + b f(x) = j=1 α j y j x j > x + b = j=1 α j y j ³ xj > x + b and for w in the cost function min w w 2 subject to y i ³ w > x i + b 1, i w 2 = X j α j y j x j X > k α k y k x k = X jk Hence, an equivalent optimization problem is over α j min α j X jk α j α k y j y k (x j > x k ) subject to y i j=1 α j α k y j y k (x j > x k ) α j y j (x > j x i )+b 1, i and a few more steps are required to complete the derivation.

Primal and dual formulations N is number of training points, and d is dimension of feature vector x. Primal problem: for w R d min w R w 2 + C d i max (, 1 y i f(x i )) Dual problem: for α R N (stated without proof): X max α i 1 X α j α k y j y k (x > j x k ) subject to α i C for i, and X α i i 2 jk i α i y i = Need to learn d parameters for primal, and N for dual If N<<dthen more efficient to solve for α than w Dual form only involves (x j > x k ). We will return to why this is an advantage whenwelookatkernels.

Primal and dual formulations Primal version of classifier: f(x) =w > x + b Dual version of classifier: f(x) = i α i y i (x i > x)+b At first sight the dual form appears to have the disadvantage of a K-NN classifier it requires the training data points x i. However, many of the α i s are zero. The ones that are non-zero define the support vectors x i.

Support Vector Machine w T x + b = b w Support Vector Support Vector w f(x) = X i α i y i (x i > x)+b support vectors

C = 1 soft margin

Handling data that is not linearly separable introduce slack variables min w R d,ξ i R w 2 + C + subject to ξ i i y i ³ w > x i + b 1 ξ i for i =1...N linear classifier not appropriate??

Solution 1: use polar coordinates r θ < > θ r Data is linearly separable in polar coordinates Acts non-linearly in original space!! Φ : Ã x1 x 2 Ã r θ R 2 R 2

Solution 2: map data to higher dimension Φ : Ã x1 x 2! x 2 1 x 2 2 2x1 x 2 R 2 R 3 Data is linearly separable in 3D This means that the problem can still be solved by a linear classifier

SVM classifiers in a transformed feature space R d R D f(x) = Φ Φ : x Φ(x) R d R D Learn classifier linear in w for R D : Φ(x) is a feature map f(x) =w > Φ(x) + b

Primal Classifier in transformed feature space Classifier, withw R D : Learning, forw R D f(x) =w > Φ(x) + b min w R w 2 + C D i max (, 1 y i f(x i )) Simply map x to Φ(x) where data is separable Solve for w in high dimensional space R D If D>>dthen there are many more parameters to learn for w. Can this be avoided?

Classifier: Dual Classifier in transformed feature space Learning: subject to f(x) = f(x) = X max α i max α i i X i i i α i 1 2 α i 1 2 α i y i x i > x + b α i y i Φ(x i ) > Φ(x) + b X jk X jk α j α k y j y k x j > x k α j α k y j y k Φ(x j ) > Φ(x k ) α i C for i, and X i α i y i =

Dual Classifier in transformed feature space Note, that Φ(x) only occurs in pairs Φ(x j ) > Φ(x i ) Once the scalar products are computed, only the N dimensional vector α needs to be learnt; it is not necessary to learn in the D dimensional space, as it is for the primal Write k(x j, x i )=Φ(x j ) > Φ(x i ). This is known as a Kernel Classifier: Learning: subject to f(x) = X max α i 1 α i i 2 i α i y i k(x i, x) + b X jk α j α k y j y k k(x j, x k ) α i C for i, and X i α i y i =

Special transformations Φ : Ã x1 x 2! x 2 1 x 2 2 2x1 x 2 R 2 R 3 Φ(x) > Φ(z) = ³ x 2 1,x2 2, 2x 1 x 2 z 2 1 z 2 2 2z1 z 2 Kernel Trick = x 2 1 z2 1 + x2 2 z2 2 +2x 1x 2 z 1 z 2 = (x 1 z 1 + x 2 z 2 ) 2 = (x > z) 2 Classifier can be learnt and applied without explicitly computing Φ(x) All that is required is the kernel k(x, z) =(x > z) 2 Complexity of learning depends on N (typically it is O(N 3 )) not on D

Example kernels Linear kernels k(x, x )=x > x Polynomial kernels k(x, x )= ³ 1+x > x d for any d> Contains all polynomials terms up to degree d Gaussian kernels k(x, x ) = exp ³ x x 2 /2σ 2 for σ > Infinite dimensional feature space

SVM classifier with Gaussian kernel f(x) = i α i y i k(x i, x)+b N = size of training data weight (may be zero) support vector Gaussian kernel k(x, x )=exp ³ x x 2 /2σ 2 Radial Basis Function (RBF) SVM f(x) = i α i y i exp ³ x x i 2 /2σ 2 + b

RBF Kernel SVM Example.6.4 feature y.2 -.2 -.4 -.6 -.8 -.6 -.4 -.2.2.4.6.8 1 feature x data is not linearly separable in original feature space

σ =1. C = f(x) =1 f(x) = f(x) = 1 f(x) = i α i y i exp ³ x x i 2 /2σ 2 + b

σ =1. C = 1 Decrease C, gives wider (soft) margin

σ =1. C =1 f(x) = i α i y i exp ³ x x i 2 /2σ 2 + b

σ =1. C = f(x) = i α i y i exp ³ x x i 2 /2σ 2 + b

σ =.25 C = Decrease sigma, moves towards nearest neighbour classifier

σ =.1 C = f(x) = i α i y i exp ³ x x i 2 /2σ 2 + b

Kernel Trick - Summary Classifiers can be learnt for high dimensional features spaces, without actually having to map the points into the high dimensional space Data may be linearly separable in the high dimensional space, but not linearly separable in the original feature space Kernels can be used for an SVM because of the scalar product in the dual form, but can also be used elsewhere they are not tied to the SVM formalism Kernels apply also to objects that are not vectors, e.g. k(h, h )= P k min(h k,h k ) for histograms with bins h k,h k

Regression y Suppose we are given a training set of N observations ((x 1,y 1 ),...,(x N,y N )) with x i R d,y i R The regression problem is to estimate f(x) from this data such that y i = f(x i )

Learning by optimization As in the case of classification, learning a regressor can be formulated as an optimization: Minimize with respect to f F i=1 l (f(x i ),y i ) + λr (f) loss function regularization There is a choice of both loss functions and regularization e.g. squared loss, SVM hinge-like loss squared regularizer, lasso regularizer

Choice of regression function non-linear basis functions Function for regression y(x, w) is a non-linear function of x, but linear in w: f(x, w) =w + w 1 φ 1 (x)+w 2 φ 2 (x)+...+ w M φ M (x) =w > Φ(x) For example, for x R, polynomial regression with φ j (x) =x j : f(x, w) =w + w 1 φ 1 (x)+w 2 φ 2 (x)+...+ w M φ M (x) = e.g. for M =3, f(x, w) =(w,w 1,w 2,w 3 ) Φ : x Φ(x) R 1 R 4 1 x x 2 x 3 = w> Φ(x) MX w j x j j=

Least squares ridge regression Cost function squared loss: target value y i loss function regularization x i Regression function for x (1D): f(x, w) =w + w 1 φ 1 (x)+w 2 φ 2 (x)+...+ w M φ M (x) =w > Φ(x) NB squared loss arises in Maximum Likelihood estimation for an error model y i =ỹ i + n i n i N (, σ 2 ) measured value true value

Solving for the weights w Notation: write the target and regressed values as N-vectors y = y 1 y 2.. y N f = Φ(x 1 ) > w Φ(x 2 ) > w.. Φ(x N ) > w = Φw = 1 φ 1 (x 1 )... φ M (x 1 ) 1 φ 1 (x 2 )... φ M (x 2 ).... 1 φ 1 (x N )... φ M (x N ) w w 1.. w M Φ is an N M design matrix e.g. for polynomial regression with basis functions up to x 2 Φw = 1 x 1 x 2 1 1 x 2 x 2 2.... 1 x N x 2 N w w 1 w 2

ee(w) = 1 2 = 1 2 i=1 i=1 {f(x i, w) y i } 2 + λ 2 kwk2 ³ yi w > Φ(x i ) 2 + λ 2 kwk2 = 1 2 (y Φw)2 + λ 2 kwk2 Now, compute where derivative w.r.t. w is zero for minimum Hence ee(w) dw = Φ> (y Φw) + λw = ³ Φ > Φ + λi w = Φ > y w = ³ Φ > Φ + λi 1 Φ > y

M basis functions, N data points w = ³ Φ > Φ + λi 1 Φ > y = assume N > M M x 1 M x M M x N N x 1 This shows that there is a unique solution. If λ = (no regularization), then w =(Φ > Φ ) 1 Φ > y = Φ + y where Φ + is the pseudo-inverse of Φ (pinv in Matlab) Adding the term λi improves the conditioning of the inverse, since if Φ is not full rank, then (Φ > Φ + λi) willbe(forsufficiently large λ) As λ, w 1 λ Φ> y Often the regularization is applied only to the inhomogeneous part of w, i.e. to w, wherew =(w, w)

w = ³ Φ > Φ + λi 1 Φ > y f(x, w) = w > Φ(x) =Φ(x) > w = Φ(x) > ³ Φ > Φ + λi 1 Φ > y = b(x) > y Output is a linear blend, b(x), of the training values {y i }

Example 1: polynomial basis functions The red curve is the true function (which is not a polynomial) 1.5 1 ideal fit Sample points Ideal fit The data points are samples from the curve with added noise in y. y.5 There is a choice in both the degree, M, of the basis functions used, and in the strength of the regularization f(x, w) = MX w j x j = w > Φ(x) j= -.5-1 -1.5.1.2.3.4.5.6.7.8.9 1 x Φ : x Φ(x) R R M+1 w is a M+1 dimensional vector

N = 9 samples, M = 7 1.5 1 Sample points Ideal fit lambda = 1 1.5 1 Sample points Ideal fit lambda =.1.5.5 y y -.5 -.5-1 -1-1.5.1.2.3.4.5.6.7.8.9 1 x -1.5.1.2.3.4.5.6.7.8.9 1 x 1.5 1 Sample points Ideal fit lambda = 1e-1 1.5 1 Sample points Ideal fit lambda = 1e-15.5.5 y y -.5 -.5-1 -1-1.5.1.2.3.4.5.6.7.8.9 1 x -1.5.1.2.3.4.5.6.7.8.9 1 x

M = 3 M = 5 1.5 1 least-squares fit Sample points Ideal fit Least-squares solution 1.5 1 least-squares fit Sample points Ideal fit Least-squares solution.5.5 y y -.5 -.5-1 -1 y -1.5.1.2 Polynomial basis functions.3.4.5.6.7.8.9 1 15 x 1 5-5 -1-15 -2-25.1.2.3.4.5.6.7.8.9 1 x y -1.5.1.2.3.4.5.6.7.8.9 1 Polynomial basis x functions 4 3 2 1-1 -2-3 -4.1.2.3.4.5.6.7.8.9 1 x

Example 2: Gaussian basis functions The red curve is the true function (which is not a polynomial) The data points are samples from the curve with added noise in y. 1.5 1.5 ideal fit Sample points Ideal fit Basis functions are centred on the training data (N points) There is a choice in both the scale, sigma, of the basis functions used, and in the strength of the regularization f(x, w) = w i e (x x i) 2 /σ 2 i=1 y -.5-1 = w > Φ(x) -1.5.1.2.3.4.5.6.7.8.9 1 x Φ : x Φ(x) R R N w is a N-vector

N = 9 samples, sigma =.334 1.5 1 Sample points Ideal fit lambda = 1 1.5 1 Sample points Ideal fit lambda =.1.5.5 y y -.5 -.5-1 -1-1.5.1.2.3.4.5.6.7.8.9 1 x -1.5.1.2.3.4.5.6.7.8.9 1 x 1.5 1 Sample points Ideal fit lambda = 1e-1 1.5 1 Sample points Ideal fit lambda = 1e-15.5.5 y y -.5 -.5-1 -1-1.5.1.2.3.4.5.6.7.8.9 1 x -1.5.1.2.3.4.5.6.7.8.9 1 x

Choosing lambda using a validation set 6 5 Ideal fit Validation Training Min error 1.5 1 Sample points Ideal fit Validation set fit 4.5 error norm 3 y 2 -.5 1-1 1-1 1-5 1 log -1.5.1.2.3.4.5.6.7.8.9 1 x

1.5 1 Sigma =.334 Sample points Ideal fit Validation set fit 1.5 1 Sigma =.1 Sample points Ideal fit Validation set fit.5.5 y y -.5 -.5-1 -1-1.5.1.2.3.4.5.6.7.8.9 1 x Gaussian basis functions -1.5.1.2.3.4.5.6.7.8.9 1 x Gaussian basis functions 2.8 15.6 1.4 5.2 y y -5 -.2-1 -.4-15 -.6-2 -.8.1.2.3.4.5.6.7.8.9 1 x.1.2.3.4.5.6.7.8.9 1 x

Application: regressing face pose Estimate two face pose angles: yaw (around the Y axis) pitch (around the X axis) Compute a HOG feature vector for each face region Learn a regressor from the HOG vector to the two pose angles

Summary and dual problem So far we have considered the primal problem where f(x, w) = MX i=1 andwewantedasolutionforw R M w i φ i (x) =w > Φ(x) As in the case of SVMs, we can also consider the dual problem where w = i=1 a i Φ(x i ) and f(x, a) = and obtain a solution for a R N. i a i Φ(x i ) > Φ(x) Again there is a closed form solution for a, the solution involves the N N Gram matrix k(x i,x j )=Φ(x i ) > Φ(x j ), so we can use the kernel trick again to replace scalar products

Background reading and more Bishop, chapters 6 & 7 for kernels and SVMs Hastie et al, chapter 12 Bishop, chapter 3 for regression More on web page: http://www.robots.ox.ac.uk/~az/lectures/ml