Lecture 3: SVM dual, kernels and regression

Similar documents
Support Vector Machine (SVM)

Lecture 2: The SVM classifier

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Introduction to Support Vector Machines. Colin Campbell, Bristol University

Support Vector Machines Explained

A Simple Introduction to Support Vector Machines

Several Views of Support Vector Machines

Statistical Machine Learning

Support Vector Machines

An Introduction to Machine Learning

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x)) To go the other way, you need to diagonalize S

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

Support Vector Machine. Tutorial. (and Statistical Learning Theory)

Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval

Similarity and Diagonalization. Similar Matrices

Statistical machine learning, high dimension and big data

Lecture 3: Linear methods for classification

Support Vector Machines with Clustering for Training with Very Large Datasets

Big Data Analytics: Optimization and Randomization

CSCI567 Machine Learning (Fall 2014)

Gaussian Processes in Machine Learning

Big Data - Lecture 1 Optimization reminders

Vector and Matrix Norms

Inner Product Spaces

T ( a i x i ) = a i T (x i ).

AIMS Big data. AIMS Big data. Outline. Outline. Lecture 5: Structured-output learning January 7, 2015 Andrea Vedaldi

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

Lecture 6: Logistic Regression

Data visualization and dimensionality reduction using kernel maps with a reference point

Big Data Analytics CSCI 4030

Acknowledgments. Data Mining with Regression. Data Mining Context. Overview. Colleagues

Support Vector Machines

Support Vector Machines for Classification and Regression

Linear Threshold Units

NOTES ON LINEAR TRANSFORMATIONS

Duality in General Programs. Ryan Tibshirani Convex Optimization /36-725

Machine Learning in FX Carry Basket Prediction

AdaBoost. Jiri Matas and Jan Šochman. Centre for Machine Perception Czech Technical University, Prague

CS 2750 Machine Learning. Lecture 1. Machine Learning. CS 2750 Machine Learning.

Linearly Independent Sets and Linearly Dependent Sets

Machine Learning and Pattern Recognition Logistic Regression

( ) which must be a vector

Artificial Neural Networks and Support Vector Machines. CS 486/686: Introduction to Artificial Intelligence

Factorization Theorems

Mathematics Course 111: Algebra I Part IV: Vector Spaces

Linear Programming Notes V Problem Transformations

MATH 423 Linear Algebra II Lecture 38: Generalized eigenvectors. Jordan canonical form (continued).

Local features and matching. Image classification & object localization

KERNEL LOGISTIC REGRESSION-LINEAR FOR LEUKEMIA CLASSIFICATION USING HIGH DIMENSIONAL DATA

Introduction: Overview of Kernel Methods

Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j

Decompose Error Rate into components, some of which can be measured on unlabeled data

THE SVM APPROACH FOR BOX JENKINS MODELS

MAT 200, Midterm Exam Solution. a. (5 points) Compute the determinant of the matrix A =

Introduction to General and Generalized Linear Models

Lecture 2: Homogeneous Coordinates, Lines and Conics

1 Sets and Set Notation.

MACHINE LEARNING IN HIGH ENERGY PHYSICS

STA 4273H: Statistical Machine Learning

Systems of Linear Equations

1 Introduction to Matrices

1 Norms and Vector Spaces

Distributed Machine Learning and Big Data

by the matrix A results in a vector which is a reflection of the given

Solving Systems of Linear Equations

Exact Inference for Gaussian Process Regression in case of Big Data with the Cartesian Product Structure

Least Squares Estimation

LINEAR ALGEBRA W W L CHEN

Multiclass Classification Class 06, 25 Feb 2008 Ryan Rifkin

Understanding Basic Calculus

Moving Least Squares Approximation

Lecture 8 February 4

Semi-Supervised Support Vector Machines and Application to Spam Filtering

Statistical Models in Data Mining

University of Lille I PC first year list of exercises n 7. Review

Introduction to nonparametric regression: Least squares vs. Nearest neighbors

: Introduction to Machine Learning Dr. Rita Osadchy

Java Modules for Time Series Analysis

Data Mining and Data Warehousing. Henryk Maciejewski. Data Mining Predictive modelling: regression

Linear Algebra Review. Vectors

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

PYTHAGOREAN TRIPLES KEITH CONRAD

Mathematics Review for MS Finance Students

Introduction to Matrix Algebra

Early defect identification of semiconductor processes using machine learning

Algebra 2 Chapter 1 Vocabulary. identity - A statement that equates two equivalent expressions.

Orthogonal Diagonalization of Symmetric Matrices

1 VECTOR SPACES AND SUBSPACES

MAXIMIZING RETURN ON DIRECT MARKETING CAMPAIGNS

Matrix Representations of Linear Transformations and Changes of Coordinates

Making Sense of the Mayhem: Machine Learning and March Madness

DEFINITION A complex number is a matrix of the form. x y. , y x

Elasticity Theory Basics

Two Topics in Parametric Integration Applied to Stochastic Simulation in Industrial Engineering

Network Intrusion Detection using Semi Supervised Support Vector Machine

Server Load Prediction

Correlation and Convolution Class Notes for CMSC 426, Fall 2005 David Jacobs

To give it a definition, an implicit function of x and y is simply any relationship that takes the form:

Simple and efficient online algorithms for real world applications

Large-Scale Sparsified Manifold Regularization

Transcription:

Lecture 3: SVM dual, kernels and regression C19 Machine Learning Hilary 215 A. Zisserman Primal and dual forms Linear separability revisted Feature maps Kernels for SVMs Regression Ridge regression Basis functions

SVM review We have seen that for an SVM learning a linear classifier f(x) =w > x + b is formulated as solving an optimization problem over w : min w R d w 2 + C i max (, 1 y i f(x i )) This quadratic optimization problem is known as the primal problem. Instead,theSVMcanbeformulatedtolearnalinearclassifier f(x) = i α i y i (x i > x)+b by solving an optimization problem over α i. This is know as the dual problem, and we will look at the advantages of this formulation.

Sketch derivation of dual form The Representer Theorem states that the solution w can always be written as a linear combination of the training data: w = j=1 α j y j x j Proof: see example sheet. Now, substitute for w in f(x) =w > x + b f(x) = j=1 α j y j x j > x + b = j=1 α j y j ³ xj > x + b and for w in the cost function min w w 2 subject to y i ³ w > x i + b 1, i w 2 = X j α j y j x j X > k α k y k x k = X jk Hence, an equivalent optimization problem is over α j min α j X jk α j α k y j y k (x j > x k ) subject to y i j=1 α j α k y j y k (x j > x k ) α j y j (x > j x i )+b 1, i and a few more steps are required to complete the derivation.

Primal and dual formulations N is number of training points, and d is dimension of feature vector x. Primal problem: for w R d min w R w 2 + C d i max (, 1 y i f(x i )) Dual problem: for α R N (stated without proof): X max α i 1 X α j α k y j y k (x > j x k ) subject to α i C for i, and X α i i 2 jk i α i y i = Need to learn d parameters for primal, and N for dual If N<<dthen more efficient to solve for α than w Dual form only involves (x j > x k ). We will return to why this is an advantage whenwelookatkernels.

Primal and dual formulations Primal version of classifier: f(x) =w > x + b Dual version of classifier: f(x) = i α i y i (x i > x)+b At first sight the dual form appears to have the disadvantage of a K-NN classifier it requires the training data points x i. However, many of the α i s are zero. The ones that are non-zero define the support vectors x i.

Support Vector Machine w T x + b = b w Support Vector Support Vector w f(x) = X i α i y i (x i > x)+b support vectors

C = 1 soft margin

Handling data that is not linearly separable introduce slack variables min w R d,ξ i R w 2 + C + subject to ξ i i y i ³ w > x i + b 1 ξ i for i =1...N linear classifier not appropriate??

Solution 1: use polar coordinates r θ < > θ r Data is linearly separable in polar coordinates Acts non-linearly in original space!! Φ : Ã x1 x 2 Ã r θ R 2 R 2

Solution 2: map data to higher dimension Φ : Ã x1 x 2! x 2 1 x 2 2 2x1 x 2 R 2 R 3 Data is linearly separable in 3D This means that the problem can still be solved by a linear classifier

SVM classifiers in a transformed feature space R d R D f(x) = Φ Φ : x Φ(x) R d R D Learn classifier linear in w for R D : Φ(x) is a feature map f(x) =w > Φ(x) + b

Primal Classifier in transformed feature space Classifier, withw R D : Learning, forw R D f(x) =w > Φ(x) + b min w R w 2 + C D i max (, 1 y i f(x i )) Simply map x to Φ(x) where data is separable Solve for w in high dimensional space R D If D>>dthen there are many more parameters to learn for w. Can this be avoided?

Classifier: Dual Classifier in transformed feature space Learning: subject to f(x) = f(x) = X max α i max α i i X i i i α i 1 2 α i 1 2 α i y i x i > x + b α i y i Φ(x i ) > Φ(x) + b X jk X jk α j α k y j y k x j > x k α j α k y j y k Φ(x j ) > Φ(x k ) α i C for i, and X i α i y i =

Dual Classifier in transformed feature space Note, that Φ(x) only occurs in pairs Φ(x j ) > Φ(x i ) Once the scalar products are computed, only the N dimensional vector α needs to be learnt; it is not necessary to learn in the D dimensional space, as it is for the primal Write k(x j, x i )=Φ(x j ) > Φ(x i ). This is known as a Kernel Classifier: Learning: subject to f(x) = X max α i 1 α i i 2 i α i y i k(x i, x) + b X jk α j α k y j y k k(x j, x k ) α i C for i, and X i α i y i =

Special transformations Φ : Ã x1 x 2! x 2 1 x 2 2 2x1 x 2 R 2 R 3 Φ(x) > Φ(z) = ³ x 2 1,x2 2, 2x 1 x 2 z 2 1 z 2 2 2z1 z 2 Kernel Trick = x 2 1 z2 1 + x2 2 z2 2 +2x 1x 2 z 1 z 2 = (x 1 z 1 + x 2 z 2 ) 2 = (x > z) 2 Classifier can be learnt and applied without explicitly computing Φ(x) All that is required is the kernel k(x, z) =(x > z) 2 Complexity of learning depends on N (typically it is O(N 3 )) not on D

Example kernels Linear kernels k(x, x )=x > x Polynomial kernels k(x, x )= ³ 1+x > x d for any d> Contains all polynomials terms up to degree d Gaussian kernels k(x, x ) = exp ³ x x 2 /2σ 2 for σ > Infinite dimensional feature space

SVM classifier with Gaussian kernel f(x) = i α i y i k(x i, x)+b N = size of training data weight (may be zero) support vector Gaussian kernel k(x, x )=exp ³ x x 2 /2σ 2 Radial Basis Function (RBF) SVM f(x) = i α i y i exp ³ x x i 2 /2σ 2 + b

RBF Kernel SVM Example.6.4 feature y.2 -.2 -.4 -.6 -.8 -.6 -.4 -.2.2.4.6.8 1 feature x data is not linearly separable in original feature space

σ =1. C = f(x) =1 f(x) = f(x) = 1 f(x) = i α i y i exp ³ x x i 2 /2σ 2 + b

σ =1. C = 1 Decrease C, gives wider (soft) margin

σ =1. C =1 f(x) = i α i y i exp ³ x x i 2 /2σ 2 + b

σ =1. C = f(x) = i α i y i exp ³ x x i 2 /2σ 2 + b

σ =.25 C = Decrease sigma, moves towards nearest neighbour classifier

σ =.1 C = f(x) = i α i y i exp ³ x x i 2 /2σ 2 + b

Kernel Trick - Summary Classifiers can be learnt for high dimensional features spaces, without actually having to map the points into the high dimensional space Data may be linearly separable in the high dimensional space, but not linearly separable in the original feature space Kernels can be used for an SVM because of the scalar product in the dual form, but can also be used elsewhere they are not tied to the SVM formalism Kernels apply also to objects that are not vectors, e.g. k(h, h )= P k min(h k,h k ) for histograms with bins h k,h k

Regression y Suppose we are given a training set of N observations ((x 1,y 1 ),...,(x N,y N )) with x i R d,y i R The regression problem is to estimate f(x) from this data such that y i = f(x i )

Learning by optimization As in the case of classification, learning a regressor can be formulated as an optimization: Minimize with respect to f F i=1 l (f(x i ),y i ) + λr (f) loss function regularization There is a choice of both loss functions and regularization e.g. squared loss, SVM hinge-like loss squared regularizer, lasso regularizer

Choice of regression function non-linear basis functions Function for regression y(x, w) is a non-linear function of x, but linear in w: f(x, w) =w + w 1 φ 1 (x)+w 2 φ 2 (x)+...+ w M φ M (x) =w > Φ(x) For example, for x R, polynomial regression with φ j (x) =x j : f(x, w) =w + w 1 φ 1 (x)+w 2 φ 2 (x)+...+ w M φ M (x) = e.g. for M =3, f(x, w) =(w,w 1,w 2,w 3 ) Φ : x Φ(x) R 1 R 4 1 x x 2 x 3 = w> Φ(x) MX w j x j j=

Least squares ridge regression Cost function squared loss: target value y i loss function regularization x i Regression function for x (1D): f(x, w) =w + w 1 φ 1 (x)+w 2 φ 2 (x)+...+ w M φ M (x) =w > Φ(x) NB squared loss arises in Maximum Likelihood estimation for an error model y i =ỹ i + n i n i N (, σ 2 ) measured value true value

Solving for the weights w Notation: write the target and regressed values as N-vectors y = y 1 y 2.. y N f = Φ(x 1 ) > w Φ(x 2 ) > w.. Φ(x N ) > w = Φw = 1 φ 1 (x 1 )... φ M (x 1 ) 1 φ 1 (x 2 )... φ M (x 2 ).... 1 φ 1 (x N )... φ M (x N ) w w 1.. w M Φ is an N M design matrix e.g. for polynomial regression with basis functions up to x 2 Φw = 1 x 1 x 2 1 1 x 2 x 2 2.... 1 x N x 2 N w w 1 w 2

ee(w) = 1 2 = 1 2 i=1 i=1 {f(x i, w) y i } 2 + λ 2 kwk2 ³ yi w > Φ(x i ) 2 + λ 2 kwk2 = 1 2 (y Φw)2 + λ 2 kwk2 Now, compute where derivative w.r.t. w is zero for minimum Hence ee(w) dw = Φ> (y Φw) + λw = ³ Φ > Φ + λi w = Φ > y w = ³ Φ > Φ + λi 1 Φ > y

M basis functions, N data points w = ³ Φ > Φ + λi 1 Φ > y = assume N > M M x 1 M x M M x N N x 1 This shows that there is a unique solution. If λ = (no regularization), then w =(Φ > Φ ) 1 Φ > y = Φ + y where Φ + is the pseudo-inverse of Φ (pinv in Matlab) Adding the term λi improves the conditioning of the inverse, since if Φ is not full rank, then (Φ > Φ + λi) willbe(forsufficiently large λ) As λ, w 1 λ Φ> y Often the regularization is applied only to the inhomogeneous part of w, i.e. to w, wherew =(w, w)

w = ³ Φ > Φ + λi 1 Φ > y f(x, w) = w > Φ(x) =Φ(x) > w = Φ(x) > ³ Φ > Φ + λi 1 Φ > y = b(x) > y Output is a linear blend, b(x), of the training values {y i }

Example 1: polynomial basis functions The red curve is the true function (which is not a polynomial) 1.5 1 ideal fit Sample points Ideal fit The data points are samples from the curve with added noise in y. y.5 There is a choice in both the degree, M, of the basis functions used, and in the strength of the regularization f(x, w) = MX w j x j = w > Φ(x) j= -.5-1 -1.5.1.2.3.4.5.6.7.8.9 1 x Φ : x Φ(x) R R M+1 w is a M+1 dimensional vector

N = 9 samples, M = 7 1.5 1 Sample points Ideal fit lambda = 1 1.5 1 Sample points Ideal fit lambda =.1.5.5 y y -.5 -.5-1 -1-1.5.1.2.3.4.5.6.7.8.9 1 x -1.5.1.2.3.4.5.6.7.8.9 1 x 1.5 1 Sample points Ideal fit lambda = 1e-1 1.5 1 Sample points Ideal fit lambda = 1e-15.5.5 y y -.5 -.5-1 -1-1.5.1.2.3.4.5.6.7.8.9 1 x -1.5.1.2.3.4.5.6.7.8.9 1 x

M = 3 M = 5 1.5 1 least-squares fit Sample points Ideal fit Least-squares solution 1.5 1 least-squares fit Sample points Ideal fit Least-squares solution.5.5 y y -.5 -.5-1 -1 y -1.5.1.2 Polynomial basis functions.3.4.5.6.7.8.9 1 15 x 1 5-5 -1-15 -2-25.1.2.3.4.5.6.7.8.9 1 x y -1.5.1.2.3.4.5.6.7.8.9 1 Polynomial basis x functions 4 3 2 1-1 -2-3 -4.1.2.3.4.5.6.7.8.9 1 x

Example 2: Gaussian basis functions The red curve is the true function (which is not a polynomial) The data points are samples from the curve with added noise in y. 1.5 1.5 ideal fit Sample points Ideal fit Basis functions are centred on the training data (N points) There is a choice in both the scale, sigma, of the basis functions used, and in the strength of the regularization f(x, w) = w i e (x x i) 2 /σ 2 i=1 y -.5-1 = w > Φ(x) -1.5.1.2.3.4.5.6.7.8.9 1 x Φ : x Φ(x) R R N w is a N-vector

N = 9 samples, sigma =.334 1.5 1 Sample points Ideal fit lambda = 1 1.5 1 Sample points Ideal fit lambda =.1.5.5 y y -.5 -.5-1 -1-1.5.1.2.3.4.5.6.7.8.9 1 x -1.5.1.2.3.4.5.6.7.8.9 1 x 1.5 1 Sample points Ideal fit lambda = 1e-1 1.5 1 Sample points Ideal fit lambda = 1e-15.5.5 y y -.5 -.5-1 -1-1.5.1.2.3.4.5.6.7.8.9 1 x -1.5.1.2.3.4.5.6.7.8.9 1 x

Choosing lambda using a validation set 6 5 Ideal fit Validation Training Min error 1.5 1 Sample points Ideal fit Validation set fit 4.5 error norm 3 y 2 -.5 1-1 1-1 1-5 1 log -1.5.1.2.3.4.5.6.7.8.9 1 x

1.5 1 Sigma =.334 Sample points Ideal fit Validation set fit 1.5 1 Sigma =.1 Sample points Ideal fit Validation set fit.5.5 y y -.5 -.5-1 -1-1.5.1.2.3.4.5.6.7.8.9 1 x Gaussian basis functions -1.5.1.2.3.4.5.6.7.8.9 1 x Gaussian basis functions 2.8 15.6 1.4 5.2 y y -5 -.2-1 -.4-15 -.6-2 -.8.1.2.3.4.5.6.7.8.9 1 x.1.2.3.4.5.6.7.8.9 1 x

Application: regressing face pose Estimate two face pose angles: yaw (around the Y axis) pitch (around the X axis) Compute a HOG feature vector for each face region Learn a regressor from the HOG vector to the two pose angles

Summary and dual problem So far we have considered the primal problem where f(x, w) = MX i=1 andwewantedasolutionforw R M w i φ i (x) =w > Φ(x) As in the case of SVMs, we can also consider the dual problem where w = i=1 a i Φ(x i ) and f(x, a) = and obtain a solution for a R N. i a i Φ(x i ) > Φ(x) Again there is a closed form solution for a, the solution involves the N N Gram matrix k(x i,x j )=Φ(x i ) > Φ(x j ), so we can use the kernel trick again to replace scalar products

Background reading and more Bishop, chapters 6 & 7 for kernels and SVMs Hastie et al, chapter 12 Bishop, chapter 3 for regression More on web page: http://www.robots.ox.ac.uk/~az/lectures/ml