Kernel Methods and Support Vector Machines

Size: px

Start display at page:

Download "Kernel Methods and Support Vector Machines"

Tamsyn Willis
7 years ago
Views:

1 Kernel Methods and Support Vector Machines Oliver Schulte - CMPT 726 Bishop PRML Ch. 6

2 Support Vector Machines Defining Characteristics Like logistic regression, good for continuous input features, discrete target variable. Like nearest neighbor, a kernel method: classification is based on weighted similar instances. The kernel defines similarity measure. Sparsity: Tries to find a few important instances, the support vectors. Intuition: Netflix recommendation system.

3 SVMs: Pros and Cons Pros Very good classification performance, basically unbeatable. Fast and scaleable learning. Pretty fast inference. Cons No model is built, therefore black-box. Not so applicable for discrete inputs. Still need to specify kernel function (like specifying basis functions). Issues with multiple classes, can use probabilistic version. (Relevance Vector Machine).

4 Two Views of SVMs Theoretical View: linear separator SVM looks for linear separator but in new feature space. Uses a new criterion to choose a line separating classes: max-margin. User View: kernel-based classification User specifies a kernel function. SVM learns weights for instances. Classification is performed by taking average of the labels of other instances, weighted by a) similarity b) instance weight. Nice demo on web

5 Two Views of SVMs Theoretical View: linear separator SVM looks for linear separator but in new feature space. Uses a new criterion to choose a line separating classes: max-margin. User View: kernel-based classification User specifies a kernel function. SVM learns weights for instances. Classification is performed by taking average of the labels of other instances, weighted by a) similarity b) instance weight. Nice demo on web

6 Two Views of SVMs Theoretical View: linear separator SVM looks for linear separator but in new feature space. Uses a new criterion to choose a line separating classes: max-margin. User View: kernel-based classification User specifies a kernel function. SVM learns weights for instances. Classification is performed by taking average of the labels of other instances, weighted by a) similarity b) instance weight. Nice demo on web

7 Two Views of SVMs Theoretical View: linear separator SVM looks for linear separator but in new feature space. Uses a new criterion to choose a line separating classes: max-margin. User View: kernel-based classification User specifies a kernel function. SVM learns weights for instances. Classification is performed by taking average of the labels of other instances, weighted by a) similarity b) instance weight. Nice demo on web

8 Example: X-OR X-OR problem: class of (x 1, x 2 ) is positive iff x 1 x 2 > 0. Use 6 basis functions φ(x 1, x 2 ) = (1, 2x 1, 2x 2, x 2 1, 2x 1 x 2, x 2 2 ). Simple classifier y(x 1, x 2 ) = φ 5 (x 1, x 2 ) = 2x 1 x 2. Linear in basis function space. Dot product φ(x) T φ(z) = (1 + x T z) 2 = k(x, z). A quadratic kernel. let s check SVM demo

9 Example: X-OR X-OR problem: class of (x 1, x 2 ) is positive iff x 1 x 2 > 0. Use 6 basis functions φ(x 1, x 2 ) = (1, 2x 1, 2x 2, x 2 1, 2x 1 x 2, x 2 2 ). Simple classifier y(x 1, x 2 ) = φ 5 (x 1, x 2 ) = 2x 1 x 2. Linear in basis function space. Dot product φ(x) T φ(z) = (1 + x T z) 2 = k(x, z). A quadratic kernel. let s check SVM demo

10 Valid Kernels Valid kernels: if k(, ) satisfies: Symmetric; k(x i, x j ) = k(x j, x i ) Positive definite; for any x 1,..., x N, the Gram matrix K must be positive semi-definite: K = k(x 1, x 1 ) k(x 1, x 2 )... k(x 1, x N ) k(x N, x 1 ) k(x N, x 2 )... k(x N, x N ) Positive semi-definite means x T Kx 0 for all x then k(, ) corresponds to a dot product in some space φ a.k.a. Mercer kernel, admissible kernel, reproducing kernel

11 Examples of Kernels Some kernels: Linear kernel k(x 1, x 2 ) = x T 1 x 2 Polynomial kernel k(x 1, x 2 ) = (1 + x T 1 x 2) d Contains all polynomial terms up to degree d Gaussian kernel k(x 1, x 2 ) = exp( x 1 x 2 2 /2σ 2 ) Infinite dimension feature space

12 Constructing Kernels Can build new valid kernels from existing valid ones: k(x 1, x 2 ) = ck 1 (x 1, x 2 ), c > 0 k(x 1, x 2 ) = k 1 (x 1, x 2 ) + k 2 (x 1, x 2 ) k(x 1, x 2 ) = k 1 (x 1, x 2 )k 2 (x 1, x 2 ) k(x 1, x 2 ) = exp(k 1 (x 1, x 2 )) Table on p. 296 gives many such rules

13 More Kernels Stationary kernels are only a function of the difference between arguments: k(x 1, x 2 ) = k(x 1 x 2 ) Translation invariant in input space: k(x 1, x 2 ) = k(x 1 + c, x 2 + c) Homogeneous kernels, a. k. a. radial basis functions only a function of magnitude of difference: k(x 1, x 2 ) = k( x 1 x 2 ) Set subsets k(a 1, A 2 ) = 2 A 1 A 2, where A denotes number of elements in A Domain-specific: think hard about your problem, figure out what it means to be similar, define as k(, ), prove positive definite.

14 The Kernel Classification Formula Suppose we have a kernel function k and N labelled instances with weights a n 0, n = 1,..., N. As with the perceptron, the target labels +1 are for positive class, -1 for negative class. Then y(x) = N a n t n k(x, x n ) + b n=1 x is classified as positive if y(x) > 0, negative otherwise. If a n > 0, then x n is a support vector. Don t need to store other vectors. a will be sparse - many zeros.

15 Examples SVM with Gaussian kernel Support vectors circled. They are the closest to the other class. Note non-linear decision boundary in x space

16 the number the number of degrees of degrees of freedom of freedom is higher, is higher, for thefor linearly the linearly separable separable case (left case panel), (left panel), the the solution solution is roughly is roughly linear, linear, indicating indicating that the thatcapacity the capacity is being is being controlled; controlled; and that and the that the linearly linearly non-separable non-separable case (right case (right panel) panel) has become has become separable. separable. Examples From Burges, A Tutorial on Support Vector Machines for Pattern Recognition (1998) Figure 9. Figure Degree 9. Degree 3 polynomial 3 polynomial kernel. kernel. The background The background colour colour shows the shows shape the of shape the of decision the decision surface. surface. Finally, Finally, note that notealthough that although the SVM the classifiers SVM classifiers described described above above are binary are binary classifiers, classifiers, they they are easily are easily combined combined to handle to handle the multiclass the multiclass case. case. A simple, A simple, effective effective combination combination trains trains SVM trained using cubic polynomial kernel k(x 1, x 2 ) = (x T 1 x 2 + 1) 3 Left is linearly separable Note decision boundary is almost linear, even using cubic polynomial kernel Right is not linearly separable But is separable using polynomial kernel

17 Learning the Instance Weights The max-margin classifier is found by solving the following problem: Maximize wrt a L(a) = N a n 1 2 n=1 subject to the constraints a n 0, n = 1,..., N N n=1 a nt n = 0 N n=1 m=1 N a n a m t n t m k(x n, x m ) It is quadratic, with linear constraints, convex in a Bounded above since K positive semi-definite Optimal a can be found With large datasets, descent strategies employed

18 Regression Kernelized Many classifiers can be written as using only dot products. Kernelization = replace dot products by kernel. E.g., the kernel solution for regularized least squares regression is y(x) = k(x) T (K + λi N ) 1 t vs. φ(x)(φ T Φ + λi M ) 1 Φ T t for original version N is number of datapoints (size of Gram matrix K) M is number of basis functions (size of matrix Φ T Φ) Bad if N > M, but good otherwise k(x) = (k(x, x 1,..., k(x, x n )) is the vector of kernel values over data points x n.

19 Conclusion Readings: Ch (pp ) Non-linear features, or domain-specific similarity measurements are useful Dot products of non-linear features, or similarity measurements, can be written as kernel functions Validity by positive semi-definiteness of kernel function Can have algorithm work in non-linear feature space without actually mapping inputs to feature space Advantageous when feature space is high-dimensional

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION Introduction In the previous chapter, we explored a class of regression models having particularly simple analytical