Kernels Methods in Machine Learning

Transcription

1 Kernels Methods in Machine Learning Perceptron. Geometric Margins. Support Vector Machines (SVMs). MariaFlorina Balcan 03/23/2015

2 Quick Recap about Perceptron and Margins

3 The nline Learning Model Example arrive sequentially. We need to make a prediction. Afterwards observe the outcome. For i=1, 2,, : Example x i Phase i: nline Algorithm Prediction h(x i ) bserve c (x i ) Mistake bound model Analysis wise, make no distributional assumptions. Goal: Minimize the number of mistakes.

4 Perceptron Algorithm in nline Model WLG homogeneous linear separators [w 0 = 0]. Set t=1, start with the all zero vector w 1. Given example x, predict + iff w t x 0 n a mistake, update as follows: Mistake on positive, w t+1 w t + x Mistake on negative, w t+1 w t x = R n w Note 1: w t is weighted sum of incorrectly classified examples w t = a i1 x i1 + + a ik x ik So, w t x = a i1 x i1 x + + a ik x ik x Note 2: Number of mistakes ever made depends only on the geometric margin of examples seen. No matter how long the sequence is or how high dimension n is!

5 Geometric Margin Definition: The margin of example x w.r.t. a linear sep. w is the distance from x to the plane w x = 0. Margin of example x 1 If w = 1, margin of x w.r.t. w is x w. x 1 w Margin of example x 2 x 2

6 Geometric Margin Definition: The margin of example x w.r.t. a linear sep. w is the distance from x to the plane w x = 0. Definition: The margin γ w of a set of examples S wrt a linear separator w is the smallest margin over points x S. Definition: The margin γ of a set of examples S is the maximum γ w over all linear separators w. γ γ + + w + + +

7 Perceptron: Mistake Bound Theorem: If data linearly separable by margin γ and points inside a ball of radius R, then Perceptron makes R/γ 2 mistakes. No matter how long the sequence is how high dimension n is! w* R Margin: the amount of wiggleroom available for a solution. (Normalized margin: multiplying all points by 100, or dividing all points by 100, doesn t change the number of mistakes; algo is invariant to scaling.)

8 Perceptron Extensions Can use it to find a consistent separator with a given set S linearly separable by margin γ (by cycling through the data). Can convert the mistake bound guarantee into a distributional guarantee too (for the case where the x i s come from a fixed distribution). Can be adapted to the case where there is no perfect separator as long as the so called hinge loss (i.e., the total distance needed to move the points to classify them correctly large margin) is small. Can be kernelized to handle nonlinear decision boundaries!

9 Theorem: If data linearly separable by margin γ and points inside a ball of radius R, then Perceptron makes R/γ 2 mistakes. Implies that large margin classifiers have smaller complexity!

10 Complexity of Large Margin Linear Sep. Know that in R n we can shatter n+1 points with linear separators, but not n+2 points (VCdim of linear sep is n+1). What if we require that the points be linearly separated by margin γ? Can have at most R 2 points inside ball of radius R γ that can be shattered at margin γ (meaning that every labeling is achievable by a separator of margin γ). So, large margin classifiers have smaller complexity! Less classifiers to worry about that will look good over the sample, but bad over all. Less prone to overfitting!!!! Nice implications for usual distributional learning setting. w

11 Margin Important Theme in ML. Both sample complexity and algorithmic implications. Sample/Mistake Bound complexity: If large margin, # mistakes Peceptron makes is small (independent on the dim of the space)! If large margin γ and if alg. produces a large margin classifier, then amount of data needed depends only on R/γ [Bartlett & ShaweTaylor 99]. Suggests searching for a large margin classifier Algorithmic Implications: Perceptron, Kernels, SVMs w γ + γ

12 So far, talked about margins in the context of (nearly) linearly separable datasets

13 What if Not Linearly Separable Problem: data not linearly separable in the most natural feature representation. Example: Solutions: vs No good linear separator in pixel representation. Learn a more complex class of functions (e.g., decision trees, neural networks, boosting). Use a Kernel (a neat solution that attracted a lot of attention) Use a Deep Network Combine Kernels and Deep Networks

14 verview of Kernel Methods What is a Kernel? A kernel K is a legal def of dotproduct: i.e. there exists an implicit mapping Φ s.t. K(, ) =Φ( ) Φ( ) E.g., K(x,y) = (x y + 1) d : (ndimensional space)! n d dimensional space Why Kernels matter? Many algorithms interact with data only via dotproducts. So, if replace x z with K x, z they act implicitly as if data was in the higherdimensional Φspace. If data is linearly separable by large margin in the Φspace, then good sample complexity. [r other regularity properties for controlling the capacity.]

15 Kernels Definition K(, ) is a kernel if it can be viewed as a legal definition of inner product: ϕ: R N s.t. K x, z = ϕ x ϕ(z) Range of ϕ is called the Φspace. N can be very large. But think of ϕ as implicit, not explicit!!!!

16 Example For n=2, d=2, the kernel K x, z = x z d corresponds to x 1, x 2 Φ x = (x 1 2, x 2 2, 2x 1 x 2 ) x 2 x 1 z 1 z 3 Φspace riginal space

17 Example ϕ: R 2 R 3, x 1, x 2 Φ x = (x 1 2, x 2 2, 2x 1 x 2 ) x 2 x 1 z 1 z 3 Φspace riginal space ϕ x ϕ z = x 1 2, x 2 2, 2x 1 x 2 (z 1 2, z 2 2, 2z 1 z 2 ) = x 1 z 1 + x 2 z 2 2 = x z 2 = K(x, z)

18 Kernels Definition K(, ) is a kernel if it can be viewed as a legal definition of inner product: ϕ: R N s.t. K x, z = ϕ x ϕ(z) Range of ϕ is called the Φspace. N can be very large. But think of ϕ as implicit, not explicit!!!!

19 Example Note: feature space might not be unique. ϕ: R 2 R 3, x 1, x 2 Φ x = (x 2 1, x 2 2, 2x 1 x 2 ) ϕ x ϕ z = x 2 1, x 2 2, 2x 1 x 2 (z 2 1, z 2 2, 2z 1 z 2 ) = x 1 z 1 + x 2 z 2 2 = x z 2 = K(x, z) ϕ: R 2 R 4, x 1, x 2 Φ x = (x 1 2, x 2 2, x 1 x 2, x 2 x 1 ) ϕ x ϕ z = (x 1 2, x 2 2, x 1 x 2, x 2 x 1 ) (z 1 2, z 2 2, z 1 z 2, z 2 z 1 ) = x z 2 = K(x, z)

20 Avoid explicitly expanding the features Feature space can grow really large and really quickly. Crucial to think of ϕ as implicit, not explicit!!!! Polynomial kernel degreee d, k x, z = x z d = φ x φ z x d 1, x 1 x 2 x d, x 2 1 x 2 x d 1 Total number of such feature is d + n 1 d + n 1! = d d! n 1! d = 6, n = 100, there are 1.6 billion terms n computation! k x, z = x z d = φ x φ z

21 Kernelizing a learning algorithm If all computations involving instances are in terms of inner products then: Conceptually, work in a very high diml space and the alg s performance depends only on linear separability in that extended space. Computationally, only need to modify the algo by replacing each x z with a K x, z. Examples of kernalizable algos: classification: Perceptron, SVM. regression: linear, ridge regression. clustering: kmeans.

22 Kernelizing the Perceptron Algorithm Set t=1, start with the all zero vector w 1. Given example x, predict + iff w t x 0 n a mistake, update as follows: Mistake on positive, w t+1 w t + x Mistake on negative, w t+1 w t x w Easy to kernelize since w t is weighted sum of incorrectly classified examples w t = a i1 x i1 + + a ik x ik Replace w t x = a i1 x i1 x + + a ik x ik x a i1 K(x i1, x) + + a ik K(x ik, x) with Note: need to store all the mistakes so far.

23 Kernelizing the Perceptron Algorithm Given x, predict + iff φ(x it 1 ) φ(x) a i1 K(x i1, x) + + a it 1 K(x it 1, x) 0 n the t th mistake, update as follows: Mistake on positive, set a it 1; store x it Mistake on negative, a it 1; store x it Φspace w Perceptron w t = a i1 x i1 + + a ik x ik w t x = a i1 x i1 x + + a ik x ik x a i1 K(x i1, x) + + a ik K(x ik, x) Exact same behavior/prediction rule as if mapped data in the φspace and ran Perceptron there! Do this implicitly, so computational savings!!!!!

24 Generalize Well if Good Margin If data is linearly separable by margin in the φspace, then small mistake bound. If margin γ in φspace, then Perceptron makes R γ 2 mistakes Φspace + + R w* + +

25 Kernels: More Examples Linear: K x, z = x z Polynomial: K x, z = x z d or K x, z = 1 + x z d Gaussian: K x, z = exp x z 2 2 σ 2 Laplace Kernel: K x, z = exp x z 2 σ 2 Kernel for nonvectorial data, e.g., measuring similarity between sequences.

26 Theorem (Mercer) Properties of Kernels K is a kernel if and only if: K is symmetric For any set of training points x 1, x 2,, x m and for any a 1, a 2,, a m R, we have: i,j a i a j K x i, x j 0 a T Ka 0 I.e., K = (K x i, x j ) i,j=1,,n is positive semidefinite.

27 Kernel Methods ffer great modularity. No need to change the underlying learning algorithm to accommodate a particular choice of kernel function. Also, we can substitute a different algorithm while maintaining the same kernel.

28 Kernel, Closure Properties Easily create new kernels using basic ones! Fact: If K 1, and K 2, are kernels c 1 0, c 2 0, then K x, z = c 1 K 1 x, z + c 2 K 2 x, z is a kernel. Key idea: concatenate the φ spaces. ϕ x = ( c 1 ϕ 1 x, c 2 ϕ 2 (x)) ϕ x ϕ(z) = c 1 ϕ 1 x ϕ 1 z + c 2 ϕ 2 x ϕ 2 z K 1 (x, z) K 2 (x, z)

29 Kernel, Closure Properties Easily create new kernels using basic ones! Fact: If K 1, and K 2, are kernels, then K x, z = K 1 x, z K 2 x, z is a kernel. Key idea: ϕ x = ϕ 1,i x ϕ 2,j x i 1,,n,j {1,,m} ϕ x ϕ(z) = i,j ϕ 1,i x ϕ 2,j x ϕ 1,i z ϕ 2,j z = ϕ 1,i x ϕ 1,i z ϕ 2,j x ϕ 2,j z i j = ϕ 1,i x ϕ 1,i z K 2 x, z i = K 1 x, z K 2 x, z

30 Kernels, Discussion If all computations involving instances are in terms of inner products then: Conceptually, work in a very high diml space and the alg s performance depends only on linear separability in that extended space. Computationally, only need to modify the algo by replacing each x z with a K x, z. Lots of Machine Learning algorithms are kernalizable: classification: Perceptron, SVM. regression: linear regression. clustering: kmeans.

31 Kernels, Discussion If all computations involving instances are in terms of inner products then: Conceptually, work in a very high diml space and the alg s performance depends only on linear separability in that extended space. Computationally, only need to modify the algo by replacing each x z with a K x, z. How to choose a kernel: Kernels often encode domain knowledge (e.g., string kernels) Use CrossValidation to choose the parameters, e.g., σ for Gaussian Kernel K x, z = exp x z 2 2 σ 2 Learn a good kernel; e.g., [LanckrietCristianiniBartlettEl Ghaoui Jordan 04]