Introduction to Support Vector Machines

Transcription

1 Introduction to Support Vector Machines Liangliang Cao ECE 547 University of Illinois at Urbana-Champaign Fall 2010 iangliang Cao ( ECE 547 University of Illinois at Urbana-Champaign Introduction to Support ) Vector Machines Fall / 35

2 Who invented SVMs? Vladimir Vapnik Ph.D. in Statistics 1964 Ins. Control Sci. Moscow AT&T, USA (developed Support Vector Machines) NEC Laboratories 2002 now U.S. National Academy of Engineering 2006 Quote: Until recently, philosophy was based on the very simple idea that the world is simple. As Enstein said, when the number of factors coming into play is too large, scientific methods in most cases fail. In machine learning, for the first time, we have examples where the world is not simple. For example, when we solve the "forest" problem with data of size 15,000 we get 85%-87% accuracy. However, when we use 500,000 training examples we achieve 98% of correct answers. This means that a good decision rule is not a simple one, it cannot be described by a very few parameters. " Liangliang Cao ( ECE 547 University of Illinois at Urbana-Champaign Introduction to Support ) Vector Machines Fall / 35

3 Outline 1 Maximum margin classifiers and linear SVMs Separating hyperplane Geometric margin Comparing with other algorithms Reformulation by rescaling and slack variables General SVM in the linear form 2 Dual problem and nonlinear SVMs Lagrange Multiplier and KKT condition Dual problem and Kernels Mercer Theorem Optimization in Primal form: from Perceptron to Pegasos SVM Optimization in Dual form: SMO algorithms iangliang Cao ( ECE 547 University of Illinois at Urbana-Champaign Introduction to Support ) Vector Machines Fall / 35

4 Resources Vladimir Vapnik: The Nature of Statistical Learning Theory. Springer-Verlag, (difficult but unique) Christopher J. C. Burges: A Tutorial on Support Vector Machines for Pattern Recognition, Data Mining and Knowledge Discovery, 1998 Bernhard Schölkopf and A. J. Smola: Learning with Kernels A useful website: Software: LIBSVM SVMLight svmlight.joachims.org/ iangliang Cao ( ECE 547 University of Illinois at Urbana-Champaign Introduction to Support ) Vector Machines Fall / 35

5 Problem A toy problem for two category classification: Training samples {x i, y i }, 1 i N. Here x i denotes the samples in two dimensional space, while y i denotes the labels {+1, 1} iangliang Cao ( ECE 547 University of Illinois at Urbana-Champaign Introduction to Support ) Vector Machines Fall / 35

6 Problem We consider the linear classifier which corresponds a hyperplane separating the training samples (suppose all the samples are separable) iangliang Cao ( ECE 547 University of Illinois at Urbana-Champaign Introduction to Support ) Vector Machines Fall / 35

7 Classifier Which linear classifier is the best? iangliang Cao ( ECE 547 University of Illinois at Urbana-Champaign Introduction to Support ) Vector Machines Fall / 35

8 Optimal classifier When all the sampled are correctly classified, we prefer the situation where the datapoint can be as far from the decision boundary as possible. We introduce the concept margin to measure the distance from data samples to separating hyperplane. The optimal classifier is the one with largest margin. iangliang Cao ( ECE 547 University of Illinois at Urbana-Champaign Introduction to Support ) Vector Machines Fall / 35

9 Distance from a point to a plane Distance from x to plane w T x + b = 0: r = wt x+b w. Proof. w is orthogonal to the hyperplane (w, b). Suppose x is on the above of hyperplane, we can write x x = r w w. Since we know wt x + b = 0, so that w T (r w w x) + b = 0 from which we can get r = wt x+b w. Similarly, the distance for x in the whole space is r = wt x + b w iangliang Cao ( ECE 547 University of Illinois at Urbana-Champaign Introduction to Support ) Vector Machines Fall / 35

10 Geometric margin and support vectors Geometric margin: The geometric margin is the smallest distance between samples to the separating hyperplane, i.e. M = min i r i = min i w T x i +b w. Note that the geometric margin is independent with other training samples which are far from the boundary. We are more interested in those which defines the decision boundary. Supporting vectors: The minimum distance is determined by a few data points on the boundary. We call those points are supporting vectors. iangliang Cao ( ECE 547 University of Illinois at Urbana-Champaign Introduction to Support ) Vector Machines Fall / 35

11 Maximum Margin Classifier vs. Nearest Neighbor Summary Based on the concept of of geometric margin, we could be able to sketch the maximum margin classifier for some simple case in 2D space. The general optimization problem of finding the optimal classifier will be discussed in next class. Comparing with NN Maximum Margin NN Training Need training No training Testing Fast slow High Dimension Usually good Not so good Multi-category Expensive Simple iangliang Cao ( ECE 547 University of Illinois at Urbana-Champaign Introduction to Support ) Vector Machines Fall / 35

12 Formulation of a convex optimization problem Maximum margin classifier is the simplest SVM (linear SVMs). To maximize the margin, we consider arg max w,b M = arg max {min w,b i w T x i + b w } To remove the, we employ y i to demonstrate whether x i is above the hyperplane or below the hyperplane, so that we have arg max {min w,b i y i (w T x i + b) w } = arg max { 1 w,b w min [y i (w T x i + b)]} i However, this formulation is still difficult to solve: Unknown variables exist in both numerator and denominator! iangliang Cao ( ECE 547 University of Illinois at Urbana-Champaign Introduction to Support ) Vector Machines Fall / 35

13 Rescale w Intuition Note that we can re-scale the w so that min i [y i (w T x i + b)] can be adjusted. We can set it as a constant so that the optimization problem can be separated. A good way to re-scale w is to guarantee that the numerator is 1, i.e. [y i (w T x i + b)] = 1. Formulation arg max { 1 w,b w min [y i (w T x i + b)]} i which can be transformed as arg max w,b { 1 w, subject to min i [y i (w T x i + b)] 1. which is equivalent to subject to [y i (w T x i + b)] 1 arg min w,b w 2 iangliang Cao ( ECE 547 University of Illinois at Urbana-Champaign Introduction to Support ) Vector Machines Fall / 35

14 Non-separable situation So far we get SVM for separable case: arg min w,b w 2 (1) s.t. [y i (w T x i + b)] 1 (2) However, what if min i [y i (w T x i + b)] 1 cannot be satisfied, i.e., the data is not linear separable? iangliang Cao ( ECE 547 University of Illinois at Urbana-Champaign Introduction to Support ) Vector Machines Fall / 35

15 Non-separable situation we introduce slack variables ξ i 0 for each constraint [y i (w T x i + b)] 1 ξ i where ξ i > 1 means that sample i is misclassified. Therefore we get the formal SVM formulation min 1 2 w 2 + C N ξ n s.t. i=1 y i (w T x i + b) 1 ξ i Now we arrive at what is called Support Vector Machines (linear case)! iangliang Cao ( ECE 547 University of Illinois at Urbana-Champaign Introduction to Support ) Vector Machines Fall / 35

16 Summary Maximum margin classifier is easy to understand (and for simply case, it is possible to compute by hand) For the ease of optimization and handling non-separable situation, we rewrite the formulation, which is called linear SVM. There are more rich meaning in nonlinear SVMs. We will cover it later. iangliang Cao ( ECE 547 University of Illinois at Urbana-Champaign Introduction to Support ) Vector Machines Fall / 35

17 Review Last class we found that the maximum margin classifier should be To maximize the margin, we consider arg max w,b M = arg max {min w,b i It can be written in a generalized form (SVM) s.t. min 1 2 w 2 + C w T x i + b w N i=1 y i (w T x i + b) 1 ξ i In this class we will review some advanced topics, Lagrange Multiplier Kernels: KKT condition Dual problem Mercer Theorem Optimization in Primal form: from Perceptron to Pegasos SVM Optimization in Dual form: SMO algorithms iangliang Cao ( ECE 547 University of Illinois at Urbana-Champaign Introduction to Support ) Vector Machines Fall / 35 ξ n }

18 Problem abstraction General problem min f (x) s.t. g(x) 0 h(x) = 0 Simplified problem (equality constraints only) min f (x) s.t. g(x) = 0 A naive solution is to find x 2 = τ(x 1 ), and then substitute into f. but this naive approach doesn t work for large problems or complicated constraints iangliang Cao ( ECE 547 University of Illinois at Urbana-Champaign Introduction to Support ) Vector Machines Fall / 35

19 Lagrange multiplier for equality constraints Interpretation of constraints g(x) defines a p 1 dimension surface in the original space, and g(x) is orthogonal to the surface. Optimal point To find a point x on the constraint surface which minimizes f (x), we have f (x ) is orthogonal to the surface, i.e., parallel to g(x) Proof. f (x + ɛ) f (x ) + ɛ T f (x ) If ɛ T f (x ) 0, then we can find a ɛ so that f (x + ɛ) < f (x ), which contradicts x = arg min f (x). So that ɛ T f (x ) = 0, and f (x ) is orthogonal to g(x). iangliang Cao ( ECE 547 University of Illinois at Urbana-Champaign Introduction to Support ) Vector Machines Fall / 35

20 Lagrange multiplier for equality constraints Now we know the condition of optimal point where λ 0 can have either sign. We can consider Lagrangian f (x) + λ g(x) = 0 L = f (x) + λg(x) whose optimal point corresponds to L = 0, L λ = 0. Next we will generalize this idea to the inequality constraints iangliang Cao ( ECE 547 University of Illinois at Urbana-Champaign Introduction to Support ) Vector Machines Fall / 35

21 Lagrange multiplier for inequality constraint Problem: min f (x) s.t. g(x) 0 Condition 1 min f (x) for those samples not on the boundary g(x) < 0. Optimal condition: f (x) = 0 Condition 2 min f (x) for those samples on the boundary g(x) = 0. Optimal condition: f (x) + λ g(x) = 0 since we know that f (x) is in the reverse direction of g(x) 0, we have λ > 0 We do not know which condition it might be, but we can unify these two conditions into one formula, which is called KKT condition. iangliang Cao ( ECE 547 University of Illinois at Urbana-Champaign Introduction to Support ) Vector Machines Fall / 35

22 Karush-Kuhn-Tucker(KKT) conditions In both conditions, we have f (x) + λ g(x) = 0 while λ = 0 for condition 1, and λ > 0 for condition 2. Considering the constraint g(x) 0, we have the following observations: g(x) 0 λ 0 λg(x) = 0 which are named Karush-Kuhn-Tucker (KKT) conditions. iangliang Cao ( ECE 547 University of Illinois at Urbana-Champaign Introduction to Support ) Vector Machines Fall / 35

23 Lagrange multiplier for SVMs Using the Lagrangian multiplier, we consider arg max w,b L(w, b, α) = arg max w,b = arg max w,b L(w, b, α) 1 2 w 2 N α i y i (w T x i + b) 1 i=1 By letting L w = 0, L b = 0, L λ = 0, we have w N α i y i φ(x i ) = 0 i=1 α i y i = 0 i α i 0 iangliang Cao ( ECE 547 University of Illinois at Urbana-Champaign Introduction to Support ) Vector Machines Fall / 35

24 Dual problem Eliminating w and b we re-write the cost function as L(α) = 1 2 w 2 w T i α i y i x i + i α i = 1 2 [ i α i y i x i ] T [ j α j y j x j ] + i α i = 1 2 α i α j y i y j (x T i x j ) + i j i α i subject to α i 0, i α i y i = 0 This is called dual form of the SVMs. Dual form provides not only a different perspective for optimization, but also a way of employing Kernels instead of inner products. iangliang Cao ( ECE 547 University of Illinois at Urbana-Champaign Introduction to Support ) Vector Machines Fall / 35

25 Comparison with other linear classifiers Other linear classifier: Linear Discriminant Analysis (LDA) Logistic Regression SVM is NOT necessarily significantly better than LDA or Logistic Regression, especially for the case of multiple classes. However, SVM is more popular in practice probably because There exist very good implementations of SVMs (SVMLight and LibSVM) Linear SVM can be easily generalized to nonlinear case by using different Kernels. 1 Next we will discuss Kernels. 1 But there is no free lunch. Compared with linear SVMs, nonlinear SVM is much slower to compute and it is not easy to always find a good Kernel. iangliang Cao ( ECE 547 University of Illinois at Urbana-Champaign Introduction to Support ) Vector Machines Fall / 35

26 Kernel Consider L(α) = 1 2 i j α iα j y i y j (x T i x j) + i α i, replace x T i x j with < x i, x j >. If we map x to a high dimensional space φ(x i ), for example, φ(x) = [x, x 2, x 3, x 4,...] T. Then the inner product is K(x i, x j ) =< φ(x i ), φ(x j ) > We can compute the kernel function K directly, which is usually easier and faster than compute φ. As an example, let x = (x(1), x(2)) T, z = (z(1), z(2)) T, we have < x, z > 2 = (x(1)z(1) + x(2)z(2)) 2 = x(1) 2 z(1) 2 + x(2) 2 z(2) 2 + 2x(1)z(1)x(2)z(2) =< (x(1) 2, x(2) 2, 2x(1)x(2)), (z(1) 2, z(2) 2, 2z(1)z(2)) > =< φ(x), φ(z) > iangliang Cao ( ECE 547 University of Illinois at Urbana-Champaign Introduction to Support ) Vector Machines Fall / 35

27 Kernel Selection and Existence (Mercer Theorem) How to select kernels? Some examples of kernels K(x, z) = (x T z + c) d K(x, z) = exp( x z 2 2δ 2 Intuition: x φ(x), z φ(z) try to take K(x, z) =< φ(x), φ(z) > which is large when x, z are similar, but small when x, z are dissimilar. Existence: For any K(), does φ satisfying K(x, z) =< φ(x), φ(z) >? Theorem Any symmetric positive definite matrix can be regarded as a kernel matrix, that is as an inner product matrix in some space. iangliang Cao ( ECE 547 University of Illinois at Urbana-Champaign Introduction to Support ) Vector Machines Fall / 35

28 SVM solver Solver in dual form L(α) = 1 2 α i α j y i y j (x T i x j ) + i α i s.t. α i 0, i α i y i = 0 i j We will discuss SMO algorithm. Solver in primal form arg min w,b w 2 We will introduce the Pegasos SVM. s.t. [y i (w T x i + b)] 1 iangliang Cao ( ECE 547 University of Illinois at Urbana-Champaign Introduction to Support ) Vector Machines Fall / 35

29 Optimization in dual form General quadratic programming problem 1 arg min x 2 xt Px + q T x + r subject to Gx h, Ax = b SVM problem arg max α N α i 1 2 i=1 N i,j=1 α i α j y i y j K(x i, x j ) subject to 0 α i C N, N i=1 α iy i = 0 For SVM problem, the number of variable is N, number of constraint is N. When training SVM in handling the large dataset, General QP optimization approaches (e.g., interior-point method) are still relatively slow. iangliang Cao ( ECE 547 University of Illinois at Urbana-Champaign Introduction to Support ) Vector Machines Fall / 35

30 Sequential Minimization Optimization (SMO) We will introduce John Platt s SMO algorithm,which solves smaller problems using subsets of the constraints, while adding more constraints until all of them are satisfied. Empirically SMO is much more efficient than interior-point method. Outline of SMO 1 Heuristically picks 2 variables, say α i, α j, and freeze the other variables. 2 Analytically update α i, α j 3 Iterate until converges. Questions left: How to select α i, α j? How to find the analytical solution? Why will it converge? Next we will focus the first two questions but neglect the last (You can find the answer in Platt s paper if you are interested). iangliang Cao ( ECE 547 University of Illinois at Urbana-Champaign Introduction to Support ) Vector Machines Fall / 35

31 Heuristics for selecting two variables First criteria Select the one which contributes most to the KKT gap: pick α i Loop over lagrangians which are neither at the lower or upper boundary. pick α j Once all these are satisfied we loop over all patterns violating the KKT, to ensure self consistency over complete datasets α j = arg max k (f (x i ) y i ) (f (x k ) y k ) Second criteria In case the first heuristic was unsuccessful, all other examples are analyzed until an example is found where progress can be made to find the gap. iangliang Cao ( ECE 547 University of Illinois at Urbana-Champaign Introduction to Support ) Vector Machines Fall / 35

32 Analytical solution for two-variable QP 2-variable problem subject to min α i,α j (α 2 i K ii + α 2 j K jj + 2α i α j K ij ) + c i α i + c j α j sα i + α j = γ 0 α i, α j C let α j = γ sα i, we can represent the object function in terms of α i alone. Then we can get the analytical solution of α i easily. iangliang Cao ( ECE 547 University of Illinois at Urbana-Champaign Introduction to Support ) Vector Machines Fall / 35

33 Optimization in primal form Perceptron Optimization To learn f (x) = w T x, the classification is h = sign(f (x)). Algorithms: Randomly select w 0 as initialization for each sample x i, y i, 1 i N, if y i (w T x + b) 0, then w k+1 = w k + ηy i x i k = k + 1 Pegasos SVM by Shai Shalev-Shwartz, Yoram Singer, Nathan Srebro Algorithm Initialize w 0 for t = 0, 1, 2,..., T randomly sample a set A t from all the training set {x, y} select A + t = {(x, y) A t : y(w T x) < 1} update w t+1 using the samples in A + t iangliang Cao ( ECE 547 University of Illinois at Urbana-Champaign Introduction to Support ) Vector Machines Fall / 35

34 Conclusion For now we have introduce SVMs and related optimization problem. Although understanding SVMs is not a trivial task, I wish what are taught can help you read most books or papers without much difficulty. For those who just want to use SVMs as tools from the shelf, please try to play with LibSVM or SVMLight. The former package also provides a faster version for linear case called LibLinear. iangliang Cao ( ECE 547 University of Illinois at Urbana-Champaign Introduction to Support ) Vector Machines Fall / 35

35 Conclusion-Cont For those who want do research on SVMs, here are some suggestions Try to implement some SVM solver by yourself. Try to test your solver on some toy dataset, check where it fails. Kernel selection is the most difficult part in SVM learning and one of the hot research areas. There are other view points for SVMs, esp VC dimension (difficult) SVM as a hybrid of generative and discriminate approaches (Tong and Koller 2000) SVM as a regression (UIUC Stat542) iangliang Cao ( ECE 547 University of Illinois at Urbana-Champaign Introduction to Support ) Vector Machines Fall / 35