PDEEC Machine Learning 2016/17

Transcription

1 PDEEC Machine Learning 2016/17 Lecture - Introduction to Support Vector Machines Jaime S. Cardoso jaime.cardoso@inesctec.pt INESC TEC and Faculdade Engenharia, Universidade do Porto Nov. 21, / 56

2 Background Convex Hull Convex Hull Given a set of points {xi } we define the convex hull to be the set of all points x given by x = i α i x i subject to: α i 0 i α i = 1 Given two sets of points {x i } and {y i } if their convex hull intersect, the two sets of points cannot be linearly separable Given two sets of points {xi } and {y i } if they are linearly separable, their convex hull do not intersect x1 y1 x4 x2 y4 y3 y5 x3 y2 2 / 56

3 Background Convex Hull Reduced Convex Hulls Given a set of points {x i } we define the reduced convex hull to be the set of all points x given by x = i α i x i subject to: α i 0 i α i = 1 α i α, 1 N α 1 Figure: from [4] 3 / 56

4 Background Distance of the origin to a plane. plane defined by equation y(x) = w T x + b y(xa ) = 0 if x A to plane w is orthogonal to every vector lying on the plane: w T (x A x B ) = 0, with x A and x B on the plane the distance of the plane to the origin is w < x, w > (projection of any point of the plane in the direction of w) = wt x w = b w Figure: from [1] 4 / 56

5 Background Distance of a point P to a plane (I). 1st proof by translation, make P the origin O of a new system of coordinates: x = x P the equation of the plane w T x + b = 0 becomes w T (x + P ) + b = 0 w T x + (w T P + b) = 0 the distance of the origin to the plane is = (wt P +b) w = y(p ) w 2nd proof P = P + P // = P + r w w wt P + b = w T P + b + w T r w w = 0 + r w r = wt P +b w = y(p ) y(p ) w r = w 5 / 56

6 Background Distance of a point P to a plane (II) 3rd proof the problem of finding the minimum of (x P ) is equivalent to finding the minimum of (x P ) 2 = (x P ) T (x P ) using Lagrange multipliers, the problem of finding the minimum of (x P ) T (x P ) subject to w T x + b = 0 is equivalent to the problem of finding the minimum of L = {(x P ) T (x P ) + λ(w T x + b), for the variables x and L λ. x = 2(x P ) + λw = 0 L λ = wt x + b = 0 multiplying the 1st equation by w T and using the 2nd for w T x = b, we obtain λ = 2(wT P +b) w T w using this value of λ on the 1st equation we get x P = (wt P +b)w. The norm of this vector comes as w T w x P = wt P +b w = wt P +b y(p ) w T w w = w algebric distance of a point P to a plane: y(p )/ w 6 / 56

7 Background Lagrange Multipliers (I) Maximizing f(x) subject to g i (x) = 0 Consider the problem of finding the maximum of a function f(x), x R d, subject to a set of constraints g i (x) = 0, i = 1..m There exists a λ = (λ 1, λ 2,, λ m ) T, such that on the maximum x 0 f(x 0 ) = λ 1 g 1 (x 0 ) + λ 2 g 2 (x 0 ) + + λ m g m (x 0 ) Figure: from[1] λ = (λ 1, λ 2,, λ m ) are called the lagrange multipliers Introducing the Lagrangian function defined by L(x, λ) = f(x) + i λ ig i (x), we see that L/ λi = g i (x) L x = f(x) + λ 1 g 1 (x) + λ 2 g 2 (x) + + λ m g m (x) Just find the stationary point of L(x, λ) with respect to both x and λ. 7 / 56

8 Background Lagrange Multipliers (II) Maximizing f(x) subject to g i (x) 0 Two kinds of solutions: the solution lies in g(x) > 0, in which case the constraint is not active (λ = 0) the solution lies in g(x) = 0, in which case the constraint is active (λ 0). NOTE that now λ > 0. Figure: from [1] 8 / 56

9 Background Lagrange Multipliers (III) Karush-Kuhn-Tucker (KKT) conditions (it is a generalization of the method of Lagrange multipliers to inequality constraints) the solution to following nonlinear optimization problem is obtained by optimizing argmax x f(x) s.t. g i (x) 0 L(x, λ i, ν j ) = f(x) + i with respect to x and λ subject to h j (x) = 0 g i (x) 0 λ i 0 λ i g i (x) = 0 h i (x) = 0 λ i g i (x) + j ν j h j (x) 9 / 56

10 Background Lagrange Multipliers (IV) Karush-Kuhn-Tucker (KKT) conditions (it is a generalization of the method of Lagrange multipliers to inequality constraints) the solution to following nonlinear optimization problem is obtained by optimizing argmin x f(x) s.t. g i (x) 0 L(x, λ i, ν j ) = f(x) i with respect to x and λ subject to h j (x) = 0 g i (x) 0 λ i 0 λ i g i (x) = 0 h i (x) = 0 λ i g i (x) j ν j h j (x) 10 / 56

11 Background Lagrange Multipliers (V) Constraints f 1 (θ) = θ 1 + 2θ f 2 (θ) = θ 1 θ f 3 (θ) = θ a constraint is called inactive if the corresponding Lagrange multiplier is zero a constraint f i is called active if the corresponding Lagrange multiplier is nonzero (for which f i (θ ) = 0). 11 / 56

12 Background Min-Max Duality Primal: Dual: max y min x min x max y max F(x, y) y min F(x, y) x F(x, y) min x max F(x, y) y 12 / 56

13 Background Lagrangian Duality (I) minimize x J(x) s.t. g i (x) 0 L(x, λ i ) = J(x) i λ i g i (x) since λ i 0 and g i (x) 0 then J(x) = max λ L(x, λ i) the original problem is equivalent to min x max λ 0 L(x, λ) 13 / 56

14 Background Lagrangian Duality (II) Convex Programming Assume that J(x) is convex Assume that g i are convex then min x = max max L(x, λ) = λ 0 min λ 0 x L(x, λ) 14 / 56

15 Background Lagrangian Duality (III) Wolfe Dual Representation A convex programming problem is equivalent to max L(x, λ) λ 0 subject to L(x, λ) = 0 x 15 / 56

16 Support vector machines Linearly separable classes (I) The training set comprises N input vectors x 1,, x N, with corresponding target values y 1,, y N, where y i { 1, 1}. New data points x will be classified according the sign of the model y(x) = w T x + b margin - smallest distance between the decision boundary and any of the samples In SVMs the decision boundary is chosen to be the one for which the margin is maximized (a) An example of a linearly separable two class problem with multiple possible linear classifiers. (b) SVM option: maximize the margin (from [2]) 16 / 56

17 Support vector machines Linearly separable classes (II) the distance of a training point x n (correctly classified) to the decision surface is given by yny(xn) w = yn(wt x+b) w thus, we want { } 1 argmax w,b w min(y n(w T x n + b)) note now that the solution to this problem is not completely well-defined: if (w, b) is a solution to this optimization problem, so it will be any solution of the form (cw, cb), where c is a constant we can impose an additional constraint to completely define the problem. The constraint can be chosen to simplify the formulation. 17 / 56

18 Support vector machines Linearly separable classes (III) set (y n (w T x n + b)) = 1 for the point closest to the surface. In this case all points should satisfy (y n (w T x n + b)) 1, n = 1,, N We obtain the canonical representation: argmax w,b 1 w subject to: (y n (w T x n + b)) 1, equivalently: argmin w,b 1 2 w 2 subject to: (y n (w T x n + b)) 1, n = 1,, N n = 1,, N this is the primal formulation for the linearly separable case 18 / 56

19 Support vector machines Linearly separable classes (IV) Using the lagrange multipliers and the Karush-Kuhn-Tucker (KKT) conditions, we obtain the dual formulation: L(w, b, λ) = 1 2 w 2 n λ n{y n (w T x n + b) 1} L/ w = 0 w n λ n y n x n = 0 L/ b = 0 n λ n y n = 0 λ i 0, i = 1,, N λ i (y i (w T x i + b) 1) = 0, i = 1,, N y i (w T x i + b) 1) 0, i = 1,, N 19 / 56

20 Support vector machines Linearly separable classes (V) After a bit of algebra we end up with the equivalent optimization task, called the dual representation: argmax N λ n=1 λ n 1 2 subject to: i,j λ iλ j y i y j x T i x j n λ ny n = 0 λ i 0, i = 1,, N note that that now we have N variables, instead of the initial D variables (D is the dimension of the data space) note that in the optimization function the training data only show up in inner produts 20 / 56

21 Support vector machines Linearly separable classes (VI) Classifying new data points after finding the optimal λ, we can classify new data points, evaluating the sign of y(x) = x T w + b = n λ n y n x T x n + b (1) Any training point with λn = 0 plays no role in making predictions linear regression estimates a set of parameters and then the training points are not used for making predictions k-nearest Neighbour uses ALL training points to make predictions SVMs is something inbetween: uses only points with λ n 0 (support vectors) to make predictions: sparse kernel machine 21 / 56

22 Support vector machines Linearly separable classes (VII) Classifying new data points note that w does not need to be computed explicitly... but we need b note that any support vector satisfies (from the KKT conditions or from the very own formulation of SVMs) y m y(x m ) = 1 y m (w t x m + b) = 1. Using (1) we get y m ( n λ ny n x T mx n + b) = 1, where x m is any support vector b = y m n λ ny n x T mx n it is numericaly more stable to average over ALL support vectors: b = 1 { N S m S ym n S λ ny n x T } mx n, with m and n running only over support vectors 22 / 56

23 Support vector machines Linearly separable classes (VIII) A geometric interpretation find the convex hull of both sets of points find the two closest points in the two convex hulls (the convex hull is the smallest convex set containing the points) construct the plane that bisects these two points Figure: from [3] 23 / 56

24 Support vector machines Linearly separable classes (IX) A geometric interpretation The closest points in the convex hulls can be found by solving the following quadratic problem 1 argmin α 2 c d 2 s.t c = x i C 1 α i x i d = x i C 2 α i x i x i C 1 α i = 1 x i C 2 α i = 1 α i 0, i = 1,, N it can be shown to be equivalent to the maximum margin approach 24 / 56

25 Support vector machines Non-linearly separable classes (I) When in possession of a nearly separable dataset, a simple linear separator is bound to misclassify some points. But the question that we have to ask ourselves is if the non-linearly separable data portraits some intrinsic property of the problem (in which case a more complex classifier, allowing more general boundaries between classes may be more appropriate) or if it can be interpreted as the result of noisy points (measurement errors, uncertainty in class membership, etc), in which case keeping the linear separator and accept some errors is more natural. we will keep the linear boundary for now and accept errors in the training data 25 / 56

26 Support vector machines Non-linearly separable classes (II) allow some of the training points to be misclassified penalize the number of misclassifications introduce slack variables ξ n 0, n = 1,..., N ξ n = 0 ξ n = y n y(x n ) x n on the correct side of the margin otherwise Figure: from [1] 26 / 56

27 Support vector machines Non-linearly separable classes (III) replace exact constraints with y n y(x n ) 1 ξ n y n (w T x n + b) 1 ξ n the goal could now be to minimize 1 2 w 2 + C N n=1 sign(ξ n) The constraint parameter C controls the tradeoff between the dual objectives of maximizing the margin of separation and minimizing the misclassification error For an error to occur, the corresponding ξ n must exceed unity, so N n=1 sign(ξ n) is an upper bound on the number of the training errors optimization of the above is difficult since it involves a discontinuous function sign() to optimize a closely related cost function 1 2 w 2 + C N n=1 ξ n 27 / 56

28 Support vector machines Non-linearly separable classes (III) Figure: from [1] 28 / 56

29 Support vector machines Non-linearly separable classes (IV) L(w, b, ξ, µ, λ) = 1 2 w 2 + C ξ n λ n (y n y(x n ) 1 + ξ n ) µ n ξ n KKT conditions: λ n 0 y n y(x n ) 1 + ξ n 0 λ(y n y(x n ) 1 + ξ n ) = 0 µ n 0 ξ n 0 µ n ξ n = 0 29 / 56

30 Support vector machines Non-linearly separable classes (Va) L w = 0 w = n λ ny n x n L b = 0 n λ ny n = 0 = 0 λ n = C µ n L ξ n Dual Lagrangian form: argmax λ L(λ) = λ N n=1 1 2 N n=1 N m=1 λ nλ m y n y m x T mx n subject to: 0 λ n C N n=1 λ ny n = 0 This is very similar to the optimization problem in the linear separable case, except that there is an upper bound C on λ n now a QP solver can be used to find λ n 30 / 56

31 Support vector machines Non-linearly separable classes (Vb) The SMO algorithm Consider solving the unconstrained opt problem: EM (to be discussed) gradient ascend Coordinate ascend max α W (α 1,, α m ) 31 / 56

32 Support vector machines Non-linearly separable classes (Vc) Coordinate ascend 32 / 56

33 Support vector machines Non-linearly separable classes (Vc) Sequential minimal optimization Constrained optimization: argmax λ L(λ) = λ N n=1 1 2 N n=1 N m=1 λ nλ m y n y m x T mx n subject to: 0 λ n C N n=1 λ ny n = 0 Question: can we do coordinate along one direction at a time (i.e., hold all λ [ i] fixed, and update λ i?) 33 / 56

34 Support vector machines Non-linearly separable classes (Vd) The SMO algorithm Repeat till convergence Select some pair λ i and λ j to update next (using a heuristic that tries to pick the two that will allow us to make the biggest progress towards the global maximum). Re-optimize L(λ) with respect to λi and λ j, while holding all the other λ k s (k i; j) fixed. 34 / 56

35 Support vector machines Non-linearly separable classes (VI) Predictions: y(x) = N n=1 λ ny n x T x n + b b = 1 N M n M (y n m S λ my m x t mx n ) where M denotes the set of indices of data having 0 < λ n < C and S denotes the set of indices of data having λ n 0 (set of support vectors) 35 / 56

36 Support vector machines Non-linearly separable classes (VII) A geometric interpretation find the reduced convex hull of both sets of points find the two closest points in the two reduced convex hulls (the convex hull is the smallest convex set containing the points) construct the plane that bisects these two points (a) An example of two nonlinearly separable datasets (from [3]). (b) Reduced Convex Hull approach (from [3]) 36 / 56

37 Support vector machines Non-linearly separable classes (VIII) A geometric interpretation The closest points in the reduced convex hulls can be found by solving the following quadratic problem 1 argmin α 2 c d 2 s.t c = x i C 1 α i x i d = x i C 2 α i x i x i C 1 α i = 1 x i C 2 α i = 1 0 α i D, i = 1,, N it can be shown to be equivalent to the maximum margin approach 37 / 56

38 Support vector machines Non-linearly separable classes (IX) show that for a linearly separable dataset, the solution for the formulation with ξ s may be different from the solution for the formulation without the ξ s. what is the solution for the formulation without the ξ s for a non-linearly separable dataset? 38 / 56

39 Support vector machines Non-linearly separable classes (X) Dual Lagrangian form: L(λ) = λ N n=1 1 2 N n=1 N m=1 λ nλ m y n y m x T mx n subject to: 0 λ n C N n=1 λ ny n = 0 b = 1 N M m M (y m s S λ sy s x T s x m ) Predictions: y(x) = N n=1 λ ny n x T x n + b Notes: N unknowns: λ s the input vector x enters only in the form of scalar products the dimension of the data has almost no impact on the complexity 39 / 56

40 Support vector machines The design of nonlinear classifiers The design of a nonlinear classifier for the original data space x R d can be accomplished with the design of a linear classifier in a suitable space ˆx R D usually D > d but that is not always necessary Examples 1. the search for a boundary x x 2 2 = const in R 2 can be found as a linear boundary in the space defined by the variables r = x x2 2 and θ = arctan(y/x). In fact it could be simplified to the search in R defined just by the variable r 2. the search for a boundary w 1 x 1 + w 2 x 2 + w 3 x 1 x 2 + w 4 x w 5 x w 6 = 0 in R 2 can be found as a linear boundary in the space defined by the variables ˆx 1 = x 1, ˆx 2 = x 2, ˆx 3 = x 1 x 2, ˆx 4 = x 2 1, ˆx 5 = x / 56

41 Support vector machines The design of nonlinear classifiers to design a nonlinear SVM we just define a new space Φ(x) where our problem becomes linear apply the linear SVM in the new space Φ(x) (the linearly separable version or the nonlinearly separable version) the changes in our formulations are minimum: just replace x T i x j by Φ(x i ) T Φ(x j ) Define k(x i, x j ) = Φ(x i ) T Φ(x j ). The nonlinear SVMs become: L(λ) = λ N n=1 1 N N 2 n=1 m=1 λ nλ m y n y m k(x m, x n ) subject to: 0 λ n C N n=1 λ ny n = 0 b = 1 N M m M (y m s S λ sy s k(x s, x m )) Predictions: y(x) = N n=1 λ ny n k(x, x n ) + b The new space Φ(x) should capture the nonlinearity intrinsic to the data (and not the noise). In the simplest case Φ(x) = x and k(x i, x j ) = x T i x j 41 / 56

42 Kernel methods The design of nonlinear classifiers It would be nice to be able to compute k(x i, x j ) = Φ(x i ) T Φ(x j ) without needing to use the transformed space Φ(x) Mercer s theorem Let x R d and a mapping Φ(x) x Φ(x) H, where H is a Euclidean space (linear space equipped with inner product). Then Φ(x 1 ) T Φ(x 2 ) = k(x 1, x 2 ), where k(x 1, x 2 ) is a symmetric function satisfying k(x 1, x 2 )g(x 1 )g(x 2 )dx 1 dx 2 0, for any g(x), x R d such that g 2 (x)dx < Note that Mercer s theorem does not tell us how to find this space. A necessary and sufficient condition for a function k(x, x) to be a valid kernel is that the Gram Matrix K, whose elements are given by k(x n, x m ), should be a positive semidefinite matrix for all possible choices of the set {x n } 42 / 56

43 Kernel methods SVM examples 43 / 56

44 Kernel methods SVM examples 44 / 56

45 Kernel methods The design of nonlinear classifiers many models can be reformulated in terms of a dual representation in which the kernel function arises naturally linear regression kernel PCA nearest neighbour etc 45 / 56

46 Beyond basic Linear Regression Beyond basic LR Dual Variables in Regression (X t X + λi)w = X t y w = λ 1 X t (y Xw) = X t α showing that w can be written as a linear combination of the training points, w = N n=1 = α nx n, with α = λ 1 (y Xw) The elements of α are called the dual variables. α = λ 1 (y Xw) λα = y XX t α (XX t + λi)α = y α = (G + λi) 1 y G = XX t or, component-wise, G ij =< x i, x j >. The resulting prediction function is given by N g(x) =< w, x >= n=1 α nx n, x = n α n < x n, x > 46 / 56

47 Kernel methods An example of a kernel algorithm Idea: classify points x in the feature space according to which of the two class means is closer. c + = 1 N + i:y i =+1 x i c = 1 N i:y i = 1 x i Figure: from [?] Compute the sign of the dot product between w = c + c and x c. 47 / 56

48 Kernel methods An example of a kernel algorithm ( f(x) = sgn 1 ) N + i:y i =+1 < x, x i > 1 N i:y i = 1 < x, x i > +b ( = sgn 1 ) N + i:y i =+1 k(x, x i) 1 N i:y i = 1 k(x, x i) + b where ( b = N 2 (i,j):y i =y j = 1 k(x i, x j ) 1 ) N+ 2 (i,j):y i =y j =+1 k(x i, x j ) 48 / 56

49 Kernel methods Typical kernels any algorithm that only depends on dot products can benefit form the kernel trick think of the kernel as a nonlinear similarity measure examples of common kernels: Linear k(x1, x 2 ) = x T 1 x 2 Polynomial k(x1, x 2 ) = (x T 1 x 2 + c) d Gaussian k(x 1, x 2 ) = exp( γ x 1 x 2 2 ) kernels are also known as covariance functions 49 / 56

50 Kernel methods Constructing kernels Given valid kernels k 1 (x, x ) and k 2 (x, x ), the following new kernels will also be valid: k(x, x ) = ck 1 (x, x ), c R + k(x, x ) = f(x)k 1 (x, x )f(x ), f(.) is any function k(x, x ) = k 1 (x, x ) + k 2 (x, x ) k(x, x ) = k 1 (x, x )k 2 (x, x ) k(x, x ) = exp(k 1 (x, x )) 50 / 56

51 Regression Support Vector Machine Regression Support Vector Machine min w,b C n E ɛ(y n f(x n )) + λ 2 { w 2 0 if z < ɛ E ɛ (z) = z ɛ, otherwise E(z) t ɛ 0 ɛ z (a) ɛ insensitive error function. 0 x 1 (b) RVM result. Figure: from [1] 51 / 56

52 Multiclass Support Vector Machine Reduction techniques Conventional approaches One-against-All K two-class problems Pairwise K(K 1)/2 two-class problems Decision-Tree-Based DAG (Directed Acyclic Graph) Error-Correcting Output Codes 52 / 56

53 Multiclass Support Vector Machine Reduction techniques Conventional approaches apply binary classifier 1 to test example and get prediction F 1 (0/1) apply binary classifier 2 to test example and get prediction F 2 (0/1)... apply binary classifier M to test example and get prediction F M (0/1) use all M classifications to get the final multiclass classification 1..K 53 / 56

54 Multiclass Support Vector Machine Reduction techniques A lesser-known approach for solving multiclass problems via a Single binary classifier reduction obtained by embedding the original problem in a higher-dimensional space consisting of the original features, as well as one or more other dimensions determined by fixed vectors, termed here extension features. This embedding is implemented by replicating the training set points so that a copy of the original point is concatenated with each of the extension features vectors. The binary labels of the replicated points are set to maintain a particular structure in the extended space. This construction results in an instance of an artificial binary problem, which is fed to a binary learning algorithm that outputs a single soft binary classifier. To classify a new point, the point is replicated and extended similarly and the resulting replicas are fed to the soft binary classifier, which generates a number of signals, one for each replica. The class is determined as a function of these signals. 54 / 56

55 Multiclass Support Vector Machine All-at-Once Support Vector Machines For a K-class problem we define the decision function for class i by w T i x + b i > w T k x + b k, for k i, k = 1,..., K The L1 soft-margin support vector machine can be obtained by minimizing min w,b,ξ = 1 2 K w k 2 + C k=1 N n=1 K k=1,k y n ξ n,k subject to (w yn w k ) T x k + b yn b k 1 ξ n,k for k y n, k = 1,..., K, n = 1,..., N 55 / 56

56 References Christopher M. Bishop Pattern recognition and machine learning, Springer, Sergios Theodoridis and Konstantinos Koutroumbas Pattern recognition Elsevier, Academic Press, Kristin P. Bennet and Campbell Support Vector Machines: Hype or Hallelujah? SIGKDD Explorations, vol. 2, 2000 Michael E. Mavroforakis and Sergios Theodoridis A Geometric Approach to Support Vector Machine (SVM) Classification IEEE Transactions on Neural Networks, 2006