Introduction to Machine Learning NPFL 054

Transcription

1 Introduction to Machine Learning NPFL Barbora Hladká Martin Holub Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics NPFL054, 2016 Hladká & Holub Lecture 7, page 1/51

2 Outline Evaluation of binary classifiers (cntnd): ROC curve Support Vector Machines (SVM) NPFL054, 2016 Hladká & Holub Lecture 7, page 2/51

3 Support Vector Machines Basic idea of SVM for binary classification tasks We find a plane that separates the two classes in the feature space. If it is not possible allow some training errors, or enrich the feature space so that finding a separating plane is possible NPFL054, 2016 Hladká & Holub Lecture 7, page 3/51

4 Support Vector Machines Three key ideas Maximizing the margin Duality optimization task Kernels Key concepts needed Hyperplane Dot product Quadratic programming NPFL054, 2016 Hladká & Holub Lecture 7, page 4/51

5 Hyperplane A hyperplane of an m-dimensional space is a subspace with dimension m 1. Mathematical definition Θ 0 + Θ T x = 0, where Θ = Θ 1,..., Θ m If m = 2, a hyperplane is a line Θ is a normal vector If x satisfies the equation, then it lies on the hyperplane If Θ 0 + Θ T x 0, then x lies to one side of the hyperplane NPFL054, 2016 Hladká & Holub Lecture 7, page 5/51

6 Hyperplane NPFL054, 2016 Hladká & Holub Lecture 7, page 6/51

7 Hyperplane Separating hyperplane NPFL054, 2016 Hladká & Holub Lecture 7, page 7/51

8 Hyperplane Point-hyperplane distance Distance of x to the hyperplane Θ 0 + Θ T x = 0 ρ(x) = Θ 0 + Θ T x Θ NPFL054, 2016 Hladká & Holub Lecture 7, page 8/51

9 Dot product x R m length of x (or 2-norm) x = m i=1 x i 2 dot product of two vectors x 1, x 2 R m x 1 x 2 = m x 1i x 2i i=1 x 1x 2 = x 1. x 2. cos α geometric interpretation of x 1x 2: the length of the projection of x 1 onto the unit vector x 2 ( x 2 = 1) xx = x 2 NPFL054, 2016 Hladká & Holub Lecture 7, page 9/51

10 Dot product x 2 = 1 NPFL054, 2016 Hladká & Holub Lecture 7, page 10/51

11 Quadratic programming Quadratic programming is the problem of optimizing a quadratic function of several variables subject to linear constraints on these variables. NPFL054, 2016 Hladká & Holub Lecture 7, page 11/51

12 Support Vector Machines Binary classification task Y = {+1, 1} h has a form of h(x) = sgn(θ 0 + Θ T x) NPFL054, 2016 Hladká & Holub Lecture 7, page 12/51

13 Support Vector Machines Binary classification task Y = {+1, 1} Outline 1 Large margin classifier (linear separability) 2 Soft margin classifier (not linear separability) 3 Kernels (non-linear class boundaries) NPFL054, 2016 Hladká & Holub Lecture 7, page 13/51

14 Support Vector Machines Binary classification task Y = {+1, 1} Data set Data = { x i, y i, x i X, y i { 1, +1}} is linearly separable if there exists a hyperplane so that all instances from Data are classified correctly. NPFL054, 2016 Hladká & Holub Lecture 7, page 14/51

15 Support Vector Machines Binary classification task Y = {+1, 1} Assume a hyperplane g: Θ 0 + Θ T x = 0 Margin of x w.r.t. g is distance of x to g: ρ g (x) = Θ 0 + Θ T x Θ Functional margin of x, x, y Data w.r.t. g is ρ g (x, y) = y(θ 0 + Θ T x) Is x classified correctly or not? Large functional margin represents correct and confident classification. Geometric margin of I.e. functional margin scaled by Θ x, x, y Data w.r.t. g is ρ g (x, y) = ρ g (x, y)/ Θ NPFL054, 2016 Hladká & Holub Lecture 7, page 15/51

16 Geometric margin of x NPFL054, 2016 Hladká & Holub Lecture 7, page 16/51

17 Support Vector Machines Binary classification task Y = {+1, 1} Geometric margin of Data w.r.t. g is ρ g (Data) = argmin x,y Data ρ g (x, y) NPFL054, 2016 Hladká & Holub Lecture 7, page 17/51

18 Support Vector Machines Binary classification task Y = {+1, 1} Functional margin of Data w.r.t. g is ρ g (Data) = argmin x,y Data ρ g (x, y) Functional margin of training set is functional margin of closest points NPFL054, 2016 Hladká & Holub Lecture 7, page 18/51

19 Large Margin Classifier Training data is linearly separable We look for g so that g = argmax g ρ g (Data) NPFL054, 2016 Hladká & Holub Lecture 7, page 19/51

20 Large Margin Classifier Training data is linearly separable Θ 0 + Θ T x and kθ 0 + (kθ) T x define the same hyperplane. i.e. distances to separator are unchanged y i (Θ 0 + Θ T x i ) Θ = y i(kθ 0 + (kθ) T x i ) kθ Thus, we can choose Θ so that ρ g (Data) = 1. Then g = argmax g ρ g (Data) = argmax g 1 Θ NPFL054, 2016 Hladká & Holub Lecture 7, page 20/51

21 Large Margin Classifier Training data is linearly separable NPFL054, 2016 Hladká & Holub Lecture 7, page 21/51

22 Large Margin Classifier Training data is linearly separable Goal: Orientate the separatig hyperplane to be as far as possible from the closest instances of both classes. Θ = argmax Θ 1 Θ Support vectors are the instances touching the margins. NPFL054, 2016 Hladká & Holub Lecture 7, page 22/51

23 Large Margin Classifier Training data is linearly separable NPFL054, 2016 Hladká & Holub Lecture 7, page 23/51

24 Large Margin Classifier Training data is linearly separable Θ 1 = argmax Θ Θ argmin 1 Θ 2 Θ 2 NPFL054, 2016 Hladká & Holub Lecture 7, page 24/51

25 Large Margin Classifier Training data is linearly separable Primal problem Minimize subject to constraints 1 2 Θ 2 y i (Θ 0 + Θ T x i ) 1, i = 1,... n Properties 1 Convex optimization 2 Unique solution for linearly separable training data NPFL054, 2016 Hladká & Holub Lecture 7, page 25/51

26 Large Margin Classifier Training data is linearly separable For each training example x i, y i, introduce Lagrange multiplier α i 0. Let α = α 1,..., α n. Primal Lagrangian L(Θ, Θ 0, α) is given by L(Θ, Θ 0, α) = 1 2 Θ 2 i α i (y i (Θ 0 + Θ T x i ) 1) (1) subject to constraints α i [y i (Θ 0 + Θ T x i ) 1] = 0, i = 1,... n We wish to find the Θ and Θ 0 which minimizes and the α which maximizes L. NPFL054, 2016 Hladká & Holub Lecture 7, page 26/51

27 Large Margin Classifier Training data is linearly separable 1. Minimize L w.r.t. Θ Thus differentiate L w.r.t. Θ and L Θ = 0 It gives Θ = n α i y i x i (2) i=1 2. Minimize L w.r.t. Θ 0 Thus differentiate L w.r.t. Θ 0 and L Θ 0 = 0 It gives n α i y i = 0 (3) i=1 NPFL054, 2016 Hladká & Holub Lecture 7, page 27/51

28 Large Margin Classifier Training data is linearly separable 3. Substitute (2) into the primal form (1). Then L(Θ, Θ 0, α) = α i 1 2 i subject to α i 0 and i α iy i = 0, i = 1... n α i α j y i y j xi T x j i,j NPFL054, 2016 Hladká & Holub Lecture 7, page 28/51

29 Large Margin Classifier Training data is linearly separable 4. Solve the dual problem, i.e. maximize a quadratic function. Do quadratic programming. L(α) = i α i 1 α i α j y i y j xi T x j 2 subject to α i 0 and i α iy i = 0, i = 1... n 5. Get α 6. Then Θ = n i=1 α i y ix i i,j NPFL054, 2016 Hladká & Holub Lecture 7, page 29/51

30 Large Margin Classifier Training data is linearly separable Θ is the solution to the primal problem α is the solution to the dual problem Θ and α satisfy the Karush-Kuhn-Tucker conditions where one of them is so called Karush-Kuhn-Tucker dual complementarity: α i (1 y i (Θ 0 + Θ T x i ) = 0 y i (Θ 0 + Θ T x i ) 1 (x i is not support vector) α i = 0 α i 0 y i (Θ 0 + Θ T x i ) = 1 (x i is support vector) I.e., finding Θ is equivalent to finding support vectors and their weights NPFL054, 2016 Hladká & Holub Lecture 7, page 30/51

31 Large Margin Classifier Training data is linearly separable Prediction for a new instance x h(x) = sgn( n α i y i x i x + Θ 0 ) i=1 similarity between x and support vector x i : a support vector that is more similar contributes more to the classification support vector that is more important, i.e. has larger α i, contributes more to the classification if y i is positive, than the contribution is positive, otherwise negative NPFL054, 2016 Hladká & Holub Lecture 7, page 31/51

32 Soft Margin Classifier Training data is not linearly separable In a real problem it is unlikely that a line will exactly separate the data even if a curved decision boundary is possible. So exactly separating the data is probably not desirable if the data has noise and outliers, a smooth decision boundary that ignores a few data points is better than one that loops around the outliers. Thus minimize Θ 2 AND the number of training mistakes NPFL054, 2016 Hladká & Holub Lecture 7, page 32/51

33 Soft Margin Classifier Training data is not linearly separable NPFL054, 2016 Hladká & Holub Lecture 7, page 33/51

34 Soft Margin Classifier Training data is not linearly separable Introducing slack variables ξ i 0 ξ i = 0 if x i is correctly classified ξ i is distance to "its supporting hyperplane" otherwise 0 < ξ i 1/ Θ : margin violation ξ i > 1/ Θ : misclassification NPFL054, 2016 Hladká & Holub Lecture 7, page 34/51

35 Soft Margin Classifier Training data is not linearly separable Primal problem Minimize subject to constraint 1 2 Θ 2 + C n i=1 ξ i C 0 trade-off parameter small C large margin large C narrow margin y i (Θ 0 + Θ T x i ) 1 ξ i, i = 1,... n NPFL054, 2016 Hladká & Holub Lecture 7, page 35/51

36 Soft Margin Classifier Training data is not linearly separable Do quadratic programming as for Large Margin Classifier Prediction for a new instance x n h(x) = sgn( α i y i x i x + Θ 0 ) i=1 NPFL054, 2016 Hladká & Holub Lecture 7, page 36/51

37 Support Vector Machines Non-linear boundary If the examples are separated by a nonlinear region NPFL054, 2016 Hladká & Holub Lecture 7, page 37/51

38 Support Vector Machines Non-linear boundary Recall polynomial regression Polynomial regression is an extension of linear regression where the relationship between features and target value is modelled as a d-th order polynomial. Simple regression y = Θ 0 + Θ 1 x 1 Polynomial regression y = Θ 0 + Θ 1 x 1 + Θ 2 x Θ dx d 1 This defines a feature mapping φ(x 1 ) = [x 1, x 2 1,..., x d 1 ] It is still a linear model with features A 1, A 2 1,..., Ad 1. NPFL054, 2016 Hladká & Holub Lecture 7, page 38/51

39 Support Vector Machines Kernels Idea Apply Large/Soft margin classifier not to the orginal features but to the features obtained by the feature mapping φ φ(x) : R m F Large/Soft margin classifier uses dot product x i x j. Now, replace it with φ(x i )φ(x j ). NPFL054, 2016 Hladká & Holub Lecture 7, page 39/51

40 Support Vector Machines Kernels However, finding φ could be expensive. Kernel trick No need to know what φ is and what the feature space is, i.e. no need to explicitly map the data to the feature space Define a kernel function K : R m R m R Replace the dot product x i x j with a Kernel function K(x i, x j ) : L(α) = i α i 1 α i α j y i y j K(x i, x j ) 2 i,j NPFL054, 2016 Hladká & Holub Lecture 7, page 40/51

41 Support Vector Machines Kernels Source: /machine-learning-dir/notes-dir/ker1/ker1-l.html NPFL054, 2016 Hladká & Holub Lecture 7, page 41/51

42 Support Vector Machines Kernels Common kernel functions Linear: K(x i, x j ) = x T i x j Polynomial: K(x i, x j ) = (γx T i x j + c) d Radial basis function: K(x i, x j ) = exp( γ( x i x j 2 )) Sigmoid: K(x i, x j ) = tanh(γx T i x j + c), where tanh(x) = exp(x) exp( x) exp(x)+exp( x) NPFL054, 2016 Hladká & Holub Lecture 7, page 42/51

43 Support Vector Machines Multiclass classification tasks One-to-one Train ( K 2) SVM binary classifiers Classify x using each of the ( K 2) classifiers. The instance is assigned to the class which is the most frequent class assigned in the pairwise classification. NPFL054, 2016 Hladká & Holub Lecture 7, page 43/51

44 Support Vector Machines Multiclass classification tasks One-to-all Train K SVM binary classifiers. Each of them, doing classification of k-th class (+1) to the others (-1), is characterized by the hypothesis parameters Θ k, k = 1,..., K The instance x is assigned to the class k = max k Θ T k x NPFL054, 2016 Hladká & Holub Lecture 7, page 44/51

45 Evaluation of binary classifiers Sensitivity vs. specificity Confusion matrix True class Predicted class Positive Negative Positive True Positive (TP) False Negative (FN) Negative False Positive (FP) True Negative (TN) Measure Precision Recall/Sensitivity Specificity Accuracy Formula TP/(TP+FP) TP/(TP+FN) TN/(TN+FP) (TP+TN)/(TP+FP+TN+FN) NPFL054, 2016 Hladká & Holub Lecture 7, page 45/51

46 Evaluation of binary classifiers Sensitivity vs. specificity Perfect classifier NPFL054, 2016 Hladká & Holub Lecture 7, page 46/51

47 Evaluation of binary classifiers Sensitivity vs. specificity Reality NPFL054, 2016 Hladká & Holub Lecture 7, page 47/51

48 Evaluation of binary classifiers Sensitivity vs. specificity 100% sensitive classifier NPFL054, 2016 Hladká & Holub Lecture 7, page 48/51

49 Evaluation of binary classifiers Sensitivity vs. specificity 100% specific classifier NPFL054, 2016 Hladká & Holub Lecture 7, page 49/51