Advanced Introduction to Machine Learning CMU-10715

Size: px

Start display at page:

Download "Advanced Introduction to Machine Learning CMU-10715"

Theodore Byron Davidson
7 years ago
Views:

1 Advanced Introduction to Machine Learning CMU Support Vector Machines Barnabás Póczos, 2014 Fall

2 Linear classifiers which line is better? 2

3 Pick the one with the largest margin! w x + b > 0 Class 1 w Class 2 w x + b < 0 Data: Margin 3

4 Scaling Plus-Plane Classifier Boundary Minus-Plane Classification rule: Classify as.. +1 if w x+ b 1 1 if w x+ b 1 Universe explodes if -1 < w x+ b < 1 How large is the margin of this classifier? Goal: Find the maximum margin classifier 4

5 Computing the margin width x + M =Margin Width = 2 w w x - Let x + and x - be such that w x + + b = +1 w x - + b = -1 x + = x - + λ w x + x - = M=? (Margin) Maximize M minimize w w! 5

6 The Primal Hard SVM This is a QP problem (m-dimensional) (Quadratic cost function, linear constraints) 6

7 Quadratic Programming Find Subject to and to Efficient Algorithms exist for QP. They often solve the dual problem instead of the primal. 7

8 Constrained Optimization 8

9 Lagrange Multiplier Moving the constraint to objective function Lagrangian: Solve: Constraint is active when α > 0 9

10 Lagrange Multiplier Dual Variables Solving: When α > 0, constraint is tight 10

11 Primal problem: From Primal to Dual Lagrange function: 11

12 The Lagrange Problem The Lagrange problem: Proof cont. 12

13 The Dual Problem Proof cont. 13

14 The Dual Hard SVM Quadratic Programming (n-dimensional) Lemma 14

15 The Problem with Hard SVM It assumes samples are linearly separable... What can we do if data is not linearly separable??? 15

16 Hard 1-dimensional Dataset If the data set is notlinearly separable, then adding new features (mapping the data to a larger feature space) the data might become linearly separable 16

17 Hard 1-dimensional Dataset Make up a new feature! Sort of computed from original feature(s) z k = ( x k, x 2 k ) Separable! MAGIC! x=0 Now drop this augmented data into our linear SVM. 17

18 Feature mapping ngeneral! points in an n-1dimensional space is always linearly separable by a hyperspace! it is good to map the data to high dimensional spaces Having ntraining data, is it always enough to map the data into a feature space with dimension n-1? Nope... We have to think about the test data as well! Even if we don t know how many test data we have and what they are... We might want to map our data to a huge ( ) dimensional feature space Overfitting? Generalization error?... We don t care now... 18

19 How to do feature mapping? Use features of features of features of features. 19

20 The Problem with Hard SVM It assumes samples are linearly separable... Solutions: 1. Use feature transformation to a larger space each training samples are linearly separable in the feature space Hard SVM can be applied overfitting Soft margin SVM instead of Hard SVM Slack variables... We will discuss them now 20

21 Hard SVM The Hard SVM problem can be rewritten: where Misclassification, or inside the margin Correct classification and outside of the margin 21

22 From Hard to Soft constraints Instead of using hard constraints (points are linearly separable) We can try solve the soft version of it:. Introduce a λ parameter! (Your loss is only 1 instead of if you misclassify an instance) where Misclassification Correct classification 22

23 Problems with l 0-1 loss It is not convex in yf(x) It is not convex in w, either and we only like convex functions... Let us approximate it with convex functions! 23

24 Approximation of the Heaviside step function Picture is taken from R. Herbrich 24

25 Approximations of l 0-1 loss Piecewise linear approximations (hinge loss, l lin ) Quadratic approximation (l quad ) 25

26 The hinge loss approximation of l 0-1 Where, 26

27 The Slack Variables ξ 2 ξ 1 M = 2 w w ξ7 27

28 The Primal Soft SVM problem where Equivalently, 28

29 The Primal Soft SVM problem Equivalently, We can use this form, too... What is the dual form of primal soft SVM? 29

30 The Dual Soft SVM (using hinge loss) where 30

31 The Dual Soft SVM (using hinge loss) 31

32 The Dual Soft SVM (using hinge loss) 32

33 SVM classification in the dual space Solve the dual problem 33

34 Why is it called Support Vector Machine? KKT conditions

35 Dual SVM Interpretation: Sparsity α j = 0 α j > 0 α j > 0 α j = 0 Only few α j s can be non-zero : where constraint is tight α j > 0 α j = 0 (<w,x j> + b)y j = 1 Support vectors training points j whose α j s are non-zero 35

36 Support Vectors w.x + b > 0 w.x + b < 0 Linear hyperplane defined by support vectors Moving other points a little doesn t effect the decision boundary γ γ only need to store the support vectors to predict labels of new points 36

37 Support vectors in Soft SVM

38 Support vectors in Soft SVM Margin support vectors Nonmargin support vectors

39 SVM classification in the dual space Without b With b 39

40 SVM with Linear Programs QP: Max margin LP: Min support vectors

41 SVM for Regression 41

42 Ridge Regression Linear regression: Primal: Dual for a given λ:...after some calculations... This can be solved in closed form:

43 Kernel Ridge Regression Algorithm

44 SVM vs. Logistic Regression SVM : Hinge loss Logistic Regression : Log loss ( log conditional likelihood) Log loss Hinge loss 0-1 loss

45 Difference between SVMs and Logistic Regression SVMs Logistic Regression Loss function Hinge loss Log-loss High dimensional features with kernels Yes! No (but there is kernel logistic regression too) Solution sparse Often yes! Almost always no! Semantics of output Margin Real probabilities

46 Constructing Kernels

47 Common Kernels Polynomials of degree d Polynomials of degree up to d Gaussian/Radial kernels Sigmoid 47

48 Designing new kernels from kernels are also kernels. Picture is taken from R. Herbrich

49 Designing new kernels from kernels Picture is taken from R. Herbrich

50 Designing new kernels from kernels

51 Higher Order Polynomials m input features d degree of polynomial grows fast! d = 6, m = 100 about 1.6 billion terms 51

52 Dot Product of Polynomials d=1 d=2 d 52

53 Picture is taken from R. Herbrich

54 The RBF kernel Note: Proof: Note:

55 Overfitting Huge feature space with kernels, what about overfitting??? Maximizing margin leads to sparse set of support vectors Some interesting theory says that SVMs search for simple hypothesis with large margin Often robust to overfitting 55

56 String kernels P-spectrum kernel: P=3: s= statistics t= computation They contain the following substrings of length 3 sta, tat, ati, tis, ist, sti, tic, ics com, omp, mpu, put, uta, tat, ati, tio, ion Common substrings: tat, ati k(s,t)=2

57 Distribution kernels Euclidean: Bhattacharyya's affinity: Mean map:

58 Set kernels Mean map: Intersection kernel: Union complement kernel:

59 What about multiple classes? 59

60 One against all Learn 3 classifiers separately: Class k vs. rest (w k, b k ) k=1,2,3 y = arg max w k.x + b k k But w k s may not be based on the same scale. Note: (aw).x + (ab) is also a solution 60

61 Learn 1 classifier: Multi-class SVM Simultaneously learn 3 sets of weights Margin - gap between correct class and nearest other class y = arg max w (k).x + b (k) 61

62 Learn 1 classifier: Multi-class SVM Simultaneously learn 3 sets of weights y = arg max w (k).x + b (k) Joint optimization: w k s have the same scale. 62

63 Steve Gunn s svm toolbox Results, Iris 2vs13, Linear kernel 63

64 Results, Iris 1vs23, 2 nd order kernel 64

65 Results, Iris 1vs23, 2nd order kernel 65

66 Results, Iris 1vs23, RBF kernel 66

67 Results, Iris 1vs23, RBF kernel 67

68 Results, Iris 1vs23, RBF kernel 68

69 Results, Chessboard, Poly kernel 69

70 Results, Chessboard, Poly kernel 70

71 Results, Chessboard, Poly kernel 71

72 Results, Chessboard, Poly kernel 72

73 Results, Chessboard, poly kernel 73

74 Results, Chessboard, RBF kernel 74

75 Sinc=sin(π x)/ (π x), RBF kernel

76 Sinc=sin(π x)/ (π x), RBF kernel

77 Sinc=sin(π x)/ (π x), RBF kernel

78 Sinc=sin(π x)/ (π x), RBF kernel

79 Sinc=sin(π x)/ (π x), RBF kernel

80 Sinc=sin(π x)/ (π x), RBF kernel

81 Sinc=sin(π x)/ (π x), poly kernel

82 Sinc=sin(π x)/ (π x), poly kernel

83 Sinc=sin(π x)/ (π x), poly kernel

84 Sinc=sin(π x)/ (π x), poly kernel

85 Sinc=sin(π x)/ (π x), poly kernel

86 Sinc=sin(π x)/ (π x), poly kernel

87 What you need to know Dual SVM formulation How it s derived Common kernels Differences between SVMs and logistic regression 87

88 Thanks for your attention 88

Support Vector Machine (SVM)

Support Vector Machine (SVM) CE-725: Statistical Pattern Recognition Sharif University of Technology Spring 2013 Soleymani Outline Margin concept Hard-Margin SVM Soft-Margin SVM Dual Problems of Hard-Margin