An Introduction to Machine Learning

Transcription

1 An Introduction to Machine Learning L5: Support Vector Classification Alexander J. Smola Statistical Machine Learning Program Canberra, ACT 0200 Australia MLSS Tübingen, August 2007 Alexander J. Smola: An Introduction to Machine Learning 1 / 1

2 Overview L1: Machine learning and probability theory Introduction to pattern recognition Classification, regression, and novelty detection Probability theory, Bayes rule, inference L5: Support Vector estimation Geometrical view, dual problem Convex optimization, kernels L6: Structured Estimation Sequence annotation web page ranking, path planning implementation and optimization Alexander J. Smola: An Introduction to Machine Learning 2 / 1

3 L5 Support Vector Classification Support Vector Machine Problem definition Geometrical picture Optimization problem Optimization Problem Hard margin Convexity Dual problem Soft margin problem Alexander J. Smola: An Introduction to Machine Learning 3 / 1

4 Classification Data Pairs of observations (x i, y i ) generated from some distribution P(x, y), e.g., (blood status, cancer), (credit transaction, fraud), (profile of jet engine, defect) Task Estimate y given x at a new location. Modification: find a function f (x) that does the task. Alexander J. Smola: An Introduction to Machine Learning 4 / 1

5 So Many Solutions Alexander J. Smola: An Introduction to Machine Learning 5 / 1

6 One to rule them all... Alexander J. Smola: An Introduction to Machine Learning 6 / 1

7 Optimal Separating Hyperplane Alexander J. Smola: An Introduction to Machine Learning 7 / 1

8 Optimization Problem Margin to Norm Separation of sets is given by 2 w Equivalently minimize 1 w. 2 Equivalently minimize 1 2 w 2. Constraints Separation with margin, i.e. so maximize that. w, x i + b 1 if y i = 1 w, x i + b 1 Equivalent constraint y i ( w, x i + b) 1 if y i = 1 Alexander J. Smola: An Introduction to Machine Learning 8 / 1

9 Optimization Problem Mathematical Programming Setting Combining the above requirements we obtain minimize subject to 1 2 w 2 y i ( w, x i + b) 1 0 for all 1 i m Properties Problem is convex Hence it has unique minimum Efficient algorithms for solving it exist Alexander J. Smola: An Introduction to Machine Learning 9 / 1

10 Lagrange Function Objective Function 1 2 w 2. Constraints c i (w, b) := 1 y i ( w, x i + b) 0 Lagrange Function L(w, b, α) = PrimalObjective + i α i c i = 1 2 w 2 + m α i (1 y i ( w, x i + b)) i=1 Saddle Point Condition Derivatives of L with respect to w and b must vanish. Alexander J. Smola: An Introduction to Machine Learning 10 / 1

11 Support Vector Machines Optimization Problem minimize 1 2 subject to m α i α j y i y j x i, x j i,j=1 m i=1 m α i y i = 0 and α i 0 i=1 Support Vector Expansion w = α i y i x i and hence f (x) = i Kuhn Tucker Conditions α i m α i y i x i, x + b i=1 α i (1 y i ( x i, x + b)) = 0 Alexander J. Smola: An Introduction to Machine Learning 11 / 1

12 Proof (optional) Lagrange Function L(w, b, α) = 1 2 w 2 + m α i (1 y i ( w, x i + b)) i=1 Saddlepoint condition w L(w, b, α) = w b L(w, b, α) = m α i y i x i = 0 w = i=1 m α i y i x i i=1 = 0 m α i y i x i i=1 m α i y i = 0 To obtain the dual optimization problem we have to substitute the values of w and b into L. Note that the dual variables α i have the constraint α i 0. Alexander J. Smola: An Introduction to Machine Learning 12 / 1 i=1

13 Proof (optional) Dual Optimization Problem After substituting in terms for b, w the Lagrange function becomes 1 m m α i α j y i y j x i, x j + α i 2 subject to i,j=1 i=1 m α i y i = 0 and α i 0 for all 1 i m i=1 Practical Modification Need to maximize dual objective function. Rewrite as minimize 1 m m α i α j y i y j x i, x j α i 2 i,j=1 subject to the above constraints. Alexander J. Smola: An Introduction to Machine Learning 13 / 1 i=1

14 Support Vector Expansion Solution in w = m α i y i x i i=1 w is given by a linear combination of training patterns x i. Independent of the dimensionality of x. w depends on the Lagrange multipliers α i. Kuhn-Tucker-Conditions At optimal solution Constraint Lagrange Multiplier = 0 In our context this means Equivalently we have α i (1 y i ( w, x i + b)) = 0. α i 0 = y i ( w, x i + b) = 1 Only points at the decision boundary can contribute to the solution. Alexander J. Smola: An Introduction to Machine Learning 14 / 1

15 Mini Summary Linear Classification Many solutions Optimal separating hyperplane Optimization problem Support Vector Machines Quadratic problem Lagrange function Dual problem Interpretation Dual variables and SVs SV expansion Hard margin and infinite weights Alexander J. Smola: An Introduction to Machine Learning 15 / 1

16 Kernels Nonlinearity via Feature Maps Replace x i by Φ(x i ) in the optimization problem. Equivalent optimization problem minimize 1 m m α i α j y i y j k(x i, x j ) 2 subject to Decision Function w = i,j=1 m α i y i = 0 and α i 0 i=1 m α i y i Φ(x i ) implies i=1 f (x) = w, Φ(x) + b = i=1 α i m α i y i k(x i, x) + b. i=1 Alexander J. Smola: An Introduction to Machine Learning 16 / 1

17 Examples and Problems Advantage Works well when the data is noise free. Problem Already a single wrong observation can ruin everything we require y i f (x i ) 1 for all i. Idea Limit the influence of individual observations by making the constraints less stringent (introduce slacks). Alexander J. Smola: An Introduction to Machine Learning 17 / 1

18 Optimization Problem (Soft Margin) Recall: Hard Margin Problem Softening the Constraints minimize 1 minimize 2 w 2 subject to y i ( w, x i + b) w 2 + C m i=1 subject to y i ( w, x i + b) 1+ξ i 0 and ξ i 0 ξ i Alexander J. Smola: An Introduction to Machine Learning 18 / 1

19 Linear SVM C = 1 Alexander J. Smola: An Introduction to Machine Learning 19 / 1

47 Insights Changing C For clean data C doesn t matter much. For noisy data, large C leads to narrow margin (SVM tries to do a good job at separating, even though it isn t possible) Noisy data Clean data has few support vectors Noisy data leads to data in the margins More support vectors for noisy data Alexander J. Smola: An Introduction to Machine Learning 47 / 1

48 Python pseudocode SVM Classification import elefant.kernels.vector # linear kernel k = elefant.kernels.vector.clinearkernel() # Gaussian RBF kernel k = elefant.kernels.vector.cgausskernel(rbf) import elefant.estimation.svm.svmclass as svmclass svm = svmclass.svc(c, kernel=k) alpha, b = svm.train(x, y) ytest = svm.test(xtest) Alexander J. Smola: An Introduction to Machine Learning 48 / 1

49 Dual Optimization Problem Optimization Problem minimize 1 2 subject to m α i α j y i y j k(x i, x j ) i,j=1 m i=1 m α i y i = 0 and C α i 0 for all 1 i m i=1 Interpretation Almost same optimization problem as before Constraint on weight of each α i (bounds influence of pattern). Efficient solvers exist (more about that tomorrow). α i Alexander J. Smola: An Introduction to Machine Learning 49 / 1

50 SV Classification Machine Alexander J. Smola: An Introduction to Machine Learning 50 / 1

51 Gaussian RBF with C = 0.1 Alexander J. Smola: An Introduction to Machine Learning 51 / 1

59 Insights Changing C For clean data C doesn t matter much. For noisy data, large C leads to more complicated margin (SVM tries to do a good job at separating, even though it isn t possible) Overfitting for large C Noisy data Clean data has few support vectors Noisy data leads to data in the margins More support vectors for noisy data Alexander J. Smola: An Introduction to Machine Learning 59 / 1

60 Gaussian RBF with σ = 1 Alexander J. Smola: An Introduction to Machine Learning 60 / 1

76 Insights Changing σ For clean data σ doesn t matter much. For noisy data, small σ leads to more complicated margin (SVM tries to do a good job at separating, even though it isn t possible) Lots of overfitting for small σ Noisy data Clean data has few support vectors Noisy data leads to data in the margins More support vectors for noisy data Alexander J. Smola: An Introduction to Machine Learning 76 / 1

77 Summary Support Vector Machine Problem definition Geometrical picture Optimization problem Optimization Problem Hard margin Convexity Dual problem Soft margin problem Alexander J. Smola: An Introduction to Machine Learning 77 / 1