Optimization for Machine Learning

Size: px

Start display at page:

Download "Optimization for Machine Learning"

Marianna Pitts
7 years ago
Views:

1 Optimization for Machine Learning Lecture 4: SMO-MKL S.V. N. (vishy) Vishwanathan Purdue University July 11, 2012 S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 1 / 22

2 Binary Classification y i = +1 y i = 1 S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 2 / 22

3 Binary Classification y i = +1 y i = 1 S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 2 / 22

4 Binary Classification y i = +1 y i = 1 S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 2 / 22

5 Binary Classification y i = +1 w, x 1 + b = +1 w, x 2 + b = 1 w, x 1 x 2 = 2 w w, x 1 x 2 = 2 w x 2 x 1 y i = 1 {x w, x + b = 1} {x w, x + b = 1} {x w, x + b = 0} S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 2 / 22

6 Linear Support Vector Machines Optimization Problem min w,b,ξ s.t. 1 2 w 2 + C m i=1 y i ( w, x i + b) 1 ξ i for all i ξ i 0 ξ i S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 3 / 22

7 The Kernel Trick y x S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 4 / 22

8 The Kernel Trick y x S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 4 / 22

9 The Kernel Trick x 2 + y 2 x S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 4 / 22

10 Kernel Trick Optimization Problem min w,b,ξ s.t. 1 2 w 2 + C m i=1 ξ i y i ( w, φ(x i ) + b) 1 ξ i for all i ξ i 0 S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 5 / 22

11 Kernel Trick Optimization Problem max α s.t. 1 2 α Hα + 1 α 0 α i C α i y i = 0 i H ij = y i y j φ(x i ), φ(x j ) S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 5 / 22

12 Kernel Trick Optimization Problem max α s.t. 1 2 α Hα + 1 α 0 α i C α i y i = 0 i H ij = y i y j k(x i, x j ) S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 5 / 22

13 Key Question Which kernel should I use? The Multiple Kernel Learning Answer Cook up as many (base) kernels as you can Compute a data dependent kernel function as a linear combination of base kernels k(x, x ) = k d k k k (x, x ) s.t. d k 0 S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 6 / 22

14 Key Question Which kernel should I use? The Multiple Kernel Learning Answer Cook up as many (base) kernels as you can Compute a data dependent kernel function as a linear combination of base kernels k(x, x ) = k d k k k (x, x ) s.t. d k 0 S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 6 / 22

15 Object Detection Localize a specified object of interest if it exists in a given image S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 7 / 22

16 Some Examples of MKL Detection S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 8 / 22

17 Summary of Our Results Sonar Dataset with 800 kernels p Training Time (s) # Kernels Selected SMO-MKL Shogun SMO-MKL Shogun Web dataset: 50,000 points and 50 kernels 30 minutes Sonar with a hundred thousand kernels Precomputed: 8 minutes Kernels computed on-the-fly: 30 minutes S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 9 / 22

18 Setting up the Optimization Problem -I The Setup We are given K kernel functions k 1,..., k n with corresponding feature maps φ 1 ( ),..., φ n ( ) We are interested in deriving the feature map φ(x) = d1 φ 1 (x). dn φ n (x) S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 10 / 22

19 Setting up the Optimization Problem -I The Setup We are given K kernel functions k 1,..., k n with corresponding feature maps φ 1 ( ),..., φ n ( ) We are interested in deriving the feature map φ(x) = d1 φ 1 (x). dn φ n (x) = w = w 1. w n S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 10 / 22

20 Setting up the Optimization Problem Optimization Problem min w,b,ξ s.t. 1 2 w 2 + C m i=1 ξ i y i ( w, φ(x i ) + b) 1 ξ i for all i ξ i 0 S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 11 / 22

21 Setting up the Optimization Problem Optimization Problem min w,b,ξ,d s.t. 1 2 w k 2 + C k y i ( k m i=1 ξ i dk w k, φ k (x i ) + b ) 1 ξ i for all i ξ i 0 d k 0 for all k S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 11 / 22

22 Setting up the Optimization Problem Optimization Problem min w,b,ξ,d s.t. ( ) 2 1 m w k 2 + C ξ i + ρ d p p 2 2 k k i=1 k ( ) y i dk w k, φ k (x i ) + b 1 ξ i for all i k ξ i 0 d k 0 for all k S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 11 / 22

23 Setting up the Optimization Problem Optimization Problem min w,b,ξ,d s.t. 1 w k 2 + C 2 d k k y i ( k ( m ξ i + ρ 2 k ) i=1 w k, φ k (x i ) + b d p k ) 2 p 1 ξ i for all i ξ i 0 d k 0 for all k S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 11 / 22

24 Setting up the Optimization Problem Optimization Problem min max d α s.t. ( 1 d k α H k α + 1 α + ρ 2 2 k 0 α i C α i y i = 0 i d k 0 k d p k ) 2 p S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 11 / 22

25 Saddle Point Problem d α S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 12 / 22

26 Solving the Saddle Point Saddle Point Problem min max d α s.t. ( 1 d k α H k α + 1 α + ρ 2 2 k 0 α i C α i y i = 0 i d k 0 k d p k ) 2 p S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 13 / 22

27 Our Approach The Key Insight Eliminate d D(α) := max α s.t. 1 8ρ ( k 0 α i C α i y i = 0 i 1 p + 1 q = 1 ) 2 ( ) q q α H k α + 1 α Not a QP but very close to one! S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 14 / 22

28 Our Approach SMO-MKL: High Level Overview D(α) := max α s.t. 1 8ρ ( k 0 α i C α i y i = 0 i ) 2 ( ) q q α H k α + 1 α Algorithm Choose two variables α i and α j to optimize Solve the one dimensional reduced optimization problem Repeat until convergence S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 15 / 22

29 Our Approach SMO-MKL: High Level Overview Selecting the Working Set Compute directional derivative and directional Hessian Greedily select the variables Solving the Reduced Problem Analytic solution for p = q = 2 (one dimensional quartic) For other values of p use Newton Raphson S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 16 / 22

30 Our Approach SMO-MKL: High Level Overview Selecting the Working Set Compute directional derivative and directional Hessian Greedily select the variables Solving the Reduced Problem Analytic solution for p = q = 2 (one dimensional quartic) For other values of p use Newton Raphson S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 16 / 22

31 Experiments Generalization Performance Australian 90 Test Accuracy (%) SMO-MKL Shogun S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 17 / 22

32 Experiments Generalization Performance ionosphere 94 Test Accuracy (%) SMO-MKL Shogun S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 17 / 22

33 Experiments Scaling with Training Set Size Adult: 123 dimensions, 50 RBF kernels, p = 1.33, C = 1 CPU Time in seconds SMO-MKL Shogun Number of Training Examples S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 18 / 22

34 Experiments Scaling with Training Set Size Adult: 123 dimensions, 50 RBF kernels, p = 1.33, C = 1 CPU Time in seconds Observed O(n 1.9 ) Number of Training Examples S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 18 / 22

35 Experiments On Another Dataset Web: 300 dimensions, 50 RBF kernels, p = 1.33, C = 1 CPU Time in seconds Number of Training Examples S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 19 / 22

36 Experiments Scaling with Number of Kernels Sonar: 208 examples, 59 dimensions, p = 1.33, C = 1 CPU Time in seconds Number of Kernels S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 20 / 22

37 Experiments Scaling with Number of Kernels Sonar: 208 examples, 59 dimensions, p = 1.33, C = 1 CPU Time in seconds SMO-MKL O(n) Number of Kernels S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 20 / 22

38 Experiments On Another Dataset Real-sim: 72,309 examples, 20,958 dimensions, p = 1.33, C = 1 CPU Time in seconds 10 5 SMO-MKL O(n) Number of Training Examples S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 21 / 22

39 References References S. V. N. Vishwanathan, Zhaonan Sun, Theera-Ampornpunt, and Manik Varma. Multiple Kernel Learning and the SMO Algorithm. NIPS Pages Code available for download from com/en-us/um/people/manik/code/smo-mkl/download.html S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 22 / 22

Introduction to Support Vector Machines. Colin Campbell, Bristol University

Introduction to Support Vector Machines Colin Campbell, Bristol University 1 Outline of talk. Part 1. An Introduction to SVMs 1.1. SVMs for binary classification. 1.2. Soft margins and multi-class classification.