ECS289: Scalable Machine Learning

Transcription

1 ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 18, 2016

2 Outline Kernel Methods Solving the dual problem Kernel approximation

3 Support Vector Machines (SVM) SVM is a widely used classifier. Given: Training data points x 1,, x n. Each x i R d is a feature vector: Consider a simple case with two classes: y i {+1, 1}. Goal: Find a hyperplane to separate these two classes of data: if y i = 1, w T x i 1 ξ i ; y i = 1, w T x i 1 + ξ i.

4 Support Vector Machines (SVM) Given training data x 1,, x n R d with labels y i {+1, 1}. SVM primal problem (find optimal w R d ): min w,ξ 1 2 w T w + C n i=1 ξ i s.t. y i (w T x i ) 1 ξ i, ξ i 0, i = 1,..., n, SVM dual problem (find optimal α R n ): 1 min α 2 αt Qα e T α s.t.0 α i C, i = 1,..., n Q: n-by-n kernel matrix, Q ij = y i y j x T i x j Each α i corresponds to one training data point. Primal-dual relationship: w = i α iy i x i

5 Non-linearly separable problems What if the data is not linearly separable? Solution: map data x i to higher dimensional(maybe infinite) feature space ϕ(x i ), where they are linearly separable.

6 Support Vector Machines (SVM) SVM primal problem: min w,ξ 1 2 w T w + C n i=1 ξ i s.t. y i (w T ϕ(x i )) 1 ξ i, ξ i 0, i = 1,..., n, The dual problem for SVM: 1 min α 2 αt Qα e T α, s.t. 0 α i C, for i = 1,..., n, where Q ij = y i y j ϕ(x i ) T ϕ(x j ) and e = [1,..., 1] T. Kernel trick: define K(x i, x j ) = ϕ(x i ) T ϕ(x j ). At optimum: w = n i=1 α iy i ϕ(x i ),

7 Various types of kernels Gaussian kernel: K(x i, y j ) = e γ x i x j 2 2 ; Polynomial kernel: K(x i, x j ) = (γx T i x j + c) d. Other kernels for specific problems: Graph kernels (Vishwanathan et al., Graph Kernels, JMLR, 2010) Pyramid kernel for image matching (Grauman and Darrell, The Pyramid Match Kernel: Discriminative Classification with Sets of Image Features. In ICCV, 2005) String kernel (Lodhi et al., Text classification using string kernels. JMLR, 2002).

8 General Kernelized ERM L2-Regularized Empirical Risk Minimization: min w 1 2 w 2 + C n l i (w T ϕ(x i )) i=1 x 1,, x n : training samples ϕ(x i ): nonlinear mapping to a higher dimensional space Dual problem: min α 1 2 αt Qα + n l i ( α i ) i=1 where Q R n n and Q ij = ϕ(x i ) T ϕ(x j ) = K(x i, x j ).

9 Kernel Ridge Regression Given training samples (x i, y i ), i = 1,, n. min w 1 2 w n (w T ϕ(x i ) y i ) 2 i=1 Dual problem: min α αt Qα + α 2 2α T y

10 Scalability Challenge for solving kernel SVMs (for dataset with n samples): Space: O(n 2 ) for storing the n-by-n kernel matrix (can be reduced in some cases); Time: O(n 3 ) for computing the exact solution.

11 Greedy Coordinate Descent for Kernel SVM

12 Nonlinear SVM SVM primal problem: min w,ξ 1 2 w T w + C n i=1 ξ i s.t. y i (w T ϕ(x i )) 1 ξ i, ξ i 0, i = 1,..., n, w: an infinite dimensional vector, very hard to solve The dual problem for SVM: 1 min α 2 αt Qα e T α, s.t. 0 α i C, for i = 1,..., n, where Q ij = y i y j ϕ(x i ) T ϕ(x j ) and e = [1,..., 1] T. Kernel trick: define K(x i, x j ) = ϕ(x i ) T ϕ(x j ). Example: Gaussian kernel K(x i, x j ) = e γ x i x j 2

13 Nonlinear SVM Can we solve the problem by dual coordinate descent? The vector w = i y iα i ϕ(x i ) may have infinite dimensionality: Cannot maintain w Closed form solution needs O(n) computational time: δ = max( α i min(c α i, 1 (Qα) i Q ii )) (Assume Q is stored in memory) Can we improve coordinate descent using the same O(n) time complexity?

14 Greedy Coordinate Descent The Greedy Coordinate Descent (GCD) algorithm: For t = 1, 2, Compute δ i := argmin δ f (α + δe i ) for all i = 1,..., n 2. Find the best i according to the following criterion: 3. α i α i + δ i i = argmax δi i

15 Greedy Coordinate Descent Other variable selection criterion: The coordinate with the maximum step size: i = argmax δi i The coordinate with maximum objective function reduction: i ( = argmax f (α) f (α + δ i e i ) ) i The coordinate with the maximum projected gradient....

16 Greedy Coordinate Descent How to compute the optimal coordinate? Closed form solution of best δ: ( δi = max α i, min ( C α i, 1 (Qα) ) i ) Q ii Observations: 1 Computing all δ i needs O(n 2 ) time 2 If Qα is stored in memory, computing all δ i only needs O(n) time Maintaining Qα also needs O(n) time after each update

17 Greedy Coordinate Descent Initial: α, z = Qα For t = 1, 2,... For all i = 1,..., n, compute ( δi = max α i, min(c α i, 1 z ) i ) Q ii Let i = argmax i δ i α α + δ i z z + q i δ i (q i is the i -th column of Q) (This is a simplified version of the Sequential Minimal Optimization (SMO) algorithm proposed in Platt el al., 1998) (A similar version is implemented in LIBSVM)

18 How to solve problems with millions of samples? Q R n n cannot be fully stored Have a fixed size of memory to cache the computed columns of Q For each coordinate update: If q i is in memory, directly use it to update If q i is not in memory 1 Kick out the Least Recent Used column 2 Recompute q i and store it in memory 3 Update α i

19 LIBSVM Implemented in LIBSVM: Other functionalities: Multi-class classification Support vector regression Cross-validation

20 Kernel Approximation

21 Kernel Approximation Kernel methods are hard to scale up because Kernel matrix G is an n-by-n matrix Can we approximate the kernel using a low-rank representation? Two main algorithms: Nystrom approximation: Approximate the kernel matrix Random Features: Approximate the kernel function

22 Kernel Matrix Approximation We want to form a low rank approximation of G R n n, where G ij = K(x i, x j ) Can we do SVD?

23 Kernel Matrix Approximation We want to form a low rank approximation of G R n n, where G ij = K(x i, x j ) Can we do SVD? No, SVD needs to form the n-by-n matrix O(n 2 ) space and O(n 3 ) time

24 Kernel Matrix Approximation We want to form a low rank approximation of G R n n, where G ij = K(x i, x j ) Can we do SVD? No, SVD needs to form the n-by-n matrix O(n 2 ) space and O(n 3 ) time Can we do matrix completion or matrix sketching?

25 Kernel Matrix Approximation We want to form a low rank approximation of G R n n, where G ij = K(x i, x j ) Can we do SVD? No, SVD needs to form the n-by-n matrix O(n 2 ) space and O(n 3 ) time Can we do matrix completion or matrix sketching? Main problem: need to generalize to new points

26 Nystrom Approximation Nystrom approximation: Randomly sample m columns of G: [ ] W G12 G = G 21 G 22 W : m-by-m square matrix (observed) G 21 : (n m)-by-m matrix (observed) G 12 = G T 21 (observed) G 22 : (n m)-by-(n m) matrix (not observed) Form a kernel approximation based on these mn elements [ G G W = W G21] [ ] W G 12 W : pseudo-inverse of W

27 Nystrom Approximation Why W? Exact recover the top left m-by-m matrix The kernel approximation: [ ] [ ] W G12 W W [ [ ] ] W G12 W G G 21 G 22 G 12 = 21 G 21 G 21 W G 12

28 Nystrom Approximation Algorithm: 1 Sample m landmark points : v 1,, v m 2 Compute the kernel values between all training data to landmark points: K(x 1, v 1 ) K(x 1, v m ) [ W K(x 2, v 1 ) K(x 2, v m ) C = = G21]... K(x n, v 1 ) K(x n, v m ) 3 Form the kernel approximation G = CW C T (No need to explicitly form the n-by-n matrix) Time complexity: mn kernel evaluations (O(mnd) time if use classical kernels) How to choose m? Trade-off between accuracy v.s computational time and memory space

29 Nystrom Approximation: Training Solve the dual problem using low-rank representation. Example: Kernel ridge regression min α αt Gα + λ α 2 y T α Solve a linear system ( G + λi )α = y Use iterative methods (such as CG), with fast matrix-vector multiplication ( G + λi )p = CW C T p + λp O(nm) time complexity per iteration

30 Nystrom Approximation: Training Another approach: reduce to linear ERM problems! Rewrite as G = CW C T = C(W ) 1/2 (W ) 1/2 C T So the approximate kernel can be represent by linear kernel with feature matrix C(W ) 1/2 : K(x i, x j ) = G ij = ui T u j, K(x j, v 1 ) where u j = (W ) 1/2. K(x j, v m ) The problem is equivalent to a linear models with features u 1,, u n R m Kernel SVM Linear SVM with m features Kernel Ridge Regression Linear Ridge regression with m features

31 Nystrom Approximation: Prediction Given the dual solution α. Prediction for testing sample x: n α i K(xi, x) i=1 or n α i K(x i, x)? i=1

32 Nystrom Approximation: Prediction Given the dual solution α. Prediction for testing sample x: n α i K(xi, x) i=1 Need to use approximate kernel instead of original kernel! Approximate kernel gives much better accuracy! Because training & testing should use the same kernel. Generalization bound: (Alaoui and Mahoney, Fast Randomized Kernel Methods With Statistical Guarantees ) (Rudi et al., Less is More: Nystrom Computational Regularization. NIPS ) (using small number of landmark points can be viewed as a regularization)

33 Nystrom Approximation: Prediction How to define K(x, x i )? K(x 1, v 1 ) K(x 1, v m ) K(v 1, x 1 ) K(v 1, x) G =... K(x n, v 1 ) K(x 1, v m ) W... K(v K(x, v 1 ) K(x 1, v m ) m, x 1 ) K(v m, x) K(v 1, x) Compute u = (W ) 1/2. (need m kernel evaluations) K(v m, x) Approximate kernel can be defined by K(x, x i ) = u T u i

34 Nystrom Approximation: Prediction Prediction: f (x) = n n α i u T u i = u T ( α i u i ) i=1 ( n i=1 α iu i ) can be pre-computed prediction time is m kernel evaluations Original kernel method: need n kernel evaluations for prediction. Summary: Quality Training time Prediction time Original kernel exact n 1.x kernel + kernel training n kernel Nystrom (m) rank m nm kernel + linear training m kernel i=1

35 Nystrom Approximation How to select landmark points (v 1,, v m ) Traditional approach: uniform random sampling from training data Importance sampling (leverage score): Drineas and Mahoney On the Nystrom method for approximating a Gram matrix for improved kernel-based learning. JMLR Gittens and Mahoney Revisiting the Nystrom method for improved large-scale machine learning. ICML 2013 Kmeans sampling (performs extremely well) : Run kmeans clustering and set v 1,, v m to be cluster centers Zhang et al., Improved Nystrom low rank approximation and error analysis. ICML Subspace distance: Lim et al., Double Nystrom method: An efficient and accurate Nystrom scheme for large-scale data sets. ICML 2015.

36 Nystrom Approximation (other related papers) Distributed setting: (Kumar et al., Ensemble Nystrom method. NIPS 2009). Block diagonal + low-rank approximation: (Si et al., Memory efficient kernel approximation. ICML ) Nystrom method for fast prediction: (Hsieh et al., Fast prediction for large-scale kernel machines. NIPS, ) Structured landmark points: (Si et al., Computational Efficient Nystrom Approximation using Fast Transforms. ICML )

37 Kernel Approximation (random features)

38 Random Features Directly approximation the kernel function (Data-independent) Generate nonlinear features reduce the problem to linear model training Several examples: Random Fourier Features: (Rahimi and Recht, Random features for large-scale kernel machines. NIPS 2007) Random features for polynomial kernel: (Kar and Karnick, Random feature maps for dot product kernels. AISTATS )

39 Random Fourier Features Random features for Shift-invariant kernel : A continuous kernel is shift-invariant if K(x, y) = k(x y). Gaussian kernel: K(x, y) = e x y 2 Laplacian kernel: K(x, y) = e x y 1 Shift invariant kernel (if positive definite) can be written as: k(x y) = P(w)e jw T (x y) dw R d for some probability distribution P(w).

40 Random Fourier Features Taking the real part we have k(x y) = P(w) cos(w T x w T y) R d = P(w)(cos(w T x) cos(w T y) + sin(w T x) sin(w T y)) R d = E w P(w) [z w (x) T z w (y)] where z w (x) T = [cos(w T x), sin(w T x)]. z w (x) T z w (y) is an unbiased estimator of K(x, y) if w is sampled from P( )

41 Random Fourier Features Sample m vectors w 1,, w m from P( ) distribution Generate random features for each sample: sin(w1 T x i) cos(w1 T x i) sin(w2 T x i) u i = cos(w2 T x i). sin(w m T x i ) cos(wm T x i ) K(x i, x j ) ui T u j (larger m leads to better approximation)

42 Random Features G UU T, where U = u 1. T un T Rank 2m approximation. The problem can be reduced to linear classification/regression for u 1,, u n. Time complexity: O(nmd) time + linear training (Nystrom: O(nmd + m 3 ) time + linear training ) Prediction time: O(md) per sample (Nystrom: O(md) per sample)

43 Fastfood Random Fourier features require O(md) time to generate m features: Main bottleneck: Computing w T i x for all i = 1,, m Can we compute this in O(m log d) time?

44 Fastfood Random Fourier features require O(md) time to generate m features: Main bottleneck: Computing wi T x for all i = 1,, m Can we compute this in O(m log d) time? Fastfood: proposed in (Le et al., Fastfood Approximating Kernel Expansions in Loglinear Time. ICML ) Reduce time by using structured random features

45 Fastfood Tool: fast matrix-vector multiplication for Hadamard matrix (H d R d d ) H d x requires O(d log d) time Hadamard matrix: H 1 = [1] [ ] 1 1 H 2 = 1 1 [ ] H2 k 1 H H 2 k = 2 k 1 H 2 k 1 H 2 k 1 H d x can be computed efficiently by Fast Hadamard Transform (dynamic programming)

46 Fastfood Sample m = d random features by v W = [ ] 0 v 2 0 H d v d So W x only needs O(d log d) time. Instead of sampling W (d 2 elements), we just sample v 1,, v d from N(0, σ) Each row of W is still N(0, σ) (but rows are not independent) Faster computation, but performance is slightly worse than original random features.

47 Extensions Fastfood: structured random features to improve computational speed: (Le et al., Fastfood Approximating Kernel Expansions in Loglinear Time. ICML ) Structured landmark points for Nystrom approximation: (Si et al., Computational Efficient Nystrom Approximation using Fast Transforms. ICML ) Doubly stochastic gradient with random features: (Dai et al., Scalable Kernel Methods via Doubly Stochastic Gradients. NIPS ) Generalization bound (?)

48 Coming up Choose the paper for presentation (due this Sunday, Oct 23). Questions?