Big Data Analytics: Optimization and Randomization

Transcription

1 Big Data Analytics: Optimization and Randomization Tianbao Yang, Qihang Lin, Rong Jin 2015 Sydney, Australia Department of Computer Science, The University of Iowa, IA, USA Department of Management Sciences, The University of Iowa, IA, USA Department of Computer Science and Engineering, Michigan State University, MI, USA Institute of Data Science and Technologies at Alibaba Group, Seattle, USA August 10, 2015 Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

2 URL tyng/kdd15-tutorial.pdf Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

3 Some Claims No Yes This tutorial is not an exhaustive literature survey It is not a survey on different machine learning/data mining algorithms It is about how to efficiently solve machine learning/data mining (formulated as optimization) problems for big data Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

4 Outline Part I: Basics Part II: Optimization Part III: Randomization Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

5 Big Data Analytics: Optimization and Randomization Part I: Basics Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

6 Basics Introduction Outline 1 Basics Introduction Notations and Definitions Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

7 Basics Introduction Three Steps for Machine Learning distance to optimal objective T 1/T 2 1/T iterations Data Model Optimization Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

8 Basics Introduction Big Data Challenge Big Data Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

9 Basics Introduction Big Data Challenge Big Model 60 million parameters Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

10 Basics Introduction Learning as Optimization Ridge Regression Problem: 1 min w R d n n (y i w x i ) 2 + λ 2 w 2 2 i=1 x i R d : d-dimensional feature vector y i R: target variable w R d : model parameters n: number of data points Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

11 Basics Introduction Learning as Optimization Ridge Regression Problem: 1 n min (y i w x i ) 2 + λ w R d n 2 w 2 2 i=1 } {{ } Empirical Loss x i R d : d-dimensional feature vector y i R: target variable w R d : model parameters n: number of data points Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

12 Basics Introduction Learning as Optimization Ridge Regression Problem: 1 n min (y i w x i ) 2 + λ w R d n 2 w 2 2 i=1 } {{ } Regularization x i R d : d-dimensional feature vector y i R: target variable w R d : model parameters n: number of data points Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

13 Basics Introduction Learning as Optimization Classification Problems: 1 min w R d n n l(y i w x i ) + λ 2 w 2 2 i=1 y i {+1, 1}: label Loss function l(z): z = yw x 1. SVMs: (squared) hinge loss l(z) = max(0, 1 z) p, where p = 1, 2 2. Logistic Regression: l(z) = log(1 + exp( z)) Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

14 Basics Introduction Learning as Optimization Feature Selection: 1 min w R d n n l(w x i, y i ) + λ w 1 i=1 l 1 regularization w 1 = d i=1 w i λ controls sparsity level Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

15 Basics Introduction Learning as Optimization Feature Selection using Elastic Net: 1 min w R d n n i=1 ( ) l(w x i, y i )+λ w 1 + γ w 2 2 Elastic net regularizer, more robust than l 1 regularizer Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

16 Basics Introduction Learning as Optimization Multi-class/Multi-task Learning: W R K d min W 1 n n l(wx i, y i ) + λr(w) i=1 r(w) = W 2 F = K dj=1 k=1 Wkj 2 : Frobenius Norm r(w) = W = i σ i: Nuclear Norm (sum of singular values) r(w) = W 1, = d j=1 W :j : l 1, mixed norm Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

17 Basics Introduction Learning as Optimization Regularized Empirical Loss Minimization 1 min w R d n Both l and R are convex functions n l(w x i, y i ) + R(w) i=1 Extensions to Matrix Cases are possible (sometimes straightforward) Extensions to Kernel methods can be combined with randomized approaches Extensions to Non-convex (e.g., deep learning) are in progress Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

18 Basics Introduction Data Matrices and Machine Learning The Instance-feature Matrix: X R n d X = x 1 x 2 x n Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

19 Basics Introduction Data Matrices and Machine Learning y1 y2 The output vector: y = Rn 1 yn continuous yi R: regression (e.g., house price) discrete, e.g., yi {1, 2, 3}: classification (e.g., species of iris) Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

20 Basics Introduction Data Matrices and Machine Learning The Instance-Instance Matrix: K R n n Similarity Matrix Kernel Matrix Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

21 Basics Introduction Data Matrices and Machine Learning Some machine learning tasks are formulated on the kernel matrix Clustering Kernel Methods Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

22 Basics Introduction Data Matrices and Machine Learning The Feature-Feature Matrix: C R d d Covariance Matrix Distance Metric Matrix Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

23 Basics Introduction Data Matrices and Machine Learning Some machine learning tasks requires the covariance matrix Principal Component Analysis Top-k Singular Value (Eigen-Value) Decomposition of the Covariance Matrix Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

24 Basics Introduction Why Learning from Big Data is Challenging? High per-iteration cost High memory cost High communication cost Large iteration complexity Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

25 Basics Notations and Definitions Outline 1 Basics Introduction Notations and Definitions Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

26 Basics Notations and Definitions Norms Vector x R d Euclidean vector norm: x 2 = x x = di=1 x 2 i ( di=1 l p -norm of a vector: x p = x i p) 1/p where p 1 d 1 l 2 norm x 2 = i=1 x i 2 2 l 1 norm x 1 = d i=1 x i 3 l norm x = max i x i Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

29 Basics Notations and Definitions Matrix Factorization Matrix X R n d Singular Value Decomposition X = UΣV 1 U R n r : orthonormal columns (U U = I): span column space 2 Σ R r r : diagonal matrix Σ ii = σ i > 0, σ 1 σ 2... σ r 3 V R d r : orthonormal columns (V V = I): span row space 4 r min(n, d): max value such that σ r > 0: rank of X 5 U k Σ k Vk : top-k approximation Pseudo inverse: X = V Σ 1 U QR factorization: X = QR (n d) Q R n d : orthonormal columns R R d d : upper triangular matrix Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

32 Basics Notations and Definitions Norms Matrix X R n d Frobenius norm: X F = tr(x X) = ni=1 dj=1 X 2 ij Spectral (induced norm) of a matrix: X 2 = max u 2 =1 Xu 2 A 2 = σ 1 (maximum singular value) Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

33 Basics Notations and Definitions Norms Matrix X R n d Frobenius norm: X F = tr(x X) = ni=1 dj=1 X 2 ij Spectral (induced norm) of a matrix: X 2 = max u 2 =1 Xu 2 A 2 = σ 1 (maximum singular value) Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

34 Basics Notations and Definitions Convex Optimization min x X f (x) X is a convex domain for any x, y X, their convex combination αx + (1 α)y X f (x) is a convex function Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

35 Basics Notations and Definitions Convex Function Characterization of Convex Function f (αx + (1 α)y) αf (x) + (1 α)f (y), x, y X, α [0, 1] f (x) f (y) + f (y) (x y) x, y X local optimum is global optimum Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

36 Basics Notations and Definitions Convex Function Characterization of Convex Function f (αx + (1 α)y) αf (x) + (1 α)f (y), x, y X, α [0, 1] f (x) f (y) + f (y) (x y) x, y X local optimum is global optimum Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

37 Basics Notations and Definitions Convex vs Strongly Convex Convex function: f (x) f (y) + f (y) (x strong y) x, convexity y X constant Strongly Convex function: f (x) f (y) + f (y) (x y) + λ 2 x y 2 2 x, y X Global optimum is unique Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

38 Basics Notations and Definitions Convex vs Strongly Convex Convex function: f (x) f (y) + f (y) (x strong y) x, convexity y X constant Strongly Convex function: f (x) f (y) + f (y) (x y) + λ 2 x y 2 2 x, y X Global optimum is unique Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

39 Basics Notations and Definitions Non-smooth function vs Smooth function Non-smooth function Lipschitz Lipschitz continuous: constant e.g. absolute loss f (x) = x f (x) f (y) G x y 2 Subgradient: f (x) f (y) + f (y) (x y) non smooth sub gradient x Smooth function smoothness constant e.g. logistic loss f (x) = log(1 + exp( x)) f (x) f (y) 2 L x y 2 6 log(1+exp( x)) 5 f(x) Quadratic Function 1 y 0 f(y)+f (y)(x y) Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

42 Basics Notations and Definitions Next... Part II: Optimization 1 min w R d n stochastic optimization distributed optimization n l(w x i, y i ) + R(w) i=1 Reduce Iteration Complexity: utilizing properties of functions Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

43 Basics Notations and Definitions Next... Part III: Randomization Classification, Regression SVD, K-means, Kernel methods Reduce Data Size: utilizing properties of data Please stay tuned! Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

44 Optimization Big Data Analytics: Optimization and Randomization Part II: Optimization Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

45 Optimization (Sub)Gradient Methods Outline 2 Optimization (Sub)Gradient Methods Stochastic Optimization Algorithms for Big Data Stochastic Optimization Distributed Optimization Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

46 Optimization (Sub)Gradient Methods Learning as Optimization Regularized Empirical Loss Minimization 1 n min l(w x i, y i ) + R(w) w R d n i=1 } {{ } F (w) Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

47 Optimization (Sub)Gradient Methods Convergence Measure Most optimization algorithms are iterative w t+1 = w t + w t objective ε 0 T iterations Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

48 Optimization (Sub)Gradient Methods Convergence Measure Most optimization algorithms are iterative w t+1 = w t + w t Iteration Complexity: the number of iterations T (ɛ) needed to have F (ŵ T ) min F (w) ɛ (ɛ 1) w objective ε 0 T iterations Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

49 Optimization (Sub)Gradient Methods Convergence Measure Most optimization algorithms are iterative w t+1 = w t + w t Iteration Complexity: the number of iterations T (ɛ) needed to have F (ŵ T ) min F (w) ɛ (ɛ 1) w objective Convergence Rate: after T iterations, how good is the solution ε 0 T iterations F (ŵ T ) min F (w) ɛ(t ) w Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

50 Optimization (Sub)Gradient Methods Convergence Measure Most optimization algorithms are iterative w t+1 = w t + w t Iteration Complexity: the number of iterations T (ɛ) needed to have F (ŵ T ) min F (w) ɛ (ɛ 1) w objective Convergence Rate: after T iterations, how good is the solution ε 0 T iterations F (ŵ T ) min F (w) ɛ(t ) w Total Runtime = Per-iteration Cost Iteration Complexity Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

51 Optimization (Sub)Gradient Methods More on Convergence Measure Big O( ) notation: explicit dependence on T or ɛ Convergence Rate Iteration Complexity ( ) ( ( )) 1 linear O µ T (µ < 1) O log ɛ ( ) ( ) sub-linear O 1 1 T α > 0 O α ɛ 1/α Why are we interested in Bounds? Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

52 Optimization (Sub)Gradient Methods More on Convergence Measure Big O( ) notation: explicit dependence on T or ɛ Convergence Rate Iteration Complexity ( ) ( ( )) 1 linear O µ T (µ < 1) O log ɛ ( ) ( ) sub-linear O 1 1 T α > 0 O α ɛ 1/α Why are we interested in Bounds? Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

53 Optimization (Sub)Gradient Methods More on Convergence Measure Convergence Rate Iteration Complexity ) linear O(µ T ) (µ < 1) O (log( 1 ( ɛ ) sub-linear O( 1 T ) α > 0 O 1 α ɛ 1/α 0.5 distance to optimum seconds 0.5 T iterations (T) Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

54 Optimization (Sub)Gradient Methods More on Convergence Measure Convergence Rate Iteration Complexity ) linear O(µ T ) (µ < 1) O (log( 1 ( ɛ ) sub-linear O( 1 T ) α > 0 O 1 α ɛ 1/α distance to optimum seconds minutes 0.5 T 1/T iterations (T) Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

55 Optimization (Sub)Gradient Methods More on Convergence Measure Convergence Rate Iteration Complexity ) linear O(µ T ) (µ < 1) O (log( 1 ( ɛ ) sub-linear O( 1 T ) α > 0 O 1 α ɛ 1/α distance to optimum seconds minutes hours 0.5 T 1/T 1/T iterations (T) Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

56 Optimization (Sub)Gradient Methods More on Convergence Measure Convergence Rate Iteration Complexity ) linear O(µ T ) (µ < 1) O (log( 1 ( ɛ ) sub-linear O( 1 T ) α > 0 O 1 α ɛ 1/α distance to optimum seconds minutes hours 0.5 T 1/T 1/T 0.5 Theoretically, we consider ( ) 1 O(µ T ) O T 2 ( ) 1 log 1 1 ɛ ɛ ɛ 1 ɛ 2 ( ) 1 O O T ( 1 T ) iterations (T) Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

57 Optimization (Sub)Gradient Methods Non-smooth V.S. Smooth Non-smooth l(z) hinge loss: l(w x, y) = max(0, 1 yw x) absolute loss: l(w x, y) = w x y Smooth l(z) squared hinge loss: l(w x, y) = max(0, 1 yw x) 2 logistic loss: l(w x, y) = log(1 + exp( yw x)) square loss: l(w x, y) = (w x y) 2 Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

58 Optimization (Sub)Gradient Methods Strong convex V.S. Non-strongly convex λ-strongly convex R(w) λ l 2 regularizer: 2 w 2 2 Elastic net regularizer: τ w 1 + λ 2 w 2 2 Non-strongly convex R(w) unregularized problem: R(w) 0 l 1 regularizer: τ w 1 Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

59 Optimization (Sub)Gradient Methods Gradient Method in Machine Learning n F (w) = 1 n Suppose l(z) is smooth i=1 l(w x i, y i ) + λ 2 w 2 2 Full gradient: F (w) = 1 n ni=1 l(w x i, y i ) + λw Per-iteration cost: O(nd) Gradient Descent step size w t = w t 1 γ t F (w t 1 ) Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

60 Optimization (Sub)Gradient Methods Gradient Method in Machine Learning n F (w) = 1 n Suppose l(z) is smooth i=1 l(w x i, y i ) + λ 2 w 2 2 Full gradient: F (w) = 1 n ni=1 l(w x i, y i ) + λw Per-iteration cost: O(nd) Gradient Descent step size w t = w t 1 γ t F (w t 1 ) Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

61 Optimization (Sub)Gradient Methods Gradient Method in Machine Learning n F (w) = 1 n i=1 If λ = 0: R(w) is non-strongly convex Iteration complexity O( 1 ɛ ) l(w x i, y i ) + λ 2 w 2 2 If λ > 0: R(w) is λ-strongly convex Iteration complexity O( 1 λ log( 1 ɛ )) Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

62 Optimization (Sub)Gradient Methods Accelerated Gradient Method Accelerated Gradient Descent w t = v t 1 γ t F (v t 1 ) v t = w t + η t (w t w t 1 ) Momentum Step w t is the output and v t is an auxiliary sequence. Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

63 Optimization (Sub)Gradient Methods Accelerated Gradient Method Accelerated Gradient Descent w t = v t 1 γ t F (v t 1 ) v t = w t + η t (w t w t 1 ) Momentum Step w t is the output and v t is an auxiliary sequence. Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

64 Optimization (Sub)Gradient Methods Accelerated Gradient Method n F (w) = 1 n i=1 l(w x i, y i ) + λ 2 w 2 2 If λ = 0: R(w) is non-strongly convex Iteration complexity O( 1 ɛ ), better than O( 1 ɛ ) If λ > 0: R(w) is λ-strongly convex Iteration complexity O( 1 λ log( 1 ɛ )), better than O( 1 λ log( 1 ɛ )) for small λ Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

65 Optimization (Sub)Gradient Methods Deal with l 1 regularizer Consider a more general case min w R d F (w) = 1 n n i=1 l(w x i, y i ) + R (w) + τ w 1 } {{ } R(w) R(w) = R (w) + τ w 1 R (w): λ-strongly convex and smooth Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

66 Optimization (Sub)Gradient Methods Deal with l 1 regularizer Consider a more general case min F (w) = 1 n l(w x i, y i ) + R (w) +τ w 1 w R d n i=1 } {{ } F (w) R(w) = R (w) + τ w 1 R (w): λ-strongly convex and smooth Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

67 Optimization (Sub)Gradient Methods Deal with l 1 regularizer Proximal mapping Accelerated Gradient Descent w t = arg min w R d F (v t 1 ) w + 1 2γ t w v t τ w 1 v t = w t + η t (w t w t 1 ) Proximal mapping has close-form solution: Soft-thresholding Iteration complexity and runtime remain unchanged. Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

68 Optimization (Sub)Gradient Methods Deal with l 1 regularizer Proximal mapping Accelerated Gradient Descent w t = arg min w R d F (v t 1 ) w + 1 2γ t w v t τ w 1 v t = w t + η t (w t w t 1 ) Proximal mapping has close-form solution: Soft-thresholding Iteration complexity and runtime remain unchanged. Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

69 Optimization (Sub)Gradient Methods Sub-Gradient Method in Machine Learning n F (w) = 1 n i=1 l(w x i, y i ) + λ 2 w 2 2 Suppose l(z) is non-smooth Full sub-gradient: F (w) = 1 ni=1 n l(w x i, y i ) + λw Sub-Gradient Descent w t = w t 1 γ t F (w t 1 ) Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

70 Optimization (Sub)Gradient Methods Sub-Gradient Method in Machine Learning n F (w) = 1 n i=1 l(w x i, y i ) + λ 2 w 2 2 Suppose l(z) is non-smooth Full sub-gradient: F (w) = 1 ni=1 n l(w x i, y i ) + λw Sub-Gradient Descent w t = w t 1 γ t F (w t 1 ) Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

71 Optimization (Sub)Gradient Methods Sub-Gradient Method n F (w) = 1 n i=1 If λ = 0: R(w) is non-strongly convex Iteration complexity O( 1 ) ɛ 2 l(w x i, y i ) + λ 2 w 2 2 If λ > 0: R(w) is λ-strongly convex Iteration complexity O( 1 λɛ ) No efficient acceleration scheme in general Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

72 Optimization (Sub)Gradient Methods Problem Classes and Iteration Complexity Iteration complexity R(w) 1 min w R d n n l(w x i, y i ) + R(w) i=1 l(z) l(z, y) Non-smooth ( ) Smooth ( ) Non-strongly convex O 1 O 1 ɛ ( ɛ 2 ) ( ( )) λ-strongly convex O 1 λɛ O λ 1 log 1 ɛ Per-iteration cost: O(nd), too high if n or d are large. Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

73 Optimization Stochastic Optimization Algorithms for Big Data Outline 2 Optimization (Sub)Gradient Methods Stochastic Optimization Algorithms for Big Data Stochastic Optimization Distributed Optimization Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

74 Optimization Stochastic Optimization Algorithms for Big Data Stochastic First-Order Method by Data Sampling Stochastic Gradient Descent (SGD) Stochastic Variance Reduced Gradient (SVRG) Stochastic Average Gradient Algorithm (SAGA) Stochastic Dual Coordinate Ascent (SDCA) Accelerated Proximal Coordinate Gradient (APCG) Assumption: x i 1 for any i Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

75 Optimization Stochastic Optimization Algorithms for Big Data Basic SGD (Nemirovski & Yudin (1978)) n F (w) = 1 n i=1 l(w x i, y i ) + λ 2 w 2 2 Full sub-gradient: F (w) = 1 n ni=1 l(w x i, y i ) + λw Randomly sample i {1,..., n} Stochastic sub-gradient: l(w T x i, y i ) + λw E i [ l(w T x i, y i ) + λw] = F (w) Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

76 Optimization Stochastic Optimization Algorithms for Big Data Basic SGD (Nemirovski & Yudin (1978)) Applicable in all settings! min w R d F (w) = 1 n n l(w x i, y i ) + λ 2 w 2 2 i=1 sample: i t {1,..., n} update: w t = w t 1 γ t ( l(w T t 1x it, y it ) + λw t 1 ) output: w T = 1 T T w t t=1 Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

77 Optimization Stochastic Optimization Algorithms for Big Data Basic SGD (Nemirovski & Yudin (1978)) Applicable in all settings! min w R d F (w) = 1 n n l(w x i, y i ) + λ 2 w 2 2 i=1 sample: i t {1,..., n} update: w t = w t 1 γ t ( l(w T t 1x it, y it ) + λw t 1 ) output: w T = 1 T T w t t=1 Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

78 Optimization Stochastic Optimization Algorithms for Big Data Basic SGD (Nemirovski & Yudin (1978)) n F (w) = 1 n i=1 If λ = 0: R(w) is non-strongly convex Iteration complexity O( 1 ) ɛ 2 l(w x i, y i ) + λ 2 w 2 2 If λ > 0: R(w) is λ-strongly convex Iteration complexity O( 1 λɛ ) Exactly the same as sub-gradient descent! Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

79 Optimization Stochastic Optimization Algorithms for Big Data Total Runtime Per-iteration cost: O(d) Much lower than full gradient method e.g. hinge loss (SVM) y it x it, 1 y it w x it > 0 stochastic gradient: l(w x it, y it ) = 0, otherwise Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

80 Optimization Stochastic Optimization Algorithms for Big Data Total Runtime Iteration complexity R(w) 1 min w R d n n l(w x i, y i ) + R(w) i=1 l(z) l(z, y) Non-smooth ( ) Smooth ( ) Non-strongly convex O 1 O 1 ( ɛ 2 ) ( ɛ 2 ) λ-strongly convex O 1 λɛ O 1 λɛ For SGD, only strongly convexity helps but the smoothness does not make any difference! Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

81 Optimization Stochastic Optimization Algorithms for Big Data Full Gradient V.S. Stochastic Gradient Full gradient method needs fewer iterations Stochastic gradient method has lower cost per iteration For small ɛ, use full gradient Satisfied with large ɛ, use stochastic gradient Full gradient can be accelerated Stochastic gradient cannot Full gradient s iterations complexity depends on smoothness and strong convexity Stochastic gradient s iteration complexity only depends on strong convexity Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

82 Optimization Stochastic Optimization Algorithms for Big Data SVRG (Johnson & Zhang, 2013; Zhang et al., 2013; Xiao & Zhang, 2014) min w R d F (w) = 1 n n l(w x i, y i ) + λ 2 w 2 2 i=1 Applicable when l(z) is smooth and R(w) is λ-strongly convex Stochastic gradient: g it (w) = l(w T x it, y it ) + λw E it [g it (w)] = F (w) but... Var [g it (w)] 0 even if w = w Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

83 Optimization Stochastic Optimization Algorithms for Big Data SVRG (Johnson & Zhang, 2013; Zhang et al., 2013; Xiao & Zhang, 2014) min w R d F (w) = 1 n n l(w x i, y i ) + λ 2 w 2 2 i=1 Applicable when l(z) is smooth and R(w) is λ-strongly convex Stochastic gradient: g it (w) = l(w T x it, y it ) + λw E it [g it (w)] = F (w) but... Var [g it (w)] 0 even if w = w Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

84 Optimization Stochastic Optimization Algorithms for Big Data SVRG (Johnson & Zhang, 2013; Zhang et al., 2013; Xiao & Zhang, 2014) Compute the full gradient at a reference point w F ( w) = 1 n l( w T x i, y i ) + λ w n i=1 Stochastic variance reduced gradient: g it (w) = F ( w) g it ( w) + g it (w) E it [ g it (w)] = F (w) Var [ g it (w)] 0 as w, w w Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

85 Optimization Stochastic Optimization Algorithms for Big Data SVRG (Johnson & Zhang, 2013; Zhang et al., 2013; Xiao & Zhang, 2014) At optimal solution w : F (w ) = 0 It does not mean g it (w) 0 as w w However, we have g it (w) = F ( w) g it ( w) + g it (w) 0 as w, w w Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

86 Optimization Stochastic Optimization Algorithms for Big Data SVRG (Johnson & Zhang, 2013; Zhang et al., 2013; Xiao & Zhang, 2014) Iterate s = 1,..., T 1 Let w 0 = w s and compute F ( w s ) Iterate t = 1,..., K g it (w t 1 ) = F ( w s ) g it ( w s ) + g it (w t 1 ) w t = w t 1 γ t g it (w t 1 ) w s+1 = 1 K Kt=1 w t output: w T K = O ( 1 λ ) Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

87 Optimization Stochastic Optimization Algorithms for Big Data SVRG (Johnson & Zhang, 2013; Zhang et al., 2013; Xiao & Zhang, 2014) ( ( )) Per-iteration cost: O d n + 1 λ Iteration complexity R(w) l(z) l(z, y) Non-smooth Smooth Non-strongly convex N.A. ( N.A. ( )) λ-strongly convex N.A. O log 1 ɛ ( ( ) ( )) Total Runtime: O d n + 1 λ log 1 ɛ Use proximal mapping for l 1 regularizer Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

88 Optimization Stochastic Optimization Algorithms for Big Data SVRG (Johnson & Zhang, 2013; Zhang et al., 2013; Xiao & Zhang, 2014) ( ( )) Per-iteration cost: O d n + 1 λ Iteration complexity R(w) l(z) l(z, y) Non-smooth Smooth Non-strongly convex N.A. ( N.A. ( )) λ-strongly convex N.A. O log 1 ɛ ( ( ) ( )) Total Runtime: O d n + 1 λ log 1 ɛ Use proximal mapping for l 1 regularizer Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

89 Optimization Stochastic Optimization Algorithms for Big Data SAGA (Defazio et al. (2014)) min w R d F (w) = 1 n n l(w x i, y i ) + λ 2 w 2 2 i=1 A new version of SAG (Roux et al. (2012)) Applicable when l(z) is smooth Strong convexity is not required. Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

90 Optimization Stochastic Optimization Algorithms for Big Data SAGA (Defazio et al. (2014)) SAGA also reduces the variance of stochastic gradient but with a different technique SVRG uses gradients at the same point w g it (w) = F ( w) g it ( w) + g it (w) F ( w) = 1 n l( w T x i, y i ) + λ w n i=1 SAGA uses gradients at different point { w 1, w 2,, w n } Average Gradient g it (w) = G g it ( w it ) + g it (w) G = 1 n l( w T i x i, y i ) + λ w it n i=1 Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

91 Optimization Stochastic Optimization Algorithms for Big Data SAGA (Defazio et al. (2014)) SAGA also reduces the variance of stochastic gradient but with a different technique SVRG uses gradients at the same point w g it (w) = F ( w) g it ( w) + g it (w) F ( w) = 1 n l( w T x i, y i ) + λ w n i=1 SAGA uses gradients at different point { w 1, w 2,, w n } Average Gradient g it (w) = G g it ( w it ) + g it (w) G = 1 n l( w T i x i, y i ) + λ w it n i=1 Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

92 Optimization Stochastic Optimization Algorithms for Big Data SAGA (Defazio et al. (2014)) Initialize average gradient G 0 : G 0 = 1 n g i, n i=1 g i = l(w0 x i, y i ) + λw 0 stochastic variance reduced gradient: g it (w t 1 ) = ) G t 1 g it + ( l(w t 1x it, y it ) + λw t 1 w t = w t 1 γ t g it (w t 1 ) Update average gradient G t = 1 n g i, n i=1 g it = l(w t 1x it, y it ) + λw t 1 Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

95 Optimization Stochastic Optimization Algorithms for Big Data SAGA: efficient update of averaged gradient G t and G t 1 only differs in g i for i = i t Before we update g i, we update n ( l(w t 1x it, y it ) + λw t 1 ) G t = 1 n i=1 g i = G t 1 1 n g i t + 1 n computation cost: O(d) To implemente SAGA, we have to store and update all: g 1, g 2,..., g n Require extra memory space O(nd) Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

96 Optimization Stochastic Optimization Algorithms for Big Data SAGA: efficient update of averaged gradient G t and G t 1 only differs in g i for i = i t Before we update g i, we update n ( l(w t 1x it, y it ) + λw t 1 ) G t = 1 n i=1 g i = G t 1 1 n g i t + 1 n computation cost: O(d) To implemente SAGA, we have to store and update all: g 1, g 2,..., g n Require extra memory space O(nd) Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

97 Optimization Stochastic Optimization Algorithms for Big Data SAGA (Defazio et al. (2014)) Per-iteration cost: O(d) Iteration complexity l(z) l(z, y) Non-smooth Smooth Non-strongly convex N.A. O ( ) n R(w) (( ) ɛ ( )) λ-strongly convex N.A. O n + 1 λ log 1 ɛ ( ( ) ( )) Total Runtime (strongly convex): O d n + 1 λ log 1 ɛ. Same as SVRG! Use proximal mapping for l 1 regularizer Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

98 Optimization Stochastic Optimization Algorithms for Big Data SAGA (Defazio et al. (2014)) Per-iteration cost: O(d) Iteration complexity l(z) l(z, y) Non-smooth Smooth Non-strongly convex N.A. O ( ) n R(w) (( ) ɛ ( )) λ-strongly convex N.A. O n + 1 λ log 1 ɛ ( ( ) ( )) Total Runtime (strongly convex): O d n + 1 λ log 1 ɛ. Same as SVRG! Use proximal mapping for l 1 regularizer Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

99 Optimization Stochastic Optimization Algorithms for Big Data Compare the Runtime of SGD and SVRG/SAGA Smooth but non-strongly convex: SGD: O ( ) d ɛ 2 SAGA: O ( ) dn ɛ Smooth and strongly convex: SGD: O ( ) d λɛ SVRG/SAGA: O ( d ( ( n + λ) 1 log 1 )) ɛ For small ɛ, use SVRG/SAGA Satisfied with large ɛ, use SGD Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

100 Optimization Stochastic Optimization Algorithms for Big Data Conjugate Duality Define l i (z) l(z, y i ) Conjugate function: l i (α) l i(z) l i (z) = max α R [αz l (α)], l i (α) = max [αz l(z)] z R E.g. hinge loss: l i (z) = max(0, 1 y i z) { l αyi if 1 αy i (α) = i 0 + otherwise E.g. square hinge loss: l i (z) = max(0, 1 y i z) 2 l i (α) = { α αy i if αy i 0 + otherwise Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

101 Optimization Stochastic Optimization Algorithms for Big Data Conjugate Duality Define l i (z) l(z, y i ) Conjugate function: l i (α) l i(z) l i (z) = max α R [αz l (α)], l i (α) = max [αz l(z)] z R E.g. hinge loss: l i (z) = max(0, 1 y i z) { l αyi if 1 αy i (α) = i 0 + otherwise E.g. square hinge loss: l i (z) = max(0, 1 y i z) 2 l i (α) = { α αy i if αy i 0 + otherwise Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

102 Optimization Stochastic Optimization Algorithms for Big Data SDCA (Shalev-Shwartz & Zhang (2013)) Stochastic Dual Coordinate Ascent (liblinear (Hsieh et al., 2008)) Applicable when R(w) is λ-strongly convex Smoothness is not required From Primal problem to Dual problem: min w = min w 1 n 1 n 1 = max α R n n n l(w x i, y } {{ } i ) + λ 2 w 2 2 i=1 z n [ ] max α i (w x i ) l i (α i ) + λ α i=1 i R 2 w 2 2 n l i (α i ) λ 1 n 2 α 2 i x i λn i=1 i=1 2 Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

103 Optimization Stochastic Optimization Algorithms for Big Data SDCA (Shalev-Shwartz & Zhang (2013)) Stochastic Dual Coordinate Ascent (liblinear (Hsieh et al., 2008)) Applicable when R(w) is λ-strongly convex Smoothness is not required From Primal problem to Dual problem: min w = min w 1 n 1 n 1 = max α R n n n l(w x i, y } {{ } i ) + λ 2 w 2 2 i=1 z n [ ] max α i (w x i ) l i (α i ) + λ α i=1 i R 2 w 2 2 n l i (α i ) λ 1 n 2 α 2 i x i λn i=1 i=1 2 Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

104 Optimization Stochastic Optimization Algorithms for Big Data SDCA (Shalev-Shwartz & Zhang (2013)) Solve Dual Problem: 1 max α R n n n l i (α i ) λ 1 n 2 α 2 i x i λn i=1 i=1 2 Sample i t {1,..., n}. Optimize α it while fixing others Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

105 Optimization Stochastic Optimization Algorithms for Big Data SDCA (Shalev-Shwartz & Zhang (2013)) Maintain a primal solution: w t = 1 ni=1 λn αi tx i Change variable α i α i ( 1 n max l α R n i (αi t + α i ) λ n ) 1 n 2 αi t x n 2 i + α i x i λn i=1 i=1 i=1 2 1 n max l α R n i (αi t + α i ) λ n 2 w t + 1 n 2 α i x i λn i=1 i=1 2 Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

106 Optimization Stochastic Optimization Algorithms for Big Data SDCA (Shalev-Shwartz & Zhang (2013)) Dual Coordinate Updates α it = max α i t Rn 1 n l i t ( α t i t α it ) λ 2 αi t+1 t = αi t t + α it w t+1 = w t + 1 λn α i t x i w t + 1 λn α 2 i t x it 2 Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

107 Optimization Stochastic Optimization Algorithms for Big Data SDCA updates Close-form solution for α i : hinge loss, squared hinge loss, absolute loss and square loss (Shalev-Shwartz & Zhang (2013)) e.g. square loss Per-iteration cost: O(d) α i = y i w t x i α t i 1 + x i 2 2 /(λn) Approximate solution: logistic loss (Shalev-Shwartz & Zhang (2013)) Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

108 Optimization Stochastic Optimization Algorithms for Big Data SDCA Iteration complexity l(z) l(z, y) Non-smooth Smooth Non-strongly convex N.A. N.A. R(w) ( ) (( ) ( )) λ-strongly convex O n + 1 λɛ O n + 1 λ log 1 ɛ ( ( ) ( )) Total Runtime (smooth loss): O d n + 1 λ log 1 ɛ. The same as SVRG and SAGA! Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

109 Optimization Stochastic Optimization Algorithms for Big Data SDCA Iteration complexity l(z) l(z, y) Non-smooth Smooth Non-strongly convex N.A. N.A. R(w) ( ) (( ) ( )) λ-strongly convex O n + 1 λɛ O n + 1 λ log 1 ɛ ( ( ) ( )) Total Runtime (smooth loss): O d n + 1 λ log 1 ɛ. The same as SVRG and SAGA! Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

110 Optimization Stochastic Optimization Algorithms for Big Data SVRG V.S. SDCA V.S. SGD l 2 -regularized logistic regression with λ = 10 4 MNIST data Johnson & Zhang (2013) Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

111 Optimization Stochastic Optimization Algorithms for Big Data APCG (Lin et al. (2014)) Recall the acceleration scheme for full gradient method Auxiliary sequence (β t ) Momentum step Maintain a primal solution: w t = 1 λn ni=1 β t i x i Dual Coordinate Updates β it = max β i t Rn n l i t ( βi t t β it ) λ 2 w t + 1 λn β i t x it Momentum Step αi t+1 t = βi t t + β it β t+1 = α t+1 + η t (α t+1 α t ) 2 2 Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

112 Optimization Stochastic Optimization Algorithms for Big Data APCG (Lin et al. (2014)) Recall the acceleration scheme for full gradient method Auxiliary sequence (β t ) Momentum step Maintain a primal solution: w t = 1 λn ni=1 β t i x i Dual Coordinate Updates β it = max β i t Rn n l i t ( βi t t β it ) λ 2 w t + 1 λn β i t x it Momentum Step αi t+1 t = βi t t + β it β t+1 = α t+1 + η t (α t+1 α t ) 2 2 Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

113 Optimization Stochastic Optimization Algorithms for Big Data APCG (Lin et al. (2014)) R(w) Per-iteration cost: O(d) Iteration complexity l(z) l(z, y) Non-smooth Smooth Non-strongly convex ( N.A. ) (( N.A. ) ( )) λ-strongly convex O n + n λɛ O n + n λ log 1 ɛ Compared to SDCA, APCG has shorter runtime when λ is very small. Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

114 Optimization Stochastic Optimization Algorithms for Big Data APCG V.S. SDCA squared hinge loss SVM real data datasets number of samples n number of features d sparsity rcv1 20,242 47, % covtype 581, % news20 19,996 1,355, % F (w t ) F (w ) V.S. the number of passes of data Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

115 Optimization Stochastic Optimization Algorithms for Big Data APCG V.S. SDCA Lin et al. (2014) Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

116 Optimization Stochastic Optimization Algorithms for Big Data For general R(w) Dual Problem: 1 max α R n n n i=1 R is the conjugate of R l i (α i ) R ( 1 λn Sample i t {1,..., n}. Optimize α it ) n α i x i i=1 while fixing others Can be still updated in O(d) in many cases (Shalev-Shwartz & Zhang (2013)) Iteration complexity and runtime of SDCA and APCG remain unchanged. Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

117 Optimization Stochastic Optimization Algorithms for Big Data APCG for primal problem min w R d F (w) = 1 n n l(w x i, y i ) + λ 2 w τ w 1 i=1 Suppose d >> n. Per-iteration cost O(d) is too high Apply APCG to the primal instead of dual problem Sample over features instead of data Per-iteration cost becomes O(n) Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

118 Optimization Stochastic Optimization Algorithms for Big Data APCG for primal problem min w R d F (w) = 1 2 Xw y λ 2 w τ w 1 X = [x 1, x 2,, x n ] Full gradient: F (w) = X T (Xw y) + λw Partial gradient: i F (w) = xi T (Xw y) + λw i Proximal Coordinate Gradient (PCG) (Nesterov (2012)) w t i = { arg minw R i F (w t 1 )w i + 1 2γ t (w i w t 1 w t 1 i i ) 2 + τ w i Proximal mapping if i = t i otherwise i F (w t ) can be updated in O(n) Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

121 Optimization Stochastic Optimization Algorithms for Big Data APCG for primal problem APCG accelerates PCG using Auxiliary sequence (v t ) Momentum step APCG w t i = { arg minwi R i F (v t 1 )w i + 1 2γ t (w i vi t 1 ) 2 + τ w i if i = t i otherwise w t 1 i v t = w t + η t (w t w t 1 ) Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

122 Optimization Stochastic Optimization Algorithms for Big Data APCG for primal problem Per-iteration cost: O(n) Iteration complexity R(w) l(z) l(z, y) Non-smooth Smooth ( ) Non-strongly convex N.A. O d ɛ λ-strongly convex N.A. O (( d λ n >> d: Apply APCG to the dual problem. d >> n: Apply APCG to the primal problem. ) ( )) log 1 ɛ Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

123 Optimization Stochastic Optimization Algorithms for Big Data Which Algorithm to Use Satisfied with large ɛ: SGD For small ɛ: R(w) l(z) l(z, y) Non-smooth Smooth Non-strongly convex SGD SAGA λ-strongly convex APCG APCG Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

124 Optimization Stochastic Optimization Algorithms for Big Data Table : Per-iteration cost:o(d(n + 1 λ )) Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234 Summary smooth str-cvx SGD SAGA SDCA APCG 1 No No N.A. N.A. N.A. ɛ 1 2 n Yes No ɛ 2 ɛ N.A. N.A. No Yes 1 λɛ N.A. n + 1 ɛ n + n λɛ Yes Yes 1 λɛ (n + 1 λ ) log( 1 ɛ ) (n + 1 λ ) log( 1 ɛ ) (n + 1 Table : Per-iteration cost: O(d) λ ) log( 1 ɛ ) smooth str-cvx SVRG No No N.A. Yes No N.A. No Yes N.A. Yes Yes log( 1 ɛ )

125 Optimization Stochastic Optimization Algorithms for Big Data Table : Per-iteration cost:o(d(n + 1 λ )) Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234 Summary smooth str-cvx SGD SAGA SDCA APCG 1 No No N.A. N.A. N.A. ɛ 1 2 n Yes No ɛ 2 ɛ N.A. N.A. No Yes 1 λɛ N.A. n + 1 ɛ n + n λɛ Yes Yes 1 λɛ (n + 1 λ ) log( 1 ɛ ) (n + 1 λ ) log( 1 ɛ ) (n + 1 Table : Per-iteration cost: O(d) λ ) log( 1 ɛ ) smooth str-cvx SVRG No No N.A. Yes No N.A. No Yes N.A. Yes Yes log( 1 ɛ )

126 Optimization Stochastic Optimization Algorithms for Big Data Summary SGD SVRG SAGA SDCA APCG Memory O(d) O(d) O(dn) O(d) O(d) Parameters γ t γ t, K γ t None η t absolute hinge square squared hinge logistic λ > 0 λ = 0 Primal Dual Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

130 Optimization Stochastic Optimization Algorithms for Big Data Outline 2 Optimization (Sub)Gradient Methods Stochastic Optimization Algorithms for Big Data Stochastic Optimization Distributed Optimization Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

131 Optimization Stochastic Optimization Algorithms for Big Data Big Data and Distributed Optimization Distributed Optimization data distributed over a cluster of multiple machines moving to single machine suffers low network bandwidth limited disk or memory communication V.S. computation RAM 100 nanoseconds standard network connection 250, 000 nanoseconds Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

132 Optimization Stochastic Optimization Algorithms for Big Data Distributed Data N data points are partitioned and distributed to m machines [x 1, x 2,..., x N ] = S 1 S 2 S m Machine j only has access to S j. W.L.O.G: S j = n = N m S 1 S 2 S 3 S 4 S 5 S 6 Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

133 Optimization Stochastic Optimization Algorithms for Big Data A simple solution: Average Solution Global problem { w = arg min F (w) = 1 } N l(w x i, y i ) + R(w) w R d N i=1 Machine j solves a local problem w j = arg min w R d f j(w) = 1 l(w x i, y i ) + R(w) n i S j S 1 S 2 S 3 S 4 S 5 S 6 w 1 w 2 w 3 w 4 w 5 w 6 Center computes: ŵ = 1 m w j, Issue: Will not converge to w m j=1 Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

134 Optimization Stochastic Optimization Algorithms for Big Data A simple solution: Average Solution Global problem { w = arg min F (w) = 1 } N l(w x i, y i ) + R(w) w R d N i=1 Machine j solves a local problem w j = arg min w R d f j(w) = 1 l(w x i, y i ) + R(w) n i S j S 1 S 2 S 3 S 4 S 5 S 6 w 1 w 2 w 3 w 4 w 5 w 6 Center computes: ŵ = 1 m w j, Issue: Will not converge to w m j=1 Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

135 Optimization Stochastic Optimization Algorithms for Big Data Mini-Batch SGD: Average Stochastic Gradient Machine j sample i t S j and construct a stochastic gradient g j (w t 1 ) = l(w t 1x it, y it ) + R(w t 1 ) S 1 S 2 S 3 S 4 S 5 S 6 g 1 (w t 1 ) g 2 (w t 1 ) g 3 (w t 1 ) g 4 (w t 1 ) g 5 (w t 1 ) g 6 (w t 1 ) 1 m Center computes: w t = w t 1 γ t g j (w t 1 ) m j=1 } {{ } Mini-batch SG Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

136 Optimization Stochastic Optimization Algorithms for Big Data Total Runtime Single machine Total Runtime = Per-iteration Cost Iteration Complexity Distributed optimization Total Runtime = (Communication Time Per-round+Local Runtime Per-round) Rounds of Communication Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

137 Optimization Stochastic Optimization Algorithms for Big Data Mini-Batch SGD Applicable in all settings! Communication Time Per-round: increase in m in a complicated way. Local Runtime Per-round: O(1) Suppose R(w) is λ-strongly convex Rounds of Communication: O( 1 mλɛ ) Suppose R(w) is non-strongly convex Rounds of Communication: O( 1 mɛ 2 ) More machines reduce the rounds of communication but increase communication time per-round. Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

138 Optimization Stochastic Optimization Algorithms for Big Data Mini-Batch SGD Applicable in all settings! Communication Time Per-round: increase in m in a complicated way. Local Runtime Per-round: O(1) Suppose R(w) is λ-strongly convex Rounds of Communication: O( 1 mλɛ ) Suppose R(w) is non-strongly convex Rounds of Communication: O( 1 mɛ 2 ) More machines reduce the rounds of communication but increase communication time per-round. Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

139 Optimization Stochastic Optimization Algorithms for Big Data Distributed SDCA (Yang, 2013; Ma et al., 2015) Only works when R(w) = λ 2 w 2 2 with λ > 0. (No l 1) Global dual problem 1 max α R N N α = [α S1, α S2,, α Sm ] N l i (α i ) λ 1 N 2 α 2 i x i λn i=1 i=1 2 Machine j solves a local dual problem only over α Sj 1 max α Sj R n N l i (α i ) λ 1 2 λn i S j N 2 α i x i i=1 2 Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

140 Optimization Stochastic Optimization Algorithms for Big Data Distributed SDCA (Yang, 2013; Ma et al., 2015) Only works when R(w) = λ 2 w 2 2 with λ > 0. (No l 1) Global dual problem 1 max α R N N α = [α S1, α S2,, α Sm ] N l i (α i ) λ 1 N 2 α 2 i x i λn i=1 i=1 2 Machine j solves a local dual problem only over α Sj 1 max α Sj R n N l i (α i ) λ 1 2 λn i S j N 2 α i x i i=1 2 Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

141 Optimization Stochastic Optimization Algorithms for Big Data DisDCA (Yang, 2013),CoCoA+ (Ma et al., 2015) Center maintains a primal solution: w t = 1 λn Change variable α i α i 1 max l α Sj R n i (αi t + α i ) λ N 2 i S j 1 max l α Sj R n i (αi t + α i ) λ N 2 i S j N αi t x i i=1 1 N αi t x i + α i x i λn i=1 i S j 2 wt + 1 α i x i λn i S j Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

142 Optimization Stochastic Optimization Algorithms for Big Data DisDCA (Yang, 2013),CoCoA+ (Ma et al., 2015) Machine j approximately solves αs t 1 j arg max l i (αi t + α i ) λ α Sj R n N 2 i S j α t+1 S j = αs t j + αs t j, wj t = 1 αs t λn j x i i S j wt + 1 α i x i λn i S j S 1 S 2 S 3 S 4 S 5 S w1 t w2 t w3 t w4 t w5 t w6 t m Center computes: w t+1 = w t + wj t j=1 Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

143 Optimization Stochastic Optimization Algorithms for Big Data DisDCA (Yang, 2013),CoCoA+ (Ma et al., 2015) Machine j approximately solves αs t 1 j arg max l i (αi t + α i ) λ α Sj R n N 2 i S j α t+1 S j = αs t j + αs t j, wj t = 1 αs t λn j x i i S j wt + 1 α i x i λn i S j S 1 S 2 S 3 S 4 S 5 S w1 t w2 t w3 t w4 t w5 t w6 t m Center computes: w t+1 = w t + wj t j=1 Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

144 Optimization Stochastic Optimization Algorithms for Big Data CoCoA+ (Ma et al., 2015) Local objective value G j ( α Sj, w t ) = 1 l i (αi t + α i ) λ N 2 i S j wt + 1 α i x i λn i S j Solve α t S j ( by any local solver as long as max G j ( α Sj, w t ) G j ( αs t α j, w t ) Sj ) Θ ( max G j ( α Sj, w t ) G j (0, w t ) α Sj 2 2 ) 0 < Θ < 1 Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

145 Optimization Stochastic Optimization Algorithms for Big Data CoCoA+ (Ma et al., 2015) Suppose l(z) is smooth, R(w) is λ-strongly convex and SDCA is the local solver ( ) Local Runtime Per-round: O(( 1 λ + N m ) log 1 Θ ) ( ) Rounds of Communication: O( Θ λ log 1 ɛ ) Suppose l(z) is non-smooth, R(w) is λ-strongly convex and SDCA is the local solver Local Runtime Per-round: O(( 1 λ + N m ) 1 Θ ) Rounds of Communication: O( Θ λɛ ) Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

146 Optimization Stochastic Optimization Algorithms for Big Data CoCoA+ (Ma et al., 2015) Suppose l(z) is smooth, R(w) is λ-strongly convex and SDCA is the local solver ( ) Local Runtime Per-round: O(( 1 λ + N m ) log 1 Θ ) ( ) Rounds of Communication: O( Θ λ log 1 ɛ ) Suppose l(z) is non-smooth, R(w) is λ-strongly convex and SDCA is the local solver Local Runtime Per-round: O(( 1 λ + N m ) 1 Θ ) Rounds of Communication: O( Θ λɛ ) Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

147 Optimization Stochastic Optimization Algorithms for Big Data Distributed SDCA in Practice Choice of Θ (how long we run the local solver?) Choice of m (how many machines to use?) Fast machines but slow network: Use small Θ and small m Fast network but slow machines: Use large Θ and large m Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

148 Optimization Stochastic Optimization Algorithms for Big Data Distributed SDCA in Practice Choice of Θ (how long we run the local solver?) Choice of m (how many machines to use?) Fast machines but slow network: Use small Θ and small m Fast network but slow machines: Use large Θ and large m Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

149 Optimization Stochastic Optimization Algorithms for Big Data DiSCO (Zhang & Xiao, 2015) The rounds of communication of Dual SDCA does not depends on m. DiSCO: Distributed Second-Order method (Zhang & Xiao, 2015) The rounds of communication of DiSCO depends on m. Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

150 Optimization Stochastic Optimization Algorithms for Big Data DiSCO (Zhang & Xiao, 2015) Global problem min w R d { F (w) = 1 } N l(w x i, y i ) + R(w) N i=1 Local problem min w R d f j(w) = 1 l(w x i, y i ) + R(w) n i S j Global problem can be written as min w R d F (w) = 1 m f j (w) m j=1 Applicable when f j (w) is smooth, λ-strongly convex and self-concordant Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

151 Optimization Stochastic Optimization Algorithms for Big Data Newton Direction At optimal solution w : F (w ) = 0 We hope moving w t along v t leads to F (w t v t ) = 0 Taylor expansion: F (w t v t ) F (w t ) 2 F (w t )v t = 0 Such v t is called a Newton s direction Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

152 Optimization Stochastic Optimization Algorithms for Big Data Newton Method Newton Method Find a Netwon direction v t by solving 2 F (w t )v t = F (w t ) Then update w t+1 = w t γ t v t Require solving a linear d d equation system. Costly when d > Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

153 Optimization Stochastic Optimization Algorithms for Big Data DiSCO (Zhang & Xiao, 2015) Inexact Newton Method Find an inexact Netwon direction v t using Preconditioned Conjugate Gradient (PCG) (Golub & Ye, 1997) 2 F (w t )v t F (w t ) 2 ɛ t Then update w t+1 = w t γ t v t Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

154 Optimization Stochastic Optimization Algorithms for Big Data DiSCO (Zhang & Xiao, 2015) PCG Keep computing v P 1 2 F (w t ) v iteratively until v becomes an inexact Newton direction. Preconditioner: P = 2 f 1 (w t ) + µi µ: a tuning parameter such that 2 f 1 (w t ) 2 F (w t ) 2 µ f 1 is similar to F so that P is a good local preconditioner. µ = O( 1 n ) = O( m N ) Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

155 Optimization Stochastic Optimization Algorithms for Big Data DiSCO (Zhang & Xiao, 2015) PCG Keep computing v P 1 2 F (w t ) v iteratively until v becomes an inexact Newton direction. Preconditioner: P = 2 f 1 (w t ) + µi µ: a tuning parameter such that 2 f 1 (w t ) 2 F (w t ) 2 µ f 1 is similar to F so that P is a good local preconditioner. µ = O( 1 n ) = O( m N ) Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

156 Optimization Stochastic Optimization Algorithms for Big Data DiSCO (Zhang & Xiao, 2015) DiSCO compute 2 F (w t ) v distributedly: 2 F (w t ) v = 2 f 1 (w t ) v } {{ } machine f 2 (w t ) v } {{ } machine f m (w t ) } {{ Then, compute P 1 2 F (w t ) v only in machine 1 Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

157 Optimization Stochastic Optimization Algorithms for Big Data DiSCO (Zhang & Xiao, 2015) Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

161 Optimization Stochastic Optimization Algorithms for Big Data DiSCO (Zhang & Xiao, 2015) For high-dimensional data (e.g. d 1000), computing P 1 2 F (w t ) v is time costly. Instead, use SDCA in machine 1 to solve P F (w t ) v arg min u R d 2 u Pu u F (w t )v Local runtime: O( N m + 1+µ λ+µ ) Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

162 Optimization Stochastic Optimization Algorithms for Big Data DiSCO (Zhang & Xiao, 2015) Suppose SDCA is the local solver Local Runtime Per-round: O( N m + 1+µ λ+µ ) Rounds of Communication: O( µ λ log ( 1 ɛ Choice of m (how many machines to use?). (µ = O( m N )) ) ) Fast machines but slow network: Use small m Fast network but slow machine: Use large m Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

163 Optimization Stochastic Optimization Algorithms for Big Data DiSCO (Zhang & Xiao, 2015) Suppose SDCA is the local solver Local Runtime Per-round: O( N m + 1+µ λ+µ ) Rounds of Communication: O( µ λ log ( 1 ɛ Choice of m (how many machines to use?). (µ = O( m N )) ) ) Fast machines but slow network: Use small m Fast network but slow machine: Use large m Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

164 Optimization Stochastic Optimization Algorithms for Big Data DSVRG (Lee et al., 2015) A distributed version of SVRG using a round-robin scheme Assume the user can control the distribution of data before algorithm. Applicable when f j (w) is smooth and λ-strongly convex Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

165 Optimization Stochastic Optimization Algorithms for Big Data DSVRG (Lee et al., 2015) Easy to distribute Iterate s = 1,..., T 1 Let w 0 = w s and compute F ( w s ) Iterate t = 1,..., K g it (w t 1 ) = F ( w s ) g it ( w s ) + g it (w t 1 ) w t = w t 1 γ t g it (w t 1 ) w s+1 = 1 K Kt=1 w t output: w T Hard to distribute Each machine can only sample from its own data. However, E it S j [ g it (w t 1 )] F (w t 1 ) Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

169 Optimization Stochastic Optimization Algorithms for Big Data DSVRG (Lee et al., 2015) Solution: Store a second set of data R j in machine j, which are sampled with replacement from {x 1, x 2,..., x n } before the algorithm starts. Construct the stochastic gradient g it (w t 1 ) by sampling i t R j and removing i t from R j after. E it R j [ g it (w t 1 )] = F (w t 1 ) When R j =, pass w t to next machine. Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

170 Optimization Stochastic Optimization Algorithms for Big Data Full Gradient Step Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

171 Optimization Stochastic Optimization Algorithms for Big Data Full Gradient Step Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

172 Optimization Stochastic Optimization Algorithms for Big Data Stochastic Gradient Step Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

175 Optimization Stochastic Optimization Algorithms for Big Data DSVRG (Lee et al., 2015) Suppose R j = r for all j. Local Runtime Per-round: O( N m + r) ( ) Rounds of Communication: O( 1 rλ log 1 ɛ ) Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

176 Optimization Stochastic Optimization Algorithms for Big Data DSVRG (Lee et al., 2015) Choice of m (how many machines to use?). Fast machines but slow network: Use small m Fast network but slow machine: Use large m Choice of r (how many data points to pre-sample in R j?). The larger, the better Required machine memory space: S j + R j = N m + r Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

177 Optimization Stochastic Optimization Algorithms for Big Data Other Distributed Optimization Methods ADMM (Boyd et al., 2011; Ozdaglar, 2015) Rounds of Communication: O(Network Graph Dependency Term 1 λ log( 1 ɛ )) DANE (Shamir et al., 2014) Approximate Newton direction with a difference approach from DISCO Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

178 Optimization Stochastic Optimization Algorithms for Big Data Summary l(z) is smooth and R(w) is λ-strongly convex. alg. Mini-SGD Dist-SDCA DiSCO Runtime Per-round O(1) O( 1 λ + N m ) O( N m ) Round of Comm. O( 1 λmɛ ) O( 1 λ log( 1 ɛ )) m1/4 O( log( 1 λn 1/4 ɛ )) Table : Assume µ = O( m N ) alg. DSVRG Full Grad. Stoch. Grad. Runtime Per-round O( N m ) O(r) Round of Comm. O(log( 1 1 ɛ )) O( rλ log( 1 ɛ )) Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

179 Optimization Stochastic Optimization Algorithms for Big Data Summary l(z) is non-smooth and g(w) is λ-strongly convex. alg. Mini-SGD Dist-SDCA DiSCO DSVRG Runtime Per-round O(1) O( 1 λ + N m ) N.A. N.A. Round of Comm. O( 1 λmɛ ) O( 1 λɛ ) N.A. N.A. Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

180 Optimization Stochastic Optimization Algorithms for Big Data Summary g(w) is non-strongly convex. alg. Mini-SGD Dist-SDCA DiSCO DSVRG Runtime Per-round O(1) N.A. N.A. N.A. Round of Comm. O( 1 ) mɛ 2 N.A. N.A. N.A. Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

181 Optimization Stochastic Optimization Algorithms for Big Data Which Algorithm to Use Algorithm to use R(w) l(z) l(z, y) Non-smooth Smooth Non-strongly convex Mini-SGD DSVRG λ-strongly convex Dist-SDCA DiSCO/DSVRG Between DiSCO/DSVRG Use DiSCO for small λ, e.g., λ < 10 5 Use DSVRG for large λ, e.g., λ > 10 5 Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

182 Optimization Stochastic Optimization Algorithms for Big Data Distributed Machine Learning Systems and Library Petuum: Apache Spark: Parameter Server: Birds: tyng/software.html Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

183 Optimization Stochastic Optimization Algorithms for Big Data Thank You! Questions? Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

184 Randomized Dimension Reduction Big Data Analytics: Optimization and Randomization Part III: Randomization Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

185 Randomized Dimension Reduction Outline 1 Basics 2 Optimization 3 Randomized Dimension Reduction 4 Randomized Algorithms 5 Concluding Remarks Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

186 Randomized Dimension Reduction Random Sketch Approximate a large data matrix by a much smaller sketch Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

187 Randomized Dimension Reduction The Framework of Randomized Algorithms Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

191 Randomized Dimension Reduction Why randomized dimension reduction? Efficient Robust (e.g., dropout) Formal Guarantees Can explore parallel algorithms Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

192 Randomized Dimension Reduction Randomized Dimension Reduction Johnson-Lindenstauss (JL) transforms Subspace embeddings Column sampling Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

193 Randomized Dimension Reduction JL Lemma JL Lemma (Johnson & Lindenstrauss, 1984) For any 0 < ɛ, δ < 1/2, there exists a probability distribution on m d real matrices A such that there exists a small universal constant c > 0 and for any fixed x R d with a probability at least 1 δ, we have Ax 2 2 x 2 2 c log(1/δ) m x 2 2 or for m = Θ(ɛ 2 log(1/δ)), then with a probability at least 1 δ Ax 2 2 x 2 2 ɛ x 2 2 Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

194 Randomized Dimension Reduction Embedding a set of points into low dimensional space Given a set of points x 1,..., x n R d, we can embed them into a low dimensional space Ax 1,..., Ax n R m such that the pairwise distance between any two points are well preserved in the low dimensional space Ax i Ax j 2 2 = A(x i x j ) 2 2 (1 + ɛ) x i x j 2 2 Ax i Ax j 2 2 = A(x i x j ) 2 2 (1 ɛ) x i x j 2 2 In other words, in order to have all pairwise Euclidean distances preserved up to 1 ± ɛ, only m = Θ(ɛ 2 log(n 2 /δ)) dimensions are necessary Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

195 Randomized Dimension Reduction JL transforms: Gaussian Random Projection Gaussian Random Projection (Dasgupta & Gupta, 2003): A R m d A ij N (0, 1/m) m = Θ(ɛ 2 log(1/δ)) Computational cost of AX: where X R d n mnd for dense matrices nnz(x)m for sparse matrices Computational Cost is very High (could be as high as solving many problems) Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

196 Randomized Dimension Reduction Accelerate JL transforms: using discrete distributions Using Discrete Distributions (Achlioptas, 2003): Pr(A ij = ± 1 m ) = 0.5 Pr(A ij = ± 3 m ) = 1 6, Pr(A ij = 0) = 2 3 Database friendly Replace multiplications by additions and subtractions Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

197 Randomized Dimension Reduction Accelerate JL transforms: using Hadmard transform (I) Fast JL transform based on randomized Hadmard transform: Motivation: Can we simply use random sampling matrix P R m d that randomly selects m coordinates out of d coordinates (scaled by d/m)? Unfortunately: by Chernoff bound Px 2 2 x 2 2 d x 3 log(2/δ) x 2 2 x 2 m Unless d x x 2 c, the random sampling doest not work Remedy is given by randomized Hadmard transform Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

200 Randomized Dimension Reduction Randomized Hadmard transform Hadmard transform: H R d d : H = 1 d H 2 k H 1 = [1], H 2 = [ ], H 2 k = [ H2 k 1 H 2 k 1 H 2 k 1 H 2 k 1 ] Hx 2 = x 2 and H is orthogonal Computational costs of Hx: d log(d) randomized Hadmard transform: HD D R d d : a diagonal matrix Pr(D ii = ±1) = 0.5 HD is orthogonal and HDx 2 = x 2 d HDx Key property: log(d/δ) w.h.p 1 δ HDx 2 Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

203 Randomized Dimension Reduction Accelerate JL transforms: using Hadmard transform (I) Fast JL transform based on randomized Hadmard transform (Tropp, 2011): A = d m PHD yields Ax 2 2 x log(2/δ) log(d/δ) x 2 2 m m = Θ(ɛ 2 log(1/δ) log(d/δ)) suffice for 1 ± ɛ additional factor log(d/δ) can be removed Computational cost of AX: O(nd log(m)) Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

204 Randomized Dimension Reduction Accelerate JL transforms: using a sparse matrix (I) Random hashing (Dasgupta et al., 2010) where D R d d and H R m d A = HD random hashing: h(j) : {1,..., d} {1,..., m} H ij = 1 if h(j) = i: sparse matrix (each column has only one non-zero entry) D R d d : a diagonal matrix Pr(D ii = ±1) = 0.5 [Ax] j = i:h(i)=j x id ii Technically speaking, random hashing does not satisfy JL lemma Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

205 Randomized Dimension Reduction Accelerate JL transforms: using a sparse matrix (I) Random hashing (Dasgupta et al., 2010) where D R d d and H R m d A = HD random hashing: h(j) : {1,..., d} {1,..., m} H ij = 1 if h(j) = i: sparse matrix (each column has only one non-zero entry) D R d d : a diagonal matrix Pr(D ii = ±1) = 0.5 [Ax] j = i:h(i)=j x id ii Technically speaking, random hashing does not satisfy JL lemma Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

206 Randomized Dimension Reduction Accelerate JL transforms: using a sparse matrix (I) key properties: E[ HDx 1, HDx 2 ] = x 1, x 2 and norm perserving HDx 2 2 x 2 2 ɛ x 2 2, only when x x 2 1 c Apply randomized Hadmard transform P first: Θ(c log(c/δ)) blocks of randomized Hadmard transform Px Px 2 1 c Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

207 Randomized Dimension Reduction Accelerate JL transforms: using a sparse matrix (II) Sparse JL transform based on block random hashing (Kane & Nelson, 2014) A = 1 s Q s Q s Each Q s R v d is an independent random hashing (HD) matrix Set v = Θ(ɛ 1 ) and s = Θ(ɛ 1 log(1/δ)) ( [ ]) nnz(x) 1 Computational Cost of AX: O log ɛ δ Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

208 Randomized Dimension Reduction Randomized Dimension Reduction Johnson-Lindenstauss (JL) transforms Subspace embeddings Column sampling Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

209 Randomized Dimension Reduction Subspace Embeddings Definition: a subspace embedding given some parameters 0 < ɛ, δ < 1, k d is a distribution D over matrices A R m d such that for any fixed linear subspace W R d with dim(w ) = k it holds that Pr ( x W, Ax 2 (1 ± ɛ) x 2 ) 1 δ A D It implies If U R d k is orthogonal matrix (contains the orthonormal bases) AU R m k is of full column rank AU 2 (1 ± ɛ) (1 ɛ) 2 U A AU 2 (1 + ɛ) 2 These are key properties in the theoretical analysis of many algorithms (e.g., low-rank matrix approximation, randomized least-squares regression, randomized classification) Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

212 Randomized Dimension Reduction Subspace Embeddings From a JL transform to a Subspace Embedding (Sarlós, 2006). Let A R m d be a JL transform. If ] [ m = O k log k δɛ Then w.h.p 1 δ k, A R m d is a subspace embedding w.r.t a k-dimensional space in R d ɛ 2 Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

213 Randomized Dimension Reduction Subspace Embeddings Making block random hashing a Subspace Embedding (Nelson & Nguyen, 2013). A = 1 s Q s Q s Each Q s R v d is an independent random hashing (HD) matrix Set v = Θ(kɛ 1 log 5 (k/δ)) and s = Θ(ɛ 1 log 3 (k/δ)) w.h.p 1 δ, A R m d with m = Θ ( ) k log 8 (k/δ) ɛ 2 embedding w.r.t a k-dimensional space in R d ( [ ]) nnz(x) k Computational Cost of AX: O log 3 ɛ δ is a subspace Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

214 Randomized Dimension Reduction Sparse Subspace Embedding (SSE) Random hashing is SSE with a Constant Probability (Nelson & Nguyen, 2013) A = HD where D R d d and H R m d m = Ω(k 2 /ɛ 2 ) suffice for a subspace embedding with a probability 2/3 Computational Cost AX: O(nnz(X)) Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

215 Randomized Dimension Reduction Randomized Dimensionality Reduction Johnson-Lindenstauss (JL) transforms Subspace embeddings Column (Row) sampling Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

216 Randomized Dimension Reduction Column sampling Column subset selection (feature selection) More interpretable Uniform sampling usually does not work (not a JL transform) Non-oblivious sampling (data-dependent sampling) leverage-score sampling Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

219 Randomized Dimension Reduction Leverage-score sampling (Drineas et al., 2006) Let X R d n be a rank-k matrix X = UΣV : U R d k, Σ R k k Leverage scores U i 2 2, i = 1,..., d Let p i = U i 2 2 d, i = 1,..., d i=1 U i 2 2 Let i 1,..., i m {1,..., d} denote m indices selected by following p i Let A R m d be sampling-and-rescaling matrix: A ij = 1 mpj AX R m n is a small sketch of X if j = i j 0 otherwise Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

224 Randomized Dimension Reduction Properties of Leverage-score sampling When m = Θ ( k ɛ 2 log [ 2k δ ]), w.h.p 1 δ, AU R m k is full column rank σ 2 i σ 2 i (AU) (1 ɛ) (1 ɛ)2 (AU) 1 + ɛ (1 + ɛ)2 Leverage-score sampling performs like a subspace embedding (only for U, the top singular vector matrix of X) Computational cost: compute top-k SVD of X, expensive Randomized algoritms to compute approximate leverage scores Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

225 Randomized Dimension Reduction Properties of Leverage-score sampling When m = Θ ( k ɛ 2 log [ 2k δ ]), w.h.p 1 δ, AU R m k is full column rank σ 2 i σ 2 i (AU) (1 ɛ) (1 ɛ)2 (AU) 1 + ɛ (1 + ɛ)2 Leverage-score sampling performs like a subspace embedding (only for U, the top singular vector matrix of X) Computational cost: compute top-k SVD of X, expensive Randomized algoritms to compute approximate leverage scores Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

226 Randomized Dimension Reduction When uniform sampling makes sense? Coherence measure µ k = d k max 1 i d U i 2 2 Valid when the coherence measure is small (some real data mining datasets have small coherence measures) The Nyström method usually uses uniform sampling (Gittens, 2011) Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

227 Randomized Algorithms Randomized Classification (Regression) Outline 4 Randomized Algorithms Randomized Classification (Regression) Randomized Least-Squares Regression Randomized K-means Clustering Randomized Kernel methods Randomized Low-rank Matrix Approximation Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

228 Randomized Algorithms Randomized Classification (Regression) Classification Classification problems: 1 min w R d n n l(y i w x i ) + λ 2 w 2 2 i=1 y i {+1, 1}: label Loss function l(z): z = yw x 1. SVMs: (squared) hinge loss l(z) = max(0, 1 z) p, where p = 1, 2 2. Logistic Regression: l(z) = log(1 + exp( z)) Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

229 Randomized Algorithms Randomized Classification (Regression) Randomized Classification For large-scale high-dimensional problems, the computational cost of optimization is O((nd + dκ) log(1/ɛ)). Use random reduction A R d m (m d), we reduce X R n d to X = XA R n m. Then solve JL transforms 1 min u R m n Sparse subspace embeddings n l(y i u x i ) + λ 2 u 2 2 i=1 Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

230 Randomized Algorithms Randomized Classification (Regression) Randomized Classification Two questions: Is there any performance guarantee? margin is preserved: if data is linearly separable (Balcan et al., 2006) as long as m 12 ɛ log( 6m 2 δ ) generalization performance is preserved: if the data matrix if of low rank and m = Ω( kploy(log(k/δɛ)) ɛ ) (Paul et al., 2013) 2 How to recover an accurate model in the original high-dimensional space? Dual Recovery (Zhang et al., 2014) and Dual Sparse Recovery (Yang et al., 2015) Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

235 Randomized Algorithms Randomized Classification (Regression) The Dual probelm Using Fenchel conjugate Primal: Dual: From dual to primal: l i (α i ) = max α i α i z l(z, y i ) 1 w = arg min w R d n α = arg max α R n 1 n n l(w x i, y i ) + λ 2 w 2 2 i=1 n i=1 l i (α i ) 1 2λn 2 α XX α w = 1 λn X α Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

236 Randomized Algorithms Randomized Classification (Regression) Dual Recovery for Randomized Reduction From dual formulation: w lies in the row space of the data matrix X R n d Dual Recovery: w = 1 λn X α, where α = arg max α R n 1 n and X = XA R n m n i=1 l i (α i ) 1 2λn 2 α X X α Subspace Embedding A with m = Θ(r log(r/δ)ɛ 2 ) Guarantee: under low-rank assumption of the data matrix X (e.g., rank(x) = r), with a high probability 1 δ, w w 2 ɛ 1 ɛ w 2 Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

239 Randomized Algorithms Randomized Classification (Regression) Dual Sparse Recovery for Randomized Reduction Assume the optimal dual solution α is sparse (i.e., the number of support vectors is small) Dual Sparse Recovery: w = 1 λn X α, where α = arg max α R n 1 n where X = XA R n m n i=1 JL transform A with m = Θ(s log(n/δ)ɛ 2 ) l i (α i ) 1 2λn 2 α X X α τ n α 1 Guarantee: if α is s-sparse, with a high probability 1 δ, w w 2 ɛ w 2 Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

242 Randomized Algorithms Randomized Classification (Regression) Dual Sparse Recovery RCV1 text data, n = 677, 399, and d = 47, 236 relative dual error L2 norm Dual Error λ= τ m=1024 m=2048 m=4096 m=8192 Primal Error relative primal error L2 norm λ= τ m=1024 m=2048 m=4096 m=8192 Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

243 Randomized Algorithms Randomized Least-Squares Regression Outline 4 Randomized Algorithms Randomized Classification (Regression) Randomized Least-Squares Regression Randomized K-means Clustering Randomized Kernel methods Randomized Low-rank Matrix Approximation Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

244 Randomized Algorithms Randomized Least-Squares Regression Least-squares regression Let X R n d with d n and b R n. The least-squares regression problem is to find w such that w = arg min w R d Xw b 2 Computational Cost: O(nd 2 ) Goal of RA: o(nd 2 ) Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

245 Randomized Algorithms Randomized Least-Squares Regression Randomized Least-squares regression Let A R m n be a random reduction matrix. Solve ŵ = arg min w R d A(Xw b) 2 = AXw Ab 2 Computational Cost: O(md 2 ) + reduction time Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

246 Randomized Algorithms Randomized Least-Squares Regression Randomized Least-squares regression Theoretical Guarantees (Sarlós, 2006; Drineas et al., 2011; Nelson & Nguyen, 2012): Xŵ b 2 (1 + ɛ) Xw b 2 Total Time O(nnz(X) + d 3 log(d/ɛ)ɛ 2 ) Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

247 Randomized Algorithms Randomized K-means Clustering Outline 4 Randomized Algorithms Randomized Classification (Regression) Randomized Least-Squares Regression Randomized K-means Clustering Randomized Kernel methods Randomized Low-rank Matrix Approximation Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

248 Randomized Algorithms Randomized K-means Clustering K-means Clustering Let x 1,..., x n R d be a set of data points. K-means clustering aims to solve min C 1,...,C k k j=1 x i C j x i µ j 2 2 Computational Cost: O(ndkt), where t is number of iterations. Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

249 Randomized Algorithms Randomized K-means Clustering Randomized Algorithms for K-means Clustering Let X = (x 1,..., x n ) R n d be the data matrix. High-dimensional data: Random Sketch: X = XA R n m, l d Approximate K-means: min C 1,...,C k k j=1 x i C j x i µ j 2 2 Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

250 Randomized Algorithms Randomized K-means Clustering Randomized Algorithms for K-means Clustering Let X = (x 1,..., x n ) R n d be the data matrix. High-dimensional data: Random Sketch: X = XA R n m, l d Approximate K-means: min C 1,...,C k k j=1 x i C j x i µ j 2 2 Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

251 Randomized Algorithms Randomized K-means Clustering Randomized Algorithms for K-means Clustering For random sketch: JL transforms, sparse subspace embedding all work JL transform: m = O( k log(k/(ɛδ)) ɛ 2 ) Sparse subspace embedding: m = O( k2 ɛ 2 δ ) ɛ relates to the approximation accuracy Analysis of approximation error for K-means can be formulates as Constrained Low-rank Approximation (Cohen et al., 2015) where Q is orthonormal. min X QQ X 2 Q F Q=I Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

252 Randomized Algorithms Randomized Kernel methods Outline 4 Randomized Algorithms Randomized Classification (Regression) Randomized Least-Squares Regression Randomized K-means Clustering Randomized Kernel methods Randomized Low-rank Matrix Approximation Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

253 Randomized Algorithms Randomized Kernel methods Kernel methods Kernel function: κ(, ) a set of examples x 1,..., x n Kernel matrix: K R n n with K ij = κ(x i, x j ) K is a PSD matrix Computational and memory costs: Ω(n 2 ) Approximation methos The Nyström method Random Fourier features Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

254 Randomized Algorithms Randomized Kernel methods Kernel methods Kernel function: κ(, ) a set of examples x 1,..., x n Kernel matrix: K R n n with K ij = κ(x i, x j ) K is a PSD matrix Computational and memory costs: Ω(n 2 ) Approximation methos The Nyström method Random Fourier features Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

255 Randomized Algorithms Randomized Kernel methods The Nyström method Let A R n l be uniform sampling matrix. B = KA R n l C = A B = A KA The Nyström approximation (Drineas & Mahoney, 2005) K = BC B Computational Cost: O(l 3 + nl 2 ) Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

256 Randomized Algorithms Randomized Kernel methods The Nyström method Let A R n l be uniform sampling matrix. B = KA R n l C = A B = A KA The Nyström approximation (Drineas & Mahoney, 2005) K = BC B Computational Cost: O(l 3 + nl 2 ) Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

257 Randomized Algorithms Randomized Kernel methods The Nyström based kernel machine The dual problem: arg max α R n 1 n n i=1 l i (α i ) 1 2λn 2 α BC B α Solve it like solving a linear method: X = BC 1/2 R n l arg max α R n 1 n n i=1 l i (α i ) 1 2λn 2 α X X α Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

258 Randomized Algorithms Randomized Kernel methods The Nyström based kernel machine Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234