Big Data Analytics: Optimization and Randomization

Size: px
Start display at page:

Download "Big Data Analytics: Optimization and Randomization"

Transcription

1 Big Data Analytics: Optimization and Randomization Tianbao Yang, Qihang Lin, Rong Jin 2015 Sydney, Australia Department of Computer Science, The University of Iowa, IA, USA Department of Management Sciences, The University of Iowa, IA, USA Department of Computer Science and Engineering, Michigan State University, MI, USA Institute of Data Science and Technologies at Alibaba Group, Seattle, USA August 10, 2015 Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

2 URL tyng/kdd15-tutorial.pdf Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

3 Some Claims No Yes This tutorial is not an exhaustive literature survey It is not a survey on different machine learning/data mining algorithms It is about how to efficiently solve machine learning/data mining (formulated as optimization) problems for big data Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

4 Outline Part I: Basics Part II: Optimization Part III: Randomization Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

5 Big Data Analytics: Optimization and Randomization Part I: Basics Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

6 Basics Introduction Outline 1 Basics Introduction Notations and Definitions Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

7 Basics Introduction Three Steps for Machine Learning distance to optimal objective T 1/T 2 1/T iterations Data Model Optimization Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

8 Basics Introduction Big Data Challenge Big Data Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

9 Basics Introduction Big Data Challenge Big Model 60 million parameters Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

10 Basics Introduction Learning as Optimization Ridge Regression Problem: 1 min w R d n n (y i w x i ) 2 + λ 2 w 2 2 i=1 x i R d : d-dimensional feature vector y i R: target variable w R d : model parameters n: number of data points Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

11 Basics Introduction Learning as Optimization Ridge Regression Problem: 1 n min (y i w x i ) 2 + λ w R d n 2 w 2 2 i=1 } {{ } Empirical Loss x i R d : d-dimensional feature vector y i R: target variable w R d : model parameters n: number of data points Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

12 Basics Introduction Learning as Optimization Ridge Regression Problem: 1 n min (y i w x i ) 2 + λ w R d n 2 w 2 2 i=1 } {{ } Regularization x i R d : d-dimensional feature vector y i R: target variable w R d : model parameters n: number of data points Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

13 Basics Introduction Learning as Optimization Classification Problems: 1 min w R d n n l(y i w x i ) + λ 2 w 2 2 i=1 y i {+1, 1}: label Loss function l(z): z = yw x 1. SVMs: (squared) hinge loss l(z) = max(0, 1 z) p, where p = 1, 2 2. Logistic Regression: l(z) = log(1 + exp( z)) Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

14 Basics Introduction Learning as Optimization Feature Selection: 1 min w R d n n l(w x i, y i ) + λ w 1 i=1 l 1 regularization w 1 = d i=1 w i λ controls sparsity level Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

15 Basics Introduction Learning as Optimization Feature Selection using Elastic Net: 1 min w R d n n i=1 ( ) l(w x i, y i )+λ w 1 + γ w 2 2 Elastic net regularizer, more robust than l 1 regularizer Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

16 Basics Introduction Learning as Optimization Multi-class/Multi-task Learning: W R K d min W 1 n n l(wx i, y i ) + λr(w) i=1 r(w) = W 2 F = K dj=1 k=1 Wkj 2 : Frobenius Norm r(w) = W = i σ i: Nuclear Norm (sum of singular values) r(w) = W 1, = d j=1 W :j : l 1, mixed norm Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

17 Basics Introduction Learning as Optimization Regularized Empirical Loss Minimization 1 min w R d n Both l and R are convex functions n l(w x i, y i ) + R(w) i=1 Extensions to Matrix Cases are possible (sometimes straightforward) Extensions to Kernel methods can be combined with randomized approaches Extensions to Non-convex (e.g., deep learning) are in progress Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

18 Basics Introduction Data Matrices and Machine Learning The Instance-feature Matrix: X R n d X = x 1 x 2 x n Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

19 Basics Introduction Data Matrices and Machine Learning y1 y2 The output vector: y = Rn 1 yn continuous yi R: regression (e.g., house price) discrete, e.g., yi {1, 2, 3}: classification (e.g., species of iris) Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

20 Basics Introduction Data Matrices and Machine Learning The Instance-Instance Matrix: K R n n Similarity Matrix Kernel Matrix Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

21 Basics Introduction Data Matrices and Machine Learning Some machine learning tasks are formulated on the kernel matrix Clustering Kernel Methods Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

22 Basics Introduction Data Matrices and Machine Learning The Feature-Feature Matrix: C R d d Covariance Matrix Distance Metric Matrix Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

23 Basics Introduction Data Matrices and Machine Learning Some machine learning tasks requires the covariance matrix Principal Component Analysis Top-k Singular Value (Eigen-Value) Decomposition of the Covariance Matrix Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

24 Basics Introduction Why Learning from Big Data is Challenging? High per-iteration cost High memory cost High communication cost Large iteration complexity Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

25 Basics Notations and Definitions Outline 1 Basics Introduction Notations and Definitions Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

26 Basics Notations and Definitions Norms Vector x R d Euclidean vector norm: x 2 = x x = di=1 x 2 i ( di=1 l p -norm of a vector: x p = x i p) 1/p where p 1 d 1 l 2 norm x 2 = i=1 x i 2 2 l 1 norm x 1 = d i=1 x i 3 l norm x = max i x i Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

27 Basics Notations and Definitions Norms Vector x R d Euclidean vector norm: x 2 = x x = di=1 x 2 i ( di=1 l p -norm of a vector: x p = x i p) 1/p where p 1 d 1 l 2 norm x 2 = i=1 x i 2 2 l 1 norm x 1 = d i=1 x i 3 l norm x = max i x i Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

28 Basics Notations and Definitions Norms Vector x R d Euclidean vector norm: x 2 = x x = di=1 x 2 i ( di=1 l p -norm of a vector: x p = x i p) 1/p where p 1 d 1 l 2 norm x 2 = i=1 x i 2 2 l 1 norm x 1 = d i=1 x i 3 l norm x = max i x i Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

29 Basics Notations and Definitions Matrix Factorization Matrix X R n d Singular Value Decomposition X = UΣV 1 U R n r : orthonormal columns (U U = I): span column space 2 Σ R r r : diagonal matrix Σ ii = σ i > 0, σ 1 σ 2... σ r 3 V R d r : orthonormal columns (V V = I): span row space 4 r min(n, d): max value such that σ r > 0: rank of X 5 U k Σ k Vk : top-k approximation Pseudo inverse: X = V Σ 1 U QR factorization: X = QR (n d) Q R n d : orthonormal columns R R d d : upper triangular matrix Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

30 Basics Notations and Definitions Matrix Factorization Matrix X R n d Singular Value Decomposition X = UΣV 1 U R n r : orthonormal columns (U U = I): span column space 2 Σ R r r : diagonal matrix Σ ii = σ i > 0, σ 1 σ 2... σ r 3 V R d r : orthonormal columns (V V = I): span row space 4 r min(n, d): max value such that σ r > 0: rank of X 5 U k Σ k Vk : top-k approximation Pseudo inverse: X = V Σ 1 U QR factorization: X = QR (n d) Q R n d : orthonormal columns R R d d : upper triangular matrix Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

31 Basics Notations and Definitions Matrix Factorization Matrix X R n d Singular Value Decomposition X = UΣV 1 U R n r : orthonormal columns (U U = I): span column space 2 Σ R r r : diagonal matrix Σ ii = σ i > 0, σ 1 σ 2... σ r 3 V R d r : orthonormal columns (V V = I): span row space 4 r min(n, d): max value such that σ r > 0: rank of X 5 U k Σ k Vk : top-k approximation Pseudo inverse: X = V Σ 1 U QR factorization: X = QR (n d) Q R n d : orthonormal columns R R d d : upper triangular matrix Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

32 Basics Notations and Definitions Norms Matrix X R n d Frobenius norm: X F = tr(x X) = ni=1 dj=1 X 2 ij Spectral (induced norm) of a matrix: X 2 = max u 2 =1 Xu 2 A 2 = σ 1 (maximum singular value) Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

33 Basics Notations and Definitions Norms Matrix X R n d Frobenius norm: X F = tr(x X) = ni=1 dj=1 X 2 ij Spectral (induced norm) of a matrix: X 2 = max u 2 =1 Xu 2 A 2 = σ 1 (maximum singular value) Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

34 Basics Notations and Definitions Convex Optimization min x X f (x) X is a convex domain for any x, y X, their convex combination αx + (1 α)y X f (x) is a convex function Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

35 Basics Notations and Definitions Convex Function Characterization of Convex Function f (αx + (1 α)y) αf (x) + (1 α)f (y), x, y X, α [0, 1] f (x) f (y) + f (y) (x y) x, y X local optimum is global optimum Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

36 Basics Notations and Definitions Convex Function Characterization of Convex Function f (αx + (1 α)y) αf (x) + (1 α)f (y), x, y X, α [0, 1] f (x) f (y) + f (y) (x y) x, y X local optimum is global optimum Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

37 Basics Notations and Definitions Convex vs Strongly Convex Convex function: f (x) f (y) + f (y) (x strong y) x, convexity y X constant Strongly Convex function: f (x) f (y) + f (y) (x y) + λ 2 x y 2 2 x, y X Global optimum is unique Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

38 Basics Notations and Definitions Convex vs Strongly Convex Convex function: f (x) f (y) + f (y) (x strong y) x, convexity y X constant Strongly Convex function: f (x) f (y) + f (y) (x y) + λ 2 x y 2 2 x, y X Global optimum is unique Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

39 Basics Notations and Definitions Non-smooth function vs Smooth function Non-smooth function Lipschitz Lipschitz continuous: constant e.g. absolute loss f (x) = x f (x) f (y) G x y 2 Subgradient: f (x) f (y) + f (y) (x y) non smooth sub gradient x Smooth function smoothness constant e.g. logistic loss f (x) = log(1 + exp( x)) f (x) f (y) 2 L x y 2 6 log(1+exp( x)) 5 f(x) Quadratic Function 1 y 0 f(y)+f (y)(x y) Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

40 Basics Notations and Definitions Non-smooth function vs Smooth function Non-smooth function Lipschitz Lipschitz continuous: constant e.g. absolute loss f (x) = x f (x) f (y) G x y 2 Subgradient: f (x) f (y) + f (y) (x y) non smooth sub gradient x Smooth function smoothness constant e.g. logistic loss f (x) = log(1 + exp( x)) f (x) f (y) 2 L x y 2 6 log(1+exp( x)) 5 f(x) Quadratic Function 1 y 0 f(y)+f (y)(x y) Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

41 Basics Notations and Definitions Non-smooth function vs Smooth function Non-smooth function Lipschitz Lipschitz continuous: constant e.g. absolute loss f (x) = x f (x) f (y) G x y 2 Subgradient: f (x) f (y) + f (y) (x y) non smooth sub gradient x Smooth function smoothness constant e.g. logistic loss f (x) = log(1 + exp( x)) f (x) f (y) 2 L x y 2 6 log(1+exp( x)) 5 f(x) Quadratic Function 1 y 0 f(y)+f (y)(x y) Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

42 Basics Notations and Definitions Next... Part II: Optimization 1 min w R d n stochastic optimization distributed optimization n l(w x i, y i ) + R(w) i=1 Reduce Iteration Complexity: utilizing properties of functions Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

43 Basics Notations and Definitions Next... Part III: Randomization Classification, Regression SVD, K-means, Kernel methods Reduce Data Size: utilizing properties of data Please stay tuned! Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

44 Optimization Big Data Analytics: Optimization and Randomization Part II: Optimization Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

45 Optimization (Sub)Gradient Methods Outline 2 Optimization (Sub)Gradient Methods Stochastic Optimization Algorithms for Big Data Stochastic Optimization Distributed Optimization Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

46 Optimization (Sub)Gradient Methods Learning as Optimization Regularized Empirical Loss Minimization 1 n min l(w x i, y i ) + R(w) w R d n i=1 } {{ } F (w) Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

47 Optimization (Sub)Gradient Methods Convergence Measure Most optimization algorithms are iterative w t+1 = w t + w t objective ε 0 T iterations Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

48 Optimization (Sub)Gradient Methods Convergence Measure Most optimization algorithms are iterative w t+1 = w t + w t Iteration Complexity: the number of iterations T (ɛ) needed to have F (ŵ T ) min F (w) ɛ (ɛ 1) w objective ε 0 T iterations Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

49 Optimization (Sub)Gradient Methods Convergence Measure Most optimization algorithms are iterative w t+1 = w t + w t Iteration Complexity: the number of iterations T (ɛ) needed to have F (ŵ T ) min F (w) ɛ (ɛ 1) w objective Convergence Rate: after T iterations, how good is the solution ε 0 T iterations F (ŵ T ) min F (w) ɛ(t ) w Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

50 Optimization (Sub)Gradient Methods Convergence Measure Most optimization algorithms are iterative w t+1 = w t + w t Iteration Complexity: the number of iterations T (ɛ) needed to have F (ŵ T ) min F (w) ɛ (ɛ 1) w objective Convergence Rate: after T iterations, how good is the solution ε 0 T iterations F (ŵ T ) min F (w) ɛ(t ) w Total Runtime = Per-iteration Cost Iteration Complexity Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

51 Optimization (Sub)Gradient Methods More on Convergence Measure Big O( ) notation: explicit dependence on T or ɛ Convergence Rate Iteration Complexity ( ) ( ( )) 1 linear O µ T (µ < 1) O log ɛ ( ) ( ) sub-linear O 1 1 T α > 0 O α ɛ 1/α Why are we interested in Bounds? Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

52 Optimization (Sub)Gradient Methods More on Convergence Measure Big O( ) notation: explicit dependence on T or ɛ Convergence Rate Iteration Complexity ( ) ( ( )) 1 linear O µ T (µ < 1) O log ɛ ( ) ( ) sub-linear O 1 1 T α > 0 O α ɛ 1/α Why are we interested in Bounds? Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

53 Optimization (Sub)Gradient Methods More on Convergence Measure Convergence Rate Iteration Complexity ) linear O(µ T ) (µ < 1) O (log( 1 ( ɛ ) sub-linear O( 1 T ) α > 0 O 1 α ɛ 1/α 0.5 distance to optimum seconds 0.5 T iterations (T) Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

54 Optimization (Sub)Gradient Methods More on Convergence Measure Convergence Rate Iteration Complexity ) linear O(µ T ) (µ < 1) O (log( 1 ( ɛ ) sub-linear O( 1 T ) α > 0 O 1 α ɛ 1/α distance to optimum seconds minutes 0.5 T 1/T iterations (T) Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

55 Optimization (Sub)Gradient Methods More on Convergence Measure Convergence Rate Iteration Complexity ) linear O(µ T ) (µ < 1) O (log( 1 ( ɛ ) sub-linear O( 1 T ) α > 0 O 1 α ɛ 1/α distance to optimum seconds minutes hours 0.5 T 1/T 1/T iterations (T) Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

56 Optimization (Sub)Gradient Methods More on Convergence Measure Convergence Rate Iteration Complexity ) linear O(µ T ) (µ < 1) O (log( 1 ( ɛ ) sub-linear O( 1 T ) α > 0 O 1 α ɛ 1/α distance to optimum seconds minutes hours 0.5 T 1/T 1/T 0.5 Theoretically, we consider ( ) 1 O(µ T ) O T 2 ( ) 1 log 1 1 ɛ ɛ ɛ 1 ɛ 2 ( ) 1 O O T ( 1 T ) iterations (T) Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

57 Optimization (Sub)Gradient Methods Non-smooth V.S. Smooth Non-smooth l(z) hinge loss: l(w x, y) = max(0, 1 yw x) absolute loss: l(w x, y) = w x y Smooth l(z) squared hinge loss: l(w x, y) = max(0, 1 yw x) 2 logistic loss: l(w x, y) = log(1 + exp( yw x)) square loss: l(w x, y) = (w x y) 2 Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

58 Optimization (Sub)Gradient Methods Strong convex V.S. Non-strongly convex λ-strongly convex R(w) λ l 2 regularizer: 2 w 2 2 Elastic net regularizer: τ w 1 + λ 2 w 2 2 Non-strongly convex R(w) unregularized problem: R(w) 0 l 1 regularizer: τ w 1 Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

59 Optimization (Sub)Gradient Methods Gradient Method in Machine Learning n F (w) = 1 n Suppose l(z) is smooth i=1 l(w x i, y i ) + λ 2 w 2 2 Full gradient: F (w) = 1 n ni=1 l(w x i, y i ) + λw Per-iteration cost: O(nd) Gradient Descent step size w t = w t 1 γ t F (w t 1 ) Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

60 Optimization (Sub)Gradient Methods Gradient Method in Machine Learning n F (w) = 1 n Suppose l(z) is smooth i=1 l(w x i, y i ) + λ 2 w 2 2 Full gradient: F (w) = 1 n ni=1 l(w x i, y i ) + λw Per-iteration cost: O(nd) Gradient Descent step size w t = w t 1 γ t F (w t 1 ) Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

61 Optimization (Sub)Gradient Methods Gradient Method in Machine Learning n F (w) = 1 n i=1 If λ = 0: R(w) is non-strongly convex Iteration complexity O( 1 ɛ ) l(w x i, y i ) + λ 2 w 2 2 If λ > 0: R(w) is λ-strongly convex Iteration complexity O( 1 λ log( 1 ɛ )) Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

62 Optimization (Sub)Gradient Methods Accelerated Gradient Method Accelerated Gradient Descent w t = v t 1 γ t F (v t 1 ) v t = w t + η t (w t w t 1 ) Momentum Step w t is the output and v t is an auxiliary sequence. Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

63 Optimization (Sub)Gradient Methods Accelerated Gradient Method Accelerated Gradient Descent w t = v t 1 γ t F (v t 1 ) v t = w t + η t (w t w t 1 ) Momentum Step w t is the output and v t is an auxiliary sequence. Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

64 Optimization (Sub)Gradient Methods Accelerated Gradient Method n F (w) = 1 n i=1 l(w x i, y i ) + λ 2 w 2 2 If λ = 0: R(w) is non-strongly convex Iteration complexity O( 1 ɛ ), better than O( 1 ɛ ) If λ > 0: R(w) is λ-strongly convex Iteration complexity O( 1 λ log( 1 ɛ )), better than O( 1 λ log( 1 ɛ )) for small λ Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

65 Optimization (Sub)Gradient Methods Deal with l 1 regularizer Consider a more general case min w R d F (w) = 1 n n i=1 l(w x i, y i ) + R (w) + τ w 1 } {{ } R(w) R(w) = R (w) + τ w 1 R (w): λ-strongly convex and smooth Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

66 Optimization (Sub)Gradient Methods Deal with l 1 regularizer Consider a more general case min F (w) = 1 n l(w x i, y i ) + R (w) +τ w 1 w R d n i=1 } {{ } F (w) R(w) = R (w) + τ w 1 R (w): λ-strongly convex and smooth Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

67 Optimization (Sub)Gradient Methods Deal with l 1 regularizer Proximal mapping Accelerated Gradient Descent w t = arg min w R d F (v t 1 ) w + 1 2γ t w v t τ w 1 v t = w t + η t (w t w t 1 ) Proximal mapping has close-form solution: Soft-thresholding Iteration complexity and runtime remain unchanged. Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

68 Optimization (Sub)Gradient Methods Deal with l 1 regularizer Proximal mapping Accelerated Gradient Descent w t = arg min w R d F (v t 1 ) w + 1 2γ t w v t τ w 1 v t = w t + η t (w t w t 1 ) Proximal mapping has close-form solution: Soft-thresholding Iteration complexity and runtime remain unchanged. Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

69 Optimization (Sub)Gradient Methods Sub-Gradient Method in Machine Learning n F (w) = 1 n i=1 l(w x i, y i ) + λ 2 w 2 2 Suppose l(z) is non-smooth Full sub-gradient: F (w) = 1 ni=1 n l(w x i, y i ) + λw Sub-Gradient Descent w t = w t 1 γ t F (w t 1 ) Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

70 Optimization (Sub)Gradient Methods Sub-Gradient Method in Machine Learning n F (w) = 1 n i=1 l(w x i, y i ) + λ 2 w 2 2 Suppose l(z) is non-smooth Full sub-gradient: F (w) = 1 ni=1 n l(w x i, y i ) + λw Sub-Gradient Descent w t = w t 1 γ t F (w t 1 ) Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

71 Optimization (Sub)Gradient Methods Sub-Gradient Method n F (w) = 1 n i=1 If λ = 0: R(w) is non-strongly convex Iteration complexity O( 1 ) ɛ 2 l(w x i, y i ) + λ 2 w 2 2 If λ > 0: R(w) is λ-strongly convex Iteration complexity O( 1 λɛ ) No efficient acceleration scheme in general Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

72 Optimization (Sub)Gradient Methods Problem Classes and Iteration Complexity Iteration complexity R(w) 1 min w R d n n l(w x i, y i ) + R(w) i=1 l(z) l(z, y) Non-smooth ( ) Smooth ( ) Non-strongly convex O 1 O 1 ɛ ( ɛ 2 ) ( ( )) λ-strongly convex O 1 λɛ O λ 1 log 1 ɛ Per-iteration cost: O(nd), too high if n or d are large. Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

73 Optimization Stochastic Optimization Algorithms for Big Data Outline 2 Optimization (Sub)Gradient Methods Stochastic Optimization Algorithms for Big Data Stochastic Optimization Distributed Optimization Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

74 Optimization Stochastic Optimization Algorithms for Big Data Stochastic First-Order Method by Data Sampling Stochastic Gradient Descent (SGD) Stochastic Variance Reduced Gradient (SVRG) Stochastic Average Gradient Algorithm (SAGA) Stochastic Dual Coordinate Ascent (SDCA) Accelerated Proximal Coordinate Gradient (APCG) Assumption: x i 1 for any i Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

75 Optimization Stochastic Optimization Algorithms for Big Data Basic SGD (Nemirovski & Yudin (1978)) n F (w) = 1 n i=1 l(w x i, y i ) + λ 2 w 2 2 Full sub-gradient: F (w) = 1 n ni=1 l(w x i, y i ) + λw Randomly sample i {1,..., n} Stochastic sub-gradient: l(w T x i, y i ) + λw E i [ l(w T x i, y i ) + λw] = F (w) Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

76 Optimization Stochastic Optimization Algorithms for Big Data Basic SGD (Nemirovski & Yudin (1978)) Applicable in all settings! min w R d F (w) = 1 n n l(w x i, y i ) + λ 2 w 2 2 i=1 sample: i t {1,..., n} update: w t = w t 1 γ t ( l(w T t 1x it, y it ) + λw t 1 ) output: w T = 1 T T w t t=1 Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

77 Optimization Stochastic Optimization Algorithms for Big Data Basic SGD (Nemirovski & Yudin (1978)) Applicable in all settings! min w R d F (w) = 1 n n l(w x i, y i ) + λ 2 w 2 2 i=1 sample: i t {1,..., n} update: w t = w t 1 γ t ( l(w T t 1x it, y it ) + λw t 1 ) output: w T = 1 T T w t t=1 Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

78 Optimization Stochastic Optimization Algorithms for Big Data Basic SGD (Nemirovski & Yudin (1978)) n F (w) = 1 n i=1 If λ = 0: R(w) is non-strongly convex Iteration complexity O( 1 ) ɛ 2 l(w x i, y i ) + λ 2 w 2 2 If λ > 0: R(w) is λ-strongly convex Iteration complexity O( 1 λɛ ) Exactly the same as sub-gradient descent! Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

79 Optimization Stochastic Optimization Algorithms for Big Data Total Runtime Per-iteration cost: O(d) Much lower than full gradient method e.g. hinge loss (SVM) y it x it, 1 y it w x it > 0 stochastic gradient: l(w x it, y it ) = 0, otherwise Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

80 Optimization Stochastic Optimization Algorithms for Big Data Total Runtime Iteration complexity R(w) 1 min w R d n n l(w x i, y i ) + R(w) i=1 l(z) l(z, y) Non-smooth ( ) Smooth ( ) Non-strongly convex O 1 O 1 ( ɛ 2 ) ( ɛ 2 ) λ-strongly convex O 1 λɛ O 1 λɛ For SGD, only strongly convexity helps but the smoothness does not make any difference! Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

81 Optimization Stochastic Optimization Algorithms for Big Data Full Gradient V.S. Stochastic Gradient Full gradient method needs fewer iterations Stochastic gradient method has lower cost per iteration For small ɛ, use full gradient Satisfied with large ɛ, use stochastic gradient Full gradient can be accelerated Stochastic gradient cannot Full gradient s iterations complexity depends on smoothness and strong convexity Stochastic gradient s iteration complexity only depends on strong convexity Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

82 Optimization Stochastic Optimization Algorithms for Big Data SVRG (Johnson & Zhang, 2013; Zhang et al., 2013; Xiao & Zhang, 2014) min w R d F (w) = 1 n n l(w x i, y i ) + λ 2 w 2 2 i=1 Applicable when l(z) is smooth and R(w) is λ-strongly convex Stochastic gradient: g it (w) = l(w T x it, y it ) + λw E it [g it (w)] = F (w) but... Var [g it (w)] 0 even if w = w Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

83 Optimization Stochastic Optimization Algorithms for Big Data SVRG (Johnson & Zhang, 2013; Zhang et al., 2013; Xiao & Zhang, 2014) min w R d F (w) = 1 n n l(w x i, y i ) + λ 2 w 2 2 i=1 Applicable when l(z) is smooth and R(w) is λ-strongly convex Stochastic gradient: g it (w) = l(w T x it, y it ) + λw E it [g it (w)] = F (w) but... Var [g it (w)] 0 even if w = w Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

84 Optimization Stochastic Optimization Algorithms for Big Data SVRG (Johnson & Zhang, 2013; Zhang et al., 2013; Xiao & Zhang, 2014) Compute the full gradient at a reference point w F ( w) = 1 n l( w T x i, y i ) + λ w n i=1 Stochastic variance reduced gradient: g it (w) = F ( w) g it ( w) + g it (w) E it [ g it (w)] = F (w) Var [ g it (w)] 0 as w, w w Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

85 Optimization Stochastic Optimization Algorithms for Big Data SVRG (Johnson & Zhang, 2013; Zhang et al., 2013; Xiao & Zhang, 2014) At optimal solution w : F (w ) = 0 It does not mean g it (w) 0 as w w However, we have g it (w) = F ( w) g it ( w) + g it (w) 0 as w, w w Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

86 Optimization Stochastic Optimization Algorithms for Big Data SVRG (Johnson & Zhang, 2013; Zhang et al., 2013; Xiao & Zhang, 2014) Iterate s = 1,..., T 1 Let w 0 = w s and compute F ( w s ) Iterate t = 1,..., K g it (w t 1 ) = F ( w s ) g it ( w s ) + g it (w t 1 ) w t = w t 1 γ t g it (w t 1 ) w s+1 = 1 K Kt=1 w t output: w T K = O ( 1 λ ) Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

87 Optimization Stochastic Optimization Algorithms for Big Data SVRG (Johnson & Zhang, 2013; Zhang et al., 2013; Xiao & Zhang, 2014) ( ( )) Per-iteration cost: O d n + 1 λ Iteration complexity R(w) l(z) l(z, y) Non-smooth Smooth Non-strongly convex N.A. ( N.A. ( )) λ-strongly convex N.A. O log 1 ɛ ( ( ) ( )) Total Runtime: O d n + 1 λ log 1 ɛ Use proximal mapping for l 1 regularizer Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

88 Optimization Stochastic Optimization Algorithms for Big Data SVRG (Johnson & Zhang, 2013; Zhang et al., 2013; Xiao & Zhang, 2014) ( ( )) Per-iteration cost: O d n + 1 λ Iteration complexity R(w) l(z) l(z, y) Non-smooth Smooth Non-strongly convex N.A. ( N.A. ( )) λ-strongly convex N.A. O log 1 ɛ ( ( ) ( )) Total Runtime: O d n + 1 λ log 1 ɛ Use proximal mapping for l 1 regularizer Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

89 Optimization Stochastic Optimization Algorithms for Big Data SAGA (Defazio et al. (2014)) min w R d F (w) = 1 n n l(w x i, y i ) + λ 2 w 2 2 i=1 A new version of SAG (Roux et al. (2012)) Applicable when l(z) is smooth Strong convexity is not required. Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

90 Optimization Stochastic Optimization Algorithms for Big Data SAGA (Defazio et al. (2014)) SAGA also reduces the variance of stochastic gradient but with a different technique SVRG uses gradients at the same point w g it (w) = F ( w) g it ( w) + g it (w) F ( w) = 1 n l( w T x i, y i ) + λ w n i=1 SAGA uses gradients at different point { w 1, w 2,, w n } Average Gradient g it (w) = G g it ( w it ) + g it (w) G = 1 n l( w T i x i, y i ) + λ w it n i=1 Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

91 Optimization Stochastic Optimization Algorithms for Big Data SAGA (Defazio et al. (2014)) SAGA also reduces the variance of stochastic gradient but with a different technique SVRG uses gradients at the same point w g it (w) = F ( w) g it ( w) + g it (w) F ( w) = 1 n l( w T x i, y i ) + λ w n i=1 SAGA uses gradients at different point { w 1, w 2,, w n } Average Gradient g it (w) = G g it ( w it ) + g it (w) G = 1 n l( w T i x i, y i ) + λ w it n i=1 Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

92 Optimization Stochastic Optimization Algorithms for Big Data SAGA (Defazio et al. (2014)) Initialize average gradient G 0 : G 0 = 1 n g i, n i=1 g i = l(w0 x i, y i ) + λw 0 stochastic variance reduced gradient: g it (w t 1 ) = ) G t 1 g it + ( l(w t 1x it, y it ) + λw t 1 w t = w t 1 γ t g it (w t 1 ) Update average gradient G t = 1 n g i, n i=1 g it = l(w t 1x it, y it ) + λw t 1 Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

93 Optimization Stochastic Optimization Algorithms for Big Data SAGA (Defazio et al. (2014)) Initialize average gradient G 0 : G 0 = 1 n g i, n i=1 g i = l(w0 x i, y i ) + λw 0 stochastic variance reduced gradient: g it (w t 1 ) = ) G t 1 g it + ( l(w t 1x it, y it ) + λw t 1 w t = w t 1 γ t g it (w t 1 ) Update average gradient G t = 1 n g i, n i=1 g it = l(w t 1x it, y it ) + λw t 1 Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

94 Optimization Stochastic Optimization Algorithms for Big Data SAGA (Defazio et al. (2014)) Initialize average gradient G 0 : G 0 = 1 n g i, n i=1 g i = l(w0 x i, y i ) + λw 0 stochastic variance reduced gradient: g it (w t 1 ) = ) G t 1 g it + ( l(w t 1x it, y it ) + λw t 1 w t = w t 1 γ t g it (w t 1 ) Update average gradient G t = 1 n g i, n i=1 g it = l(w t 1x it, y it ) + λw t 1 Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

95 Optimization Stochastic Optimization Algorithms for Big Data SAGA: efficient update of averaged gradient G t and G t 1 only differs in g i for i = i t Before we update g i, we update n ( l(w t 1x it, y it ) + λw t 1 ) G t = 1 n i=1 g i = G t 1 1 n g i t + 1 n computation cost: O(d) To implemente SAGA, we have to store and update all: g 1, g 2,..., g n Require extra memory space O(nd) Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

96 Optimization Stochastic Optimization Algorithms for Big Data SAGA: efficient update of averaged gradient G t and G t 1 only differs in g i for i = i t Before we update g i, we update n ( l(w t 1x it, y it ) + λw t 1 ) G t = 1 n i=1 g i = G t 1 1 n g i t + 1 n computation cost: O(d) To implemente SAGA, we have to store and update all: g 1, g 2,..., g n Require extra memory space O(nd) Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

97 Optimization Stochastic Optimization Algorithms for Big Data SAGA (Defazio et al. (2014)) Per-iteration cost: O(d) Iteration complexity l(z) l(z, y) Non-smooth Smooth Non-strongly convex N.A. O ( ) n R(w) (( ) ɛ ( )) λ-strongly convex N.A. O n + 1 λ log 1 ɛ ( ( ) ( )) Total Runtime (strongly convex): O d n + 1 λ log 1 ɛ. Same as SVRG! Use proximal mapping for l 1 regularizer Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

98 Optimization Stochastic Optimization Algorithms for Big Data SAGA (Defazio et al. (2014)) Per-iteration cost: O(d) Iteration complexity l(z) l(z, y) Non-smooth Smooth Non-strongly convex N.A. O ( ) n R(w) (( ) ɛ ( )) λ-strongly convex N.A. O n + 1 λ log 1 ɛ ( ( ) ( )) Total Runtime (strongly convex): O d n + 1 λ log 1 ɛ. Same as SVRG! Use proximal mapping for l 1 regularizer Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

99 Optimization Stochastic Optimization Algorithms for Big Data Compare the Runtime of SGD and SVRG/SAGA Smooth but non-strongly convex: SGD: O ( ) d ɛ 2 SAGA: O ( ) dn ɛ Smooth and strongly convex: SGD: O ( ) d λɛ SVRG/SAGA: O ( d ( ( n + λ) 1 log 1 )) ɛ For small ɛ, use SVRG/SAGA Satisfied with large ɛ, use SGD Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

100 Optimization Stochastic Optimization Algorithms for Big Data Conjugate Duality Define l i (z) l(z, y i ) Conjugate function: l i (α) l i(z) l i (z) = max α R [αz l (α)], l i (α) = max [αz l(z)] z R E.g. hinge loss: l i (z) = max(0, 1 y i z) { l αyi if 1 αy i (α) = i 0 + otherwise E.g. square hinge loss: l i (z) = max(0, 1 y i z) 2 l i (α) = { α αy i if αy i 0 + otherwise Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

101 Optimization Stochastic Optimization Algorithms for Big Data Conjugate Duality Define l i (z) l(z, y i ) Conjugate function: l i (α) l i(z) l i (z) = max α R [αz l (α)], l i (α) = max [αz l(z)] z R E.g. hinge loss: l i (z) = max(0, 1 y i z) { l αyi if 1 αy i (α) = i 0 + otherwise E.g. square hinge loss: l i (z) = max(0, 1 y i z) 2 l i (α) = { α αy i if αy i 0 + otherwise Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

102 Optimization Stochastic Optimization Algorithms for Big Data SDCA (Shalev-Shwartz & Zhang (2013)) Stochastic Dual Coordinate Ascent (liblinear (Hsieh et al., 2008)) Applicable when R(w) is λ-strongly convex Smoothness is not required From Primal problem to Dual problem: min w = min w 1 n 1 n 1 = max α R n n n l(w x i, y } {{ } i ) + λ 2 w 2 2 i=1 z n [ ] max α i (w x i ) l i (α i ) + λ α i=1 i R 2 w 2 2 n l i (α i ) λ 1 n 2 α 2 i x i λn i=1 i=1 2 Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

103 Optimization Stochastic Optimization Algorithms for Big Data SDCA (Shalev-Shwartz & Zhang (2013)) Stochastic Dual Coordinate Ascent (liblinear (Hsieh et al., 2008)) Applicable when R(w) is λ-strongly convex Smoothness is not required From Primal problem to Dual problem: min w = min w 1 n 1 n 1 = max α R n n n l(w x i, y } {{ } i ) + λ 2 w 2 2 i=1 z n [ ] max α i (w x i ) l i (α i ) + λ α i=1 i R 2 w 2 2 n l i (α i ) λ 1 n 2 α 2 i x i λn i=1 i=1 2 Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

104 Optimization Stochastic Optimization Algorithms for Big Data SDCA (Shalev-Shwartz & Zhang (2013)) Solve Dual Problem: 1 max α R n n n l i (α i ) λ 1 n 2 α 2 i x i λn i=1 i=1 2 Sample i t {1,..., n}. Optimize α it while fixing others Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

105 Optimization Stochastic Optimization Algorithms for Big Data SDCA (Shalev-Shwartz & Zhang (2013)) Maintain a primal solution: w t = 1 ni=1 λn αi tx i Change variable α i α i ( 1 n max l α R n i (αi t + α i ) λ n ) 1 n 2 αi t x n 2 i + α i x i λn i=1 i=1 i=1 2 1 n max l α R n i (αi t + α i ) λ n 2 w t + 1 n 2 α i x i λn i=1 i=1 2 Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

106 Optimization Stochastic Optimization Algorithms for Big Data SDCA (Shalev-Shwartz & Zhang (2013)) Dual Coordinate Updates α it = max α i t Rn 1 n l i t ( α t i t α it ) λ 2 αi t+1 t = αi t t + α it w t+1 = w t + 1 λn α i t x i w t + 1 λn α 2 i t x it 2 Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

107 Optimization Stochastic Optimization Algorithms for Big Data SDCA updates Close-form solution for α i : hinge loss, squared hinge loss, absolute loss and square loss (Shalev-Shwartz & Zhang (2013)) e.g. square loss Per-iteration cost: O(d) α i = y i w t x i α t i 1 + x i 2 2 /(λn) Approximate solution: logistic loss (Shalev-Shwartz & Zhang (2013)) Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

108 Optimization Stochastic Optimization Algorithms for Big Data SDCA Iteration complexity l(z) l(z, y) Non-smooth Smooth Non-strongly convex N.A. N.A. R(w) ( ) (( ) ( )) λ-strongly convex O n + 1 λɛ O n + 1 λ log 1 ɛ ( ( ) ( )) Total Runtime (smooth loss): O d n + 1 λ log 1 ɛ. The same as SVRG and SAGA! Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

109 Optimization Stochastic Optimization Algorithms for Big Data SDCA Iteration complexity l(z) l(z, y) Non-smooth Smooth Non-strongly convex N.A. N.A. R(w) ( ) (( ) ( )) λ-strongly convex O n + 1 λɛ O n + 1 λ log 1 ɛ ( ( ) ( )) Total Runtime (smooth loss): O d n + 1 λ log 1 ɛ. The same as SVRG and SAGA! Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

110 Optimization Stochastic Optimization Algorithms for Big Data SVRG V.S. SDCA V.S. SGD l 2 -regularized logistic regression with λ = 10 4 MNIST data Johnson & Zhang (2013) Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

111 Optimization Stochastic Optimization Algorithms for Big Data APCG (Lin et al. (2014)) Recall the acceleration scheme for full gradient method Auxiliary sequence (β t ) Momentum step Maintain a primal solution: w t = 1 λn ni=1 β t i x i Dual Coordinate Updates β it = max β i t Rn n l i t ( βi t t β it ) λ 2 w t + 1 λn β i t x it Momentum Step αi t+1 t = βi t t + β it β t+1 = α t+1 + η t (α t+1 α t ) 2 2 Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

112 Optimization Stochastic Optimization Algorithms for Big Data APCG (Lin et al. (2014)) Recall the acceleration scheme for full gradient method Auxiliary sequence (β t ) Momentum step Maintain a primal solution: w t = 1 λn ni=1 β t i x i Dual Coordinate Updates β it = max β i t Rn n l i t ( βi t t β it ) λ 2 w t + 1 λn β i t x it Momentum Step αi t+1 t = βi t t + β it β t+1 = α t+1 + η t (α t+1 α t ) 2 2 Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

113 Optimization Stochastic Optimization Algorithms for Big Data APCG (Lin et al. (2014)) R(w) Per-iteration cost: O(d) Iteration complexity l(z) l(z, y) Non-smooth Smooth Non-strongly convex ( N.A. ) (( N.A. ) ( )) λ-strongly convex O n + n λɛ O n + n λ log 1 ɛ Compared to SDCA, APCG has shorter runtime when λ is very small. Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

114 Optimization Stochastic Optimization Algorithms for Big Data APCG V.S. SDCA squared hinge loss SVM real data datasets number of samples n number of features d sparsity rcv1 20,242 47, % covtype 581, % news20 19,996 1,355, % F (w t ) F (w ) V.S. the number of passes of data Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

115 Optimization Stochastic Optimization Algorithms for Big Data APCG V.S. SDCA Lin et al. (2014) Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

116 Optimization Stochastic Optimization Algorithms for Big Data For general R(w) Dual Problem: 1 max α R n n n i=1 R is the conjugate of R l i (α i ) R ( 1 λn Sample i t {1,..., n}. Optimize α it ) n α i x i i=1 while fixing others Can be still updated in O(d) in many cases (Shalev-Shwartz & Zhang (2013)) Iteration complexity and runtime of SDCA and APCG remain unchanged. Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

117 Optimization Stochastic Optimization Algorithms for Big Data APCG for primal problem min w R d F (w) = 1 n n l(w x i, y i ) + λ 2 w τ w 1 i=1 Suppose d >> n. Per-iteration cost O(d) is too high Apply APCG to the primal instead of dual problem Sample over features instead of data Per-iteration cost becomes O(n) Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

118 Optimization Stochastic Optimization Algorithms for Big Data APCG for primal problem min w R d F (w) = 1 2 Xw y λ 2 w τ w 1 X = [x 1, x 2,, x n ] Full gradient: F (w) = X T (Xw y) + λw Partial gradient: i F (w) = xi T (Xw y) + λw i Proximal Coordinate Gradient (PCG) (Nesterov (2012)) w t i = { arg minw R i F (w t 1 )w i + 1 2γ t (w i w t 1 w t 1 i i ) 2 + τ w i Proximal mapping if i = t i otherwise i F (w t ) can be updated in O(n) Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

119 Optimization Stochastic Optimization Algorithms for Big Data APCG for primal problem min w R d F (w) = 1 2 Xw y λ 2 w τ w 1 X = [x 1, x 2,, x n ] Full gradient: F (w) = X T (Xw y) + λw Partial gradient: i F (w) = xi T (Xw y) + λw i Proximal Coordinate Gradient (PCG) (Nesterov (2012)) w t i = { arg minw R i F (w t 1 )w i + 1 2γ t (w i w t 1 w t 1 i i ) 2 + τ w i Proximal mapping if i = t i otherwise i F (w t ) can be updated in O(n) Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

120 Optimization Stochastic Optimization Algorithms for Big Data APCG for primal problem min w R d F (w) = 1 2 Xw y λ 2 w τ w 1 X = [x 1, x 2,, x n ] Full gradient: F (w) = X T (Xw y) + λw Partial gradient: i F (w) = xi T (Xw y) + λw i Proximal Coordinate Gradient (PCG) (Nesterov (2012)) w t i = { arg minw R i F (w t 1 )w i + 1 2γ t (w i w t 1 w t 1 i i ) 2 + τ w i Proximal mapping if i = t i otherwise i F (w t ) can be updated in O(n) Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

121 Optimization Stochastic Optimization Algorithms for Big Data APCG for primal problem APCG accelerates PCG using Auxiliary sequence (v t ) Momentum step APCG w t i = { arg minwi R i F (v t 1 )w i + 1 2γ t (w i vi t 1 ) 2 + τ w i if i = t i otherwise w t 1 i v t = w t + η t (w t w t 1 ) Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

122 Optimization Stochastic Optimization Algorithms for Big Data APCG for primal problem Per-iteration cost: O(n) Iteration complexity R(w) l(z) l(z, y) Non-smooth Smooth ( ) Non-strongly convex N.A. O d ɛ λ-strongly convex N.A. O (( d λ n >> d: Apply APCG to the dual problem. d >> n: Apply APCG to the primal problem. ) ( )) log 1 ɛ Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

123 Optimization Stochastic Optimization Algorithms for Big Data Which Algorithm to Use Satisfied with large ɛ: SGD For small ɛ: R(w) l(z) l(z, y) Non-smooth Smooth Non-strongly convex SGD SAGA λ-strongly convex APCG APCG Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

124 Optimization Stochastic Optimization Algorithms for Big Data Table : Per-iteration cost:o(d(n + 1 λ )) Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234 Summary smooth str-cvx SGD SAGA SDCA APCG 1 No No N.A. N.A. N.A. ɛ 1 2 n Yes No ɛ 2 ɛ N.A. N.A. No Yes 1 λɛ N.A. n + 1 ɛ n + n λɛ Yes Yes 1 λɛ (n + 1 λ ) log( 1 ɛ ) (n + 1 λ ) log( 1 ɛ ) (n + 1 Table : Per-iteration cost: O(d) λ ) log( 1 ɛ ) smooth str-cvx SVRG No No N.A. Yes No N.A. No Yes N.A. Yes Yes log( 1 ɛ )

125 Optimization Stochastic Optimization Algorithms for Big Data Table : Per-iteration cost:o(d(n + 1 λ )) Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234 Summary smooth str-cvx SGD SAGA SDCA APCG 1 No No N.A. N.A. N.A. ɛ 1 2 n Yes No ɛ 2 ɛ N.A. N.A. No Yes 1 λɛ N.A. n + 1 ɛ n + n λɛ Yes Yes 1 λɛ (n + 1 λ ) log( 1 ɛ ) (n + 1 λ ) log( 1 ɛ ) (n + 1 Table : Per-iteration cost: O(d) λ ) log( 1 ɛ ) smooth str-cvx SVRG No No N.A. Yes No N.A. No Yes N.A. Yes Yes log( 1 ɛ )

126 Optimization Stochastic Optimization Algorithms for Big Data Summary SGD SVRG SAGA SDCA APCG Memory O(d) O(d) O(dn) O(d) O(d) Parameters γ t γ t, K γ t None η t absolute hinge square squared hinge logistic λ > 0 λ = 0 Primal Dual Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

127 Optimization Stochastic Optimization Algorithms for Big Data Summary SGD SVRG SAGA SDCA APCG Memory O(d) O(d) O(dn) O(d) O(d) Parameters γ t γ t, K γ t None η t absolute hinge square squared hinge logistic λ > 0 λ = 0 Primal Dual Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

128 Optimization Stochastic Optimization Algorithms for Big Data Summary SGD SVRG SAGA SDCA APCG Memory O(d) O(d) O(dn) O(d) O(d) Parameters γ t γ t, K γ t None η t absolute hinge square squared hinge logistic λ > 0 λ = 0 Primal Dual Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

129 Optimization Stochastic Optimization Algorithms for Big Data Summary SGD SVRG SAGA SDCA APCG Memory O(d) O(d) O(dn) O(d) O(d) Parameters γ t γ t, K γ t None η t absolute hinge square squared hinge logistic λ > 0 λ = 0 Primal Dual Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

130 Optimization Stochastic Optimization Algorithms for Big Data Outline 2 Optimization (Sub)Gradient Methods Stochastic Optimization Algorithms for Big Data Stochastic Optimization Distributed Optimization Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

131 Optimization Stochastic Optimization Algorithms for Big Data Big Data and Distributed Optimization Distributed Optimization data distributed over a cluster of multiple machines moving to single machine suffers low network bandwidth limited disk or memory communication V.S. computation RAM 100 nanoseconds standard network connection 250, 000 nanoseconds Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

132 Optimization Stochastic Optimization Algorithms for Big Data Distributed Data N data points are partitioned and distributed to m machines [x 1, x 2,..., x N ] = S 1 S 2 S m Machine j only has access to S j. W.L.O.G: S j = n = N m S 1 S 2 S 3 S 4 S 5 S 6 Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

133 Optimization Stochastic Optimization Algorithms for Big Data A simple solution: Average Solution Global problem { w = arg min F (w) = 1 } N l(w x i, y i ) + R(w) w R d N i=1 Machine j solves a local problem w j = arg min w R d f j(w) = 1 l(w x i, y i ) + R(w) n i S j S 1 S 2 S 3 S 4 S 5 S 6 w 1 w 2 w 3 w 4 w 5 w 6 Center computes: ŵ = 1 m w j, Issue: Will not converge to w m j=1 Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

134 Optimization Stochastic Optimization Algorithms for Big Data A simple solution: Average Solution Global problem { w = arg min F (w) = 1 } N l(w x i, y i ) + R(w) w R d N i=1 Machine j solves a local problem w j = arg min w R d f j(w) = 1 l(w x i, y i ) + R(w) n i S j S 1 S 2 S 3 S 4 S 5 S 6 w 1 w 2 w 3 w 4 w 5 w 6 Center computes: ŵ = 1 m w j, Issue: Will not converge to w m j=1 Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

135 Optimization Stochastic Optimization Algorithms for Big Data Mini-Batch SGD: Average Stochastic Gradient Machine j sample i t S j and construct a stochastic gradient g j (w t 1 ) = l(w t 1x it, y it ) + R(w t 1 ) S 1 S 2 S 3 S 4 S 5 S 6 g 1 (w t 1 ) g 2 (w t 1 ) g 3 (w t 1 ) g 4 (w t 1 ) g 5 (w t 1 ) g 6 (w t 1 ) 1 m Center computes: w t = w t 1 γ t g j (w t 1 ) m j=1 } {{ } Mini-batch SG Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

136 Optimization Stochastic Optimization Algorithms for Big Data Total Runtime Single machine Total Runtime = Per-iteration Cost Iteration Complexity Distributed optimization Total Runtime = (Communication Time Per-round+Local Runtime Per-round) Rounds of Communication Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

137 Optimization Stochastic Optimization Algorithms for Big Data Mini-Batch SGD Applicable in all settings! Communication Time Per-round: increase in m in a complicated way. Local Runtime Per-round: O(1) Suppose R(w) is λ-strongly convex Rounds of Communication: O( 1 mλɛ ) Suppose R(w) is non-strongly convex Rounds of Communication: O( 1 mɛ 2 ) More machines reduce the rounds of communication but increase communication time per-round. Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

138 Optimization Stochastic Optimization Algorithms for Big Data Mini-Batch SGD Applicable in all settings! Communication Time Per-round: increase in m in a complicated way. Local Runtime Per-round: O(1) Suppose R(w) is λ-strongly convex Rounds of Communication: O( 1 mλɛ ) Suppose R(w) is non-strongly convex Rounds of Communication: O( 1 mɛ 2 ) More machines reduce the rounds of communication but increase communication time per-round. Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

139 Optimization Stochastic Optimization Algorithms for Big Data Distributed SDCA (Yang, 2013; Ma et al., 2015) Only works when R(w) = λ 2 w 2 2 with λ > 0. (No l 1) Global dual problem 1 max α R N N α = [α S1, α S2,, α Sm ] N l i (α i ) λ 1 N 2 α 2 i x i λn i=1 i=1 2 Machine j solves a local dual problem only over α Sj 1 max α Sj R n N l i (α i ) λ 1 2 λn i S j N 2 α i x i i=1 2 Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

140 Optimization Stochastic Optimization Algorithms for Big Data Distributed SDCA (Yang, 2013; Ma et al., 2015) Only works when R(w) = λ 2 w 2 2 with λ > 0. (No l 1) Global dual problem 1 max α R N N α = [α S1, α S2,, α Sm ] N l i (α i ) λ 1 N 2 α 2 i x i λn i=1 i=1 2 Machine j solves a local dual problem only over α Sj 1 max α Sj R n N l i (α i ) λ 1 2 λn i S j N 2 α i x i i=1 2 Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

141 Optimization Stochastic Optimization Algorithms for Big Data DisDCA (Yang, 2013),CoCoA+ (Ma et al., 2015) Center maintains a primal solution: w t = 1 λn Change variable α i α i 1 max l α Sj R n i (αi t + α i ) λ N 2 i S j 1 max l α Sj R n i (αi t + α i ) λ N 2 i S j N αi t x i i=1 1 N αi t x i + α i x i λn i=1 i S j 2 wt + 1 α i x i λn i S j Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

142 Optimization Stochastic Optimization Algorithms for Big Data DisDCA (Yang, 2013),CoCoA+ (Ma et al., 2015) Machine j approximately solves αs t 1 j arg max l i (αi t + α i ) λ α Sj R n N 2 i S j α t+1 S j = αs t j + αs t j, wj t = 1 αs t λn j x i i S j wt + 1 α i x i λn i S j S 1 S 2 S 3 S 4 S 5 S w1 t w2 t w3 t w4 t w5 t w6 t m Center computes: w t+1 = w t + wj t j=1 Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

143 Optimization Stochastic Optimization Algorithms for Big Data DisDCA (Yang, 2013),CoCoA+ (Ma et al., 2015) Machine j approximately solves αs t 1 j arg max l i (αi t + α i ) λ α Sj R n N 2 i S j α t+1 S j = αs t j + αs t j, wj t = 1 αs t λn j x i i S j wt + 1 α i x i λn i S j S 1 S 2 S 3 S 4 S 5 S w1 t w2 t w3 t w4 t w5 t w6 t m Center computes: w t+1 = w t + wj t j=1 Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

144 Optimization Stochastic Optimization Algorithms for Big Data CoCoA+ (Ma et al., 2015) Local objective value G j ( α Sj, w t ) = 1 l i (αi t + α i ) λ N 2 i S j wt + 1 α i x i λn i S j Solve α t S j ( by any local solver as long as max G j ( α Sj, w t ) G j ( αs t α j, w t ) Sj ) Θ ( max G j ( α Sj, w t ) G j (0, w t ) α Sj 2 2 ) 0 < Θ < 1 Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

145 Optimization Stochastic Optimization Algorithms for Big Data CoCoA+ (Ma et al., 2015) Suppose l(z) is smooth, R(w) is λ-strongly convex and SDCA is the local solver ( ) Local Runtime Per-round: O(( 1 λ + N m ) log 1 Θ ) ( ) Rounds of Communication: O( Θ λ log 1 ɛ ) Suppose l(z) is non-smooth, R(w) is λ-strongly convex and SDCA is the local solver Local Runtime Per-round: O(( 1 λ + N m ) 1 Θ ) Rounds of Communication: O( Θ λɛ ) Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

146 Optimization Stochastic Optimization Algorithms for Big Data CoCoA+ (Ma et al., 2015) Suppose l(z) is smooth, R(w) is λ-strongly convex and SDCA is the local solver ( ) Local Runtime Per-round: O(( 1 λ + N m ) log 1 Θ ) ( ) Rounds of Communication: O( Θ λ log 1 ɛ ) Suppose l(z) is non-smooth, R(w) is λ-strongly convex and SDCA is the local solver Local Runtime Per-round: O(( 1 λ + N m ) 1 Θ ) Rounds of Communication: O( Θ λɛ ) Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

147 Optimization Stochastic Optimization Algorithms for Big Data Distributed SDCA in Practice Choice of Θ (how long we run the local solver?) Choice of m (how many machines to use?) Fast machines but slow network: Use small Θ and small m Fast network but slow machines: Use large Θ and large m Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

148 Optimization Stochastic Optimization Algorithms for Big Data Distributed SDCA in Practice Choice of Θ (how long we run the local solver?) Choice of m (how many machines to use?) Fast machines but slow network: Use small Θ and small m Fast network but slow machines: Use large Θ and large m Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

149 Optimization Stochastic Optimization Algorithms for Big Data DiSCO (Zhang & Xiao, 2015) The rounds of communication of Dual SDCA does not depends on m. DiSCO: Distributed Second-Order method (Zhang & Xiao, 2015) The rounds of communication of DiSCO depends on m. Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

150 Optimization Stochastic Optimization Algorithms for Big Data DiSCO (Zhang & Xiao, 2015) Global problem min w R d { F (w) = 1 } N l(w x i, y i ) + R(w) N i=1 Local problem min w R d f j(w) = 1 l(w x i, y i ) + R(w) n i S j Global problem can be written as min w R d F (w) = 1 m f j (w) m j=1 Applicable when f j (w) is smooth, λ-strongly convex and self-concordant Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

151 Optimization Stochastic Optimization Algorithms for Big Data Newton Direction At optimal solution w : F (w ) = 0 We hope moving w t along v t leads to F (w t v t ) = 0 Taylor expansion: F (w t v t ) F (w t ) 2 F (w t )v t = 0 Such v t is called a Newton s direction Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

152 Optimization Stochastic Optimization Algorithms for Big Data Newton Method Newton Method Find a Netwon direction v t by solving 2 F (w t )v t = F (w t ) Then update w t+1 = w t γ t v t Require solving a linear d d equation system. Costly when d > Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

153 Optimization Stochastic Optimization Algorithms for Big Data DiSCO (Zhang & Xiao, 2015) Inexact Newton Method Find an inexact Netwon direction v t using Preconditioned Conjugate Gradient (PCG) (Golub & Ye, 1997) 2 F (w t )v t F (w t ) 2 ɛ t Then update w t+1 = w t γ t v t Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

154 Optimization Stochastic Optimization Algorithms for Big Data DiSCO (Zhang & Xiao, 2015) PCG Keep computing v P 1 2 F (w t ) v iteratively until v becomes an inexact Newton direction. Preconditioner: P = 2 f 1 (w t ) + µi µ: a tuning parameter such that 2 f 1 (w t ) 2 F (w t ) 2 µ f 1 is similar to F so that P is a good local preconditioner. µ = O( 1 n ) = O( m N ) Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

155 Optimization Stochastic Optimization Algorithms for Big Data DiSCO (Zhang & Xiao, 2015) PCG Keep computing v P 1 2 F (w t ) v iteratively until v becomes an inexact Newton direction. Preconditioner: P = 2 f 1 (w t ) + µi µ: a tuning parameter such that 2 f 1 (w t ) 2 F (w t ) 2 µ f 1 is similar to F so that P is a good local preconditioner. µ = O( 1 n ) = O( m N ) Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

156 Optimization Stochastic Optimization Algorithms for Big Data DiSCO (Zhang & Xiao, 2015) DiSCO compute 2 F (w t ) v distributedly: 2 F (w t ) v = 2 f 1 (w t ) v } {{ } machine f 2 (w t ) v } {{ } machine f m (w t ) } {{ Then, compute P 1 2 F (w t ) v only in machine 1 Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

157 Optimization Stochastic Optimization Algorithms for Big Data DiSCO (Zhang & Xiao, 2015) Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

158 Optimization Stochastic Optimization Algorithms for Big Data DiSCO (Zhang & Xiao, 2015) Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

159 Optimization Stochastic Optimization Algorithms for Big Data DiSCO (Zhang & Xiao, 2015) Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

160 Optimization Stochastic Optimization Algorithms for Big Data DiSCO (Zhang & Xiao, 2015) Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

161 Optimization Stochastic Optimization Algorithms for Big Data DiSCO (Zhang & Xiao, 2015) For high-dimensional data (e.g. d 1000), computing P 1 2 F (w t ) v is time costly. Instead, use SDCA in machine 1 to solve P F (w t ) v arg min u R d 2 u Pu u F (w t )v Local runtime: O( N m + 1+µ λ+µ ) Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

162 Optimization Stochastic Optimization Algorithms for Big Data DiSCO (Zhang & Xiao, 2015) Suppose SDCA is the local solver Local Runtime Per-round: O( N m + 1+µ λ+µ ) Rounds of Communication: O( µ λ log ( 1 ɛ Choice of m (how many machines to use?). (µ = O( m N )) ) ) Fast machines but slow network: Use small m Fast network but slow machine: Use large m Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

163 Optimization Stochastic Optimization Algorithms for Big Data DiSCO (Zhang & Xiao, 2015) Suppose SDCA is the local solver Local Runtime Per-round: O( N m + 1+µ λ+µ ) Rounds of Communication: O( µ λ log ( 1 ɛ Choice of m (how many machines to use?). (µ = O( m N )) ) ) Fast machines but slow network: Use small m Fast network but slow machine: Use large m Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

164 Optimization Stochastic Optimization Algorithms for Big Data DSVRG (Lee et al., 2015) A distributed version of SVRG using a round-robin scheme Assume the user can control the distribution of data before algorithm. Applicable when f j (w) is smooth and λ-strongly convex Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

165 Optimization Stochastic Optimization Algorithms for Big Data DSVRG (Lee et al., 2015) Easy to distribute Iterate s = 1,..., T 1 Let w 0 = w s and compute F ( w s ) Iterate t = 1,..., K g it (w t 1 ) = F ( w s ) g it ( w s ) + g it (w t 1 ) w t = w t 1 γ t g it (w t 1 ) w s+1 = 1 K Kt=1 w t output: w T Hard to distribute Each machine can only sample from its own data. However, E it S j [ g it (w t 1 )] F (w t 1 ) Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

166 Optimization Stochastic Optimization Algorithms for Big Data DSVRG (Lee et al., 2015) Easy to distribute Iterate s = 1,..., T 1 Let w 0 = w s and compute F ( w s ) Iterate t = 1,..., K g it (w t 1 ) = F ( w s ) g it ( w s ) + g it (w t 1 ) w t = w t 1 γ t g it (w t 1 ) w s+1 = 1 K Kt=1 w t output: w T Hard to distribute Each machine can only sample from its own data. However, E it S j [ g it (w t 1 )] F (w t 1 ) Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

167 Optimization Stochastic Optimization Algorithms for Big Data DSVRG (Lee et al., 2015) Easy to distribute Iterate s = 1,..., T 1 Let w 0 = w s and compute F ( w s ) Iterate t = 1,..., K g it (w t 1 ) = F ( w s ) g it ( w s ) + g it (w t 1 ) w t = w t 1 γ t g it (w t 1 ) w s+1 = 1 K Kt=1 w t output: w T Hard to distribute Each machine can only sample from its own data. However, E it S j [ g it (w t 1 )] F (w t 1 ) Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

168 Optimization Stochastic Optimization Algorithms for Big Data DSVRG (Lee et al., 2015) Easy to distribute Iterate s = 1,..., T 1 Let w 0 = w s and compute F ( w s ) Iterate t = 1,..., K g it (w t 1 ) = F ( w s ) g it ( w s ) + g it (w t 1 ) w t = w t 1 γ t g it (w t 1 ) w s+1 = 1 K Kt=1 w t output: w T Hard to distribute Each machine can only sample from its own data. However, E it S j [ g it (w t 1 )] F (w t 1 ) Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

169 Optimization Stochastic Optimization Algorithms for Big Data DSVRG (Lee et al., 2015) Solution: Store a second set of data R j in machine j, which are sampled with replacement from {x 1, x 2,..., x n } before the algorithm starts. Construct the stochastic gradient g it (w t 1 ) by sampling i t R j and removing i t from R j after. E it R j [ g it (w t 1 )] = F (w t 1 ) When R j =, pass w t to next machine. Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

170 Optimization Stochastic Optimization Algorithms for Big Data Full Gradient Step Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

171 Optimization Stochastic Optimization Algorithms for Big Data Full Gradient Step Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

172 Optimization Stochastic Optimization Algorithms for Big Data Stochastic Gradient Step Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

173 Optimization Stochastic Optimization Algorithms for Big Data Stochastic Gradient Step Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

174 Optimization Stochastic Optimization Algorithms for Big Data Stochastic Gradient Step Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

175 Optimization Stochastic Optimization Algorithms for Big Data DSVRG (Lee et al., 2015) Suppose R j = r for all j. Local Runtime Per-round: O( N m + r) ( ) Rounds of Communication: O( 1 rλ log 1 ɛ ) Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

176 Optimization Stochastic Optimization Algorithms for Big Data DSVRG (Lee et al., 2015) Choice of m (how many machines to use?). Fast machines but slow network: Use small m Fast network but slow machine: Use large m Choice of r (how many data points to pre-sample in R j?). The larger, the better Required machine memory space: S j + R j = N m + r Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

177 Optimization Stochastic Optimization Algorithms for Big Data Other Distributed Optimization Methods ADMM (Boyd et al., 2011; Ozdaglar, 2015) Rounds of Communication: O(Network Graph Dependency Term 1 λ log( 1 ɛ )) DANE (Shamir et al., 2014) Approximate Newton direction with a difference approach from DISCO Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

178 Optimization Stochastic Optimization Algorithms for Big Data Summary l(z) is smooth and R(w) is λ-strongly convex. alg. Mini-SGD Dist-SDCA DiSCO Runtime Per-round O(1) O( 1 λ + N m ) O( N m ) Round of Comm. O( 1 λmɛ ) O( 1 λ log( 1 ɛ )) m1/4 O( log( 1 λn 1/4 ɛ )) Table : Assume µ = O( m N ) alg. DSVRG Full Grad. Stoch. Grad. Runtime Per-round O( N m ) O(r) Round of Comm. O(log( 1 1 ɛ )) O( rλ log( 1 ɛ )) Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

179 Optimization Stochastic Optimization Algorithms for Big Data Summary l(z) is non-smooth and g(w) is λ-strongly convex. alg. Mini-SGD Dist-SDCA DiSCO DSVRG Runtime Per-round O(1) O( 1 λ + N m ) N.A. N.A. Round of Comm. O( 1 λmɛ ) O( 1 λɛ ) N.A. N.A. Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

180 Optimization Stochastic Optimization Algorithms for Big Data Summary g(w) is non-strongly convex. alg. Mini-SGD Dist-SDCA DiSCO DSVRG Runtime Per-round O(1) N.A. N.A. N.A. Round of Comm. O( 1 ) mɛ 2 N.A. N.A. N.A. Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

181 Optimization Stochastic Optimization Algorithms for Big Data Which Algorithm to Use Algorithm to use R(w) l(z) l(z, y) Non-smooth Smooth Non-strongly convex Mini-SGD DSVRG λ-strongly convex Dist-SDCA DiSCO/DSVRG Between DiSCO/DSVRG Use DiSCO for small λ, e.g., λ < 10 5 Use DSVRG for large λ, e.g., λ > 10 5 Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

182 Optimization Stochastic Optimization Algorithms for Big Data Distributed Machine Learning Systems and Library Petuum: Apache Spark: Parameter Server: Birds: tyng/software.html Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

183 Optimization Stochastic Optimization Algorithms for Big Data Thank You! Questions? Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

184 Randomized Dimension Reduction Big Data Analytics: Optimization and Randomization Part III: Randomization Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

185 Randomized Dimension Reduction Outline 1 Basics 2 Optimization 3 Randomized Dimension Reduction 4 Randomized Algorithms 5 Concluding Remarks Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

186 Randomized Dimension Reduction Random Sketch Approximate a large data matrix by a much smaller sketch Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

187 Randomized Dimension Reduction The Framework of Randomized Algorithms Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

188 Randomized Dimension Reduction The Framework of Randomized Algorithms Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

189 Randomized Dimension Reduction The Framework of Randomized Algorithms Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

190 Randomized Dimension Reduction The Framework of Randomized Algorithms Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

191 Randomized Dimension Reduction Why randomized dimension reduction? Efficient Robust (e.g., dropout) Formal Guarantees Can explore parallel algorithms Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

192 Randomized Dimension Reduction Randomized Dimension Reduction Johnson-Lindenstauss (JL) transforms Subspace embeddings Column sampling Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

193 Randomized Dimension Reduction JL Lemma JL Lemma (Johnson & Lindenstrauss, 1984) For any 0 < ɛ, δ < 1/2, there exists a probability distribution on m d real matrices A such that there exists a small universal constant c > 0 and for any fixed x R d with a probability at least 1 δ, we have Ax 2 2 x 2 2 c log(1/δ) m x 2 2 or for m = Θ(ɛ 2 log(1/δ)), then with a probability at least 1 δ Ax 2 2 x 2 2 ɛ x 2 2 Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

194 Randomized Dimension Reduction Embedding a set of points into low dimensional space Given a set of points x 1,..., x n R d, we can embed them into a low dimensional space Ax 1,..., Ax n R m such that the pairwise distance between any two points are well preserved in the low dimensional space Ax i Ax j 2 2 = A(x i x j ) 2 2 (1 + ɛ) x i x j 2 2 Ax i Ax j 2 2 = A(x i x j ) 2 2 (1 ɛ) x i x j 2 2 In other words, in order to have all pairwise Euclidean distances preserved up to 1 ± ɛ, only m = Θ(ɛ 2 log(n 2 /δ)) dimensions are necessary Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

195 Randomized Dimension Reduction JL transforms: Gaussian Random Projection Gaussian Random Projection (Dasgupta & Gupta, 2003): A R m d A ij N (0, 1/m) m = Θ(ɛ 2 log(1/δ)) Computational cost of AX: where X R d n mnd for dense matrices nnz(x)m for sparse matrices Computational Cost is very High (could be as high as solving many problems) Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

196 Randomized Dimension Reduction Accelerate JL transforms: using discrete distributions Using Discrete Distributions (Achlioptas, 2003): Pr(A ij = ± 1 m ) = 0.5 Pr(A ij = ± 3 m ) = 1 6, Pr(A ij = 0) = 2 3 Database friendly Replace multiplications by additions and subtractions Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

197 Randomized Dimension Reduction Accelerate JL transforms: using Hadmard transform (I) Fast JL transform based on randomized Hadmard transform: Motivation: Can we simply use random sampling matrix P R m d that randomly selects m coordinates out of d coordinates (scaled by d/m)? Unfortunately: by Chernoff bound Px 2 2 x 2 2 d x 3 log(2/δ) x 2 2 x 2 m Unless d x x 2 c, the random sampling doest not work Remedy is given by randomized Hadmard transform Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

198 Randomized Dimension Reduction Accelerate JL transforms: using Hadmard transform (I) Fast JL transform based on randomized Hadmard transform: Motivation: Can we simply use random sampling matrix P R m d that randomly selects m coordinates out of d coordinates (scaled by d/m)? Unfortunately: by Chernoff bound Px 2 2 x 2 2 d x 3 log(2/δ) x 2 2 x 2 m Unless d x x 2 c, the random sampling doest not work Remedy is given by randomized Hadmard transform Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

199 Randomized Dimension Reduction Accelerate JL transforms: using Hadmard transform (I) Fast JL transform based on randomized Hadmard transform: Motivation: Can we simply use random sampling matrix P R m d that randomly selects m coordinates out of d coordinates (scaled by d/m)? Unfortunately: by Chernoff bound Px 2 2 x 2 2 d x 3 log(2/δ) x 2 2 x 2 m Unless d x x 2 c, the random sampling doest not work Remedy is given by randomized Hadmard transform Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

200 Randomized Dimension Reduction Randomized Hadmard transform Hadmard transform: H R d d : H = 1 d H 2 k H 1 = [1], H 2 = [ ], H 2 k = [ H2 k 1 H 2 k 1 H 2 k 1 H 2 k 1 ] Hx 2 = x 2 and H is orthogonal Computational costs of Hx: d log(d) randomized Hadmard transform: HD D R d d : a diagonal matrix Pr(D ii = ±1) = 0.5 HD is orthogonal and HDx 2 = x 2 d HDx Key property: log(d/δ) w.h.p 1 δ HDx 2 Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

201 Randomized Dimension Reduction Randomized Hadmard transform Hadmard transform: H R d d : H = 1 d H 2 k H 1 = [1], H 2 = [ ], H 2 k = [ H2 k 1 H 2 k 1 H 2 k 1 H 2 k 1 ] Hx 2 = x 2 and H is orthogonal Computational costs of Hx: d log(d) randomized Hadmard transform: HD D R d d : a diagonal matrix Pr(D ii = ±1) = 0.5 HD is orthogonal and HDx 2 = x 2 d HDx Key property: log(d/δ) w.h.p 1 δ HDx 2 Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

202 Randomized Dimension Reduction Randomized Hadmard transform Hadmard transform: H R d d : H = 1 d H 2 k H 1 = [1], H 2 = [ ], H 2 k = [ H2 k 1 H 2 k 1 H 2 k 1 H 2 k 1 ] Hx 2 = x 2 and H is orthogonal Computational costs of Hx: d log(d) randomized Hadmard transform: HD D R d d : a diagonal matrix Pr(D ii = ±1) = 0.5 HD is orthogonal and HDx 2 = x 2 d HDx Key property: log(d/δ) w.h.p 1 δ HDx 2 Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

203 Randomized Dimension Reduction Accelerate JL transforms: using Hadmard transform (I) Fast JL transform based on randomized Hadmard transform (Tropp, 2011): A = d m PHD yields Ax 2 2 x log(2/δ) log(d/δ) x 2 2 m m = Θ(ɛ 2 log(1/δ) log(d/δ)) suffice for 1 ± ɛ additional factor log(d/δ) can be removed Computational cost of AX: O(nd log(m)) Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

204 Randomized Dimension Reduction Accelerate JL transforms: using a sparse matrix (I) Random hashing (Dasgupta et al., 2010) where D R d d and H R m d A = HD random hashing: h(j) : {1,..., d} {1,..., m} H ij = 1 if h(j) = i: sparse matrix (each column has only one non-zero entry) D R d d : a diagonal matrix Pr(D ii = ±1) = 0.5 [Ax] j = i:h(i)=j x id ii Technically speaking, random hashing does not satisfy JL lemma Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

205 Randomized Dimension Reduction Accelerate JL transforms: using a sparse matrix (I) Random hashing (Dasgupta et al., 2010) where D R d d and H R m d A = HD random hashing: h(j) : {1,..., d} {1,..., m} H ij = 1 if h(j) = i: sparse matrix (each column has only one non-zero entry) D R d d : a diagonal matrix Pr(D ii = ±1) = 0.5 [Ax] j = i:h(i)=j x id ii Technically speaking, random hashing does not satisfy JL lemma Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

206 Randomized Dimension Reduction Accelerate JL transforms: using a sparse matrix (I) key properties: E[ HDx 1, HDx 2 ] = x 1, x 2 and norm perserving HDx 2 2 x 2 2 ɛ x 2 2, only when x x 2 1 c Apply randomized Hadmard transform P first: Θ(c log(c/δ)) blocks of randomized Hadmard transform Px Px 2 1 c Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

207 Randomized Dimension Reduction Accelerate JL transforms: using a sparse matrix (II) Sparse JL transform based on block random hashing (Kane & Nelson, 2014) A = 1 s Q s Q s Each Q s R v d is an independent random hashing (HD) matrix Set v = Θ(ɛ 1 ) and s = Θ(ɛ 1 log(1/δ)) ( [ ]) nnz(x) 1 Computational Cost of AX: O log ɛ δ Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

208 Randomized Dimension Reduction Randomized Dimension Reduction Johnson-Lindenstauss (JL) transforms Subspace embeddings Column sampling Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

209 Randomized Dimension Reduction Subspace Embeddings Definition: a subspace embedding given some parameters 0 < ɛ, δ < 1, k d is a distribution D over matrices A R m d such that for any fixed linear subspace W R d with dim(w ) = k it holds that Pr ( x W, Ax 2 (1 ± ɛ) x 2 ) 1 δ A D It implies If U R d k is orthogonal matrix (contains the orthonormal bases) AU R m k is of full column rank AU 2 (1 ± ɛ) (1 ɛ) 2 U A AU 2 (1 + ɛ) 2 These are key properties in the theoretical analysis of many algorithms (e.g., low-rank matrix approximation, randomized least-squares regression, randomized classification) Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

210 Randomized Dimension Reduction Subspace Embeddings Definition: a subspace embedding given some parameters 0 < ɛ, δ < 1, k d is a distribution D over matrices A R m d such that for any fixed linear subspace W R d with dim(w ) = k it holds that Pr ( x W, Ax 2 (1 ± ɛ) x 2 ) 1 δ A D It implies If U R d k is orthogonal matrix (contains the orthonormal bases) AU R m k is of full column rank AU 2 (1 ± ɛ) (1 ɛ) 2 U A AU 2 (1 + ɛ) 2 These are key properties in the theoretical analysis of many algorithms (e.g., low-rank matrix approximation, randomized least-squares regression, randomized classification) Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

211 Randomized Dimension Reduction Subspace Embeddings Definition: a subspace embedding given some parameters 0 < ɛ, δ < 1, k d is a distribution D over matrices A R m d such that for any fixed linear subspace W R d with dim(w ) = k it holds that Pr ( x W, Ax 2 (1 ± ɛ) x 2 ) 1 δ A D It implies If U R d k is orthogonal matrix (contains the orthonormal bases) AU R m k is of full column rank AU 2 (1 ± ɛ) (1 ɛ) 2 U A AU 2 (1 + ɛ) 2 These are key properties in the theoretical analysis of many algorithms (e.g., low-rank matrix approximation, randomized least-squares regression, randomized classification) Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

212 Randomized Dimension Reduction Subspace Embeddings From a JL transform to a Subspace Embedding (Sarlós, 2006). Let A R m d be a JL transform. If ] [ m = O k log k δɛ Then w.h.p 1 δ k, A R m d is a subspace embedding w.r.t a k-dimensional space in R d ɛ 2 Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

213 Randomized Dimension Reduction Subspace Embeddings Making block random hashing a Subspace Embedding (Nelson & Nguyen, 2013). A = 1 s Q s Q s Each Q s R v d is an independent random hashing (HD) matrix Set v = Θ(kɛ 1 log 5 (k/δ)) and s = Θ(ɛ 1 log 3 (k/δ)) w.h.p 1 δ, A R m d with m = Θ ( ) k log 8 (k/δ) ɛ 2 embedding w.r.t a k-dimensional space in R d ( [ ]) nnz(x) k Computational Cost of AX: O log 3 ɛ δ is a subspace Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

214 Randomized Dimension Reduction Sparse Subspace Embedding (SSE) Random hashing is SSE with a Constant Probability (Nelson & Nguyen, 2013) A = HD where D R d d and H R m d m = Ω(k 2 /ɛ 2 ) suffice for a subspace embedding with a probability 2/3 Computational Cost AX: O(nnz(X)) Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

215 Randomized Dimension Reduction Randomized Dimensionality Reduction Johnson-Lindenstauss (JL) transforms Subspace embeddings Column (Row) sampling Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

216 Randomized Dimension Reduction Column sampling Column subset selection (feature selection) More interpretable Uniform sampling usually does not work (not a JL transform) Non-oblivious sampling (data-dependent sampling) leverage-score sampling Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

217 Randomized Dimension Reduction Column sampling Column subset selection (feature selection) More interpretable Uniform sampling usually does not work (not a JL transform) Non-oblivious sampling (data-dependent sampling) leverage-score sampling Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

218 Randomized Dimension Reduction Column sampling Column subset selection (feature selection) More interpretable Uniform sampling usually does not work (not a JL transform) Non-oblivious sampling (data-dependent sampling) leverage-score sampling Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

219 Randomized Dimension Reduction Leverage-score sampling (Drineas et al., 2006) Let X R d n be a rank-k matrix X = UΣV : U R d k, Σ R k k Leverage scores U i 2 2, i = 1,..., d Let p i = U i 2 2 d, i = 1,..., d i=1 U i 2 2 Let i 1,..., i m {1,..., d} denote m indices selected by following p i Let A R m d be sampling-and-rescaling matrix: A ij = 1 mpj AX R m n is a small sketch of X if j = i j 0 otherwise Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

220 Randomized Dimension Reduction Leverage-score sampling (Drineas et al., 2006) Let X R d n be a rank-k matrix X = UΣV : U R d k, Σ R k k Leverage scores U i 2 2, i = 1,..., d Let p i = U i 2 2 d, i = 1,..., d i=1 U i 2 2 Let i 1,..., i m {1,..., d} denote m indices selected by following p i Let A R m d be sampling-and-rescaling matrix: A ij = 1 mpj AX R m n is a small sketch of X if j = i j 0 otherwise Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

221 Randomized Dimension Reduction Leverage-score sampling (Drineas et al., 2006) Let X R d n be a rank-k matrix X = UΣV : U R d k, Σ R k k Leverage scores U i 2 2, i = 1,..., d Let p i = U i 2 2 d, i = 1,..., d i=1 U i 2 2 Let i 1,..., i m {1,..., d} denote m indices selected by following p i Let A R m d be sampling-and-rescaling matrix: A ij = 1 mpj AX R m n is a small sketch of X if j = i j 0 otherwise Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

222 Randomized Dimension Reduction Leverage-score sampling (Drineas et al., 2006) Let X R d n be a rank-k matrix X = UΣV : U R d k, Σ R k k Leverage scores U i 2 2, i = 1,..., d Let p i = U i 2 2 d, i = 1,..., d i=1 U i 2 2 Let i 1,..., i m {1,..., d} denote m indices selected by following p i Let A R m d be sampling-and-rescaling matrix: A ij = 1 mpj AX R m n is a small sketch of X if j = i j 0 otherwise Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

223 Randomized Dimension Reduction Leverage-score sampling (Drineas et al., 2006) Let X R d n be a rank-k matrix X = UΣV : U R d k, Σ R k k Leverage scores U i 2 2, i = 1,..., d Let p i = U i 2 2 d, i = 1,..., d i=1 U i 2 2 Let i 1,..., i m {1,..., d} denote m indices selected by following p i Let A R m d be sampling-and-rescaling matrix: A ij = 1 mpj AX R m n is a small sketch of X if j = i j 0 otherwise Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

224 Randomized Dimension Reduction Properties of Leverage-score sampling When m = Θ ( k ɛ 2 log [ 2k δ ]), w.h.p 1 δ, AU R m k is full column rank σ 2 i σ 2 i (AU) (1 ɛ) (1 ɛ)2 (AU) 1 + ɛ (1 + ɛ)2 Leverage-score sampling performs like a subspace embedding (only for U, the top singular vector matrix of X) Computational cost: compute top-k SVD of X, expensive Randomized algoritms to compute approximate leverage scores Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

225 Randomized Dimension Reduction Properties of Leverage-score sampling When m = Θ ( k ɛ 2 log [ 2k δ ]), w.h.p 1 δ, AU R m k is full column rank σ 2 i σ 2 i (AU) (1 ɛ) (1 ɛ)2 (AU) 1 + ɛ (1 + ɛ)2 Leverage-score sampling performs like a subspace embedding (only for U, the top singular vector matrix of X) Computational cost: compute top-k SVD of X, expensive Randomized algoritms to compute approximate leverage scores Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

226 Randomized Dimension Reduction When uniform sampling makes sense? Coherence measure µ k = d k max 1 i d U i 2 2 Valid when the coherence measure is small (some real data mining datasets have small coherence measures) The Nyström method usually uses uniform sampling (Gittens, 2011) Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

227 Randomized Algorithms Randomized Classification (Regression) Outline 4 Randomized Algorithms Randomized Classification (Regression) Randomized Least-Squares Regression Randomized K-means Clustering Randomized Kernel methods Randomized Low-rank Matrix Approximation Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

228 Randomized Algorithms Randomized Classification (Regression) Classification Classification problems: 1 min w R d n n l(y i w x i ) + λ 2 w 2 2 i=1 y i {+1, 1}: label Loss function l(z): z = yw x 1. SVMs: (squared) hinge loss l(z) = max(0, 1 z) p, where p = 1, 2 2. Logistic Regression: l(z) = log(1 + exp( z)) Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

229 Randomized Algorithms Randomized Classification (Regression) Randomized Classification For large-scale high-dimensional problems, the computational cost of optimization is O((nd + dκ) log(1/ɛ)). Use random reduction A R d m (m d), we reduce X R n d to X = XA R n m. Then solve JL transforms 1 min u R m n Sparse subspace embeddings n l(y i u x i ) + λ 2 u 2 2 i=1 Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

230 Randomized Algorithms Randomized Classification (Regression) Randomized Classification Two questions: Is there any performance guarantee? margin is preserved: if data is linearly separable (Balcan et al., 2006) as long as m 12 ɛ log( 6m 2 δ ) generalization performance is preserved: if the data matrix if of low rank and m = Ω( kploy(log(k/δɛ)) ɛ ) (Paul et al., 2013) 2 How to recover an accurate model in the original high-dimensional space? Dual Recovery (Zhang et al., 2014) and Dual Sparse Recovery (Yang et al., 2015) Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

231 Randomized Algorithms Randomized Classification (Regression) Randomized Classification Two questions: Is there any performance guarantee? margin is preserved: if data is linearly separable (Balcan et al., 2006) as long as m 12 ɛ log( 6m 2 δ ) generalization performance is preserved: if the data matrix if of low rank and m = Ω( kploy(log(k/δɛ)) ɛ ) (Paul et al., 2013) 2 How to recover an accurate model in the original high-dimensional space? Dual Recovery (Zhang et al., 2014) and Dual Sparse Recovery (Yang et al., 2015) Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

232 Randomized Algorithms Randomized Classification (Regression) Randomized Classification Two questions: Is there any performance guarantee? margin is preserved: if data is linearly separable (Balcan et al., 2006) as long as m 12 ɛ log( 6m 2 δ ) generalization performance is preserved: if the data matrix if of low rank and m = Ω( kploy(log(k/δɛ)) ɛ ) (Paul et al., 2013) 2 How to recover an accurate model in the original high-dimensional space? Dual Recovery (Zhang et al., 2014) and Dual Sparse Recovery (Yang et al., 2015) Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

233 Randomized Algorithms Randomized Classification (Regression) Randomized Classification Two questions: Is there any performance guarantee? margin is preserved: if data is linearly separable (Balcan et al., 2006) as long as m 12 ɛ log( 6m 2 δ ) generalization performance is preserved: if the data matrix if of low rank and m = Ω( kploy(log(k/δɛ)) ɛ ) (Paul et al., 2013) 2 How to recover an accurate model in the original high-dimensional space? Dual Recovery (Zhang et al., 2014) and Dual Sparse Recovery (Yang et al., 2015) Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

234 Randomized Algorithms Randomized Classification (Regression) Randomized Classification Two questions: Is there any performance guarantee? margin is preserved: if data is linearly separable (Balcan et al., 2006) as long as m 12 ɛ log( 6m 2 δ ) generalization performance is preserved: if the data matrix if of low rank and m = Ω( kploy(log(k/δɛ)) ɛ ) (Paul et al., 2013) 2 How to recover an accurate model in the original high-dimensional space? Dual Recovery (Zhang et al., 2014) and Dual Sparse Recovery (Yang et al., 2015) Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

235 Randomized Algorithms Randomized Classification (Regression) The Dual probelm Using Fenchel conjugate Primal: Dual: From dual to primal: l i (α i ) = max α i α i z l(z, y i ) 1 w = arg min w R d n α = arg max α R n 1 n n l(w x i, y i ) + λ 2 w 2 2 i=1 n i=1 l i (α i ) 1 2λn 2 α XX α w = 1 λn X α Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

236 Randomized Algorithms Randomized Classification (Regression) Dual Recovery for Randomized Reduction From dual formulation: w lies in the row space of the data matrix X R n d Dual Recovery: w = 1 λn X α, where α = arg max α R n 1 n and X = XA R n m n i=1 l i (α i ) 1 2λn 2 α X X α Subspace Embedding A with m = Θ(r log(r/δ)ɛ 2 ) Guarantee: under low-rank assumption of the data matrix X (e.g., rank(x) = r), with a high probability 1 δ, w w 2 ɛ 1 ɛ w 2 Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

237 Randomized Algorithms Randomized Classification (Regression) Dual Recovery for Randomized Reduction From dual formulation: w lies in the row space of the data matrix X R n d Dual Recovery: w = 1 λn X α, where α = arg max α R n 1 n and X = XA R n m n i=1 l i (α i ) 1 2λn 2 α X X α Subspace Embedding A with m = Θ(r log(r/δ)ɛ 2 ) Guarantee: under low-rank assumption of the data matrix X (e.g., rank(x) = r), with a high probability 1 δ, w w 2 ɛ 1 ɛ w 2 Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

238 Randomized Algorithms Randomized Classification (Regression) Dual Recovery for Randomized Reduction From dual formulation: w lies in the row space of the data matrix X R n d Dual Recovery: w = 1 λn X α, where α = arg max α R n 1 n and X = XA R n m n i=1 l i (α i ) 1 2λn 2 α X X α Subspace Embedding A with m = Θ(r log(r/δ)ɛ 2 ) Guarantee: under low-rank assumption of the data matrix X (e.g., rank(x) = r), with a high probability 1 δ, w w 2 ɛ 1 ɛ w 2 Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

239 Randomized Algorithms Randomized Classification (Regression) Dual Sparse Recovery for Randomized Reduction Assume the optimal dual solution α is sparse (i.e., the number of support vectors is small) Dual Sparse Recovery: w = 1 λn X α, where α = arg max α R n 1 n where X = XA R n m n i=1 JL transform A with m = Θ(s log(n/δ)ɛ 2 ) l i (α i ) 1 2λn 2 α X X α τ n α 1 Guarantee: if α is s-sparse, with a high probability 1 δ, w w 2 ɛ w 2 Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

240 Randomized Algorithms Randomized Classification (Regression) Dual Sparse Recovery for Randomized Reduction Assume the optimal dual solution α is sparse (i.e., the number of support vectors is small) Dual Sparse Recovery: w = 1 λn X α, where α = arg max α R n 1 n where X = XA R n m n i=1 JL transform A with m = Θ(s log(n/δ)ɛ 2 ) l i (α i ) 1 2λn 2 α X X α τ n α 1 Guarantee: if α is s-sparse, with a high probability 1 δ, w w 2 ɛ w 2 Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

241 Randomized Algorithms Randomized Classification (Regression) Dual Sparse Recovery for Randomized Reduction Assume the optimal dual solution α is sparse (i.e., the number of support vectors is small) Dual Sparse Recovery: w = 1 λn X α, where α = arg max α R n 1 n where X = XA R n m n i=1 JL transform A with m = Θ(s log(n/δ)ɛ 2 ) l i (α i ) 1 2λn 2 α X X α τ n α 1 Guarantee: if α is s-sparse, with a high probability 1 δ, w w 2 ɛ w 2 Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

242 Randomized Algorithms Randomized Classification (Regression) Dual Sparse Recovery RCV1 text data, n = 677, 399, and d = 47, 236 relative dual error L2 norm Dual Error λ= τ m=1024 m=2048 m=4096 m=8192 Primal Error relative primal error L2 norm λ= τ m=1024 m=2048 m=4096 m=8192 Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

243 Randomized Algorithms Randomized Least-Squares Regression Outline 4 Randomized Algorithms Randomized Classification (Regression) Randomized Least-Squares Regression Randomized K-means Clustering Randomized Kernel methods Randomized Low-rank Matrix Approximation Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

244 Randomized Algorithms Randomized Least-Squares Regression Least-squares regression Let X R n d with d n and b R n. The least-squares regression problem is to find w such that w = arg min w R d Xw b 2 Computational Cost: O(nd 2 ) Goal of RA: o(nd 2 ) Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

245 Randomized Algorithms Randomized Least-Squares Regression Randomized Least-squares regression Let A R m n be a random reduction matrix. Solve ŵ = arg min w R d A(Xw b) 2 = AXw Ab 2 Computational Cost: O(md 2 ) + reduction time Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

246 Randomized Algorithms Randomized Least-Squares Regression Randomized Least-squares regression Theoretical Guarantees (Sarlós, 2006; Drineas et al., 2011; Nelson & Nguyen, 2012): Xŵ b 2 (1 + ɛ) Xw b 2 Total Time O(nnz(X) + d 3 log(d/ɛ)ɛ 2 ) Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

247 Randomized Algorithms Randomized K-means Clustering Outline 4 Randomized Algorithms Randomized Classification (Regression) Randomized Least-Squares Regression Randomized K-means Clustering Randomized Kernel methods Randomized Low-rank Matrix Approximation Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

248 Randomized Algorithms Randomized K-means Clustering K-means Clustering Let x 1,..., x n R d be a set of data points. K-means clustering aims to solve min C 1,...,C k k j=1 x i C j x i µ j 2 2 Computational Cost: O(ndkt), where t is number of iterations. Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

249 Randomized Algorithms Randomized K-means Clustering Randomized Algorithms for K-means Clustering Let X = (x 1,..., x n ) R n d be the data matrix. High-dimensional data: Random Sketch: X = XA R n m, l d Approximate K-means: min C 1,...,C k k j=1 x i C j x i µ j 2 2 Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

250 Randomized Algorithms Randomized K-means Clustering Randomized Algorithms for K-means Clustering Let X = (x 1,..., x n ) R n d be the data matrix. High-dimensional data: Random Sketch: X = XA R n m, l d Approximate K-means: min C 1,...,C k k j=1 x i C j x i µ j 2 2 Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

251 Randomized Algorithms Randomized K-means Clustering Randomized Algorithms for K-means Clustering For random sketch: JL transforms, sparse subspace embedding all work JL transform: m = O( k log(k/(ɛδ)) ɛ 2 ) Sparse subspace embedding: m = O( k2 ɛ 2 δ ) ɛ relates to the approximation accuracy Analysis of approximation error for K-means can be formulates as Constrained Low-rank Approximation (Cohen et al., 2015) where Q is orthonormal. min X QQ X 2 Q F Q=I Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

252 Randomized Algorithms Randomized Kernel methods Outline 4 Randomized Algorithms Randomized Classification (Regression) Randomized Least-Squares Regression Randomized K-means Clustering Randomized Kernel methods Randomized Low-rank Matrix Approximation Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

253 Randomized Algorithms Randomized Kernel methods Kernel methods Kernel function: κ(, ) a set of examples x 1,..., x n Kernel matrix: K R n n with K ij = κ(x i, x j ) K is a PSD matrix Computational and memory costs: Ω(n 2 ) Approximation methos The Nyström method Random Fourier features Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

254 Randomized Algorithms Randomized Kernel methods Kernel methods Kernel function: κ(, ) a set of examples x 1,..., x n Kernel matrix: K R n n with K ij = κ(x i, x j ) K is a PSD matrix Computational and memory costs: Ω(n 2 ) Approximation methos The Nyström method Random Fourier features Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

255 Randomized Algorithms Randomized Kernel methods The Nyström method Let A R n l be uniform sampling matrix. B = KA R n l C = A B = A KA The Nyström approximation (Drineas & Mahoney, 2005) K = BC B Computational Cost: O(l 3 + nl 2 ) Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

256 Randomized Algorithms Randomized Kernel methods The Nyström method Let A R n l be uniform sampling matrix. B = KA R n l C = A B = A KA The Nyström approximation (Drineas & Mahoney, 2005) K = BC B Computational Cost: O(l 3 + nl 2 ) Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

257 Randomized Algorithms Randomized Kernel methods The Nyström based kernel machine The dual problem: arg max α R n 1 n n i=1 l i (α i ) 1 2λn 2 α BC B α Solve it like solving a linear method: X = BC 1/2 R n l arg max α R n 1 n n i=1 l i (α i ) 1 2λn 2 α X X α Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

258 Randomized Algorithms Randomized Kernel methods The Nyström based kernel machine Yang, Lin, Jin Tutorial for KDD 15 August 10, / 234

Stochastic Optimization for Big Data Analytics: Algorithms and Libraries

Stochastic Optimization for Big Data Analytics: Algorithms and Libraries Stochastic Optimization for Big Data Analytics: Algorithms and Libraries Tianbao Yang SDM 2014, Philadelphia, Pennsylvania collaborators: Rong Jin, Shenghuo Zhu NEC Laboratories America, Michigan State

More information

Federated Optimization: Distributed Optimization Beyond the Datacenter

Federated Optimization: Distributed Optimization Beyond the Datacenter Federated Optimization: Distributed Optimization Beyond the Datacenter Jakub Konečný School of Mathematics University of Edinburgh J.Konecny@sms.ed.ac.uk H. Brendan McMahan Google, Inc. Seattle, WA 98103

More information

Stochastic gradient methods for machine learning

Stochastic gradient methods for machine learning Stochastic gradient methods for machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines, Nicolas Le Roux and Mark Schmidt - April 2013 Context Machine

More information

Introduction to Online Learning Theory

Introduction to Online Learning Theory Introduction to Online Learning Theory Wojciech Kot lowski Institute of Computing Science, Poznań University of Technology IDSS, 04.06.2013 1 / 53 Outline 1 Example: Online (Stochastic) Gradient Descent

More information

GI01/M055 Supervised Learning Proximal Methods

GI01/M055 Supervised Learning Proximal Methods GI01/M055 Supervised Learning Proximal Methods Massimiliano Pontil (based on notes by Luca Baldassarre) (UCL) Proximal Methods 1 / 20 Today s Plan Problem setting Convex analysis concepts Proximal operators

More information

Statistical Machine Learning

Statistical Machine Learning Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes

More information

Lecture 6: Logistic Regression

Lecture 6: Logistic Regression Lecture 6: CS 194-10, Fall 2011 Laurent El Ghaoui EECS Department UC Berkeley September 13, 2011 Outline Outline Classification task Data : X = [x 1,..., x m]: a n m matrix of data points in R n. y { 1,

More information

Linear Threshold Units

Linear Threshold Units Linear Threshold Units w x hx (... w n x n w We assume that each feature x j and each weight w j is a real number (we will relax this later) We will study three different algorithms for learning linear

More information

Beyond stochastic gradient descent for large-scale machine learning

Beyond stochastic gradient descent for large-scale machine learning Beyond stochastic gradient descent for large-scale machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines, Nicolas Le Roux and Mark Schmidt - ECML-PKDD,

More information

Distributed Machine Learning and Big Data

Distributed Machine Learning and Big Data Distributed Machine Learning and Big Data Sourangshu Bhattacharya Dept. of Computer Science and Engineering, IIT Kharagpur. http://cse.iitkgp.ac.in/~sourangshu/ August 21, 2015 Sourangshu Bhattacharya

More information

Large-Scale Similarity and Distance Metric Learning

Large-Scale Similarity and Distance Metric Learning Large-Scale Similarity and Distance Metric Learning Aurélien Bellet Télécom ParisTech Joint work with K. Liu, Y. Shi and F. Sha (USC), S. Clémençon and I. Colin (Télécom ParisTech) Séminaire Criteo March

More information

LABEL PROPAGATION ON GRAPHS. SEMI-SUPERVISED LEARNING. ----Changsheng Liu 10-30-2014

LABEL PROPAGATION ON GRAPHS. SEMI-SUPERVISED LEARNING. ----Changsheng Liu 10-30-2014 LABEL PROPAGATION ON GRAPHS. SEMI-SUPERVISED LEARNING ----Changsheng Liu 10-30-2014 Agenda Semi Supervised Learning Topics in Semi Supervised Learning Label Propagation Local and global consistency Graph

More information

Statistical machine learning, high dimension and big data

Statistical machine learning, high dimension and big data Statistical machine learning, high dimension and big data S. Gaïffas 1 14 mars 2014 1 CMAP - Ecole Polytechnique Agenda for today Divide and Conquer principle for collaborative filtering Graphical modelling,

More information

Adaptive Online Gradient Descent

Adaptive Online Gradient Descent Adaptive Online Gradient Descent Peter L Bartlett Division of Computer Science Department of Statistics UC Berkeley Berkeley, CA 94709 bartlett@csberkeleyedu Elad Hazan IBM Almaden Research Center 650

More information

Stochastic gradient methods for machine learning

Stochastic gradient methods for machine learning Stochastic gradient methods for machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines, Nicolas Le Roux and Mark Schmidt - April 2013 Context Machine

More information

Big Data - Lecture 1 Optimization reminders

Big Data - Lecture 1 Optimization reminders Big Data - Lecture 1 Optimization reminders S. Gadat Toulouse, Octobre 2014 Big Data - Lecture 1 Optimization reminders S. Gadat Toulouse, Octobre 2014 Schedule Introduction Major issues Examples Mathematics

More information

Lecture 3: Linear methods for classification

Lecture 3: Linear methods for classification Lecture 3: Linear methods for classification Rafael A. Irizarry and Hector Corrada Bravo February, 2010 Today we describe four specific algorithms useful for classification problems: linear regression,

More information

Big Data Analytics. Lucas Rego Drumond

Big Data Analytics. Lucas Rego Drumond Big Data Analytics Lucas Rego Drumond Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany Going For Large Scale Going For Large Scale 1

More information

Support Vector Machines with Clustering for Training with Very Large Datasets

Support Vector Machines with Clustering for Training with Very Large Datasets Support Vector Machines with Clustering for Training with Very Large Datasets Theodoros Evgeniou Technology Management INSEAD Bd de Constance, Fontainebleau 77300, France theodoros.evgeniou@insead.fr Massimiliano

More information

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION Introduction In the previous chapter, we explored a class of regression models having particularly simple analytical

More information

Machine Learning Big Data using Map Reduce

Machine Learning Big Data using Map Reduce Machine Learning Big Data using Map Reduce By Michael Bowles, PhD Where Does Big Data Come From? -Web data (web logs, click histories) -e-commerce applications (purchase histories) -Retail purchase histories

More information

Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh

Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh Peter Richtárik Week 3 Randomized Coordinate Descent With Arbitrary Sampling January 27, 2016 1 / 30 The Problem

More information

15.062 Data Mining: Algorithms and Applications Matrix Math Review

15.062 Data Mining: Algorithms and Applications Matrix Math Review .6 Data Mining: Algorithms and Applications Matrix Math Review The purpose of this document is to give a brief review of selected linear algebra concepts that will be useful for the course and to develop

More information

Large-Scale Machine Learning with Stochastic Gradient Descent

Large-Scale Machine Learning with Stochastic Gradient Descent Large-Scale Machine Learning with Stochastic Gradient Descent Léon Bottou NEC Labs America, Princeton NJ 08542, USA leon@bottou.org Abstract. During the last decade, the data sizes have grown faster than

More information

Simple and efficient online algorithms for real world applications

Simple and efficient online algorithms for real world applications Simple and efficient online algorithms for real world applications Università degli Studi di Milano Milano, Italy Talk @ Centro de Visión por Computador Something about me PhD in Robotics at LIRA-Lab,

More information

An Introduction to Machine Learning

An Introduction to Machine Learning An Introduction to Machine Learning L5: Novelty Detection and Regression Alexander J. Smola Statistical Machine Learning Program Canberra, ACT 0200 Australia Alex.Smola@nicta.com.au Tata Institute, Pune,

More information

Parallel Data Mining. Team 2 Flash Coders Team Research Investigation Presentation 2. Foundations of Parallel Computing Oct 2014

Parallel Data Mining. Team 2 Flash Coders Team Research Investigation Presentation 2. Foundations of Parallel Computing Oct 2014 Parallel Data Mining Team 2 Flash Coders Team Research Investigation Presentation 2 Foundations of Parallel Computing Oct 2014 Agenda Overview of topic Analysis of research papers Software design Overview

More information

Big learning: challenges and opportunities

Big learning: challenges and opportunities Big learning: challenges and opportunities Francis Bach SIERRA Project-team, INRIA - Ecole Normale Supérieure December 2013 Omnipresent digital media Scientific context Big data Multimedia, sensors, indicators,

More information

Machine learning and optimization for massive data

Machine learning and optimization for massive data Machine learning and optimization for massive data Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE Joint work with Eric Moulines - IHES, May 2015 Big data revolution?

More information

ANALYSIS, THEORY AND DESIGN OF LOGISTIC REGRESSION CLASSIFIERS USED FOR VERY LARGE SCALE DATA MINING

ANALYSIS, THEORY AND DESIGN OF LOGISTIC REGRESSION CLASSIFIERS USED FOR VERY LARGE SCALE DATA MINING ANALYSIS, THEORY AND DESIGN OF LOGISTIC REGRESSION CLASSIFIERS USED FOR VERY LARGE SCALE DATA MINING BY OMID ROUHANI-KALLEH THESIS Submitted as partial fulfillment of the requirements for the degree of

More information

Lecture Topic: Low-Rank Approximations

Lecture Topic: Low-Rank Approximations Lecture Topic: Low-Rank Approximations Low-Rank Approximations We have seen principal component analysis. The extraction of the first principle eigenvalue could be seen as an approximation of the original

More information

Making Sense of the Mayhem: Machine Learning and March Madness

Making Sense of the Mayhem: Machine Learning and March Madness Making Sense of the Mayhem: Machine Learning and March Madness Alex Tran and Adam Ginzberg Stanford University atran3@stanford.edu ginzberg@stanford.edu I. Introduction III. Model The goal of our research

More information

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x)) To go the other way, you need to diagonalize S

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x)) To go the other way, you need to diagonalize S Linear smoother ŷ = S y where s ij = s ij (x) e.g. s ij = diag(l i (x)) To go the other way, you need to diagonalize S 2 Online Learning: LMS and Perceptrons Partially adapted from slides by Ryan Gabbard

More information

(Quasi-)Newton methods

(Quasi-)Newton methods (Quasi-)Newton methods 1 Introduction 1.1 Newton method Newton method is a method to find the zeros of a differentiable non-linear function g, x such that g(x) = 0, where g : R n R n. Given a starting

More information

Review Jeopardy. Blue vs. Orange. Review Jeopardy

Review Jeopardy. Blue vs. Orange. Review Jeopardy Review Jeopardy Blue vs. Orange Review Jeopardy Jeopardy Round Lectures 0-3 Jeopardy Round $200 How could I measure how far apart (i.e. how different) two observations, y 1 and y 2, are from each other?

More information

AIMS Big data. AIMS Big data. Outline. Outline. Lecture 5: Structured-output learning January 7, 2015 Andrea Vedaldi

AIMS Big data. AIMS Big data. Outline. Outline. Lecture 5: Structured-output learning January 7, 2015 Andrea Vedaldi AMS Big data AMS Big data Lecture 5: Structured-output learning January 7, 5 Andrea Vedaldi. Discriminative learning. Discriminative learning 3. Hashing and kernel maps 4. Learning representations 5. Structured-output

More information

10. Proximal point method

10. Proximal point method L. Vandenberghe EE236C Spring 2013-14) 10. Proximal point method proximal point method augmented Lagrangian method Moreau-Yosida smoothing 10-1 Proximal point method a conceptual algorithm for minimizing

More information

Randomized Robust Linear Regression for big data applications

Randomized Robust Linear Regression for big data applications Randomized Robust Linear Regression for big data applications Yannis Kopsinis 1 Dept. of Informatics & Telecommunications, UoA Thursday, Apr 16, 2015 In collaboration with S. Chouvardas, Harris Georgiou,

More information

Introduction to Support Vector Machines. Colin Campbell, Bristol University

Introduction to Support Vector Machines. Colin Campbell, Bristol University Introduction to Support Vector Machines Colin Campbell, Bristol University 1 Outline of talk. Part 1. An Introduction to SVMs 1.1. SVMs for binary classification. 1.2. Soft margins and multi-class classification.

More information

Learning, Sparsity and Big Data

Learning, Sparsity and Big Data Learning, Sparsity and Big Data M. Magdon-Ismail (Joint Work) January 22, 2014. Out-of-Sample is What Counts NO YES A pattern exists We don t know it We have data to learn it Tested on new cases? Teaching

More information

Large Scale Learning to Rank

Large Scale Learning to Rank Large Scale Learning to Rank D. Sculley Google, Inc. dsculley@google.com Abstract Pairwise learning to rank methods such as RankSVM give good performance, but suffer from the computational burden of optimizing

More information

Probabilistic Linear Classification: Logistic Regression. Piyush Rai IIT Kanpur

Probabilistic Linear Classification: Logistic Regression. Piyush Rai IIT Kanpur Probabilistic Linear Classification: Logistic Regression Piyush Rai IIT Kanpur Probabilistic Machine Learning (CS772A) Jan 18, 2016 Probabilistic Machine Learning (CS772A) Probabilistic Linear Classification:

More information

CSCI567 Machine Learning (Fall 2014)

CSCI567 Machine Learning (Fall 2014) CSCI567 Machine Learning (Fall 2014) Drs. Sha & Liu {feisha,yanliu.cs}@usc.edu September 22, 2014 Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, 2014 1 /

More information

Duality in General Programs. Ryan Tibshirani Convex Optimization 10-725/36-725

Duality in General Programs. Ryan Tibshirani Convex Optimization 10-725/36-725 Duality in General Programs Ryan Tibshirani Convex Optimization 10-725/36-725 1 Last time: duality in linear programs Given c R n, A R m n, b R m, G R r n, h R r : min x R n c T x max u R m, v R r b T

More information

Predict Influencers in the Social Network

Predict Influencers in the Social Network Predict Influencers in the Social Network Ruishan Liu, Yang Zhao and Liuyu Zhou Email: rliu2, yzhao2, lyzhou@stanford.edu Department of Electrical Engineering, Stanford University Abstract Given two persons

More information

Several Views of Support Vector Machines

Several Views of Support Vector Machines Several Views of Support Vector Machines Ryan M. Rifkin Honda Research Institute USA, Inc. Human Intention Understanding Group 2007 Tikhonov Regularization We are considering algorithms of the form min

More information

The Singular Value Decomposition in Symmetric (Löwdin) Orthogonalization and Data Compression

The Singular Value Decomposition in Symmetric (Löwdin) Orthogonalization and Data Compression The Singular Value Decomposition in Symmetric (Löwdin) Orthogonalization and Data Compression The SVD is the most generally applicable of the orthogonal-diagonal-orthogonal type matrix decompositions Every

More information

Accelerated Parallel Optimization Methods for Large Scale Machine Learning

Accelerated Parallel Optimization Methods for Large Scale Machine Learning Accelerated Parallel Optimization Methods for Large Scale Machine Learning Haipeng Luo Princeton University haipengl@cs.princeton.edu Patrick Haffner and Jean-François Paiement AT&T Labs - Research {haffner,jpaiement}@research.att.com

More information

1 Introduction. 2 Prediction with Expert Advice. Online Learning 9.520 Lecture 09

1 Introduction. 2 Prediction with Expert Advice. Online Learning 9.520 Lecture 09 1 Introduction Most of the course is concerned with the batch learning problem. In this lecture, however, we look at a different model, called online. Let us first compare and contrast the two. In batch

More information

Smoothing Multivariate Performance Measures

Smoothing Multivariate Performance Measures Journal of Machine Learning Research 13 (2012) 3589 3646 Submitted 11/11; Revised 9/12; Published 12/12 Smoothing Multivariate Performance Measures Xinhua Zhang Department of Computing Science University

More information

Social Media Mining. Data Mining Essentials

Social Media Mining. Data Mining Essentials Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers

More information

A Potential-based Framework for Online Multi-class Learning with Partial Feedback

A Potential-based Framework for Online Multi-class Learning with Partial Feedback A Potential-based Framework for Online Multi-class Learning with Partial Feedback Shijun Wang Rong Jin Hamed Valizadegan Radiology and Imaging Sciences Computer Science and Engineering Computer Science

More information

The Many Facets of Big Data

The Many Facets of Big Data Department of Computer Science and Engineering Hong Kong University of Science and Technology Hong Kong ACPR 2013 Big Data 1 volume sample size is big feature dimensionality is big 2 variety multiple formats:

More information

Lecture 2: The SVM classifier

Lecture 2: The SVM classifier Lecture 2: The SVM classifier C19 Machine Learning Hilary 2015 A. Zisserman Review of linear classifiers Linear separability Perceptron Support Vector Machine (SVM) classifier Wide margin Cost function

More information

Big Data Techniques Applied to Very Short-term Wind Power Forecasting

Big Data Techniques Applied to Very Short-term Wind Power Forecasting Big Data Techniques Applied to Very Short-term Wind Power Forecasting Ricardo Bessa Senior Researcher (ricardo.j.bessa@inesctec.pt) Center for Power and Energy Systems, INESC TEC, Portugal Joint work with

More information

Numerical Methods I Solving Linear Systems: Sparse Matrices, Iterative Methods and Non-Square Systems

Numerical Methods I Solving Linear Systems: Sparse Matrices, Iterative Methods and Non-Square Systems Numerical Methods I Solving Linear Systems: Sparse Matrices, Iterative Methods and Non-Square Systems Aleksandar Donev Courant Institute, NYU 1 donev@courant.nyu.edu 1 Course G63.2010.001 / G22.2420-001,

More information

Online Convex Optimization

Online Convex Optimization E0 370 Statistical Learning heory Lecture 19 Oct 22, 2013 Online Convex Optimization Lecturer: Shivani Agarwal Scribe: Aadirupa 1 Introduction In this lecture we shall look at a fairly general setting

More information

Nimble Algorithms for Cloud Computing. Ravi Kannan, Santosh Vempala and David Woodruff

Nimble Algorithms for Cloud Computing. Ravi Kannan, Santosh Vempala and David Woodruff Nimble Algorithms for Cloud Computing Ravi Kannan, Santosh Vempala and David Woodruff Cloud computing Data is distributed arbitrarily on many servers Parallel algorithms: time Streaming algorithms: sublinear

More information

Least Squares Estimation

Least Squares Estimation Least Squares Estimation SARA A VAN DE GEER Volume 2, pp 1041 1045 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S Everitt & David

More information

Machine Learning and Pattern Recognition Logistic Regression

Machine Learning and Pattern Recognition Logistic Regression Machine Learning and Pattern Recognition Logistic Regression Course Lecturer:Amos J Storkey Institute for Adaptive and Neural Computation School of Informatics University of Edinburgh Crichton Street,

More information

P164 Tomographic Velocity Model Building Using Iterative Eigendecomposition

P164 Tomographic Velocity Model Building Using Iterative Eigendecomposition P164 Tomographic Velocity Model Building Using Iterative Eigendecomposition K. Osypov* (WesternGeco), D. Nichols (WesternGeco), M. Woodward (WesternGeco) & C.E. Yarman (WesternGeco) SUMMARY Tomographic

More information

Parallel Selective Algorithms for Nonconvex Big Data Optimization

Parallel Selective Algorithms for Nonconvex Big Data Optimization 1874 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 63, NO. 7, APRIL 1, 2015 Parallel Selective Algorithms for Nonconvex Big Data Optimization Francisco Facchinei, Gesualdo Scutari, Senior Member, IEEE,

More information

Lecture 8 February 4

Lecture 8 February 4 ICS273A: Machine Learning Winter 2008 Lecture 8 February 4 Scribe: Carlos Agell (Student) Lecturer: Deva Ramanan 8.1 Neural Nets 8.1.1 Logistic Regression Recall the logistic function: g(x) = 1 1 + e θt

More information

Big Data Analytics. Lucas Rego Drumond

Big Data Analytics. Lucas Rego Drumond Big Data Analytics Lucas Rego Drumond Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany Going For Large Scale Application Scenario: Recommender

More information

Probabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014

Probabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014 Probabilistic Models for Big Data Alex Davies and Roger Frigola University of Cambridge 13th February 2014 The State of Big Data Why probabilistic models for Big Data? 1. If you don t have to worry about

More information

Machine learning challenges for big data

Machine learning challenges for big data Machine learning challenges for big data Francis Bach SIERRA Project-team, INRIA - Ecole Normale Supérieure Joint work with R. Jenatton, J. Mairal, G. Obozinski, N. Le Roux, M. Schmidt - December 2012

More information

Support Vector Machine. Tutorial. (and Statistical Learning Theory)

Support Vector Machine. Tutorial. (and Statistical Learning Theory) Support Vector Machine (and Statistical Learning Theory) Tutorial Jason Weston NEC Labs America 4 Independence Way, Princeton, USA. jasonw@nec-labs.com 1 Support Vector Machines: history SVMs introduced

More information

6. Cholesky factorization

6. Cholesky factorization 6. Cholesky factorization EE103 (Fall 2011-12) triangular matrices forward and backward substitution the Cholesky factorization solving Ax = b with A positive definite inverse of a positive definite matrix

More information

CS 5614: (Big) Data Management Systems. B. Aditya Prakash Lecture #18: Dimensionality Reduc7on

CS 5614: (Big) Data Management Systems. B. Aditya Prakash Lecture #18: Dimensionality Reduc7on CS 5614: (Big) Data Management Systems B. Aditya Prakash Lecture #18: Dimensionality Reduc7on Dimensionality Reduc=on Assump=on: Data lies on or near a low d- dimensional subspace Axes of this subspace

More information

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris Class #6: Non-linear classification ML4Bio 2012 February 17 th, 2012 Quaid Morris 1 Module #: Title of Module 2 Review Overview Linear separability Non-linear classification Linear Support Vector Machines

More information

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS 1 AND ALGORITHMS Chiara Renso KDD-LAB ISTI- CNR, Pisa, Italy WHAT IS CLUSTER ANALYSIS? Finding groups of objects such that the objects in a group will be similar

More information

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression Logistic Regression Department of Statistics The Pennsylvania State University Email: jiali@stat.psu.edu Logistic Regression Preserve linear classification boundaries. By the Bayes rule: Ĝ(x) = arg max

More information

Distributed Coordinate Descent Method for Learning with Big Data

Distributed Coordinate Descent Method for Learning with Big Data Peter Richtárik Martin Takáč University of Edinburgh, King s Buildings, EH9 3JZ Edinburgh, United Kingdom PETER.RICHTARIK@ED.AC.UK MARTIN.TAKI@GMAIL.COM Abstract In this paper we develop and analyze Hydra:

More information

Feature Engineering in Machine Learning

Feature Engineering in Machine Learning Research Fellow Faculty of Information Technology, Monash University, Melbourne VIC 3800, Australia August 21, 2015 Outline A Machine Learning Primer Machine Learning and Data Science Bias-Variance Phenomenon

More information

Randomized Coordinate Descent Methods for Big Data Optimization

Randomized Coordinate Descent Methods for Big Data Optimization Randomized Coordinate Descent Methods for Big Data Optimization Doctoral thesis Martin Takáč Supervisor: Dr. Peter Richtárik Doctor of Philosophy The University of Edinburgh 2014 Randomized Coordinate

More information

Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning

Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning SAMSI 10 May 2013 Outline Introduction to NMF Applications Motivations NMF as a middle step

More information

L A R G E - S C A L E O P T I M I Z AT I O N F O R M A C H I N E L E A R N I N G A N D S E Q U E N T I A L D E C I S I O N M A K I N G

L A R G E - S C A L E O P T I M I Z AT I O N F O R M A C H I N E L E A R N I N G A N D S E Q U E N T I A L D E C I S I O N M A K I N G L A R G E - S C A L E O P T I M I Z AT I O N F O R M A C H I N E L E A R N I N G A N D S E Q U E N T I A L D E C I S I O N M A K I N G Q I H A N G L I N Tepper School of Business Carnegie Mellon University

More information

A Stochastic 3MG Algorithm with Application to 2D Filter Identification

A Stochastic 3MG Algorithm with Application to 2D Filter Identification A Stochastic 3MG Algorithm with Application to 2D Filter Identification Emilie Chouzenoux 1, Jean-Christophe Pesquet 1, and Anisia Florescu 2 1 Laboratoire d Informatique Gaspard Monge - CNRS Univ. Paris-Est,

More information

Variance Reduction. Pricing American Options. Monte Carlo Option Pricing. Delta and Common Random Numbers

Variance Reduction. Pricing American Options. Monte Carlo Option Pricing. Delta and Common Random Numbers Variance Reduction The statistical efficiency of Monte Carlo simulation can be measured by the variance of its output If this variance can be lowered without changing the expected value, fewer replications

More information

CS 688 Pattern Recognition Lecture 4. Linear Models for Classification

CS 688 Pattern Recognition Lecture 4. Linear Models for Classification CS 688 Pattern Recognition Lecture 4 Linear Models for Classification Probabilistic generative models Probabilistic discriminative models 1 Generative Approach ( x ) p C k p( C k ) Ck p ( ) ( x Ck ) p(

More information

5. Orthogonal matrices

5. Orthogonal matrices L Vandenberghe EE133A (Spring 2016) 5 Orthogonal matrices matrices with orthonormal columns orthogonal matrices tall matrices with orthonormal columns complex matrices with orthonormal columns 5-1 Orthonormal

More information

Bilinear Prediction Using Low-Rank Models

Bilinear Prediction Using Low-Rank Models Bilinear Prediction Using Low-Rank Models Inderjit S. Dhillon Dept of Computer Science UT Austin 26th International Conference on Algorithmic Learning Theory Banff, Canada Oct 6, 2015 Joint work with C-J.

More information

Map-Reduce for Machine Learning on Multicore

Map-Reduce for Machine Learning on Multicore Map-Reduce for Machine Learning on Multicore Chu, et al. Problem The world is going multicore New computers - dual core to 12+-core Shift to more concurrent programming paradigms and languages Erlang,

More information

CS229T/STAT231: Statistical Learning Theory (Winter 2015)

CS229T/STAT231: Statistical Learning Theory (Winter 2015) CS229T/STAT231: Statistical Learning Theory (Winter 2015) Percy Liang Last updated Wed Oct 14 2015 20:32 These lecture notes will be updated periodically as the course goes on. Please let us know if you

More information

Linear Algebra Review. Vectors

Linear Algebra Review. Vectors Linear Algebra Review By Tim K. Marks UCSD Borrows heavily from: Jana Kosecka kosecka@cs.gmu.edu http://cs.gmu.edu/~kosecka/cs682.html Virginia de Sa Cogsci 8F Linear Algebra review UCSD Vectors The length

More information

SMOOTHING APPROXIMATIONS FOR TWO CLASSES OF CONVEX EIGENVALUE OPTIMIZATION PROBLEMS YU QI. (B.Sc.(Hons.), BUAA)

SMOOTHING APPROXIMATIONS FOR TWO CLASSES OF CONVEX EIGENVALUE OPTIMIZATION PROBLEMS YU QI. (B.Sc.(Hons.), BUAA) SMOOTHING APPROXIMATIONS FOR TWO CLASSES OF CONVEX EIGENVALUE OPTIMIZATION PROBLEMS YU QI (B.Sc.(Hons.), BUAA) A THESIS SUBMITTED FOR THE DEGREE OF MASTER OF SCIENCE DEPARTMENT OF MATHEMATICS NATIONAL

More information

Big Data Optimization: Randomized lock-free methods for minimizing partially separable convex functions

Big Data Optimization: Randomized lock-free methods for minimizing partially separable convex functions Big Data Optimization: Randomized lock-free methods for minimizing partially separable convex functions Peter Richtárik School of Mathematics The University of Edinburgh Joint work with Martin Takáč (Edinburgh)

More information

Parallel & Distributed Optimization. Based on Mark Schmidt s slides

Parallel & Distributed Optimization. Based on Mark Schmidt s slides Parallel & Distributed Optimization Based on Mark Schmidt s slides Motivation behind using parallel & Distributed optimization Performance Computational throughput have increased exponentially in linear

More information

Support Vector Machines

Support Vector Machines Support Vector Machines Charlie Frogner 1 MIT 2011 1 Slides mostly stolen from Ryan Rifkin (Google). Plan Regularization derivation of SVMs. Analyzing the SVM problem: optimization, duality. Geometric

More information

Linear Models for Classification

Linear Models for Classification Linear Models for Classification Sumeet Agarwal, EEL709 (Most figures from Bishop, PRML) Approaches to classification Discriminant function: Directly assigns each data point x to a particular class Ci

More information

We shall turn our attention to solving linear systems of equations. Ax = b

We shall turn our attention to solving linear systems of equations. Ax = b 59 Linear Algebra We shall turn our attention to solving linear systems of equations Ax = b where A R m n, x R n, and b R m. We already saw examples of methods that required the solution of a linear system

More information

Virtual Landmarks for the Internet

Virtual Landmarks for the Internet Virtual Landmarks for the Internet Liying Tang Mark Crovella Boston University Computer Science Internet Distance Matters! Useful for configuring Content delivery networks Peer to peer applications Multiuser

More information

Notes on Symmetric Matrices

Notes on Symmetric Matrices CPSC 536N: Randomized Algorithms 2011-12 Term 2 Notes on Symmetric Matrices Prof. Nick Harvey University of British Columbia 1 Symmetric Matrices We review some basic results concerning symmetric matrices.

More information

Variational approach to restore point-like and curve-like singularities in imaging

Variational approach to restore point-like and curve-like singularities in imaging Variational approach to restore point-like and curve-like singularities in imaging Daniele Graziani joint work with Gilles Aubert and Laure Blanc-Féraud Roma 12/06/2012 Daniele Graziani (Roma) 12/06/2012

More information

Performance of First- and Second-Order Methods for Big Data Optimization

Performance of First- and Second-Order Methods for Big Data Optimization Noname manuscript No. (will be inserted by the editor) Performance of First- and Second-Order Methods for Big Data Optimization Kimon Fountoulakis Jacek Gondzio Received: date / Accepted: date Abstract

More information

Nonlinear Optimization: Algorithms 3: Interior-point methods

Nonlinear Optimization: Algorithms 3: Interior-point methods Nonlinear Optimization: Algorithms 3: Interior-point methods INSEAD, Spring 2006 Jean-Philippe Vert Ecole des Mines de Paris Jean-Philippe.Vert@mines.org Nonlinear optimization c 2006 Jean-Philippe Vert,

More information

Loss Functions for Preference Levels: Regression with Discrete Ordered Labels

Loss Functions for Preference Levels: Regression with Discrete Ordered Labels Loss Functions for Preference Levels: Regression with Discrete Ordered Labels Jason D. M. Rennie Massachusetts Institute of Technology Comp. Sci. and Artificial Intelligence Laboratory Cambridge, MA 9,

More information

October 3rd, 2012. Linear Algebra & Properties of the Covariance Matrix

October 3rd, 2012. Linear Algebra & Properties of the Covariance Matrix Linear Algebra & Properties of the Covariance Matrix October 3rd, 2012 Estimation of r and C Let rn 1, rn, t..., rn T be the historical return rates on the n th asset. rn 1 rṇ 2 r n =. r T n n = 1, 2,...,

More information

large-scale machine learning revisited Léon Bottou Microsoft Research (NYC)

large-scale machine learning revisited Léon Bottou Microsoft Research (NYC) large-scale machine learning revisited Léon Bottou Microsoft Research (NYC) 1 three frequent ideas in machine learning. independent and identically distributed data This experimental paradigm has driven

More information

Online Learning. 1 Online Learning and Mistake Bounds for Finite Hypothesis Classes

Online Learning. 1 Online Learning and Mistake Bounds for Finite Hypothesis Classes Advanced Course in Machine Learning Spring 2011 Online Learning Lecturer: Shai Shalev-Shwartz Scribe: Shai Shalev-Shwartz In this lecture we describe a different model of learning which is called online

More information