Stochastic Optimization for Big Data Analytics: Algorithms and Libraries

Transcription

1 Stochastic Optimization for Big Data Analytics: Algorithms and Libraries Tianbao Yang SDM 2014, Philadelphia, Pennsylvania collaborators: Rong Jin, Shenghuo Zhu NEC Laboratories America, Michigan State University February 9, 2014 Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, / 77

2 The updates are available here Thanks. Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, / 77

3 STochastic OPtimization (STOP) and Machine Learning Outline 1 STochastic OPtimization (STOP) and Machine Learning 2 STOP Algorithms for Big Data Classification and Regression 3 General Strategies for Stochastic Optimization 4 Implementations and A Library Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, / 77

4 STochastic OPtimization (STOP) and Machine Learning Introduction Machine Learning problems and Stochastic Optimization Coverage: Classification and Regression in different forms Motivation to employ STochastic OPtimization (STOP) Basic Convex Optimization Knowledge Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, / 77

5 STochastic OPtimization (STOP) and Machine Learning Learning as Optimization Least Square Regression Problem: 1 min w R d n n (y i w x i ) 2 + λ 2 w 2 2 i=1 x i R d : d-dimensional feature vector y i R: response variable w R d : unknown parameters n: number of data points Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, / 77

6 STochastic OPtimization (STOP) and Machine Learning Learning as Optimization Least Square Regression Problem: 1 n min (y i w x i ) 2 + λ w R d n 2 w 2 2 i=1 } {{ } Empirical Loss x i R d : d-dimensional feature vector y i R: response variable w R d : unknown parameters n: number of data points Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, / 77

7 STochastic OPtimization (STOP) and Machine Learning Learning as Optimization Least Square Regression Problem: 1 min w R d n n i=1 (y i w x i ) 2 + λ 2 w 2 2 } {{ } Regularization x i R d : d-dimensional feature vector y i R: response variable w R d : unknown parameters n: number of data points Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, / 77

8 STochastic OPtimization (STOP) and Machine Learning Learning as Optimization Classification Problems: 1 min w R d n n l(y i w x i ) + λ 2 w 2 2 i=1 y i {+1, 1}: label Loss function l(z) 1. SVMs: hinge loss l(z) = max(0, 1 z) p, where p = 1, 2 2. Logistic Regression: l(z) = log(1 + exp( z)) Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, / 77

9 STochastic OPtimization (STOP) and Machine Learning Learning as Optimization Feature Selection: 1 min w R d n n l(y i w x i ) + λ w 1 i=1 l 1 regularization w 1 = d i=1 w i λ controls sparsity level Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, / 77

10 STochastic OPtimization (STOP) and Machine Learning Learning as Optimization Feature Selection using Elastic Net: 1 min w R d n n i=1 l(y i w x i )+λ ( w 1 + γ w 2 ) 2 Elastic net regularizer, more robust than l 1 regularizer Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, / 77

11 STochastic OPtimization (STOP) and Machine Learning Big Data Challenge Huge amount of data generated on the internet every day Facebook users upload 3 million photos Goolge receives 3000 million queries GE 100 Million Hours of operational data on the World s Largest Gas Turbine Fleet The global internet population is 2.1 billion people how-much-data-created-every-minute/ Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, / 77

12 STochastic OPtimization (STOP) and Machine Learning Why Learning from Big Data is Hard? Too many data points 1 n min l(w x i, y i ) + λr(w) w R d n i=1 } {{ } empirical loss + regularizer Issue: can t afford go through data set many times Solution: Stochastic Optimization Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, / 77

13 STochastic OPtimization (STOP) and Machine Learning Why Learning from Big Data is Hard? High Dimensional Data 1 n min l(w x i, y i ) + λr(w) w R d n i=1 } {{ } empirical loss + regularizer Issue: can t afford second order optimization (e.g. Newton s method) Solution: first order method (i.e, gradient based method) Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, / 77

14 STochastic OPtimization (STOP) and Machine Learning Why Learning from Big Data is Hard? 1 n min l(w x i, y i ) + λr(w) w R d n i=1 } {{ } empirical loss + regularizer Data are distributed over many machines Issue: expensive (if not impossible) to move data Solution: Distributed Optimization Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, / 77

15 STochastic OPtimization (STOP) and Machine Learning Warm-up: Vector, Norm, Inner product, Dual Norm bold letters x R d, w R d : d-dimensional vectors d norm x : R d R +. e.g. l 2 norm x 2 = i=1 x i 2, l 1 norm x 1 = d i=1 x i, l norm x = max i x i inner product < x, y >= x y = d i=1 x iy i dual norm x = max y 1 x y. x 2 x 2, x 1 x Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, / 77

16 STochastic OPtimization (STOP) and Machine Learning Warm-up: Convex Optimization min f (x) x X f (x) is a convex function X is a convex domain f (αx + (1 α)y) αf (x) + (1 α)f (y), x, y X, α [0, 1] Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, / 77

17 STochastic OPtimization (STOP) and Machine Learning Warm-up: Convergence Measure Most optimization algorithms are iterative w t+1 = w t + w t Iteration Complexity: the number of iterations T (ɛ) needed to have f (x T ) min f (x) ɛ (ɛ 1) x X Convergence rate: after T iterations, how good is the solution objective ε(t) 0 Optimal Value f (x T ) min f (x) ɛ(t ) x X Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, / 77

18 STochastic OPtimization (STOP) and Machine Learning More on Convergence Measure Convergence Rate Iteration Complexity linear µ T (µ < 1) O ( ( log( 1 ɛ ) )) 1 sub-linear T α > 0 O 1 α ɛ 1/α Why are we interested in Bounds? Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, / 77

19 STochastic OPtimization (STOP) and Machine Learning More on Convergence Measure Convergence Rate Iteration Complexity linear µ T (µ < 1) O ( ( log( 1 ɛ ) )) 1 sub-linear T α > 0 O 1 α ɛ 1/α Why are we interested in Bounds? Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, / 77

20 STochastic OPtimization (STOP) and Machine Learning More on Convergence Measure Convergence Rate Iteration Complexity linear µ T (µ < 1) O ( ( log( 1 ɛ ) )) 1 sub-linear T α > 0 O 1 α ɛ 1/α distance to optimal objective T iterations Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, / 77

21 STochastic OPtimization (STOP) and Machine Learning More on Convergence Measure Convergence Rate Iteration Complexity linear µ T (µ < 1) O ( ( log( 1 ɛ ) )) 1 sub-linear T α > 0 O 1 α ɛ 1/α distance to optimal objective T 1/T iterations Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, / 77

22 STochastic OPtimization (STOP) and Machine Learning More on Convergence Measure Convergence Rate Iteration Complexity linear µ T (µ < 1) O ( ( log( 1 ɛ ) )) 1 sub-linear T α > 0 O 1 α ɛ 1/α distance to optimal objective T 1/T 2 1/T iterations Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, / 77

23 STochastic OPtimization (STOP) and Machine Learning More on Convergence Measure Convergence Rate Iteration Complexity linear µ T (µ < 1) O ( ( log( 1 ɛ ) )) 1 sub-linear T α > 0 O 1 α ɛ 1/α distance to optimal objective T 1/T 2 1/T 1/T iterations Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, / 77

24 STochastic OPtimization (STOP) and Machine Learning More on Convergence Measure Convergence Rate Iteration Complexity linear µ T (µ < 1) O ( ( log( 1 ɛ ) )) 1 sub-linear T α > 0 O 1 α ɛ 1/α distance to optimal objective iterations 0.5 T 1/T 2 1/T 1/T 0.5 µ T < 1 T 2 < 1 T < 1 T ( ) 1 log < 1 < 1 ɛ ɛ ɛ < 1 ɛ 2 Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, / 77

25 STochastic OPtimization (STOP) and Machine Learning Factors that affect Iteration Complexity Domain X : size and geometry Size of problem: dimension and number of data points Property of function: smoothness of function Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, / 77

26 STochastic OPtimization (STOP) and Machine Learning Warm-up: Non-smooth function Lipschitz continuous: e.g. absolute loss f (z) = z 0.8 non smooth x f (x) f (y) L x y sub gradient f (x) f (y) + f (y) (x y) Iteration complexity O ( ) 1 ɛ 2 or convergence rate O ( ) 1 T Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, / 77

27 STochastic OPtimization (STOP) and Machine Learning Warm-up: Smooth Convex function smooth: e.g. logistic loss f (z) = log(1 + exp( z)) f (x) f (y) β x y where β > 0 Second Order Derivative is upper 0.2 smooth gradient 0.2 bounded ( ) ( ) 1 1 Full gradient descent O or O ɛ T ( ) 2 1 Stochastic gradient descent O ɛ 2 or O ( ) 1 T x 2 Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, / 77

28 STochastic OPtimization (STOP) and Machine Learning Warm-up: Strongly Convex function strongly convex: e.g. l 2 2 norm f (x) = 1 2 x 2 2 f (x) f (y) λ x y, λ > 0 Second Order Derivative is lower bounded Iteration complexity O(1/ɛ) or convergence rate O(1/T ) Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, / 77

29 STochastic OPtimization (STOP) and Machine Learning Warm-up: Smooth and Strongly Convex function smooth and strongly convex: λ x y f (x) f (y) β x y, β λ > 0 e.g. square loss: f (z) = 1 2 (1 z)2 ( ( )) 1 Iteration complexity O log or convergence rate O(µ T ) ɛ Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, / 77

30 STochastic OPtimization (STOP) and Machine Learning Warm-up: Stochastic Gradient Loss: L(w) = 1 n n l(w x i, y i ) i=1 stochastic gradient: l(w x it, y it ) where i t {1,..., n} randomly sampled key equation: E it [ l(w x it, y it )] = L(w) computation is very cheap Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, / 77

31 STochastic OPtimization (STOP) and Machine Learning Warm-up: Dual Convex Conjugate: l (α) l(z) l(z) = max α Ω αz l (α), e.g. max(0, 1 z) = max α 0 αz α Dual Objective (classification): min w 1 n = min w n l(y i w x } {{ } i ) + λ 2 w 2 2 i=1 1 n 1 = max α Ω n i=1 z n [ ] max α i(y i w x i ) l (α i ) + λ α i Ω 2 w 2 2 n l (α i ) λ 1 2 α 2 i y i x i λn i=1 i=1 2 Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, / 77

32 STOP Algorithms for Big Data Classification and Regression Outline 1 STochastic OPtimization (STOP) and Machine Learning 2 STOP Algorithms for Big Data Classification and Regression 3 General Strategies for Stochastic Optimization 4 Implementations and A Library Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, / 77

33 STOP Algorithms for Big Data Classification and Regression STOP Algorithms for Big Data Classification and Regression In this section, we present several well-known STOP algorithms for solving classification and regression problems Coverage: Stochastic Gradient Descent (Pegasos) for L1-SVM (primal) Stochastic Dual Coordinate Ascent (SDCA) for L2-SVM (dual) Stochastic Average Gradient (SAG) for Logistic Regression/Regression (primal) Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, / 77

34 STOP Algorithms for Big Data Classification and Regression STOP Algorithms for Big Data Classification Support Vector Machines: L1 SVM L2 SVM min w R d f (w) = 1 n n l(w x i, y i ) + λ } {{ } 2 w 2 2 loss i=1 L 1 SVM: l(w x, y) = max(0, 1 y i w x i ) L 2 SVM: l(w x, y) = max(0, 1 y i w x i ) 2 Equivalent to C-SVM formulations C = nλ loss smooth function 0.5 non smooth function yw T x Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, / 77

37 STOP Algorithms for Big Data Classification and Regression Pegasos (Shai et al. ICML 07, primal): STOP for L 1 SVM min w R d f (w) = 1 n n i=1 max(0, 1 y i w x i ) + λ 2 w 2 2 } {{ } strongly convex Pegasos (SGD for strongly convex) : y i x i, 1 y i w x i < 0 stochastic gradient: l(w x i, y i ) = 0, otherwise Stochastic Gradient Descent Updates f (w t ; i t ) = l(w t x it, y it ) + λw t w t+1 = w t 1 λt f (w t; i t ) Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, / 77

38 STOP Algorithms for Big Data Classification and Regression SDCA (C.J. Hsieh, ICML 08, Shai&Tong JMLR 12, dual): STOP for L 2 SVM convex conjugate: max(0, 1 yw x) 2 = max (α 14 ) α 0 α2 αyw x Dual SVM: max α 0 D(α) = 1 n primal solution: w t = 1 λn n ( ) α i α2 i λ λn i=1 } {{ } φ (α i ) n αi t y i x i i=1 SDCA: stochastic dual coordinate ascent Dual Coordinate Updates α i = max α t i + α i 0 φ (α t i + α i ) λ 2 n 2 α i y i x i i=1 w t + 1 λn α iy i x i Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, /

39 STOP Algorithms for Big Data Classification and Regression SAG (Schmidt et al. 11, primal): Stopt for Logit Reg./ Reg. smooth and strongly convex objective min f (w) = 1 n l(w x i, y i ) + λ w R d n 2 w 2 2 i=1 } {{ } smooth and strongly convex e.g. logistic regression, least square regression SAG (stochastic average gradient): w t+1 = (1 ηλ)w t η n gi t = l(w t x i, y i ), n i=1 g t i, where if i is selected g t 1 i, otherwise Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, / 77

40 STOP Algorithms for Big Data Classification and Regression OK, tell me which one is the best? Let us compare Theoretically first alg. Pegasos SAG SDCA f (w) sc sm & sc sm loss & sc reg. e.g. SVMs, LR, LSR L 2 -SVM, LR, LSR SVMs, LR, LSR com. cost O(d) O(d) O(d) mem. cost O(d) O(n + d) O(n + d) Iteration O ( 1 λɛ ) O ( (n + 1 λ ) log ( 1 ɛ )) O ( (n + 1 λ ) log ( 1 ɛ Table : sc: strongly convex; sm: smooth; LR: logistic regression; LSR: least square regression )) Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, / 77

41 STOP Algorithms for Big Data Classification and Regression What about l 1 regularization? 1 min w R d n n l(w x i, y i ) + σ i=1 K w g 1 g=1 } {{ } Lasso or Group Lasso Issue: Not Strongly Convex Solution: Add l 2 2 regularization 1 min w R d n n l(w x i, y i ) + σ i=1 setting λ = Θ(1/ɛ) Algorithm: Pegasos, SDCA K w g 1 + λ 2 w 2 2 g=1 Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, / 77

42 General Strategies for Stochastic Optimization Outline 1 STochastic OPtimization (STOP) and Machine Learning 2 STOP Algorithms for Big Data Classification and Regression 3 General Strategies for Stochastic Optimization 4 Implementations and A Library Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, / 77

43 General Strategies for Stochastic Optimization General Strategies for Stochastic Optimization General strategies for Stochastic Gradient Descent Coverage: SGD for various objectives Accelerated SGD for Stochastic Composite Optimization Variance Reduced SGD Parallel and Distributed Optimization Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, / 77

44 General Strategies for Stochastic Optimization Stochastic Gradient Descent min f (x) x X stochastic gradient f (x; ω): ω is a random variable SGD updates: x t Π X [x t 1 γ t f (x t 1 ; ω t )] Π X [ˆx] = min x X x ˆx 2 2 Issues: How to determine learning rate γ t? key: decreasing Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, / 77

45 General Strategies for Stochastic Optimization Stochastic Gradient Descent min f (x) x X stochastic gradient f (x; ω): ω is a random variable SGD updates: x t Π X [x t 1 γ t f (x t 1 ; ω t )] Π X [ˆx] = min x X x ˆx 2 2 Issues: How to determine learning rate γ t? key: decreasing Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, / 77

46 General Strategies for Stochastic Optimization Why Stochastic Gradient Descent Works? Why decreasing step size? GD vs SGD x t = x t 1 η f (x t 1 ) f (x t 1 ) 0 x t = x t 1 η t f (x t 1 ; ω t ) f (x t 1 ; ω t ) 0, η t 0 Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, / 77

47 General Strategies for Stochastic Optimization Why Stochastic Gradient Descent Works? Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, / 77

56 General Strategies for Stochastic Optimization How to decrease the Step size? Step size Depends on the property of the function. Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, / 77

57 General Strategies for Stochastic Optimization SGD for Non-smooth Function: Convergence Rate x y D and f (x; ω) G, x, y X. Step size: γ t = c t, c usually needed to be tuned (( ) ) Convergence rate of final solution x T : O D 2 c + cg 2 log T T Are we good enough? Close but not yet ( ) minimax optimal rate : O 1 T Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, / 77

61 General Strategies for Stochastic Optimization SGD for Non-smooth Function: Convergence Rate ( ) minimax optimal rate : O 1 T Averaging: x t = η = 0 standard average ( η ) x t η t + η t + η x t, η 0 x T = x x T T ( ( ) ) Convergence Rate of x T : O η D 2 c + cg 2 1 T Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, / 77

62 General Strategies for Stochastic Optimization SGD for Non-smooth Strongly Convex Function: Convergence Rate λ-strongly Convex minimax optimal rate : O( 1 T ) Step size: γ t = 1 λt ( ) Convergence Rate of x T : O G 2 log T λt Averaging: ( x t = η ) x t η t + η t + η x t, η 0 ( ) Convergence Rate of x T : O ηg 2 λt Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, / 77

63 General Strategies for Stochastic Optimization SGD for Smooth function ( ) Generally No Further improvement upon O 1 T Special structure may yield better convergence Stochastic Composite Optimization: Smooth + Non-Smooth Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, / 77

64 General Strategies for Stochastic Optimization General Strategies for Stochastic Optimization General strategies for Stochastic Gradient Descent Coverage: SGD for various objectives Accelerated SGD for Stochastic Composite Optimization Variance Reduced SGD Parallel and Distributed Optimization Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, / 77

65 General Strategies for Stochastic Optimization Stochastic Composite Optimization Example 1 min w 1 n min x X f (x) }{{} smooth + g(x) }{{} nonsmooth n log(1 + exp( y i w x i )) + λ w 1 i=1 Example 2 min w 1 n n i=1 1 2 (y i w x i ) 2 + λ w 1 Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, / 77

66 General Strategies for Stochastic Optimization Stochastic Composite Optimization Accelerated Stochastic Gradient auxiliary sequence of search solutions maintain three solutions: x t (action solution), x t (averaged solution), x m t (auxiliary solution) Similar to Nesterov s deterministic method several variants Stochastic Dual Coordinate Ascent 1 min w R d n n l(w x i, y i ) + σ i=1 K w g 1 + λ 2 w 2 2 g=1 Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, / 77

67 General Strategies for Stochastic Optimization G. Lan s Accelerated Stochastic Approximation Iterate t = 1,..., T 1. compute auxiliary solution x m t = β 1 t x t + (1 β 1 ) x t t 2. update solution x t+1 = P xt (γ t G(x m t ; ω t )) (think about projection) 3. update averaged solution x t+1 = (1 β 1 t ) x t + β 1 x t+1 (Similar to SGD) t γ t, β t, Convergence Rate? Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, / 77

68 General Strategies for Stochastic Optimization Accelerated Stochastic Approximation 1. β 1 t = 2 t+1 : x t+1 = ( 1 2 ) x t + 2 t + 1 t + 1 x t+1 2. γ t = t+1 2 γ: Increasing, Never like before 3. Convergence rate of x T : O ( L T 2 + M + ) σ2 T Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, / 77

69 General Strategies for Stochastic Optimization Variance Reduction: Mini-batch Batch gradient: g t = 1 b b i=1 f (x t 1, ω ti ) Unbiased sampling: Eg t = E f (x t 1, ω). Variance reduced: Vg t = 1 b V f (x t 1, ω). Tradeoff the batch computation for variance decrease. Batch size 1 Constant batch size; 2 Increasing batch size as proportion to t. Convergence Rate: e.g. O( 1 bλt ) of SGD for strongly convex Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, / 77

70 General Strategies for Stochastic Optimization Variance Reduction: Mixed-Grad (Mahdavi et. al, Johnson et al. 13) Iterate s = 1,..., Iterate t = 1,..., m x s t = x s t 1 η ( f (x s t 1; ω t ) f ( x s 1 ; ω t ) + f ( x s 1 )) } {{ } StoGrad StoGrad+Grad update x s how variance is reduced? if x s 1 x, f ( x t 1 ) 0, f (x t 1 ; ω t ) f (x ; ω t ) 0 x s = x s m or x s = m t=1 xs t/m Constant learning rate Better Iteration Complexity.: O((n + 1 λ ) log ( 1 ɛ ) ) for sm&sc, O(1/ɛ) for smooth Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, / 77

71 General Strategies for Stochastic Optimization Distributed Optimization: Why? data distributed over multiple machines moving to single machine suffers low network bandwidth limited disk or memory benefits from parallel computation cluster of machines GPUs Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, / 77

72 General Strategies for Stochastic Optimization A Naive Solution Data w 1 w 2 w 3 w 4 w 5 w 6 w = 1 k k i=1 w i Issue: Not the Optimal Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, / 77

73 General Strategies for Stochastic Optimization Distribute SGD is simple Mini-batch synchronization Good: reduced variance, faster conv. Bad: synchronization is expensive Solutions: asynchronized update convergence: difficult to prove lesser synchronizations guaranteed convergence Mini-batch SGD Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, / 77

74 General Strategies for Stochastic Optimization Distribute SDCA is not trivial issue: data are correlated α i = arg max φ i ( α t i α i ) λn 2 Not Working w t + 1 λn α ix i 2 2 Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, / 77

75 General Strategies for Stochastic Optimization Distribute SDCA (T. Yang NIPS 13) The Basic Variant mr-sdca running on machine k Iterate: for t = 1,..., T Iterate: for j = 1,..., m Randomly pick i {1,, n k } and let i j = i Find α k,i by IncDual(w = w t 1, scl = mk) Set αk,i t = α t 1 k,i + α k,i Reduce: w t w t m λn j=1 α k,i j x k,ij α i = arg max φ i ( α t i α i ) λn 2mK w t + mk λn α ix i 2 2 Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, / 77

76 General Strategies for Stochastic Optimization Distribute SDCA (T. Yang NIPS 13) The Practical Variant Initialize: u 0 t = w t 1 Iterate: for j = 1,..., m Randomly pick i {1,, n k } and let i j = i Find α k,ij by IncDual(w = ut j 1, scl = k) Update αk,i t j = α t 1 k,i j + α k,ij Update u j t = u j 1 t + 1 λn k α k,ij x k,ij α ij = arg max φ i j ( α t i j α ij ) λn 2K t + K 2 λn α i j x ij uj 1 Using Updated Information and Large Step Size 2 Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, / 77

77 General Strategies for Stochastic Optimization Distribute SDCA (T. Yang NIPS 13) The Practical Variant is much faster than the Basic Variant log(p(w t ) D(α t )) duality gap 7 8 DisDCA p DisDCA n iterations Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, / 77

78 Implementations and A Library Outline 1 STochastic OPtimization (STOP) and Machine Learning 2 STOP Algorithms for Big Data Classification and Regression 3 General Strategies for Stochastic Optimization 4 Implementations and A Library Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, / 77

79 Implementations and A Library Implementations and A Library In this section, we present some techniques for efficient implementations and a practical library Coverage: efficient averaging Gradient sparsification pre-optimization: screening pre-optimization: data preconditioning distributed (parallel) optimization library Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, / 77

80 Implementations and A Library Efficient Averaging Update rule: x t = (1 γ t λ)x t 1 + γ t g t x t = (1 α t ) x t 1 + α t x t Efficient update when g t has many 0, or g t is sparse, ( ) 1 λγt 0 S t = S α t (1 λγ t ) 1 α t 1, S 1 = I t y t = y t 1 [St 1 ] 11 γ t g t ỹ t = ỹ t 1 ([St 1 ] 21 + [St 1 ] 22 α t )γ t g t x T = [S T ] 11 y T x T = [S T ] 21 y T + [S T ] 22 ỹ T When Gradient is Sparse Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, / 77

81 Implementations and A Library Gradient sparsification Sparsification by importance sampling R ti = unif(0, 1) g ti = g ti [ g ti ĝ i ] + ĝsign(g ti )[ĝ i R ti g ti < ĝ i ] Unbiased sample: E g t = g t. Tradeoff variance increase for the efficient computation. Especially useful for Logistic Regression Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, / 77

82 Implementations and A Library Pre-optimization: Screening Screening for Lasso (Wang et al., 2012) Screening for SVM (Ogawa et al., 2013) Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, / 77

83 Implementations and A Library Distributed Optimization Library: Birds The birds library implements distributed stochastic dual coordinate ascent (DisDCA) for classification and regression with a broad support. For technical details see: Trading Computation for Communication: Distributed Stochastic Dual Coordinate Ascent. Tianbao Yang. NIPS On the Theoretical Analysis of Distributed Stochastic Dual Coordinate Ascent Tianbao Yang, etc. Tech Report The code is distributed under GNU General Publich License (see license.txt for details). Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, / 77

84 Implementations and A Library Distributed Optimization Library: Birds What problems does it Solve? Classification and Regression Loss 1 Hinge loss (SVM): L-1 hinge loss, L-2 hinge loss 2 Logistic loss (Logistic Regression) 3 Least Square Regression (Ridge Regression) Regularizer 1 l 2 norm: SVM, Logistic Regression, Ridge Regression 2 l 1 norm: Lasso, SVM, LR with l 1 norm Multi-class : one-vs-all Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, / 77