Stochastic Optimization for Big Data Analytics: Algorithms and Libraries Tianbao Yang SDM 2014, Philadelphia, Pennsylvania collaborators: Rong Jin, Shenghuo Zhu NEC Laboratories America, Michigan State University February 9, 2014 Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, 2014 1 / 77
The updates are available here http://www.cse.msu.edu/~yangtia1/sdm14-tutorial.pdf Thanks. Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, 2014 2 / 77
STochastic OPtimization (STOP) and Machine Learning Outline 1 STochastic OPtimization (STOP) and Machine Learning 2 STOP Algorithms for Big Data Classification and Regression 3 General Strategies for Stochastic Optimization 4 Implementations and A Library Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, 2014 3 / 77
STochastic OPtimization (STOP) and Machine Learning Introduction Machine Learning problems and Stochastic Optimization Coverage: Classification and Regression in different forms Motivation to employ STochastic OPtimization (STOP) Basic Convex Optimization Knowledge Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, 2014 4 / 77
STochastic OPtimization (STOP) and Machine Learning Learning as Optimization Least Square Regression Problem: 1 min w R d n n (y i w x i ) 2 + λ 2 w 2 2 i=1 x i R d : d-dimensional feature vector y i R: response variable w R d : unknown parameters n: number of data points Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, 2014 5 / 77
STochastic OPtimization (STOP) and Machine Learning Learning as Optimization Least Square Regression Problem: 1 n min (y i w x i ) 2 + λ w R d n 2 w 2 2 i=1 } {{ } Empirical Loss x i R d : d-dimensional feature vector y i R: response variable w R d : unknown parameters n: number of data points Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, 2014 6 / 77
STochastic OPtimization (STOP) and Machine Learning Learning as Optimization Least Square Regression Problem: 1 min w R d n n i=1 (y i w x i ) 2 + λ 2 w 2 2 } {{ } Regularization x i R d : d-dimensional feature vector y i R: response variable w R d : unknown parameters n: number of data points Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, 2014 7 / 77
STochastic OPtimization (STOP) and Machine Learning Learning as Optimization Classification Problems: 1 min w R d n n l(y i w x i ) + λ 2 w 2 2 i=1 y i {+1, 1}: label Loss function l(z) 1. SVMs: hinge loss l(z) = max(0, 1 z) p, where p = 1, 2 2. Logistic Regression: l(z) = log(1 + exp( z)) Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, 2014 8 / 77
STochastic OPtimization (STOP) and Machine Learning Learning as Optimization Feature Selection: 1 min w R d n n l(y i w x i ) + λ w 1 i=1 l 1 regularization w 1 = d i=1 w i λ controls sparsity level Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, 2014 9 / 77
STochastic OPtimization (STOP) and Machine Learning Learning as Optimization Feature Selection using Elastic Net: 1 min w R d n n i=1 l(y i w x i )+λ ( w 1 + γ w 2 ) 2 Elastic net regularizer, more robust than l 1 regularizer Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, 2014 10 / 77
STochastic OPtimization (STOP) and Machine Learning Big Data Challenge Huge amount of data generated on the internet every day Facebook users upload 3 million photos Goolge receives 3000 million queries GE 100 Million Hours of operational data on the World s Largest Gas Turbine Fleet The global internet population is 2.1 billion people http://www.visualnews.com/2012/06/19/ how-much-data-created-every-minute/ Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, 2014 11 / 77
STochastic OPtimization (STOP) and Machine Learning Why Learning from Big Data is Hard? Too many data points 1 n min l(w x i, y i ) + λr(w) w R d n i=1 } {{ } empirical loss + regularizer Issue: can t afford go through data set many times Solution: Stochastic Optimization Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, 2014 12 / 77
STochastic OPtimization (STOP) and Machine Learning Why Learning from Big Data is Hard? High Dimensional Data 1 n min l(w x i, y i ) + λr(w) w R d n i=1 } {{ } empirical loss + regularizer Issue: can t afford second order optimization (e.g. Newton s method) Solution: first order method (i.e, gradient based method) Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, 2014 13 / 77
STochastic OPtimization (STOP) and Machine Learning Why Learning from Big Data is Hard? 1 n min l(w x i, y i ) + λr(w) w R d n i=1 } {{ } empirical loss + regularizer Data are distributed over many machines Issue: expensive (if not impossible) to move data Solution: Distributed Optimization Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, 2014 14 / 77
STochastic OPtimization (STOP) and Machine Learning Warm-up: Vector, Norm, Inner product, Dual Norm bold letters x R d, w R d : d-dimensional vectors d norm x : R d R +. e.g. l 2 norm x 2 = i=1 x i 2, l 1 norm x 1 = d i=1 x i, l norm x = max i x i inner product < x, y >= x y = d i=1 x iy i dual norm x = max y 1 x y. x 2 x 2, x 1 x Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, 2014 15 / 77
STochastic OPtimization (STOP) and Machine Learning Warm-up: Convex Optimization min f (x) x X f (x) is a convex function X is a convex domain f (αx + (1 α)y) αf (x) + (1 α)f (y), x, y X, α [0, 1] Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, 2014 16 / 77
STochastic OPtimization (STOP) and Machine Learning Warm-up: Convergence Measure Most optimization algorithms are iterative w t+1 = w t + w t Iteration Complexity: the number of iterations T (ɛ) needed to have f (x T ) min f (x) ɛ (ɛ 1) x X Convergence rate: after T iterations, how good is the solution objective 1 0.8 0.6 0.4 0.2 ε(t) 0 Optimal Value 0 20 40 60 80 100 f (x T ) min f (x) ɛ(t ) x X Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, 2014 17 / 77
STochastic OPtimization (STOP) and Machine Learning More on Convergence Measure Convergence Rate Iteration Complexity linear µ T (µ < 1) O ( ( log( 1 ɛ ) )) 1 sub-linear T α > 0 O 1 α ɛ 1/α Why are we interested in Bounds? Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, 2014 18 / 77
STochastic OPtimization (STOP) and Machine Learning More on Convergence Measure Convergence Rate Iteration Complexity linear µ T (µ < 1) O ( ( log( 1 ɛ ) )) 1 sub-linear T α > 0 O 1 α ɛ 1/α Why are we interested in Bounds? Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, 2014 18 / 77
STochastic OPtimization (STOP) and Machine Learning More on Convergence Measure Convergence Rate Iteration Complexity linear µ T (µ < 1) O ( ( log( 1 ɛ ) )) 1 sub-linear T α > 0 O 1 α ɛ 1/α distance to optimal objective 0.3 0.25 0.2 0.15 0.1 0.05 0 0.5 T 20 40 60 80 100 iterations Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, 2014 19 / 77
STochastic OPtimization (STOP) and Machine Learning More on Convergence Measure Convergence Rate Iteration Complexity linear µ T (µ < 1) O ( ( log( 1 ɛ ) )) 1 sub-linear T α > 0 O 1 α ɛ 1/α distance to optimal objective 0.3 0.25 0.2 0.15 0.1 0.05 0 0.5 T 1/T 2 20 40 60 80 100 iterations Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, 2014 20 / 77
STochastic OPtimization (STOP) and Machine Learning More on Convergence Measure Convergence Rate Iteration Complexity linear µ T (µ < 1) O ( ( log( 1 ɛ ) )) 1 sub-linear T α > 0 O 1 α ɛ 1/α distance to optimal objective 0.3 0.25 0.2 0.15 0.1 0.05 0 0.5 T 1/T 2 1/T 20 40 60 80 100 iterations Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, 2014 21 / 77
STochastic OPtimization (STOP) and Machine Learning More on Convergence Measure Convergence Rate Iteration Complexity linear µ T (µ < 1) O ( ( log( 1 ɛ ) )) 1 sub-linear T α > 0 O 1 α ɛ 1/α distance to optimal objective 0.3 0.25 0.2 0.15 0.1 0.05 0 0.5 T 1/T 2 1/T 1/T 0.5 20 40 60 80 100 iterations Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, 2014 22 / 77
STochastic OPtimization (STOP) and Machine Learning More on Convergence Measure Convergence Rate Iteration Complexity linear µ T (µ < 1) O ( ( log( 1 ɛ ) )) 1 sub-linear T α > 0 O 1 α ɛ 1/α distance to optimal objective 0.3 0.25 0.2 0.15 0.1 0.05 0 20 40 60 80 100 iterations 0.5 T 1/T 2 1/T 1/T 0.5 µ T < 1 T 2 < 1 T < 1 T ( ) 1 log < 1 < 1 ɛ ɛ ɛ < 1 ɛ 2 Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, 2014 23 / 77
STochastic OPtimization (STOP) and Machine Learning Factors that affect Iteration Complexity Domain X : size and geometry Size of problem: dimension and number of data points Property of function: smoothness of function Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, 2014 24 / 77
STochastic OPtimization (STOP) and Machine Learning Warm-up: Non-smooth function Lipschitz continuous: e.g. absolute loss f (z) = z 0.8 non smooth x f (x) f (y) L x y 0.6 0.4 0.2 sub gradient f (x) f (y) + f (y) (x y) Iteration complexity O 0 0.2 1 0.5 0 0.5 1 ( ) 1 ɛ 2 or convergence rate O ( ) 1 T Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, 2014 25 / 77
STochastic OPtimization (STOP) and Machine Learning Warm-up: Smooth Convex function smooth: e.g. logistic loss f (z) = log(1 + exp( z)) f (x) f (y) β x y where β > 0 Second Order Derivative is upper 0.2 smooth gradient 0.2 bounded 1 0.5 0 0.5 1 ( ) ( ) 1 1 Full gradient descent O or O ɛ T ( ) 2 1 Stochastic gradient descent O ɛ 2 or O 0.8 0.6 0.4 0 ( ) 1 T x 2 Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, 2014 26 / 77
STochastic OPtimization (STOP) and Machine Learning Warm-up: Strongly Convex function strongly convex: e.g. l 2 2 norm f (x) = 1 2 x 2 2 f (x) f (y) λ x y, λ > 0 Second Order Derivative is lower bounded Iteration complexity O(1/ɛ) or convergence rate O(1/T ) Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, 2014 27 / 77
STochastic OPtimization (STOP) and Machine Learning Warm-up: Smooth and Strongly Convex function smooth and strongly convex: λ x y f (x) f (y) β x y, β λ > 0 e.g. square loss: f (z) = 1 2 (1 z)2 ( ( )) 1 Iteration complexity O log or convergence rate O(µ T ) ɛ Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, 2014 28 / 77
STochastic OPtimization (STOP) and Machine Learning Warm-up: Stochastic Gradient Loss: L(w) = 1 n n l(w x i, y i ) i=1 stochastic gradient: l(w x it, y it ) where i t {1,..., n} randomly sampled key equation: E it [ l(w x it, y it )] = L(w) computation is very cheap Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, 2014 29 / 77
STochastic OPtimization (STOP) and Machine Learning Warm-up: Dual Convex Conjugate: l (α) l(z) l(z) = max α Ω αz l (α), e.g. max(0, 1 z) = max α 0 αz α Dual Objective (classification): min w 1 n = min w n l(y i w x } {{ } i ) + λ 2 w 2 2 i=1 1 n 1 = max α Ω n i=1 z n [ ] max α i(y i w x i ) l (α i ) + λ α i Ω 2 w 2 2 n l (α i ) λ 1 2 α 2 i y i x i λn i=1 i=1 2 Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, 2014 30 / 77
STOP Algorithms for Big Data Classification and Regression Outline 1 STochastic OPtimization (STOP) and Machine Learning 2 STOP Algorithms for Big Data Classification and Regression 3 General Strategies for Stochastic Optimization 4 Implementations and A Library Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, 2014 31 / 77
STOP Algorithms for Big Data Classification and Regression STOP Algorithms for Big Data Classification and Regression In this section, we present several well-known STOP algorithms for solving classification and regression problems Coverage: Stochastic Gradient Descent (Pegasos) for L1-SVM (primal) Stochastic Dual Coordinate Ascent (SDCA) for L2-SVM (dual) Stochastic Average Gradient (SAG) for Logistic Regression/Regression (primal) Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, 2014 32 / 77
STOP Algorithms for Big Data Classification and Regression STOP Algorithms for Big Data Classification Support Vector Machines: 4 3.5 L1 SVM L2 SVM min w R d f (w) = 1 n n l(w x i, y i ) + λ } {{ } 2 w 2 2 loss i=1 L 1 SVM: l(w x, y) = max(0, 1 y i w x i ) L 2 SVM: l(w x, y) = max(0, 1 y i w x i ) 2 Equivalent to C-SVM formulations C = nλ loss 3 2.5 2 1.5 1 smooth function 0.5 non smooth function 0 1 0.5 0 0.5 1 1.5 2 yw T x Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, 2014 33 / 77
STOP Algorithms for Big Data Classification and Regression STOP Algorithms for Big Data Classification Support Vector Machines: 4 3.5 L1 SVM L2 SVM min w R d f (w) = 1 n n l(w x i, y i ) + λ } {{ } 2 w 2 2 loss i=1 L 1 SVM: l(w x, y) = max(0, 1 y i w x i ) L 2 SVM: l(w x, y) = max(0, 1 y i w x i ) 2 Equivalent to C-SVM formulations C = nλ loss 3 2.5 2 1.5 1 smooth function 0.5 non smooth function 0 1 0.5 0 0.5 1 1.5 2 yw T x Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, 2014 33 / 77
STOP Algorithms for Big Data Classification and Regression STOP Algorithms for Big Data Classification Support Vector Machines: 4 3.5 L1 SVM L2 SVM min w R d f (w) = 1 n n l(w x i, y i ) + λ } {{ } 2 w 2 2 loss i=1 L 1 SVM: l(w x, y) = max(0, 1 y i w x i ) L 2 SVM: l(w x, y) = max(0, 1 y i w x i ) 2 Equivalent to C-SVM formulations C = nλ loss 3 2.5 2 1.5 1 smooth function 0.5 non smooth function 0 1 0.5 0 0.5 1 1.5 2 yw T x Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, 2014 33 / 77
STOP Algorithms for Big Data Classification and Regression Pegasos (Shai et al. ICML 07, primal): STOP for L 1 SVM min w R d f (w) = 1 n n i=1 max(0, 1 y i w x i ) + λ 2 w 2 2 } {{ } strongly convex Pegasos (SGD for strongly convex) : y i x i, 1 y i w x i < 0 stochastic gradient: l(w x i, y i ) = 0, otherwise Stochastic Gradient Descent Updates f (w t ; i t ) = l(w t x it, y it ) + λw t w t+1 = w t 1 λt f (w t; i t ) Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, 2014 34 / 77
STOP Algorithms for Big Data Classification and Regression SDCA (C.J. Hsieh, ICML 08, Shai&Tong JMLR 12, dual): STOP for L 2 SVM convex conjugate: max(0, 1 yw x) 2 = max (α 14 ) α 0 α2 αyw x Dual SVM: max α 0 D(α) = 1 n primal solution: w t = 1 λn n ( ) α i α2 i λ 1 4 2 λn i=1 } {{ } φ (α i ) n αi t y i x i i=1 SDCA: stochastic dual coordinate ascent Dual Coordinate Updates α i = max α t i + α i 0 φ (α t i + α i ) λ 2 n 2 α i y i x i i=1 w t + 1 λn α iy i x i Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, 2014 35 / 77 2 2 2
STOP Algorithms for Big Data Classification and Regression SAG (Schmidt et al. 11, primal): Stopt for Logit Reg./ Reg. smooth and strongly convex objective min f (w) = 1 n l(w x i, y i ) + λ w R d n 2 w 2 2 i=1 } {{ } smooth and strongly convex e.g. logistic regression, least square regression SAG (stochastic average gradient): w t+1 = (1 ηλ)w t η n gi t = l(w t x i, y i ), n i=1 g t i, where if i is selected g t 1 i, otherwise Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, 2014 36 / 77
STOP Algorithms for Big Data Classification and Regression OK, tell me which one is the best? Let us compare Theoretically first alg. Pegasos SAG SDCA f (w) sc sm & sc sm loss & sc reg. e.g. SVMs, LR, LSR L 2 -SVM, LR, LSR SVMs, LR, LSR com. cost O(d) O(d) O(d) mem. cost O(d) O(n + d) O(n + d) Iteration O ( 1 λɛ ) O ( (n + 1 λ ) log ( 1 ɛ )) O ( (n + 1 λ ) log ( 1 ɛ Table : sc: strongly convex; sm: smooth; LR: logistic regression; LSR: least square regression )) Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, 2014 37 / 77
STOP Algorithms for Big Data Classification and Regression What about l 1 regularization? 1 min w R d n n l(w x i, y i ) + σ i=1 K w g 1 g=1 } {{ } Lasso or Group Lasso Issue: Not Strongly Convex Solution: Add l 2 2 regularization 1 min w R d n n l(w x i, y i ) + σ i=1 setting λ = Θ(1/ɛ) Algorithm: Pegasos, SDCA K w g 1 + λ 2 w 2 2 g=1 Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, 2014 38 / 77
General Strategies for Stochastic Optimization Outline 1 STochastic OPtimization (STOP) and Machine Learning 2 STOP Algorithms for Big Data Classification and Regression 3 General Strategies for Stochastic Optimization 4 Implementations and A Library Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, 2014 39 / 77
General Strategies for Stochastic Optimization General Strategies for Stochastic Optimization General strategies for Stochastic Gradient Descent Coverage: SGD for various objectives Accelerated SGD for Stochastic Composite Optimization Variance Reduced SGD Parallel and Distributed Optimization Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, 2014 40 / 77
General Strategies for Stochastic Optimization Stochastic Gradient Descent min f (x) x X stochastic gradient f (x; ω): ω is a random variable SGD updates: x t Π X [x t 1 γ t f (x t 1 ; ω t )] Π X [ˆx] = min x X x ˆx 2 2 Issues: How to determine learning rate γ t? key: decreasing Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, 2014 41 / 77
General Strategies for Stochastic Optimization Stochastic Gradient Descent min f (x) x X stochastic gradient f (x; ω): ω is a random variable SGD updates: x t Π X [x t 1 γ t f (x t 1 ; ω t )] Π X [ˆx] = min x X x ˆx 2 2 Issues: How to determine learning rate γ t? key: decreasing Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, 2014 41 / 77
General Strategies for Stochastic Optimization Why Stochastic Gradient Descent Works? Why decreasing step size? GD vs SGD x t = x t 1 η f (x t 1 ) f (x t 1 ) 0 x t = x t 1 η t f (x t 1 ; ω t ) f (x t 1 ; ω t ) 0, η t 0 Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, 2014 42 / 77
General Strategies for Stochastic Optimization Why Stochastic Gradient Descent Works? Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, 2014 43 / 77
General Strategies for Stochastic Optimization Why Stochastic Gradient Descent Works? Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, 2014 43 / 77
General Strategies for Stochastic Optimization Why Stochastic Gradient Descent Works? Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, 2014 43 / 77
General Strategies for Stochastic Optimization Why Stochastic Gradient Descent Works? Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, 2014 43 / 77
General Strategies for Stochastic Optimization Why Stochastic Gradient Descent Works? Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, 2014 43 / 77
General Strategies for Stochastic Optimization Why Stochastic Gradient Descent Works? Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, 2014 44 / 77
General Strategies for Stochastic Optimization Why Stochastic Gradient Descent Works? Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, 2014 45 / 77
General Strategies for Stochastic Optimization Why Stochastic Gradient Descent Works? Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, 2014 46 / 77
General Strategies for Stochastic Optimization Why Stochastic Gradient Descent Works? Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, 2014 47 / 77
General Strategies for Stochastic Optimization How to decrease the Step size? Step size Depends on the property of the function. Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, 2014 48 / 77
General Strategies for Stochastic Optimization SGD for Non-smooth Function: Convergence Rate x y D and f (x; ω) G, x, y X. Step size: γ t = c t, c usually needed to be tuned (( ) ) Convergence rate of final solution x T : O D 2 c + cg 2 log T T Are we good enough? Close but not yet ( ) minimax optimal rate : O 1 T Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, 2014 49 / 77
General Strategies for Stochastic Optimization SGD for Non-smooth Function: Convergence Rate x y D and f (x; ω) G, x, y X. Step size: γ t = c t, c usually needed to be tuned (( ) ) Convergence rate of final solution x T : O D 2 c + cg 2 log T T Are we good enough? Close but not yet ( ) minimax optimal rate : O 1 T Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, 2014 49 / 77
General Strategies for Stochastic Optimization SGD for Non-smooth Function: Convergence Rate x y D and f (x; ω) G, x, y X. Step size: γ t = c t, c usually needed to be tuned (( ) ) Convergence rate of final solution x T : O D 2 c + cg 2 log T T Are we good enough? Close but not yet ( ) minimax optimal rate : O 1 T Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, 2014 49 / 77
General Strategies for Stochastic Optimization SGD for Non-smooth Function: Convergence Rate x y D and f (x; ω) G, x, y X. Step size: γ t = c t, c usually needed to be tuned (( ) ) Convergence rate of final solution x T : O D 2 c + cg 2 log T T Are we good enough? Close but not yet ( ) minimax optimal rate : O 1 T Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, 2014 49 / 77
General Strategies for Stochastic Optimization SGD for Non-smooth Function: Convergence Rate ( ) minimax optimal rate : O 1 T Averaging: x t = η = 0 standard average ( 1 1 + η ) x t 1 + 1 + η t + η t + η x t, η 0 x T = x 1 + + x T T ( ( ) ) Convergence Rate of x T : O η D 2 c + cg 2 1 T Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, 2014 50 / 77
General Strategies for Stochastic Optimization SGD for Non-smooth Strongly Convex Function: Convergence Rate λ-strongly Convex minimax optimal rate : O( 1 T ) Step size: γ t = 1 λt ( ) Convergence Rate of x T : O G 2 log T λt Averaging: ( x t = 1 1 + η ) x t 1 + 1 + η t + η t + η x t, η 0 ( ) Convergence Rate of x T : O ηg 2 λt Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, 2014 51 / 77
General Strategies for Stochastic Optimization SGD for Smooth function ( ) Generally No Further improvement upon O 1 T Special structure may yield better convergence Stochastic Composite Optimization: Smooth + Non-Smooth Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, 2014 52 / 77
General Strategies for Stochastic Optimization General Strategies for Stochastic Optimization General strategies for Stochastic Gradient Descent Coverage: SGD for various objectives Accelerated SGD for Stochastic Composite Optimization Variance Reduced SGD Parallel and Distributed Optimization Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, 2014 53 / 77
General Strategies for Stochastic Optimization Stochastic Composite Optimization Example 1 min w 1 n min x X f (x) }{{} smooth + g(x) }{{} nonsmooth n log(1 + exp( y i w x i )) + λ w 1 i=1 Example 2 min w 1 n n i=1 1 2 (y i w x i ) 2 + λ w 1 Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, 2014 54 / 77
General Strategies for Stochastic Optimization Stochastic Composite Optimization Accelerated Stochastic Gradient auxiliary sequence of search solutions maintain three solutions: x t (action solution), x t (averaged solution), x m t (auxiliary solution) Similar to Nesterov s deterministic method several variants Stochastic Dual Coordinate Ascent 1 min w R d n n l(w x i, y i ) + σ i=1 K w g 1 + λ 2 w 2 2 g=1 Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, 2014 55 / 77
General Strategies for Stochastic Optimization G. Lan s Accelerated Stochastic Approximation Iterate t = 1,..., T 1. compute auxiliary solution x m t = β 1 t x t + (1 β 1 ) x t t 2. update solution x t+1 = P xt (γ t G(x m t ; ω t )) (think about projection) 3. update averaged solution x t+1 = (1 β 1 t ) x t + β 1 x t+1 (Similar to SGD) t γ t, β t, Convergence Rate? Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, 2014 56 / 77
General Strategies for Stochastic Optimization Accelerated Stochastic Approximation 1. β 1 t = 2 t+1 : x t+1 = ( 1 2 ) x t + 2 t + 1 t + 1 x t+1 2. γ t = t+1 2 γ: Increasing, Never like before 3. Convergence rate of x T : O ( L T 2 + M + ) σ2 T Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, 2014 57 / 77
General Strategies for Stochastic Optimization Variance Reduction: Mini-batch Batch gradient: g t = 1 b b i=1 f (x t 1, ω ti ) Unbiased sampling: Eg t = E f (x t 1, ω). Variance reduced: Vg t = 1 b V f (x t 1, ω). Tradeoff the batch computation for variance decrease. Batch size 1 Constant batch size; 2 Increasing batch size as proportion to t. Convergence Rate: e.g. O( 1 bλt ) of SGD for strongly convex Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, 2014 58 / 77
General Strategies for Stochastic Optimization Variance Reduction: Mixed-Grad (Mahdavi et. al, Johnson et al. 13) Iterate s = 1,..., Iterate t = 1,..., m x s t = x s t 1 η ( f (x s t 1; ω t ) f ( x s 1 ; ω t ) + f ( x s 1 )) } {{ } StoGrad StoGrad+Grad update x s how variance is reduced? if x s 1 x, f ( x t 1 ) 0, f (x t 1 ; ω t ) f (x ; ω t ) 0 x s = x s m or x s = m t=1 xs t/m Constant learning rate Better Iteration Complexity.: O((n + 1 λ ) log ( 1 ɛ ) ) for sm&sc, O(1/ɛ) for smooth Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, 2014 59 / 77
General Strategies for Stochastic Optimization Distributed Optimization: Why? data distributed over multiple machines moving to single machine suffers low network bandwidth limited disk or memory benefits from parallel computation cluster of machines GPUs Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, 2014 60 / 77
General Strategies for Stochastic Optimization A Naive Solution Data w 1 w 2 w 3 w 4 w 5 w 6 w = 1 k k i=1 w i Issue: Not the Optimal Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, 2014 61 / 77
General Strategies for Stochastic Optimization Distribute SGD is simple Mini-batch synchronization Good: reduced variance, faster conv. Bad: synchronization is expensive Solutions: asynchronized update convergence: difficult to prove lesser synchronizations guaranteed convergence Mini-batch SGD Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, 2014 62 / 77
General Strategies for Stochastic Optimization Distribute SDCA is not trivial issue: data are correlated α i = arg max φ i ( α t i α i ) λn 2 Not Working w t + 1 λn α ix i 2 2 Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, 2014 63 / 77
General Strategies for Stochastic Optimization Distribute SDCA (T. Yang NIPS 13) The Basic Variant mr-sdca running on machine k Iterate: for t = 1,..., T Iterate: for j = 1,..., m Randomly pick i {1,, n k } and let i j = i Find α k,i by IncDual(w = w t 1, scl = mk) Set αk,i t = α t 1 k,i + α k,i Reduce: w t w t 1 + 1 m λn j=1 α k,i j x k,ij α i = arg max φ i ( α t i α i ) λn 2mK w t + mk λn α ix i 2 2 Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, 2014 64 / 77
General Strategies for Stochastic Optimization Distribute SDCA (T. Yang NIPS 13) The Practical Variant Initialize: u 0 t = w t 1 Iterate: for j = 1,..., m Randomly pick i {1,, n k } and let i j = i Find α k,ij by IncDual(w = ut j 1, scl = k) Update αk,i t j = α t 1 k,i j + α k,ij Update u j t = u j 1 t + 1 λn k α k,ij x k,ij α ij = arg max φ i j ( α t i j α ij ) λn 2K t + K 2 λn α i j x ij uj 1 Using Updated Information and Large Step Size 2 Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, 2014 65 / 77
General Strategies for Stochastic Optimization Distribute SDCA (T. Yang NIPS 13) The Practical Variant is much faster than the Basic Variant log(p(w t ) D(α t )) 0 1 2 3 4 5 6 duality gap 7 8 DisDCA p DisDCA n 9 0 20 40 60 80 100 iterations Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, 2014 66 / 77
Implementations and A Library Outline 1 STochastic OPtimization (STOP) and Machine Learning 2 STOP Algorithms for Big Data Classification and Regression 3 General Strategies for Stochastic Optimization 4 Implementations and A Library Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, 2014 67 / 77
Implementations and A Library Implementations and A Library In this section, we present some techniques for efficient implementations and a practical library Coverage: efficient averaging Gradient sparsification pre-optimization: screening pre-optimization: data preconditioning distributed (parallel) optimization library Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, 2014 68 / 77
Implementations and A Library Efficient Averaging Update rule: x t = (1 γ t λ)x t 1 + γ t g t x t = (1 α t ) x t 1 + α t x t Efficient update when g t has many 0, or g t is sparse, ( ) 1 λγt 0 S t = S α t (1 λγ t ) 1 α t 1, S 1 = I t y t = y t 1 [St 1 ] 11 γ t g t ỹ t = ỹ t 1 ([St 1 ] 21 + [St 1 ] 22 α t )γ t g t x T = [S T ] 11 y T x T = [S T ] 21 y T + [S T ] 22 ỹ T When Gradient is Sparse Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, 2014 69 / 77
Implementations and A Library Gradient sparsification Sparsification by importance sampling R ti = unif(0, 1) g ti = g ti [ g ti ĝ i ] + ĝsign(g ti )[ĝ i R ti g ti < ĝ i ] Unbiased sample: E g t = g t. Tradeoff variance increase for the efficient computation. Especially useful for Logistic Regression Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, 2014 70 / 77
Implementations and A Library Pre-optimization: Screening Screening for Lasso (Wang et al., 2012) Screening for SVM (Ogawa et al., 2013) Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, 2014 71 / 77
Implementations and A Library Distributed Optimization Library: Birds The birds library implements distributed stochastic dual coordinate ascent (DisDCA) for classification and regression with a broad support. For technical details see: Trading Computation for Communication: Distributed Stochastic Dual Coordinate Ascent. Tianbao Yang. NIPS 2013. On the Theoretical Analysis of Distributed Stochastic Dual Coordinate Ascent Tianbao Yang, etc. Tech Report 2013. The code is distributed under GNU General Publich License (see license.txt for details). Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, 2014 72 / 77
Implementations and A Library Distributed Optimization Library: Birds What problems does it Solve? Classification and Regression Loss 1 Hinge loss (SVM): L-1 hinge loss, L-2 hinge loss 2 Logistic loss (Logistic Regression) 3 Least Square Regression (Ridge Regression) Regularizer 1 l 2 norm: SVM, Logistic Regression, Ridge Regression 2 l 1 norm: Lasso, SVM, LR with l 1 norm Multi-class : one-vs-all Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, 2014 73 / 77
Implementations and A Library Distributed Optimization Library: Birds What data does it Support? dense, sparse txt, binary What environment does it Support? Prerequisites: Boost Library Tested on A cluster of Linux Servers (up to hundreds of machines) Tested on Cygwin in Windows with multi-core Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, 2014 74 / 77
Implementations and A Library How about kernel methods? Stochastic/Online Optimization Approaches: (Jin et al., 2010; Orabona et al., 2012), Linearization + STOP for linear methods the Nyström method (Drineas & Mahoney, 2005) Random Fourier Features (Rahimi & Recht, 2007) Comparison of two (Yang et al., 2012) Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, 2014 75 / 77
Implementations and A Library References I Drineas, Petros and Mahoney, Michael W. On the nystrom method for approximating a gram matrix for improved kernel-based learning. J. Mach. Learn. Res., 6:2153 2175, 2005. Jin, Rong, Hoi, Steven C. H., and Yang, Tianbao. Online multiple kernel learning: Algorithms and mistake bounds. In Proceedings of the 21st International Conference on Algorithmic Learning Theory, pp. 390 404, 2010. Ogawa, Kohei, Suzuki, Yoshiki, and Takeuchi, Ichiro. Safe screening of non-support vectors in pathwise svm computation. In ICML (3), pp. 1382 1390, 2013. Orabona, Francesco, Luo, Jie, and Caputo, Barbara. Multi kernel learning with online-batch optimization. Journal of Machine Learning Research, 13:227 253, 2012. Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, 2014 76 / 77
Implementations and A Library References II Rahimi, Ali and Recht, Benjamin. Random features for large-scale kernel machines. In NIPS, 2007. Wang, Jie, Lin, Binbin, Gong, Pinghua, Wonka, Peter, and Ye, Jieping. Lasso screening rules via dual polytope projection. CoRR, abs/1211.3966, 2012. Yang, Tianbao, Li, Yu-Feng, Mahdavi, Mehrdad, Jin, Rong, and Zhou, Zhi-Hua. nystrom method vs random fourier features: A theoretical and empirical comparison. In NIPS, pp. 485 493, 2012. Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, 2014 77 / 77