Stochastic Optimization for Big Data Analytics: Algorithms and Libraries

Size: px
Start display at page:

Download "Stochastic Optimization for Big Data Analytics: Algorithms and Libraries"

Transcription

1 Stochastic Optimization for Big Data Analytics: Algorithms and Libraries Tianbao Yang SDM 2014, Philadelphia, Pennsylvania collaborators: Rong Jin, Shenghuo Zhu NEC Laboratories America, Michigan State University February 9, 2014 Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, / 77

2 The updates are available here Thanks. Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, / 77

3 STochastic OPtimization (STOP) and Machine Learning Outline 1 STochastic OPtimization (STOP) and Machine Learning 2 STOP Algorithms for Big Data Classification and Regression 3 General Strategies for Stochastic Optimization 4 Implementations and A Library Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, / 77

4 STochastic OPtimization (STOP) and Machine Learning Introduction Machine Learning problems and Stochastic Optimization Coverage: Classification and Regression in different forms Motivation to employ STochastic OPtimization (STOP) Basic Convex Optimization Knowledge Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, / 77

5 STochastic OPtimization (STOP) and Machine Learning Learning as Optimization Least Square Regression Problem: 1 min w R d n n (y i w x i ) 2 + λ 2 w 2 2 i=1 x i R d : d-dimensional feature vector y i R: response variable w R d : unknown parameters n: number of data points Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, / 77

6 STochastic OPtimization (STOP) and Machine Learning Learning as Optimization Least Square Regression Problem: 1 n min (y i w x i ) 2 + λ w R d n 2 w 2 2 i=1 } {{ } Empirical Loss x i R d : d-dimensional feature vector y i R: response variable w R d : unknown parameters n: number of data points Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, / 77

7 STochastic OPtimization (STOP) and Machine Learning Learning as Optimization Least Square Regression Problem: 1 min w R d n n i=1 (y i w x i ) 2 + λ 2 w 2 2 } {{ } Regularization x i R d : d-dimensional feature vector y i R: response variable w R d : unknown parameters n: number of data points Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, / 77

8 STochastic OPtimization (STOP) and Machine Learning Learning as Optimization Classification Problems: 1 min w R d n n l(y i w x i ) + λ 2 w 2 2 i=1 y i {+1, 1}: label Loss function l(z) 1. SVMs: hinge loss l(z) = max(0, 1 z) p, where p = 1, 2 2. Logistic Regression: l(z) = log(1 + exp( z)) Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, / 77

9 STochastic OPtimization (STOP) and Machine Learning Learning as Optimization Feature Selection: 1 min w R d n n l(y i w x i ) + λ w 1 i=1 l 1 regularization w 1 = d i=1 w i λ controls sparsity level Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, / 77

10 STochastic OPtimization (STOP) and Machine Learning Learning as Optimization Feature Selection using Elastic Net: 1 min w R d n n i=1 l(y i w x i )+λ ( w 1 + γ w 2 ) 2 Elastic net regularizer, more robust than l 1 regularizer Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, / 77

11 STochastic OPtimization (STOP) and Machine Learning Big Data Challenge Huge amount of data generated on the internet every day Facebook users upload 3 million photos Goolge receives 3000 million queries GE 100 Million Hours of operational data on the World s Largest Gas Turbine Fleet The global internet population is 2.1 billion people how-much-data-created-every-minute/ Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, / 77

12 STochastic OPtimization (STOP) and Machine Learning Why Learning from Big Data is Hard? Too many data points 1 n min l(w x i, y i ) + λr(w) w R d n i=1 } {{ } empirical loss + regularizer Issue: can t afford go through data set many times Solution: Stochastic Optimization Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, / 77

13 STochastic OPtimization (STOP) and Machine Learning Why Learning from Big Data is Hard? High Dimensional Data 1 n min l(w x i, y i ) + λr(w) w R d n i=1 } {{ } empirical loss + regularizer Issue: can t afford second order optimization (e.g. Newton s method) Solution: first order method (i.e, gradient based method) Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, / 77

14 STochastic OPtimization (STOP) and Machine Learning Why Learning from Big Data is Hard? 1 n min l(w x i, y i ) + λr(w) w R d n i=1 } {{ } empirical loss + regularizer Data are distributed over many machines Issue: expensive (if not impossible) to move data Solution: Distributed Optimization Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, / 77

15 STochastic OPtimization (STOP) and Machine Learning Warm-up: Vector, Norm, Inner product, Dual Norm bold letters x R d, w R d : d-dimensional vectors d norm x : R d R +. e.g. l 2 norm x 2 = i=1 x i 2, l 1 norm x 1 = d i=1 x i, l norm x = max i x i inner product < x, y >= x y = d i=1 x iy i dual norm x = max y 1 x y. x 2 x 2, x 1 x Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, / 77

16 STochastic OPtimization (STOP) and Machine Learning Warm-up: Convex Optimization min f (x) x X f (x) is a convex function X is a convex domain f (αx + (1 α)y) αf (x) + (1 α)f (y), x, y X, α [0, 1] Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, / 77

17 STochastic OPtimization (STOP) and Machine Learning Warm-up: Convergence Measure Most optimization algorithms are iterative w t+1 = w t + w t Iteration Complexity: the number of iterations T (ɛ) needed to have f (x T ) min f (x) ɛ (ɛ 1) x X Convergence rate: after T iterations, how good is the solution objective ε(t) 0 Optimal Value f (x T ) min f (x) ɛ(t ) x X Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, / 77

18 STochastic OPtimization (STOP) and Machine Learning More on Convergence Measure Convergence Rate Iteration Complexity linear µ T (µ < 1) O ( ( log( 1 ɛ ) )) 1 sub-linear T α > 0 O 1 α ɛ 1/α Why are we interested in Bounds? Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, / 77

19 STochastic OPtimization (STOP) and Machine Learning More on Convergence Measure Convergence Rate Iteration Complexity linear µ T (µ < 1) O ( ( log( 1 ɛ ) )) 1 sub-linear T α > 0 O 1 α ɛ 1/α Why are we interested in Bounds? Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, / 77

20 STochastic OPtimization (STOP) and Machine Learning More on Convergence Measure Convergence Rate Iteration Complexity linear µ T (µ < 1) O ( ( log( 1 ɛ ) )) 1 sub-linear T α > 0 O 1 α ɛ 1/α distance to optimal objective T iterations Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, / 77

21 STochastic OPtimization (STOP) and Machine Learning More on Convergence Measure Convergence Rate Iteration Complexity linear µ T (µ < 1) O ( ( log( 1 ɛ ) )) 1 sub-linear T α > 0 O 1 α ɛ 1/α distance to optimal objective T 1/T iterations Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, / 77

22 STochastic OPtimization (STOP) and Machine Learning More on Convergence Measure Convergence Rate Iteration Complexity linear µ T (µ < 1) O ( ( log( 1 ɛ ) )) 1 sub-linear T α > 0 O 1 α ɛ 1/α distance to optimal objective T 1/T 2 1/T iterations Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, / 77

23 STochastic OPtimization (STOP) and Machine Learning More on Convergence Measure Convergence Rate Iteration Complexity linear µ T (µ < 1) O ( ( log( 1 ɛ ) )) 1 sub-linear T α > 0 O 1 α ɛ 1/α distance to optimal objective T 1/T 2 1/T 1/T iterations Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, / 77

24 STochastic OPtimization (STOP) and Machine Learning More on Convergence Measure Convergence Rate Iteration Complexity linear µ T (µ < 1) O ( ( log( 1 ɛ ) )) 1 sub-linear T α > 0 O 1 α ɛ 1/α distance to optimal objective iterations 0.5 T 1/T 2 1/T 1/T 0.5 µ T < 1 T 2 < 1 T < 1 T ( ) 1 log < 1 < 1 ɛ ɛ ɛ < 1 ɛ 2 Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, / 77

25 STochastic OPtimization (STOP) and Machine Learning Factors that affect Iteration Complexity Domain X : size and geometry Size of problem: dimension and number of data points Property of function: smoothness of function Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, / 77

26 STochastic OPtimization (STOP) and Machine Learning Warm-up: Non-smooth function Lipschitz continuous: e.g. absolute loss f (z) = z 0.8 non smooth x f (x) f (y) L x y sub gradient f (x) f (y) + f (y) (x y) Iteration complexity O ( ) 1 ɛ 2 or convergence rate O ( ) 1 T Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, / 77

27 STochastic OPtimization (STOP) and Machine Learning Warm-up: Smooth Convex function smooth: e.g. logistic loss f (z) = log(1 + exp( z)) f (x) f (y) β x y where β > 0 Second Order Derivative is upper 0.2 smooth gradient 0.2 bounded ( ) ( ) 1 1 Full gradient descent O or O ɛ T ( ) 2 1 Stochastic gradient descent O ɛ 2 or O ( ) 1 T x 2 Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, / 77

28 STochastic OPtimization (STOP) and Machine Learning Warm-up: Strongly Convex function strongly convex: e.g. l 2 2 norm f (x) = 1 2 x 2 2 f (x) f (y) λ x y, λ > 0 Second Order Derivative is lower bounded Iteration complexity O(1/ɛ) or convergence rate O(1/T ) Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, / 77

29 STochastic OPtimization (STOP) and Machine Learning Warm-up: Smooth and Strongly Convex function smooth and strongly convex: λ x y f (x) f (y) β x y, β λ > 0 e.g. square loss: f (z) = 1 2 (1 z)2 ( ( )) 1 Iteration complexity O log or convergence rate O(µ T ) ɛ Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, / 77

30 STochastic OPtimization (STOP) and Machine Learning Warm-up: Stochastic Gradient Loss: L(w) = 1 n n l(w x i, y i ) i=1 stochastic gradient: l(w x it, y it ) where i t {1,..., n} randomly sampled key equation: E it [ l(w x it, y it )] = L(w) computation is very cheap Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, / 77

31 STochastic OPtimization (STOP) and Machine Learning Warm-up: Dual Convex Conjugate: l (α) l(z) l(z) = max α Ω αz l (α), e.g. max(0, 1 z) = max α 0 αz α Dual Objective (classification): min w 1 n = min w n l(y i w x } {{ } i ) + λ 2 w 2 2 i=1 1 n 1 = max α Ω n i=1 z n [ ] max α i(y i w x i ) l (α i ) + λ α i Ω 2 w 2 2 n l (α i ) λ 1 2 α 2 i y i x i λn i=1 i=1 2 Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, / 77

32 STOP Algorithms for Big Data Classification and Regression Outline 1 STochastic OPtimization (STOP) and Machine Learning 2 STOP Algorithms for Big Data Classification and Regression 3 General Strategies for Stochastic Optimization 4 Implementations and A Library Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, / 77

33 STOP Algorithms for Big Data Classification and Regression STOP Algorithms for Big Data Classification and Regression In this section, we present several well-known STOP algorithms for solving classification and regression problems Coverage: Stochastic Gradient Descent (Pegasos) for L1-SVM (primal) Stochastic Dual Coordinate Ascent (SDCA) for L2-SVM (dual) Stochastic Average Gradient (SAG) for Logistic Regression/Regression (primal) Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, / 77

34 STOP Algorithms for Big Data Classification and Regression STOP Algorithms for Big Data Classification Support Vector Machines: L1 SVM L2 SVM min w R d f (w) = 1 n n l(w x i, y i ) + λ } {{ } 2 w 2 2 loss i=1 L 1 SVM: l(w x, y) = max(0, 1 y i w x i ) L 2 SVM: l(w x, y) = max(0, 1 y i w x i ) 2 Equivalent to C-SVM formulations C = nλ loss smooth function 0.5 non smooth function yw T x Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, / 77

35 STOP Algorithms for Big Data Classification and Regression STOP Algorithms for Big Data Classification Support Vector Machines: L1 SVM L2 SVM min w R d f (w) = 1 n n l(w x i, y i ) + λ } {{ } 2 w 2 2 loss i=1 L 1 SVM: l(w x, y) = max(0, 1 y i w x i ) L 2 SVM: l(w x, y) = max(0, 1 y i w x i ) 2 Equivalent to C-SVM formulations C = nλ loss smooth function 0.5 non smooth function yw T x Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, / 77

36 STOP Algorithms for Big Data Classification and Regression STOP Algorithms for Big Data Classification Support Vector Machines: L1 SVM L2 SVM min w R d f (w) = 1 n n l(w x i, y i ) + λ } {{ } 2 w 2 2 loss i=1 L 1 SVM: l(w x, y) = max(0, 1 y i w x i ) L 2 SVM: l(w x, y) = max(0, 1 y i w x i ) 2 Equivalent to C-SVM formulations C = nλ loss smooth function 0.5 non smooth function yw T x Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, / 77

37 STOP Algorithms for Big Data Classification and Regression Pegasos (Shai et al. ICML 07, primal): STOP for L 1 SVM min w R d f (w) = 1 n n i=1 max(0, 1 y i w x i ) + λ 2 w 2 2 } {{ } strongly convex Pegasos (SGD for strongly convex) : y i x i, 1 y i w x i < 0 stochastic gradient: l(w x i, y i ) = 0, otherwise Stochastic Gradient Descent Updates f (w t ; i t ) = l(w t x it, y it ) + λw t w t+1 = w t 1 λt f (w t; i t ) Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, / 77

38 STOP Algorithms for Big Data Classification and Regression SDCA (C.J. Hsieh, ICML 08, Shai&Tong JMLR 12, dual): STOP for L 2 SVM convex conjugate: max(0, 1 yw x) 2 = max (α 14 ) α 0 α2 αyw x Dual SVM: max α 0 D(α) = 1 n primal solution: w t = 1 λn n ( ) α i α2 i λ λn i=1 } {{ } φ (α i ) n αi t y i x i i=1 SDCA: stochastic dual coordinate ascent Dual Coordinate Updates α i = max α t i + α i 0 φ (α t i + α i ) λ 2 n 2 α i y i x i i=1 w t + 1 λn α iy i x i Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, /

39 STOP Algorithms for Big Data Classification and Regression SAG (Schmidt et al. 11, primal): Stopt for Logit Reg./ Reg. smooth and strongly convex objective min f (w) = 1 n l(w x i, y i ) + λ w R d n 2 w 2 2 i=1 } {{ } smooth and strongly convex e.g. logistic regression, least square regression SAG (stochastic average gradient): w t+1 = (1 ηλ)w t η n gi t = l(w t x i, y i ), n i=1 g t i, where if i is selected g t 1 i, otherwise Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, / 77

40 STOP Algorithms for Big Data Classification and Regression OK, tell me which one is the best? Let us compare Theoretically first alg. Pegasos SAG SDCA f (w) sc sm & sc sm loss & sc reg. e.g. SVMs, LR, LSR L 2 -SVM, LR, LSR SVMs, LR, LSR com. cost O(d) O(d) O(d) mem. cost O(d) O(n + d) O(n + d) Iteration O ( 1 λɛ ) O ( (n + 1 λ ) log ( 1 ɛ )) O ( (n + 1 λ ) log ( 1 ɛ Table : sc: strongly convex; sm: smooth; LR: logistic regression; LSR: least square regression )) Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, / 77

41 STOP Algorithms for Big Data Classification and Regression What about l 1 regularization? 1 min w R d n n l(w x i, y i ) + σ i=1 K w g 1 g=1 } {{ } Lasso or Group Lasso Issue: Not Strongly Convex Solution: Add l 2 2 regularization 1 min w R d n n l(w x i, y i ) + σ i=1 setting λ = Θ(1/ɛ) Algorithm: Pegasos, SDCA K w g 1 + λ 2 w 2 2 g=1 Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, / 77

42 General Strategies for Stochastic Optimization Outline 1 STochastic OPtimization (STOP) and Machine Learning 2 STOP Algorithms for Big Data Classification and Regression 3 General Strategies for Stochastic Optimization 4 Implementations and A Library Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, / 77

43 General Strategies for Stochastic Optimization General Strategies for Stochastic Optimization General strategies for Stochastic Gradient Descent Coverage: SGD for various objectives Accelerated SGD for Stochastic Composite Optimization Variance Reduced SGD Parallel and Distributed Optimization Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, / 77

44 General Strategies for Stochastic Optimization Stochastic Gradient Descent min f (x) x X stochastic gradient f (x; ω): ω is a random variable SGD updates: x t Π X [x t 1 γ t f (x t 1 ; ω t )] Π X [ˆx] = min x X x ˆx 2 2 Issues: How to determine learning rate γ t? key: decreasing Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, / 77

45 General Strategies for Stochastic Optimization Stochastic Gradient Descent min f (x) x X stochastic gradient f (x; ω): ω is a random variable SGD updates: x t Π X [x t 1 γ t f (x t 1 ; ω t )] Π X [ˆx] = min x X x ˆx 2 2 Issues: How to determine learning rate γ t? key: decreasing Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, / 77

46 General Strategies for Stochastic Optimization Why Stochastic Gradient Descent Works? Why decreasing step size? GD vs SGD x t = x t 1 η f (x t 1 ) f (x t 1 ) 0 x t = x t 1 η t f (x t 1 ; ω t ) f (x t 1 ; ω t ) 0, η t 0 Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, / 77

47 General Strategies for Stochastic Optimization Why Stochastic Gradient Descent Works? Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, / 77

48 General Strategies for Stochastic Optimization Why Stochastic Gradient Descent Works? Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, / 77

49 General Strategies for Stochastic Optimization Why Stochastic Gradient Descent Works? Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, / 77

50 General Strategies for Stochastic Optimization Why Stochastic Gradient Descent Works? Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, / 77

51 General Strategies for Stochastic Optimization Why Stochastic Gradient Descent Works? Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, / 77

52 General Strategies for Stochastic Optimization Why Stochastic Gradient Descent Works? Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, / 77

53 General Strategies for Stochastic Optimization Why Stochastic Gradient Descent Works? Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, / 77

54 General Strategies for Stochastic Optimization Why Stochastic Gradient Descent Works? Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, / 77

55 General Strategies for Stochastic Optimization Why Stochastic Gradient Descent Works? Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, / 77

56 General Strategies for Stochastic Optimization How to decrease the Step size? Step size Depends on the property of the function. Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, / 77

57 General Strategies for Stochastic Optimization SGD for Non-smooth Function: Convergence Rate x y D and f (x; ω) G, x, y X. Step size: γ t = c t, c usually needed to be tuned (( ) ) Convergence rate of final solution x T : O D 2 c + cg 2 log T T Are we good enough? Close but not yet ( ) minimax optimal rate : O 1 T Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, / 77

58 General Strategies for Stochastic Optimization SGD for Non-smooth Function: Convergence Rate x y D and f (x; ω) G, x, y X. Step size: γ t = c t, c usually needed to be tuned (( ) ) Convergence rate of final solution x T : O D 2 c + cg 2 log T T Are we good enough? Close but not yet ( ) minimax optimal rate : O 1 T Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, / 77

59 General Strategies for Stochastic Optimization SGD for Non-smooth Function: Convergence Rate x y D and f (x; ω) G, x, y X. Step size: γ t = c t, c usually needed to be tuned (( ) ) Convergence rate of final solution x T : O D 2 c + cg 2 log T T Are we good enough? Close but not yet ( ) minimax optimal rate : O 1 T Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, / 77

60 General Strategies for Stochastic Optimization SGD for Non-smooth Function: Convergence Rate x y D and f (x; ω) G, x, y X. Step size: γ t = c t, c usually needed to be tuned (( ) ) Convergence rate of final solution x T : O D 2 c + cg 2 log T T Are we good enough? Close but not yet ( ) minimax optimal rate : O 1 T Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, / 77

61 General Strategies for Stochastic Optimization SGD for Non-smooth Function: Convergence Rate ( ) minimax optimal rate : O 1 T Averaging: x t = η = 0 standard average ( η ) x t η t + η t + η x t, η 0 x T = x x T T ( ( ) ) Convergence Rate of x T : O η D 2 c + cg 2 1 T Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, / 77

62 General Strategies for Stochastic Optimization SGD for Non-smooth Strongly Convex Function: Convergence Rate λ-strongly Convex minimax optimal rate : O( 1 T ) Step size: γ t = 1 λt ( ) Convergence Rate of x T : O G 2 log T λt Averaging: ( x t = η ) x t η t + η t + η x t, η 0 ( ) Convergence Rate of x T : O ηg 2 λt Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, / 77

63 General Strategies for Stochastic Optimization SGD for Smooth function ( ) Generally No Further improvement upon O 1 T Special structure may yield better convergence Stochastic Composite Optimization: Smooth + Non-Smooth Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, / 77

64 General Strategies for Stochastic Optimization General Strategies for Stochastic Optimization General strategies for Stochastic Gradient Descent Coverage: SGD for various objectives Accelerated SGD for Stochastic Composite Optimization Variance Reduced SGD Parallel and Distributed Optimization Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, / 77

65 General Strategies for Stochastic Optimization Stochastic Composite Optimization Example 1 min w 1 n min x X f (x) }{{} smooth + g(x) }{{} nonsmooth n log(1 + exp( y i w x i )) + λ w 1 i=1 Example 2 min w 1 n n i=1 1 2 (y i w x i ) 2 + λ w 1 Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, / 77

66 General Strategies for Stochastic Optimization Stochastic Composite Optimization Accelerated Stochastic Gradient auxiliary sequence of search solutions maintain three solutions: x t (action solution), x t (averaged solution), x m t (auxiliary solution) Similar to Nesterov s deterministic method several variants Stochastic Dual Coordinate Ascent 1 min w R d n n l(w x i, y i ) + σ i=1 K w g 1 + λ 2 w 2 2 g=1 Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, / 77

67 General Strategies for Stochastic Optimization G. Lan s Accelerated Stochastic Approximation Iterate t = 1,..., T 1. compute auxiliary solution x m t = β 1 t x t + (1 β 1 ) x t t 2. update solution x t+1 = P xt (γ t G(x m t ; ω t )) (think about projection) 3. update averaged solution x t+1 = (1 β 1 t ) x t + β 1 x t+1 (Similar to SGD) t γ t, β t, Convergence Rate? Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, / 77

68 General Strategies for Stochastic Optimization Accelerated Stochastic Approximation 1. β 1 t = 2 t+1 : x t+1 = ( 1 2 ) x t + 2 t + 1 t + 1 x t+1 2. γ t = t+1 2 γ: Increasing, Never like before 3. Convergence rate of x T : O ( L T 2 + M + ) σ2 T Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, / 77

69 General Strategies for Stochastic Optimization Variance Reduction: Mini-batch Batch gradient: g t = 1 b b i=1 f (x t 1, ω ti ) Unbiased sampling: Eg t = E f (x t 1, ω). Variance reduced: Vg t = 1 b V f (x t 1, ω). Tradeoff the batch computation for variance decrease. Batch size 1 Constant batch size; 2 Increasing batch size as proportion to t. Convergence Rate: e.g. O( 1 bλt ) of SGD for strongly convex Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, / 77

70 General Strategies for Stochastic Optimization Variance Reduction: Mixed-Grad (Mahdavi et. al, Johnson et al. 13) Iterate s = 1,..., Iterate t = 1,..., m x s t = x s t 1 η ( f (x s t 1; ω t ) f ( x s 1 ; ω t ) + f ( x s 1 )) } {{ } StoGrad StoGrad+Grad update x s how variance is reduced? if x s 1 x, f ( x t 1 ) 0, f (x t 1 ; ω t ) f (x ; ω t ) 0 x s = x s m or x s = m t=1 xs t/m Constant learning rate Better Iteration Complexity.: O((n + 1 λ ) log ( 1 ɛ ) ) for sm&sc, O(1/ɛ) for smooth Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, / 77

71 General Strategies for Stochastic Optimization Distributed Optimization: Why? data distributed over multiple machines moving to single machine suffers low network bandwidth limited disk or memory benefits from parallel computation cluster of machines GPUs Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, / 77

72 General Strategies for Stochastic Optimization A Naive Solution Data w 1 w 2 w 3 w 4 w 5 w 6 w = 1 k k i=1 w i Issue: Not the Optimal Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, / 77

73 General Strategies for Stochastic Optimization Distribute SGD is simple Mini-batch synchronization Good: reduced variance, faster conv. Bad: synchronization is expensive Solutions: asynchronized update convergence: difficult to prove lesser synchronizations guaranteed convergence Mini-batch SGD Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, / 77

74 General Strategies for Stochastic Optimization Distribute SDCA is not trivial issue: data are correlated α i = arg max φ i ( α t i α i ) λn 2 Not Working w t + 1 λn α ix i 2 2 Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, / 77

75 General Strategies for Stochastic Optimization Distribute SDCA (T. Yang NIPS 13) The Basic Variant mr-sdca running on machine k Iterate: for t = 1,..., T Iterate: for j = 1,..., m Randomly pick i {1,, n k } and let i j = i Find α k,i by IncDual(w = w t 1, scl = mk) Set αk,i t = α t 1 k,i + α k,i Reduce: w t w t m λn j=1 α k,i j x k,ij α i = arg max φ i ( α t i α i ) λn 2mK w t + mk λn α ix i 2 2 Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, / 77

76 General Strategies for Stochastic Optimization Distribute SDCA (T. Yang NIPS 13) The Practical Variant Initialize: u 0 t = w t 1 Iterate: for j = 1,..., m Randomly pick i {1,, n k } and let i j = i Find α k,ij by IncDual(w = ut j 1, scl = k) Update αk,i t j = α t 1 k,i j + α k,ij Update u j t = u j 1 t + 1 λn k α k,ij x k,ij α ij = arg max φ i j ( α t i j α ij ) λn 2K t + K 2 λn α i j x ij uj 1 Using Updated Information and Large Step Size 2 Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, / 77

77 General Strategies for Stochastic Optimization Distribute SDCA (T. Yang NIPS 13) The Practical Variant is much faster than the Basic Variant log(p(w t ) D(α t )) duality gap 7 8 DisDCA p DisDCA n iterations Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, / 77

78 Implementations and A Library Outline 1 STochastic OPtimization (STOP) and Machine Learning 2 STOP Algorithms for Big Data Classification and Regression 3 General Strategies for Stochastic Optimization 4 Implementations and A Library Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, / 77

79 Implementations and A Library Implementations and A Library In this section, we present some techniques for efficient implementations and a practical library Coverage: efficient averaging Gradient sparsification pre-optimization: screening pre-optimization: data preconditioning distributed (parallel) optimization library Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, / 77

80 Implementations and A Library Efficient Averaging Update rule: x t = (1 γ t λ)x t 1 + γ t g t x t = (1 α t ) x t 1 + α t x t Efficient update when g t has many 0, or g t is sparse, ( ) 1 λγt 0 S t = S α t (1 λγ t ) 1 α t 1, S 1 = I t y t = y t 1 [St 1 ] 11 γ t g t ỹ t = ỹ t 1 ([St 1 ] 21 + [St 1 ] 22 α t )γ t g t x T = [S T ] 11 y T x T = [S T ] 21 y T + [S T ] 22 ỹ T When Gradient is Sparse Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, / 77

81 Implementations and A Library Gradient sparsification Sparsification by importance sampling R ti = unif(0, 1) g ti = g ti [ g ti ĝ i ] + ĝsign(g ti )[ĝ i R ti g ti < ĝ i ] Unbiased sample: E g t = g t. Tradeoff variance increase for the efficient computation. Especially useful for Logistic Regression Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, / 77

82 Implementations and A Library Pre-optimization: Screening Screening for Lasso (Wang et al., 2012) Screening for SVM (Ogawa et al., 2013) Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, / 77

83 Implementations and A Library Distributed Optimization Library: Birds The birds library implements distributed stochastic dual coordinate ascent (DisDCA) for classification and regression with a broad support. For technical details see: Trading Computation for Communication: Distributed Stochastic Dual Coordinate Ascent. Tianbao Yang. NIPS On the Theoretical Analysis of Distributed Stochastic Dual Coordinate Ascent Tianbao Yang, etc. Tech Report The code is distributed under GNU General Publich License (see license.txt for details). Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, / 77

84 Implementations and A Library Distributed Optimization Library: Birds What problems does it Solve? Classification and Regression Loss 1 Hinge loss (SVM): L-1 hinge loss, L-2 hinge loss 2 Logistic loss (Logistic Regression) 3 Least Square Regression (Ridge Regression) Regularizer 1 l 2 norm: SVM, Logistic Regression, Ridge Regression 2 l 1 norm: Lasso, SVM, LR with l 1 norm Multi-class : one-vs-all Yang et al. (NEC Labs America) Tutorial for SDM 14 February 9, / 77

Big Data Analytics: Optimization and Randomization

Big Data Analytics: Optimization and Randomization Big Data Analytics: Optimization and Randomization Tianbao Yang, Qihang Lin, Rong Jin Tutorial@SIGKDD 2015 Sydney, Australia Department of Computer Science, The University of Iowa, IA, USA Department of

More information

Simple and efficient online algorithms for real world applications

Simple and efficient online algorithms for real world applications Simple and efficient online algorithms for real world applications Università degli Studi di Milano Milano, Italy Talk @ Centro de Visión por Computador Something about me PhD in Robotics at LIRA-Lab,

More information

Linear Threshold Units

Linear Threshold Units Linear Threshold Units w x hx (... w n x n w We assume that each feature x j and each weight w j is a real number (we will relax this later) We will study three different algorithms for learning linear

More information

Large-Scale Similarity and Distance Metric Learning

Large-Scale Similarity and Distance Metric Learning Large-Scale Similarity and Distance Metric Learning Aurélien Bellet Télécom ParisTech Joint work with K. Liu, Y. Shi and F. Sha (USC), S. Clémençon and I. Colin (Télécom ParisTech) Séminaire Criteo March

More information

Introduction to Online Learning Theory

Introduction to Online Learning Theory Introduction to Online Learning Theory Wojciech Kot lowski Institute of Computing Science, Poznań University of Technology IDSS, 04.06.2013 1 / 53 Outline 1 Example: Online (Stochastic) Gradient Descent

More information

Lecture 6: Logistic Regression

Lecture 6: Logistic Regression Lecture 6: CS 194-10, Fall 2011 Laurent El Ghaoui EECS Department UC Berkeley September 13, 2011 Outline Outline Classification task Data : X = [x 1,..., x m]: a n m matrix of data points in R n. y { 1,

More information

DUOL: A Double Updating Approach for Online Learning

DUOL: A Double Updating Approach for Online Learning : A Double Updating Approach for Online Learning Peilin Zhao School of Comp. Eng. Nanyang Tech. University Singapore 69798 zhao6@ntu.edu.sg Steven C.H. Hoi School of Comp. Eng. Nanyang Tech. University

More information

Big Data - Lecture 1 Optimization reminders

Big Data - Lecture 1 Optimization reminders Big Data - Lecture 1 Optimization reminders S. Gadat Toulouse, Octobre 2014 Big Data - Lecture 1 Optimization reminders S. Gadat Toulouse, Octobre 2014 Schedule Introduction Major issues Examples Mathematics

More information

Steven C.H. Hoi. School of Computer Engineering Nanyang Technological University Singapore

Steven C.H. Hoi. School of Computer Engineering Nanyang Technological University Singapore Steven C.H. Hoi School of Computer Engineering Nanyang Technological University Singapore Acknowledgments: Peilin Zhao, Jialei Wang, Hao Xia, Jing Lu, Rong Jin, Pengcheng Wu, Dayong Wang, etc. 2 Agenda

More information

Support Vector Machine. Tutorial. (and Statistical Learning Theory)

Support Vector Machine. Tutorial. (and Statistical Learning Theory) Support Vector Machine (and Statistical Learning Theory) Tutorial Jason Weston NEC Labs America 4 Independence Way, Princeton, USA. jasonw@nec-labs.com 1 Support Vector Machines: history SVMs introduced

More information

Scalable Machine Learning - or what to do with all that Big Data infrastructure

Scalable Machine Learning - or what to do with all that Big Data infrastructure - or what to do with all that Big Data infrastructure TU Berlin blog.mikiobraun.de Strata+Hadoop World London, 2015 1 Complex Data Analysis at Scale Click-through prediction Personalized Spam Detection

More information

Predict Influencers in the Social Network

Predict Influencers in the Social Network Predict Influencers in the Social Network Ruishan Liu, Yang Zhao and Liuyu Zhou Email: rliu2, yzhao2, lyzhou@stanford.edu Department of Electrical Engineering, Stanford University Abstract Given two persons

More information

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x)) To go the other way, you need to diagonalize S

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x)) To go the other way, you need to diagonalize S Linear smoother ŷ = S y where s ij = s ij (x) e.g. s ij = diag(l i (x)) To go the other way, you need to diagonalize S 2 Online Learning: LMS and Perceptrons Partially adapted from slides by Ryan Gabbard

More information

Machine Learning Big Data using Map Reduce

Machine Learning Big Data using Map Reduce Machine Learning Big Data using Map Reduce By Michael Bowles, PhD Where Does Big Data Come From? -Web data (web logs, click histories) -e-commerce applications (purchase histories) -Retail purchase histories

More information

Parallel Data Mining. Team 2 Flash Coders Team Research Investigation Presentation 2. Foundations of Parallel Computing Oct 2014

Parallel Data Mining. Team 2 Flash Coders Team Research Investigation Presentation 2. Foundations of Parallel Computing Oct 2014 Parallel Data Mining Team 2 Flash Coders Team Research Investigation Presentation 2 Foundations of Parallel Computing Oct 2014 Agenda Overview of topic Analysis of research papers Software design Overview

More information

Big learning: challenges and opportunities

Big learning: challenges and opportunities Big learning: challenges and opportunities Francis Bach SIERRA Project-team, INRIA - Ecole Normale Supérieure December 2013 Omnipresent digital media Scientific context Big data Multimedia, sensors, indicators,

More information

Adaptive Online Gradient Descent

Adaptive Online Gradient Descent Adaptive Online Gradient Descent Peter L Bartlett Division of Computer Science Department of Statistics UC Berkeley Berkeley, CA 94709 bartlett@csberkeleyedu Elad Hazan IBM Almaden Research Center 650

More information

Beyond stochastic gradient descent for large-scale machine learning

Beyond stochastic gradient descent for large-scale machine learning Beyond stochastic gradient descent for large-scale machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines, Nicolas Le Roux and Mark Schmidt - ECML-PKDD,

More information

Distributed Machine Learning and Big Data

Distributed Machine Learning and Big Data Distributed Machine Learning and Big Data Sourangshu Bhattacharya Dept. of Computer Science and Engineering, IIT Kharagpur. http://cse.iitkgp.ac.in/~sourangshu/ August 21, 2015 Sourangshu Bhattacharya

More information

MapReduce/Bigtable for Distributed Optimization

MapReduce/Bigtable for Distributed Optimization MapReduce/Bigtable for Distributed Optimization Keith B. Hall Google Inc. kbhall@google.com Scott Gilpin Google Inc. sgilpin@google.com Gideon Mann Google Inc. gmann@google.com Abstract With large data

More information

Big Data Analytics. Lucas Rego Drumond

Big Data Analytics. Lucas Rego Drumond Big Data Analytics Lucas Rego Drumond Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany Going For Large Scale Going For Large Scale 1

More information

Big Data Optimization: Randomized lock-free methods for minimizing partially separable convex functions

Big Data Optimization: Randomized lock-free methods for minimizing partially separable convex functions Big Data Optimization: Randomized lock-free methods for minimizing partially separable convex functions Peter Richtárik School of Mathematics The University of Edinburgh Joint work with Martin Takáč (Edinburgh)

More information

Stochastic gradient methods for machine learning

Stochastic gradient methods for machine learning Stochastic gradient methods for machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines, Nicolas Le Roux and Mark Schmidt - April 2013 Context Machine

More information

CSCI567 Machine Learning (Fall 2014)

CSCI567 Machine Learning (Fall 2014) CSCI567 Machine Learning (Fall 2014) Drs. Sha & Liu {feisha,yanliu.cs}@usc.edu September 22, 2014 Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, 2014 1 /

More information

Federated Optimization: Distributed Optimization Beyond the Datacenter

Federated Optimization: Distributed Optimization Beyond the Datacenter Federated Optimization: Distributed Optimization Beyond the Datacenter Jakub Konečný School of Mathematics University of Edinburgh J.Konecny@sms.ed.ac.uk H. Brendan McMahan Google, Inc. Seattle, WA 98103

More information

Parallel & Distributed Optimization. Based on Mark Schmidt s slides

Parallel & Distributed Optimization. Based on Mark Schmidt s slides Parallel & Distributed Optimization Based on Mark Schmidt s slides Motivation behind using parallel & Distributed optimization Performance Computational throughput have increased exponentially in linear

More information

Introduction to Support Vector Machines. Colin Campbell, Bristol University

Introduction to Support Vector Machines. Colin Campbell, Bristol University Introduction to Support Vector Machines Colin Campbell, Bristol University 1 Outline of talk. Part 1. An Introduction to SVMs 1.1. SVMs for binary classification. 1.2. Soft margins and multi-class classification.

More information

Foundations of Machine Learning On-Line Learning. Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu

Foundations of Machine Learning On-Line Learning. Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Foundations of Machine Learning On-Line Learning Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Motivation PAC learning: distribution fixed over time (training and test). IID assumption.

More information

Accelerated Parallel Optimization Methods for Large Scale Machine Learning

Accelerated Parallel Optimization Methods for Large Scale Machine Learning Accelerated Parallel Optimization Methods for Large Scale Machine Learning Haipeng Luo Princeton University haipengl@cs.princeton.edu Patrick Haffner and Jean-François Paiement AT&T Labs - Research {haffner,jpaiement}@research.att.com

More information

Online Kernel Selection: Algorithms and Evaluations

Online Kernel Selection: Algorithms and Evaluations Proceedings of the Twenty-Sixth AAAI Conference on Artificial Intelligence Online Kernel Selection: Algorithms and Evaluations Tianbao Yang 1, Mehrdad Mahdavi 1, Rong Jin 1, Jinfeng Yi 1, Steven C. H.

More information

Stochastic gradient methods for machine learning

Stochastic gradient methods for machine learning Stochastic gradient methods for machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines, Nicolas Le Roux and Mark Schmidt - April 2013 Context Machine

More information

Sparse Online Learning via Truncated Gradient

Sparse Online Learning via Truncated Gradient Sparse Online Learning via Truncated Gradient John Langford Yahoo! Research jl@yahoo-inc.com Lihong Li Department of Computer Science Rutgers University lihong@cs.rutgers.edu Tong Zhang Department of Statistics

More information

The Many Facets of Big Data

The Many Facets of Big Data Department of Computer Science and Engineering Hong Kong University of Science and Technology Hong Kong ACPR 2013 Big Data 1 volume sample size is big feature dimensionality is big 2 variety multiple formats:

More information

Large-Scale Machine Learning with Stochastic Gradient Descent

Large-Scale Machine Learning with Stochastic Gradient Descent Large-Scale Machine Learning with Stochastic Gradient Descent Léon Bottou NEC Labs America, Princeton NJ 08542, USA leon@bottou.org Abstract. During the last decade, the data sizes have grown faster than

More information

A Potential-based Framework for Online Multi-class Learning with Partial Feedback

A Potential-based Framework for Online Multi-class Learning with Partial Feedback A Potential-based Framework for Online Multi-class Learning with Partial Feedback Shijun Wang Rong Jin Hamed Valizadegan Radiology and Imaging Sciences Computer Science and Engineering Computer Science

More information

HT2015: SC4 Statistical Data Mining and Machine Learning

HT2015: SC4 Statistical Data Mining and Machine Learning HT2015: SC4 Statistical Data Mining and Machine Learning Dino Sejdinovic Department of Statistics Oxford http://www.stats.ox.ac.uk/~sejdinov/sdmml.html Bayesian Nonparametrics Parametric vs Nonparametric

More information

Machine Learning over Big Data

Machine Learning over Big Data Machine Learning over Big Presented by Fuhao Zou fuhao@hust.edu.cn Jue 16, 2014 Huazhong University of Science and Technology Contents 1 2 3 4 Role of Machine learning Challenge of Big Analysis Distributed

More information

Machine learning challenges for big data

Machine learning challenges for big data Machine learning challenges for big data Francis Bach SIERRA Project-team, INRIA - Ecole Normale Supérieure Joint work with R. Jenatton, J. Mairal, G. Obozinski, N. Le Roux, M. Schmidt - December 2012

More information

Statistical Machine Learning

Statistical Machine Learning Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes

More information

Large Linear Classification When Data Cannot Fit In Memory

Large Linear Classification When Data Cannot Fit In Memory Large Linear Classification When Data Cannot Fit In Memory ABSTRACT Hsiang-Fu Yu Dept. of Computer Science National Taiwan University Taipei 106, Taiwan b93107@csie.ntu.edu.tw Kai-Wei Chang Dept. of Computer

More information

Statistical machine learning, high dimension and big data

Statistical machine learning, high dimension and big data Statistical machine learning, high dimension and big data S. Gaïffas 1 14 mars 2014 1 CMAP - Ecole Polytechnique Agenda for today Divide and Conquer principle for collaborative filtering Graphical modelling,

More information

The Artificial Prediction Market

The Artificial Prediction Market The Artificial Prediction Market Adrian Barbu Department of Statistics Florida State University Joint work with Nathan Lay, Siemens Corporate Research 1 Overview Main Contributions A mathematical theory

More information

Probabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014

Probabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014 Probabilistic Models for Big Data Alex Davies and Roger Frigola University of Cambridge 13th February 2014 The State of Big Data Why probabilistic models for Big Data? 1. If you don t have to worry about

More information

Machine learning and optimization for massive data

Machine learning and optimization for massive data Machine learning and optimization for massive data Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE Joint work with Eric Moulines - IHES, May 2015 Big data revolution?

More information

GI01/M055 Supervised Learning Proximal Methods

GI01/M055 Supervised Learning Proximal Methods GI01/M055 Supervised Learning Proximal Methods Massimiliano Pontil (based on notes by Luca Baldassarre) (UCL) Proximal Methods 1 / 20 Today s Plan Problem setting Convex analysis concepts Proximal operators

More information

Bilinear Prediction Using Low-Rank Models

Bilinear Prediction Using Low-Rank Models Bilinear Prediction Using Low-Rank Models Inderjit S. Dhillon Dept of Computer Science UT Austin 26th International Conference on Algorithmic Learning Theory Banff, Canada Oct 6, 2015 Joint work with C-J.

More information

L3: Statistical Modeling with Hadoop

L3: Statistical Modeling with Hadoop L3: Statistical Modeling with Hadoop Feng Li feng.li@cufe.edu.cn School of Statistics and Mathematics Central University of Finance and Economics Revision: December 10, 2014 Today we are going to learn...

More information

Support Vector Machines with Clustering for Training with Very Large Datasets

Support Vector Machines with Clustering for Training with Very Large Datasets Support Vector Machines with Clustering for Training with Very Large Datasets Theodoros Evgeniou Technology Management INSEAD Bd de Constance, Fontainebleau 77300, France theodoros.evgeniou@insead.fr Massimiliano

More information

10. Proximal point method

10. Proximal point method L. Vandenberghe EE236C Spring 2013-14) 10. Proximal point method proximal point method augmented Lagrangian method Moreau-Yosida smoothing 10-1 Proximal point method a conceptual algorithm for minimizing

More information

An Introduction to Machine Learning

An Introduction to Machine Learning An Introduction to Machine Learning L5: Novelty Detection and Regression Alexander J. Smola Statistical Machine Learning Program Canberra, ACT 0200 Australia Alex.Smola@nicta.com.au Tata Institute, Pune,

More information

Making Sense of the Mayhem: Machine Learning and March Madness

Making Sense of the Mayhem: Machine Learning and March Madness Making Sense of the Mayhem: Machine Learning and March Madness Alex Tran and Adam Ginzberg Stanford University atran3@stanford.edu ginzberg@stanford.edu I. Introduction III. Model The goal of our research

More information

Table 1: Summary of the settings and parameters employed by the additive PA algorithm for classification, regression, and uniclass.

Table 1: Summary of the settings and parameters employed by the additive PA algorithm for classification, regression, and uniclass. Online Passive-Aggressive Algorithms Koby Crammer Ofer Dekel Shai Shalev-Shwartz Yoram Singer School of Computer Science & Engineering The Hebrew University, Jerusalem 91904, Israel {kobics,oferd,shais,singer}@cs.huji.ac.il

More information

Sibyl: a system for large scale machine learning

Sibyl: a system for large scale machine learning Sibyl: a system for large scale machine learning Tushar Chandra, Eugene Ie, Kenneth Goldman, Tomas Lloret Llinares, Jim McFadden, Fernando Pereira, Joshua Redstone, Tal Shaked, Yoram Singer Machine Learning

More information

Feature Engineering in Machine Learning

Feature Engineering in Machine Learning Research Fellow Faculty of Information Technology, Monash University, Melbourne VIC 3800, Australia August 21, 2015 Outline A Machine Learning Primer Machine Learning and Data Science Bias-Variance Phenomenon

More information

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION Introduction In the previous chapter, we explored a class of regression models having particularly simple analytical

More information

Online Convex Optimization

Online Convex Optimization E0 370 Statistical Learning heory Lecture 19 Oct 22, 2013 Online Convex Optimization Lecturer: Shivani Agarwal Scribe: Aadirupa 1 Introduction In this lecture we shall look at a fairly general setting

More information

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression Logistic Regression Department of Statistics The Pennsylvania State University Email: jiali@stat.psu.edu Logistic Regression Preserve linear classification boundaries. By the Bayes rule: Ĝ(x) = arg max

More information

O(n 3 ) O(4 d nlog 2d n) O(n) O(4 d log 2d n)

O(n 3 ) O(4 d nlog 2d n) O(n) O(4 d log 2d n) O(n 3 )O(4 d nlog 2d n) O(n)O(4 d log 2d n) O(n 3 ) O(n) O(n 3 ) O(4 d nlog 2d n) O(n)O(4 d log 2d n) 1. Fast Gaussian Filtering Algorithm 2. Binary Classification Using The Fast Gaussian Filtering Algorithm

More information

Lecture 2: The SVM classifier

Lecture 2: The SVM classifier Lecture 2: The SVM classifier C19 Machine Learning Hilary 2015 A. Zisserman Review of linear classifiers Linear separability Perceptron Support Vector Machine (SVM) classifier Wide margin Cost function

More information

Logistic Regression for Spam Filtering

Logistic Regression for Spam Filtering Logistic Regression for Spam Filtering Nikhila Arkalgud February 14, 28 Abstract The goal of the spam filtering problem is to identify an email as a spam or not spam. One of the classic techniques used

More information

Online Feature Selection for Mining Big Data

Online Feature Selection for Mining Big Data Online Feature Selection for Mining Big Data Steven C.H. Hoi, Jialei Wang, Peilin Zhao, Rong Jin School of Computer Engineering, Nanyang Technological University, Singapore Department of Computer Science

More information

Vowpal Wabbit 7 Tutorial. Stephane Ross

Vowpal Wabbit 7 Tutorial. Stephane Ross Vowpal Wabbit 7 Tutorial Stephane Ross Binary Classification and Regression Input Format Data in text file (can be gziped), one example/line Label [weight] Namespace Feature... Feature Namespace... Namespace:

More information

(Quasi-)Newton methods

(Quasi-)Newton methods (Quasi-)Newton methods 1 Introduction 1.1 Newton method Newton method is a method to find the zeros of a differentiable non-linear function g, x such that g(x) = 0, where g : R n R n. Given a starting

More information

Support Vector Machine (SVM)

Support Vector Machine (SVM) Support Vector Machine (SVM) CE-725: Statistical Pattern Recognition Sharif University of Technology Spring 2013 Soleymani Outline Margin concept Hard-Margin SVM Soft-Margin SVM Dual Problems of Hard-Margin

More information

How To Understand The Theory Of Probability

How To Understand The Theory Of Probability Graduate Programs in Statistics Course Titles STAT 100 CALCULUS AND MATR IX ALGEBRA FOR STATISTICS. Differential and integral calculus; infinite series; matrix algebra STAT 195 INTRODUCTION TO MATHEMATICAL

More information

Probabilistic Linear Classification: Logistic Regression. Piyush Rai IIT Kanpur

Probabilistic Linear Classification: Logistic Regression. Piyush Rai IIT Kanpur Probabilistic Linear Classification: Logistic Regression Piyush Rai IIT Kanpur Probabilistic Machine Learning (CS772A) Jan 18, 2016 Probabilistic Machine Learning (CS772A) Probabilistic Linear Classification:

More information

A Simple Introduction to Support Vector Machines

A Simple Introduction to Support Vector Machines A Simple Introduction to Support Vector Machines Martin Law Lecture for CSE 802 Department of Computer Science and Engineering Michigan State University Outline A brief history of SVM Large-margin linear

More information

Weekly Sales Forecasting

Weekly Sales Forecasting Weekly Sales Forecasting! San Diego Data Science and R Users Group June 2014 Kevin Davenport! http://kldavenport.com kldavenportjr@gmail.com @KevinLDavenport Thank you to our sponsors: The competition

More information

Several Views of Support Vector Machines

Several Views of Support Vector Machines Several Views of Support Vector Machines Ryan M. Rifkin Honda Research Institute USA, Inc. Human Intention Understanding Group 2007 Tikhonov Regularization We are considering algorithms of the form min

More information

Azure Machine Learning, SQL Data Mining and R

Azure Machine Learning, SQL Data Mining and R Azure Machine Learning, SQL Data Mining and R Day-by-day Agenda Prerequisites No formal prerequisites. Basic knowledge of SQL Server Data Tools, Excel and any analytical experience helps. Best of all:

More information

IFT3395/6390. Machine Learning from linear regression to Neural Networks. Machine Learning. Training Set. t (3.5, -2,..., 127, 0,...

IFT3395/6390. Machine Learning from linear regression to Neural Networks. Machine Learning. Training Set. t (3.5, -2,..., 127, 0,... IFT3395/6390 Historical perspective: back to 1957 (Prof. Pascal Vincent) (Rosenblatt, Perceptron ) Machine Learning from linear regression to Neural Networks Computer Science Artificial Intelligence Symbolic

More information

1 Introduction. 2 Prediction with Expert Advice. Online Learning 9.520 Lecture 09

1 Introduction. 2 Prediction with Expert Advice. Online Learning 9.520 Lecture 09 1 Introduction Most of the course is concerned with the batch learning problem. In this lecture, however, we look at a different model, called online. Let us first compare and contrast the two. In batch

More information

Transform-based Domain Adaptation for Big Data

Transform-based Domain Adaptation for Big Data Transform-based Domain Adaptation for Big Data Erik Rodner University of Jena Judy Hoffman Jeff Donahue Trevor Darrell Kate Saenko UMass Lowell Abstract Images seen during test time are often not from

More information

large-scale machine learning revisited Léon Bottou Microsoft Research (NYC)

large-scale machine learning revisited Léon Bottou Microsoft Research (NYC) large-scale machine learning revisited Léon Bottou Microsoft Research (NYC) 1 three frequent ideas in machine learning. independent and identically distributed data This experimental paradigm has driven

More information

Karthik Sridharan. 424 Gates Hall Ithaca, E-mail: sridharan@cs.cornell.edu http://www.cs.cornell.edu/ sridharan/ Contact Information

Karthik Sridharan. 424 Gates Hall Ithaca, E-mail: sridharan@cs.cornell.edu http://www.cs.cornell.edu/ sridharan/ Contact Information Karthik Sridharan Contact Information 424 Gates Hall Ithaca, NY 14853-7501 USA E-mail: sridharan@cs.cornell.edu http://www.cs.cornell.edu/ sridharan/ Research Interests Machine Learning, Statistical Learning

More information

Steven C.H. Hoi School of Information Systems Singapore Management University Email: chhoi@smu.edu.sg

Steven C.H. Hoi School of Information Systems Singapore Management University Email: chhoi@smu.edu.sg Steven C.H. Hoi School of Information Systems Singapore Management University Email: chhoi@smu.edu.sg Introduction http://stevenhoi.org/ Finance Recommender Systems Cyber Security Machine Learning Visual

More information

Noisy and Missing Data Regression: Distribution-Oblivious Support Recovery

Noisy and Missing Data Regression: Distribution-Oblivious Support Recovery : Distribution-Oblivious Support Recovery Yudong Chen Department of Electrical and Computer Engineering The University of Texas at Austin Austin, TX 7872 Constantine Caramanis Department of Electrical

More information

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R Practical Data Science with Azure Machine Learning, SQL Data Mining, and R Overview This 4-day class is the first of the two data science courses taught by Rafal Lukawiecki. Some of the topics will be

More information

A Stochastic 3MG Algorithm with Application to 2D Filter Identification

A Stochastic 3MG Algorithm with Application to 2D Filter Identification A Stochastic 3MG Algorithm with Application to 2D Filter Identification Emilie Chouzenoux 1, Jean-Christophe Pesquet 1, and Anisia Florescu 2 1 Laboratoire d Informatique Gaspard Monge - CNRS Univ. Paris-Est,

More information

Large Scale Learning to Rank

Large Scale Learning to Rank Large Scale Learning to Rank D. Sculley Google, Inc. dsculley@google.com Abstract Pairwise learning to rank methods such as RankSVM give good performance, but suffer from the computational burden of optimizing

More information

Lecture 3: Linear methods for classification

Lecture 3: Linear methods for classification Lecture 3: Linear methods for classification Rafael A. Irizarry and Hector Corrada Bravo February, 2010 Today we describe four specific algorithms useful for classification problems: linear regression,

More information

Distributed Coordinate Descent Method for Learning with Big Data

Distributed Coordinate Descent Method for Learning with Big Data Peter Richtárik Martin Takáč University of Edinburgh, King s Buildings, EH9 3JZ Edinburgh, United Kingdom PETER.RICHTARIK@ED.AC.UK MARTIN.TAKI@GMAIL.COM Abstract In this paper we develop and analyze Hydra:

More information

Steven C.H. Hoi. Methods and Applications. School of Computer Engineering Nanyang Technological University Singapore 4 May, 2013

Steven C.H. Hoi. Methods and Applications. School of Computer Engineering Nanyang Technological University Singapore 4 May, 2013 Methods and Applications Steven C.H. Hoi School of Computer Engineering Nanyang Technological University Singapore 4 May, 2013 http://libol.stevenhoi.org/ Acknowledgements Peilin Zhao Jialei Wang Rong

More information

Warm Start for Parameter Selection of Linear Classifiers

Warm Start for Parameter Selection of Linear Classifiers Warm Start for Parameter Selection of Linear Classifiers Bo-Yu Chu Dept. of Computer Science National Taiwan Univ., Taiwan r222247@ntu.edu.tw Chieh-Yen Lin Dept. of Computer Science National Taiwan Univ.,

More information

Jubatus: An Open Source Platform for Distributed Online Machine Learning

Jubatus: An Open Source Platform for Distributed Online Machine Learning Jubatus: An Open Source Platform for Distributed Online Machine Learning Shohei Hido Seiya Tokui Preferred Infrastructure Inc. Tokyo, Japan {hido, tokui}@preferred.jp Satoshi Oda NTT Software Innovation

More information

Online Semi-Supervised Learning

Online Semi-Supervised Learning Online Semi-Supervised Learning Andrew B. Goldberg, Ming Li, Xiaojin Zhu jerryzhu@cs.wisc.edu Computer Sciences University of Wisconsin Madison Xiaojin Zhu (Univ. Wisconsin-Madison) Online Semi-Supervised

More information

Sparse modeling: some unifying theory and word-imaging

Sparse modeling: some unifying theory and word-imaging Sparse modeling: some unifying theory and word-imaging Bin Yu UC Berkeley Departments of Statistics, and EECS Based on joint work with: Sahand Negahban (UC Berkeley) Pradeep Ravikumar (UT Austin) Martin

More information

Big Data Analytics CSCI 4030

Big Data Analytics CSCI 4030 High dim. data Graph data Infinite data Machine learning Apps Locality sensitive hashing PageRank, SimRank Filtering data streams SVM Recommen der systems Clustering Community Detection Web advertising

More information

Chapter 6. The stacking ensemble approach

Chapter 6. The stacking ensemble approach 82 This chapter proposes the stacking ensemble approach for combining different data mining classifiers to get better performance. Other combination techniques like voting, bagging etc are also described

More information

A semi-supervised Spam mail detector

A semi-supervised Spam mail detector A semi-supervised Spam mail detector Bernhard Pfahringer Department of Computer Science, University of Waikato, Hamilton, New Zealand Abstract. This document describes a novel semi-supervised approach

More information

Online Lazy Updates for Portfolio Selection with Transaction Costs

Online Lazy Updates for Portfolio Selection with Transaction Costs Proceedings of the Twenty-Seventh AAAI Conference on Artificial Intelligence Online Lazy Updates for Portfolio Selection with Transaction Costs Puja Das, Nicholas Johnson, and Arindam Banerjee Department

More information

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld. Logistic Regression Vibhav Gogate The University of Texas at Dallas Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld. Generative vs. Discriminative Classifiers Want to Learn: h:x Y X features

More information

Big Data at Spotify. Anders Arpteg, Ph D Analytics Machine Learning, Spotify

Big Data at Spotify. Anders Arpteg, Ph D Analytics Machine Learning, Spotify Big Data at Spotify Anders Arpteg, Ph D Analytics Machine Learning, Spotify Quickly about me Quickly about Spotify What is all the data used for? Quickly about Spark Hadoop MR vs Spark Need for (distributed)

More information

ANALYSIS, THEORY AND DESIGN OF LOGISTIC REGRESSION CLASSIFIERS USED FOR VERY LARGE SCALE DATA MINING

ANALYSIS, THEORY AND DESIGN OF LOGISTIC REGRESSION CLASSIFIERS USED FOR VERY LARGE SCALE DATA MINING ANALYSIS, THEORY AND DESIGN OF LOGISTIC REGRESSION CLASSIFIERS USED FOR VERY LARGE SCALE DATA MINING BY OMID ROUHANI-KALLEH THESIS Submitted as partial fulfillment of the requirements for the degree of

More information

GURLS: A Least Squares Library for Supervised Learning

GURLS: A Least Squares Library for Supervised Learning Journal of Machine Learning Research 14 (2013) 3201-3205 Submitted 1/12; Revised 2/13; Published 10/13 GURLS: A Least Squares Library for Supervised Learning Andrea Tacchetti Pavan K. Mallapragada Center

More information

Sampling via Moment Sharing: A New Framework for Distributed Bayesian Inference for Big Data

Sampling via Moment Sharing: A New Framework for Distributed Bayesian Inference for Big Data Sampling via Moment Sharing: A New Framework for Distributed Bayesian Inference for Big Data (Oxford) in collaboration with: Minjie Xu, Jun Zhu, Bo Zhang (Tsinghua) Balaji Lakshminarayanan (Gatsby) Bayesian

More information

Randomized Robust Linear Regression for big data applications

Randomized Robust Linear Regression for big data applications Randomized Robust Linear Regression for big data applications Yannis Kopsinis 1 Dept. of Informatics & Telecommunications, UoA Thursday, Apr 16, 2015 In collaboration with S. Chouvardas, Harris Georgiou,

More information

Server Load Prediction

Server Load Prediction Server Load Prediction Suthee Chaidaroon (unsuthee@stanford.edu) Joon Yeong Kim (kim64@stanford.edu) Jonghan Seo (jonghan@stanford.edu) Abstract Estimating server load average is one of the methods that

More information

STORM: Stochastic Optimization Using Random Models Katya Scheinberg Lehigh University. (Joint work with R. Chen and M. Menickelly)

STORM: Stochastic Optimization Using Random Models Katya Scheinberg Lehigh University. (Joint work with R. Chen and M. Menickelly) STORM: Stochastic Optimization Using Random Models Katya Scheinberg Lehigh University (Joint work with R. Chen and M. Menickelly) Outline Stochastic optimization problem black box gradient based Existing

More information

AIMS Big data. AIMS Big data. Outline. Outline. Lecture 5: Structured-output learning January 7, 2015 Andrea Vedaldi

AIMS Big data. AIMS Big data. Outline. Outline. Lecture 5: Structured-output learning January 7, 2015 Andrea Vedaldi AMS Big data AMS Big data Lecture 5: Structured-output learning January 7, 5 Andrea Vedaldi. Discriminative learning. Discriminative learning 3. Hashing and kernel maps 4. Learning representations 5. Structured-output

More information