Beyond stochastic gradient descent for large-scale machine learning

Size: px
Start display at page:

Download "Beyond stochastic gradient descent for large-scale machine learning"

Transcription

1 Beyond stochastic gradient descent for large-scale machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines, Nicolas Le Roux and Mark Schmidt - ECML-PKDD, September 2014

2 Big data revolution? A new scientific context Data everywhere: size does not (always) matter Science and industry Size and variety Learning from examples n observations in dimension p

3 Search engines - Advertising

4 Marketing - Personalized recommendation

5 Visual object recognition

6 Personal photos

7 Bioinformatics Protein: Crucial elements of cell life Massive data: 2 millions for humans Complex data

8 Context Machine learning for big data Large-scale machine learning: large p, large n p : dimension of each observation (input) n : number of observations Examples: computer vision, bioinformatics, advertising Ideal running-time complexity: O(pn) Going back to simple methods Stochastic gradient methods (Robbins and Monro, 1951) Mixing statistics and optimization Using smoothness to go beyond stochastic gradient descent

9 Context Machine learning for big data Large-scale machine learning: large p, large n p : dimension of each observation (input) n : number of observations Examples: computer vision, bioinformatics, advertising Ideal running-time complexity: O(pn) Going back to simple methods Stochastic gradient methods (Robbins and Monro, 1951) Mixing statistics and optimization Using smoothness to go beyond stochastic gradient descent

10 Context Machine learning for big data Large-scale machine learning: large p, large n p : dimension of each observation (input) n : number of observations Examples: computer vision, bioinformatics, advertising Ideal running-time complexity: O(pn) Going back to simple methods Stochastic gradient methods (Robbins and Monro, 1951) Mixing statistics and optimization Using smoothness to go beyond stochastic gradient descent

11 Outline Introduction: stochastic approximation algorithms Supervised machine learning and convex optimization Stochastic gradient and averaging Strongly convex vs. non-strongly convex Fast convergence through smoothness and constant step-sizes Online Newton steps (Bach and Moulines, 2013) O(1/n) convergence rate for all convex functions More than a single pass through the data Stochastic average gradient (Le Roux, Schmidt, and Bach, 2012) Linear (exponential) convergence rate for strongly convex functions

12 Supervised machine learning Data: n observations (x i,y i ) X Y, i = 1,...,n, i.i.d. Prediction as a linear function θ,φ(x) of features Φ(x) R p (regularized) empirical risk minimization: find ˆθ solution of 1 n min l ( y i, θ,φ(x i ) ) + µω(θ) θ R p n i=1 convex data fitting term + regularizer

13 Usual losses Regression: y R, prediction ŷ = θ,φ(x) quadratic loss 1 2 (y ŷ)2 = 1 2 (y θ,φ(x) )2

14 Usual losses Regression: y R, prediction ŷ = θ,φ(x) quadratic loss 1 2 (y ŷ)2 = 1 2 (y θ,φ(x) )2 Classification : y { 1,1}, prediction ŷ = sign( θ,φ(x) ) loss of the form l(y θ,φ(x) ) True 0-1 loss: l(y θ,φ(x) ) = 1 y θ,φ(x) <0 Usual convex losses: hinge square logistic

15 Supervised machine learning Data: n observations (x i,y i ) X Y, i = 1,...,n, i.i.d. Prediction as a linear function θ,φ(x) of features Φ(x) R p (regularized) empirical risk minimization: find ˆθ solution of 1 n min l ( y i, θ,φ(x i ) ) + µω(θ) θ R p n i=1 convex data fitting term + regularizer

16 Supervised machine learning Data: n observations (x i,y i ) X Y, i = 1,...,n, i.i.d. Prediction as a linear function θ,φ(x) of features Φ(x) R p (regularized) empirical risk minimization: find ˆθ solution of 1 n min l ( y i, θ,φ(x i ) ) + µω(θ) θ R p n i=1 convex data fitting term + regularizer Empirical risk: ˆf(θ) = 1 n n i=1 l(y i, θ,φ(x i ) ) training cost Expected risk: f(θ) = E (x,y) l(y, θ,φ(x) ) testing cost Two fundamental questions: (1) computing ˆθ and (2) analyzing ˆθ May be tackled simultaneously

17 Supervised machine learning Data: n observations (x i,y i ) X Y, i = 1,...,n, i.i.d. Prediction as a linear function θ,φ(x) of features Φ(x) R p (regularized) empirical risk minimization: find ˆθ solution of 1 n min l ( y i, θ,φ(x i ) ) + µω(θ) θ R p n i=1 convex data fitting term + regularizer Empirical risk: ˆf(θ) = 1 n n i=1 l(y i, θ,φ(x i ) ) training cost Expected risk: f(θ) = E (x,y) l(y, θ,φ(x) ) testing cost Two fundamental questions: (1) computing ˆθ and (2) analyzing ˆθ May be tackled simultaneously

18 Smoothness and strong convexity A function g : R p R is L-smooth if and only if it is twice differentiable and θ R p, eigenvalues [ g (θ) ] L smooth non smooth

19 Smoothness and strong convexity A function g : R p R is L-smooth if and only if it is twice differentiable and θ R p, eigenvalues [ g (θ) ] L Machine learning with g(θ) = 1 n n i=1 l(y i, θ,φ(x i ) ) Hessian covariance matrix 1 n n i=1 Φ(x i) Φ(x i ) Bounded data

20 Smoothness and strong convexity A twice differentiable function g : R p R is µ-strongly convex if and only if θ R p, eigenvalues [ g (θ) ] µ convex strongly convex

21 Smoothness and strong convexity A twice differentiable function g : R p R is µ-strongly convex if and only if θ R p, eigenvalues [ g (θ) ] µ Machine learning with g(θ) = 1 n n i=1 l(y i, θ,φ(x i ) ) Hessian covariance matrix 1 n n i=1 Φ(x i) Φ(x i ) Data with invertible covariance matrix (low correlation/dimension)

22 Smoothness and strong convexity A twice differentiable function g : R p R is µ-strongly convex if and only if θ R p, eigenvalues [ g (θ) ] µ Machine learning with g(θ) = 1 n n i=1 l(y i, θ,φ(x i ) ) Hessian covariance matrix 1 n n i=1 Φ(x i) Φ(x i ) Data with invertible covariance matrix (low correlation/dimension) Adding regularization by µ 2 θ 2 creates additional bias unless µ is small

23 Iterative methods for minimizing smooth functions Assumption: g convex and smooth on R p Gradient descent: θ t = θ t 1 γ t g (θ t 1 ) O(1/t) convergence rate for convex functions O(e ρt ) convergence rate for strongly convex functions

24 Iterative methods for minimizing smooth functions Assumption: g convex and smooth on R p Gradient descent: θ t = θ t 1 γ t g (θ t 1 ) O(1/t) convergence rate for convex functions O(e ρt ) convergence rate for strongly convex functions Newton method: θ t = θ t 1 g (θ t 1 ) 1 g (θ t 1 ) O ( e ρ2t ) convergence rate

25 Iterative methods for minimizing smooth functions Assumption: g convex and smooth on R p Gradient descent: θ t = θ t 1 γ t g (θ t 1 ) O(1/t) convergence rate for convex functions O(e ρt ) convergence rate for strongly convex functions Newton method: θ t = θ t 1 g (θ t 1 ) 1 g (θ t 1 ) O ( e ρ2t ) convergence rate Key insights from Bottou and Bousquet (2008) 1. In machine learning, no need to optimize below statistical error 2. In machine learning, cost functions are averages Stochastic approximation

26 Stochastic approximation Goal: Minimizing a function f defined on R p given only unbiased estimates f n(θ n ) of its gradients f (θ n ) at certain points θ n R p

27 Stochastic approximation Goal: Minimizing a function f defined on R p given only unbiased estimates f n(θ n ) of its gradients f (θ n ) at certain points θ n R p Machine learning - statistics f(θ) = Ef n (θ) = El(y n, θ,φ(x n ) ) = generalization error Loss for a single pair of observations: f n (θ) = l(y n, θ,φ(x n ) ) Expected gradient: f (θ) = Ef n(θ) = E { l (y n, θ,φ(x n ) )Φ(x n ) } Beyond convex optimization: see, e.g., Benveniste et al. (2012)

28 Convex stochastic approximation Key assumption: smoothness and/or strong convexity Key algorithm: stochastic gradient descent (a.k.a. Robbins-Monro) θ n = θ n 1 γ n f n(θ n 1 ) Polyak-Ruppert averaging: θ n = 1 n+1 n k=0 θ k Which learning rate sequence γ n? Classical setting: γ n = Cn α

29 Convex stochastic approximation Key assumption: smoothness and/or strong convexity Key algorithm: stochastic gradient descent (a.k.a. Robbins-Monro) θ n = θ n 1 γ n f n(θ n 1 ) Polyak-Ruppert averaging: θ n = 1 n+1 n k=0 θ k Which learning rate sequence γ n? Classical setting: γ n = Cn α Running-time = O(np) Single pass through the data One line of code among many

30 Convex stochastic approximation Existing work Known global minimax rates of convergence for non-smooth problems (Nemirovsky and Yudin, 1983; Agarwal et al., 2012) Strongly convex: O((µn) 1 ) Attainedbyaveragedstochasticgradientdescentwithγ n (µn) 1 Non-strongly convex: O(n 1/2 ) Attained by averaged stochastic gradient descent with γ n n 1/2

31 Convex stochastic approximation Existing work Known global minimax rates of convergence for non-smooth problems (Nemirovsky and Yudin, 1983; Agarwal et al., 2012) Strongly convex: O((µn) 1 ) Attainedbyaveragedstochasticgradientdescentwithγ n (µn) 1 Non-strongly convex: O(n 1/2 ) Attained by averaged stochastic gradient descent with γ n n 1/2 Asymptotic analysis of averaging (Polyak and Juditsky, 1992; Ruppert, 1988) All step sizes γ n = Cn α with α (1/2,1) lead to O(n 1 ) for smooth strongly convex problems A single algorithm with global adaptive convergence rate for smooth problems?

32 Convex stochastic approximation Existing work Known global minimax rates of convergence for non-smooth problems (Nemirovsky and Yudin, 1983; Agarwal et al., 2012) Strongly convex: O((µn) 1 ) Attainedbyaveragedstochasticgradientdescentwithγ n (µn) 1 Non-strongly convex: O(n 1/2 ) Attained by averaged stochastic gradient descent with γ n n 1/2 Asymptotic analysis of averaging (Polyak and Juditsky, 1992; Ruppert, 1988) All step sizes γ n = Cn α with α (1/2,1) lead to O(n 1 ) for smooth strongly convex problems A single algorithm for smooth problems with convergence rate O(1/n) in all situations?

33 Least-mean-square algorithm Least-squares: f(θ) = 1 2 E[ (y n Φ(x n ),θ ) 2] with θ R p SGD = least-mean-square algorithm (see, e.g., Macchi, 1995) usually studied without averaging and decreasing step-sizes with strong convexity assumption E [ Φ(x n ) Φ(x n ) ] = H µ Id

34 Least-mean-square algorithm Least-squares: f(θ) = 1 2 E[ (y n Φ(x n ),θ ) 2] with θ R p SGD = least-mean-square algorithm (see, e.g., Macchi, 1995) usually studied without averaging and decreasing step-sizes with strong convexity assumption E [ Φ(x n ) Φ(x n ) ] = H µ Id New analysis for averaging and constant step-size γ = 1/(4R 2 ) Assume Φ(x n ) R and y n Φ(x n ),θ σ almost surely No assumption regarding lowest eigenvalues of H Main result: Ef( θ n 1 ) f(θ ) 4σ2 p n + 4R2 θ 0 θ 2 n Matches statistical lower bound (Tsybakov, 2003) Non-asymptotic robust version of Györfi and Walk (1996)

35 Markov chain interpretation of constant step sizes LMS recursion for f n (θ) = 1 2 ( yn Φ(x n ),θ ) 2 θ n = θ n 1 γ ( Φ(x n ),θ n 1 y n ) Φ(xn ) The sequence (θ n ) n is a homogeneous Markov chain convergence to a stationary distribution π γ with expectation θ def γ = θπ γ (dθ) - For least-squares, θ γ = θ θ n θ γ θ 0

36 Markov chain interpretation of constant step sizes LMS recursion for f n (θ) = 1 2 ( yn Φ(x n ),θ ) 2 θ n = θ n 1 γ ( Φ(x n ),θ n 1 y n ) Φ(xn ) The sequence (θ n ) n is a homogeneous Markov chain convergence to a stationary distribution π γ with expectation θ def γ = θπ γ (dθ) For least-squares, θ γ = θ θ n θ θ 0

37 Markov chain interpretation of constant step sizes LMS recursion for f n (θ) = 1 2 ( yn Φ(x n ),θ ) 2 θ n = θ n 1 γ ( Φ(x n ),θ n 1 y n ) Φ(xn ) The sequence (θ n ) n is a homogeneous Markov chain convergence to a stationary distribution π γ with expectation θ def γ = θπ γ (dθ) For least-squares, θ γ = θ θ n θ n θ θ 0

38 Markov chain interpretation of constant step sizes LMS recursion for f n (θ) = 1 2 ( yn Φ(x n ),θ ) 2 θ n = θ n 1 γ ( Φ(x n ),θ n 1 y n ) Φ(xn ) The sequence (θ n ) n is a homogeneous Markov chain convergence to a stationary distribution π γ with expectation θ def γ = θπ γ (dθ) For least-squares, θ γ = θ θ n does not converge to θ but oscillates around it oscillations of order γ Ergodic theorem: Averaged iterates converge to θ γ = θ at rate O(1/n)

39 Simulations - synthetic examples Gaussian distributions - p = 20 0 synthetic square log 10 [f(θ) f(θ * )] /2R 2 1/8R 2 4 1/32R 2 1/2R 2 n 1/ log 10 (n)

40 Simulations - benchmarks alpha (p = 500, n = ), news (p = , n = ) log 10 [f(θ) f(θ * )] alpha square C=1 test 1/R 2 1/R 2 n 1/2 SAG log (n) alpha square C=opt test C/R 2 C/R 2 n 1/2 SAG log (n) news square C=1 test 0.2 news square C=opt test log 10 [f(θ) f(θ * )] /R 2 1/R 2 n 1/2 SAG log (n) C/R 2 C/R 2 n 1/2 SAG log 10 (n)

41 Beyond least-squares - Markov chain interpretation Recursion θ n = θ n 1 γf n(θ n 1 ) also defines a Markov chain Stationary distribution π γ such that f (θ)π γ (dθ) = 0 When f is not linear, f ( θπ γ (dθ)) f (θ)π γ (dθ) = 0

42 Beyond least-squares - Markov chain interpretation Recursion θ n = θ n 1 γf n(θ n 1 ) also defines a Markov chain Stationary distribution π γ such that f (θ)π γ (dθ) = 0 When f is not linear, f ( θπ γ (dθ)) f (θ)π γ (dθ) = 0 θ n oscillates around the wrong value θ γ θ θ n θ n θ γ θ 0 θ

43 Beyond least-squares - Markov chain interpretation Recursion θ n = θ n 1 γf n(θ n 1 ) also defines a Markov chain Stationary distribution π γ such that f (θ)π γ (dθ) = 0 When f is not linear, f ( θπ γ (dθ)) f (θ)π γ (dθ) = 0 θ n oscillates around the wrong value θ γ θ moreover, θ θ n = O p ( γ) Ergodic theorem averaged iterates converge to θ γ θ at rate O(1/n) moreover, θ θ γ = O(γ) (Bach, 2013)

44 Simulations - synthetic examples Gaussian distributions - p = 20 0 synthetic logistic 1 log 10 [f(θ) f(θ * )] 1 2 1/2R 2 3 1/8R 2 4 1/32R 2 5 1/2R 2 n 1/ log 10 (n)

45 Restoring convergence through online Newton steps Known facts 1. Averaged SGD with γ n n 1/2 leads to robust rate O(n 1/2 ) for all convex functions 2. Averaged SGD with γ n constant leads to robust rate O(n 1 ) for all convex quadratic functions 3. Newton s method squares the error at each iteration for smooth functions 4. A single step of Newton s method is equivalent to minimizing the quadratic Taylor expansion Online Newton step Rate: O((n 1/2 ) 2 +n 1 ) = O(n 1 ) Complexity: O(p) per iteration

46 Restoring convergence through online Newton steps Known facts 1. Averaged SGD with γ n n 1/2 leads to robust rate O(n 1/2 ) for all convex functions 2. Averaged SGD with γ n constant leads to robust rate O(n 1 ) for all convex quadratic functions O(n 1 ) 3. Newton s method squares the error at each iteration for smooth functions O((n 1/2 ) 2 ) 4. A single step of Newton s method is equivalent to minimizing the quadratic Taylor expansion Online Newton step Rate: O((n 1/2 ) 2 +n 1 ) = O(n 1 ) Complexity: O(p) per iteration

47 Restoring convergence through online Newton steps The Newton step for f = Ef n (θ) def = E [ l(y n, θ,φ(x n ) ) ] at θ is equivalent to minimizing the quadratic approximation g(θ) = f( θ)+ f ( θ),θ θ θ θ,f ( θ)(θ θ) = f( θ)+ Ef n( θ),θ θ θ θ,ef n( θ)(θ θ) [ = E f( θ)+ f n( θ),θ θ θ θ,f n( θ)(θ θ) ]

48 Restoring convergence through online Newton steps The Newton step for f = Ef n (θ) def = E [ l(y n, θ,φ(x n ) ) ] at θ is equivalent to minimizing the quadratic approximation g(θ) = f( θ)+ f ( θ),θ θ θ θ,f ( θ)(θ θ) = f( θ)+ Ef n( θ),θ θ θ θ,ef n( θ)(θ θ) [ = E f( θ)+ f n( θ),θ θ θ θ,f n( θ)(θ θ) ] Complexity of least-mean-square recursion for g is O(p) θ n = θ n 1 γ [ f n( θ)+f n( θ)(θ n 1 θ) ] f n( θ) = l (y n, θ,φ(x n ) )Φ(x n ) Φ(x n ) has rank one New online Newton step without computing/inverting Hessians

49 Choice of support point for online Newton step Two-stage procedure (1) Run n/2 iterations of averaged SGD to obtain θ (2) Run n/2 iterations of averaged constant step-size LMS Reminiscent of one-step estimators (see, e.g., Van der Vaart, 2000) Provable convergence rate of O(p/n) for logistic regression Additional assumptions but no strong convexity

50 Choice of support point for online Newton step Two-stage procedure (1) Run n/2 iterations of averaged SGD to obtain θ (2) Run n/2 iterations of averaged constant step-size LMS Reminiscent of one-step estimators (see, e.g., Van der Vaart, 2000) Provable convergence rate of O(p/n) for logistic regression Additional assumptions but no strong convexity Update at each iteration using the current averaged iterate Recursion: θ n = θ n 1 γ [ f n( θ n 1 )+f n( θ n 1 )(θ n 1 θ n 1 ) ] No provable convergence rate (yet) but best practical behavior Note (dis)similarity with regular SGD: θ n = θ n 1 γf n(θ n 1 )

51 Simulations - synthetic examples Gaussian distributions - p = 20 0 synthetic logistic 1 0 synthetic logistic 2 log 10 [f(θ) f(θ * )] /2R 2 3 1/8R 2 3 every 2 p 4 1/32R 2 every iter. 4 2 step 5 1/2R 2 n 1/2 5 2 step dbl log 10 (n) log 10 (n) log 10 [f(θ) f(θ * )] 1

52 Simulations - benchmarks alpha (p = 500, n = ), news (p = , n = ) alpha logistic C=1 test alpha logistic C=opt test log 10 [f(θ) f(θ * )] /R /R 2 n 1/2 SAG 2 Adagrad Newton log (n) C/R C/R 2 n 1/2 SAG 2 Adagrad Newton log (n) news logistic C=1 test 0.2 news logistic C=opt test log 10 [f(θ) f(θ * )] /R /R 2 n 1/2 SAG 0.8 Adagrad 1 Newton log (n) C/R C/R 2 n 1/2 SAG 0.8 Adagrad 1 Newton log 10 (n)

53 Going beyond a single pass over the data Stochastic approximation Assumes infinite data stream Observations are used only once Directly minimizes testing cost E (x,y) l(y, θ,φ(x) )

54 Going beyond a single pass over the data Stochastic approximation Assumes infinite data stream Observations are used only once Directly minimizes testing cost E (x,y) l(y, θ,φ(x) ) Machine learning practice Finite data set (x 1,y 1,...,x n,y n ) Multiple passes Minimizes training cost 1 n n i=1 l(y i, θ,φ(x i ) ) Need to regularize (e.g., by the l 2 -norm) to avoid overfitting Goal: minimize g(θ) = 1 n n f i (θ) i=1

55 Stochastic vs. deterministic methods Minimizing g(θ) = 1 n n f i (θ) with f i (θ) = l ( y i, θ,φ(x i ) ) +µω(θ) i=1 Batchgradientdescent:θ t = θ t 1 γ t g (θ t 1 ) = θ t 1 γ t n Linear (e.g., exponential) convergence rate in O(e αt ) Iteration complexity is linear in n (with line search) n f i(θ t 1 ) i=1

56 Stochastic vs. deterministic methods Minimizing g(θ) = 1 n n f i (θ) with f i (θ) = l ( y i, θ,φ(x i ) ) +µω(θ) i=1 Batchgradientdescent:θ t = θ t 1 γ t g (θ t 1 ) = θ t 1 γ t n Linear (e.g., exponential) convergence rate in O(e αt ) Iteration complexity is linear in n (with line search) n f i(θ t 1 ) i=1 Stochastic gradient descent: θ t = θ t 1 γ t f i(t) (θ t 1) Sampling with replacement: i(t) random element of {1,...,n} Convergence rate in O(1/t) Iteration complexity is independent of n (step size selection?)

57 Stochastic average gradient (Le Roux, Schmidt, and Bach, 2012) Stochastic average gradient (SAG) iteration Keep in memory the gradients of all functions f i, i = 1,...,n Random selection i(t) {1,...,n} with replacement { Iteration: θ t = θ t 1 γ n t f yi t with yi t n = i (θ t 1 ) if i = i(t) otherwise i=1 y t 1 i

58 Stochastic average gradient (Le Roux, Schmidt, and Bach, 2012) Stochastic average gradient (SAG) iteration Keep in memory the gradients of all functions f i, i = 1,...,n Random selection i(t) {1,...,n} with replacement { Iteration: θ t = θ t 1 γ n t f yi t n with yt i = i (θ t 1 ) if i = i(t) otherwise i=1 y t 1 i Stochastic version of incremental average gradient(blatt et al., 2008) Extra memory requirement Supervised machine learning If f i (θ) = l i (y i, Φ(x i ),θ ), then f i (θ) = l i (y i, Φ(x i ),θ )Φ(x i ) Only need to store n real numbers

59 Stochastic average gradient - Convergence analysis Assumptions Each f i is L-smooth, i = 1,...,n g= 1 n n i=1 f i is µ-strongly convex (with potentially µ = 0) constant step size γ t = 1/(16L) initialization with one pass of averaged SGD

60 Stochastic average gradient - Convergence analysis Assumptions Each f i is L-smooth, i = 1,...,n g= 1 n n i=1 f i is µ-strongly convex (with potentially µ = 0) constant step size γ t = 1/(16L) initialization with one pass of averaged SGD Strongly convex case (Le Roux et al., 2012, 2013) E [ g(θ t ) g(θ ) ] ( 8σ 2 nµ + 4L θ 0 θ 2 n ) exp ( { 1 tmin 8n, µ }) 16L Linear (exponential) convergence rate with ( O(1) iteration { cost 1 After one pass, reduction of cost by exp min 8, nµ }) 16L

61 Stochastic average gradient - Convergence analysis Assumptions Each f i is L-smooth, i = 1,...,n g= 1 n n i=1 f i is µ-strongly convex (with potentially µ = 0) constant step size γ t = 1/(16L) initialization with one pass of averaged SGD Non-strongly convex case (Le Roux et al., 2013) E [ g(θ t ) g(θ ) ] 48 σ2 +L θ 0 θ 2 n n t Improvement over regular batch and stochastic gradient Adaptivity to potentially hidden strong convexity

62 spam dataset (n = , p = )

63 Conclusions Constant-step-size averaged stochastic gradient descent Reaches convergence rate O(1/n) in all regimes Improves on the O(1/ n) lower-bound of non-smooth problems Efficient online Newton step for non-quadratic problems Going beyond a single pass through the data Keep memory of all gradients for finite training sets Randomization leads to easier analysis and faster rates Relationship with Shalev-Shwartz and Zhang (2012); Mairal (2013) Extensions Non-differentiable terms, kernels, line-search, parallelization, etc. Beyond supervised learning, beyond convex problems

64 References A. Agarwal, P. L. Bartlett, P. Ravikumar, and M. J. Wainwright. Information-theoretic lower bounds on the oracle complexity of stochastic convex optimization. Information Theory, IEEE Transactions on, 58(5): , F. Bach. Adaptivity of averaged stochastic gradient descent to local strong convexity for logistic regression. Technical Report , HAL, F. Bach and E. Moulines. Non-strongly-convex smooth stochastic approximation with convergence rate o(1/n). Technical Report , HAL, Albert Benveniste, Michel Métivier, and Pierre Priouret. Adaptive algorithms and stochastic approximations. Springer Publishing Company, Incorporated, D. Blatt, A. O. Hero, and H. Gauchman. A convergent incremental gradient method with a constant step size. SIAM Journal on Optimization, 18(1):29 51, L. Bottou and O. Bousquet. The tradeoffs of large scale learning. In Adv. NIPS, L. Györfi and H. Walk. On the averaged stochastic approximation for linear regression. SIAM Journal on Control and Optimization, 34(1):31 61, N. Le Roux, M. Schmidt, and F. Bach. A stochastic gradient method with an exponential convergence rate for strongly-convex optimization with finite training sets. In Adv. NIPS, N. Le Roux, M. Schmidt, and F. Bach. A stochastic gradient method with an exponential convergence rate for strongly-convex optimization with finite training sets. Technical Report , HAL, 2013.

65 O. Macchi. Adaptive processing: The least mean squares approach with applications in transmission. Wiley West Sussex, Julien Mairal. Optimization with first-order surrogate functions. arxiv preprint arxiv: , A. S. Nemirovsky and D. B. Yudin. Problem complexity and method efficiency in optimization. Wiley & Sons, B. T. Polyak and A. B. Juditsky. Acceleration of stochastic approximation by averaging. SIAM Journal on Control and Optimization, 30(4): , H. Robbins and S. Monro. A stochastic approximation method. Ann. Math. Statistics, 22: , ISSN D. Ruppert. Efficient estimations from a slowly convergent Robbins-Monro process. Technical Report 781, Cornell University Operations Research and Industrial Engineering, S. Shalev-Shwartz and T. Zhang. Stochastic dual coordinate ascent methods for regularized loss minimization. Technical Report , Arxiv, A. B. Tsybakov. Optimal rates of aggregation A. W. Van der Vaart. Asymptotic statistics, volume 3. Cambridge Univ. press, 2000.

Machine learning and optimization for massive data

Machine learning and optimization for massive data Machine learning and optimization for massive data Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE Joint work with Eric Moulines - IHES, May 2015 Big data revolution?

More information

Stochastic gradient methods for machine learning

Stochastic gradient methods for machine learning Stochastic gradient methods for machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines, Nicolas Le Roux and Mark Schmidt - April 2013 Context Machine

More information

Large-scale machine learning and convex optimization

Large-scale machine learning and convex optimization Large-scale machine learning and convex optimization Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Eurandom - March 2014 Slides available at www.di.ens.fr/~fbach/gradsto_eurandom.pdf Context

More information

Stochastic gradient methods for machine learning

Stochastic gradient methods for machine learning Stochastic gradient methods for machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines, Nicolas Le Roux and Mark Schmidt - April 2013 Context Machine

More information

Big learning: challenges and opportunities

Big learning: challenges and opportunities Big learning: challenges and opportunities Francis Bach SIERRA Project-team, INRIA - Ecole Normale Supérieure December 2013 Omnipresent digital media Scientific context Big data Multimedia, sensors, indicators,

More information

Machine learning challenges for big data

Machine learning challenges for big data Machine learning challenges for big data Francis Bach SIERRA Project-team, INRIA - Ecole Normale Supérieure Joint work with R. Jenatton, J. Mairal, G. Obozinski, N. Le Roux, M. Schmidt - December 2012

More information

Big Data - Lecture 1 Optimization reminders

Big Data - Lecture 1 Optimization reminders Big Data - Lecture 1 Optimization reminders S. Gadat Toulouse, Octobre 2014 Big Data - Lecture 1 Optimization reminders S. Gadat Toulouse, Octobre 2014 Schedule Introduction Major issues Examples Mathematics

More information

Big Data Analytics: Optimization and Randomization

Big Data Analytics: Optimization and Randomization Big Data Analytics: Optimization and Randomization Tianbao Yang, Qihang Lin, Rong Jin Tutorial@SIGKDD 2015 Sydney, Australia Department of Computer Science, The University of Iowa, IA, USA Department of

More information

Large-Scale Machine Learning with Stochastic Gradient Descent

Large-Scale Machine Learning with Stochastic Gradient Descent Large-Scale Machine Learning with Stochastic Gradient Descent Léon Bottou NEC Labs America, Princeton NJ 08542, USA leon@bottou.org Abstract. During the last decade, the data sizes have grown faster than

More information

Linear Threshold Units

Linear Threshold Units Linear Threshold Units w x hx (... w n x n w We assume that each feature x j and each weight w j is a real number (we will relax this later) We will study three different algorithms for learning linear

More information

Introduction to Online Learning Theory

Introduction to Online Learning Theory Introduction to Online Learning Theory Wojciech Kot lowski Institute of Computing Science, Poznań University of Technology IDSS, 04.06.2013 1 / 53 Outline 1 Example: Online (Stochastic) Gradient Descent

More information

A Stochastic 3MG Algorithm with Application to 2D Filter Identification

A Stochastic 3MG Algorithm with Application to 2D Filter Identification A Stochastic 3MG Algorithm with Application to 2D Filter Identification Emilie Chouzenoux 1, Jean-Christophe Pesquet 1, and Anisia Florescu 2 1 Laboratoire d Informatique Gaspard Monge - CNRS Univ. Paris-Est,

More information

Simple and efficient online algorithms for real world applications

Simple and efficient online algorithms for real world applications Simple and efficient online algorithms for real world applications Università degli Studi di Milano Milano, Italy Talk @ Centro de Visión por Computador Something about me PhD in Robotics at LIRA-Lab,

More information

(Quasi-)Newton methods

(Quasi-)Newton methods (Quasi-)Newton methods 1 Introduction 1.1 Newton method Newton method is a method to find the zeros of a differentiable non-linear function g, x such that g(x) = 0, where g : R n R n. Given a starting

More information

Discussion on the paper Hypotheses testing by convex optimization by A. Goldenschluger, A. Juditsky and A. Nemirovski.

Discussion on the paper Hypotheses testing by convex optimization by A. Goldenschluger, A. Juditsky and A. Nemirovski. Discussion on the paper Hypotheses testing by convex optimization by A. Goldenschluger, A. Juditsky and A. Nemirovski. Fabienne Comte, Celine Duval, Valentine Genon-Catalot To cite this version: Fabienne

More information

Probabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014

Probabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014 Probabilistic Models for Big Data Alex Davies and Roger Frigola University of Cambridge 13th February 2014 The State of Big Data Why probabilistic models for Big Data? 1. If you don t have to worry about

More information

Statistical Machine Learning

Statistical Machine Learning Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes

More information

large-scale machine learning revisited Léon Bottou Microsoft Research (NYC)

large-scale machine learning revisited Léon Bottou Microsoft Research (NYC) large-scale machine learning revisited Léon Bottou Microsoft Research (NYC) 1 three frequent ideas in machine learning. independent and identically distributed data This experimental paradigm has driven

More information

Online Learning, Stability, and Stochastic Gradient Descent

Online Learning, Stability, and Stochastic Gradient Descent Online Learning, Stability, and Stochastic Gradient Descent arxiv:1105.4701v3 [cs.lg] 8 Sep 2011 September 9, 2011 Tomaso Poggio, Stephen Voinea, Lorenzo Rosasco CBCL, McGovern Institute, CSAIL, Brain

More information

Lecture 8 February 4

Lecture 8 February 4 ICS273A: Machine Learning Winter 2008 Lecture 8 February 4 Scribe: Carlos Agell (Student) Lecturer: Deva Ramanan 8.1 Neural Nets 8.1.1 Logistic Regression Recall the logistic function: g(x) = 1 1 + e θt

More information

Stochastic Optimization for Big Data Analytics: Algorithms and Libraries

Stochastic Optimization for Big Data Analytics: Algorithms and Libraries Stochastic Optimization for Big Data Analytics: Algorithms and Libraries Tianbao Yang SDM 2014, Philadelphia, Pennsylvania collaborators: Rong Jin, Shenghuo Zhu NEC Laboratories America, Michigan State

More information

Large-Scale Similarity and Distance Metric Learning

Large-Scale Similarity and Distance Metric Learning Large-Scale Similarity and Distance Metric Learning Aurélien Bellet Télécom ParisTech Joint work with K. Liu, Y. Shi and F. Sha (USC), S. Clémençon and I. Colin (Télécom ParisTech) Séminaire Criteo March

More information

GI01/M055 Supervised Learning Proximal Methods

GI01/M055 Supervised Learning Proximal Methods GI01/M055 Supervised Learning Proximal Methods Massimiliano Pontil (based on notes by Luca Baldassarre) (UCL) Proximal Methods 1 / 20 Today s Plan Problem setting Convex analysis concepts Proximal operators

More information

Summer School Machine Learning Trimester

Summer School Machine Learning Trimester Summer School Machine Learning Trimester Université Paul Sabatier, Toulouse September 1418, 2015 About the CIMI Machine Learning thematic trimester From September to December 2015, the International Centre

More information

Online Convex Optimization

Online Convex Optimization E0 370 Statistical Learning heory Lecture 19 Oct 22, 2013 Online Convex Optimization Lecturer: Shivani Agarwal Scribe: Aadirupa 1 Introduction In this lecture we shall look at a fairly general setting

More information

CS 688 Pattern Recognition Lecture 4. Linear Models for Classification

CS 688 Pattern Recognition Lecture 4. Linear Models for Classification CS 688 Pattern Recognition Lecture 4 Linear Models for Classification Probabilistic generative models Probabilistic discriminative models 1 Generative Approach ( x ) p C k p( C k ) Ck p ( ) ( x Ck ) p(

More information

CSCI567 Machine Learning (Fall 2014)

CSCI567 Machine Learning (Fall 2014) CSCI567 Machine Learning (Fall 2014) Drs. Sha & Liu {feisha,yanliu.cs}@usc.edu September 22, 2014 Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, 2014 1 /

More information

Two Topics in Parametric Integration Applied to Stochastic Simulation in Industrial Engineering

Two Topics in Parametric Integration Applied to Stochastic Simulation in Industrial Engineering Two Topics in Parametric Integration Applied to Stochastic Simulation in Industrial Engineering Department of Industrial Engineering and Management Sciences Northwestern University September 15th, 2014

More information

Lecture 3: Linear methods for classification

Lecture 3: Linear methods for classification Lecture 3: Linear methods for classification Rafael A. Irizarry and Hector Corrada Bravo February, 2010 Today we describe four specific algorithms useful for classification problems: linear regression,

More information

Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh

Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh Peter Richtárik Week 3 Randomized Coordinate Descent With Arbitrary Sampling January 27, 2016 1 / 30 The Problem

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 6 Three Approaches to Classification Construct

More information

Parallel Data Mining. Team 2 Flash Coders Team Research Investigation Presentation 2. Foundations of Parallel Computing Oct 2014

Parallel Data Mining. Team 2 Flash Coders Team Research Investigation Presentation 2. Foundations of Parallel Computing Oct 2014 Parallel Data Mining Team 2 Flash Coders Team Research Investigation Presentation 2 Foundations of Parallel Computing Oct 2014 Agenda Overview of topic Analysis of research papers Software design Overview

More information

Federated Optimization: Distributed Optimization Beyond the Datacenter

Federated Optimization: Distributed Optimization Beyond the Datacenter Federated Optimization: Distributed Optimization Beyond the Datacenter Jakub Konečný School of Mathematics University of Edinburgh J.Konecny@sms.ed.ac.uk H. Brendan McMahan Google, Inc. Seattle, WA 98103

More information

Big Data Optimization: Randomized lock-free methods for minimizing partially separable convex functions

Big Data Optimization: Randomized lock-free methods for minimizing partially separable convex functions Big Data Optimization: Randomized lock-free methods for minimizing partially separable convex functions Peter Richtárik School of Mathematics The University of Edinburgh Joint work with Martin Takáč (Edinburgh)

More information

On the Path to an Ideal ROC Curve: Considering Cost Asymmetry in Learning Classifiers

On the Path to an Ideal ROC Curve: Considering Cost Asymmetry in Learning Classifiers On the Path to an Ideal ROC Curve: Considering Cost Asymmetry in Learning Classifiers Francis R. Bach Computer Science Division University of California Berkeley, CA 9472 fbach@cs.berkeley.edu Abstract

More information

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION Introduction In the previous chapter, we explored a class of regression models having particularly simple analytical

More information

Parallel & Distributed Optimization. Based on Mark Schmidt s slides

Parallel & Distributed Optimization. Based on Mark Schmidt s slides Parallel & Distributed Optimization Based on Mark Schmidt s slides Motivation behind using parallel & Distributed optimization Performance Computational throughput have increased exponentially in linear

More information

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x)) To go the other way, you need to diagonalize S

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x)) To go the other way, you need to diagonalize S Linear smoother ŷ = S y where s ij = s ij (x) e.g. s ij = diag(l i (x)) To go the other way, you need to diagonalize S 2 Online Learning: LMS and Perceptrons Partially adapted from slides by Ryan Gabbard

More information

How To Understand The Theory Of Probability

How To Understand The Theory Of Probability Graduate Programs in Statistics Course Titles STAT 100 CALCULUS AND MATR IX ALGEBRA FOR STATISTICS. Differential and integral calculus; infinite series; matrix algebra STAT 195 INTRODUCTION TO MATHEMATICAL

More information

Pricing and calibration in local volatility models via fast quantization

Pricing and calibration in local volatility models via fast quantization Pricing and calibration in local volatility models via fast quantization Parma, 29 th January 2015. Joint work with Giorgia Callegaro and Martino Grasselli Quantization: a brief history Birth: back to

More information

Accelerated Parallel Optimization Methods for Large Scale Machine Learning

Accelerated Parallel Optimization Methods for Large Scale Machine Learning Accelerated Parallel Optimization Methods for Large Scale Machine Learning Haipeng Luo Princeton University haipengl@cs.princeton.edu Patrick Haffner and Jean-François Paiement AT&T Labs - Research {haffner,jpaiement}@research.att.com

More information

Stochastic Gradient Method: Applications

Stochastic Gradient Method: Applications Stochastic Gradient Method: Applications February 03, 2015 P. Carpentier Master MMMEF Cours MNOS 2014-2015 114 / 267 Lecture Outline 1 Two Elementary Exercices on the Stochastic Gradient Two-Stage Recourse

More information

Dirichlet forms methods for error calculus and sensitivity analysis

Dirichlet forms methods for error calculus and sensitivity analysis Dirichlet forms methods for error calculus and sensitivity analysis Nicolas BOULEAU, Osaka university, november 2004 These lectures propose tools for studying sensitivity of models to scalar or functional

More information

Scalable Machine Learning - or what to do with all that Big Data infrastructure

Scalable Machine Learning - or what to do with all that Big Data infrastructure - or what to do with all that Big Data infrastructure TU Berlin blog.mikiobraun.de Strata+Hadoop World London, 2015 1 Complex Data Analysis at Scale Click-through prediction Personalized Spam Detection

More information

Machine Learning and Pattern Recognition Logistic Regression

Machine Learning and Pattern Recognition Logistic Regression Machine Learning and Pattern Recognition Logistic Regression Course Lecturer:Amos J Storkey Institute for Adaptive and Neural Computation School of Informatics University of Edinburgh Crichton Street,

More information

LABEL PROPAGATION ON GRAPHS. SEMI-SUPERVISED LEARNING. ----Changsheng Liu 10-30-2014

LABEL PROPAGATION ON GRAPHS. SEMI-SUPERVISED LEARNING. ----Changsheng Liu 10-30-2014 LABEL PROPAGATION ON GRAPHS. SEMI-SUPERVISED LEARNING ----Changsheng Liu 10-30-2014 Agenda Semi Supervised Learning Topics in Semi Supervised Learning Label Propagation Local and global consistency Graph

More information

Nonlinear Optimization: Algorithms 3: Interior-point methods

Nonlinear Optimization: Algorithms 3: Interior-point methods Nonlinear Optimization: Algorithms 3: Interior-point methods INSEAD, Spring 2006 Jean-Philippe Vert Ecole des Mines de Paris Jean-Philippe.Vert@mines.org Nonlinear optimization c 2006 Jean-Philippe Vert,

More information

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld. Logistic Regression Vibhav Gogate The University of Texas at Dallas Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld. Generative vs. Discriminative Classifiers Want to Learn: h:x Y X features

More information

STORM: Stochastic Optimization Using Random Models Katya Scheinberg Lehigh University. (Joint work with R. Chen and M. Menickelly)

STORM: Stochastic Optimization Using Random Models Katya Scheinberg Lehigh University. (Joint work with R. Chen and M. Menickelly) STORM: Stochastic Optimization Using Random Models Katya Scheinberg Lehigh University (Joint work with R. Chen and M. Menickelly) Outline Stochastic optimization problem black box gradient based Existing

More information

CCNY. BME I5100: Biomedical Signal Processing. Linear Discrimination. Lucas C. Parra Biomedical Engineering Department City College of New York

CCNY. BME I5100: Biomedical Signal Processing. Linear Discrimination. Lucas C. Parra Biomedical Engineering Department City College of New York BME I5100: Biomedical Signal Processing Linear Discrimination Lucas C. Parra Biomedical Engineering Department CCNY 1 Schedule Week 1: Introduction Linear, stationary, normal - the stuff biology is not

More information

Adaptive Online Gradient Descent

Adaptive Online Gradient Descent Adaptive Online Gradient Descent Peter L Bartlett Division of Computer Science Department of Statistics UC Berkeley Berkeley, CA 94709 bartlett@csberkeleyedu Elad Hazan IBM Almaden Research Center 650

More information

An Introduction to Machine Learning

An Introduction to Machine Learning An Introduction to Machine Learning L5: Novelty Detection and Regression Alexander J. Smola Statistical Machine Learning Program Canberra, ACT 0200 Australia Alex.Smola@nicta.com.au Tata Institute, Pune,

More information

Bayesian Machine Learning (ML): Modeling And Inference in Big Data. Zhuhua Cai Google, Rice University caizhua@gmail.com

Bayesian Machine Learning (ML): Modeling And Inference in Big Data. Zhuhua Cai Google, Rice University caizhua@gmail.com Bayesian Machine Learning (ML): Modeling And Inference in Big Data Zhuhua Cai Google Rice University caizhua@gmail.com 1 Syllabus Bayesian ML Concepts (Today) Bayesian ML on MapReduce (Next morning) Bayesian

More information

Big Data Analytics CSCI 4030

Big Data Analytics CSCI 4030 High dim. data Graph data Infinite data Machine learning Apps Locality sensitive hashing PageRank, SimRank Filtering data streams SVM Recommen der systems Clustering Community Detection Web advertising

More information

MACHINE LEARNING IN HIGH ENERGY PHYSICS

MACHINE LEARNING IN HIGH ENERGY PHYSICS MACHINE LEARNING IN HIGH ENERGY PHYSICS LECTURE #1 Alex Rogozhnikov, 2015 INTRO NOTES 4 days two lectures, two practice seminars every day this is introductory track to machine learning kaggle competition!

More information

Big Data Analytics. Lucas Rego Drumond

Big Data Analytics. Lucas Rego Drumond Big Data Analytics Lucas Rego Drumond Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany Going For Large Scale Going For Large Scale 1

More information

Acknowledgments. Data Mining with Regression. Data Mining Context. Overview. Colleagues

Acknowledgments. Data Mining with Regression. Data Mining Context. Overview. Colleagues Data Mining with Regression Teaching an old dog some new tricks Acknowledgments Colleagues Dean Foster in Statistics Lyle Ungar in Computer Science Bob Stine Department of Statistics The School of the

More information

MapReduce/Bigtable for Distributed Optimization

MapReduce/Bigtable for Distributed Optimization MapReduce/Bigtable for Distributed Optimization Keith B. Hall Google Inc. kbhall@google.com Scott Gilpin Google Inc. sgilpin@google.com Gideon Mann Google Inc. gmann@google.com Abstract With large data

More information

Noisy and Missing Data Regression: Distribution-Oblivious Support Recovery

Noisy and Missing Data Regression: Distribution-Oblivious Support Recovery : Distribution-Oblivious Support Recovery Yudong Chen Department of Electrical and Computer Engineering The University of Texas at Austin Austin, TX 7872 Constantine Caramanis Department of Electrical

More information

Probabilistic Linear Classification: Logistic Regression. Piyush Rai IIT Kanpur

Probabilistic Linear Classification: Logistic Regression. Piyush Rai IIT Kanpur Probabilistic Linear Classification: Logistic Regression Piyush Rai IIT Kanpur Probabilistic Machine Learning (CS772A) Jan 18, 2016 Probabilistic Machine Learning (CS772A) Probabilistic Linear Classification:

More information

CS229T/STAT231: Statistical Learning Theory (Winter 2015)

CS229T/STAT231: Statistical Learning Theory (Winter 2015) CS229T/STAT231: Statistical Learning Theory (Winter 2015) Percy Liang Last updated Wed Oct 14 2015 20:32 These lecture notes will be updated periodically as the course goes on. Please let us know if you

More information

B. Delyon. Stochastic approximation with decreasing gain: convergence and asymptotic theory. Unpublished lecture notes, Université de Rennes, 2000.

B. Delyon. Stochastic approximation with decreasing gain: convergence and asymptotic theory. Unpublished lecture notes, Université de Rennes, 2000. Some References Laetitia Andrieu, Guy Cohen and Felisa J. Vázquez-Abad. Gradient-based simulation optimization under probability constraints. European Journal of Operational Research, 212, 345-351, 2011.

More information

Lecture 6: Logistic Regression

Lecture 6: Logistic Regression Lecture 6: CS 194-10, Fall 2011 Laurent El Ghaoui EECS Department UC Berkeley September 13, 2011 Outline Outline Classification task Data : X = [x 1,..., x m]: a n m matrix of data points in R n. y { 1,

More information

1 Introduction. 2 Prediction with Expert Advice. Online Learning 9.520 Lecture 09

1 Introduction. 2 Prediction with Expert Advice. Online Learning 9.520 Lecture 09 1 Introduction Most of the course is concerned with the batch learning problem. In this lecture, however, we look at a different model, called online. Let us first compare and contrast the two. In batch

More information

Foundations of Machine Learning On-Line Learning. Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu

Foundations of Machine Learning On-Line Learning. Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Foundations of Machine Learning On-Line Learning Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Motivation PAC learning: distribution fixed over time (training and test). IID assumption.

More information

Introduction to Machine Learning and Data Mining. Prof. Dr. Igor Trajkovski trajkovski@nyus.edu.mk

Introduction to Machine Learning and Data Mining. Prof. Dr. Igor Trajkovski trajkovski@nyus.edu.mk Introduction to Machine Learning and Data Mining Prof. Dr. Igor Trakovski trakovski@nyus.edu.mk Neural Networks 2 Neural Networks Analogy to biological neural systems, the most robust learning systems

More information

The Artificial Prediction Market

The Artificial Prediction Market The Artificial Prediction Market Adrian Barbu Department of Statistics Florida State University Joint work with Nathan Lay, Siemens Corporate Research 1 Overview Main Contributions A mathematical theory

More information

Karthik Sridharan. 424 Gates Hall Ithaca, E-mail: sridharan@cs.cornell.edu http://www.cs.cornell.edu/ sridharan/ Contact Information

Karthik Sridharan. 424 Gates Hall Ithaca, E-mail: sridharan@cs.cornell.edu http://www.cs.cornell.edu/ sridharan/ Contact Information Karthik Sridharan Contact Information 424 Gates Hall Ithaca, NY 14853-7501 USA E-mail: sridharan@cs.cornell.edu http://www.cs.cornell.edu/ sridharan/ Research Interests Machine Learning, Statistical Learning

More information

Online Learning for Matrix Factorization and Sparse Coding

Online Learning for Matrix Factorization and Sparse Coding Journal of Machine Learning Research 11 (2010) 19-60 Submitted 7/09; Revised 11/09; Published 1/10 Online Learning for Matrix Factorization and Sparse Coding Julien Mairal Francis Bach INRIA - WILLOW Project-Team

More information

Learning Gaussian process models from big data. Alan Qi Purdue University Joint work with Z. Xu, F. Yan, B. Dai, and Y. Zhu

Learning Gaussian process models from big data. Alan Qi Purdue University Joint work with Z. Xu, F. Yan, B. Dai, and Y. Zhu Learning Gaussian process models from big data Alan Qi Purdue University Joint work with Z. Xu, F. Yan, B. Dai, and Y. Zhu Machine learning seminar at University of Cambridge, July 4 2012 Data A lot of

More information

i=1 In practice, the natural logarithm of the likelihood function, called the log-likelihood function and denoted by

i=1 In practice, the natural logarithm of the likelihood function, called the log-likelihood function and denoted by Statistics 580 Maximum Likelihood Estimation Introduction Let y (y 1, y 2,..., y n be a vector of iid, random variables from one of a family of distributions on R n and indexed by a p-dimensional parameter

More information

t := maxγ ν subject to ν {0,1,2,...} and f(x c +γ ν d) f(x c )+cγ ν f (x c ;d).

t := maxγ ν subject to ν {0,1,2,...} and f(x c +γ ν d) f(x c )+cγ ν f (x c ;d). 1. Line Search Methods Let f : R n R be given and suppose that x c is our current best estimate of a solution to P min x R nf(x). A standard method for improving the estimate x c is to choose a direction

More information

Monte Carlo Simulation

Monte Carlo Simulation 1 Monte Carlo Simulation Stefan Weber Leibniz Universität Hannover email: sweber@stochastik.uni-hannover.de web: www.stochastik.uni-hannover.de/ sweber Monte Carlo Simulation 2 Quantifying and Hedging

More information

Statistical machine learning, high dimension and big data

Statistical machine learning, high dimension and big data Statistical machine learning, high dimension and big data S. Gaïffas 1 14 mars 2014 1 CMAP - Ecole Polytechnique Agenda for today Divide and Conquer principle for collaborative filtering Graphical modelling,

More information

HT2015: SC4 Statistical Data Mining and Machine Learning

HT2015: SC4 Statistical Data Mining and Machine Learning HT2015: SC4 Statistical Data Mining and Machine Learning Dino Sejdinovic Department of Statistics Oxford http://www.stats.ox.ac.uk/~sejdinov/sdmml.html Bayesian Nonparametrics Parametric vs Nonparametric

More information

NEURAL NETWORKS A Comprehensive Foundation

NEURAL NETWORKS A Comprehensive Foundation NEURAL NETWORKS A Comprehensive Foundation Second Edition Simon Haykin McMaster University Hamilton, Ontario, Canada Prentice Hall Prentice Hall Upper Saddle River; New Jersey 07458 Preface xii Acknowledgments

More information

Statistics Graduate Courses

Statistics Graduate Courses Statistics Graduate Courses STAT 7002--Topics in Statistics-Biological/Physical/Mathematics (cr.arr.).organized study of selected topics. Subjects and earnable credit may vary from semester to semester.

More information

Sampling via Moment Sharing: A New Framework for Distributed Bayesian Inference for Big Data

Sampling via Moment Sharing: A New Framework for Distributed Bayesian Inference for Big Data Sampling via Moment Sharing: A New Framework for Distributed Bayesian Inference for Big Data (Oxford) in collaboration with: Minjie Xu, Jun Zhu, Bo Zhang (Tsinghua) Balaji Lakshminarayanan (Gatsby) Bayesian

More information

The Steepest Descent Algorithm for Unconstrained Optimization and a Bisection Line-search Method

The Steepest Descent Algorithm for Unconstrained Optimization and a Bisection Line-search Method The Steepest Descent Algorithm for Unconstrained Optimization and a Bisection Line-search Method Robert M. Freund February, 004 004 Massachusetts Institute of Technology. 1 1 The Algorithm The problem

More information

17.3.1 Follow the Perturbed Leader

17.3.1 Follow the Perturbed Leader CS787: Advanced Algorithms Topic: Online Learning Presenters: David He, Chris Hopman 17.3.1 Follow the Perturbed Leader 17.3.1.1 Prediction Problem Recall the prediction problem that we discussed in class.

More information

Fexible QoS Management of Web Services Orchestrations

Fexible QoS Management of Web Services Orchestrations Fexible QoS Management of Web Services Orchestrations Ajay Kattepur 1 1 Equipe ARLES, INRIA Paris-Rocquencourt, France. Collaborators: Albert Benveniste (INRIA), Claude Jard (INRIA / Uni. Nantes), Sidney

More information

Christfried Webers. Canberra February June 2015

Christfried Webers. Canberra February June 2015 c Statistical Group and College of Engineering and Computer Science Canberra February June (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 829 c Part VIII Linear Classification 2 Logistic

More information

Distributed Machine Learning and Big Data

Distributed Machine Learning and Big Data Distributed Machine Learning and Big Data Sourangshu Bhattacharya Dept. of Computer Science and Engineering, IIT Kharagpur. http://cse.iitkgp.ac.in/~sourangshu/ August 21, 2015 Sourangshu Bhattacharya

More information

Operations Research and Financial Engineering. Courses

Operations Research and Financial Engineering. Courses Operations Research and Financial Engineering Courses ORF 504/FIN 504 Financial Econometrics Professor Jianqing Fan This course covers econometric and statistical methods as applied to finance. Topics

More information

Big Data: The Computation/Statistics Interface

Big Data: The Computation/Statistics Interface Big Data: The Computation/Statistics Interface Michael I. Jordan University of California, Berkeley September 2, 2013 What Is the Big Data Phenomenon? Big Science is generating massive datasets to be used

More information

A Simultaneous Solution for General Linear Equations on a Ring or Hierarchical Cluster

A Simultaneous Solution for General Linear Equations on a Ring or Hierarchical Cluster Acta Technica Jaurinensis Vol. 3. No. 1. 010 A Simultaneous Solution for General Linear Equations on a Ring or Hierarchical Cluster G. Molnárka, N. Varjasi Széchenyi István University Győr, Hungary, H-906

More information

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 51, NO. 7, JULY 2005 2475. G. George Yin, Fellow, IEEE, and Vikram Krishnamurthy, Fellow, IEEE

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 51, NO. 7, JULY 2005 2475. G. George Yin, Fellow, IEEE, and Vikram Krishnamurthy, Fellow, IEEE IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 51, NO. 7, JULY 2005 2475 LMS Algorithms for Tracking Slow Markov Chains With Applications to Hidden Markov Estimation and Adaptive Multiuser Detection G.

More information

Advanced Signal Processing and Digital Noise Reduction

Advanced Signal Processing and Digital Noise Reduction Advanced Signal Processing and Digital Noise Reduction Saeed V. Vaseghi Queen's University of Belfast UK WILEY HTEUBNER A Partnership between John Wiley & Sons and B. G. Teubner Publishers Chichester New

More information

APPM4720/5720: Fast algorithms for big data. Gunnar Martinsson The University of Colorado at Boulder

APPM4720/5720: Fast algorithms for big data. Gunnar Martinsson The University of Colorado at Boulder APPM4720/5720: Fast algorithms for big data Gunnar Martinsson The University of Colorado at Boulder Course objectives: The purpose of this course is to teach efficient algorithms for processing very large

More information

Logistic Regression for Spam Filtering

Logistic Regression for Spam Filtering Logistic Regression for Spam Filtering Nikhila Arkalgud February 14, 28 Abstract The goal of the spam filtering problem is to identify an email as a spam or not spam. One of the classic techniques used

More information

Introduction to Logistic Regression

Introduction to Logistic Regression OpenStax-CNX module: m42090 1 Introduction to Logistic Regression Dan Calderon This work is produced by OpenStax-CNX and licensed under the Creative Commons Attribution License 3.0 Abstract Gives introduction

More information

2014-2015 The Master s Degree with Thesis Course Descriptions in Industrial Engineering

2014-2015 The Master s Degree with Thesis Course Descriptions in Industrial Engineering 2014-2015 The Master s Degree with Thesis Course Descriptions in Industrial Engineering Compulsory Courses IENG540 Optimization Models and Algorithms In the course important deterministic optimization

More information

Linear Classification. Volker Tresp Summer 2015

Linear Classification. Volker Tresp Summer 2015 Linear Classification Volker Tresp Summer 2015 1 Classification Classification is the central task of pattern recognition Sensors supply information about an object: to which class do the object belong

More information

Lecture Notes in Mathematics 2033

Lecture Notes in Mathematics 2033 Lecture Notes in Mathematics 2033 Editors: J.-M. Morel, Cachan B. Teissier, Paris Subseries: École d Été de Probabilités de Saint-Flour For further volumes: http://www.springer.com/series/304 Saint-Flour

More information

BIG DATA PROBLEMS AND LARGE-SCALE OPTIMIZATION: A DISTRIBUTED ALGORITHM FOR MATRIX FACTORIZATION

BIG DATA PROBLEMS AND LARGE-SCALE OPTIMIZATION: A DISTRIBUTED ALGORITHM FOR MATRIX FACTORIZATION BIG DATA PROBLEMS AND LARGE-SCALE OPTIMIZATION: A DISTRIBUTED ALGORITHM FOR MATRIX FACTORIZATION Ş. İlker Birbil Sabancı University Ali Taylan Cemgil 1, Hazal Koptagel 1, Figen Öztoprak 2, Umut Şimşekli

More information

Table 1: Summary of the settings and parameters employed by the additive PA algorithm for classification, regression, and uniclass.

Table 1: Summary of the settings and parameters employed by the additive PA algorithm for classification, regression, and uniclass. Online Passive-Aggressive Algorithms Koby Crammer Ofer Dekel Shai Shalev-Shwartz Yoram Singer School of Computer Science & Engineering The Hebrew University, Jerusalem 91904, Israel {kobics,oferd,shais,singer}@cs.huji.ac.il

More information

Supplement to Call Centers with Delay Information: Models and Insights

Supplement to Call Centers with Delay Information: Models and Insights Supplement to Call Centers with Delay Information: Models and Insights Oualid Jouini 1 Zeynep Akşin 2 Yves Dallery 1 1 Laboratoire Genie Industriel, Ecole Centrale Paris, Grande Voie des Vignes, 92290

More information

Network Security A Decision and Game-Theoretic Approach

Network Security A Decision and Game-Theoretic Approach Network Security A Decision and Game-Theoretic Approach Tansu Alpcan Deutsche Telekom Laboratories, Technical University of Berlin, Germany and Tamer Ba ar University of Illinois at Urbana-Champaign, USA

More information

Making Sense of the Mayhem: Machine Learning and March Madness

Making Sense of the Mayhem: Machine Learning and March Madness Making Sense of the Mayhem: Machine Learning and March Madness Alex Tran and Adam Ginzberg Stanford University atran3@stanford.edu ginzberg@stanford.edu I. Introduction III. Model The goal of our research

More information

Load Balancing and Switch Scheduling

Load Balancing and Switch Scheduling EE384Y Project Final Report Load Balancing and Switch Scheduling Xiangheng Liu Department of Electrical Engineering Stanford University, Stanford CA 94305 Email: liuxh@systems.stanford.edu Abstract Load

More information