Machine learning and optimization for massive data

Similar documents

Beyond stochastic gradient descent for large-scale machine learning

Stochastic gradient methods for machine learning

Large-scale machine learning and convex optimization

Stochastic gradient methods for machine learning

Big learning: challenges and opportunities

Machine learning challenges for big data

Large-Scale Machine Learning with Stochastic Gradient Descent

Big Data - Lecture 1 Optimization reminders

Introduction to Online Learning Theory

Big Data Analytics: Optimization and Randomization

large-scale machine learning revisited Léon Bottou Microsoft Research (NYC)

(Quasi-)Newton methods

Probabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014

Linear Threshold Units

A Stochastic 3MG Algorithm with Application to 2D Filter Identification

Discussion on the paper Hypotheses testing by convex optimization by A. Goldenschluger, A. Juditsky and A. Nemirovski.

Simple and efficient online algorithms for real world applications

Statistical Machine Learning

Big Data Optimization: Randomized lock-free methods for minimizing partially separable convex functions

CSCI567 Machine Learning (Fall 2014)

Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh

Online Learning, Stability, and Stochastic Gradient Descent

CS 688 Pattern Recognition Lecture 4. Linear Models for Classification

GI01/M055 Supervised Learning Proximal Methods

Lecture 8 February 4

Large-Scale Similarity and Distance Metric Learning

B. Delyon. Stochastic approximation with decreasing gain: convergence and asymptotic theory. Unpublished lecture notes, Université de Rennes, 2000.

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

LABEL PROPAGATION ON GRAPHS. SEMI-SUPERVISED LEARNING. ----Changsheng Liu

Lecture 3: Linear methods for classification

Stochastic Optimization for Big Data Analytics: Algorithms and Libraries

An Introduction to Machine Learning

How To Understand The Theory Of Probability

Stochastic Gradient Method: Applications

Nonlinear Optimization: Algorithms 3: Interior-point methods

On the Path to an Ideal ROC Curve: Considering Cost Asymmetry in Learning Classifiers

Noisy and Missing Data Regression: Distribution-Oblivious Support Recovery

Probabilistic Linear Classification: Logistic Regression. Piyush Rai IIT Kanpur

Statistics Graduate Courses

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x)) To go the other way, you need to diagonalize S

Two Topics in Parametric Integration Applied to Stochastic Simulation in Industrial Engineering

Advanced Signal Processing and Digital Noise Reduction

CS229T/STAT231: Statistical Learning Theory (Winter 2015)

The Master s Degree with Thesis Course Descriptions in Industrial Engineering

Dirichlet forms methods for error calculus and sensitivity analysis

Lecture 6: Logistic Regression

BIG DATA PROBLEMS AND LARGE-SCALE OPTIMIZATION: A DISTRIBUTED ALGORITHM FOR MATRIX FACTORIZATION

Operations Research and Financial Engineering. Courses

Computing a Nearest Correlation Matrix with Factor Structure

APPM4720/5720: Fast algorithms for big data. Gunnar Martinsson The University of Colorado at Boulder

Accelerated Parallel Optimization Methods for Large Scale Machine Learning

Federated Optimization: Distributed Optimization Beyond the Datacenter

Parallel Data Mining. Team 2 Flash Coders Team Research Investigation Presentation 2. Foundations of Parallel Computing Oct 2014

Pricing and calibration in local volatility models via fast quantization

STORM: Stochastic Optimization Using Random Models Katya Scheinberg Lehigh University. (Joint work with R. Chen and M. Menickelly)

Penalized Logistic Regression and Classification of Microarray Data

STA 4273H: Statistical Machine Learning

4F7 Adaptive Filters (and Spectrum Estimation) Least Mean Square (LMS) Algorithm Sumeetpal Singh Engineering Department sss40@eng.cam.ac.

CCNY. BME I5100: Biomedical Signal Processing. Linear Discrimination. Lucas C. Parra Biomedical Engineering Department City College of New York

A Simultaneous Solution for General Linear Equations on a Ring or Hierarchical Cluster

NEURAL NETWORKS A Comprehensive Foundation

Karthik Sridharan. 424 Gates Hall Ithaca, sridharan/ Contact Information

MapReduce/Bigtable for Distributed Optimization

Parallel Selective Algorithms for Nonconvex Big Data Optimization

Bayesian Machine Learning (ML): Modeling And Inference in Big Data. Zhuhua Cai Google, Rice University

Statistical machine learning, high dimension and big data

A Learning Based Method for Super-Resolution of Low Resolution Images

The Steepest Descent Algorithm for Unconstrained Optimization and a Bisection Line-search Method

t := maxγ ν subject to ν {0,1,2,...} and f(x c +γ ν d) f(x c )+cγ ν f (x c ;d).

Applied mathematics and mathematical statistics

Learning Gaussian process models from big data. Alan Qi Purdue University Joint work with Z. Xu, F. Yan, B. Dai, and Y. Zhu

Online Convex Optimization

Adaptive Online Gradient Descent

Pattern Analysis. Logistic Regression. 12. Mai Joachim Hornegger. Chair of Pattern Recognition Erlangen University

Lecture 2: The SVM classifier

Introduction to Machine Learning and Data Mining. Prof. Dr. Igor Trajkovski

Lecture Notes in Mathematics 2033

Semi-Supervised Support Vector Machines and Application to Spam Filtering

Linear Models for Classification

Big Data Analytics. Lucas Rego Drumond

AN INTRODUCTION TO NUMERICAL METHODS AND ANALYSIS

HT2015: SC4 Statistical Data Mining and Machine Learning

One-year reserve risk including a tail factor : closed formula and bootstrap approaches

Big Data Analytics CSCI 4030

Machine Learning and Pattern Recognition Logistic Regression

Perron vector Optimization applied to search engines

On the Traffic Capacity of Cellular Data Networks. 1 Introduction. T. Bonald 1,2, A. Proutière 1,2

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model

MACHINE LEARNING IN HIGH ENERGY PHYSICS

Collaborative Filtering Scalable Data Analysis Algorithms Claudia Lehmann, Andrina Mascher

Bayesian Information Criterion The BIC Of Algebraic geometry

Direct Loss Minimization for Structured Prediction

Scalable Machine Learning - or what to do with all that Big Data infrastructure

Summer School Machine Learning Trimester

Transcription:

Machine learning and optimization for massive data Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE Joint work with Eric Moulines - IHES, May 2015

Big data revolution? A new scientific context Technical progress Computation, storage and sensor costs

Moore s law - Processors

Moore s law - Storage

DNA sequencing

Big data revolution? A new scientific context Data everywhere: size does not (always) matter Science and industry Size and variety Learning from examples n observations in dimension p

Search engines - Advertising

Marketing - Personalized recommendation

Visual object recognition

Bioinformatics Protein: Crucial elements of cell life Massive data: 2 millions for humans Complex data

Astronomy Square Kilometer Array (2024): 10 9 Go par jour

Context Machine learning for big data Large-scale machine learning: large p, large n p : dimension of each observation (input) n : number of observations Examples: computer vision, bioinformatics, advertising Ideal running-time complexity: O(pn) Going back to simple methods Stochastic gradient methods (Robbins and Monro, 1951) Mixing statistics and optimization Using smoothness to go beyond stochastic gradient descent

Context Machine learning for big data Large-scale machine learning: large p, large n p : dimension of each observation (input) n : number of observations Examples: computer vision, bioinformatics, advertising Ideal running-time complexity: O(pn) Going back to simple methods Stochastic gradient methods (Robbins and Monro, 1951) Mixing statistics and optimization Using smoothness to go beyond stochastic gradient descent

Context Machine learning for big data Large-scale machine learning: large p, large n p : dimension of each observation (input) n : number of observations Examples: computer vision, bioinformatics, advertising Ideal running-time complexity: O(pn) Going back to simple methods Stochastic gradient methods (Robbins and Monro, 1951) Mixing statistics and optimization Using smoothness to go beyond stochastic gradient descent

Scaling to large problems with convex optimization Retour aux sources 1950 s: computers not powerful enough IBM 1620, 1959 CPU frequency: 50 KHz Price > 100 000 dollars 2010 s: Data too massive One pass through the data (Robbins et Monro, 1956) Algorithm: θ n = θ n 1 γ n l (y n, θ n 1,Φ(x n ) )Φ(x n )

Scaling to large problems with convex optimization Retour aux sources 1950 s: computers not powerful enough IBM 1620, 1959 CPU frequency: 50 KHz Price > 100 000 dollars 2010 s: Data too massive One pass through the data (Robbins et Monro, 1951) Algorithm: θ n = θ n 1 γ n l (y n, θ n 1,Φ(x n ) )Φ(x n )

Outline Introduction: stochastic approximation algorithms Supervised machine learning and convex optimization Stochastic gradient and averaging Strongly convex vs. non-strongly convex O(1/(nµ)) vs. O(1/ n) for all non-smooth functions Least-squares regressions Constant step-sizes O(1/n) convergence rate for all quadratic functions Smooth losses Online Newton steps O(1/n) convergence rate for all convex functions

Supervised machine learning Data: n observations (x i,y i ) X Y, i = 1,...,n, i.i.d. Prediction as a linear function θ,φ(x) of features Φ(x) R p (regularized) empirical risk minimization: find ˆθ solution of 1 n min l ( y i, θ,φ(x i ) ) + µω(θ) θ R p n i=1 convex data fitting term + regularizer

Usual losses Regression: y R, prediction ŷ = θ,φ(x) quadratic loss 1 2 (y ŷ)2 = 1 2 (y θ,φ(x) )2

Usual losses Regression: y R, prediction ŷ = θ,φ(x) quadratic loss 1 2 (y ŷ)2 = 1 2 (y θ,φ(x) )2 Classification : y { 1,1}, prediction ŷ = sign( θ,φ(x) ) loss of the form l(y θ,φ(x) ) True 0-1 loss: l(y θ,φ(x) ) = 1 y θ,φ(x) <0 Usual convex losses: 5 4 3 0 1 hinge square logistic 2 1 0 3 2 1 0 1 2 3 4

Supervised machine learning Data: n observations (x i,y i ) X Y, i = 1,...,n, i.i.d. Prediction as a linear function θ,φ(x) of features Φ(x) R p (regularized) empirical risk minimization: find ˆθ solution of 1 n min l ( y i, θ,φ(x i ) ) + µω(θ) θ R p n i=1 convex data fitting term + regularizer

Supervised machine learning Data: n observations (x i,y i ) X Y, i = 1,...,n, i.i.d. Prediction as a linear function θ,φ(x) of features Φ(x) R p (regularized) empirical risk minimization: find ˆθ solution of 1 n min l ( y i, θ,φ(x i ) ) + µω(θ) θ R p n i=1 convex data fitting term + regularizer Empirical risk: ˆf(θ) = 1 n n i=1 l(y i, θ,φ(x i ) ) training cost Expected risk: f(θ) = E (x,y) l(y, θ,φ(x) ) testing cost Two fundamental questions: (1) computing ˆθ and (2) analyzing ˆθ May be tackled simultaneously

Supervised machine learning Data: n observations (x i,y i ) X Y, i = 1,...,n, i.i.d. Prediction as a linear function θ,φ(x) of features Φ(x) R p (regularized) empirical risk minimization: find ˆθ solution of 1 n min l ( y i, θ,φ(x i ) ) + µω(θ) θ R p n i=1 convex data fitting term + regularizer Empirical risk: ˆf(θ) = 1 n n i=1 l(y i, θ,φ(x i ) ) training cost Expected risk: f(θ) = E (x,y) l(y, θ,φ(x) ) testing cost Two fundamental questions: (1) computing ˆθ and (2) analyzing ˆθ May be tackled simultaneously

Smoothness and strong convexity A function g : R p R is L-smooth if and only if it is twice differentiable and θ R p, eigenvalues [ g (θ) ] L smooth non smooth

Smoothness and strong convexity A function g : R p R is L-smooth if and only if it is twice differentiable and θ R p, eigenvalues [ g (θ) ] L Machine learning with g(θ) = 1 n n i=1 l(y i, θ,φ(x i ) ) Hessian covariance matrix 1 n n i=1 Φ(x i) Φ(x i ) Bounded data

Smoothness and strong convexity A twice differentiable function g : R p R is µ-strongly convex if and only if θ R p, eigenvalues [ g (θ) ] µ convex strongly convex

Smoothness and strong convexity A twice differentiable function g : R p R is µ-strongly convex if and only if θ R p, eigenvalues [ g (θ) ] µ (large µ) (small µ)

Smoothness and strong convexity A twice differentiable function g : R p R is µ-strongly convex if and only if θ R p, eigenvalues [ g (θ) ] µ Machine learning with g(θ) = 1 n n i=1 l(y i, θ,φ(x i ) ) Hessian covariance matrix 1 n n i=1 Φ(x i) Φ(x i ) Data with invertible covariance matrix (low correlation/dimension)

Smoothness and strong convexity A twice differentiable function g : R p R is µ-strongly convex if and only if θ R p, eigenvalues [ g (θ) ] µ Machine learning with g(θ) = 1 n n i=1 l(y i, θ,φ(x i ) ) Hessian covariance matrix 1 n n i=1 Φ(x i) Φ(x i ) Data with invertible covariance matrix (low correlation/dimension) Adding regularization by µ 2 θ 2 creates additional bias unless µ is small

Iterative methods for minimizing smooth functions Assumption: g convex and smooth on R p Gradient descent: θ t = θ t 1 γ t g (θ t 1 ) O(1/t) convergence rate for convex functions O(e ρt ) convergence rate for strongly convex functions

Iterative methods for minimizing smooth functions Assumption: g convex and smooth on R p Gradient descent: θ t = θ t 1 γ t g (θ t 1 ) O(1/t) convergence rate for convex functions O(e ρt ) convergence rate for strongly convex functions Newton method: θ t = θ t 1 g (θ t 1 ) 1 g (θ t 1 ) O ( e ρ2t ) convergence rate

Iterative methods for minimizing smooth functions Assumption: g convex and smooth on R p Gradient descent: θ t = θ t 1 γ t g (θ t 1 ) O(1/t) convergence rate for convex functions O(e ρt ) convergence rate for strongly convex functions Newton method: θ t = θ t 1 g (θ t 1 ) 1 g (θ t 1 ) O ( e ρ2t ) convergence rate Key insights from Bottou and Bousquet (2008) 1. In machine learning, no need to optimize below statistical error 2. In machine learning, cost functions are averages Stochastic approximation

Stochastic approximation Goal: Minimizing a function f defined on R p given only unbiased estimates f n(θ n ) of its gradients f (θ n ) at certain points θ n R p

Stochastic approximation Goal: Minimizing a function f defined on R p given only unbiased estimates f n(θ n ) of its gradients f (θ n ) at certain points θ n R p Machine learning - statistics f(θ) = Ef n (θ) = El(y n, θ,φ(x n ) ) = generalization error Loss for a single pair of observations: f n (θ) = l(y n, θ,φ(x n ) ) Expected gradient: f (θ) = Ef n(θ) = E { l (y n, θ,φ(x n ) )Φ(x n ) } Beyond convex optimization: see, e.g., Benveniste et al. (2012)

Convex stochastic approximation Key assumption: smoothness and/or strong convexity Key algorithm: stochastic gradient descent (a.k.a. Robbins-Monro) θ n = θ n 1 γ n f n(θ n 1 ) Polyak-Ruppert averaging: θ n = 1 n+1 n k=0 θ k Which learning rate sequence γ n? Classical setting: γ n = Cn α

Convex stochastic approximation Key assumption: smoothness and/or strong convexity Key algorithm: stochastic gradient descent (a.k.a. Robbins-Monro) θ n = θ n 1 γ n f n(θ n 1 ) Polyak-Ruppert averaging: θ n = 1 n+1 n k=0 θ k Which learning rate sequence γ n? Classical setting: γ n = Cn α Running-time = O(np) Single pass through the data One line of code among many

Convex stochastic approximation Existing work Known global minimax rates of convergence for non-smooth problems (Nemirovsky and Yudin, 1983; Agarwal et al., 2012) Strongly convex: O((µn) 1 ) Attainedbyaveragedstochasticgradientdescentwithγ n (µn) 1 Non-strongly convex: O(n 1/2 ) Attained by averaged stochastic gradient descent with γ n n 1/2

Convex stochastic approximation Existing work Known global minimax rates of convergence for non-smooth problems (Nemirovsky and Yudin, 1983; Agarwal et al., 2012) Strongly convex: O((µn) 1 ) Attainedbyaveragedstochasticgradientdescentwithγ n (µn) 1 Non-strongly convex: O(n 1/2 ) Attained by averaged stochastic gradient descent with γ n n 1/2 Asymptotic analysis of averaging (Polyak and Juditsky, 1992; Ruppert, 1988) All step sizes γ n = Cn α with α (1/2,1) lead to O(n 1 ) for smooth strongly convex problems A single algorithm with global adaptive convergence rate for smooth problems?

Convex stochastic approximation Existing work Known global minimax rates of convergence for non-smooth problems (Nemirovsky and Yudin, 1983; Agarwal et al., 2012) Strongly convex: O((µn) 1 ) Attainedbyaveragedstochasticgradientdescentwithγ n (µn) 1 Non-strongly convex: O(n 1/2 ) Attained by averaged stochastic gradient descent with γ n n 1/2 Asymptotic analysis of averaging (Polyak and Juditsky, 1992; Ruppert, 1988) All step sizes γ n = Cn α with α (1/2,1) lead to O(n 1 ) for smooth strongly convex problems A single algorithm for smooth problems with convergence rate O(1/n) in all situations?

Least-mean-square algorithm Least-squares: f(θ) = 1 2 E[ (y n Φ(x n ),θ ) 2] with θ R p SGD = least-mean-square algorithm (see, e.g., Macchi, 1995) usually studied without averaging and decreasing step-sizes with strong convexity assumption E [ Φ(x n ) Φ(x n ) ] = H µ Id

Least-mean-square algorithm Least-squares: f(θ) = 1 2 E[ (y n Φ(x n ),θ ) 2] with θ R p SGD = least-mean-square algorithm (see, e.g., Macchi, 1995) usually studied without averaging and decreasing step-sizes with strong convexity assumption E [ Φ(x n ) Φ(x n ) ] = H µ Id New analysis for averaging and constant step-size γ = 1/(4R 2 ) Assume Φ(x n ) R and y n Φ(x n ),θ σ almost surely No assumption regarding lowest eigenvalues of H Main result: Ef( θ n ) f(θ ) 4σ2 p n + 4R2 θ 0 θ 2 n Matches statistical lower bound (Tsybakov, 2003) Non-asymptotic robust version of Györfi and Walk (1996)

Markov chain interpretation of constant step sizes LMS recursion for f n (θ) = 1 2 ( yn Φ(x n ),θ ) 2 θ n = θ n 1 γ ( Φ(x n ),θ n 1 y n ) Φ(xn ) The sequence (θ n ) n is a homogeneous Markov chain convergence to a stationary distribution π γ with expectation θ γ def = θπ γ (dθ) - For least-squares, θ γ = θ θ n θ γ θ 0

Markov chain interpretation of constant step sizes LMS recursion for f n (θ) = 1 2 ( yn Φ(x n ),θ ) 2 θ n = θ n 1 γ ( Φ(x n ),θ n 1 y n ) Φ(xn ) The sequence (θ n ) n is a homogeneous Markov chain convergence to a stationary distribution π γ with expectation θ γ def = θπ γ (dθ) For least-squares, θ γ = θ θ n θ θ 0

Markov chain interpretation of constant step sizes LMS recursion for f n (θ) = 1 2 ( yn Φ(x n ),θ ) 2 θ n = θ n 1 γ ( Φ(x n ),θ n 1 y n ) Φ(xn ) The sequence (θ n ) n is a homogeneous Markov chain convergence to a stationary distribution π γ with expectation θ γ def = θπ γ (dθ) For least-squares, θ γ = θ θ n θ n θ θ 0

Markov chain interpretation of constant step sizes LMS recursion for f n (θ) = 1 2 ( yn Φ(x n ),θ ) 2 θ n = θ n 1 γ ( Φ(x n ),θ n 1 y n ) Φ(xn ) The sequence (θ n ) n is a homogeneous Markov chain convergence to a stationary distribution π γ with expectation θ γ def = θπ γ (dθ) For least-squares, θ γ = θ θ n does not converge to θ but oscillates around it oscillations of order γ Ergodic theorem: Averaged iterates converge to θ γ = θ at rate O(1/n)

Simulations - synthetic examples Gaussian distributions - p = 20 0 synthetic square log 10 [f(θ) f(θ * )] 1 2 3 1/2R 2 1/8R 2 4 1/32R 2 1/2R 2 n 1/2 5 0 2 4 6 log 10 (n)

Simulations - benchmarks alpha (p = 500, n = 500 000), news (p = 1 300 000, n = 20 000) log 10 [f(θ) f(θ * )] 1 0.5 0 0.5 1 1.5 2 alpha square C=1 test 1/R 2 1/R 2 n 1/2 SAG 0 2 4 6 log (n) 10 1 0.5 0 0.5 1 1.5 2 alpha square C=opt test C/R 2 C/R 2 n 1/2 SAG 0 2 4 6 log (n) 10 0.2 news square C=1 test 0.2 news square C=opt test log 10 [f(θ) f(θ * )] 0 0.2 0.4 0.6 0.8 1/R 2 1/R 2 n 1/2 SAG 0 2 4 log (n) 10 0 0.2 0.4 0.6 0.8 C/R 2 C/R 2 n 1/2 SAG 0 2 4 log 10 (n)

Beyond least-squares - Markov chain interpretation Recursion θ n = θ n 1 γf n(θ n 1 ) also defines a Markov chain Stationary distribution π γ such that f (θ)π γ (dθ) = 0 When f is not linear, f ( θπ γ (dθ)) f (θ)π γ (dθ) = 0

Beyond least-squares - Markov chain interpretation Recursion θ n = θ n 1 γf n(θ n 1 ) also defines a Markov chain Stationary distribution π γ such that f (θ)π γ (dθ) = 0 When f is not linear, f ( θπ γ (dθ)) f (θ)π γ (dθ) = 0 θ n oscillates around the wrong value θ γ θ θ n θ n θ γ θ 0 θ

Beyond least-squares - Markov chain interpretation Recursion θ n = θ n 1 γf n(θ n 1 ) also defines a Markov chain Stationary distribution π γ such that f (θ)π γ (dθ) = 0 When f is not linear, f ( θπ γ (dθ)) f (θ)π γ (dθ) = 0 θ n oscillates around the wrong value θ γ θ moreover, θ θ n = O p ( γ) Ergodic theorem averaged iterates converge to θ γ θ at rate O(1/n) moreover, θ θ γ = O(γ) (Bach, 2013)

Simulations - synthetic examples Gaussian distributions - p = 20 0 synthetic logistic 1 log 10 [f(θ) f(θ * )] 1 2 1/2R 2 3 1/8R 2 4 1/32R 2 5 1/2R 2 n 1/2 0 2 4 6 log 10 (n)

Restoring convergence through online Newton steps Known facts 1. Averaged SGD with γ n n 1/2 leads to robust rate O(n 1/2 ) for all convex functions 2. Averaged SGD with γ n constant leads to robust rate O(n 1 ) for all convex quadratic functions 3. Newton s method squares the error at each iteration for smooth functions 4. A single step of Newton s method is equivalent to minimizing the quadratic Taylor expansion Online Newton step Rate: O((n 1/2 ) 2 +n 1 ) = O(n 1 ) Complexity: O(p) per iteration

Restoring convergence through online Newton steps Known facts 1. Averaged SGD with γ n n 1/2 leads to robust rate O(n 1/2 ) for all convex functions 2. Averaged SGD with γ n constant leads to robust rate O(n 1 ) for all convex quadratic functions O(n 1 ) 3. Newton s method squares the error at each iteration for smooth functions O((n 1/2 ) 2 ) 4. A single step of Newton s method is equivalent to minimizing the quadratic Taylor expansion Online Newton step Rate: O((n 1/2 ) 2 +n 1 ) = O(n 1 ) Complexity: O(p) per iteration

Restoring convergence through online Newton steps The Newton step for f = Ef n (θ) def = E [ l(y n, θ,φ(x n ) ) ] at θ is equivalent to minimizing the quadratic approximation g(θ) = f( θ)+ f ( θ),θ θ + 1 2 θ θ,f ( θ)(θ θ) = f( θ)+ Ef n( θ),θ θ + 1 2 θ θ,ef n( θ)(θ θ) [ = E f( θ)+ f n( θ),θ θ + 1 2 θ θ,f n( θ)(θ θ) ]

Restoring convergence through online Newton steps The Newton step for f = Ef n (θ) def = E [ l(y n, θ,φ(x n ) ) ] at θ is equivalent to minimizing the quadratic approximation g(θ) = f( θ)+ f ( θ),θ θ + 1 2 θ θ,f ( θ)(θ θ) = f( θ)+ Ef n( θ),θ θ + 1 2 θ θ,ef n( θ)(θ θ) [ = E f( θ)+ f n( θ),θ θ + 1 2 θ θ,f n( θ)(θ θ) ] Complexity of least-mean-square recursion for g is O(p) θ n = θ n 1 γ [ f n( θ)+f n( θ)(θ n 1 θ) ] f n( θ) = l (y n, θ,φ(x n ) )Φ(x n ) Φ(x n ) has rank one New online Newton step without computing/inverting Hessians

Choice of support point for online Newton step Two-stage procedure (1) Run n/2 iterations of averaged SGD to obtain θ (2) Run n/2 iterations of averaged constant step-size LMS Reminiscent of one-step estimators (see, e.g., Van der Vaart, 2000) Provable convergence rate of O(p/n) for logistic regression Additional assumptions but no strong convexity

Choice of support point for online Newton step Two-stage procedure (1) Run n/2 iterations of averaged SGD to obtain θ (2) Run n/2 iterations of averaged constant step-size LMS Reminiscent of one-step estimators (see, e.g., Van der Vaart, 2000) Provable convergence rate of O(p/n) for logistic regression Additional assumptions but no strong convexity Update at each iteration using the current averaged iterate Recursion: θ n = θ n 1 γ [ f n( θ n 1 )+f n( θ n 1 )(θ n 1 θ n 1 ) ] No provable convergence rate (yet) but best practical behavior Note (dis)similarity with regular SGD: θ n = θ n 1 γf n(θ n 1 )

Simulations - synthetic examples Gaussian distributions - p = 20 0 synthetic logistic 1 0 synthetic logistic 2 log 10 [f(θ) f(θ * )] 1 2 2 1/2R 2 3 1/8R 2 3 every 2 p 4 1/32R 2 every iter. 4 2 step 5 1/2R 2 n 1/2 5 2 step dbl. 0 2 4 6 0 2 4 6 log 10 (n) log 10 (n) log 10 [f(θ) f(θ * )] 1

Simulations - benchmarks alpha (p = 500, n = 500 000), news (p = 1 300 000, n = 20 000) alpha logistic C=1 test alpha logistic C=opt test 0.5 0.5 log 10 [f(θ) f(θ * )] 0 0.5 1 1/R 2 1.5 1/R 2 n 1/2 SAG 2 Adagrad Newton 2.5 0 2 4 log (n) 10 6 0 0.5 1 C/R 2 1.5 C/R 2 n 1/2 SAG 2 Adagrad Newton 2.5 0 2 4 log (n) 10 6 0.2 news logistic C=1 test 0.2 news logistic C=opt test log 10 [f(θ) f(θ * )] 0 0.2 0.4 1/R 2 0.6 1/R 2 n 1/2 SAG 0.8 Adagrad 1 Newton 0 2 4 log (n) 10 0 0.2 0.4 C/R 2 0.6 C/R 2 n 1/2 SAG 0.8 Adagrad 1 Newton 0 2 4 log 10 (n)

Conclusions Constant-step-size averaged stochastic gradient descent Reaches convergence rate O(1/n) in all regimes Improves on the O(1/ n) lower-bound of non-smooth problems Efficient online Newton step for non-quadratic problems Robustness to step-size selection Extensions and future work Going beyond a single pass Pre-conditioning Proximal extensions fo non-differentiable terms kernels and non-parametric estimation(dieuleveut and Bach, 2014) line-search parallelization Non-convex problems

References A. Agarwal, P. L. Bartlett, P. Ravikumar, and M. J. Wainwright. Information-theoretic lower bounds on the oracle complexity of stochastic convex optimization. Information Theory, IEEE Transactions on, 58(5):3235 3249, 2012. F. Bach. Adaptivity of averaged stochastic gradient descent to local strong convexity for logistic regression. Technical Report 00804431, HAL, 2013. Albert Benveniste, Michel Métivier, and Pierre Priouret. Adaptive algorithms and stochastic approximations. Springer Publishing Company, Incorporated, 2012. L. Bottou and O. Bousquet. The tradeoffs of large scale learning. In Adv. NIPS, 2008. A. Dieuleveut and F. Bach. Non-parametric Stochastic Approximation with Large Step sizes. Technical report, ArXiv, 2014. L. Györfi and H. Walk. On the averaged stochastic approximation for linear regression. SIAM Journal on Control and Optimization, 34(1):31 61, 1996. O. Macchi. Adaptive processing: The least mean squares approach with applications in transmission. Wiley West Sussex, 1995. A. S. Nemirovsky and D. B. Yudin. Problem complexity and method efficiency in optimization. Wiley & Sons, 1983. B. T. Polyak and A. B. Juditsky. Acceleration of stochastic approximation by averaging. SIAM Journal on Control and Optimization, 30(4):838 855, 1992.

H. Robbins and S. Monro. A stochastic approximation method. Ann. Math. Statistics, 22:400 407, 1951. ISSN 0003-4851. D. Ruppert. Efficient estimations from a slowly convergent Robbins-Monro process. Technical Report 781, Cornell University Operations Research and Industrial Engineering, 1988. A. B. Tsybakov. Optimal rates of aggregation. 2003. A. W. Van der Vaart. Asymptotic statistics, volume 3. Cambridge Univ. press, 2000.