Large-scale machine learning and convex optimization



Similar documents
Beyond stochastic gradient descent for large-scale machine learning

Machine learning and optimization for massive data

Stochastic gradient methods for machine learning

Stochastic gradient methods for machine learning

Big learning: challenges and opportunities

Machine learning challenges for big data

Big Data Analytics: Optimization and Randomization

Big Data - Lecture 1 Optimization reminders

Adaptive Online Gradient Descent

Simple and efficient online algorithms for real world applications

Large-Scale Similarity and Distance Metric Learning

Introduction to Online Learning Theory

Linear Threshold Units

STORM: Stochastic Optimization Using Random Models Katya Scheinberg Lehigh University. (Joint work with R. Chen and M. Menickelly)

Online Learning, Stability, and Stochastic Gradient Descent

(Quasi-)Newton methods

A Stochastic 3MG Algorithm with Application to 2D Filter Identification

Large-Scale Machine Learning with Stochastic Gradient Descent

Online Convex Optimization

Accelerated Parallel Optimization Methods for Large Scale Machine Learning

Summer School Machine Learning Trimester

Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh

Big Data Optimization: Randomized lock-free methods for minimizing partially separable convex functions

Parallel Data Mining. Team 2 Flash Coders Team Research Investigation Presentation 2. Foundations of Parallel Computing Oct 2014

Probabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014

A primal-dual algorithm for group sparse regularization with overlapping groups

Stochastic Optimization for Big Data Analytics: Algorithms and Libraries

LABEL PROPAGATION ON GRAPHS. SEMI-SUPERVISED LEARNING. ----Changsheng Liu

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

GI01/M055 Supervised Learning Proximal Methods

Online Learning for Matrix Factorization and Sparse Coding

Two Topics in Parametric Integration Applied to Stochastic Simulation in Industrial Engineering

Discussion on the paper Hypotheses testing by convex optimization by A. Goldenschluger, A. Juditsky and A. Nemirovski.

CS229T/STAT231: Statistical Learning Theory (Winter 2015)

large-scale machine learning revisited Léon Bottou Microsoft Research (NYC)

CSCI567 Machine Learning (Fall 2014)

10. Proximal point method

MACHINE LEARNING IN HIGH ENERGY PHYSICS

Statistical Machine Learning

Online Lazy Updates for Portfolio Selection with Transaction Costs

Operations Research and Financial Engineering. Courses

1 Introduction. 2 Prediction with Expert Advice. Online Learning Lecture 09

Foundations of Machine Learning On-Line Learning. Mehryar Mohri Courant Institute and Google Research

Federated Optimization: Distributed Optimization Beyond the Datacenter

Lecture 8 February 4

Stochastic Gradient Method: Applications

CCNY. BME I5100: Biomedical Signal Processing. Linear Discrimination. Lucas C. Parra Biomedical Engineering Department City College of New York

Acknowledgments. Data Mining with Regression. Data Mining Context. Overview. Colleagues

Nonlinear Optimization: Algorithms 3: Interior-point methods

STA 4273H: Statistical Machine Learning

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

AN INTRODUCTION TO NUMERICAL METHODS AND ANALYSIS

Projection-free Online Learning

Machine Learning and Pattern Recognition Logistic Regression

Karthik Sridharan. 424 Gates Hall Ithaca, sridharan/ Contact Information

The p-norm generalization of the LMS algorithm for adaptive filtering

Follow the Perturbed Leader

Summer course on Convex Optimization. Fifth Lecture Interior-Point Methods (1) Michel Baes, K.U.Leuven Bharath Rangarajan, U.

2.3 Convex Constrained Optimization Problems

t := maxγ ν subject to ν {0,1,2,...} and f(x c +γ ν d) f(x c )+cγ ν f (x c ;d).

Parallel Selective Algorithms for Nonconvex Big Data Optimization

Consistent Binary Classification with Generalized Performance Metrics

i=1 In practice, the natural logarithm of the likelihood function, called the log-likelihood function and denoted by

Noisy and Missing Data Regression: Distribution-Oblivious Support Recovery

Analysis of Mean-Square Error and Transient Speed of the LMS Adaptive Algorithm

Parallel & Distributed Optimization. Based on Mark Schmidt s slides

Table 1: Summary of the settings and parameters employed by the additive PA algorithm for classification, regression, and uniclass.

Efficient Projections onto the l 1 -Ball for Learning in High Dimensions

Statistical machine learning, high dimension and big data

Learning Gaussian process models from big data. Alan Qi Purdue University Joint work with Z. Xu, F. Yan, B. Dai, and Y. Zhu

4F7 Adaptive Filters (and Spectrum Estimation) Least Mean Square (LMS) Algorithm Sumeetpal Singh Engineering Department sss40@eng.cam.ac.

Sparse Prediction with the k-support Norm

SMOOTHING APPROXIMATIONS FOR TWO CLASSES OF CONVEX EIGENVALUE OPTIMIZATION PROBLEMS YU QI. (B.Sc.(Hons.), BUAA)

How To Understand The Theory Of Probability

NMR Measurement of T1-T2 Spectra with Partial Measurements using Compressive Sensing

MapReduce/Bigtable for Distributed Optimization

Lecture 3: Linear methods for classification

An Introduction to Machine Learning

EMPIRICAL RISK MINIMIZATION FOR CAR INSURANCE DATA

Lecture 2: The SVM classifier

Introduction to Machine Learning and Data Mining. Prof. Dr. Igor Trajkovski

Bootstrapping Big Data

On the Path to an Ideal ROC Curve: Considering Cost Asymmetry in Learning Classifiers

Introduction to Online Optimization

Sparse LMS via Online Linearized Bregman Iteration

Big Data Analytics. Lucas Rego Drumond

Interactive Machine Learning. Maria-Florina Balcan

Computing a Nearest Correlation Matrix with Factor Structure

Dirichlet forms methods for error calculus and sensitivity analysis

Convex Optimization for Big Data

Large Scale Learning to Rank

Neuro-Dynamic Programming An Overview

Direct Loss Minimization for Structured Prediction

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model

Numerical Analysis An Introduction

Transcription:

Large-scale machine learning and convex optimization Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Eurandom - March 2014 Slides available at www.di.ens.fr/~fbach/gradsto_eurandom.pdf

Context Machine learning for big data Large-scale machine learning: large d, large n, large k d : dimension of each observation (input) n : number of observations k : number of tasks (dimension of outputs) Examples: computer vision, bioinformatics, text processing Ideal running-time complexity: O(dn + kn) Going back to simple methods Stochastic gradient methods (Robbins and Monro, 1951) Mixing statistics and optimization Using smoothness to go beyond stochastic gradient descent

Object recognition

Learning for bioinformatics - Proteins Crucial components of cell life Predicting multiple functions and interactions Massive data: up to 1 millions for humans! Complex data Amino-acid sequence Link with DNA Tri-dimensional molecule

Search engines - advertising

Advertising - recommendation

Context Machine learning for big data Large-scale machine learning: large d, large n, large k d : dimension of each observation (input) n : number of observations k : number of tasks (dimension of outputs) Examples: computer vision, bioinformatics, text processing Ideal running-time complexity: O(dn + kn) Going back to simple methods Stochastic gradient methods (Robbins and Monro, 1951) Mixing statistics and optimization Using smoothness to go beyond stochastic gradient descent

Context Machine learning for big data Large-scale machine learning: large d, large n, large k d : dimension of each observation (input) n : number of observations k : number of tasks (dimension of outputs) Examples: computer vision, bioinformatics, text processing Ideal running-time complexity: O(dn + kn) Going back to simple methods Stochastic gradient methods (Robbins and Monro, 1951) Mixing statistics and optimization Using smoothness to go beyond stochastic gradient descent

Outline 1. Large-scale machine learning and optimization Traditional statistical analysis Classical methods for convex optimization 2. Non-smooth stochastic approximation Stochastic (sub)gradient and averaging Non-asymptotic results and lower bounds Strongly convex vs. non-strongly convex 3. Smooth stochastic approximation algorithms Asymptotic and non-asymptotic results 4. Beyond decaying step-sizes 5. Finite data sets

Supervised machine learning Data: n observations (x i,y i ) X Y, i = 1,...,n, i.i.d. Prediction as a linear function θ Φ(x) of features Φ(x) R d (regularized) empirical risk minimization: find ˆθ solution of 1 n min l ( y i,θ Φ(x i ) ) + µω(θ) θ R d n i=1 convex data fitting term + regularizer

Usual losses Regression: y R, prediction ŷ = θ Φ(x) quadratic loss 1 2 (y ŷ)2 = 1 2 (y θ Φ(x)) 2

Usual losses Regression: y R, prediction ŷ = θ Φ(x) quadratic loss 1 2 (y ŷ)2 = 1 2 (y θ Φ(x)) 2 Classification : y { 1,1}, prediction ŷ = sign(θ Φ(x)) loss of the form l(yθ Φ(x)) True 0-1 loss: l(yθ Φ(x)) = 1 yθ Φ(x)<0 Usual convex losses: 5 4 3 0 1 hinge square logistic 2 1 0 3 2 1 0 1 2 3 4

Usual regularizers Main goal: avoid overfitting (squared) Euclidean norm: θ 2 2 = d j=1 θ j 2 Numerically well-behaved Representer theorem and kernel methods : θ = n i=1 α iφ(x i ) See, e.g., Schölkopf and Smola (2001); Shawe-Taylor and Cristianini (2004) Sparsity-inducing norms Main example: l 1 -norm θ 1 = d j=1 θ j Perform model selection as well as regularization Non-smooth optimization and structured sparsity See, e.g., Bach, Jenatton, Mairal, and Obozinski (2011, 2012)

Supervised machine learning Data: n observations (x i,y i ) X Y, i = 1,...,n, i.i.d. Prediction as a linear function θ Φ(x) of features Φ(x) R d (regularized) empirical risk minimization: find ˆθ solution of 1 n min l ( y i,θ Φ(x i ) ) + µω(θ) θ R d n i=1 convex data fitting term + regularizer

Supervised machine learning Data: n observations (x i,y i ) X Y, i = 1,...,n, i.i.d. Prediction as a linear function θ Φ(x) of features Φ(x) R d (regularized) empirical risk minimization: find ˆθ solution of 1 n min l ( y i,θ Φ(x i ) ) + µω(θ) θ R d n i=1 convex data fitting term + regularizer Empirical risk: ˆf(θ) = 1 n n i=1 l(y i,θ Φ(x i )) training cost Expected risk: f(θ) = E (x,y) l(y,θ Φ(x)) testing cost Two fundamental questions: (1) computing ˆθ and (2) analyzing ˆθ May be tackled simultaneously

Supervised machine learning Data: n observations (x i,y i ) X Y, i = 1,...,n, i.i.d. Prediction as a linear function θ Φ(x) of features Φ(x) R d (regularized) empirical risk minimization: find ˆθ solution of 1 n min l ( y i,θ Φ(x i ) ) + µω(θ) θ R d n i=1 convex data fitting term + regularizer Empirical risk: ˆf(θ) = 1 n n i=1 l(y i,θ Φ(x i )) training cost Expected risk: f(θ) = E (x,y) l(y,θ Φ(x)) testing cost Two fundamental questions: (1) computing ˆθ and (2) analyzing ˆθ May be tackled simultaneously

Supervised machine learning Data: n observations (x i,y i ) X Y, i = 1,...,n, i.i.d. Prediction as a linear function θ Φ(x) of features Φ(x) R d (regularized) empirical risk minimization: find ˆθ solution of 1 n min l ( y i,θ Φ(x i ) ) such that Ω(θ) D θ R d n i=1 convex data fitting term + constraint Empirical risk: ˆf(θ) = 1 n n i=1 l(y i,θ Φ(x i )) training cost Expected risk: f(θ) = E (x,y) l(y,θ Φ(x)) testing cost Two fundamental questions: (1) computing ˆθ and (2) analyzing ˆθ May be tackled simultaneously

Analysis of empirical risk minimization Approximation and estimation errors: C = {θ R d,ω(θ) D} [ ] [ ] f(ˆθ) min θ R df(θ) = f(ˆθ) min f(θ) + minf(θ) min θ C θ C θ R df(θ) NB: may replace min by best (non-linear) predictions θ Rdf(θ) 1. Uniform deviation bounds, with ˆθ argmin θ C ˆf(θ) f(ˆθ) min θ C Typically slow rate O ( 1 n ) f(θ) 2sup θ C ˆf(θ) f(θ) 2. More refined concentration results with faster rates

Slow rate for supervised learning Assumptions (f is the expected risk, ˆf the empirical risk) Ω(θ) = θ 2 (Euclidean norm) Linear predictors: θ(x) = θ Φ(x), with Φ(x) 2 R a.s. G-Lipschitz loss: f and ˆf are GR-Lipschitz on C = { θ 2 D} No assumptions regarding convexity

Slow rate for supervised learning Assumptions (f is the expected risk, ˆf the empirical risk) Ω(θ) = θ 2 (Euclidean norm) Linear predictors: θ(x) = θ Φ(x), with Φ(x) 2 R a.s. G-Lipschitz loss: f and ˆf are GR-Lipschitz on C = { θ 2 D} No assumptions regarding convexity With probability greater than 1 δ sup θ C ˆf(θ) f(θ) GRD n [2+ Expectated estimation error: E [ sup θ C 2log 2 δ ] ˆf(θ) f(θ) ] 4GRD n Using Rademacher averages (see, e.g., Boucheron et al., 2005) Lipschitz functions slow rate

Fast rate for supervised learning Assumptions (f is the expected risk, ˆf the empirical risk) Same as before (bounded features, Lipschitz loss) Regularized risks: f µ (θ) = f(θ)+ µ 2 θ 2 2 and ˆf µ (θ) = ˆf(θ)+ µ 2 θ 2 2 Convexity For any a > 0, with probability greater than 1 δ, for all θ R d, f µ (θ) min (η) (1+a)(ˆf µ (θ) min ˆfµ (η))+ 8(1+ 1 a )G2 R 2 (32+log 1 δ ) η R dfµ η R d µn Results from Sridharan, Srebro, and Shalev-Shwartz (2008) see also Boucheron and Massart (2011) and references therein Strongly convex functions fast rate Warning: µ should decrease with n to reduce approximation error

Complexity results in convex optimization Assumption: f convex on R d Classical generic algorithms (sub)gradient method/descent Accelerated gradient descent Newton method Key additional properties of f Lipschitz continuity, smoothness or strong convexity Key insight from Bottou and Bousquet (2008) In machine learning, no need to optimize below estimation error Key reference: Nesterov (2004)

Lipschitz continuity Bounded gradients of f (Lipschitz-continuity): the function f if convex, differentiable and has (sub)gradients uniformly bounded by B on the ball of center 0 and radius D: Machine learning θ R d, θ 2 D f (θ) 2 B with f(θ) = 1 n n i=1 l(y i,θ Φ(x i )) G-Lipschitz loss and R-bounded data: B = GR

Smoothness and strong convexity A function f : R d R is L-smooth if and only if it is differentiable and its gradient is L-Lipschitz-continuous θ 1,θ 2 R d, f (θ 1 ) f (θ 2 ) 2 L θ 1 θ 2 2 If f is twice differentiable: θ R d, f (θ) L Id smooth non smooth

Smoothness and strong convexity A function f : R d R is L-smooth if and only if it is differentiable and its gradient is L-Lipschitz-continuous θ 1,θ 2 R d, f (θ 1 ) f (θ 2 ) 2 L θ 1 θ 2 2 If f is twice differentiable: θ R d, f (θ) L Id Machine learning with f(θ) = 1 n n i=1 l(y i,θ Φ(x i )) Hessian covariance matrix 1 n n i=1 Φ(x i)φ(x i ) l-smooth loss and R-bounded data: L = lr 2

Smoothness and strong convexity A function f : R d R is µ-strongly convex if and only if θ 1,θ 2 R d, f(θ 1 ) f(θ 2 )+f (θ 2 ) (θ 1 θ 2 )+ µ 2 θ 1 θ 2 2 2 If f is twice differentiable: θ R d, f (θ) µ Id convex strongly convex

Smoothness and strong convexity A function f : R d R is µ-strongly convex if and only if θ 1,θ 2 R d, f(θ 1 ) f(θ 2 )+f (θ 2 ) (θ 1 θ 2 )+ µ 2 θ 1 θ 2 2 2 If f is twice differentiable: θ R d, f (θ) µ Id Machine learning with f(θ) = 1 n n i=1 l(y i,θ Φ(x i )) Hessian covariance matrix 1 n n i=1 Φ(x i)φ(x i ) Data with invertible covariance matrix (low correlation/dimension)

Smoothness and strong convexity A function f : R d R is µ-strongly convex if and only if θ 1,θ 2 R d, f(θ 1 ) f(θ 2 )+f (θ 2 ) (θ 1 θ 2 )+ µ 2 θ 1 θ 2 2 3 If f is twice differentiable: θ R d, f (θ) µ Id Machine learning with f(θ) = 1 n n i=1 l(y i, θ Φ(x i )) Hessian covariance matrix 1 n n i=1 Φ(x i)φ(x i ) Data with invertible covariance matrix (low correlation/dimension) Adding regularization by µ 2 θ 2 creates additional bias unless µ is small

Summary of smoothness/convexity assumptions Bounded gradients of f (Lipschitz-continuity): the function f if convex, differentiable and has (sub)gradients uniformly bounded by B on the ball of center 0 and radius D: θ R d, θ 2 D f (θ) 2 B Smoothness of f: the function f is convex, differentiable with L-Lipschitz-continuous gradient f : θ 1,θ 2 R d, f (θ 1 ) f (θ 2 ) 2 L θ 1 θ 2 2 Strong convexity of f: The function f is strongly convex with respect to the norm, with convexity constant µ > 0: θ 1,θ 2 R d, f(θ 1 ) f(θ 2 )+f (θ 2 ) (θ 1 θ 2 )+ µ 2 θ 1 θ 2 2 2

Assumptions Subgradient method/descent f convex and B-Lipschitz-continuous on { θ 2 D} ( Algorithm: θ t = Π D θ t 1 2D ) B t f (θ t 1 ) Π D : orthogonal projection onto { θ 2 D} Bound: Three-line proof f ( 1 t t 1 k=0 θ k ) f(θ ) 2DB t Best possible convergence rate after O(d) iterations

Subgradient method/descent - proof - I Iteration: θ t = Π D (θ t 1 γ t f (θ t 1 )) with γ t = 2D B t Assumption: f (θ) 2 B and θ 2 D θ t θ 2 2 θ t 1 θ γ t f (θ t 1 ) 2 2 by contractivity of projections θ t 1 θ 2 2+B 2 γ 2 t 2γ t (θ t 1 θ ) f (θ t 1 ) because f (θ t 1 ) 2 B θ t 1 θ 2 2+B 2 γ 2 t 2γ t [ f(θt 1 ) f(θ ) ] (property of subgradients) leading to f(θ t 1 ) f(θ ) B2 γ t 2 + 1 [ θt 1 θ 2 2γ 2 θ t θ 2 ] 2 t

Starting from Subgradient method/descent - proof - II t [ f(θu 1 ) f(θ ) ] u=1 f(θ t 1 ) f(θ ) B2 γ t 2 = = Using convexity: f t u=1 t u=1 ( 1 t t u=1 t u=1 B 2 γ u 2 B 2 γ u 2 B 2 γ u 2 B 2 γ u 2 t 1 k=0 + + t u=1 t 1 u=1 + θ k ) t 1 u=1 + 1 [ θt 1 θ 2 2γ 2 θ t θ 2 ] 2 t 1 [ θu 1 θ 2 2γ 2 θ u θ 2 ] 2 u ( θ u θ 2 1 2 1 ) θ 0 θ 2 + 2 θ t θ 2 2 2γ u+1 2γ u 2γ 1 2γ t 4D 2( 1 2γ u+1 1 2γ u ) + 4D 2 2γ 1 + 4D2 2γ t 2DB t with γ t = 2D B t f(θ ) 2DB t

Subgradient descent for machine learning Assumptions (f is the expected risk, ˆf the empirical risk) Linear predictors: θ(x) = θ Φ(x), with Φ(x) 2 R a.s. ˆf(θ) = 1 n n i=1 l(y i,φ(x i ) θ) G-Lipschitz loss: f and ˆf are GR-Lipschitz on C = { θ 2 D} Statistics: with probability greater than 1 δ ˆf(θ) f(θ) GRD [2+ n sup θ C 2log 2 δ ] Optimization: after t iterations of subgradient method ˆf(ˆθ) min η C ˆf(η) GRD t t = n iterations, with total running-time complexity of O(n 2 d)

Subgradient descent - strong convexity Assumptions f convex and B-Lipschitz-continuous on { θ 2 D} f µ-strongly convex ) 2 Algorithm: θ t = Π D (θ t 1 µ(t+1) f (θ t 1 ) Bound: f ( 2 t(t+1) t k=1 kθ k 1 ) f(θ ) 2B2 µ(t+1) Three-line proof Best possible convergence rate after O(d) iterations

Subgradient method - strong convexity - proof - I Iteration: θ t = Π D (θ t 1 γ t f (θ t 1 )) with γ t = 2 µ(t+1) Assumption: f (θ) 2 B and θ 2 D and µ-strong convexity of f θ t θ 2 2 θ t 1 θ γ t f (θ t 1 ) 2 2 by contractivity of projections θ t 1 θ 2 2+B 2 γt 2 2γ t (θ t 1 θ ) f (θ t 1 ) because f (θ t 1 ) 2 B θ t 1 θ 2 2+B 2 γt 2 [ 2γ t f(θt 1 ) f(θ )+ µ 2 θ t 1 θ 2 ] 2 (property of subgradients and strong convexity) leading to f(θ t 1 ) f(θ ) B2 γ t 2 + 1 2 B 2 µ(t+1) + µ 2 [1 γ t µ ] θ t 1 θ 2 2 1 2γ t θ t θ 2 2 [t 1 2 ] θt 1 θ 2 2 µ(t+1) θ t θ 2 2 4

Subgradient method - strong convexity - proof - II From f(θ t 1 ) f(θ ) B2 µ(t+1) + µ 2 [t 1 2 ] θt 1 θ 2 2 µ(t+1) θ t θ 2 2 4 t u [ f(θ u 1 ) f(θ ) ] u=1 u t=1 B 2 u µ(u+1) + 1 4 t [ u(u 1) θu 1 θ 2 2 u(u+1) θ u θ 2 2] u=1 B2 t µ + 1 [ 0 t(t+1) θt θ 2 B 4 2] 2 t µ Using convexity: f ( 2 t(t+1) t u=1 uθ u 1 ) f(θ ) 2B2 t+1

(smooth) gradient descent Assumptions f convex with L-Lipschitz-continuous gradient Minimum attained at θ Algorithm: Bound: Three-line proof θ t = θ t 1 1 L f (θ t 1 ) f(θ t ) f(θ ) 2L θ 0 θ 2 t+4 Not best possible convergence rate after O(d) iterations

(smooth) gradient descent - strong convexity Assumptions f convex with L-Lipschitz-continuous gradient f µ-strongly convex Algorithm: Bound: θ t = θ t 1 1 L f (θ t 1 ) f(θ t ) f(θ ) (1 µ/l) t[ f(θ 0 ) f(θ ) ] Three-line proof Adaptivity of gradient descent to problem difficulty Line search

Accelerated gradient methods (Nesterov, 1983) Assumptions f convex with L-Lipschitz-cont. gradient, min. attained at θ Algorithm: θ t = η t 1 1 L f (η t 1 ) Bound: η t = θ t + t 1 t+2 (θ t θ t 1 ) f(θ t ) f(θ ) 2L θ 0 θ 2 (t+1) 2 Ten-line proof (see, e.g., Schmidt, Le Roux, and Bach, 2011) Not improvable Extension to strongly convex functions

Optimization for sparsity-inducing norms (see Bach, Jenatton, Mairal, and Obozinski, 2011) Gradient descent as a proximal method (differentiable functions) θ t+1 = arg min θ R df(θ t)+(θ θ t ) f(θ t )+ L 2 θ θ t 2 2 θ t+1 = θ t 1 L f(θ t)

Optimization for sparsity-inducing norms (see Bach, Jenatton, Mairal, and Obozinski, 2011) Gradient descent as a proximal method (differentiable functions) θ t+1 = arg min θ R df(θ t)+(θ θ t ) f(θ t )+ L 2 θ θ t 2 2 θ t+1 = θ t 1 L f(θ t) Problems of the form: min θ R df(θ)+µω(θ) θ t+1 = arg min θ R df(θ t)+(θ θ t ) f(θ t )+µω(θ)+ L 2 θ θ t 2 2 Ω(θ) = θ 1 Thresholded gradient descent Similar convergence rates than smooth optimization Acceleration methods (Nesterov, 2007; Beck and Teboulle, 2009)

Summary: minimizing convex functions Assumption: f convex Gradient descent: θ t = θ t 1 γ t f (θ t 1 ) O(1/ t) convergence rate for non-smooth convex functions O(1/t) convergence rate for smooth convex functions O(e ρt ) convergence rate for strongly smooth convex functions Newton method: θ t = θ t 1 f (θ t 1 ) 1 f (θ t 1 ) O ( e ρ2t ) convergence rate

Summary: minimizing convex functions Assumption: f convex Gradient descent: θ t = θ t 1 γ t f (θ t 1 ) O(1/ t) convergence rate for non-smooth convex functions O(1/t) convergence rate for smooth convex functions O(e ρt ) convergence rate for strongly smooth convex functions Newton method: θ t = θ t 1 f (θ t 1 ) 1 f (θ t 1 ) O ( e ρ2t ) convergence rate Key insights from Bottou and Bousquet (2008) 1. In machine learning, no need to optimize below statistical error 2. In machine learning, cost functions are averages Stochastic approximation

Outline 1. Large-scale machine learning and optimization Traditional statistical analysis Classical methods for convex optimization 2. Non-smooth stochastic approximation Stochastic (sub)gradient and averaging Non-asymptotic results and lower bounds Strongly convex vs. non-strongly convex 3. Smooth stochastic approximation algorithms Asymptotic and non-asymptotic results 4. Beyond decaying step-sizes 5. Finite data sets

Stochastic approximation Goal: Minimizing a function f defined on R d given only unbiased estimates f n(θ n ) of its gradients f (θ n ) at certain points θ n R d Stochastic approximation (much) broader applicability beyond convex optimization θ n = θ n 1 γ n h n (θ n 1 ) with E [ h n (θ n 1 ) θ n 1 ] = h(θn 1 ) Beyond convex problems, i.i.d assumption, finite dimension, etc. Typically asymptotic results See, e.g., Kushner and Yin (2003); Benveniste et al. (2012)

Stochastic approximation Goal: Minimizing a function f defined on R d given only unbiased estimates f n(θ n ) of its gradients f (θ n ) at certain points θ n R d Machine learning - statistics loss for a single pair of observations: f n (θ) = l(y n,θ Φ(x n )) f(θ) = Ef n (θ) = El(y n,θ Φ(x n )) = generalization error Expected gradient: f (θ) = Ef n(θ) = E { l (y n,θ Φ(x n ))Φ(x n ) } Non-asymptotic results Number of iterations = number of observations

Stochastic approximation Relationship to online learning Minimize f(θ) = E z l(θ,z) = generalization error of θ Using the gradients of single i.i.d. observations

Stochastic approximation Relationship to online learning Minimize f(θ) = E z l(θ,z) = generalization error of θ Using the gradients of single i.i.d. observations Batch learning Finite set of observations: z 1,...,z n Empirical risk: ˆf(θ) = 1 n n k=1 l(θ,z i) Estimator ˆθ = Minimizer of ˆf(θ) over a certain class Θ Generalization bound using uniform concentration results

Stochastic approximation Relationship to online learning Minimize f(θ) = E z l(θ,z) = generalization error of θ Using the gradients of single i.i.d. observations Batch learning Finite set of observations: z 1,...,z n Empirical risk: ˆf(θ) = 1 n n k=1 l(θ,z i) Estimator ˆθ = Minimizer of ˆf(θ) over a certain class Θ Generalization bound using uniform concentration results Online learning Update ˆθ n after each new (potentially adversarial) observation z n Cumulative loss: 1 n n k=1 l(ˆθ k 1,z k ) Online to batch through averaging (Cesa-Bianchi et al., 2004)

Convex stochastic approximation Key properties of f and/or f n Smoothness: f B-Lipschitz continuous, f L-Lipschitz continuous Strong convexity: f µ-strongly convex

Convex stochastic approximation Key properties of f and/or f n Smoothness: f B-Lipschitz continuous, f L-Lipschitz continuous Strong convexity: f µ-strongly convex Key algorithm: Stochastic gradient descent (a.k.a. Robbins-Monro) θ n = θ n 1 γ n f n(θ n 1 ) Polyak-Ruppert averaging: θ n = 1 n 1 n k=0 θ k Which learning rate sequence γ n? Classical setting: γ n = Cn α

Convex stochastic approximation Key properties of f and/or f n Smoothness: f B-Lipschitz continuous, f L-Lipschitz continuous Strong convexity: f µ-strongly convex Key algorithm: Stochastic gradient descent (a.k.a. Robbins-Monro) θ n = θ n 1 γ n f n(θ n 1 ) Polyak-Ruppert averaging: θ n = 1 n 1 n k=0 θ k Which learning rate sequence γ n? Classical setting: γ n = Cn α Desirable practical behavior Applicable (at least) to classical supervised learning problems Robustness to (potentially unknown) constants (L,B,µ) Adaptivity to difficulty of the problem (e.g., strong convexity)

Assumptions Stochastic subgradient descent/method f n convex and B-Lipschitz-continuous on { θ 2 D} (f n ) i.i.d. functions such that Ef n = f θ global optimum of f on { θ 2 D} ( Algorithm: θ n = Π D θ n 1 2D ) B n f n(θ n 1 ) Bound: Ef ( 1 n n 1 k=0 θ k ) f(θ ) 2DB n Same three-line proof as in the deterministic case Minimax convergence rate Running-time complexity: O(dn) after n iterations

Stochastic subgradient method - proof - I Iteration: θ n = Π D (θ n 1 γ n f n(θ n 1 )) with γ n = 2D B n F n : information up to time n f n(θ) 2 B and θ 2 D, unbiased gradients/functions E(f n F n 1 ) = f θ n θ 2 2 θ n 1 θ γ n f (θ n 1 ) 2 2 by contractivity of projections θ n 1 θ 2 2 +B 2 γ 2 n 2γ n (θ n 1 θ ) f n(θ n 1 ) because f n(θ n 1 ) 2 B E [ θ n θ 2 2 F n 1 ] θn 1 θ 2 2+B 2 γ 2 n 2γ n (θ n 1 θ ) f (θ n 1 ) θ n 1 θ 2 2+B 2 γ 2 n 2γ n [ f(θn 1 ) f(θ ) ] (subgradient property) E θ n θ 2 2 E θ n 1 θ 2 2+B 2 γ 2 n 2γ n [ Ef(θn 1 ) f(θ ) ] leading to Ef(θ n 1 ) f(θ ) B2 γ n 2 + 1 [ ] E θn 1 θ 2 2γ 2 E θ n θ 2 2 n

Stochastic subgradient method - proof - II Starting from Ef(θ n 1 ) f(θ ) B2 γ n 2 + 1 [ ] E θn 1 θ 2 2 E θ n θ 2 2 2γ n n [ Ef(θu 1 ) f(θ ) ] u=1 n u=1 n u=1 B 2 γ u 2 B 2 γ u 2 + n u=1 1 [ E θu 1 θ 2 2γ 2 E θ u θ 2 ] 2 u + 4D2 2γ n 2DB n with γ n = 2D B n Using convexity: Ef ( 1 n n 1 k=0 θ k ) f(θ ) 2DB n

Stochastic subgradient descent - strong convexity - I Assumptions f n convex and B-Lipschitz-continuous (f n ) i.i.d. functions such that Ef n = f f µ-strongly convex on { θ 2 D} θ global optimum of f over { θ 2 D} ) 2 Algorithm: θ n = Π D (θ n 1 µ(n+1) f n(θ n 1 ) Bound: ( 2 Ef n(n+1) n ) kθ k 1 k=1 f(θ ) 2B2 µ(n+1) Same three-line proof than in the deterministic case Minimax convergence rate

Stochastic subgradient descent - strong convexity - II Assumptions f n convex and B-Lipschitz-continuous (f n ) i.i.d. functions such that Ef n = f θ global optimum of g = f + µ 2 2 2 No compactness assumption - no projections Algorithm: 2 2 [ ] θ n = θ n 1 µ(n+1) g n(θ n 1 ) = θ n 1 f µ(n+1) n (θ n 1 )+µθ n 1 ( 2 Bound: Eg n(n+1) n ) kθ k 1 k=1 Minimax convergence rate g(θ ) 2B2 µ(n+1)

Outline 1. Large-scale machine learning and optimization Traditional statistical analysis Classical methods for convex optimization 2. Non-smooth stochastic approximation Stochastic (sub)gradient and averaging Non-asymptotic results and lower bounds Strongly convex vs. non-strongly convex 3. Smooth stochastic approximation algorithms Asymptotic and non-asymptotic results 4. Beyond decaying step-sizes 5. Finite data sets

Convex stochastic approximation Existing work Known global minimax rates of convergence for non-smooth problems (Nemirovsky and Yudin, 1983; Agarwal et al., 2012) Strongly convex: O((µn) 1 ) Attainedbyaveragedstochasticgradientdescentwithγ n (µn) 1 Non-strongly convex: O(n 1/2 ) Attained by averaged stochastic gradient descent with γ n n 1/2 Bottou and Le Cun (2005); Bottou and Bousquet (2008); Hazan et al. (2007); Shalev-Shwartz and Srebro (2008); Shalev-Shwartz et al. (2007, 2009); Xiao (2010); Duchi and Singer (2009); Nesterov and Vial (2008); Nemirovski et al. (2009)

Convex stochastic approximation Existing work Known global minimax rates of convergence for non-smooth problems (Nemirovsky and Yudin, 1983; Agarwal et al., 2012) Strongly convex: O((µn) 1 ) Attainedbyaveragedstochasticgradientdescentwithγ n (µn) 1 Non-strongly convex: O(n 1/2 ) Attained by averaged stochastic gradient descent with γ n n 1/2 Many contributions in optimization and online learning: Bottou and Le Cun (2005); Bottou and Bousquet (2008); Hazan et al. (2007); Shalev-Shwartz and Srebro (2008); Shalev-Shwartz et al. (2007, 2009); Xiao (2010); Duchi and Singer (2009); Nesterov and Vial (2008); Nemirovski et al. (2009)

Convex stochastic approximation Existing work Known global minimax rates of convergence for non-smooth problems (Nemirovsky and Yudin, 1983; Agarwal et al., 2012) Strongly convex: O((µn) 1 ) Attainedbyaveragedstochasticgradientdescentwithγ n (µn) 1 Non-strongly convex: O(n 1/2 ) Attained by averaged stochastic gradient descent with γ n n 1/2 Asymptotic analysis of averaging (Polyak and Juditsky, 1992; Ruppert, 1988) All step sizes γ n = Cn α with α (1/2,1) lead to O(n 1 ) for smooth strongly convex problems A single algorithm with global adaptive convergence rate for smooth problems?

Convex stochastic approximation Existing work Known global minimax rates of convergence for non-smooth problems (Nemirovsky and Yudin, 1983; Agarwal et al., 2012) Strongly convex: O((µn) 1 ) Attainedbyaveragedstochasticgradientdescentwithγ n (µn) 1 Non-strongly convex: O(n 1/2 ) Attained by averaged stochastic gradient descent with γ n n 1/2 Asymptotic analysis of averaging (Polyak and Juditsky, 1992; Ruppert, 1988) All step sizes γ n = Cn α with α (1/2,1) lead to O(n 1 ) for smooth strongly convex problems Non-asymptotic analysis for smooth problems? problems with convergence rate O(min{1/µn,1/ n})

Smoothness/convexity assumptions Iteration: θ n = θ n 1 γ n f n(θ n 1 ) Polyak-Ruppert averaging: θ n = 1 n n 1 k=0 θ k Smoothness of f n : For each n 1, the function f n is a.s. convex, differentiable with L-Lipschitz-continuous gradient f n: Smooth loss and bounded data Strong convexity of f: The function f is strongly convex with respect to the norm, with convexity constant µ > 0: Invertible population covariance matrix or regularization by µ 2 θ 2

Summary of new results (Bach and Moulines, 2011) Stochastic gradient descent with learning rate γ n = Cn α Strongly convex smooth objective functions Old: O(n 1 ) rate achieved without averaging for α = 1 New: O(n 1 ) rate achieved with averaging for α [1/2,1] Non-asymptotic analysis with explicit constants Forgetting of initial conditions Robustness to the choice of C

Summary of new results (Bach and Moulines, 2011) Stochastic gradient descent with learning rate γ n = Cn α Strongly convex smooth objective functions Old: O(n 1 ) rate achieved without averaging for α = 1 New: O(n 1 ) rate achieved with averaging for α [1/2,1] Non-asymptotic analysis with explicit constants Forgetting of initial conditions Robustness to the choice of C Convergence rates for E θ n θ 2 and E θ n θ 2 ( σ 2 γ ) n no averaging: O +O(e µnγ n ) θ 0 θ 2 µ averaging: trh(θ ) 1 ( +µ 1 O(n 2α +n 2+α θ0 θ 2 ) )+O n µ 2 n 2

Robustness to wrong constants for γ n = Cn α f(θ) = 1 2 θ 2 with i.i.d. Gaussian noise (d = 1) Left: α = 1/2 Right: α = 1 log[f(θ n ) f ] 5 0 α = 1/2 sgd C=1/5 ave C=1/5 sgd C=1 ave C=1 sgd C=5 ave C=5 log[f(θ n ) f ] 5 0 α = 1 sgd C=1/5 ave C=1/5 sgd C=1 ave C=1 sgd C=5 ave C=5 5 0 2 4 log(n) 5 0 2 4 log(n) See also http://leon.bottou.org/projects/sgd

Summary of new results (Bach and Moulines, 2011) Stochastic gradient descent with learning rate γ n = Cn α Strongly convex smooth objective functions Old: O(n 1 ) rate achieved without averaging for α = 1 New: O(n 1 ) rate achieved with averaging for α [1/2,1] Non-asymptotic analysis with explicit constants

Summary of new results (Bach and Moulines, 2011) Stochastic gradient descent with learning rate γ n = Cn α Strongly convex smooth objective functions Old: O(n 1 ) rate achieved without averaging for α = 1 New: O(n 1 ) rate achieved with averaging for α [1/2,1] Non-asymptotic analysis with explicit constants Non-strongly convex smooth objective functions Old: O(n 1/2 ) rate achieved with averaging for α = 1/2 New: O(max{n 1/2 3α/2,n α/2,n α 1 }) rate achieved without averaging for α [1/3,1] Take-home message Use α = 1/2 with averaging to be adaptive to strong convexity

Beyond stochastic gradient method Adding a proximal step Goal: min θ R df(θ)+ω(θ) = Ef n(θ)+ω(θ) Replace recursion θ n = θ n 1 γ n f n(θ n ) by θ n = min θ R d θ θ n 1 +γ n f n(θ n ) 2 2 +CΩ(θ) Xiao (2010); Hu et al. (2009) May be accelerated Ghadimi and Lan (2013) Related frameworks Regularized dual averaging (Nesterov, 2009; Xiao, 2010) Mirror descent (Nemirovski et al., 2009; Lan et al., 2012)

Outline 1. Large-scale machine learning and optimization Traditional statistical analysis Classical methods for convex optimization 2. Non-smooth stochastic approximation Stochastic (sub)gradient and averaging Non-asymptotic results and lower bounds Strongly convex vs. non-strongly convex 3. Smooth stochastic approximation algorithms Asymptotic and non-asymptotic results 4. Beyond decaying step-sizes 5. Finite data sets

Convex stochastic approximation Existing work Known global minimax rates of convergence for non-smooth problems (Nemirovsky and Yudin, 1983; Agarwal et al., 2012) Strongly convex: O((µn) 1 ) Attainedbyaveragedstochasticgradientdescentwithγ n (µn) 1 Non-strongly convex: O(n 1/2 ) Attained by averaged stochastic gradient descent with γ n n 1/2 Asymptotic analysis of averaging (Polyak and Juditsky, 1992; Ruppert, 1988) All step sizes γ n = Cn α with α (1/2,1) lead to O(n 1 ) for smooth strongly convex problems A single adaptive algorithm for smooth problems with convergence rate O(min{1/µn,1/ n}) in all situations?

Adaptive algorithm for logistic regression Logistic regression: (Φ(x n ),y n ) R d { 1,1} Single data point: f n (θ) = log(1+exp( y n θ Φ(x n ))) Generalization error: f(θ) = Ef n (θ)

Adaptive algorithm for logistic regression Logistic regression: (Φ(x n ),y n ) R d { 1,1} Single data point: f n (θ) = log(1+exp( y n θ Φ(x n ))) Generalization error: f(θ) = Ef n (θ) Cannot be strongly convex local strong convexity unless restricted to θ Φ(x n ) M (and with constants e M ) µ = lowest eigenvalue of the Hessian at the optimum f (θ ) logistic loss

Adaptive algorithm for logistic regression Logistic regression: (Φ(x n ),y n ) R d { 1,1} Single data point: f n (θ) = log(1+exp( y n θ Φ(x n ))) Generalization error: f(θ) = Ef n (θ) Cannot be strongly convex local strong convexity unless restricted to θ Φ(x n ) M (and with constants e M ) µ = lowest eigenvalue of the Hessian at the optimum f (θ ) n steps of averaged SGD with constant step-size 1/ ( 2R 2 n ) with R = radius of data (Bach, 2013): Ef( θ n ) f(θ ) min { } 1 n, R2 (15+5R θ0 θ ) 4 nµ Proof based on self-concordance (Nesterov and Nemirovski, 1994)

Adaptive algorithm for logistic regression Logistic regression: (Φ(x n ),y n ) R d { 1,1} Single data point: f n (θ) = log(1+exp( y n θ Φ(x n ))) Generalization error: f(θ) = Ef n (θ) Cannot be strongly convex local strong convexity unless restricted to θ Φ(x n ) M (and with constants e M ) µ = lowest eigenvalue of the Hessian at the optimum f (θ ) n steps of averaged SGD with constant step-size 1/ ( 2R 2 n ) with R = radius of data (Bach, 2013): Ef( θ n ) f(θ ) min { } 1 n, R2 (15+5R θ0 θ ) 4 nµ A single adaptive algorithm for smooth problems with convergence rate O(1/n) in all situations?

Least-mean-square (LMS) algorithm Least-squares: f(θ) = 1 2 E[ (y n Φ(x n ) θ) 2] with θ R d SGD = least-mean-square algorithm (see, e.g., Macchi, 1995) usually studied without averaging and decreasing step-sizes with strong convexity assumption E [ Φ(x n )Φ(x n ) ] = H µ Id

Least-mean-square (LMS) algorithm Least-squares: f(θ) = 1 2 E[ (y n Φ(x n ) θ) 2] with θ R d SGD = least-mean-square algorithm (see, e.g., Macchi, 1995) usually studied without averaging and decreasing step-sizes with strong convexity assumption E [ Φ(x n )Φ(x n ) ] = H µ Id New analysis for averaging and constant step-size γ = 1/(4R 2 ) Assume Φ(x n ) R and y n Φ(x n ) θ σ almost surely No assumption regarding lowest eigenvalues of H Main result: Ef( θ n ) f(θ ) 4σ2 d n + 2R2 θ 0 θ 2 n Matches statistical lower bound (Tsybakov, 2003) Non-asymptotic robust version of Györfi and Walk (1996)

Least-mean-square (LMS) algorithm Least-squares: f(θ) = 1 2 E[ (y n Φ(x n ) θ) 2] with θ R d SGD = least-mean-square algorithm (see, e.g., Macchi, 1995) usually studied without averaging and decreasing step-sizes with strong convexity assumption E [ Φ(x n )Φ(x n ) ] = H µ Id New analysis for averaging and constant step-size γ = 1/(4R 2 ) Assume Φ(x n ) R and y n Φ(x n ) θ σ almost surely No assumption regarding lowest eigenvalues of H Main result: Ef( θ n ) f(θ ) 4σ2 d n + 2R2 θ 0 θ 2 n Improvement of bias term (Flammarion and Bach, 2014): { R 2 θ 0 θ 2 min, R4 (θ 0 θ ) H 1 (θ 0 θ ) } n n 2

Least-mean-square (LMS) algorithm Least-squares: f(θ) = 1 2 E[ (y n Φ(x n ) θ) 2] with θ R d SGD = least-mean-square algorithm (see, e.g., Macchi, 1995) usually studied without averaging and decreasing step-sizes with strong convexity assumption E [ Φ(x n )Φ(x n ) ] = H µ Id New analysis for averaging and constant step-size γ = 1/(4R 2 ) Assume Φ(x n ) R and y n Φ(x n ) θ σ almost surely No assumption regarding lowest eigenvalues of H Main result: Ef( θ n ) f(θ ) 4σ2 d n + 2R2 θ 0 θ 2 n Extension to Hilbert spaces (Dieuleveult and Bach, 2014): Achieves minimax statistical rates given decay of spectrum of H

Least-squares - Proof technique LMS recursion with ε n = y n Φ(x n ) θ : θ n θ = [ I γφ(x n )Φ(x n ) ] (θ n 1 θ )+γε n Φ(x n ) Simplified LMS recursion: with H = E [ Φ(x n )Φ(x n ) ] θ n θ = [ I γh ] (θ n 1 θ )+γε n Φ(x n ) Direct proof technique of Polyak and Juditsky (1992), e.g., θ n θ = [ I γh ] n (θ0 θ )+γ Exact computations n k=1 [ I γh ] n kεk Φ(x k ) Infinite expansion of Aguech, Moulines, and Priouret(2000) in powers of γ

Markov chain interpretation of constant step sizes LMS recursion for f n (θ) = 1 2 ( yn Φ(x n ) θ ) 2 θ n = θ n 1 γ ( Φ(x n ) θ n 1 y n ) Φ(xn ) The sequence (θ n ) n is a homogeneous Markov chain convergence to a stationary distribution π γ with expectation θ def γ = θπ γ (dθ)

Markov chain interpretation of constant step sizes LMS recursion for f n (θ) = 1 2 ( yn Φ(x n ) θ ) 2 θ n = θ n 1 γ ( Φ(x n ) θ n 1 y n ) Φ(xn ) The sequence (θ n ) n is a homogeneous Markov chain convergence to a stationary distribution π γ with expectation θ def γ = θπ γ (dθ) For least-squares, θ γ = θ θ n does not converge to θ but oscillates around it oscillations of order γ cf. Kaczmarz method (Strohmer and Vershynin, 2009) Ergodic theorem: Averaged iterates converge to θ γ = θ at rate O(1/n)

Simulations - synthetic examples Gaussian distributions - d = 20 0 synthetic square log 10 [f(θ) f(θ * )] 1 2 3 1/2R 2 1/8R 2 4 1/32R 2 1/2R 2 n 1/2 5 0 2 4 6 log 10 (n)

Simulations - benchmarks alpha (d = 500, n = 500 000), news (d = 1 300 000, n = 20 000) log 10 [f(θ) f(θ * )] 1 0.5 0 0.5 1 1.5 2 alpha square C=1 test 1/R 2 1/R 2 n 1/2 SAG 0 2 4 6 log (n) 10 1 0.5 0 0.5 1 1.5 2 alpha square C=opt test C/R 2 C/R 2 n 1/2 SAG 0 2 4 6 log (n) 10 0.2 news square C=1 test 0.2 news square C=opt test log 10 [f(θ) f(θ * )] 0 0.2 0.4 0.6 0.8 1/R 2 1/R 2 n 1/2 SAG 0 2 4 log (n) 10 0 0.2 0.4 0.6 0.8 C/R 2 C/R 2 n 1/2 SAG 0 2 4 log 10 (n)

Beyond least-squares - Markov chain interpretation Recursion θ n = θ n 1 γf n(θ n 1 ) also defines a Markov chain Stationary distribution π γ such that f (θ)π γ (dθ) = 0 When f is not linear, f ( θπ γ (dθ)) f (θ)π γ (dθ) = 0

Beyond least-squares - Markov chain interpretation Recursion θ n = θ n 1 γf n(θ n 1 ) also defines a Markov chain Stationary distribution π γ such that f (θ)π γ (dθ) = 0 When f is not linear, f ( θπ γ (dθ)) f (θ)π γ (dθ) = 0 θ n oscillates around the wrong value θ γ θ moreover, θ θ n = O p ( γ) Ergodic theorem averaged iterates converge to θ γ θ at rate O(1/n) moreover, θ θ γ = O(γ) (Bach, 2013) NB: coherent with earlier results by Nedic and Bertsekas (2000)

Simulations - synthetic examples Gaussian distributions - d = 20 0 synthetic logistic 1 log 10 [f(θ) f(θ * )] 1 2 1/2R 2 3 1/8R 2 4 1/32R 2 5 1/2R 2 n 1/2 0 2 4 6 log 10 (n)

Restoring convergence through online Newton steps Known facts 1. Averaged SGD with γ n n 1/2 leads to robust rate O(n 1/2 ) for all convex functions 2. Averaged SGD with γ n constant leads to robust rate O(n 1 ) for all convex quadratic functions 3. Newton s method squares the error at each iteration for smooth functions 4. A single step of Newton s method is equivalent to minimizing the quadratic Taylor expansion Online Newton step Rate: O((n 1/2 ) 2 +n 1 ) = O(n 1 ) Complexity: O(d) per iteration for linear predictions

Restoring convergence through online Newton steps Known facts 1. Averaged SGD with γ n n 1/2 leads to robust rate O(n 1/2 ) for all convex functions 2. Averaged SGD with γ n constant leads to robust rate O(n 1 ) for all convex quadratic functions 3. Newton s method squares the error at each iteration for smooth functions 4. A single step of Newton s method is equivalent to minimizing the quadratic Taylor expansion Online Newton step Rate: O((n 1/2 ) 2 +n 1 ) = O(n 1 ) Complexity: O(d) per iteration for linear predictions

Restoring convergence through online Newton steps The Newton step for f = Ef n (θ) def = E [ l(y n,θ Φ(x n )) ] at θ is equivalent to minimizing the quadratic approximation g(θ) = f( θ)+f ( θ) (θ θ)+ 1 2 (θ θ) f ( θ)(θ θ) = f( θ)+ef n( θ) (θ θ)+ 1 2 (θ θ) Ef n( θ)(θ θ) [ = E f( θ)+f n( θ) (θ θ)+ 1 2 (θ θ) f n( θ)(θ θ) ]

Restoring convergence through online Newton steps The Newton step for f = Ef n (θ) def = E [ l(y n,θ Φ(x n )) ] at θ is equivalent to minimizing the quadratic approximation g(θ) = f( θ)+f ( θ) (θ θ)+ 1 2 (θ θ) f ( θ)(θ θ) = f( θ)+ef n( θ) (θ θ)+ 1 2 (θ θ) Ef n( θ)(θ θ) [ = E f( θ)+f n( θ) (θ θ)+ 1 2 (θ θ) f n( θ)(θ θ) ] Complexity of least-mean-square recursion for g is O(d) θ n = θ n 1 γ [ f n( θ)+f n( θ)(θ n 1 θ) ] f n( θ) = l (y n,φ(x n ) θ)φ(xn )Φ(x n ) has rank one New online Newton step without computing/inverting Hessians

Choice of support point for online Newton step Two-stage procedure (1) Run n/2 iterations of averaged SGD to obtain θ (2) Run n/2 iterations of averaged constant step-size LMS Reminiscent of one-step estimators (see, e.g., Van der Vaart, 2000) Provable convergence rate of O(d/n) for logistic regression Additional assumptions but no strong convexity

Logistic regression - Proof technique Using generalized self-concordance of ϕ : u log(1+e u ): ϕ (u) ϕ (u) NB: difference with regular self-concordance: ϕ (u) 2ϕ (u) 3/2 Using novel high-probability convergence results for regular averaged stochastic gradient descent Requires assumption on the kurtosis in every direction, i.e., E(Φ(x n ) η) 4 κ [ E(Φ(x n ) η) 2] 2

Choice of support point for online Newton step Two-stage procedure (1) Run n/2 iterations of averaged SGD to obtain θ (2) Run n/2 iterations of averaged constant step-size LMS Reminiscent of one-step estimators (see, e.g., Van der Vaart, 2000) Provable convergence rate of O(d/n) for logistic regression Additional assumptions but no strong convexity

Choice of support point for online Newton step Two-stage procedure (1) Run n/2 iterations of averaged SGD to obtain θ (2) Run n/2 iterations of averaged constant step-size LMS Reminiscent of one-step estimators (see, e.g., Van der Vaart, 2000) Provable convergence rate of O(d/n) for logistic regression Additional assumptions but no strong convexity Update at each iteration using the current averaged iterate Recursion: θ n = θ n 1 γ [ f n( θ n 1 )+f n( θ n 1 )(θ n 1 θ n 1 ) ] No provable convergence rate (yet) but best practical behavior Note (dis)similarity with regular SGD: θ n = θ n 1 γf n(θ n 1 )

Simulations - synthetic examples Gaussian distributions - d = 20 0 synthetic logistic 1 0 synthetic logistic 2 log 10 [f(θ) f(θ * )] 1 2 2 1/2R 2 3 1/8R 2 3 every 2 p 4 1/32R 2 every iter. 4 2 step 5 1/2R 2 n 1/2 5 2 step dbl. 0 2 4 6 0 2 4 6 log 10 (n) log 10 (n) log 10 [f(θ) f(θ * )] 1

Simulations - benchmarks alpha (d = 500, n = 500 000), news (d = 1 300 000, n = 20 000) alpha logistic C=1 test alpha logistic C=opt test 0.5 0.5 log 10 [f(θ) f(θ * )] 0 0.5 1 1/R 2 1.5 1/R 2 n 1/2 SAG 2 Adagrad Newton 2.5 0 2 4 log (n) 10 6 0 0.5 1 C/R 2 1.5 C/R 2 n 1/2 SAG 2 Adagrad Newton 2.5 0 2 4 log (n) 10 6 0.2 news logistic C=1 test 0.2 news logistic C=opt test log 10 [f(θ) f(θ * )] 0 0.2 0.4 1/R 2 0.6 1/R 2 n 1/2 SAG 0.8 Adagrad 1 Newton 0 2 4 log (n) 10 0 0.2 0.4 C/R 2 0.6 C/R 2 n 1/2 SAG 0.8 Adagrad 1 Newton 0 2 4 log 10 (n)

Outline 1. Large-scale machine learning and optimization Traditional statistical analysis Classical methods for convex optimization 2. Non-smooth stochastic approximation Stochastic (sub)gradient and averaging Non-asymptotic results and lower bounds Strongly convex vs. non-strongly convex 3. Smooth stochastic approximation algorithms Asymptotic and non-asymptotic results 4. Beyond decaying step-sizes 5. Finite data sets

Going beyond a single pass over the data Stochastic approximation Assumes infinite data stream Observations are used only once Directly minimizes testing cost E (x,y) l(y,θ Φ(x))

Going beyond a single pass over the data Stochastic approximation Assumes infinite data stream Observations are used only once Directly minimizes testing cost E (x,y) l(y,θ Φ(x)) Machine learning practice Finite data set (x 1,y 1,...,x n,y n ) Multiple passes Minimizes training cost 1 n n i=1 l(y i,θ Φ(x i )) Need to regularize (e.g., by the l 2 -norm) to avoid overfitting Goal: minimize g(θ) = 1 n n f i (θ) i=1

Stochastic vs. deterministic methods Minimizing g(θ) = 1 n n f i (θ) with f i (θ) = l ( y i,θ Φ(x i ) ) +µω(θ) i=1 Batchgradientdescent:θ t = θ t 1 γ t g (θ t 1 ) = θ t 1 γ t n Linear (e.g., exponential) convergence rate in O(e αt ) Iteration complexity is linear in n (with line search) n f i(θ t 1 ) i=1

Stochastic vs. deterministic methods Minimizing g(θ) = 1 n n f i (θ) with f i (θ) = l ( y i,θ Φ(x i ) ) +µω(θ) i=1 Batchgradientdescent:θ t = θ t 1 γ t g (θ t 1 ) = θ t 1 γ t n Linear (e.g., exponential) convergence rate in O(e αt ) Iteration complexity is linear in n (with line search) n f i(θ t 1 ) i=1 Stochastic gradient descent: θ t = θ t 1 γ t f i(t) (θ t 1) Sampling with replacement: i(t) random element of {1,...,n} Convergence rate in O(1/t) Iteration complexity is independent of n (step size selection?)

Stochastic vs. deterministic methods Goal = best of both worlds: Linear rate with O(1) iteration cost Robustness to step size log(excess cost) stochastic deterministic time

Stochastic vs. deterministic methods Goal = best of both worlds: Linear rate with O(1) iteration cost Robustness to step size log(excess cost) hybrid time stochastic deterministic

Accelerating gradient methods - Related work Nesterov acceleration Nesterov (1983, 2004) Better linear rate but still O(n) iteration cost Hybrid methods, incremental average gradient, increasing batch size Bertsekas (1997); Blatt et al. (2008); Friedlander and Schmidt (2011) Linear rate, but iterations make full passes through the data.

Accelerating gradient methods - Related work Momentum, gradient/iterate averaging, stochastic version of accelerated batch gradient methods Polyak and Juditsky (1992); Tseng (1998); Sunehag et al. (2009); Ghadimi and Lan (2010); Xiao (2010) Can improve constants, but still have sublinear O(1/t) rate Constant step-size stochastic gradient (SG), accelerated SG Kesten (1958); Delyon and Juditsky (1993); Solodov (1998); Nedic and Bertsekas (2000) Linear convergence, but only up to a fixed tolerance. Stochastic methods in the dual Shalev-Shwartz and Zhang (2012) Similar linear rate but limited choice for the f i s

Stochastic average gradient (Le Roux, Schmidt, and Bach, 2012) Stochastic average gradient (SAG) iteration Keep in memory the gradients of all functions f i, i = 1,...,n Random selection i(t) {1,...,n} with replacement { Iteration: θ t = θ t 1 γ n t f yi t with yi t n = i (θ t 1 ) if i = i(t) otherwise i=1 y t 1 i

Stochastic average gradient (Le Roux, Schmidt, and Bach, 2012) Stochastic average gradient (SAG) iteration Keep in memory the gradients of all functions f i, i = 1,...,n Random selection i(t) {1,...,n} with replacement { Iteration: θ t = θ t 1 γ n t f yi t n with yt i = i (θ t 1 ) if i = i(t) otherwise i=1 y t 1 i Stochastic version of incremental average gradient(blatt et al., 2008) Extra memory requirement Supervised machine learning If f i (θ) = l i (y i,φ(x i ) θ), then f i (θ) = l i (y i,φ(x i ) θ)φ(x i ) Only need to store n real numbers

Stochastic average gradient - Convergence analysis Assumptions Each f i is L-smooth, i = 1,...,n g= 1 n n i=1 f i is µ-strongly convex (with potentially µ = 0) constant step size γ t = 1/(16L) initialization with one pass of averaged SGD

Stochastic average gradient - Convergence analysis Assumptions Each f i is L-smooth, i = 1,...,n g= 1 n n i=1 f i is µ-strongly convex (with potentially µ = 0) constant step size γ t = 1/(16L) initialization with one pass of averaged SGD Strongly convex case (Le Roux et al., 2012, 2013) E [ g(θ t ) g(θ ) ] ( 8σ 2 nµ + 4L θ 0 θ 2 n ) exp ( { 1 tmin 8n, µ }) 16L Linear (exponential) convergence rate with ( O(1) iteration { cost 1 After one pass, reduction of cost by exp min 8, nµ }) 16L

Stochastic average gradient - Convergence analysis Assumptions Each f i is L-smooth, i = 1,...,n g= 1 n n i=1 f i is µ-strongly convex (with potentially µ = 0) constant step size γ t = 1/(16L) initialization with one pass of averaged SGD Non-strongly convex case (Le Roux et al., 2013) E [ g(θ t ) g(θ ) ] 48 σ2 +L θ 0 θ 2 n n t Improvement over regular batch and stochastic gradient Adaptivity to potentially hidden strong convexity

Convergence analysis - Proof sketch Main step: find good Lyapunov function J(θ t,y t 1,...,y t n) such that E [ J(θ t,y t 1,...,y t n) F t 1 ] < J(θt 1,y t 1 1,...,y t 1 n ) no natural candidates Computer-aided proof Parameterize function J(θ t,y t 1,...,y t n) = g(θ t ) g(θ )+quadratic Solve semidefinite program to obtain candidates (that depend on n,µ,l) Check validity with symbolic computations

Rate of convergence comparison Assume that L = 100, µ =.01, and n = 80000 Full gradient method has rate ( 1 µ L) = 0.9999 Accelerated gradient method has rate ( 1 µ ) L = 0.9900 Running n iterations of SAG for the same cost has rate ( 1 1 n 8n) = 0.8825 Fastest possible first-order method has rate ( ) 2 L µ L+ µ = 0.9608 Beating two lower bounds (with additional assumptions) (1) stochastic gradient and (2) full gradient

Stochastic average gradient Implementation details and extensions The algorithm can use sparsity in the features to reduce the storage and iteration cost Grouping functions together can further reduce the memory requirement We have obtained good performance when L is not known with a heuristic line-search Algorithm allows non-uniform sampling Possibility of making proximal, coordinate-wise, and Newton-like variants

spam dataset (n = 92 189, d = 823 470)

Summary and future work Constant-step-size averaged stochastic gradient descent Reaches convergence rate O(1/n) in all regimes Improves on the O(1/ n) lower-bound of non-smooth problems Efficient online Newton step for non-quadratic problems Robustness to step-size selection Going beyond a single pass through the data

Summary and future work Constant-step-size averaged stochastic gradient descent Reaches convergence rate O(1/n) in all regimes Improves on the O(1/ n) lower-bound of non-smooth problems Efficient online Newton step for non-quadratic problems Robustness to step-size selection Going beyond a single pass through the data Extensions and future work Pre-conditioning Proximal extensions fo non-differentiable terms kernels and non-parametric estimation line-search parallelization

Outline 1. Large-scale machine learning and optimization Traditional statistical analysis Classical methods for convex optimization 2. Non-smooth stochastic approximation Stochastic (sub)gradient and averaging Non-asymptotic results and lower bounds Strongly convex vs. non-strongly convex 3. Smooth stochastic approximation algorithms Asymptotic and non-asymptotic results 4. Beyond decaying step-sizes 5. Finite data sets

Conclusions Machine learning and convex optimization Statistics with or without optimization? Significance of mixing algorithms with analysis Benefits of mixing algorithms with analysis Open problems Non-parametric stochastic approximation Going beyond a single pass over the data (testing performance) Characterization of implicit regularization of online methods Further links between convex optimization and online learning/bandits

References A. Agarwal, P. L. Bartlett, P. Ravikumar, and M. J. Wainwright. Information-theoretic lower bounds on the oracle complexity of stochastic convex optimization. Information Theory, IEEE Transactions on, 58(5):3235 3249, 2012. R. Aguech, E. Moulines, and P. Priouret. On a perturbation approach for the analysis of stochastic tracking algorithms. SIAM J. Control and Optimization, 39(3):872 899, 2000. F. Bach. Adaptivity of averaged stochastic gradient descent to local strong convexity for logistic regression. Technical Report 00804431, HAL, 2013. F. Bach and E. Moulines. Non-asymptotic analysis of stochastic approximation algorithms for machine learning. In Adv. NIPS, 2011. F. Bach, R. Jenatton, J. Mairal, and G. Obozinski. Optimization with sparsity-inducing penalties. Technical Report 00613125, HAL, 2011. F. Bach, R. Jenatton, J. Mairal, and G. Obozinski. Structured sparsity through convex optimization, 2012. A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences, 2(1):183 202, 2009. Albert Benveniste, Michel Métivier, and Pierre Priouret. Adaptive algorithms and stochastic approximations. Springer Publishing Company, Incorporated, 2012. D. P. Bertsekas. A new class of incremental gradient methods for least squares problems. SIAM Journal on Optimization, 7(4):913 926, 1997.

D. Blatt, A. O. Hero, and H. Gauchman. A convergent incremental gradient method with a constant step size. SIAM Journal on Optimization, 18(1):29 51, 2008. L. Bottou and O. Bousquet. The tradeoffs of large scale learning. In Adv. NIPS, 2008. L. Bottou and Y. Le Cun. On-line learning for very large data sets. Applied Stochastic Models in Business and Industry, 21(2):137 151, 2005. S. Boucheron and P. Massart. A high-dimensional wilks phenomenon. Probability theory and related fields, 150(3-4):405 433, 2011. S. Boucheron, O. Bousquet, G. Lugosi, et al. Theory of classification: A survey of some recent advances. ESAIM Probability and statistics, 9:323 375, 2005. N. Cesa-Bianchi, A. Conconi, and C. Gentile. On the generalization ability of on-line learning algorithms. Information Theory, IEEE Transactions on, 50(9):2050 2057, 2004. B. Delyon and A. Juditsky. Accelerated stochastic approximation. SIAM Journal on Optimization, 3: 868 881, 1993. J. Duchi and Y. Singer. Efficient online and batch learning using forward backward splitting. Journal of Machine Learning Research, 10:2899 2934, 2009. ISSN 1532-4435. M. P. Friedlander and M. Schmidt. Hybrid deterministic-stochastic methods for data fitting. arxiv:1104.2373, 2011. S. Ghadimi and G. Lan. Optimal stochastic approximation algorithms for strongly convex stochastic composite optimization. Optimization Online, July, 2010. Saeed Ghadimi and Guanghui Lan. Optimal stochastic approximation algorithms for strongly convex stochastic composite optimization, ii: shrinking procedures and optimal algorithms. SIAM Journal

on Optimization, 23(4):2061 2089, 2013. L. Györfi and H. Walk. On the averaged stochastic approximation for linear regression. SIAM Journal on Control and Optimization, 34(1):31 61, 1996. E. Hazan, A. Agarwal, and S. Kale. Logarithmic regret algorithms for online convex optimization. Machine Learning, 69(2):169 192, 2007. Chonghai Hu, James T Kwok, and Weike Pan. Accelerated gradient methods for stochastic optimization and online learning. In NIPS, volume 22, pages 781 789, 2009. H. Kesten. Accelerated stochastic approximation. Ann. Math. Stat., 29(1):41 59, 1958. H. J. Kushner and G. G. Yin. Stochastic approximation and recursive algorithms and applications. Springer-Verlag, second edition, 2003. Guanghui Lan, Arkadi Nemirovski, and Alexander Shapiro. Validation analysis of mirror descent stochastic approximation method. Mathematical programming, 134(2):425 458, 2012. N. Le Roux, M. Schmidt, and F. Bach. A stochastic gradient method with an exponential convergence rate for strongly-convex optimization with finite training sets. In Adv. NIPS, 2012. N. Le Roux, M. Schmidt, and F. Bach. A stochastic gradient method with an exponential convergence rate for strongly-convex optimization with finite training sets. Technical Report 00674995, HAL, 2013. O. Macchi. Adaptive processing: The least mean squares approach with applications in transmission. Wiley West Sussex, 1995. A. Nedic and D. Bertsekas. Convergence rate of incremental subgradient algorithms. Stochastic Optimization: Algorithms and Applications, pages 263 304, 2000.

A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM Journal on Optimization, 19(4):1574 1609, 2009. A. S. Nemirovsky and D. B. Yudin. Problem complexity and method efficiency in optimization. Wiley & Sons, 1983. Y. Nesterov. A method for solving a convex programming problem with rate of convergence O(1/k 2 ). Soviet Math. Doklady, 269(3):543 547, 1983. Y. Nesterov. Introductory lectures on convex optimization: a basic course. Kluwer Academic Publishers, 2004. Y. Nesterov. Gradient methods for minimizing composite objective function. Center for Operations Research and Econometrics (CORE), Catholic University of Louvain, Tech. Rep, 76, 2007. Y. Nesterov. Primal-dual subgradient methods for convex problems. Mathematical programming, 120 (1):221 259, 2009. Y. Nesterov and A. Nemirovski. Interior-point polynomial algorithms in convex programming. SIAM studies in Applied Mathematics, 1994. Y. Nesterov and J. P. Vial. Confidence level solutions for stochastic programming. Automatica, 44(6): 1559 1568, 2008. ISSN 0005-1098. B. T. Polyak and A. B. Juditsky. Acceleration of stochastic approximation by averaging. SIAM Journal on Control and Optimization, 30(4):838 855, 1992. H. Robbins and S. Monro. A stochastic approximation method. Ann. Math. Statistics, 22:400 407, 1951. ISSN 0003-4851. D. Ruppert. Efficient estimations from a slowly convergent Robbins-Monro process. Technical Report

781, Cornell University Operations Research and Industrial Engineering, 1988. M. Schmidt, N. Le Roux, and F. Bach. Optimization with approximate gradients. Technical report, HAL, 2011. B. Schölkopf and A. J. Smola. Learning with Kernels. MIT Press, 2001. S. Shalev-Shwartz and N. Srebro. SVM optimization: inverse dependence on training set size. In Proc. ICML, 2008. S. Shalev-Shwartz and T. Zhang. Stochastic dual coordinate ascent methods for regularized loss minimization. Technical Report 1209.1873, Arxiv, 2012. S. Shalev-Shwartz, Y. Singer, and N. Srebro. Pegasos: Primal estimated sub-gradient solver for svm. In Proc. ICML, 2007. S. Shalev-Shwartz, O. Shamir, N. Srebro, and K. Sridharan. Stochastic convex optimization. In proc. COLT, 2009. J. Shawe-Taylor and N. Cristianini. Kernel Methods for Pattern Analysis. Cambridge University Press, 2004. M.V. Solodov. Incremental gradient algorithms with stepsizes bounded away from zero. Computational Optimization and Applications, 11(1):23 35, 1998. K. Sridharan, N. Srebro, and S. Shalev-Shwartz. Fast rates for regularized objectives. 2008. Thomas Strohmer and Roman Vershynin. A randomized kaczmarz algorithm with exponential convergence. Journal of Fourier Analysis and Applications, 15(2):262 278, 2009. P. Sunehag, J. Trumpf, SVN Vishwanathan, and N. Schraudolph. Variable metric stochastic approximation theory. International Conference on Artificial Intelligence and Statistics, 2009.

P. Tseng. An incremental gradient(-projection) method with momentum term and adaptive stepsize rule. SIAM Journal on Optimization, 8(2):506 531, 1998. A. B. Tsybakov. Optimal rates of aggregation. In Proc. COLT, 2003. A. W. Van der Vaart. Asymptotic statistics, volume 3. Cambridge Univ. press, 2000. L. Xiao. Dual averaging methods for regularized stochastic learning and online optimization. Journal of Machine Learning Research, 9:2543 2596, 2010. ISSN 1532-4435.