Stochastic gradient methods for machine learning



Similar documents
Stochastic gradient methods for machine learning

Beyond stochastic gradient descent for large-scale machine learning

Machine learning and optimization for massive data

Machine learning challenges for big data

Large-scale machine learning and convex optimization

Big learning: challenges and opportunities

Big Data Analytics: Optimization and Randomization

Big Data - Lecture 1 Optimization reminders

Large-Scale Machine Learning with Stochastic Gradient Descent

Linear Threshold Units

(Quasi-)Newton methods

Online Learning, Stability, and Stochastic Gradient Descent

Accelerated Parallel Optimization Methods for Large Scale Machine Learning

Summer School Machine Learning Trimester

Online Convex Optimization

Federated Optimization: Distributed Optimization Beyond the Datacenter

Big Data Optimization: Randomized lock-free methods for minimizing partially separable convex functions

Large-Scale Similarity and Distance Metric Learning

Adaptive Online Gradient Descent

Introduction to Online Learning Theory

STORM: Stochastic Optimization Using Random Models Katya Scheinberg Lehigh University. (Joint work with R. Chen and M. Menickelly)

Stochastic Optimization for Big Data Analytics: Algorithms and Libraries

Simple and efficient online algorithms for real world applications

Sparse Online Learning via Truncated Gradient

large-scale machine learning revisited Léon Bottou Microsoft Research (NYC)

Online Lazy Updates for Portfolio Selection with Transaction Costs

Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh

Karthik Sridharan. 424 Gates Hall Ithaca, sridharan/ Contact Information

10. Proximal point method

Large Scale Learning to Rank

Nonlinear Optimization: Algorithms 3: Interior-point methods

1 Introduction. 2 Prediction with Expert Advice. Online Learning Lecture 09

Statistical Machine Learning

Summer course on Convex Optimization. Fifth Lecture Interior-Point Methods (1) Michel Baes, K.U.Leuven Bharath Rangarajan, U.

A Stochastic 3MG Algorithm with Application to 2D Filter Identification

Parallel Data Mining. Team 2 Flash Coders Team Research Investigation Presentation 2. Foundations of Parallel Computing Oct 2014

LABEL PROPAGATION ON GRAPHS. SEMI-SUPERVISED LEARNING. ----Changsheng Liu

Projection-free Online Learning

Efficient Projections onto the l 1 -Ball for Learning in High Dimensions

i=1 In practice, the natural logarithm of the likelihood function, called the log-likelihood function and denoted by

11 Projected Newton-type Methods in Machine Learning

Parallel & Distributed Optimization. Based on Mark Schmidt s slides

t := maxγ ν subject to ν {0,1,2,...} and f(x c +γ ν d) f(x c )+cγ ν f (x c ;d).

CS229T/STAT231: Statistical Learning Theory (Winter 2015)

Operations Research and Financial Engineering. Courses

Follow the Perturbed Leader

How To Understand The Theory Of Probability

Direct Loss Minimization for Structured Prediction

Scalable Machine Learning - or what to do with all that Big Data infrastructure

Machine Learning and Pattern Recognition Logistic Regression

GI01/M055 Supervised Learning Proximal Methods

MapReduce/Bigtable for Distributed Optimization

CS 688 Pattern Recognition Lecture 4. Linear Models for Classification

Foundations of Machine Learning On-Line Learning. Mehryar Mohri Courant Institute and Google Research

Large Linear Classification When Data Cannot Fit In Memory

B. Delyon. Stochastic approximation with decreasing gain: convergence and asymptotic theory. Unpublished lecture notes, Université de Rennes, 2000.

Discussion on the paper Hypotheses testing by convex optimization by A. Goldenschluger, A. Juditsky and A. Nemirovski.

Online Kernel Selection: Algorithms and Evaluations

Lecture 8 February 4

Table 1: Summary of the settings and parameters employed by the additive PA algorithm for classification, regression, and uniclass.

2 Reinforcement learning architecture

Big Data: The Computation/Statistics Interface

A Logistic Regression Approach to Ad Click Prediction

Optimal Strategies and Minimax Lower Bounds for Online Convex Games

CSCI567 Machine Learning (Fall 2014)

Lecture Notes on Online Learning DRAFT

Adaptive Bound Optimization for Online Convex Optimization

Stochastic Gradient Method: Applications

A primal-dual algorithm for group sparse regularization with overlapping groups

arxiv: v2 [stat.ml] 18 Feb 2013

CONVEX optimization forms the backbone of many

The Artificial Prediction Market

Lecture 6. Artificial Neural Networks

Chapter 4: Artificial Neural Networks

Lecture 2: The SVM classifier

Acknowledgments. Data Mining with Regression. Data Mining Context. Overview. Colleagues

Probabilistic Linear Classification: Logistic Regression. Piyush Rai IIT Kanpur

Convex Optimization for Big Data

Introduction to Online Optimization

Big Data Analytics CSCI 4030

Online Learning for Matrix Factorization and Sparse Coding

Online Bandit Learning against an Adaptive Adversary: from Regret to Policy Regret

Two Topics in Parametric Integration Applied to Stochastic Simulation in Industrial Engineering

Introduction to Logistic Regression

MACHINE LEARNING IN HIGH ENERGY PHYSICS

Artificial Neural Networks and Support Vector Machines. CS 486/686: Introduction to Artificial Intelligence

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization

Slow Learners are Fast

AIMS Big data. AIMS Big data. Outline. Outline. Lecture 5: Structured-output learning January 7, 2015 Andrea Vedaldi

Gaussian Processes in Machine Learning

Lecture 6: Logistic Regression

Numerisches Rechnen. (für Informatiker) M. Grepl J. Berger & J.T. Frings. Institut für Geometrie und Praktische Mathematik RWTH Aachen

CCNY. BME I5100: Biomedical Signal Processing. Linear Discrimination. Lucas C. Parra Biomedical Engineering Department City College of New York

Logistic Regression for Spam Filtering

Making Sense of the Mayhem: Machine Learning and March Madness

Advanced analytics at your hands

Data Mining - Evaluation of Classifiers

Parallel Selective Algorithms for Nonconvex Big Data Optimization

On Optimization Methods for Deep Learning

Adaptive Search with Stochastic Acceptance Probabilities for Global Optimization

Transcription:

Stochastic gradient methods for machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines, Nicolas Le Roux and Mark Schmidt - April 2013

Context Machine learning for big data Large-scale machine learning: large p, large n, large k p : dimension of each observation (input) k : number of tasks (dimension of outputs) n : number of observations Examples: computer vision, bioinformatics, signal processing Ideal running-time complexity: O(pn + kn) Going back to simple methods Stochastic gradient methods (Robbins and Monro, 1951) Mixing statistics and optimization

Context Machine learning for big data Large-scale machine learning: large p, large n, large k p : dimension of each observation (input) k : number of tasks (dimension of outputs) n : number of observations Examples: computer vision, bioinformatics, signal processing Ideal running-time complexity: O(pn + kn) Going back to simple methods Stochastic gradient methods (Robbins and Monro, 1951) Mixing statistics and optimization

Outline Introduction Supervised machine learning and convex optimization Stochastic approximation algorithms (Bach and Moulines, 2011; Bach, 2013) Stochastic gradient and averaging Strongly convex vs. non-strongly convex Adaptivity Going beyond stochastic gradient (Le Roux, Schmidt, and Bach, 2012, 2013) More than a single pass through the data Linear (exponential) convergence rate for strongly convex functions

Supervised machine learning Data: n observations (x i,y i ) X Y, i = 1,...,n, i.i.d. Prediction as a linear function θ Φ(x) of features Φ(x) F = R p (regularized) empirical risk minimization: find ˆθ solution of 1 n min l ( y i,θ Φ(x i ) ) + µω(θ) θ F n i=1 convex data fitting term + regularizer

Supervised machine learning Data: n observations (x i,y i ) X Y, i = 1,...,n, i.i.d. Prediction as a linear function θ Φ(x) of features Φ(x) F = R p (regularized) empirical risk minimization: find ˆθ solution of 1 n min l ( y i,θ Φ(x i ) ) + µω(θ) θ F n i=1 convex data fitting term + regularizer Empirical risk: ˆf(θ) = 1 n n i=1 l(y i,θ Φ(x i )) training cost Expected risk: f(θ) = E (x,y) l(y,θ Φ(x)) testing cost Two fundamental questions: (1) computing ˆθ and (2) analyzing ˆθ May be tackled simultaneously

Supervised machine learning Data: n observations (x i,y i ) X Y, i = 1,...,n, i.i.d. Prediction as a linear function θ Φ(x) of features Φ(x) F = R p (regularized) empirical risk minimization: find ˆθ solution of 1 n min l ( y i,θ Φ(x i ) ) + µω(θ) θ F n i=1 convex data fitting term + regularizer Empirical risk: ˆf(θ) = 1 n n i=1 l(y i,θ Φ(x i )) training cost Expected risk: f(θ) = E (x,y) l(y,θ Φ(x)) testing cost Two fundamental questions: (1) computing ˆθ and (2) analyzing ˆθ May be tackled simultaneously

Smoothness and strong convexity A function g : R p R is L-smooth if and only if it is twice differentiable and θ R p, g (θ) L Id smooth non smooth

Smoothness and strong convexity A function g : R p R is L-smooth if and only if it is twice differentiable and θ R p, g (θ) L Id Machine learning with g(θ) = 1 n n i=1 l(y i,θ Φ(x i )) Hessian covariance matrix 1 n n i=1 Φ(x i)φ(x i ) Bounded data

Smoothness and strong convexity A function g : R p R is µ-strongly convex if and only if θ 1,θ 2 R p, g(θ 1 ) g(θ 2 )+ g (θ 2 ),θ 1 θ 2 + µ 2 θ 1 θ 2 2 If g is twice differentiable: θ R p, g (θ) µ Id convex strongly convex

Smoothness and strong convexity A function g : R p R is µ-strongly convex if and only if θ 1,θ 2 R p, g(θ 1 ) g(θ 2 )+ g (θ 2 ),θ 1 θ 2 + µ 2 θ 1 θ 2 2 If g is twice differentiable: θ R p, g (θ) µ Id Machine learning with g(θ) = 1 n n i=1 l(y i,θ Φ(x i )) Hessian covariance matrix 1 n n i=1 Φ(x i)φ(x i ) Data with invertible covariance matrix (low correlation/dimension)... or with added regularization by µ 2 θ 2

Iterative methods for minimizing smooth functions Assumption: g convex and smooth on F = R p Gradient descent: θ t = θ t 1 γ t g (θ t 1 ) O(1/t) convergence rate for convex functions O(e ρt ) convergence rate for strongly convex functions Newton method: θ t = θ t 1 g (θ t 1 ) 1 g (θ t 1 ) O ( e ρ2t ) convergence rate

Iterative methods for minimizing smooth functions Assumption: g convex and smooth on F = R p Gradient descent: θ t = θ t 1 γ t g (θ t 1 ) O(1/t) convergence rate for convex functions O(e ρt ) convergence rate for strongly convex functions Newton method: θ t = θ t 1 g (θ t 1 ) 1 g (θ t 1 ) O ( e ρ2t ) convergence rate Key insights from Bottou and Bousquet (2008) 1. In machine learning, no need to optimize below statistical error 2. In machine learning, cost functions are averages Stochastic approximation

Stochastic approximation Goal: Minimizing a function f defined on F = R p given only unbiased estimates f n(θ n ) of its gradients f (θ n ) at certain points θ n F Stochastic approximation Observation of f n(θ n ) = f (θ n )+ε n, with ε n = i.i.d. noise Non-convex problems

Stochastic approximation Goal: Minimizing a function f defined on F = R p given only unbiased estimates f n(θ n ) of its gradients f (θ n ) at certain points θ n F Stochastic approximation Observation of f n(θ n ) = f (θ n )+ε n, with ε n = i.i.d. noise Non-convex problems Machine learning - statistics loss for a single pair of observations: f n (θ) = l(y n,θ Φ(x n )) f(θ) = Ef n (θ) = El(y n,θ Φ(x n )) = generalization error Expected gradient: f (θ) = Ef n(θ) = E { l (y n,θ Φ(x n ))Φ(x n ) }

Convex smooth stochastic approximation Key assumption: smoothness and/or strongly convexity

Convex smooth stochastic approximation Key assumption: smoothness and/or strongly convexity Key algorithm: stochastic gradient descent (a.k.a. Robbins-Monro) θ n = θ n 1 γ n f n(θ n 1 ) Polyak-Ruppert averaging: θ n = 1 n 1 n k=0 θ k Which learning rate sequence γ n? Classical setting: γ n = Cn α - Desirable practical behavior - Applicable (at least) to least-squares and logistic regression - Robustness to (potentially unknown) constants (L, µ) - Adaptivity to difficulty of the problem (e.g., strong convexity)

Convex stochastic approximation Existing work Known global minimax rates of convergence (Nemirovski and Yudin, 1983; Agarwal et al., 2010) Strongly convex: O((µn) 1 ) Attainedbyaveragedstochasticgradientdescentwithγ n (µn) 1 Non-strongly convex: O(n 1/2 ) Attained by averaged stochastic gradient descent with γ n n 1/2 Bottou and Le Cun (2005); Bottou and Bousquet (2008); Hazan et al. (2007); Shalev-Shwartz and Srebro (2008); Shalev-Shwartz et al. (2007, 2009); Xiao (2010); Duchi and Singer (2009); Nesterov and Vial (2008); Nemirovski et al. (2009)

Convex stochastic approximation Existing work Known global minimax rates of convergence (Nemirovski and Yudin, 1983; Agarwal et al., 2010) Strongly convex: O((µn) 1 ) Attainedbyaveragedstochasticgradientdescentwithγ n (µn) 1 Non-strongly convex: O(n 1/2 ) Attained by averaged stochastic gradient descent with γ n n 1/2 Bottou and Le Cun (2005); Bottou and Bousquet (2008); Hazan et al. (2007); Shalev-Shwartz and Srebro (2008); Shalev-Shwartz et al. (2007, 2009); Xiao (2010); Duchi and Singer (2009); Nesterov and Vial (2008); Nemirovski et al. (2009)

Convex stochastic approximation Existing work Known global minimax rates of convergence (Nemirovski and Yudin, 1983; Agarwal et al., 2010) Strongly convex: O((µn) 1 ) Attainedbyaveragedstochasticgradientdescentwithγ n (µn) 1 Non-strongly convex: O(n 1/2 ) Attained by averaged stochastic gradient descent with γ n n 1/2 Asymptotic analysis of averaging (Polyak and Juditsky, 1992; Ruppert, 1988) All step sizes γ n = Cn α with α (1/2,1) lead to O(n 1 ) for strongly convex problems A single algorithm with global convergence rate?

Convex stochastic approximation Existing work Known global minimax rates of convergence (Nemirovski and Yudin, 1983; Agarwal et al., 2010) Strongly convex: O((µn) 1 ) Attainedbyaveragedstochasticgradientdescentwithγ n (µn) 1 Non-strongly convex: O(n 1/2 ) Attained by averaged stochastic gradient descent with γ n n 1/2 Asymptotic analysis of averaging (Polyak and Juditsky, 1992; Ruppert, 1988) All step sizes γ n = Cn α with α (1/2,1) lead to O(n 1 ) for strongly convex problems A single algorithm with global convergence rate?

Adaptive algorithm for logistic regression Logistic regression: (x n,y n ) R p { 1,1} Single data point: f n (θ) = log(1+exp( y n θ x n )) Generalization error: f(θ) = Ef n (θ) Cannot be strongly convex local strong convexity unless restricted to θ x n M µ = lowest eigenvalue of the Hessian at the optimum f (θ ) logistic loss

Adaptive algorithm for logistic regression Logistic regression: (x n,y n ) R p { 1,1} Single data point: f n (θ) = log(1+exp( y n θ x n )) Generalization error: f(θ) = Ef n (θ) Cannot be strongly convex local strong convexity unless restricted to θ x n M µ = lowest eigenvalue of the Hessian at the optimum f (θ ) n steps of averaged SGD with constant step-size 1/ ( 2R 2 n ) with R = radius of data (Bach, 2013): Ef( θ n ) f(θ ) min { } 1 n, R2 (15+5R θ0 θ ) 4 nµ Proof based on generalized self-concordance (Bach, 2010)

Adaptive algorithm for logistic regression Proof sketch Step 1: use existing result f( θ n ) f(θ )+ R2 n θ 0 θ 2 = O(1/ n) Step 2: f n(θ n 1 ) = 1 γ (θ n 1 θ n ) n 1 n k=1 f k (θ k 1) = 1 nγ (θ 0 θ n ) Step 3: f ( 1 n n k=1 θ ) k 1 1 n n k=1 f (θ k 1 ) = O ( f( θ n ) f(θ ) ) = O(1/ n) using self-concordance Step 4a: if f µ-strongly convex, f( θ n ) f(θ ) 1 2µ f ( θ n ) 2 Step 4b: if f self-concordant, locally true with µ = λ min (f (θ ))

Conclusions / Extensions Stochastic approximation for machine learning Mixing convex optimization and statistics Non-asymptotic analysis through moment computations Averaging with longer steps is (more) robust and adaptive

Conclusions / Extensions Stochastic approximation for machine learning Mixing convex optimization and statistics Non-asymptotic analysis through moment computations Averaging with longer steps is (more) robust and adaptive Future/current work - open problems High-probability through all moments E θ n θ 2d Including a non-differentiable term (Xiao, 2010; Lan, 2010) Non-random errors (Schmidt, Le Roux, and Bach, 2011) Line search for stochastic gradient Non-parametric stochastic approximation Going beyond a single pass through the data

Going beyond a single pass over the data Stochastic approximation Assumes infinite data stream Observations are used only once Directly minimizes testing cost E (x,y) l(y,θ Φ(x))

Going beyond a single pass over the data Stochastic approximation Assumes infinite data stream Observations are used only once Directly minimizes testing cost E (x,y) l(y,θ Φ(x)) Machine learning practice Finite data set (x 1,y 1,...,x n,y n ) Multiple passes Minimizes training cost 1 n n i=1 l(y i,θ Φ(x i )) Need to regularize (e.g., by the l 2 -norm) to avoid overfitting

Stochastic vs. deterministic methods Minimizing g(θ) = 1 n n f i (θ) with f i (θ) = l ( y i,θ Φ(x i ) ) +µω(θ) i=1 Batchgradientdescent:θ t = θ t 1 γ t g (θ t 1 ) = θ t 1 γ t n n f i(θ t 1 ) Linear (e.g., exponential) convergence rate (with strong convexity) Iteration complexity is linear in n i=1

Stochastic vs. deterministic methods Minimizing g(θ) = 1 n n f i (θ) with f i (θ) = l ( y i,θ Φ(x i ) ) +µω(θ) i=1 Batchgradientdescent:θ t = θ t 1 γ t g (θ t 1 ) = θ t 1 γ t n n f i(θ t 1 ) i=1

Stochastic vs. deterministic methods Minimizing g(θ) = 1 n n f i (θ) with f i (θ) = l ( y i,θ Φ(x i ) ) +µω(θ) i=1 Batchgradientdescent:θ t = θ t 1 γ t g (θ t 1 ) = θ t 1 γ t n n f i(θ t 1 ) Linear (e.g., exponential) convergence rate (with strong convexity) Iteration complexity is linear in n i=1 Stochastic gradient descent: θ t = θ t 1 γ t f i(t) (θ t 1) Sampling with replacement: i(t) random element of {1,...,n} Convergence rate in O(1/t) Iteration complexity is independent of n

Stochastic vs. deterministic methods Minimizing g(θ) = 1 n n f i (θ) with f i (θ) = l ( y i,θ Φ(x i ) ) +µω(θ) i=1 Batchgradientdescent:θ t = θ t 1 γ t g (θ t 1 ) = θ t 1 γ t n n f i(θ t 1 ) i=1 Stochastic gradient descent: θ t = θ t 1 γ t f i(t) (θ t 1)

Stochastic vs. deterministic methods Goal = best of both worlds: linear rate with O(1) iteration cost log(excess cost) stochastic deterministic time

Stochastic vs. deterministic methods Goal = best of both worlds: linear rate with O(1) iteration cost log(excess cost) hybrid stochastic deterministic time

Accelerating gradient methods - Related work Nesterov acceleration Nesterov (1983, 2004) Better linear rate but still O(n) iteration cost Hybrid methods, incremental average gradient, increasing batch size Bertsekas (1997); Blatt et al. (2008); Friedlander and Schmidt (2011) Linear rate, but iterations make full passes through the data.

Accelerating gradient methods - Related work Momentum, gradient/iterate averaging, stochastic version of accelerated batch gradient methods Polyak and Juditsky (1992); Tseng (1998); Sunehag et al. (2009); Ghadimi and Lan (2010); Xiao (2010) Can improve constants, but still have sublinear O(1/t) rate Constant step-size stochastic gradient (SG), accelerated SG Kesten (1958); Delyon and Juditsky (1993); Solodov (1998); Nedic and Bertsekas (2000) Linear convergence, but only up to a fixed tolerance. Stochastic methods in the dual Shalev-Shwartz and Zhang (2012) Linear rate but limited choice for the f i s

Stochastic average gradient (Le Roux, Schmidt, and Bach, 2012) Stochastic average gradient (SAG) iteration Keep in memory the gradients of all functions f i, i = 1,...,n Random selection i(t) {1,...,n} with replacement { Iteration: θ t = θ t 1 γ n t f yi t with yi t n = i (θ t 1 ) if i = i(t) otherwise i=1 y t 1 i

Stochastic average gradient (Le Roux, Schmidt, and Bach, 2012) Stochastic average gradient (SAG) iteration Keep in memory the gradients of all functions f i, i = 1,...,n Random selection i(t) {1,...,n} with replacement { Iteration: θ t = θ t 1 γ n t f yi t n with yt i = i (θ t 1 ) if i = i(t) otherwise i=1 y t 1 i Stochastic version of incremental average gradient(blatt et al., 2008) Extra memory requirement Supervised machine learning If f i (θ) = l i (y i,φ(x i ) θ), then f i (θ) = l i (y i,φ(x i ) θ)φ(x i ) Only need to store n real numbers

Stochastic average gradient - Convergence analysis Assumptions Each f i is L-smooth, i = 1,...,n g= 1 n n i=1 f i is µ-strongly convex (with potentially µ = 0) constant step size γ t = 1/(16L) initialization with one pass of averaged SGD

Stochastic average gradient - Convergence analysis Assumptions Each f i is L-smooth, i = 1,...,n g= 1 n n i=1 f i is µ-strongly convex (with potentially µ = 0) constant step size γ t = 1/(16L) initialization with one pass of averaged SGD Strongly convex case (Le Roux et al., 2012, 2013) E [ g(θ t ) g(θ ) ] ( 8σ 2 n + 4L θ 0 θ 2 n ) exp ( { 1 tmin 8n, µ }) 16L Linear (exponential) convergence rate with ( O(1) iteration { cost 1 After one pass, reduction of cost by exp min 8, nµ }) 16L

Stochastic average gradient - Convergence analysis Assumptions Each f i is L-smooth, i = 1,...,n g= 1 n n i=1 f i is µ-strongly convex (with potentially µ = 0) constant step size γ t = 1/(16L) initialization with one pass of averaged SGD Non-strongly convex case (Le Roux et al., 2013) E [ g(θ t ) g(θ ) ] 48 σ2 +L θ 0 θ 2 n n t Improvement over regular batch and stochastic gradient Adaptivity to potentially hidden strong convexity

Convergence analysis - Proof sketch Main step: find good Lyapunov function J(θ t,y t 1,...,y t n) such that E [ J(θ t,y t 1,...,y t n) F t 1 ] < J(θt 1,y t 1 1,...,y t 1 n ) no natural candidates Computer-aided proof Parameterize function J(θ t,y t 1,...,y t n) = g(θ t ) g(θ )+quadratic Solve semidefinite program to obtain candidates (that depend on n,µ,l) Check validity with symbolic computations

Rate of convergence comparison Assume that L = 100, µ =.01, and n = 80000 Full gradient method has rate ( 1 µ L) = 0.9999 Accelerated gradient method has rate ( 1 µ ) L = 0.9900 Running n iterations of SAG for the same cost has rate ( 1 1 n 8n) = 0.8825 Fastest possible first-order method has rate ( ) 2 L µ L+ µ = 0.9608 Beating two lower bounds (with additional assumptions) (1) stochastic gradient and (2) full gradient

Stochastic average gradient Implementation details and extensions The algorithm can use sparsity in the features to reduce the storage and iteration cost Grouping functions together can further reduce the memory requirement We have obtained good performance when L is not known with a heuristic line-search Algorithm allows non-uniform sampling Possibility of making proximal, coordinate-wise, and Newton-like variants

Stochastic average gradient Simulation experiments protein dataset (n = 145751, p = 74) Dataset split in two (training/testing) Objective minus Optimum 10 0 10 1 10 2 10 3 Steepest AFG L BFGS pegasos RDA SAG (2/(L+nµ)) SAG LS Test Logistic Loss 5 x 104 4.5 4 3.5 3 2.5 2 1.5 1 Steepest AFG L BFGS pegasos RDA SAG (2/(L+nµ)) SAG LS 10 4 0.5 0 5 10 15 20 25 30 Effective Passes Training cost 0 5 10 15 20 25 30 Effective Passes Testing cost

Stochastic average gradient Simulation experiments cover type dataset (n = 581012, p = 54) Dataset split in two (training/testing) Objective minus Optimum 10 2 10 0 10 2 10 4 Steepest AFG L BFGS pegasos RDA SAG (2/(L+nµ)) SAG LS Test Logistic Loss 2 1.95 1.9 1.85 1.8 1.75 1.7 1.65 1.6 x 10 5 Steepest AFG L BFGS pegasos RDA SAG (2/(L+nµ)) SAG LS 1.55 0 5 10 15 20 25 30 Effective Passes Training cost 1.5 0 5 10 15 20 25 30 Effective Passes Testing cost

Conclusions / Extensions Stochastic average gradient Going beyond a single pass through the data Keep memory of all gradients for finite training sets Linear convergence rate with O(1) iteration complexity Randomization leads to easier analysis and faster rates

Conclusions / Extensions Stochastic average gradient Going beyond a single pass through the data Keep memory of all gradients for finite training sets Linear convergence rate with O(1) iteration complexity Randomization leads to easier analysis and faster rates Future/current work - open problems Including a non-differentiable term Line search Using second-order information or non-uniform sampling Non-convex problems Distributed optimization Going beyond finite training sets (bound on testing cost)

References A. Agarwal, P. L. Bartlett, P. Ravikumar, and M. J. Wainwright. Information-theoretic lower bounds on the oracle complexity of convex optimization, 2010. Tech. report, Arxiv 1009.0571. F. Bach. Self-concordant analysis for logistic regression. Electronic Journal of Statistics, 4:384 414, 2010. ISSN 1935-7524. F. Bach and E. Moulines. Non-asymptotic analysis of stochastic approximation algorithms for machine learning, 2011. Francis Bach. Adaptivity of averaged stochastic gradient descent to local strong convexity for logistic regression. arxiv preprint arxiv:1303.6149, 2013. D. P. Bertsekas. A new class of incremental gradient methods for least squares problems. SIAM Journal on Optimization, 7(4):913 926, 1997. D. Blatt, A.O. Hero, and H. Gauchman. A convergent incremental gradient method with a constant step size. 18(1):29 51, 2008. L. Bottou and O. Bousquet. The tradeoffs of large scale learning. In Advances in Neural Information Processing Systems (NIPS), 20, 2008. L. Bottou and Y. Le Cun. On-line learning for very large data sets. Applied Stochastic Models in Business and Industry, 21(2):137 151, 2005. B. Delyon and A. Juditsky. Accelerated stochastic approximation. SIAM Journal on Optimization, 3: 868 881, 1993.

J. Duchi and Y. Singer. Efficient online and batch learning using forward backward splitting. Journal of Machine Learning Research, 10:2899 2934, 2009. ISSN 1532-4435. M. P. Friedlander and M. Schmidt. Hybrid deterministic-stochastic methods for data fitting. Arxiv preprint arxiv:1104.2373, 2011. S. Ghadimi and G. Lan. Optimal stochastic approximation algorithms for strongly convex stochastic composite optimization. Optimization Online, July, 2010. E. Hazan, A. Agarwal, and S. Kale. Logarithmic regret algorithms for online convex optimization. Machine Learning, 69(2):169 192, 2007. H. Kesten. Accelerated stochastic approximation. Ann. Math. Stat., 29(1):41 59, 1958. G. Lan. An optimal method for stochastic composite optimization. Mathematical Programming, pages 1 33, 2010. N. Le Roux, M. Schmidt, and F. Bach. A stochastic gradient method with an exponential convergence rate for strongly-convex optimization with finite training sets. Technical Report 00674995, HAL, 2012. N. Le Roux, M. Schmidt, and F. Bach. A stochastic gradient method with an exponential convergence rate for strongly-convex optimization with finite training sets. Technical Report 00674995, HAL, 2013. A. Nedic and D. Bertsekas. Convergence rate of incremental subgradient algorithms. Stochastic Optimization: Algorithms and Applications, pages 263 304, 2000. A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM Journal on Optimization, 19(4):1574 1609, 2009.

A. S. Nemirovski and D. B. Yudin. Problem complexity and method efficiency in optimization. 1983. Y. Nesterov. A method of solving a convex programming problem with convergence rate o (1/k2). In Soviet Mathematics Doklady, volume 27, pages 372 376, 1983. Y. Nesterov. Introductory lectures on convex optimization: a basic course. Kluwer Academic Publishers, 2004. Y. Nesterov and J. P. Vial. Confidence level solutions for stochastic programming. Automatica, 44(6): 1559 1568, 2008. ISSN 0005-1098. B. T. Polyak and A. B. Juditsky. Acceleration of stochastic approximation by averaging. SIAM Journal on Control and Optimization, 30(4):838 855, 1992. H. Robbins and S. Monro. A stochastic approximation method. Ann. Math. Statistics, 22:400 407, 1951. ISSN 0003-4851. D. Ruppert. Efficient estimations from a slowly convergent Robbins-Monro process. Technical Report 781, Cornell University Operations Research and Industrial Engineering, 1988. M. Schmidt, N. Le Roux, and F. Bach. Optimization with approximate gradients. Technical report, HAL, 2011. S. Shalev-Shwartz and N. Srebro. SVM optimization: inverse dependence on training set size. In Proc. ICML, 2008. S. Shalev-Shwartz and T. Zhang. Stochastic dual coordinate ascent methods for regularized loss minimization. Technical Report 1209.1873, Arxiv, 2012. S. Shalev-Shwartz, Y. Singer, and N. Srebro. Pegasos: Primal estimated sub-gradient solver for svm. In Proc. ICML, 2007.

S. Shalev-Shwartz, O. Shamir, N. Srebro, and K. Sridharan. Stochastic convex optimization. In Conference on Learning Theory (COLT), 2009. M.V. Solodov. Incremental gradient algorithms with stepsizes bounded away from zero. Computational Optimization and Applications, 11(1):23 35, 1998. P. Sunehag, J. Trumpf, SVN Vishwanathan, and N. Schraudolph. Variable metric stochastic approximation theory. International Conference on Artificial Intelligence and Statistics, 2009. P. Tseng. An incremental gradient(-projection) method with momentum term and adaptive stepsize rule. SIAM Journal on Optimization, 8(2):506 531, 1998. L. Xiao. Dual averaging methods for regularized stochastic learning and online optimization. Journal of Machine Learning Research, 9:2543 2596, 2010. ISSN 1532-4435.