Accelerated Gradient Methods for Stochastic Optimization and Online Learning



Similar documents
ANALYSIS AND COMPARISONS OF SOME SOLUTION CONCEPTS FOR STOCHASTIC PROGRAMMING PROBLEMS

Single-machine Scheduling with Periodic Maintenance and both Preemptive and. Non-preemptive jobs in Remanufacturing System 1

The Transport Equation

The Application of Multi Shifts and Break Windows in Employees Scheduling

Large Scale Online Learning.

Distributed and Secure Computation of Convex Programs over a Network of Connected Processors

Analogue and Digital Signal Processing. First Term Third Year CS Engineering By Dr Mukhtiar Ali Unar

A Note on Using the Svensson procedure to estimate the risk free rate in corporate valuation

MTH6121 Introduction to Mathematical Finance Lesson 5

Morningstar Investor Return

Mathematics in Pharmacokinetics What and Why (A second attempt to make it clearer)

Journal Of Business & Economics Research September 2005 Volume 3, Number 9

Multiprocessor Systems-on-Chips

Niche Market or Mass Market?

Optimal Stock Selling/Buying Strategy with reference to the Ultimate Average

E0 370 Statistical Learning Theory Lecture 20 (Nov 17, 2011)

Stochastic Optimal Control Problem for Life Insurance

PROFIT TEST MODELLING IN LIFE ASSURANCE USING SPREADSHEETS PART ONE

Why Did the Demand for Cash Decrease Recently in Korea?

Premium Income of Indian Life Insurance Industry

Bayesian Filtering with Online Gaussian Process Latent Variable Models

TEMPORAL PATTERN IDENTIFICATION OF TIME SERIES DATA USING PATTERN WAVELETS AND GENETIC ALGORITHMS

Real-time Particle Filters

Measuring macroeconomic volatility Applications to export revenue data,

How To Calculate Price Elasiciy Per Capia Per Capi

How To Predict A Person'S Behavior

DYNAMIC MODELS FOR VALUATION OF WRONGFUL DEATH PAYMENTS

DETERMINISTIC INVENTORY MODEL FOR ITEMS WITH TIME VARYING DEMAND, WEIBULL DISTRIBUTION DETERIORATION AND SHORTAGES KUN-SHAN WU

Term Structure of Prices of Asian Options

Duration and Convexity ( ) 20 = Bond B has a maturity of 5 years and also has a required rate of return of 10%. Its price is $613.

Automatic measurement and detection of GSM interferences

Optimal Investment and Consumption Decision of Family with Life Insurance

SELF-EVALUATION FOR VIDEO TRACKING SYSTEMS

An Online Learning-based Framework for Tracking

The option pricing framework

PATHWISE PROPERTIES AND PERFORMANCE BOUNDS FOR A PERISHABLE INVENTORY SYSTEM

Online Multi-Class LPBoost

Making a Faster Cryptanalytic Time-Memory Trade-Off

Option Put-Call Parity Relations When the Underlying Security Pays Dividends

Parameter-Free Convex Learning through Coin Betting

USE OF EDUCATION TECHNOLOGY IN ENGLISH CLASSES

Market Liquidity and the Impacts of the Computerized Trading System: Evidence from the Stock Exchange of Thailand

Algorithms for Portfolio Management based on the Newton Method

Individual Health Insurance April 30, 2008 Pages

On the degrees of irreducible factors of higher order Bernoulli polynomials

cooking trajectory boiling water B (t) microwave time t (mins)

17 Laplace transform. Solving linear ODE with piecewise continuous right hand sides

Particle Filtering for Geometric Active Contours with Application to Tracking Moving and Deforming Objects

INTEREST RATE FUTURES AND THEIR OPTIONS: SOME PRICING APPROACHES

Principal components of stock market dynamics. Methodology and applications in brief (to be updated ) Andrei Bouzaev, bouzaev@ya.

Vector Autoregressions (VARs): Operational Perspectives

Performance Center Overview. Performance Center Overview 1

ARCH Proceedings

Network Discovery: An Estimation Based Approach

The naive method discussed in Lecture 1 uses the most recent observations to forecast future values. That is, Y ˆ t + 1

11/6/2013. Chapter 14: Dynamic AD-AS. Introduction. Introduction. Keeping track of time. The model s elements

DOES TRADING VOLUME INFLUENCE GARCH EFFECTS? SOME EVIDENCE FROM THE GREEK MARKET WITH SPECIAL REFERENCE TO BANKING SECTOR

SPEC model selection algorithm for ARCH models: an options pricing evaluation framework

Stock Trading with Recurrent Reinforcement Learning (RRL) CS229 Application Project Gabriel Molina, SUID

Hotel Room Demand Forecasting via Observed Reservation Information

Analysis of Pricing and Efficiency Control Strategy between Internet Retailer and Conventional Retailer

Hedging with Forwards and Futures

APPLICATION OF THE KALMAN FILTER FOR ESTIMATING CONTINUOUS TIME TERM STRUCTURE MODELS: THE CASE OF UK AND GERMANY. January, 2005

Dynamic programming models and algorithms for the mutual fund cash balance problem

Life insurance cash flows with policyholder behaviour

Optimal Time to Sell in Real Estate Portfolio Management

adaptive control; stochastic systems; certainty equivalence principle; long-term

Chapter 8: Regression with Lagged Explanatory Variables

Efficient One-time Signature Schemes for Stream Authentication *

On Stochastic and Worst-case Models for Investing

Differential Equations and Linear Superposition

DDoS Attacks Detection Model and its Application

Research on Inventory Sharing and Pricing Strategy of Multichannel Retailer with Channel Preference in Internet Environment

Option Pricing Under Stochastic Interest Rates

Keldysh Formalism: Non-equilibrium Green s Function

Sampling Time-Based Sliding Windows in Bounded Space

Task is a schedulable entity, i.e., a thread

A UNIFIED APPROACH TO MATHEMATICAL OPTIMIZATION AND LAGRANGE MULTIPLIER THEORY FOR SCIENTISTS AND ENGINEERS

Strategic Optimization of a Transportation Distribution Network

Optimal Longevity Hedging Strategy for Insurance. Companies Considering Basis Risk. Draft Submission to Longevity 10 Conference

arxiv: v1 [stat.ml] 23 Sep 2013

A Generalized Bivariate Ornstein-Uhlenbeck Model for Financial Assets

Forecasting and Information Sharing in Supply Chains Under Quasi-ARMA Demand

UNDERSTANDING THE DEATH BENEFIT SWITCH OPTION IN UNIVERSAL LIFE POLICIES. Nadine Gatzert

Differential Equations. Solving for Impulse Response. Linear systems are often described using differential equations.

Modeling VIX Futures and Pricing VIX Options in the Jump Diusion Modeling

Transcription:

Acceleraed Gradien Mehods for Sochasic Opimizaion and Online Learning Chonghai Hu, James T. Kwok, Weike Pan Deparmen of Compuer Science and Engineering Hong Kong Universiy of Science and Technology Clear Waer Bay, Kowloon, Hong Kong Deparmen of Mahemaics, Zhejiang Universiy Hangzhou, China hino.hu@gmail.com, {jamesk,weikep}@cse.us.hk Absrac Regularized risk minimizaion ofen involves non-smooh opimizaion, eiher because of he loss funcion (e.g., hinge loss) or he regularizer (e.g., l -regularizer). Gradien mehods, hough highly scalable and easy o implemen, are known o converge slowly. In his paper, we develop a novel acceleraed gradien mehod for sochasic opimizaion while sill preserving heir compuaional simpliciy and scalabiliy. The proposed algorihm, called (Sochasic Acceleraed GradiEn), exhibis fas convergence raes on sochasic composie opimizaion wih convex or srongly convex objecives. Experimenal resuls show ha is faser han recen (sub)gradien mehods including, and. Moreover, can also be exended for online learning, resuling in a simple algorihm bu wih he bes regre bounds currenly known for hese problems. Inroducion Risk minimizaion is a he hear of many machine learning algorihms. Given a class of models parameerized by w and a loss funcion l(, ), he goal is o minimize E XY [l(w;x,y )] w.r.. w, where he expecaion is over he join disribuion of inpu X and oupu Y. However, since he join disribuion is ypically unknown in pracice, a surrogae problem is o replace he expecaion by is empirical average on a raining sample {(x,y ),...,(x m,y m )}. Moreover, a regularizer Ω( ) is ofen added for well-posedness. This leads o he minimizaion of he regularized risk min w m m l(w;x i,y i ) + λω(w), () i= where λ is a regularizaion parameer. In opimizaion erminology, he deerminisic opimizaion problem in () can be considered as a sample average approximaion (SAA) of he corresponding sochasic opimizaion problem: min w E XY [l(w;x,y )] + λω(w). () Since boh l(, ) and Ω( ) are ypically convex, () is a convex opimizaion problem which can be convenienly solved even wih sandard off-he-shelf opimizaion packages. However, wih he proliferaion of daa-inensive applicaions in he ex and web domains, daa ses wih millions or rillions of samples are nowadays no uncommon. Hence, off-he-shelf opimizaion solvers are oo slow o be used. Indeed, even ailor-made sofwares for specific models, such as he sequenial minimizaion opimizaion (SMO) mehod for he SVM, have superlinear compuaional

complexiies and hus are no feasible for large daa ses. In ligh of his, he use of sochasic mehods have recenly drawn a lo of ineres and many of hese are highly successful. Mos are based on (varians of) he sochasic gradien descen (SGD). Examples include Pegasos [], SGD-QN [], [3], and sochasic coordinae descen () [4]. The main advanages of hese mehods are ha hey are simple o implemen, have low per-ieraion complexiy, and can scale up o large daa ses. Their runime is independen of, or even decrease wih, he number of raining samples [5, 6]. On he oher hand, because of heir simpliciy, hese mehods have a slow convergence rae, and hus may require a large number of ieraions. While sandard gradien schemes have a slow convergence rae, hey can ofen be acceleraed. This sems from he pioneering work of Neserov in 983 [7], which is a deerminisic algorihm for smooh opimizaion. Recenly, i is also exended for composie opimizaion, where he objecive has a smooh componen and a non-smooh componen [8, 9]. This is paricularly relevan o machine learning since he loss l and regularizer Ω in () may be non-smooh. Examples include loss funcions such as he commonly-used hinge loss used in he SVM, and regularizers such as he popular l penaly in Lasso [], and basis pursui. These acceleraed gradien mehods have also been successfully applied in he opimizaion problems of muliple kernel learning [] and race norm minimizaion []. Very recenly, Lan [3] made an iniial aemp o furher exend his for sochasic composie opimizaion, and obained he convergence rae of O ( L/N + (M + σ)/ N ). (3) Here, N is he number of ieraions performed by he algorihm, L is he Lipschiz parameer of he gradien of he smooh erm in he objecive, M is he Lipschiz parameer of he nonsmooh erm, and σ is he variance of he sochasic subgradien. Moreover, noe ha he firs erm of (3) is relaed o he smooh componen in he objecive while he second erm is relaed o he nonsmooh componen. Complexiy resuls [4, 3] show ha (3) is he opimal convergence rae for any ieraive algorihm solving sochasic (general) convex composie opimizaion. However, as poined ou in [5], a very useful propery ha can improve he convergence raes in machine learning opimizaion problems is srong convexiy. For example, () can be srongly convex eiher because of he srong convexiy of l (e.g., log loss, square loss) or Ω (e.g., l regularizaion). On he oher hand, [3] is more ineresed in general convex opimizaion problems and so srong convexiy is no uilized. Moreover, hough heoreically ineresing, [3] may be of limied pracical use as () he sepsize in is updae rule depends on he ofen unknown σ; and () he number of ieraions performed by he algorihm has o be fixed in advance. Inspired by he successes of Neserov s mehod, we develop in his paper a novel acceleraed subgradien ( scheme for sochasic composie opimizaion. I achieves he opimal convergence rae of O L/N + σ/ ) N for general convex objecives, and O ( (L + µ)/n + σµ /N ) for µ- srongly convex objecives. Moreover, is per-ieraion complexiy is almos as low as ha for sandard (sub)gradien mehods. Finally, we also exend he acceleraed gradien scheme o online learning. We obain O( N) regre for general convex problems and O(log N) regre for srongly convex problems, which are he bes regre bounds currenly known for hese problems. Seing and Mahemaical Background Firs, we recapiulae a few noions in convex analysis. (Lipschiz coninuiy) A funcion f(x) is L-Lipschiz if f(x) f(y) L x y. Lemma. [4] The gradien of a differeniable funcion f(x) is Lipschiz coninuous wih Lipschiz parameer L if, for any x and y, f(y) f(x) + f(x),y x + L x y. (4) (Srong convexiy) A funcion φ(x) is µ-srongly convex if φ(y) φ(x)+ g(x),y x + µ y x for any x,y and subgradien g(x) φ(x). Lemma. [4] Le φ(x) be µ-srongly convex and x = arg min x φ(x). Then, for any x, φ(x) φ(x ) + µ x x. (5)

We consider he following sochasic convex sochasic opimizaion problem, wih a composie objecive funcion min{φ(x) E[F(x,ξ)] + ψ(x)}, (6) x where ξ is a random vecor, f(x) E[F(x,ξ)] is convex and differeniable, and ψ(x) is convex bu non-smooh. Clearly, his includes he opimizaion problem (). Moreover, we assume ha he gradien of f(x) is L-Lipschiz and φ(x) is µ-srongly convex (wih µ ). Noe ha when φ(x) is smooh (ψ(x) = ), µ lower bounds he smalles eigenvalue of is Hessian. Recall ha in smooh opimizaion, he gradien updae x + = x λ f(x ) on a funcion f(x) can be seen as proximal regularizaion of he linearized f a he curren ierae x [6]. In oher words, x + = arg min x ( f(x ),x x + λ x x ). Wih he presence of a non-smooh componen, we have he following more general noion. (Gradien mapping) [8] In minimizing f(x) + ψ(x), where f is convex and differeniable and ψ is convex and non-smooh, ( x + = arg min f(x),x x + ) x λ x x + ψ(x) (7) is called he generalized gradien updae, and δ = λ (x x + ) is he (generalized) gradien mapping. Noe ha he quadraic approximaion is made o he smooh componen only. I can be shown ha he gradien mapping is analogous o he gradien in smooh convex opimizaion [4, 8]. This is also a common consruc used in recen sochasic subgradien mehods [3, 7]. 3 Acceleraed Gradien Mehod for Sochasic Learning Le G(x,ξ ) x F(x,ξ ) x=x be he sochasic gradien of F(x,ξ ). We assume ha i is an unbiased esimaor of he gradien f(x), i.e., E ξ [G(x,ξ)] = f(x). Algorihm shows he proposed algorihm, which will be called (Sochasic Acceleraed GradiEn). I involves he updaing of hree sequences {x }, {y } and {z }. Noe ha y is he generalized gradien updae, and x + is a convex combinaion of y and z. The algorihm also mainains wo parameer sequences {α } and {L }. We will see in Secion 3. ha differen seings of hese parameers lead o differen convergence raes. Noe ha he only expensive sep of Algorihm is he compuaion of he generalized gradien updae y, which is analogous o he subgradien compuaion in oher subgradien-based mehods. In general, is compuaional complexiy depends on he srucure of ψ(x). As will be seen in Secion 3.3, his can ofen be efficienly obained in many regularized risk minimizaion problems. Algorihm (Sochasic Acceleraed GradiEn). Inpu: Sequences {L } and {α }. Iniialize: y = z =, α = λ =. L = L + µ. for = o N do x = ( α )y { + α z. y = arg min x G(x,ξ ),x x + L x x + ψ(x) }. z = z (L α + µ) [L (x y ) + µ(z x )]. end for Oupu y N. 3. Convergence Analysis Define G(x,ξ ) f(x ). Because of he unbiasedness of G(x,ξ ), E ξ [ ] =. In he following, we will show ha he value of φ(y ) φ(x) can be relaed o ha of φ(y ) φ(x) for any x. Le δ L (x y ) be he gradien mapping involved in updaing y. Firs, we inroduce he following lemma. Lemma 3. For, φ(x) is quadraically bounded from below as φ(x) φ(y ) + δ,x x + µ x x +,y x + L L L δ. 3

Proposiion. Assume ha for each, σ and L > L, hen φ(y ) φ(x) + L α + µα x z ( α )[φ(y ) φ(x)] + L α x z + σ (L L) + α,x z. (8) Proof. Define V (x) = δ,x x + µ x x + Lα x z. I is easy o see ha z = arg min x R d V (x). Moreover, noice ha V (x) is (L α + µ)-srongly convex. Hence on applying Lemmas and 3, we obain ha for any x, V (z ) V (x) L α + µ x z = δ,x x + µ x x + L α x z L α + µ x z φ(x) φ(y ) L L L δ + L α x z L α +µ x z +,x y. Then, φ(y ) can be bounded from above, as: φ(y ) φ(x) + δ,x z L L L δ L α z z + L α x z L (9) α + µ x z +,x y, where he non-posiive erm µ z x has been dropped from is righ-hand-side. On he oher hand, by applying Lemma 3 wih x = y, we ge φ(y ) φ(y ) δ,x y +,y y L L L δ, () where he non-posiive erm µ y x has also been dropped from he righ-hand-side. On muliplying (9) by α and () by α, and hen adding hem ogeher, we obain φ(y ) φ(x) ( α )[φ(y ) φ(x)] L L L δ +A+B+C L α z z, () where A = δ,α (x z )+( α )(x y ), B = α,x y +( α ),y y, and C = Lα x z Lα +µα x z. In he following, we consider o upper bound A and B. Firs, by using he updae rule of x in Algorihm and he Young s inequaliy, we have A = δ,α (x z ) + ( α )(x y ) + α δ,z z = α δ,z z L α z z + δ. () L On he oher hand, B can be bounded as B =,α x + ( α )y x +,x y = α,x z +,δ L α,x z + σ δ, (3) L where he second equaliy is due o he updae rule of x, and he las sep is from he Cauchy- Schwarz inequaliy and he boundedness of. Hence, plugging () and (3) ino (), φ(y ) φ(x) ( α )[φ(y ) φ(x)] (L L) δ L + σ δ + α,x z + C L σ ( α )[φ(y ) φ(x)] + (L L) + α,x z + C, where he las sep is due o he fac ha ax + bx b 4a wih a,b >. On re-arranging erms, we obain (8). The Young s inequaliy saes ha x, y x a + a y for any a >. 4

Le he opimal soluion in problem (6) be x. From he updae rules in Algorihm, we observe ha he riple (x,y,z ) depends on he random process ξ [ ] {ξ,...,ξ } and hence is also random. Clearly, z and x are independen of ξ. Thus, E ξ[],x z = E ξ[ ] E ξ[] [,x z ξ [ ] ] = E ξ[ ] E ξ [,x z ] = E ξ[ ] x z, E ξ [ ] =, where he firs equaliy uses E x [h(x)] = E y E x [h(x) y], and he las equaliy is from our assumpion ha he sochasic gradien G(x, ξ) is unbiased. Taking expecaions on boh sides of (8) wih x = x, we obain he following corollary, which will be useful in proving he subsequen heorems. Corollary. E[φ(y )] φ(x ) + L α + µα E[ x z ] ( α )(E[φ(y )] φ(x )) + L α E[ x z ] + σ (L L). So far, he choice of L and α in Algorihm has been lef unspecified. In he following, we will show ha wih a good choice of L and α, (he expecaion of) φ(y ) converges rapidly o φ(x ). Theorem. Assume ha E[ x z ] D for some D. Se L = b( + ) 3 + L, α = +, (4) where b > is a consan. Then he expeced error of Algorihm can be bounded as ) E[φ(y N )] φ(x ) 3D L N + (3D b + 5σ. (5) 3b N If σ were known, we can se b o he opimal choice of 5σ 3D, and he bound in (5) becomes 3D L 5σD N. N + Noe ha so far φ(x) is only assumed o be convex. As is shown in he following heorem, he convergence rae can be furher improved by assuming srong convexiy. This also requires anoher seing of α and L which is differen from ha in (4). Theorem. Assume he same condiions as in Theorem, excep ha φ(x) is µ-srongly convex. Se L = L + µλ, for ; α = λ + λ 4 λ, for, (6) where λ Π k= ( α ) for and λ =. Then, he expeced error of Algorihm can be bounded as E[φ(y N )] φ(x (L + µ)d ) N + 6σ Nµ. (7) In comparison, only converges as O(log(N)/N) for srongly convex objecives. 3. Remarks As in recen sudies on sochasic composie opimizaion [3], he error bounds in (5) and (7) consis of wo erms: a faser erm which is relaed o he smooh componen and a slower erm relaed o he non-smooh componen. benefis from using he srucure of he problem and acceleraes he convergence of he smooh componen. On he oher hand, many sochasic (sub)gradien-based algorihms like do no separae he smooh from he non-smooh par, bu simply rea he whole objecive as non-smooh. Consequenly, convergence of he smooh componen is also slowed down o O(/ N). As can be seen from (5) and (7), he convergence of is essenially encumbered by he variance of he sochasic subgradien. Recall ha he variance of he average of p i.i.d. random 5

variables is equal o /p of he original variance. Hence, as in Pegasos [], σ can be reduced by esimaing he subgradien from a daa subse. Unlike he AC-SA algorihm in [3], he seings of L and α in (4) do no require knowledge of σ and he number of ieraions, boh of which can be difficul o esimae in pracice. Moreover, wih he use of a sparsiy-promoing ψ(x), can produce a sparse soluion (as will be experimenally demonsraed in Secion 5) while AC-SA canno. This is because in, he oupu y is obained from a generalized gradien updae. Wih a sparsiy-promoing ψ(x), his reduces o a (sof) hresholding sep, and hus ensures a sparse soluion. On he oher hand, in each ieraion of AC-SA, is oupu is a convex combinaion of wo oher variables. Unforunaely, adding wo vecors is unlikely o produce a sparse vecor. 3.3 Efficien Compuaion of y The compuaional efficiency of Algorihm hinges on he efficien compuaion of y. Recall ha y is jus he generalized gradien updae, and so is no significanly more expensive han he gradien updae in radiional algorihms. Indeed, he generalized gradien updae is ofen a cenral componen in various opimizaion and machine learning algorihms. In paricular, Duchi and Singer [3] showed how his can be efficienly compued wih he various smooh and non-smooh regularizers, including he l,l,l,l, Berhu and marix norms. Ineresed readers are referred o [3] for deails. 4 Acceleraed Gradien Mehod for Online Learning In his secion, we exend he proposed acceleraed gradien scheme for online learning of (). The algorihm, shown in Algorihm, is similar o he sochasic version in Algorihm. Algorihm -based Online Learning Algorihm. Inpus: Sequences {L } and {α }, where L > L and < α <. Iniialize: z = y. loop x = ( α )y + α { z. Oupu y = arg min x f (x ),x x + L x x + ψ(x) }. z = z α (L + µα ) [L (x y ) + µ(z x )]. end loop Firs, we inroduce he following lemma, which plays a similar role as is sochasic counerpar of Lemma 3. Moreover, le δ L (x y ) be he gradien mapping relaed o he updaing of y. Lemma 4. For >, φ (x) can be quadraically bounded from below as φ (x) φ (y ) + δ,x x + µ x x + L L L δ. Proposiion. For any x and, assume ha here exiss a subgradien ĝ(x) ψ(x) such ha f (x) + ĝ(x) Q. Then for Algorihm, φ (y ) φ (x) Q ( α )(L L) + L α x z L + µα α x z + ( α )L α ( α )L y z L z y. Proof Skech. Define τ = L α. From he updae rule of z, one can check ha z = arg min V (x) δ,x x + µ x x x + τ x z. Similar o he analysis in obaining (9), we can obain φ (y ) φ (x) δ,x z L L L δ τ z z + τ x z τ +µ x z. (9) 6 (8)

On he oher hand, δ,x z δ L = L ( z x z y ) L z z + L ( α ) z y L α z y, () on using he convexiy of. Using (), he inequaliy (9) becomes φ (y ) φ (x) L ( α ) z y L z y L L L δ + τ x z τ + µ x z. On he oher hand, by he convexiy of φ (x) and he Young s inequaliy, we have φ (y ) φ (y ) f (y ) + ĝ (y ),y y Q () ( α )(L L) + ( α )(L L) y y. () Moreover, by using he updae rule of x and he convexiy of, we have y y = (y x ) + (x y ) = α (y z ) + (x y ) α y z + ( α ) x y = α y z + δ ( α )L. (3) On using (3), i follows from () ha Q φ (y ) φ (y ) ( α )(L L) + α ( α )(L L) y z + L L L δ. Inequaliy (8) hen follows immediaely by adding his o (). Theorem 3. Assume ha µ =, and x z D for. Se α = a and L = al +L, where a (,) is a consan. Then he regre of Algorihm can be bounded as N = [φ (y ) φ (x )] LD a + [ LD + Q ] N. a( a)l Theorem 4. Assume ha µ >, and x z D for. Se α = a, and L = aµ + L + a (µ L) +, where a (,) is a consan. Then he regre of Algorihm can be bounded as N [ (a + a [φ (y ) φ (x )µ + L )] a = ] D + Q log(n + ). a( a)µ In paricular, wih a =, he regre bound reduces o ( 3µ + L) D + Q µ log(n + ). 5 Experimens In his secion, we perform experimens on he sochasic opimizaion of (). Two daa ses are used (Table ). The firs one is he pcmac daa se, which is a subse of he -newsgroup daa se from [8], while he second one is he RCV daa se, which is a filered collecion of he Reuers RCV from [9]. We choose he square loss for l(, ) and he l regularizer for Ω( ) in (). As discussed in Secion 3.3 and [3], he generalized gradien updae can be efficienly compued by sof hresholding in his case. Moreover, we do no use srong convexiy and so µ =. We compare he proposed algorihm (wih L and α in (4)) wih hree recen algorihms: () [3]; () [4]; and (3) [4]. For fair comparison, we compare heir convergence Downloaded from hp://people.cs.uchicago.edu/ vikass/svmlin.hml and hp://www.cs.ucsb.edu/ wychen/sc.hml. 7

behavior w.r.. boh he number of ieraions and he number of daa access operaions, he laer of which has been advocaed in [4] as an implemenaion-independen measure of ime. Moreover, he efficiency ricks for sparse daa described in [4] are also implemened. Following [4], we se he regularizaion parameer λ in () o 6. The η parameer in is searched over he range of { 6, 5, 4, 3,, }, and he one wih he lowes l -regularized loss is used. As in Pegasos [], he (sub)gradien is compued from small sample subses. The subse size p is se o min(.m,5), where m is he daa se size. This is used on all he algorihms excep, since is based on coordinae descen and is quie differen from he oher sochasic subgradien algorihms. 3 All he algorihms are rained wih he same maximum amoun of ime (i.e., number of daa access operaions). Table : Summary of he daa ses. daa se #feaures #insances sparsiy pcmac 7,5,946.73% RCV 47,36 93,844.% Resuls are shown in Figure. As can be seen, requires much fewer ieraions for convergence han he ohers (Figures (a) and (e)). Moreover, he addiional coss on mainaining x and z are small, and he mos expensive sep in each ieraion is in compuing he generalized gradien updae. Hence, is per-ieraion complexiy is comparable wih he oher (sub)gradien schemes, and is convergence in erms of he number of daa access operaions is sill he fases (Figures (b), (c), (f) and (g)). Moreover, he sparsiy of he soluion is comparable wih hose of he oher algorihms (Figures (d) and (h)). L regularized loss.8.6.4. L regularized loss.8.6.4. Error (%) 8 6 4 Densiy of w 8 6 4 3 4 Number of Ieraions (a) 4 6 8 Number of Daa Accessesx 6 (b) 4 6 8 Number of Daa Accessesx 6 (c) 4 6 8 Number of Daa Accesses x 6 (d) L regularized loss.8.6.4. 3 4 Number of Ieraions (e) L regularized loss.8.6.4..5.5.5 Number of Daa Accessesx 8 (f) Error (%) 8 6 4.5.5.5 Number of Daa Accesses x 8 (g) Densiy of w 4 x 4 3.5.5.5 Number of Daa Accessesx 8 Figure : Performance of he various algorihms on he pcmac (upper) and RCV (below) daa ses. (h) 6 Conclusion In his paper, we developed a novel acceleraed gradien mehod () for sochasic convex composie opimizaion. I enjoys he compuaional simpliciy and scalabiliy of radiional (sub)gradien mehods bu are much faser, boh heoreically and empirically. Experimenal resuls show ha ouperforms recen (sub)gradien descen mehods. Moreover, can also be exended o online learning, obaining he bes regre bounds currenly known. Acknowledgmen This research has been parially suppored by he Research Grans Council of he Hong Kong Special Adminisraive Region under gran 659. 3 For he same reason, an ieraion is also very differen from an ieraion in he oher algorihms. Hence, is no shown in he plos on he regularized loss versus number of ieraions. 8

References [] S. Shalev-Shwarz, Y. Singer, and N. Srebro. Pegasos: Primal esimaed sub-gradien solver for SVM. In Proceedings of he 4h Inernaional Conference on Machine Learning, pages 87 84, Corvalis, Oregon, USA, 7. [] A. Bordes, L. Boou, and P. Gallinari. SGD-QN: Careful Quasi-Newon Sochasic Gradien Descen. Journal of Machine Learning Research, :737 754, 9. [3] J. Duchi and Y. Singer. Online and bach learning using forward looking subgradiens. Technical repor, 9. [4] S. Shalev-Shwarz and A. Tewari. Sochasic mehods for l regularized loss minimizaion. In Proceedings of he 6h Inernaional Conference on Machine Learning, pages 99 936, Monreal, Quebec, Canada, 9. [5] L. Boou and O. Bousque. The radeoffs of large scale learning. In Advances in Neural Informaion Processing Sysems. 8. [6] S. Shalev-Shwarz and N. Srebro. SVM opimizaion: Inverse dependence on raining se size. In Proceedings of he 5h Inernaional Conference on Machine Learning, pages 98 935, Helsinki, Finland, 8. [7] Y. Neserov. A mehod for unconsrained convex minimizaion problem wih he rae of convergence o( k ). Doklady AN SSSR (ranslaed as Sovie. Mah. Docl.), 69:543 547, 983. [8] Y. Neserov. Gradien mehods for minimizing composie objecive funcion. CORE Discussion Paper 7/76, Caholic Universiy of Louvain, Sepember 7. [9] A. Beck and M. Teboulle. A fas ieraive shrinkage-hresholding algorihm for linear inverse problems. SIAM Journal on Imaging Sciences, :83, 9. [] R. Tibshirani. Regression shrinkage and selecion via he Lasso. Journal of he Royal Saisical Sociey: Series B, 58:67 88, 996. [] S. Ji, L. Sun, R. Jin, and J. Ye. Muli-label muliple kernel learning. In Advances in Neural Informaion Processing Sysems. 9. [] S. Ji and J. Ye. An acceleraed gradien mehod for race norm minimizaion. In Proceedings of he Inernaional Conference on Machine Learning. Monreal, Canada, 9. [3] G. Lan. An opimal mehod for sochasic composie opimizaion. Technical repor, School of Indusrial and Sysems Engineering, Georgia Insiue of Technology, 9. [4] Y. Neserov and I.U.E. Neserov. Inroducory Lecures on Convex Opimizaion: A Basic Course. Kluwer, 3. [5] S.M. Kakade and S. Shalev-Shwarz. Mind he dualiy gap: Logarihmic regre algorihms for online opimizaion. In Advances in Neural Informaion Processing Sysems. 9. [6] A. Beck and M. Teboulle. Mirror descen and nonlinear projeced subgradien mehods for convex opimizaion. Operaions Research Leers, 3(3):67 75, 3. [7] S.J. Wrigh, R.D. Nowak, and M.A.T. Figueiredo. Sparse reconsrucion by separable approximaion. In Proceedings of he Inernaional Conference on Acousics, Speech, and Signal Processing, Las Vegas, Nevada, USA, March 8. [8] V. Sindhwani and S.S. Keerhi. Large scale semi-supervised linear SVMs. In Proceedings of he SIGIR Conference on Research and Developmen in Informaion Rerieval, pages 477 484, Seale, WA, USA, 6. [9] Y. Song, W.Y. Chen, H. Bai, C.J. Lin, and E.Y. Chang. Parallel specral clusering. In Proceedings of he European Conference on Machine Learning, pages 374 389, Anwerp, Belgium, 8. 9