Finito: A Faster, Permutable Incremental Gradient Method for Big Data Problems



Similar documents
Finito: A Faster, Permutable Incremental Gradient Method for Big Data Problems

Conversion of Non-Linear Strength Envelopes into Generalized Hoek-Brown Envelopes

APPENDIX III THE ENVELOPE PROPERTY

Preprocess a planar map S. Given a query point p, report the face of S containing p. Goal: O(n)-size data structure that enables O(log n) query time.

In the UC problem, we went a step further in assuming we could even remove a unit at any time if that would lower cost.

Data Analysis Toolkit #10: Simple linear regression Page 1

Numerical Methods with MS Excel

GRADUATION PROJECT REPORT

CIS603 - Artificial Intelligence. Logistic regression. (some material adopted from notes by M. Hauskrecht) CIS603 - AI. Supervised learning

SHAPIRO-WILK TEST FOR NORMALITY WITH KNOWN MEAN

Abraham Zaks. Technion I.I.T. Haifa ISRAEL. and. University of Haifa, Haifa ISRAEL. Abstract

Supply Chain Management Chapter 5: Application of ILP. Unified optimization methodology. Beun de Haas

Constrained Cubic Spline Interpolation for Chemical Engineering Applications

Speeding up k-means Clustering by Bootstrap Averaging

A general sectional volume equation for classical geometries of tree stem

10.5 Future Value and Present Value of a General Annuity Due

Basic statistics formulas

The simple linear Regression Model

Cyber Journals: Multidisciplinary Journals in Science and Technology, Journal of Selected Areas in Telecommunications (JSAT), January Edition, 2011

On formula to compute primes and the n th prime

3.6. Metal-Semiconductor Field Effect Transistor (MESFETs)

ANOVA Notes Page 1. Analysis of Variance for a One-Way Classification of Data

RUSSIAN ROULETTE AND PARTICLE SPLITTING

Relaxation Methods for Iterative Solution to Linear Systems of Equations

Applications of Support Vector Machine Based on Boolean Kernel to Spam Filtering

Optimal multi-degree reduction of Bézier curves with constraints of endpoints continuity

Banking (Early Repayment of Housing Loans) Order,

Swarm Based Truck-Shovel Dispatching System in Open Pit Mine Operations

Integrating Production Scheduling and Maintenance: Practical Implications

Average Price Ratios

Credit Risk Evaluation of Online Supply Chain Finance Based on Third-party B2B E-commerce Platform: an Exploratory Research Based on China s Practice

6.7 Network analysis Introduction. References - Network analysis. Topological analysis

STATISTICAL PROPERTIES OF LEAST SQUARES ESTIMATORS. x, where. = y - ˆ " 1

Analysis of Two-Echelon Perishable Inventory System with Direct and Retrial demands

Bayesian Network Representation

Fractal-Structured Karatsuba`s Algorithm for Binary Field Multiplication: FK

TI-83, TI-83 Plus or TI-84 for Non-Business Statistics

IDENTIFICATION OF THE DYNAMICS OF THE GOOGLE S RANKING ALGORITHM. A. Khaki Sedigh, Mehdi Roudaki

ADAPTATION OF SHAPIRO-WILK TEST TO THE CASE OF KNOWN MEAN

Chapter 3. AMORTIZATION OF LOAN. SINKING FUNDS R =

A New Bayesian Network Method for Computing Bottom Event's Structural Importance Degree using Jointree

Green Master based on MapReduce Cluster

The Gompertz-Makeham distribution. Fredrik Norström. Supervisor: Yuri Belyaev

A Study of Unrelated Parallel-Machine Scheduling with Deteriorating Maintenance Activities to Minimize the Total Completion Time

Using Phase Swapping to Solve Load Phase Balancing by ADSCHNN in LV Distribution Network

The Digital Signature Scheme MQQ-SIG

ECONOMIC CHOICE OF OPTIMUM FEEDER CABLE CONSIDERING RISK ANALYSIS. University of Brasilia (UnB) and The Brazilian Regulatory Agency (ANEEL), Brazil

Classic Problems at a Glance using the TVM Solver

Load Balancing Algorithm based Virtual Machine Dynamic Migration Scheme for Datacenter Application with Optical Networks

Lecture 7. Norms and Condition Numbers

Chapter Eight. f : R R

T = 1/freq, T = 2/freq, T = i/freq, T = n (number of cash flows = freq n) are :

1. The Time Value of Money

Security Analysis of RAPP: An RFID Authentication Protocol based on Permutation

Credibility Premium Calculation in Motor Third-Party Liability Insurance

A particle swarm optimization to vehicle routing problem with fuzzy demands

On Error Detection with Block Codes

Optimal replacement and overhaul decisions with imperfect maintenance and warranty contracts

Software Reliability Index Reasonable Allocation Based on UML

The analysis of annuities relies on the formula for geometric sums: r k = rn+1 1 r 1. (2.1) k=0

How To Value An Annuity

Fault Tree Analysis of Software Reliability Allocation

Optimal Packetization Interval for VoIP Applications Over IEEE Networks

TI-89, TI-92 Plus or Voyage 200 for Non-Business Statistics

Maintenance Scheduling of Distribution System with Optimal Economy and Reliability

Confidence Intervals for Linear Regression Slope

Curve Fitting and Solution of Equation

Simple Linear Regression

One way to organize workers that lies between traditional assembly lines, where workers are specialists,

Compressive Sensing over Strongly Connected Digraph and Its Application in Traffic Monitoring

Approximation Algorithms for Scheduling with Rejection on Two Unrelated Parallel Machines

A particle Swarm Optimization-based Framework for Agile Software Effort Estimation

Reinsurance and the distribution of term insurance claims

A technical guide to 2014 key stage 2 to key stage 4 value added measures

How To Make A Supply Chain System Work

CHAPTER 2. Time Value of Money 6-1

A Parallel Transmission Remote Backup System


A COMPARATIVE STUDY BETWEEN POLYCLASS AND MULTICLASS LANGUAGE MODELS

n. We know that the sum of squares of p independent standard normal variables has a chi square distribution with p degrees of freedom.

Geometric Mean Maximization: Expected, Observed, and Simulated Performance

FINANCIAL MATHEMATICS 12 MARCH 2014

MDM 4U PRACTICE EXAMINATION

of the relationship between time and the value of money.

Transcription:

Fto: A Fater, Permutable Icremetal Gradet Method or Bg ata Problem Aaro J. eazo Tbéro S. Caetao Jut omke NICTA ad Autrala Natoal Uverty AARON.FAZIO@ANU.U.AU TIBRIO.CATANO@NICTA.COM.AU JUSTIN.OMK@NICTA.COM.AU Abtract Recet advace optmzato theory have how that mooth trogly covex te um ca be mmzed ater tha by treatg them a a black box batch problem. I th work we troduce a ew method th cla wth a theoretcal covergece rate our tme ater tha extg method, or um wth ucetly may term. Th method alo amedable to a amplg wthout replacemet cheme that practce gve urther peed-up. We gve emprcal reult howg tate o the art perormace.. Itroducto May recet advace the theory ad practce o umercal optmzato have come rom the recogto ad explotato o tructure. Perhap the mot commo tructure that o te um. I mache learg whe applyg emprcal rk mmzato we almot alway ed up wth a optmzato problem volvg the mmzato o a um wth oe term per data pot. The recetly developed SAG algorthm (Schmdt et al., 3) ha how that eve wth th mple orm o tructure, a log a we have ucetly may data pot we are able to do gcatly better tha black-box optmzato techque expectato or mooth trogly covex problem. I practcal term the derece ote a actor o or more. The requremet o ucetly large dataet udametal to thee method. We decrbe the prece orm o th a the bg data codto. etally, t the requremet that the amout o data o the ame order a the codto umber o the problem. The trog covexty requremet Proceedg o the 3 t Iteratoal Coerece o Mache Learg, Bejg, Cha, 4. JMLR: W&CP volume 3. Copyrght 4 by the author(). ot a oerou. Strog covexty hold the commo cae where a quadratc regularzer ued together wth a covex lo. The SAG method ad the Fto method we decrbe th work are mlar ther orm to tochatc gradet decet method, but wth oe crucal derece: They tore addtoal ormato about each data pot durg optmzato. etally, whe they revt a data pot, they do ot treat t a a ovel pece o ormato every tme. Method or the mmzato o te um have clacally bee kow a Icremetal gradet method (Berteka, ). The proo techque ued SAG der udametally rom thoe ued o other cremetal gradet method though. The derece hge o the requremet that data be acceed a radomzed order. SAG doe ot work whe data acceed equetally each epoch, o ay proo techque whch how eve o-dvergece or equetal acce caot be appled. A remarkable property o Fto the tghte o the theoretcal boud compared to the practcal perormace o the algorthm. The practcal covergece rate ee at mot twce a good a the theoretcally predcted rate. Th et t apart rom method uch a LBFGS where the emprcal perormace ote much better tha the relatvely weak theoretcal covergece rate would ugget. The lack o tug requred alo et Fto apart rom tochatc gradet decet (SG). I order to get good perormace out o SG, ubtatal laborou tug o multple cotat ha tradtoally bee requred. A multtude o heurtc have bee developed to help chooe thee cotat, or adapt them a the method progree. Such heurtc are more complex tha Fto, ad do ot have the ame theoretcal backg. SG ha applcato outde o covex problem o coure, ad we do ot propoe that Fto wll replace SG thoe ettg. ve o trogly covex problem SG doe ot exhbt lear covergece lke Fto doe.

Fto: A Fater, Permutable Icremetal Gradet Method or Bg ata Problem There are may mlarte betwee SAG, Fto ad tochatc dual coordate decet (SCA) method (Shalev-Shwartz & Zhag, 3). SCA oly applcable to lear predctor. Whe t ca be appled, t ha lear covergece wth theoretcal rate mlar to SAG ad Fto.. Algorthm We coder deretable covex ucto o the orm (w) (w). We aume that each ha Lpchtz cotuou gradet wth cotat L ad trogly covex wth cotat. Clearly we allow, vrtually all mooth, trogly covex problem are cluded. So tead, we wll retrct ourelve to problem atyg the bg data codto. Bg data codto: Fucto o the above orm aty the bg data codto wth cotat L Typcal value o are -8. I pla laguage, we are coderg problem where the amout o data o the ame order a the codto umber (L/) o the problem... Addtoal Notato We upercrpt wth (k) to deote the value o the crpted quatty at terato k. We omt the upercrpt o ummato, ad ubcrpt wth wth the mplcato that dexg tart at. Whe we ue eparate argumet or each, we deote them. Let (k) deote the average (k) P (k). Our tep legth cotat, whch deped o, deoted. We ue agle bracket otato or dot product h,... The Fto algorthm We tart wth a table o kow () value, ad a table o kow gradet ( () ), or each. We wll update thee two table durg the coure o the algorthm. The tep or terato k, a ollow:. Update w ug the tep: w (k) (k) ( (k) ).. Pck a dex j uormly at radom, or ug wthout-replacemet amplg a dcued Secto 3. 3. Set (k) j w (k) the table ad leave the other varable the ame ( (k) (k) or 6 j). 4. Calculate ad tore j ( (k) j ) the table. Our ma theoretcal reult a covergece rate proo or th method. Theorem. Whe the bg data codto hold wth, may be ued. I that ettg, we have talzed () all the ame, the covergece rate : h ( (k) ) (w ) apple 3 k ( () ). 4 See Secto 5 or the proo. I cotrat, SAG acheve a 8 rate whe. Note that o a per epoch ba, the Fto rate exp( /).66. To put that to cotext, epoch wll ee the error boud reduced by more tha 48x. Oe otable eature o our method the xed tep ze. I typcal mache learg problem the trog covexty cotat gve by the tregth cotat o the quadratc regularzer ued. Sce th a kow quatty, a log a the bg data codto hold may be ued wthout ay tug or adjutmet o Fto requred. Th lack o tug a major eature o Fto. I cae where the bg data codto doe ot hold, we cojecture that the tep ze mut be reduced proportoally to the volato o the bg data codto. I practce, the mot eectve tep ze ca be oud by tetg a umber o tep ze, a uually doe wth other tochatc optmato method. A mple way o atyg the bg data codto to duplcate your data eough tme o the hold. Th ot a eectve practce a jut chagg the tep ze, ad o coure t ue more memory. However t doe all wth the curret theory. Aother derece compared to the SAG method that we tore both gradet ad pot. We do ot actually eed twce a much memory however a they ca be tored ummed together. I partcular we tore the quatte p ( P ), ad ue the update rule w p. Th trck doe ot work whe tep legth are adjuted durg optmzato however. The torage o alo a dadvatage whe the gradet ( ) are pare but are ot pare, a t ca caue gcat addtoal memory uage. We do ot recommed the uage o Fto whe gradet are pare. The SAG algorthm der rom Fto oly the w update

Fto: A Fater, Permutable Icremetal Gradet Method or Bg ata Problem ad tep legth: w (k) w (k ) 6L 3. Radome key ( (k) ). By ar the mot teretg apect o the SAG ad Fto method the radom choce o dex at each terato. We are ot a ole ettg, o there o heret radome the problem. Yet t eem that a radomzed method requred. Nether method work practce whe the ame orderg ued each pa, or act wth ay o-radom acce cheme we have tred. It hard to emphaze eough the mportace o radome here. The techque o pre-permutg the data, the dog order pae ater that, alo doe ot work. Reducg the tep ze SAG or Fto by or order o magtude doe ot x the covergece ue ether. Other method, uch a tadard SG, have bee oted by varou author to exhbt peed-up whe radom amplg ued tead o order pae, but the derece are ot a extreme a covergece v.. ocovergece. Perhap the mot mlar problem that o coordate decet o mooth covex ucto. Coordate decet caot dverge whe o-radom orderg are ued, but covergece rate are ubtatally wore the o-radomzed ettg (Neterov, Rchtark & Takac ). Reducg the tep ze by a much larger amout, amely by a actor o, doe allow or o-radomzed orderg to be ued. Th gve a extremely low method however. Th the cae covered by the MISO (Maral, 3). A mlar reducto tep ze gve covergece uder oradomzed orderg or SAG alo. Covergece rate or cremetal ub-gradet method wth a varety o orderg appear the lterature alo (Nedc & Berteka, ). Samplg wthout replacemet much ater Other amplg cheme, uch a amplg wthout replacemet, hould be codered. I detal, we mea the cae where each pa over the data a et o amplg wthout replacemet tep, whch cotue utl o data rema, ater whch aother pa tart areh. We call th the permuted cae or mplcty, a t the ame a re-permutg the data ater each pa. I practce, th approach doe ot gve ay peedup wth SAG, however t work pectacularly well wth Fto. We ee peedup o up to a actor o two ug th approach. Th oe o the major derece practce betwee SAG ad Fto. We hould ote that we have o theory to upport th cae however. We are ot aware o ay aaly that prove ater covergece rate o ay optmzato method uder a amplg wthout replacemet cheme. A teretg dcuo o SG uder wthout-replacemet amplg appear Recht & Re (). The SCA method alo ometme ued wth a permuted orderg (Shalev-Shwartz & Zhag, 3), our expermet Secto 7 how that th ometme reult a large peedup over uorm radom amplg, although t doe ot appear to be a relable a wth Fto. 4. Proxmal varat We ow coder compote problem o the orm (w) (w) r(w), where r covex but ot ecearly mooth or trogly covex. Such problem are ote addreed ug proxmal algorthm, partcularly whe the proxmal operator or r: prox r (z) argm x kx zk r(x) ha a cloed orm oluto. A example would be the ue o L regularzato. We ow decrbe the Fto update or th ettg. Frt otce that whe we et w the Fto method, t ca be terpreted a mmzg the quatty: B(x) ( ) h ( ),x kx k, wth repect to x, or xed. Th related to the upper boud mmzed by MISO, where tead L. It traght orward to mody th or the compote cae: B r (x) r(x) ( ) h ( ),x kx k. The mmzer o the moded B r ca be expreed ug the proxmal operator a:! w prox r / ( ). Th trogly reemble the update the tadard gradet decet ettg, whch or a tep ze o /L w prox r /L w (k ) L (w (k ) ). We have ot yet developed ay theory upportg the proxmal varat o Fto, although emprcal evdece ugget t ha the ame covergece rate a the o-proxmal cae.

5. Covergece proo Fto: A Fater, Permutable Icremetal Gradet Method or Bg ata Problem We tart by tatg two mple lemma. All expectato the ollowg are over the choce o dex j at tep k. Quatte wthout upercrpt are at ther value at terato k. Lemma. The expected tep [w (k) ] w (w). I.e. the w tep a gradet decet tep expectato ( / L ). A mlar equalty alo hold or SG, but ot or SAG. Proo. [w (k) ] w apple (w j) j(w) j( j ) (w ) (w) ( ) Now mply (w ) a P ( ), o the oly term that rema (w). Lemma. (ecompoto o varace) We ca decompoe P kw k a kw k w. Proo. kw k w w w Ma proo w, w, Our proo proceed by cotructo o a Lyapuov ucto T ; that, a ucto that boud a quatty o teret, ad that decreae each terato expectato. Our Lyapuov ucto T T T T 3 T 4 compoed o the um o the ollowg term, T. T ( ), ( ) h ( ),w, T 3 kw k, T 4 We ow tate how each term chage betwee tep k ad k. Proo are oud the appedx the upplemetary materal: [T (k) ] T apple ( ),w L 3 kw k, [T (k) ] T apple T (w) ( ) 3 k(w) ( )k. w, (w) 3 h(w) ( ),w, [T (k) 3 ] T 3 ( )T 3 (w),w 3 k( ) (w)k, [T (k) 4 ] T 4 3 kw k. Theorem. Betwee tep k ad k,, ad the [T (k) ] T apple T. w apple Proo. We take the three lemma above ad group lke term to get [T (k) ] T apple ( ),w ( ) (w) h( ),w ( ) (w), w ( L ) kw k 3 h(w) ( ),w ( ) 3 k( ) (w)k

Fto: A Fater, Permutable Icremetal Gradet Method or Bg ata Problem w Next we cacel part o the rt le ug ( apple (w) ( ) w, baed o B3 the Appedx. We the pull term occurrg T together, gvg [T (k) ] T apple T ( ) ( (w),w apple ( ) (w) T ( L ) kw k 3 h(w) ( ),w ( ( ) 3 k( ) (w)k ) w ( ) Next we ue the tadard equalty (B5) ( ) ( ) (w),w apple ( ) w, whch chage the bottom row to ( ) w ( ) P. Thee two term ca the be grouped ug Lemma, to gve [T (k) ] T apple T L 3 kw k apple ( ) (w) T 3 h(w) ( ),w (.. ) 3 k( ) (w)k. We ue the ollowg equalty (Corollary 6 Appedx) to cacel agat the P kw k term: apple (w) T apple 3 h(w) ( ),w L 3 kw k 3 k(w) ( )k, ad the apply the ollowg mlar equalty (B7 Appedx) to partally cacel P k ( ) (w)k : apple Leavg u wth apple (w) T 3 k ( ) (w)k. [T (k) ] T apple T ( ) 3 k( ) (w)k. The remag gradet orm term o-potve uder the codto peced our aumpto. Theorem 3. The Lyapuov ucto boud ( ) (w ) a ollow: ( (k) ) (w ) apple T (k). Proo. Coder the ollowg ucto, whch we wll call R(x): R(x) ( ) h ( ),x kx k. Whe evaluated at t mmum wth repect to x, whch we deote w P ( ), t a lower boud o (w ) by trog covexty. However, we are evaluatg at w P ( ) tead the (egated) Lyapuv ucto. R covex wth repect to x, o by deto R(w) R apple w R( ) R(w ). Thereore by the lower boudg property ( ) R(w) ( ) R( ) ( ) ( ) ( ) (w ). Now ote that T ( ) R(w). So ( ) (w ) apple T. R(w ) (w ) () Theorem 4. I the Fto method talzed wth all the ame,ad the aumpto o Theorem hold, the the

Fto: A Fater, Permutable Icremetal Gradet Method or Bg ata Problem covergece rate : h ( (k) ) wth c. (w ) apple c Proo. By urollg Theorem, we get k [T (k) ] apple T (). Now ug Theorem 3 h ( (k) ) (w ) apple k ( () ), k T (). We eed to cotrol T () alo. Sce we are aumg that all tart the ame, we have that T () ( () ) ( () ) ( () ),w () () w() () ( () ),w () () ( () ) ( () ) ( () ). ( () ) 6. Lower complexty boud ad explotg problem tructure The theory or the cla o mooth, trogly covex problem wth Lpchtz cotuou gradet uder rt order optmzato method (kow a S,,L ) well developed. Thee reult requre the techcal codto that the dmeoalty o the put pace R m much larger tha the umber o terato we wll take. For mplcty we wll aume th the cae the ollowg dcuo. It kow that problem ext S,,L or whch the terate covergece rate bouded by: p! k w (k) w L/ p L/ w () w. I act, whe ad L are kow advace, th rate acheved up to a mall cotat actor by everal method, mot otably by Neterov accelerated gradet decet method (Neterov 988, Neterov 998). I order to acheve covergece rate ater tha th, addtoal aumpto mut be made o the cla o ucto codered. Recet advace have how that all that requred to acheve gcatly ater rate a te um tructure, uch a our problem etup. Whe the bg data codto hold our method acheve a rate.665 per epoch expectato. Th rate oly deped o the codto umber drectly, through the bg data codto. For example, wth L/,,, the atet poble rate or a black box method a.996, wherea Fto acheve a rate o.665 expectato or 4,,, or 4x ater. The requred amout o data ot uuual moder mache learg problem. I practce, whe quaewto method are ued tead o accelerated method, a peedup o -x more commo. 6.. Oracle cla We ow decrbe the (tochatc) oracle cla FS,,L, (Rm ) or whch SAG ad Fto mot aturally t. Fucto cla: (w) P (w), wth S,,L (Rm ). Oracle: ach query take a pot x R m, ad retur j, j (w) ad j (w), wth j choe uormly at radom. Accuracy: Fd w uch that [ w (k) w ] apple. The ma choce made ormulatg th deto puttg the radom choce the oracle. Th retrct the method allowed qute trogly. The alteratve cae, where the dex j put to the oracle addto to x, alo teretg. Aumg that the method ha acce to a ource o true radom dce, we call that cla S,,L, (Rm ). I Secto 3 we dcu emprcal evdece that ugget that ater rate are poble S,,L, (Rm ) tha or FS,,L, (Rm ). It hould rt be oted that there a trval lower boud rate or SS,,L, (R m ) o reducto per tep. It ot clear th ca be acheved or ay te. Fto oly a actor o o th rate, amely at, ad aymptote toward th rate or very large. SCA, whle ot applcable to all problem th cla, alo acheve the rate aymptotcally. Aother cae to coder the mooth covex but otrogly covex ettg. We tll aume Lpchtz cotuou gradet. I th ettg we wll how that or ucetly hgh dmeoal put pace, the (o-tochatc) lower complexty boud the ame or the te um cae ad caot be better tha that gve by treatg a a gle black box ucto. The ull proo the Appedx, but the dea a ollow: whe the are ot trogly covex, we ca chooe them uch that they do ot teract wth each other, a log a the

Fto: A Fater, Permutable Icremetal Gradet Method or Bg ata Problem dmeoalty much larger tha k. More precely, we may chooe them o that or ay x ad y ad ay 6 j, h (x), j (y) hold. Whe the ucto do ot teract, o optmzato cheme may reduce the terate error ater tha by jut hadlg each eparately. og o a -order aho gve the ame rate a jut treatg ug a black box method. For trogly covex, t ot poble or them to ot teract the above ee. By deto trog covexty requre a quadratc compoet each that act o all dmeo. 7. xpermet I th ecto we compare Fto, SAG, SCA ad LBFGS. We oly coder problem where the regularzer large eough o that the bg data codto hold, a th the cae our theory upport. However, practce our method ca be ued wth maller tep ze the more geeral cae, much the ame way a SAG. Sce we do ot kow the Lpchtz cotat or thee problem exactly, the SAG method wa ru or a varety o tep ze, wth the oe that gave the atet rate o covergece plotted. The bet tep-ze or SAG uually ot what the theory ugget. Schmdt et al. (3) ugget ug L tead o the theoretcal rate 6L. For Fto, we d that ug the atet rate whe the bg data codto hold or ay >. Th the tep uggeted by our theory whe. Iteretgly, reducg to doe ot mprove the covergece rate. Itead we ee o urther mprovemet our expermet. For both SAG ad Fto we ued a derg tep rule tha uggeted by the theory or the rt pa. For Fto, durg the rt pa, ce we do ot have dervatve or each yet, we mply um over the k term ee o ar w (k) k k (k) k k ( (k) ), where we proce data pot dex order or the rt pa oly. A mlar trck uggeted by Schmdt et al. (3) or SAG. Sce SCA oly apple to lear predctor, we are retrcted poble tet problem. We chooe log lo or 3 bary clacato dataet, ad quadratc lo or regreo tak. For clacato, we teted o the jc ad covtype dataet, a well a MNIST clayg - 4 agat 5-9. For regreo, we chooe the two dataet rom the UCI repotory: the mllo og year regreo http://www.ce.tu.edu.tw/ cjl/ lbvmtool/dataet/bary.html http://ya.lecu.com/exdb/mt/ dataet, ad the lce-localzato dataet. The trag porto o the dataet are o ze 5.3 5, 5. 4, 6. 4, 4.7 5 ad 5.3 4 repectvely. Fgure 6 how the reult o our expermet. Frtly we ca ee that LBFGS ot compettve wth ay o the cremetal gradet method codered. Secodly, the opermuted SAG, Fto ad SCA ote coverge at very mlar rate. The oberved derece are uually dow to the peed o the very rt pa, where SAG ad Fto are ug the above metoed trck to peed ther covergece. Ater the rt pa, the lope o the le are uually comparable. Whe coderg the method wth permutato each pa, we ee a clear advatage or Fto. Iteretgly, t gve very lat le, dcatg very table covergece. 8. Related work Tradtoal cremetal gradet method (Berteka, ) have the ame orm a SG, but appled to te um. etally they are the o-ole aalogue o SG. Applyg SG to trogly covex problem doe ot yeld lear covergece, ad practce t lower tha the lear-covergg method we dcu the remader o th ecto. Bede the method that all uder the clacal Icremetal gradet moker, SAG ad MISO (Maral, 3) method are alo related. MISO method all to the cla o upper boud mmzato method, uch a M ad clacal gradet decet. MISO eetally the Fto method, but wth tep ze tme maller. Whe ug thee larger tep ze, the method o loger a upper boud mmzato method. Our method ca be ee a MISO, but wth a tep ze cheme that gve ether a lower or upper boud mmato method. Whle th work wa uder peer revew, a tech report (Maral (4)) wa put o arv that etablhe the covergece rate o MISO wth tep ad wth a 3 per tep. Th mlar but ot qute a good a the rate we etablh. Stochatc ual Coordate decet (Shalev-Shwartz & Zhag, 3) alo gve at covergece rate o problem or whch t applcable. It requre computg the covex cojugate o each, whch make t more complex to mplemet. For the bet perormace t ha to take advatage o the tructure o the loe alo. For mple lear clacato ad regreo problem t ca be eectve. Whe ug a pare dataet, t a better choce tha Fto due to the memory requremet. For lear predctor, t theoretcal covergece rate o ( ) per tep a lttle ater tha what we etablh or Fto, however t doe ot appear to be ater our expermet.

Fto: A Fater, Permutable Icremetal Gradet Method or Bg ata Problem ull gradet orm ull gradet orm 3 4 5 6 7 8 9 4 6 8 4 poch Fgure. MNIST 3 4 5 6 7 8 9 4 6 8 4 poch Fgure 3. Covtype ull gradet orm 3 4 5 ull gradet orm ull gradet orm 3 4 5 6 7 8 9 4 6 8 4 poch Fgure. jc 3 4 5 6 7 8 9 4 6 8 4 poch 4 6 8 4 poch SAG Fto perm SCA perm Fto SCA LBFGS Fgure 5. lce Fgure 6. Covergece rate plot or tet problem Fgure 4. Mllo Sog 9. Cocluo We have preeted a ew method or mmzato o te um o mooth trogly covex ucto, whe there a ucetly large umber o term the ummato. We addtoally develop ome theory or the lower complexty boud o th cla, ad how the emprcal perormace o our method. Reerece Berteka, mtr P. Icremetal gradet, ubgradet, ad proxmal method or covex optmzato: A urvey. Techcal report,. Maral, Jule. Optmzato wth rt-order urrogate ucto. ICML, 3. Maral, Jule. Icremetal majorzato-mmzato optmzato wth applcato to large-cale mache learg. Techcal report, INRIA Greoble Rhe-Alpe / LJK Laboratore Jea Kutzma, 4. Nedc, Agela ad Berteka, mtr. Stochatc Optmzato: Algorthm ad Applcato, chapter Covergece Rate o Icremetal Subgradet Algorthm. Kluwer Academc,. Neterov, Yu. O a approach to the cotructo o optmal method o mmzato o mooth covex ucto. koomka Mateatcheke Metody, 4:59 57, 988. Neterov, Yu. Itroductory Lecture O Covex Programmg. Sprger, 998. Neterov, Yu. cecy o coordate decet method o huge-cale optmzato problem. Techcal report, COR,. Recht, Bejam ad Re, Chrtopher. Beeath the valley o the ocommutatve arthmetc-geometrc mea equalty: cojecture, cae-tude, ad coequece. Techcal report, Uverty o Wco-Mado,.

Fto: A Fater, Permutable Icremetal Gradet Method or Bg ata Problem Rchtark, Peter ad Takac, Mart. Iterato complexty o radomzed block-coordate decet method or mmzg a compote ucto. Techcal report, Uverty o dburgh,. Schmdt, Mark, Roux, Ncola Le, ad Bach, Frac. Mmzg te um wth the tochatc average gradet. Techcal report, INRIA, 3. Shalev-Shwartz, Sha ad Zhag, Tog. Stochatc dual coordate acet method or regularzed lo mmzato. JMLR, 3.

Appedx Bac covexty equalte The ollowg equalte are clacal. See Neterov 998 or proo. They hold or all x & y, whe S, (B) (y) apple (x)h (x),y x L kx yk (B) (y) (x)h (x),y x L k (x) (y)k (B3) (y) (x)h (x),y x kx yk (B4) h (x) (y),x y L k (x) (y)k (B5) h (x) (y),x y kx yk We alo ue varat o B ad B3 that are ummed over each, wth x ad y w: (w) ( ) h ( ),w L k (x) (y)k,l. (w) ( ) h ( ),w kw k Thee are ued the ollowg egated ad rearraged orm: (w) T (w) ( ) h ( ),w (B6) ) (w) T apple kw k (B7) (w) T apple L k (w) ( )k. Lyapuov term boud Smplyg each Lyapuov term arly traght orward. We ue extevely that or 6 j. Notealothat (k) j (B8) w (k) w (w j) j ( j ) j(w). Lemma 6. Betwee tep k ad k, the T ( ) term chage a ollow: [T (k) ] T apple ( ),w L 3 kw k. w, adthat (k) Proo. Frt we ue the tadard Lpchtz upper boud (B): (y) apple (x)h (x),y x L kx yk. We ca apply th ug y (k) (w j) ad x :

( (k) ) apple ( ) We ow take expectato over j, gvg: [( (k) )] ( ),w j L kw jk. ( ) apple ( ),w L 3 kw k. Lemma 7. Betwee tep k ad k, the T [T (k) ] T apple Proo. We troduce the otato T T rt ug (k) j w: T (k) T Now we mply the chage T : T P P ( ) h ( ),w (w) ( ) 3 k(w) ( )k w, (w) 3 h(w) ( ),w. T (k) T ) T (k) T P ( ) ad T ( (k) ) ( ) j( j ) j( j ) ( (k) We ow mplyg the rt two term ug ( (k) ),w (k) The lat term o quato expad urther: ( (k) ),w (k) w j(w) ( ) ( (k) ),w (k) w w ),w (k) (k) j w: T P h ( ),w j(w) term chage a ollow: ( ) (k). We mply the chage T ( (k) ),w (k) w. () T T j ( j ),w j j (w),w w T j ( j ),w j. * ( ) j( j )j(w),w (k) w * ( ),w (k) The ecod er product term mple urther ug B8: j(w) j( j ),w (k) w j(w) j( j ), (w j) j (w) j( j ),w j w j(w) j( j ),w (k) w. () j ( j ) j(w) j (w) j( j ), j( j ) j(w).

We mply the ecod term: Groupg all remag term gve: j (w) j( j ), j( j ) j(w) j(w) j( j ). T (k) T apple j( j ) j ( j ),w j j(w) j(w) j( j ) j (w) j( j ),w j * ( ),w (k) w. We ow take expectato o each remag term. For the bottom er product we ue Lemma : * * ( ),w (k) w ( ), (w) w, (w). Takg expectato o the remag term traght orward. We get: [T (k) ] T apple ( ) (w) h( ),w 3 k(w) ( )k 3 h(w) ( ),w w, (w). Lemma 8. Betwee tep k ad k, the T 3 P kw k term chage a ollow: [T (k) 3 ] T 3 ( ) T 3 (w),w 3 k( ) (w)k. Proo. We expad a: T (k) 3 w (k) w (k) (k) w w w(k) w (k) w (k) w (k) w, w (3) (k). (4) We expad the three term o the rght eparately. For the rt term: w(k) w (w j) ( j( j ) j (w)) kw jk k j( j ) j (w)k h j( j ) j (w),w j. (5) 3

For the ecod term o quato 4, ug For the thrd term o quato 4: w (k) w, w w (k) (k) (k) j w: kw T 3 kw jk. * k kw jk w (k) w, w w (k) w, w j w (k) w, w The ecod er product term quato 6 become (ug B8): w (k) w, w j (w j) j ( j ) j(w),w j kw jk j ( j ) j(w),w j. w (k) w, w j. (6) Notce that the er product term here cacel wth the oe 5. Now we ca take expectato o each remag term. Recall that [w (k) ] w (w), othert er product term 6 become: "* # w (k) w, w (w),w. All other term do t mply uder expectato. So the reult : [T (k) 3 ] T 3 ( ) kw k (w),w 3 k ( ) (w)k. Lemma 9. Betwee tep k ad k, the T 4 P [T (k) 4 ] T 4 Proo. Note that (k) (w j), o: Now ug T (k) 4 P (k) (k) (k) (k) (w j) term chage a ollow: w 3 kw k. (k) (k), (k) w j,! (k). (k) (k) (w j) to mply the er product term: kw jk (k) hw j, j w 4

kw Takg expectato gve the reult. Lemma. Let S,L. The we have: (x) (y)h (y),x y jk (k) (k) kw jk (L ) k (x) (y)k j kw jk w kw jk. (7) L (L ) ky xk (L ) h (x) (y),y x. Proo. ee the ucto g a g(x) (x) kxk. The the gradet g (x) (x) x. g ha a lpchtz gradet wth wth cotat L. Bycovextywehave: Now replacg g wth : g(x) g(y)hg (y),x y (L ) kg (x) g (y)k. Note that (x) kxk (y) kyk h (y) y, x y (L ) k (x) x (y)yk. o: (L ) k (x) x (y)yk (L ) k (x) (y)k ky xk (L ) (L ) h (x) (y),y x, (x) (y)h (y),x y (L ) k (x) (y)k kxk kyk ky xk (L ) (L ) h (x) (y),y x hy, x y. Now ug: we get: kxk hy, x kyk kx yk, (x) (y)h (y),x y (L ) k (x) (y)k Note the orm y term cacel, ad: kyk kx yk kx yk (L ) (L ) h (x) (y),y x hy, y kx yk (L ) kx yk (L ) kx yk (L ) L (L ) kx yk. So: 5

(x) (y)h (y),x y (L ) h (x) (y),y x (L ) k (x) (y)k L ky xk (L ) Corollary. Take (x) vector: (x) P (x), wth the bg data codto holdg wth cotat ( ) h ( ),x k (x) ( )k L kx k h (x) ( ), x.. The or ay x ad Proo. We apply Lemma to each,butteadougtheactualcotatl, weue, whch uder the bg data aumpto larger tha L: (x) ( )h ( ),x k (x) ( )k L kx k h (x) ( ), x. Averagg over gve the reult. 3 Lower complexty boud I th ecto we ue the ollowg techcal aumpto, a ued Neterov (998): Aumpto : A optmzato method at tep k may oly voke the oracle wth a pot x (k) that o the orm: x (k) x () a g (), where g () the dervatve retured by the oracle at tep, ad a R. Th aumpto prevet a optmzato method rom jut gueg the correct oluto wthout dog ay work. Vrtually all optmzato method all to uder th aumpto. Smple ( )k boud Ay procedure that mmze a um o the orm (w) P (w) by uorm radom acce o retrcted by the requremet that t ha to actually ee each term at leat oce order to d the mmum. Th lead to a k rate expectato. We ow ormalze uch a argumet. We wll work R,matchgthedmeoalty o the problem to the umber o term the ummato. Theorem. For ay FS,,, (R ), we have that a k tep optmzato procedure gve: [(w)] (w ) k (w () ) (w ) Proo. We wll exhbt a mple wort-cae problem. Wthout lo o geeralty we aume that the rt oracle ace by the optmzato procedure at w. I ay other cae, we ht our pace the ollowg argumet approprately. Let (w) P h (w ) kwk. The clearly the oluto w or each, wth mmum o (w ) 4.Forw we have ().Scethedervatveoeach j o the th compoet we have ot yet ee,thevalueoeachw rema ule term ha bee ee. Let v (k) be the umber o uque term we have ot ee up to tep k. Betweetepk ad k, v decreae by wth probably v ad tay the ame otherwe. So [v (k) v (k) ]v (k) 6 v (k) v (k).

So we may dee the equece (k) [ (k) v (k) ] k v (k), whch the martgale wth repect to v, a (k). k [v (k) v (k) ] k v (k) Now ce k choe advace, toppg tme theory gve that [ (k) ][ () ].So [ ) [v (k) ] k v (k) ], k. By Aumpto, the ucto ca be at mot mmzed over the dmeo ee up to tep k. The ee dmeo cotrbute a value o 4 ad the uee term to the ucto. So we have that: [(w (k) )] (w ) [v (k) ] 4 [v(k) ] 4 [v(k) ] k 4 k h (w () ) (w ). 4 Mmzato o o-trogly covex te um It kow that the cla o covex, cotuou & deretable problem, wth L (Rm ), ha the ollowg lower complexty boud whe k<m: F, L Lpchtz cotuou dervatve (x (k) ) (k) (x ) L x () x 8(k ), whch proved va explct cotructo o a wort-cae ucto where t hold wth equalty. Let th wort cae ucto be deoted h (k) at tep k. We wll how that the ame boud apple or the te-um cae, o a per pa equvalet ba, by a mple cotructo. Theorem 3. The ollowg lower boud hold or k a multple o : (x (k) ) (k) (x ) L x () x 8( k ), whe a te um o term (x) P (x), wth each F, L (Rm ), ad wth m>k,uderthe oracle model where the optmzato method may chooe the dex to acce at each tep. Proo. Let h be a copy o h (k) redeed to be o the ubet o dmeo j, orj...k,orotherword, (x) h (k) ([x,x,...x j,...]). The we wll ue: h (k) (k) (x) 7 h (k) (x)

a a wort cae ucto or tep k. Sce the dervatve are orthogoal betwee h ad h j or 6 j, byaumpto,theboudoh (k) (x (k) ) h (k) (x ) deped oly o the umber o tme the oracle ha bee voked wth dex, oreach. Let th be deoted c. The we have that: (x (k) ) (k) (x ) L 8 x () x () (c ). Where k k () the orm o the dmeo j or j...k. We ca combe thee orm to a regular ucldea orm: (x (k) ) (k) (x L x () x ) 8 (c ). Now otce that P (c ) uder the cotrat P c k mmzed whe each c k.sowehave: (x (k) ) (k) (x ) L x () x 8 ( k, ) L x () x 8( k ), whch the ame lower boud a or k/ terato o a optmzato method o drectly. 8