Finito: A Faster, Permutable Incremental Gradient Method for Big Data Problems

Fto: A Faster, Permutable Icremetal Gradet Method for Bg Data Problems Aaro J Defazo Tbéro S Caetao Just Domke NICTA ad Australa Natoal Uversty AARONDEFAZIO@ANUEDUAU TIBERIOCAETANO@NICTACOMAU JUSTINDOMKE@NICTACOMAU Abstract Recet advaces optmzato theory have show that smooth strogly covex fte sums ca be mmzed faster tha by treatg them as a black box batch problem I ths work we troduce a ew method ths class wth a theoretcal covergece rate four tmes faster tha exstg methods, for sums wth suffcetly may terms Ths method s also amedable to a samplg wthout replacemet scheme that practce gves further speed-ups We gve emprcal results showg state of the art performace Itroducto May recet advaces the theory ad practce of umercal optmzato have come from the recogto ad explotato of structure Perhaps the most commo structure s that of fte sums I mache learg whe applyg emprcal rsk mmzato we almost always ed up wth a optmzato problem volvg the mmzato of a sum wth oe term per data pot The recetly developed SAG algorthm Schmdt et al, 03) has show that eve wth ths smple form of structure, as log as we have suffcetly may data pots we are able to do sgfcatly better tha black-box optmzato techques expectato for smooth strogly covex problems I practcal terms the dfferece s ofte a factor of 0 or more The requremet of suffcetly large datasets s fudametal to these methods We descrbe the precse form of ths as the bg data codto Essetally, t s the requremet that the amout of data s o the same order as the codto umber of the problem The strog covexty requremet Proceedgs of the 3 st Iteratoal Coferece o Mache Learg, Bejg, Cha, 04 JMLR: W&CP volume 3 Copyrght 04 by the authors) s ot as oerous Strog covexty holds the commo case where a quadratc regularzer s used together wth a covex loss The SAG method ad the Fto method we descrbe ths work are smlar ther form to stochastc gradet descet methods, but wth oe crucal dfferece: They store addtoal formato about each data pot durg optmzato Essetally, whe they revst a data pot, they do ot treat t as a ovel pece of formato every tme Methods for the mmzato of fte sums have classcally bee kow as Icremetal gradet methods Bertsekas, 00) The proof techques used SAG dffer fudametally from those used o other cremetal gradet methods though The dfferece hges o the requremet that data be accessed a radomzed order SAG does ot work whe data s accessed sequetally each epoch, so ay proof techque whch shows eve o-dvergece for sequetal access caot be appled A remarkable property of Fto s the tghtess of the theoretcal bouds compared to the practcal performace of the algorthm The practcal covergece rate see s at most twce as good as the theoretcally predcted rate Ths sets t apart from methods such as LBFGS where the emprcal performace s ofte much better tha the relatvely weak theoretcal covergece rates would suggest The lack of tug requred also sets Fto apart from stochastc gradet descet SGD) I order to get good performace out of SGD, substatal laborous tug of multple costats has tradtoally bee requred A multtude of heurstcs have bee developed to help choose these costats, or adapt them as the method progresses Such heurstcs are more complex tha Fto, ad do ot have the same theoretcal backg SGD has applcato outsde of covex problems of course, ad we do ot propose that Fto wll replace SGD those settgs Eve o strogly covex problems SGD does ot exhbt lear covergece lke Fto does

Fto: A Faster, Permutable Icremetal Gradet Method for Bg Data Problems There are may smlartes betwee SAG, Fto ad stochastc dual coordate descet SDCA) methods Shalev-Shwartz & Zhag, 03) SDCA s oly applcable to lear predctors Whe t ca be appled, t has lear covergece wth theoretcal rates smlar to SAG ad Fto Algorthm We cosder dfferetable covex fuctos of the form fw) = f w) = We assume that each f has Lpschtz cotuous gradets wth costat L ad s strogly covex wth costat s Clearly f we allow =, vrtually all smooth, strogly covex problems are cluded So stead, we wll restrct ourselves to problems satsfyg the bg data codto Bg data codto: Fuctos of the above form satsfy the bg data codto wth costat β f β L s Typcal values of β are -8 I pla laguage, we are cosderg problems where the amout of data s of the same order as the codto umber L/s) of the problem Addtoal Notato We superscrpt wth k) to deote the value of the scrpted quatty at terato k We omt the superscrpt o summatos, ad subscrpt wth wth the mplcato that dexg starts at Whe we use separate argumets for each f, we deote them φ Let φ k) deote the average φ k) = φk) Our step legth costat, whch depeds o β, s deoted α We use agle bracket otato for dot products, The Fto algorthm We start wth a table of kow φ 0) values, ad a table of kow gradets f φ0) ), for each We wll update these two tables durg the course of the algorthm The step for terato k, s as follows: Update w usg the step: w k) = φ k) αs f φ k) ) Pck a dex j uformly at radom, or usg wthout-replacemet samplg as dscussed Secto 3 3 Set φ k) j = w k) the table ad leave the other varables the same φ k) = φ k) for j) 4 Calculate ad store f j φk) j ) the table Our ma theoretcal result s a covergece rate proof for ths method Theorem Whe the bg data codto holds wth β =, α = may be used I that settg, f we have talzed all φ 0) the same, the covergece rate s: [ E f φ k) ) fw ) 3 ) k f 4s φ ) 0) See Secto 5 for the proof I cotrast, SAG acheves a 8) rate whe β = Note that o a per epoch bass, the Fto rate s ) exp /) = 0606 To put that to cotext, 0 epochs wll see the error boud reduced by more tha 48x Oe otable feature of our method s the fxed step sze I typcal mache learg problems the strog covexty costat s gve by the stregth costat of the quadratc regularzer used Sce ths s a kow quatty, as log as the bg data codto holds α = may be used wthout ay tug or adjustmet of Fto requred Ths lack of tug s a major feature of Fto I cases where the bg data codto does ot hold, we cojecture that the step sze must be reduced proportoally to the volato of the bg data codto I practce, the most effectve step sze ca be foud by testg a umber of step szes, as s usually doe wth other stochastc optmsato methods A smple way of satsfyg the bg data codto s to duplcate your data eough tmes so the holds Ths s ot as effectve practce as just chagg the step sze, ad of course t uses more memory However t does fall wth the curret theory Aother dfferece compared to the SAG method s that we store both gradets ad pots φ We do ot actually eed twce as much memory however as they ca be stored summed together I partcular we store the quattes p = p f φ ) αsφ, ad use the update rule w = αs Ths trck does ot work whe step legths are adjusted durg optmzato however The storage of φ s also a dsadvatage whe the gradets f φ ) are sparse but φ are ot sparse, as t ca cause sgfcat addtoal memory usage We do ot recommed the usage of Fto whe gradets are sparse The SAG algorthm dffers from Fto oly the w update

Fto: A Faster, Permutable Icremetal Gradet Method for Bg Data Problems ad step legths: w k) = w k ) 6L 3 Radomess s key f φ k) ) By far the most terestg aspect of the SAG ad Fto methods s the radom choce of dex at each terato We are ot a ole settg, so there s o heret radomess the problem Yet t seems that a radomzed method s requred Nether method works practce whe the same orderg s used each pass, or fact wth ay o-radom access scheme we have tred It s hard to emphasze eough the mportace of radomess here The techque of pre-permutg the data, the dog order passes after that, also does ot work Reducg the step sze SAG or Fto by or orders of magtude does ot fx the covergece ssues ether Other methods, such as stadard SGD, have bee oted by varous authors to exhbt speed-ups whe radom samplg s used stead of order passes, but the dffereces are ot as extreme as covergece vs ocovergece Perhaps the most smlar problem s that of coordate descet o smooth covex fuctos Coordate descet caot dverge whe o-radom ordergs are used, but covergece rates are substatally worse the o-radomzed settg Nesterov 00, Rchtark & Takac 0) Reducg the step sze α by a much larger amout, amely by a factor of, does allow for o-radomzed ordergs to be used Ths gves a extremely slow method however Ths s the case covered by the MISO Maral, 03) A smlar reducto step sze gves covergece uder oradomzed ordergs for SAG also Covergece rates for cremetal sub-gradet methods wth a varety of ordergs appear the lterature also Nedc & Bertsekas, 000) Samplg wthout replacemet s much faster Other samplg schemes, such as samplg wthout replacemet, should be cosdered I detal, we mea the case where each pass over the data s a set of samplg wthout replacemet steps, whch cotue utl o data remas, after whch aother pass starts afresh We call ths the permuted case for smplcty, as t s the same as re-permutg the data after each pass I practce, ths approach does ot gve ay speedup wth SAG, however t works spectacularly well wth Fto We see speedups of up to a factor of two usg ths approach Ths s oe of the major dffereces practce betwee SAG ad Fto We should ote that we have o theory to support ths case however We are ot aware of ay aalyss that proves faster covergece rates of ay optmzato method uder a samplg wthout replacemet scheme A terestg dscusso of SGD uder wthout-replacemet samplg appears Recht & Re 0) The SDCA method s also sometmes used wth a permuted orderg Shalev-Shwartz & Zhag, 03), our expermets Secto 7 show that ths sometmes results a large speedup over uform radom samplg, although t does ot appear to be as relable as wth Fto 4 Proxmal varat We ow cosder composte problems of the form fw) = f w) λrw), where r s covex but ot ecessarly smooth or strogly covex Such problems are ofte addressed usg proxmal algorthms, partcularly whe the proxmal operator for r: prox r λ z) = argm x x z λrx) has a closed form soluto A example would be the use of L regularzato We ow descrbe the Fto update for ths settg Frst otce that whe we set w the Fto method, t ca be terpreted as mmzg the quatty: Bx) = f φ ) f φ ), x φ αs x φ, wth respect to x, for fxed φ Ths s related to the upper boud mmzed by MISO, where αs s stead L It s straght forward to modfy ths for the composte case: B λr x) = λrx) f φ ) f φ ), x φ αs x φ The mmzer of the modfed B λr ca be expressed usg the proxmal operator as: ) w = prox r λ/αs φ f αs φ ) Ths strogly resembles the update the stadard gradet descet settg, whch for a step sze of /L s w = prox r λ/l w k ) ) L f w k ) ) We have ot yet developed ay theory supportg the proxmal varat of Fto, although emprcal evdece suggests t has the same covergece rate as the o-proxmal case

5 Covergece proof Fto: A Faster, Permutable Icremetal Gradet Method for Bg Data Problems We start by statg two smple lemmas All expectatos the followg are over the choce of dex j at step k Quattes wthout superscrpts are at ther values at terato k Lemma The expected step s E[w k) w = αs f w) Ie the w step s a gradet descet step expectato αs L ) A smlar equalty also holds for SGD, but ot for SAG Proof E[w k) w [ = E w φ j) f αs j w) f jφ j ) ) = w φ) αs f w) αs f φ ) Now smplfy w φ) as αs f φ ), so the oly term that remas s αs f w) Lemma Decomposto of varace) We ca decompose w φ as w φ = w φ φ φ Proof w φ = w φ = w φ = w φ Ma proof φ φ w φ, φ φ φ φ w φ, φ φ φ φ Our proof proceeds by costructo of a Lyapuov fucto T ; that s, a fucto that bouds a quatty of terest, ad that decreases each terato expectato Our Lyapuov fucto T = T T T 3 T 4 s composed of the sum of the followg terms, T = f φ), T = f φ ) f φ ), w φ, T 3 = s T 4 = s w φ, φ φ We ow state how each term chages betwee steps k ad k Proofs are foud the appedx the supplemetary materal: E[T k) T f φ), w φ L 3 w φ, E[T k) T T fw) α β ) s 3 f w) f φ ) φ w, f w) 3 f w) f φ ), w φ, E[T k) 3 T 3 = )T 3 f w), w α φ α s 3 f φ ) f w), E[T k) 4 T 4 = s s 3 w φ φ φ s φ w Theorem Betwee steps k ad k, f α α β β α 0, α ad β the E[T k) T α T Proof We take the three lemmas above ad group lke terms to get E[T k) T f φ), w φ f φ ) fw) f φ ), w φ α ) f w), φ w L s ) s w φ 3 f w) f φ ), w φ α ) αs 3 f φ ) f w)

Fto: A Faster, Permutable Icremetal Gradet Method for Bg Data Problems s w φ s φ φ Next we cacel part of the frst le usg f α φ), w φ α fw) f φ) s w α α φ, based o B3 the Appedx We the pull terms occurrg α T together, gvg E[T k) T α T α ) α ) [ fw) T f φ) f w), w φ L s α ) s w φ 3 f w) f φ ), w φ α ) αs 3 f φ ) f w) α ) s w φ α ) s φ φ Next we use the stadard equalty B5) α ) f φ) f w), w φ α ) s w φ, whch chages the bottom row to α ) s w φ α ) s φ φ These two terms ca the be grouped usg Lemma, to gve E[T k) T α T L 3 w φ α ) [ fw) T 3 f w) f φ ), w φ α ) αs 3 f φ ) f w) We use the followg equalty Corollary 6 Appedx) to cacel agast the w φ term: [ β fw) T 3 f w) f φ ), w φ L 3 w φ s 3 f w) f φ ), ad the apply the followg smlar equalty B7 Appedx) to partally cacel f φ ) f w) : α ) [ β fw) T α ) β β s 3 f φ ) f w) Leavg us wth E[T k) T α T α α β β α ) s 3 f φ ) f w) The remag gradet orm term s o-postve uder the codtos specfed our assumptos Theorem 3 The Lyapuov fucto bouds f φ) fw ) as follows: f φ k) ) fw ) αt k) Proof Cosder the followg fucto, whch we wll call Rx): Rx) = f φ ) f φ ), x φ s x φ Whe evaluated at ts mmum wth respect to x, whch we deote w = φ s f φ ), t s a lower boud o fw ) by strog covexty However, we are evaluatg at w = φ αs f φ ) stead the egated) Lyapuv fucto R s covex wth respect to x, so by defto Rw) = R α ) φ ) α α w ) R φ) α Rw ) Therefore by the lower boudg property f φ) Rw) f φ) ) R α φ) α Rw ) f φ) ) f α φ) α fw ) = α f φ) fw ) ) Now ote that T f φ) Rw) So f φ) fw ) αt Theorem 4 If the Fto method s talzed wth all φ 0) the same,ad the assumptos of Theorem hold, the the

Fto: A Faster, Permutable Icremetal Gradet Method for Bg Data Problems covergece rate s: [ E f φ k) ) fw ) c s wth c = α) ) k f α φ ) 0), Proof By urollg Theorem, we get E[T k) ) k T 0) α Now usg Theorem 3 [ E f φ k) ) fw ) α ) k T 0) α We eed to cotrol T 0) also Sce we are assumg that all φ 0 start the same, we have that T 0) = f φ 0) ) f φ 0) ) f φ 0) ), w 0) φ 0) s w 0) φ 0) = 0 = αs = f φ 0) ), w 0) φ 0) s f φ 0) ) f φ 0) ) α ) αs α s f φ 0) ) αs f φ 0) ) 6 Lower complexty bouds ad explotg problem structure The theory for the class of smooth, strogly covex problems wth Lpschtz cotuous gradets uder frst order optmzato methods kow as S, s,l ) s well developed These results requre the techcal codto that the dmesoalty of the put space R m s much larger tha the umber of teratos we wll take For smplcty we wll assume ths s the case the followg dscussos It s kow that problems exst S, s,l for whch the terate covergece rate s bouded by: w k) w L/s L/s ) k w 0) w I fact, whe s ad L are kow advace, ths rate s acheved up to a small costat factor by several methods, most otably by Nesterov s accelerated gradet descet method Nesterov 988, Nesterov 998) I order to acheve covergece rates faster tha ths, addtoal assumptos must be made o the class of fuctos cosdered Recet advaces have show that all that s requred to acheve sgfcatly faster rates s a fte sum structure, such as our problem setup Whe the bg data codto holds our method acheves a rate 06065 per epoch expectato Ths rate oly depeds o the codto umber drectly, through the bg data codto For example, wth L/s =, 000, 000, the fastest possble rate for a black box method s a 0996, whereas Fto acheves a rate of 06065 expectato for 4, 000, 000, or 4x faster The requred amout of data s ot uusual moder mache learg problems I practce, whe quasewto methods are used stead of accelerated methods, a speedup of 0-0x s more commo 6 Oracle class We ow descrbe the stochastc) oracle class F S, s,l, Rm ) for whch SAG ad Fto most aturally ft Fucto class: fw) = = f w), wth f S, s,l Rm ) Oracle: Each query takes a pot x R m, ad returs j, f j w) ad f j w), wth j chose uformly at radom Accuracy: Fd w such that E[ w k) w ɛ The ma choce made formulatg ths defto s puttg the radom choce the oracle Ths restrcts the methods allowed qute strogly The alteratve case, where the dex j s put to the oracle addto to x, s also terestg Assumg that the method has access to a source of true radom dces, we call that class DS, s,l, Rm ) I Secto 3 we dscuss emprcal evdece that suggests that faster rates are possble DS, s,l, Rm ) tha for F S, s,l, Rm ) It should frst be oted that there s a trval lower boud rate for f SS, s,l,β Rm ) of ) reducto per step Its ot clear f ths ca be acheved for ay fte β Fto s oly a factor of off ths rate, amely ) at β =, ad asymptotes towards ths rate for very large β SDCA, whle ot applcable to all problems ths class, also acheves the rate asymptotcally Aother case to cosder s the smooth covex but ostrogly covex settg We stll assume Lpschtz cotuous gradets I ths settg we wll show that for suffcetly hgh dmesoal put spaces, the o-stochastc) lower complexty boud s the same for the fte sum case ad caot be better tha that gve by treatg f as a sgle black box fucto The full proof s the Appedx, but the dea s as follows: whe the f are ot strogly covex, we ca choose them such that they do ot teract wth each other, as log as the

Fto: A Faster, Permutable Icremetal Gradet Method for Bg Data Problems dmesoalty s much larger tha k More precsely, we may choose them so that for ay x ad y ad ay j, f x), f j y) = 0 holds Whe the fuctos do ot teract, o optmzato scheme may reduce the terate error faster tha by just hadlg each f separately Dog so a -order fasho gves the same rate as just treatg f usg a black box method For strogly covex f, t s ot possble for them to ot teract the above sese By defto strog covexty requres a quadratc compoet each f that acts o all dmesos 7 Expermets I ths secto we compare Fto, SAG, SDCA ad LBFGS We oly cosder problems where the regularzer s large eough so that the bg data codto holds, as ths s the case our theory supports However, practce our method ca be used wth smaller step szes the more geeral case, much the same way as SAG Sce we do ot kow the Lpschtz costat for these problems exactly, the SAG method was ru for a varety of step szes, wth the oe that gave the fastest rate of covergece plotted The best step-sze for SAG s usually ot what the theory suggests Schmdt et al 03) suggest usg L stead of the theoretcal rate 6L For Fto, we fd that usg α = s the fastest rate whe the bg data codto holds for ay β > Ths s the step suggested by our theory whe β = Iterestgly, reducg α to does ot mprove the covergece rate Istead we see o further mprovemet our expermets For both SAG ad Fto we used a dfferg step rule tha suggested by the theory for the frst pass For Fto, durg the frst pass, sce we do ot have dervatves for each φ yet, we smply sum over the k terms see so far w k) = k k φ k) αsk k f φ k) ), where we process data pots dex order for the frst pass oly A smlar trck s suggested by Schmdt et al 03) for SAG Sce SDCA oly apples to lear predctors, we are restrcted possble test problems We choose log loss for 3 bary classfcato datasets, ad quadratc loss for regresso tasks For classfcato, we tested o the jc ad covtype datasets, as well as MNIST classfyg 0-4 agast 5-9 For regresso, we choose the two datasets from the UCI repostory: the mllo sog year regresso http://wwwcsetuedutw/ cjl/ lbsvmtools/datasets/baryhtml http://yalecucom/exdb/mst/ dataset, ad the slce-localzato dataset The trag porto of the datasets are of sze 53 0 5, 50 0 4, 60 0 4, 47 0 5 ad 53 0 4 respectvely Fgure 6 shows the results of our expermets Frstly we ca see that LBFGS s ot compettve wth ay of the cremetal gradet methods cosdered Secodly, the opermuted SAG, Fto ad SDCA ofte coverge at very smlar rates The observed dffereces are usually dow to the speed of the very frst pass, where SAG ad Fto are usg the above metoed trck to speed ther covergece After the frst pass, the slopes of the le are usually comparable Whe cosderg the methods wth permutato each pass, we see a clear advatage for Fto Iterestgly, t gves very flat les, dcatg very stable covergece 8 Related work Tradtoal cremetal gradet methods Bertsekas, 00) have the same form as SGD, but appled to fte sums Essetally they are the o-ole aalogue of SGD Applyg SGD to strogly covex problems does ot yeld lear covergece, ad practce t s slower tha the lear-covergg methods we dscuss the remader of ths secto Besdes the methods that fall uder the classcal Icremetal gradet moker, SAG ad MISO Maral, 03) methods are also related MISO method falls to the class of upper boud mmzato methods, such as EM ad classcal gradet descet MISO s essetally the Fto method, but wth step szes tmes smaller Whe usg these larger step szes, the method s o loger a upper boud mmzato method Our method ca be see as MISO, but wth a step sze scheme that gves ether a lower or upper boud mmsato method Whle ths work was uder peer revew, a tech report?) was put o arxv that establshes the covergece rate of MISO wth step α = ad wth β = as 3 per step Ths smlar but ot qute as good as the rate we establsh Stochastc Dual Coordate descet Shalev-Shwartz & Zhag, 03) also gves fast covergece rates o problems for whch t s applcable It requres computg the covex cojugate of each f, whch makes t more complex to mplemet For the best performace t has to take advatage of the structure of the losses also For smple lear classfcato ad regresso problems t ca be effectve Whe usg a sparse dataset, t s a better choce tha Fto due to the memory requremets For lear predctors, ts theoretcal covergece rate of β β) ) per step s a lttle faster tha what we establsh for Fto, however t does ot appear to be faster our expermets

Fto: A Faster, Permutable Icremetal Gradet Method for Bg Data Problems full gradet orm full gradet orm 0 0 0 0 0 3 0 4 0 5 0 6 0 7 0 8 0 9 0 0 0 4 6 8 0 4 SAG Fto perm FtoFgureSDCA MNIST SDCA perm LBFGS 0 0 3 0 4 0 5 0 6 0 7 0 8 0 9 0 0 0 0 4 6 8 0 4 SAG Fto perm FtoFgure SDCA 3 Covtype SDCA perm LBFGS full gradet orm 0 0 0 0 0 3 0 4 0 5 full gradet orm full gradet orm 0 0 0 0 0 3 0 4 0 5 0 6 0 7 0 8 0 9 0 0 0 4 6 8 0 4 SAG Fto perm SDCA perm FtoFgureSDCA jc LBFGS 0 0 0 3 0 4 0 5 0 6 0 7 0 8 0 9 0 0 0 0 4 6 8 0 4 SAG Fto perm Fto Fgure 4 SDCA Mllo Sog SDCA perm LBFGS 0 4 6 8 0 4 SAG Fto perm SDCA perm Fto SDCA LBFGS Fgure 5 slce Fgure 6 Covergece rate plots for test problems 9 Cocluso We have preseted a ew method for mmzato of fte sums of smooth strogly covex fuctos, whe there s a suffcetly large umber of terms the summato We addtoally develop some theory for the lower complexty bouds o ths class, ad show the emprcal performace of our method Refereces Bertsekas, Dmtr P Icremetal gradet, subgradet, ad proxmal methods for covex optmzato: A survey Techcal report, 00 Maral, Jule Optmzato wth frst-order surrogate fuctos ICML, 03 Nedc, Agela ad Bertsekas, Dmtr Stochastc Optmzato: Algorthms ad Applcatos, chapter Covergece Rate of Icremetal Subgradet Algorthms Kluwer Academc, 000 Nesterov, Yu O a approach to the costructo of optmal methods of mmzato of smooth covex fuctos Ekoomka Mateatcheske Metody, 4:509 57, 988 Nesterov, Yu Itroductory Lectures O Covex Programmg Sprger, 998 Nesterov, Yu Effcecy of coordate descet methods o huge-scale optmzato problems Techcal report, CORE, 00 Recht, Bejam ad Re, Chrstopher Beeath the valley of the ocommutatve arthmetc-geometrc mea equalty: cojectures, case-studes, ad cosequeces Techcal report, Uversty of Wscos-Madso, 0 Rchtark, Peter ad Takac, Mart Iterato complexty of radomzed block-coordate descet methods for mmzg a composte fucto Techcal report, Uversty of Edburgh, 0

Fto: A Faster, Permutable Icremetal Gradet Method for Bg Data Problems Schmdt, Mark, Roux, Ncolas Le, ad Bach, Fracs Mmzg fte sums wth the stochastc average gradet Techcal report, INRIA, 03 Shalev-Shwartz, Sha ad Zhag, Tog Stochastc dual coordate ascet methods for regularzed loss mmzato JMLR, 03