Hybrid Random/Deterministic Parallel Algorithms for Nonconvex Big Data Optimization

Transcription

1 Hybrd Random/Determnstc Parallel Algorthms for Nonconvex Bg Data Optmzaton Amr Daneshmand, Francsco Facchne, Vyacheslav Kungurtsev, and Gesualdo Scutar the order of the authors s alphabetcal arxv:47.454v [cs.dc] Sep 4 Abstract We propose a decomposton framewor for the parallel optmzaton of the sum of a dffentable possbly nonconvex functon and a nonsmooth possbly nonseparable, convex one. The latter term s usually employed to enforce structu n the soluton, typcally sparsty. The man contrbuton of ths wor s a novel parallel, hybrd random/determnstc decomposton scheme when, at each teraton, a subset of bloc varables s updated at the same tme by mnmzng local convex approxmatons of the orgnal nonconvex functon. To tacle wth huge-scale problems, the bloc varables to be updated a chosen accordng to a mxed random and determnstc procedu, whch captus the advantages of both pu determnstc and random update-based schemes. Almost su convergence of the proposed scheme s establshed. Numercal sults show that on huge-scale problems the proposed hybrd random/determnstc algorthm outperforms both random and determnstc schemes. Index Terms Nonconvex problems, Parallel and dstrbuted methods, Random selectons, Jacob method, Sparse soluton. I. INTRODUCTION We consder the mnmzaton of the sum of a smooth possbly nonconvex functonf and of a nonsmooth possbly nonseparable convex one G: mnvx Fx+Gx, x X whe X s a closed convex set wth a cartesan product structu: X Π N X R n. Our focus s on problems wth a huge number of varables, as those that can be encounted, for example, n machne learnng, compssed sensng, data mnng, tensor factorzaton and completon, networ optmzaton, mage processng, genomcs, and meteorology. We fer the ader to [] [3] and the boos [4], [5] as entry ponts to the lteratu. Recent years have wtnessed a surge of ntest n these very large scale problems, and the evocatve term Bg Data optmzaton has been coned to denote ths new aa of search. Bloc Coordnate Descent BCD methods rapdly emerged as a wnnng paradgm to attac Bg Data optmzaton, see e.g. [3]. At each teraton of a BCD method one bloc of varables s updated usng frst-order nformaton, whle eepng all other varables fxed. Ths dramatcally duces the memory and computatonal quments of each teraton and leads to smple and scalable methods. One of the ey ngdents n a BCD method s the choce of the bloc of varables All the authors contrbuted equally to the paper. A. Daneshmand and G. Scutar a wth the Dept. of Electrcal Engneerng, at the State Unv. of New Yor at Buffalo, Buffalo, USA. Emal: <amrdane,gesualdo>@buffalo.edu. F. Facchne s wth the Dept. of Computer, Control, and Management Engneerng, at Unv. of Rome La Sapenza, Rome, Italy. Emal: francsco.facchne@unroma.t. V. Kungurtsev s wth the Agent Technology Center, Dept. of Computer Scence, Faculty of Electrcal Engneerng, Czech Techncal Unversty n Prague. Emal: vyacheslav.ungurtsev@fel.cvut.cz. Part of ths wor has been publshed on arxv on June 4. to update. Ths can be accomplshed n several ways, for example usng a cyclc order or some gedy/opportunstc selecton strategy, whch ams at selectng the bloc leadng to the largest decase of the objectve functon. The cyclc order has the advantage of beng extmely smple, but the gedy strategy usually provdes faster convergence, at the cost of an ncased computatonal effort at each teraton. However, no matter whch bloc selecton rule s adopted, as the dmensons of the optmzaton problems ncase, even BCD methods may sult nadequate. To allevate the curse of dmensonalty, the dffent nd of strateges have been proposed, namely: a parallelsm, whe several blocs of varables a updated smultaneously n a multco or dstrbuted computng envronment, see e.g. [5] [7], [7] [], [6] [5]; b random selecton of the blocs of varables to update, see e.g. [] [3]; and c use of mo-than-frstorder nformaton, for example approxmated Hessans or parts of the orgnal functon tself, see e.g. [4], [8], [9], [3], [3]. Pont a s self-explanatory and rather ntutve although the corspondng theotcal analyss s by no means trval; he we only mar that the vast majorty of parallel BCD methods apply to convex problems only. Ponts b and c need further comments. Pont b: The random selecton of varables to update also termed random setchng s essentally as cheap as a cyclc selecton whle allevatng some of the ptfalls of cyclc rules. Moover, random setchng s levant n dstrbuted envronments when data a not avalable n ther entty, but a acqud ether n batches over tme or over a networ and not all nodes a equally sponsve. In such scenaros, one mght be ntested n runnng the optmzaton process at a certan nstant even wth the lmted, randomly avalable nformaton. The man lmtaton of random selecton rules s that they man dsconnected from the status of the optmzaton process, whch nstead s exactly the nd of behavor that gedy-based updates try to avod, n favor of faster convergence, but at the cost of mo ntensve computaton. Pont c: The use of mo-than-frst-order nformaton also has to do wth the trade-off between cost-per-teraton and overall cost of the optmzaton process. Usng hgher order or structural nformaton may seem unasonable, gven the huge sze of the problems at hand, and n fact the accepted wsdom s that at most frst-order nformaton can be used n the Bg Data envronment. However, cent studes, as those mentoned above, challenge ths wsdom and suggest that a judcous use of some nd of mo-than-frst-order nformaton can lead to substantal overall mprovements. The above pros & cons analyss suggests that t would be desrable to desgn a parallel algorthm for nonconvex problems combnng the benefts of random setchng and gedy updates, possbly usng mo-than-frst-order nformaton.

2 To the best of our nowledge, no such algorthm exsts n the lteratu. In ths paper, buldng on our pvous determnstc methods [8], [9], [33], we propose a BCD-le scheme for the computaton of statonary solutons of Problem fllng the gap and enjoyng all the followng featus: It uses a random selecton rule for the blocs, followed by a determnstc subselecton; It can classcally tacle separable convex functon G,.e., Gx G x, but also nonseparable functons G; 3 It can deal wth a nonconvex functons F ; 4 It can use both frst-order and hgher-order nformaton; 5 It s parallel; 6 It can use nexact updates; 7 It converges almost suly,.e. our convergence sults a of the form wth probablty one. As far as we a awa of, ths s the frst algorthm enjoyng all these propertes, even n the convex case. The combnaton of all the featus -7 n one sngle algorthm s a major achevement n tself, whch offers gat flexblty to develop talod nstances of solutons methods wthn the same framewor and thus all convergng under the same unfed condtons. Last but not least, our experments show mpssve performance of the proposed methods, outperformng stateof-the-art soluton scheme cf. Sec. IV. As a fnal mar, we underlne that, at mo methodologcal level, the combnaton of all featus -7 and, n partcular, the need to conclate random and determnstc strateges, led to the development of a new type of convergence analyss see Appendx A whch s also of ntest per se and could brng to further developments. Below we further comment on some of featus -7, compa to exstng sults, and detal our contrbutons. Featu : As far as we a awa of, the dea of mang a random selecton and then perform a gedy subselecton has been pvously dscussed only n [34]. However, sults then a only for convex problems wth a specfc structu; a based on a gularzed frst-order model; qu a very strngent spectral-radus-type condton, whch sevely lmts the dege of parallelsm the maxmum number of varables that can be smultaneously updated at each teraton whle guaranteeng convergence; and v convergence sults a n terms of expected value of the objectve functon. The proposed algorthmc framewor expands vastly on ths settng, whle enjoyng also all propertes -7. In partcular, t s the frst hybrd random/gedy scheme for nonconvex nonseparable functons, and t allows any dege of parallelsm.e., the update of any number of varables; and all ths s acheved under much weaer convergence condtons than those n [34], satsfed by most of practcal problems. Numercal sults show that the proposed hybrd schemes updatng gedly just some blocs wthn the pool of those selected by a random rule s very effectve, and seems to pserve the advantages of both random and determnstc selecton rules. Featu : The ablty of dealng wth some classes of nonseparable convex functons has been documented n [35] [37], but only for determnstc and sequental schemes; our approach extends also to parallel, random schemes. Featu 3: The lst of wors dealng wth BCD methods for nonconvex F s s short: [], [9] for random sequental methods; and [7], [7] [9], [38] for determnstc parallel ones. The only very cent paper dealng wth random parallel methods for nonconvex F s s the arxv submsson [38], whch however does not enjoy the ey propertes,, and 6. Featu 4: We want to stss the ablty of the proposed algorthm to explot n a systematc way mo-than-frstorder nformaton. At each teraton of a BCD method, one bloc of varables s updated usng a possbly gularzed frst-order model of the objectve functon, whle eepng all other varables fxed. Our method, followng the approach frst explod n [8], [9], [33] provdes the flexblty of usng mo sophstcated models. For example, one could use a Newton-le approxmaton; or suppose that n F F +F, whef s convex andf s not. Then, at teraton, one could base the update of the -th bloc on the approxmant F x,x + x F x T x x +Gx,x, whe x denotes the vector obtaned from x by deletng x. The logc he s that nstead of lnearzng the whole functon F we only lnearze the dffcult, nonconvex part F. In ths lght we can also better appcate the mportance of featu 6, snce f we go for mo complex approxmants, the ablty to deal wth nexact solutons becomes mportant. Featu 6: Inexact soluton methods have been lttle studed. Papers [3], [39], [4] somewhat ndctly consder some of these ssues n the specalzed context of l -loss lnear support vector machnes. A mo systematc tatment of nexactness of the soluton of a frst-order model s documented n [4], n the context of random sequental BCD methods for convex problems. Our sults n ths paper a based on our pvous wors [8], [9], [33], whe both the use of mothan-frst-order models and nexactness a ntroduced and rgorously analyzed n the context of parallel, determnstc methods. Ths paper extends sults n [8], [9], [33] to the random, parallel schemes for nonconvex objectve functons, and consttute the frst study of these ssues n ths settng. As a fnal mar, we observe that a large porton of wors mentoned so far a ntested n global complexty analyss. Of course ths s an mportant topc, but t s outsde the scope of ths paper. Note that, wth the excepton of [9], all papers dealng wth complexty analyses, study gularzed gradent-type methods for convex problems. Gven our expanded settng, we beleve t s mo frutful to concentrate on provng convergence and verfyng the practcal effectveness of our algorthms. The paper s organzed as follows. Secton II formally ntroduces the optmzaton problem along wth the man assumptons under whch t s studed and also dscusses some techncal ponts. The proposed algorthmc framewor and ts convergence propertes a ntroduced n Secton III, whle numercal sults a psented n Secton IV. Secton V draws some conclusons. All proofs a gven n the Appendx. II. PROBLEM DEFINITION AND PRELIMINARIES We consder Problem, whe the feasble set X X X N s a Cartesan product of lower dmensonal convex sets X R n, and x R n s parttoned accordngly: x x,...,x N, wth each x R n ; we denote by

3 3 N {,...,N} the set of the N blocs. The functon F s smooth and not necessarly convex and separable and G s convex, and possbly nondffentable and nonseparable. Some wdely-used choces for Gx a c x and c N x, from whch one can see that Problem ncludes many popular Bg Data optmzaton problems, such as Lasso, group Lasso, sparse logstc gsson, l - loss Support Vector Machne, Nuclear Norm Mnmzaton, and Nonnegatve Matrx or Tensor Factorzaton problems. Assumptons. Gven, we mae the followng blanet assumptons: A Each X s nonempty, closed, and convex; A F s C on an open set contanng X; A3 F s Lpschtz contnuous on X wth constant L F ; A4 G s contnuous and convex on X possbly nondffentable and nonseparable; A5 V s coercve. Note that the above assumptons a standard and a satsfed by most of the problems of practcal ntest. For nstance, A3 holds automatcally f X s bounded, wheas A5 guarantees the exstence of a soluton. Wth the advances of mult-co archtectus, t s desrable to develop parallel soluton methods for Problem wheby operatons can be card out on some or possbly all bloc varables x at the same tme. The most natural parallel Jacob-type method one can thn of s updatng all blocs smultaneously: gven x, each bloc varable x s updated by solvng the followng subproblem x + argmn { Fx,x +Gx,x }. x X Unfortunately ths method converges only under very strctve condtons [4] that a seldom verfed n practce even n the absence of the nonsmooth part G. Furthermo, the exact computaton of x + may be dffcult and computatonally too expensve. To cope wth these ssues, a natural approach s to place the nonconvex functon F,x by a sutably chosen local convex approxmaton F x ;x, and solve nstead the convex problems one for each bloc x + argmn { h x ;x F } x ;x +Gx ;x, x X 3 wth the understandng that the mnmzaton n 3 s smpler than that n. Note that the functonghas not been touched; ths s because t s generally much mo dffcult to fnd a good approxmaton of a nondffentable functon than of a dffentable one; G s alady convex; and the functons G encounted n practce do not mae the optmzaton problem 3 dffcult a closed form soluton s avalable for a large classes of G s, f F x ;x a properly chosen. In ths wor we assume that the approxmaton functons F z;w : X X R, have the followng propertes we denote by F the partal gradent of F wth spect to the frst argument z: F F ;w s unformly strongly convex wth constant q > on X ; F F x ;x x Fx for all x X; F3 F z; s Lpschtz contnuous on X for all z X. Such a functon F should be garded as a smple convex approxmaton of F at the pont x wth spect to the bloc of varables x that pserves the frst order propertes of F wth spect to x. Note that, contrary to most of the wors n the lteratu e.g., [37], we do not qu F to be a global upper approxmaton of F, whch sgnfcantly enlarges the range of applcablty of the proposed soluton methods. The most popular choce for F satsfyng F-F3 s F x ;x Fx + x Fx T x x + τ x x, 4 wth τ >. Ths s essentally the way a new teraton s computed n most bloc-bcds for the soluton of group LASSO problems and ts generalzatons. When G, ths choce gves rse to a gradent-type scheme; n fact we obtan x + smply by a shft along the antgradent. As we dscussed n the ntroducton, ths s a frst-order method, so t seems advsable, at least n some stuatons, to use mo nformatve F -s. If Fx,x s convex, an alternatve s to tae F x ;x as a second order approxmaton of Fx,x,.e., F x ;x Fx + x Fx T x x + x x T x x Fx +qi x x, 5 whe q s nonnegatve and can be taen to be zero f Fx,x s actually strongly convex. When G, ths essentally corsponds to tang a Newton step n mnmzng the duced problem mn x X Fx,x. Stll n the case of convex Fx,x, one could also tae just F x ;x Fx,x, whch pserves the whole structu of the functon. Other valuable choces talod to specfc applcatons a dscussed n [9], [33]. As a gudelne, note that our method, as we shall descrbe n detals shortly, s based on the teratve approxmate soluton of problem 3 and thefo a balance should be amed at between the accuracy of the approxmaton F and the ease of soluton of 3. Needless to say, the opton 4 s the less nformatve one, although t usually maes the computaton of the soluton of 3 a cheap tas. Best-sponse map: Assocated wth each and pont x X, under F-F3, we can defne the followng optmal bloc soluton map: x x argmn x X h x ;x. 6 Note that x x s always well-defned, snce the optmzaton problem n 6 s strongly convex. Gven 6, we can then ntroduce the soluton map X y xy x y N. 7 Our algorthmc framewor s based on solvng n parallel a sutable selecton of subproblems 6, convergng thus to fxedponts of x of course the selecton vas at each teraton. It s then natural to as whch laton exsts between these

4 4 fxed ponts and the statonary solutons of Problem. To answer ths ey queston, we call frst a few defntons. Statonarty: A pont x s a statonary pont of f a subgradent ξ Gx exsts such that Fx +ξ T y x for all y X. Coordnate-wse statonarty: A pont x s a coordnatewse statonary pont of f subgradents ξ ξ Gx, wth N, exst such that x Fx + ξ T y x, for all y X and N. Of course, f F s convex, statonary ponts concde wth ts global mnmzers. In words, a coordnate-wse statonary soluton s a pont for whch x s statonary w.r.t. every bloc of varables. It s clear that a statonary pont s always a coordnate-wse statonary pont; the converse however s not always true, unless extra condtons on G a satsfed. Regularty: Problem s gular at a coordnate-wse statonary pont x f x s also a statonary pont of the problem. Regularty at x s a rather wea qument, and s easly seen to be mpled, n partcular, by the followng two condtons: a G s separable stll nonsmooth,.e., Gx G x ; b G s contnuously dffentable around x. Note that a s assumed n practcally all papers dealng wth determnstc/random BCD methods wth the excepton of [36], [37], whe however only sequental schemes a proposed. Regularty can well occur also for nonseparable functons. For nstance, consder the functon arsng n logstc gsson problems Fx m j log + e ajyt j x, wth X R n, and y j R n and a j {,} beng gven constants. Now, choose Gx c x ; the sultng functon s contnuously dffentable, and thefo gular, at any statonary pont but x. It s easy to verfy that V s also gular at x, provded that c < log. The followng proposton s elementary and elucdates the connectons between statonarty condtons of Problem and fxed-ponts of x. Proposton. Gven Problem under A-A5 and F-F3, the followng hold: The set of fxed-ponts of x concdes wth the coordnate-wse statonary ponts of Problem ; If, n addton, Problem s gular at a fxed-pont of x, then such a fxed-pont s also a statonary pont of the problem. Other propertes of the best-sponse map x that a nstrumental to prove convergence of the proposed algorthm a ntroduced n Appendx B. III. ALGORITHMIC FRAMEWORK We a ady to descrbe our algorthmc framewor. We begn ntroducng a formal descrpton of ts salent characterstc, the novel hybrd random/gedy bloc selecton rule. The random bloc selecton wors as follows: at each teraton, a random set S N s generated, and the blocs S a the potental canddate varables to update n parallel. The set S s a alzaton of a random set-valued mappng S wth values n the power set of N. To eep the proposed scheme as general as possble, we do not constrant S to any specfc dstrbuton; we only qu that, at each teraton, each bloc has a chance postve probablty, possbly nonunform to be selected. A6 The sets S a alzatons of ndependent random setvalued mappngs S such that P S p, for all,...,n and N +, and some p >. A random selecton rule S satsfyng A6 wll be called proper samplng. Several proper samplng rules wll be dscussed n detals shortly. As alady dscussed n the ntroducton, the random selecton of blocs seems becomng benefcal when the dmensons of the problem ncase sgnfcantly. But cent sults n [], [9], [43], [44] strongly suggest that a gedy approach updatng only the promsng blocs s an mportant ngdent of an effcent algorthm. Of course, for very large scale problems, checng whether a bloc s promsng or not mght become computatonally demandng and thus tme consumng. To avod ths burden whle capturng the benefts of both strateges, the proposed approach conssts n combnng random and gedy updates n the followng form. Frst, a random selecton s performed the set S s generated. Second, a gedy procedu s run to select n the pools only the subset of blocs, sayŝ, that a promsng accordng to a pscrbed crteron. Fnally all the blocs n Ŝ a updated n parallel. To complete the descrpton of such an hybrd random/gedy selecton, the noton of promsng bloc needs to be made formal, whch s done next. Snce x s an optmal soluton of 6 f and only f x x x, a natural dstance of x from the optmalty s d x x x. The blocs n S to be updated can be then chosen based on such an optmalty measu e.g., optng for blocs exhbtng larger d s. However, ths choce qus the computaton of the solutons x x, for all S, whch n some applcatons mght be stll computatonally too expensve. Buldng on the same dea, we can ntroduce alternatve, less expensve metrcs by placng the dstance x x x wth a computatonally cheaper error bound,.e., a functon E x such that s x x x E x s x x x, 8 for some < s s. Of course one can always set E x x x x, but other choces a also possble, we fer the ntested ader to [9] for mo detals. The proposed hybrd random/gedy scheme capturng all the featus -6 dscussed n Sec. I s formally gven n Algorthm. Note that n step S.3 nexact calculatons of x a allowed, whch s another notceable and useful featu: one can duce the cost per teraton wthout affectng too much, expence shows, the emprcal convergence speed. In step S.5 we ntroduced a memory n the varable updates: the new pont x + s a convex combnaton va γ of x and ẑ. The step-sze γ plays a ey rule n the convergence, and needs to be properly tuned, as specfed n Theom, whch summarzes the convergence propertes of Algorthm.

5 5 Algorthm : Hybrd Random/Determnstc Flexble Parallel Algorthm HyFLEXA Data : {ε } for N, τ, {γ } >, x X, ρ,]. Set. S. : If x satsfes a termnaton crteron: STOP; S. : Randomly generate a set of blocs S {,...,N} S.3 : Set M max S {E x }. Choose a subset Ŝ S that contans at least one ndex for whch E x ρm. S.4 : For all Ŝ, solve 6 wth accuracy ε : fnd z X s.t. z x x ε ; Set ẑ z for Ŝ and ẑ x for Ŝ S.5 : Set x + x +γ ẑ x ; S.6 : +, and go to S.. Theom. Let {x } be the sequence generated by Algorthm, under A-A6. Suppose that {γ } and {ε } satsfy the followng condtons: γ,]; γ ; γ + ; v γ < + ; and v ε γ α mn{α,/ x Fx } for all N and some nonnegatve constants α and α. Addtonally, f nexact solutons a used n Step 3,.e., ε > for some and nfnte, then assume also that G s globally Lpschtz on X. Then, ether Algorthm converges n a fnte number of teratons to a fxed-pont of ˆx of or the exsts at least one lmt pont of {x } that s a fxed-pont of ˆx w.p.. Proof: See Appendx C. The convergence sults n Theom can be stngthened when G s separable. Theom 3. In the settng of Theom, suppose n addton that Gx s separable,.e., Gx N G x. Then, ether Algorthm converges n a fnte number of teratons to a statonary soluton of Problem or every lmt pont of {x } s a statonary soluton of Problem w.p.. Proof: See Appendx D. On the random choce of S. We dscuss next some proper samplng rules S that can be used n Step 3 of the algorthm to generate the random sets S ; for notatonal smplcty the teraton ndex wll be omtted. The samplng rule S s unquely characterzed by the probablty mass functon PS PS S, S N, whch assgn probabltes to the subsets S of N. Assocated wth S, defne the probabltes q j P S j, for j,...,n. The followng proper samplng rules, proposed n [5] for convex problems wth separable G, a nstances of rules satsfyng A6, and a used n our computatonal experments. Unform U samplng. All blocs get selected wth the same non zero probablty: P S Pj S E[ S ], j N. N Doubly Unform DU samplng. All sets S of equal cardnalty a generated wth equal probablty,.e.,ps PS, for all S,S N such that S S. The densty functon s then PS q S n S Nonoverlappng Unform NU samplng. It s a unform samplng rule assgnng postve probabltes only to sets formng a partton of N. Let S,...,S P be a partton of N, wth each S >, the densty functon of the NU samplng s: PS. P, f S { S,...,S P} otherwse whch corsponds to P S N/P, for all N. A specal case of the DU samplng that we found very effectve n our experments s the so called nce samplng. Nce Samplng NS. Gven an nteger τ N, a τ-nce samplng s a DU samplng wth q τ.e., each subset of τ blocs s chosen wth the same probablty. The NS allows us to control the dege of parallelsm of the algorthm by tunng the cardnalty τ of the random sets generated at each teraton, whch maes ths rule partcularly appealng n a mult-co envronment. Indeed, one can set τ equal to the number of avalable cos/processors, and assgn each bloc comng out from the gedy selecton f mplemented to a dedcated processor/co. As a fnal mar, note that the DU/NU rules contan as specal cases fully parallel and sequental updates, when at each teraton a sngle bloc s updated unformly at random, or all blocs a updated. Sequental samplng: It s a DU samplng wth q, or a NU samplng wth P N and S j j, for j,...,p. Fully parallel samplng: It s a DU samplng wth q N, or a NU samplng wth P and S N. Other ntestng unform and nonunform practcal rules stll satsfyng A6 can be found n [5], [45], to whch we fer the ntested ader for further detals.. On the choce of the step-sze γ. An example of step-sze rule satsfyng Theom -v s: gven < γ, let γ γ θγ,,..., 9 whe θ, s a gven constant. Numercal sults n Secton IV show the effectveness of 9 on specfc problems. We mar that t s possble to prove convergence of Algorthm also usng other step-sze rules, ncludng a standard Armjo-le lne-search procedu or a sutably small constant step-sze. Note that dffently from most of the schemes n the lteratu, the tunng of the step-sze does not qu the nowledge of the problem parameters e.g., the Lpschtz constants of F and G. IV. NUMERICAL RESULTS In ths secton we psent some plmnary experments provdng a sold evdence of the vablty of our approach; they clearly show that our framewor leads to practcal methods that explot well parallelsm and compa favorably to exstng schemes, both determnstc and random. Because of space lmtaton, we psent sults only for synthetc LASSO problems, one of the most studed nstances of the convex verson of Problem, corspondng

6 6 to Fx Ax b, Gx c x, and X R n. Extensve experments on mo vad nonconvex classes of Problem a the subject of a separate wor. All codes have been wrtten n C++ and use the Message Passng Interface for parallel operatons. All algebra s performed by usng the Intel Math Kernel Lbrary MKL. The algorthms we tested on the General Compute Cluster of the Center for Computatonal Research at the SUNY Buffalo. In partcular for our experments we used a partton composed of 37 DELL 3x.3GHz Intel E7-483 Xeon Processor nodes wth 5 GB of DDR4 man memory and QDR InfnBand 4Gb/s networ card. Tunng of Algorthm : The most successful class of random and determnstc methods for LASSO problem a proxmal gradent-le schemes, based on a lnearzaton of F. As a major departu from curnt schemes, he we propose to better explot the structu of F and use n Algorthm the followng best-sponse: gven a scalar partton of the varables.e., n for all, let x x argmn x R { Fx,x + τ x x +λ x }. Note that x x has a closed form expsson usng a softthsholdng operator [8]. The fe parameters of Algorthm a chosen as follows. The proxmal gans τ and the step-sze γ a tuned as n [9, Sec. VI.A]. The error bound functon s chosen as E x x x x, and, for any alzaton S, the subsets Ŝ n S.3 of the algorthm a chosen as Ŝ { S : E x σm }. We denote by c S the cardnalty of S normalzed to the overall number of varables n our experments, all sets S have the same cardnalty,.e., c S c S, for all. We consded the followng optons for σ and c S : c S.,.,.,.5,.8; σ, whch leads to a fully parallel pu random scheme when at each teraton all varables n Ŝ a updated; and dffent postve values of σ rangng from. to.5, whch corsponds to updatng n a gedy manner only a subset of the varables n Ŝ the smaller the σ the larger the number of potental varables to be updated at each teraton. We termed Algorthm wth σ Random FLEXble parallel Algorthm RFLEXA, wheas the other nstances wth σ > as Hybrd FLEXA HyFLEXA. Algorthms n the lteratu: We compad our versons of HyFLEXA wth the most psentatve parallel random and determnstc algorthms proposed n the lteratu to solve the convex nstance of Problem and thus also LASSO. Mo specfcally, we consder the followng schemes. PCDM & PCDM: These a proxmal gradent-le parallel randomzed BCD methods proposed n [5] for convex optmzaton problems. Snce the authors commend to use PCDM nstead of PCDM for LASSO problems, we do so ndeed, our experments show that PCDM outperforms PCDM. We smulated PCDM under dffent samplng rules and we set the parameters β and ω as n [5, Table 4], whch guarantees convergence of the algorthm n expected value. Hydra & Hydra : Hydra s a parallel and dstrbuted random gradent-le CDM, proposed n [46], when dffent cos n parallel update a randomly chosen subset of varables from those they own; a closed form soluton of the scalar updates s avalable. Hydra [] s the accelerated verson of Hydra; ndeed, n all our experments, t outperformed Hydra; thefo, we wll port the sults only for Hydra. The fe parameter β s set to β β cf. Eq. 5 n [46], wth σ gven by Eq. n [46] accordng to the authors, ths seems one of the best choces for β. FLEXA: Ths s the parallel determnstc scheme we proposed n [8], [9]. We use FLEXA as a benchmar of determnstc algorthms, snce t has been shown n [8], [9] that t outperforms curnt parallel frst-order accelerated gradent-le schemes, ncludng FISTA [8], SparRSA [9], GRoc [], parallel BCD [7], and parallel ADMM. The fe parameters of FLEXA, τ and γ, a tuned as n [9, Sec. VI.A], wheas the set S s chosen as n. Other algorthms: We tested also other random algorthms, ncludng sequental random BCD-le methods and Shotgun [6]. However, snce they we not compettve, to not overcrowd the fgus, we do not port sults for these algorthms. In all the experments, the data matrx A [A A P ] of the LASSO problem s stod n a column-bloc manner, unformly across the P parallel processes. Thus the computaton of each product Ax qud to evaluate F and the norm x that s G s dvded nto the parallel jobs of computng A x and x, followed by a duce operaton. Also, for all the algorthms, the ntal pont was set to the zero vector. Numercal Tests: We generated synthetc LASSO problems usng the random generaton technque proposed by Nesterov [6], whch we properly modfed followng [5] to generate nstances of the problem wth dffent levels of sparsty of the soluton as well as densty of the data matrx A R m n ; we ntroduce the followng two control parameters: s A average % of nonzeros n each column of A out of m; and s sol % of nonzeros n the soluton out ofn. We tested the algorthms on two groups of LASSO problems, A R 4 5 and A R 5 6, and several deges of densty of A and sparsty of the soluton, namely s sol.%,%,5%,5%,3%, and s A %,3%,5%,7%,9%. Because of the space lmtaton, we port next only the most psentatve sults; we fer to [47] for mo detals and experments. Results for the LASSO nstance wth, varables a ported n Fg. and. Fg. shows the behavor of HyFLEXA as a functon of the desgn parameters σ and c S, for dffent values of the soluton sparsty s sol, wheas n Fg. we compa the proposed RFLEXA and HyFLEXA wth FLEXA, PCDM, and Hydra, for dffent values of s sol and s A rangng from low dense matrces and hgh sparse solutons to hgh dense matrces and low sparse solutons. Fnally, n Fg. 3 we consder larger problems wth M varables. In all the fgus, we plot the latve error x Vx V /V versus the CPU tme, whe V s the optmal value of the objectve functonv n our expermentsv s nown. All the curves a averaged over ten ndependent random alzatons. Note that the CPU tme ncludes communcaton tmes and the ntal tme needed by the methods to perform all p-teratons

7 HyFLEXAσ.5, c S.5 HyFLEXAσ., c S.5 HyFLEXAσ., c S. RFLEXA c S.5 RFLEXA c S. Hydra 5.5 Hydra c S. PCDMc S.5 PCDMc S. FLEXAσ.5 FLEXAσ computatons ths explans why the curves assocated wth Hydra start after the others; n fact Hydra qus some nontrval computatons to estmates β. Gven Fg. -3, the followng comments a n order. σ σ σ σ σ σ., ssol.5, ssol., ssol.5, ssol., ssol.5, ssol.%.% % % 5% 5%.5, ssol., ssol., ssol.5, ssol., ssol., ssol.5, ssol., ssol., ssol.%.%.% % % % 5% 5% 5% tme sec a tme sec b 5 3 Fg. : HyFLEXA for dffent values of and σ: Relatve error vs. tme; ssol.%, %, 5%, sa 7%,. varables, NU samplng, 8 cos; a.5, and σ.,.5 - b σ.5, and.,., tme sec a tme sec b tme sec b tme sec a 5 HyFLEXA: On the choce of, σ, and the samplng strategy. All the experments ncludng those that we cannot port he because of lac of space show the followng tnd n the behavor of HyFLEXA as a functon of, σ. For low densty problems low ssol and sa, large pars, σ a pferable, whch corsponds to updatng at each teraton only some varables by performng a heavy gedy search over a szable amount of varables. Ths s n agement wth [9] cf. Remar 5: by the gedy selecton, Algorthm s able to dentfy those varables that wll be zero at the a soluton; thefo updatng only varables that we have strong ason to beleve wll not be zero at a soluton s a better strategy than updatng them all, especally f the solutons a very sparse. Note that ths behavor can be obtaned usng ether large or small, σ. However, n the case of low dense problems, the former strategy outperforms the latter. We observed that ths s manly due to the fact that when sa s small, estmatng x computng the products AT A s computatonally affordable, and thus performng a gedy search over mo varables enhances the practcal convergence. When the sparsty of the soluton decases and/or the densty of A ncases large sa and/or ssol, one can see from the fgus that smaller values of, σ a mo effectve than larger ones, whch corsponds to usng a less aggssve gedy selecton whle searchng over a smaller pool of varables. In fact, when A s dense, computng all x mght be prohbtve and thus nullfy the potental benefts of a gedy procedu. For nstance, t follows from Fg. -3 that, as the densty of the soluton ssol ncases the pferable choce for, σ progssvely moves from.5,.5 to.,., wth both and σ decasng. Intestng, a tunng that wors qute well n practce for all the classes of problems we smulated dffent denstes of A, soluton sparsty, number of cos, etc. s, σ.5,., whch seems to st a good balance between not updatng varables that a probably zero at the optmum and nevertheless update a szable amount of varables when needed n order to enhance convergence.. As a fnal mar, we port that, accordng to our exper- 7 4 tme sec c tme sec c 4 5 Fg. : LASSO wth. varables, 8 cos; Relatve error vs. tme for: a sa 3% and ssol.% - a sa 3% and ssol 5% b sa 7% and ssol.% - b sa 7% and ssol 5% - c sa 9% and ssol.% - c sa 9% and ssol 5%. ments, the most effectve samplng rule among U, DU, NU, and NS s the NU whch s actually the one the fgus fers to; NS becomes compettve only when the solutons a very sparse, see [47] for a detaled comparson of the dffent rules. Comparson of the algorthms. For low dense matrces A and very sparse solutons, FLEXA σ.5 s faster than ts random counterparts RFLEXA and HyFLEXA as well as ts fully parallel verson, FLEXA σ [see Fg a, b c and Fg. 3a]. Nevertheless, HyFLEXA [wth, σ.5,.5] mans close. As alady ponted out, ths s manly due to the fact that n these scenaros estmatng all x s computatonally cheap and thus performng a gedy selecton over a szable set of varable s benefcal, see Fg. ; and

8 8 Prof. Peter Rchtàr for provdng the C++ code of PCDM and Hydra that we modfed n order to use the MPI lbrary. The wor of Daneshmand and Scutar was supported by the USA NSF Grants CMS 877 and CAREER Award No The wor of Facchne was supported by the MIUR project PLATINO Grant Agement n. PON_7. The wor of Kungurtsev was supported by the European Socal Fund under the Grant CZ..7/.3./ tme sec a tme sec x 4 b Fg. 3: LASSO wth M varables, s A %, 6 cos; Relatve error vs. tme for: as sol % - b s sol 5%. The legend s as n Fg.. updatng only some varables at each teraton s mo effectve than updatng all FLEXA σ.5 outperforms FLEXA σ. However, as the densty of A and/or the sze of the problem ncase, computng all the products [A T A] qud to estmate ˆx becomes too costly; ths s when a random selecton of the varables becomes benefcal: ndeed, RFLEXA and HyFLEXA consstently outperform FLEXA [see Fg a, b c and Fg. 3b]. Among the random algorthms, Hydra s capable to approach latvely fast low accuracy, especally when the soluton s not too sparse, but has dffcultes n achng hgh accuracy. RFLEXA and HyFLEXA a always much faster than curnt state-of-the-art schemes PCDM and Hydra, especally f hgh accuracy of the solutons s qud. Between RFLEXA and HyFLEXA wth the same c S, the latter consstently outperforms the former about up to fve tme faster, wth a gap that s mo sgnfcant when solutons a sparse. Ths provdes a sold evdence of the effectveness of the proposed hybrd random/gedy selecton method. In concluson, our experments ndcate that the proposed framewor leads to very effcent and practcal soluton methods for large and very large-scale LASSO problems, wth the flexblty to adapt to many dffent problem characterstcs. V. CONCLUSIONS We proposed a hghly parallelzable hybrd random/determnstc decomposton algorthm for the mnmzaton of the sum of a possbly noncovex dffentable functon F and a possbly nonsmooth nonseparable convex functon G. The proposed framewor s the frst scheme enjoyng all the followng featus: t allows for pu gedy, pu random, or mxed random/gedy updates of the varables, all convergng under the same unfed set of convergence condtons; t can tacle va parallel updates also nonseparable convex functons G; t can deal wth nonconvex nonseparable F ; v t s parallel; v t can ncorporate both frst-order or hgher-order nformaton; and v t can use nexact solutons. Our plmnary experments on LASSO problems showed the superorty of the proposed scheme wth spect to state-of-the-art random and determnstc algorthms. Experments on mo vad classes of problems a the subject of our curnt search. VI. ACKNOWLEDGMENTS The authors a very grateful to Prof. Peter Rchtàr for hs nvaluable comments; we also than Dr. Martn Taáč and APPENDIX We frst ntroduce some plmnary sults nstrumental to prove both Theom and Theom 3. Gven Ŝ N and x x N, for notatonal smplcty, we wll denote by x or nterchangeably Ŝ x Ŝ the vector whose component s equal to x f Ŝ, and zero otherwse. Wth a slght abuse of notaton we wll also use x,y to denote the orded tuple y,...,y,x,y +,...,y N ; smlarly x,x j,y,j, wth < j stands for y,...,y,x,y +,...,y j,x j,y j+,...,y N. A. On the random samplng and ts propertes We ntroduce some propertes assocated wth the random samplng rules S satsfyng assumpton A6. A ey role n our proofs s played by the followng random set: let {x } be the sequence generated by Algorthm, and defne the set K mx as mx argmax x x x, {,...,N} K mx { N + : mx S}. 3 The ey propertes of ths set a summarzed n the followng two lemmata. Lemma 4 Infnte cardnalty. Gven the set K mx as n 3, t holds that P K mx, whe K mx denotes the cardnalty of K mx. Proof: Suppose that the statement of the lemma s not true. Then, wth postve probablty, the must exst some such that for, mx / S. But we can wrte P { mx / S } Π P mx / S mx / S,..., mx / S lm p. whe the nequalty follows by A6 and the ndependence of the events. But ths obvously gves a contradcton and concludes the proof. Lemma 5. Let {γ } be a sequence satsfyng assumptons - of Theom. Then t holds that P γ <. 4 K mx Proof:

9 9 It holds that, P γ < K mx P γ < n n N K mx n NP γ < n. K mx To prove the lemma, t s then suffcent to show that P K mx γ < n, as proved next. Defne, wth N +, as the smallest ndex such that γ j n. 5 j Note that snce γ +, ˆK s well-defned for all and lm ˆK +. For any n N, t holds: m P γ < n P γ < n K mx m N K mx lm m P m γ < n K mx lm P γ < n K mx lm P γ < n, K mx [, ] < K mx + P γ < n, K mx [, ˆK ] K mx lm P K mx [, ] < ˆK } {{ } term I + P γ < n, K mx [, ˆK ]. K mx } {{ } term II 6 Let us bound next term I and term II separately. Term I: We have P K mx [, ] < ˆK a P X < P X p > ˆK b p p ˆK p ˆK ˆK p p c ˆK p 7 whe: a: X,...,X ˆK a ndependent Bernoull random varables, wth parameter p P K mx. Note that, due to A6, p p, for all ; b: t follows from Chebyshev s nequalty; c: we used the bounds p p and ˆK p p. Term II: Let us wrte term II as ˆK P K mx γ K mx [, ] < n K mx [, ] K mx [, ] ˆK P K mx [, ] ˆK ˆK a P K mx γ K mx [, ] < n K mx [, ] ˆK P K mx [, ] ˆK ˆK b P K mx γ ˆK K mx [, ] < γ c ˆK P γ ˆK X < γ ˆK P γ ˆK X γ p ˆK ˆK > γ ˆK p γ ˆK P γ ˆK X γ p > d ˆK γ p p p ˆK γ p ˆK γ p ˆK γ ˆK γ p ˆK γ, 8 whe: a: we used K mx [, ] ˆK, by the condtonng event; b: t follows from 5, and PA B PA; c: X,...,X ˆK a ndependent Bernoull random varables, wth parameter p. The bound s due to K mx [, ] ; d: t follows from the Chebyshev s nequalty. The desd sult 4 follows adly combnng 6, 7, and 8. B. On the best-sponse map x and ts propertes We ntroduce now some ey propertes of the mappng x defned n 6. We also derve some bounds nvolvng x along wth the sequence {x } generated by Algorthm. Lemma 6 [9]. Consder Problem under A-A5, and F- F3. Suppose that Gx s separable,.e., Gx G x,

10 wth each G convex on X. Then the mappng X y xy s Lpschtz contnuous on X,.e., the exsts a postve constant ˆL such that xy xz ˆL y z, y,z X. 9 Lemma 7. Let {x } be the sequence generated by Algorthm. For every K mx and Ŝ generated as n step S.3 of Algorthm, the followng holds: the exsts a postve constant c such that, ˆxŜx x Ŝ c ˆxx x. Proof: The followng chan of nequaltes holds: max s ˆxŜx x a s N Ŝ ˆx ρ ρ x x ρ b E ρ x c ρe mx x d ρ mn s max ˆx x x N N ρ mn N s ˆxx x N whe: n a ρ s any ndex n Ŝ such that E ρ x ρ max S E x. Note that by defnton of Ŝ cf. step S.3 of Algorthm, such a ndex always exsts; b s due to 8; c follows from the defnton of ρ, and max S E x E mx x, the latter due to mx S Ŝ call that K mx ; and d follows from 8. Lemma 8. Let {x } be the sequence generated by Algorthm. For every N +, and Ŝ generated as n step S.3, the followng holds: x Fx T xx x q xx x Ŝ Ŝ Ŝ + [ Gx G x x,x ]. Ŝ Proof: Optmalty of x x for the subproblem mples T y x F x x ;x +ξ x x,x x x, for all y X, and some ξ x x,x x G x x,x. Thefo, x F x x ;x x T x x +ξ x x,x x T x x. Let us lower bound next the two terms on the RHS of. The unform strong monotoncty of F ;x cf. F, x F x x ;x x F x ;x T x x x q x x x, 3 along wth the gradent consstency condton cf. F x F x ;x x Fx mply x F x x ;x x T x x T x F x x ;x x F x x ;x x x + x F x ;x x T x x x Fx x T x x +q x x x. 4 To bound the second term on the RHS of, let us nvoe the convexty of G,x : Gx,x G x x,x ξ x x,x T x x x, whch yelds ξ x x,x x T x x G x x,x Gx. 5 The desd sult s adly obtaned by combnng wth 4 and 5, and summng over Ŝ. Lemma 9. Let {x } be the sequence generated by Algorthm, and {γ }. For every N + suffcently large, and Ŝ generated as n step S.3, the followng holds: Gx + Gx +γ L Ŝ G ε +γ [ G x x,x Gx ]. Ŝ 6 Proof: Gven and Ŝ, defne x x N, wth { x x +γ x x x, f Ŝ otherwse. x By the convexty and Lpschtz contnuty of G, t follows Gx + Gx + Gx + G x + G x Gx Gx +γ L G Ŝ ε + G x Gx, 7 whe L G s a global Lpschtz constant of G. We bound next the last term on the RHS of 7. Let γ γ N, for large enough so that < γ <. Defne ˇx ˇx N, wth ˇx x f / Ŝ, and ˇx γ x x + γ x 8 otherwse. Usng the defnton of x t s not dffcult to see that x N N x + N ˇx. 9 Usng 9 and nvong the convexty of G, the followng curson holds for suffcently large : G x G N ˇx,x + N x,ˇx + N N x G N ˇx,x + N N x, N ˇx + N N x N Gˇx,x + N N G x, N ˇx + N N x N Gˇx,x + N N G N x,ˇx + N N x

11 N Gˇx,x + N N G + N N Gˇx,x + N N G N ˇx,x x,x,ˇx, ˇx,x N + N N x + N N x,x, N ˇx, + N N x, N Gˇx,x + N Gˇx,x + N N G x,x, N ˇx, + N N x,... N N Gˇx,x. 3 Usng 3, the last term on the RHS of 7 can be upper bounded for suffcently large as G x Gx [ Gˇx N,x Gx ] N a N N [ Gˇx,x Gx ] Ŝ [ γ G x x,x + γ Gx Gx ] Ŝ γ [ G x x,x Gx ], Ŝ 3 whe a s due to the convexty of G,x and the defnton of ˇx [cf. 8]. The desd nequalty 6 follows adly by combnng 7 wth 3. Lemma. [48, Lemma 3.4, p.] Let {X }, {Y }, and {Z } be the sequences of numbers such that Y for all. Suppose that X + X Y +Z,,,... and Z <. Then ether X or else {X } converges to a fnte value and Y <. C. Proof of Theom For any gven, the Descent Lemma [4] yelds: wth ẑ ẑ N and z z N defned n step S.4 of Algorthm, F x + F x +γ x F x T ẑ x γ L F + ẑ x. 3 We bound next the second and thrd terms on the RHS of 3. Denotng by Ŝ the complement of Ŝ, we have, x F x T ẑ x x F x T ẑ xx + xx x a x F x T Ŝ z xx Ŝ + x F x T Ŝ x xx Ŝ + x F x T Ŝ xx x Ŝ + x F x T Ŝ xx x Ŝ x F x T Ŝ z xx Ŝ + x F x T Ŝ xx x Ŝ b Ŝ ε x Fx + x F x T Ŝ xx x Ŝ c Ŝ ε x Fx q xx x Ŝ + [ Gx G x x,x ] Ŝ 33 whe n a we used the defnton of ẑ and of the set Ŝ ; n b we used z x x ε ; and c follows from cf. Lemma 8. The thrd term on the RHS of 3 can be bounded as ẑ x z ˆxx Ŝ + ˆxx x Ŝ + Ŝ z x x + ˆxx x Ŝ ε + ˆxx x Ŝ, Ŝ 34 whe the frst nequalty follows from the defnton of z and ẑ, and n the last nequalty we used z x x ε. Now, we combne the above sults to get the descent property of V along {x }. For suffcently large N +, t holds Vx + Fx + +Gx + V x γ q γ L F xx x Ŝ +T, 35 whe the nequalty follows from, 3, 33, and 34, and T s gven by T γ ε LG + x Fx + γ L F ε. N N By assumpton v n Theom, t s not dffcult to show that T <. Snce γ, t follows from 35 that the exst some postve constant β and a suffcently large

12 , say, such that Vx + Vx γ β xx x Ŝ +T, 36 for all. Invong Lemma whle usng T < and the coercvty of V, we deduce from 36 that lm t and thus also lm t t γ xx x Ŝ < +, 37 t K mx Lemma 5 together wth 38 mply γ xx x Ŝ < lm nf K mx xx x Ŝ, w.p., whch by Lemma 7 mples lm nf xx x, w.p.. 39 Thefo, the lmt pont of the nfmum sequence s a fxed pont of x w.p.. D. Proof of Theom 3 The proof follows smlar deas as the one of Theom n our cent wor [9], but wth the nontrval complcaton of dealng wth randomness n the bloc selecton. Gven 39, we show next that, under the separablty assumpton on G, t holds that lm xx x w.p.. For notatonal smplcty, let us defne xx xx x. Note frst that for any fnte but arbtrary sequence {, +,..., }, t holds that E and thus [ K mx t P γ t ] K mx t t γ t [Pt K mx ] p γ t > β γ t >, t t for all K and < β < p. Ths mples that, w.p., the exsts an nfnte sequence of ndexes, say K K, such that K mx t γ t, γ t > β γ t, K. 4 t Suppose now, by contradcton, that lmsup xx > wth a postve probablty. Then we can fnd a alzaton such that at the same tme 4 holds for some K and lmsup xx >. In the st of the proof we focus on ths alzaton and get a contradcton, thus provng that lmsup xx w.p.. If lmsup xx > then the exsts a δ > such that xx > δ for nfntely many and also xx < δ for nfntely many. Thefo, one can always fnd an nfnte set of ndexes, say K, havng the followng propertes: for any K, the exsts an nteger > such that xx < δ, xx > δ 4 δ xx j δ < j <. 4 Proceedng now as n the proof of Theom n [9], we have: for K, δ a < xx xx xx xx + x x 43 b + ˆL x x 44 c + ˆL γ t xx t S t + z t xx t S t t d + ˆLδ +ε max t γ t, 45 whe a follows from 4; b s due to Lemma 6; c comes from the trangle nequalty, the updatng rule of the algorthm and the defnton of ẑ ; and n d we used 4, 4, and z t xx t N εt, whe εmax max N ε <. It follows from 45 that lm nf γ t K t δ + ˆLδ >. 46 +ε max We show next that 46 s n contradcton wth the convergence of {Vx }. To do that, we plmnary prove that, for suffcently large K, t must be xx δ/. Proceedng as n 45, we have: for any gven K, xx + xx + ˆL x + x + ˆLγ xx +ε max. It turns out that for suffcently large K so that + ˆLγ < δ/δ +ε max, t must be xx δ/; 47 otherwse the condton xx + δ would be volated [cf. 4]. Heafter we assume wthout loss of generalty that 47 holds for all K n fact, one can always strct {x } K to a proper subsequence. We can show now that 46 s n contradcton wth the convergence of {Vx }. Usng 36 possbly over a subsequence, we have: for suffcently large K, Vx Vx β K mx t a Vx β K mx t γ t xx t Ŝ + t γ t xx t + b Vx β 3 γ t + T t, t t t T t K mx t 48 whe a follows from Lemma 7 and β c β > ; and b s due to 47 and 4, wth β 3 ββ δ /4. T t

13 3 Snce {Vx } converges and T <, t holds that lm K t γt, contradctng 46. Thefo lm xx x w.p.. Snce {x } s bounded by the coercvty of V and the convergence of {Vx }, t has at least one lmt pont x X. By the contnuty of x cf. Lemma 6 t holds that x x x. By Proposton x s also a statonary soluton of Problem. REFERENCES [] R. Tbshran, Regsson shrnage and selecton va the lasso, Journal of the Royal Statstcal Socety. Ses B Methodologcal, pp , 996. [] Z. Qn, K. Schenberg, and D. Goldfarb, Effcent bloc-coordnate descent algorthms for the group lasso, Mathematcal Programmng Computaton, vol. 5, pp , June 3. [3] G.-X. Yuan, K.-W. Chang, C.-J. Hseh, and C.-J. Ln, A comparson of optmzaton methods and softwa for large-scale l-gularzed lnear classfcaton, The Journal of Machne Learnng Research, vol. 9999, pp ,. [4] K. Fountoulas and J. Gondzo, A Second-Order Method for Strongly Convex L-Regularzaton Problems, arxv pprnt arxv: , 3. [5] I. Necoara and D. Clpc, Effcent parallel coordnate descent algorthm for convex optmzaton problems wth separable constrants: applcaton to dstrbuted MPC, Journal of Process Control, vol. 3, no. 3, pp , March 3. [6] Y. Nesterov, Gradent methods for mnmzng composte functons, Mathematcal Programmng, vol. 4, pp. 5 6, August 3. [7] P. Tseng and S. Yun, A coordnate gradent descent method for nonsmooth separable mnmzaton, Mathematcal Programmng, vol. 7, no. -, pp , March 9. [8] A. Bec and M. Teboulle, A fast teratve shrnage-thsholdng algorthm for lnear nverse problems, SIAM Journal on Imagng Scences, vol., no., pp. 83, Jan. 9. [9] S. J. Wrght, R. D. Nowa, and M. A. Fguedo, Sparse constructon by separable approxmaton, IEEE Trans. on Sgnal Processng, vol. 57, no. 7, pp , July 9. [] Z. Peng, M. Yan, and W. Yn, Parallel and dstrbuted sparse optmzaton, n Sgnals, Systems and Computers, 3 Aslomar Confence on. IEEE, 3, pp [] K. Slavas and G. B. Gannas, Onlne dctonary learnng from bg data usng accelerated stochastc approxmaton algorthms, n Proc. of the IEEE 4 Internatonal Confence on Acoustcs, Speech, and Sgnal Processng ICASSP 4, Flonce, Italy, May 4-9, 4. [] K. Slavas, G. B. Gannas, and G. Mateos, Modelng and optmzaton for bg data analytcs, IEEE Sgnal Process. Mag., vol. 3, no. 5, pp. 8 3, Sept. 4. [3] M. De Sants, S. Lucd, and F. Rnald, A fast actve set bloc coordnate descent algorthm for l -gularzed least squas, eprnt arxv:43.738, March 4. [4] S. Sra, S. Nowozn, and S. J. Wrght, Eds., Optmzaton for Machne Learnng, ser. Neural Informaton Processng. Cambrdge, Massachusetts: The MIT Pss, Sept.. [5] F. Bach, R. Jenatton, J. Maral, and G. Obozns, Optmzaton wth Sparsty-nducng Penaltes. Foundatons and Tnds R n Machne Learnng, Now Publshers Inc, Dec.. [6] J. K. Bradley, A. Kyrola, D. Bcson, and C. Guestrn, Parallel coordnate descent for l-gularzed loss mnmzaton, n Proc. of the 8th Internatonal Confence on Machne Learnng, Bellevue, WA, USA, June 8 July,. [7] M. Patrsson, Cost approxmaton: a unfed framewor of descent algorthms for nonlnear programs, SIAM Journal on Optmzaton, vol. 8, no., pp , 998. [8] F. Facchne, S. Sagratella, and G. Scutar, Flexble parallel algorthms for bg data optmzaton, n Proc. of the IEEE 4 Internatonal Confence on Acoustcs, Speech, and Sgnal Processng ICASSP 4, Flonce, Italy, May 4-9, 4. [9], Flexble parallel algorthms for bg data optmzaton, IEEE Trans. on Sgnal Processng, submtted n Feb. 4. [Onlne]. Avalable: [] O. Fercoq, Z. Qu, P. Rchtár, and M. Taáč, Fast dstrbuted coordnate descent for non-strongly convex losses, arxv pprnt arxv:45.53, 4. [] O. Fercoq and P. Rchtár, Accelerated, parallel and proxmal coordnate descent, arxv pprnt arxv:3.5799, 3. [] Z. Lu and L. Xao, Randomzed Bloc Coordnate Non-Monotone Gradent Method for a Class of Nonlnear Programmng, arxv pprnt arxv:36.598v, 3. [3] I. Necoara and D. Clpc, Dstrbuted random coordnate descent method for composte mnmzaton, Techncal Report, pp. 4, Nov. 3. [Onlne]. Avalable: [4] Y. Nesterov, Effcency of coordnate descent methods on huge-scale optmzaton problems, SIAM Journal on Optmzaton, vol., no., pp ,. [5] P. Rchtár and M. Taáč, Parallel coordnate descent methods for bg data optmzaton, arxv pprnt arxv:.873,. [6] S. Shalev-Shwartz and A. Tewar, Stochastc methods for l- gularzed loss mnmzaton, The Journal of Machne Learnng Research, pp ,. [7] Z. Lu and L. Xao, On the complexty analyss of randomzed bloccoordnate descent methods, arxv pprnt arxv:35.473, 3. [8] I. Necoara and A. Patrascu, A random coordnate descent algorthm for optmzaton problems wth composte objectve functon and lnear coupled constrants, Computatonal Optmzaton and Applcatons, vol. 57, no., pp , 4. [9] A. Patrascu and I.Necoara, Effcent random coordnate descent algorthms for large-scale structud nonconvex optmzaton, J. of Global Optmzaton, pp. 3, Feb. 4. [3] P. Rchtár and M. Taáč, Iteraton complexty of randomzed bloccoordnate descent methods for mnmzng a composte functon, Mathematcal Programmng, vol. 44, no. -, pp. 38, 4. [3] I. Dassos, K. Fountoulas, and J. Gondzo, A second-order method for compssed sensng problems wth cohent and dundant dctonas, arxv pprnt arxv:45.446, 4. [3] G.-X. Yuan, C.-H. Ho, and C.-J. Ln, An mproved glmnet for l- gularzed logstc gsson, The Journal of Machne Learnng Research, vol. 3, no., pp ,. [33] G. Scutar, F. Facchne, P. Song, D. Palomar, and J.-S. Pang, Decomposton by Partal lnearzaton: Parallel optmzaton of mult-agent systems, IEEE Trans. Sgnal Process., vol. 6, pp , Feb. 4. [34] C. Scherr, A. Tewar, M. Halappanavar, and D. Hagln, Featu clusterng for acceleratng parallel coordnate descent, n Advances n Neural Informaton Processng Systems NIPS. Curran Assocates, Inc.,, pp [35] A. Auslender, Optmsaton: méthodes numérques. Masson, 976. [36] P. Tseng, Convergence of a bloc coordnate descent method for nondffentable mnmzaton, Journal of optmzaton theory and applcatons, vol. 9, no. 3, pp ,. [37] M. Razavyayn, M. Hong, and Z.-Q. Luo, A unfed convergence analyss of bloc successve mnmzaton methods for nonsmooth optmzaton, SIAM J. on Opt., vol. 3, no., pp. 6 53, 3. [38] M. Razavyayn, M. Hong, Z.-Q. Luo, and J.-S. Pang, Parallel successve convex approxmaton for nonsmooth nonconvex optmzaton, Pprnt arxv: , June 4. [39] J. T. Goodman, Exponental prors for maxmum entropy models, Mar. 4 8, us Patent 7,34,376. [4] K.-W. Chang, C.-J. Hseh, and C.-J. Ln, Coordnate descent method for large-scale l-loss lnear support vector machnes, The Journal of Machne Learnng Research, vol. 9, pp , 8. [4] R. Tappenden, P. Rchtár, and J. Gondzo, Inexact coordnate descent: complexty and pcondtonng, arxv pprnt arxv:34.553, 3. [4] D. P. Bertseas and J. N. Tstsls, Parallel and Dstrbuted Computaton: Numercal Methods, nd ed. Athena Scentfc Pss, 989. [43] Y. L and S. Osher, Coordnate descent optmzaton for l mnmzaton wth applcaton to compssed sensng; a gedy algorthm, Inverse Probl. Imagng, vol. 3, no. 3, pp , 9. [44] I. S. Dhllon, P. K. Ravumar, and A. Tewar, Neast neghbor based gedy coordnate descent, n Advances n Neural Informaton Processng Systems 4 NIPS,, pp [45] P. Rchtár and M. Taáč, On optmal probabltes n stochastc coordnate descent methods, arxv pprnt arxv:3.3438, 3. [46], Dstrbuted coordnate descent method for learnng wth bg data, arxv pprnt arxv:3.59, 3. [47] A. Daneshmand, Numercal Comparson of Hybrd Random/Determnstc Parallel Algorthms for nonconvex bg data Optmzaton, Dept. of Elect. Eng., SUNY Buffalo, Tech. Rep., August 4. [Onlne]. Avalable: [48] D. P. Bertseas and J. N. Tstsls, Neuro-Dynamc Programmng. Cambrdge, Massachusetts: Athena Scentfc Pss, May..