Hybrd Random/Determnstc Parallel Algorthms for Convex and Nonconvex Bg Data Optmzaton Amr Daneshmand, Francsco Facchne, Vyacheslav Kungurtsev, and Gesualdo Scutar Abstract We propose a decomposton framewor for the parallel optmzaton of the sum of a dffentable possbly nonconvex functon and a nonsmooth possbly nonseparable, convex one. The latter term s usually employed to enforce structu n the soluton, typcally sparsty. The man contrbuton of ths wor s a novel parallel, hybrd random/determnstc decomposton scheme when, at each teraton, a subset of bloc varables s updated at the same tme by mnmzng a convex surrogate of the orgnal nonconvex functon. To tacle huge-scale problems, the bloc varables to be updated a chosen accordng to a mxed random and determnstc procedu, whch captus the advantages of both pu determnstc and random update-based schemes. Almost su convergence of the proposed scheme s establshed. Numercal sults show that on huge-scale problems the proposed hybrd random/determnstc algorthm outperforms random and determnstc schemes on both convex and nonconvex problems. Index Terms Nonconvex problems, Parallel and dstrbuted methods, Random selectons, Jacob method, Sparse soluton. I. INTRODUCTION Recent years have wtnessed a surge of ntest n very large scale optmzaton problems, and the evocatve term Bg Data optmzaton has been coned to denote ths new aa of search. Many such problems can be formulated as the mnmzaton of the sum of a smooth possbly nonconvex functon F and of a nonsmooth possbly nonseparable convex one G: mn V x F x+gx, x X whe X s a closed convex set wth a cartesan product structu: X =Π N = X R n. Our focus s on problems wth a huge number of varables, as those that can be encounted, e.g., n machne learnng, compssed sensng, data mnng, tensor factorzaton and completon, networ optmzaton, mage processng, genomcs, etc.. We fer the ader to [] [4] and the boos [5], [6] as entry ponts to the lteratu. Bloc Coordnate Descent BCD methods rapdly emerged as a wnnng paradgm to attac Bg Data optmzaton, manly due to ther low-cost per-teraton and scalablty; see e.g. [4]. At each teraton of a BCD method one bloc of varables s updated usng frst-order nformaton, whle eepng all other varables fxed. The choce of the bloc of varables to update at each teraton can accomplshed n several ways, for example usng a cyclc order or some gedy/opportunstc selecton strategy, whch ams at selectng the bloc leadng The order of the authors s alphabetcal as each author contrbuted equally. A. Daneshmand and G. Scutar a wth the Dept. of Electrcal Engneerng, at the State Unv. of New Yor at Buffalo, Buffalo, USA. Emal: <amrdane,gesualdo>@buffalo.edu. F. Facchne s wth the Dept. of Computer, Control, and Management Engneerng, at Unv. of Rome La Sapenza, Rome, Italy. Emal: francsco.facchne@unroma.t. V. Kungurtsev s wth the Agent Technology Center, Dept. of Computer Scence, Faculty of Electrcal Engneerng, Czech Techncal Unversty n Prague. Emal: vyacheslav.ungurtsev@fel.cvut.cz. Part of ths wor has been psented at the Aslomar Confence 04 []. to the largest decase of the objectve functon. The cyclc order has the advantage of beng extmely smple, but the gedy strategy usually provdes faster convergence, at the cost of an ncased computatonal effort at each teraton. However, no matter whch bloc selecton rule s adopted, as the dmensons of the optmzaton problems ncase, even BCD methods may sult nadequate. To allevate the curse of dmensonalty, the dffent nd of strateges have been proposed, namely: a parallelsm, whe several blocs of varables a updated smultaneously n a multco or dstrbuted computng envronment, see e.g. [6] [8], [8] [], [7] [6]; b random selecton of the blocs of varables to update, see e.g. [] [3]; and c use of mo-than-frstorder nformaton, for example approxmated Hessans or parts of the orgnal functon tself, see e.g. [5], [9], [0], [3], [33]. Pont a s self-explanatory and rather ntutve; he we only mar that the vast majorty of parallel BCD methods apply to convex problems only. Ponts b and c need further comments. Pont b: Random selecton-based rules a essentally as cheap as cyclc selectons whle allevatng some of the ptfalls of cyclc updates. They a also levant n dstrbuted envronments when data a not avalable n ther entty, but a acqud ether n batches or over a networ. In such scenaros, one mght be ntested n runnng the optmzaton at a certan nstant even wth the lmted, randomly avalable nformaton. The man lmtaton of random selecton rules s that they man dsconnected from the status of the optmzaton process, whch nstead s exactly the nd of behavor that gedy-based updates try to avod, n favor of faster convergence, but at the cost of mo ntensve computaton. Pont c: The use of mo-than-frst-order nformaton also has to do wth the trade-off between cost-per-teraton and overall cost of the optmzaton process. Although usng hgher order or structural nformaton may seem unasonable n Bg Data problems, cent studes, as those mentoned above, suggest that a judcous use of some nd of mo-than-frstorder nformaton can lead to substantal mprovements. The above pros & cons analyss suggests that t would be desrable to desgn a parallel algorthm for nonconvex problems combnng the benefts of random setchng and gedy updates, possbly usng mo-than-frst-order nformaton. To the best of our nowledge, no such algorthm exsts n the lteratu. In ths paper, buldng on our pvous determnstc methods [9], [0], [34], we propose a BCD-le scheme for the computaton of statonary solutons of Problem fllng the gap and enjoyng all the followng featus: It uses a random selecton rule for the blocs, followed by a determnstc subselecton; It can classcally tacle separable convex functon G,.e., Gx = G x,butalsononseparable functons G; 3 It can deal wth a nonconvex functons F ;
4 It can use both frst-order and hgher-order nformaton; 5 It s parallel; 6 It can use nexact updates; 7 It converges almost suly,.e. our convergence sults a of the form wth probablty one. As far as we a awa of, ths s the frst algorthm enjoyng all these propertes, even n the convex case. The combnaton of all the featus -7 n one sngle algorthm s a major achevement n tself, whch offers gat flexblty to develop talod nstances of solutons methods wthn the same framewor and thus all convergng under the same unfed condtons. Last but not least, our experments show mpssve performance of the proposed methods, outperformng stateof-the-art soluton scheme cf. Sec. IV. As a fnal mar, we underlne that, at mo methodologcal level, the combnaton of all featus -7 and, n partcular, the need to conclate random and determnstc strateges, led to the development of a new type of convergence analyss see Appendx A whch s also of ntest per se and could brng to further developments. Below we further comment on some of featus -7, compa to exstng sults, and detal our contrbutons. Featu : As far as we a awa of, the dea of mang a random selecton and then perform a gedy subselecton has been pvously dscussed only n [35]. However, sults then a only for convex problems wth a specfc structu; a based on a gularzed frst-order model; qu a very strngent spectral-radus-type condton to guarantee convergence, whch sevely lmts the dege of parallelsm; and v convergence sults a n terms of expected value of the objectve functon. The proposed algorthmc framewor expands vastly on ths settng, whle enjoyng also all propertes -7. In partcular, t s the frst hybrd random/gedy scheme for nonconvex nonseparable functons, and t allows any dege of parallelsm.e., the update of any number of varables; and all ths s acheved under much weaer convergence condtons than those n [35], satsfed by most of practcal problems. Numercal sults show that the proposed hybrd schemes updatng gedly just some blocs wthn the pool of those selected by a random rule s very effectve, and seems to pserve the advantages of both random and determnstc selecton rules. Featu : The ablty of dealng wth some classes of nonseparable convex functons has been documented n [36] [38], but only for determnstc and sequental schemes; our approach extends also to parallel, random schemes. Featu 3: The lst of wors dealng wth BCD methods for nonconvex F s s short: [3], [30] for random sequental methods; and [8], [8] [0] for determnstc parallel ones. Random and cyclc parallel methods for nonconvex F s not enjoyng the ey propertes,, and 6 a studed, ndependently from ths wor but drawng on our sults [9], [0], also n [39]. Note that the method developed n [39] can also be seen as a partcular case of our Algorthm and the numercal sults show that the addtonal featus we provde, n partcular Featu, s fundamental for a good numercal behavor see Sec. IV for mo detals. We observe that for certan classes of specfc addtvely separable F s, dual ADMM-le schemes have been proposed for nonconvex problems shown to be convergent under strong condtons; see, e.g., [40] and fences then. However, for the scale and generalty of problems we a ntested n, they a computatonally mpractcal. Featu 4: We want to stss the ablty of the proposed algorthm to explot n a systematc way mo-than-frstorder nformaton. Dffently from BCD methods that use at each teraton a possbly gularzed frst-order model of the objectve functon, our method provdes the flexblty of usng mo sophstcated models, ncludng Newton-le surrogates as well as mo structud functons as those descrbed n the followng example. Suppose that n F = F + F,whe F s convex and F s not. Then, at teraton, one could base the update of the -th bloc on the surrogate functon F x, x + x F x T x x +Gx, x, whe x denotes the vector obtaned from x by deletng x.the ratonale he s that nstead of lnearzng the whole functon F we only lnearze the dffcult, nonconvex part F.Inths lght we can also better appcate the mportance of featu 6, snce f we go for mo complex surrogate functons, the ablty to deal wth nexact solutons becomes mportant. Featu 6: Inexact soluton methods have been lttle studed. Papers [4], [4], [4] somewhat ndctly consder some of these ssues for l -loss lnear support vector machnes problems. A mo systematc tatment of nexactness of the soluton of a frst-order model s documented n [43], n the context of random sequental BCD methods for convex problems. Our sults n ths paper a based on our pvous wors [9], [0], [34], whe both the use of mo-than-frstorder models and nexactness a ntroduced and rgorously analyzed n the context of parallel, determnstc methods. As a fnal mar, we note that a large porton of the afomentoned wors focuses on global complexty analyss. Specfcally, wth the excepton of [30], they all studed gularzed gradent-type methods for convex problems. Complexty analyss s an mportant topc, but t s outsde the scope of ths paper. Gven our expanded settng, we beleve t s mo frutful to concentrate on provng convergence and verfyng the practcal effectveness of our algorthms. The paper s organzed as follows. Secton II formally ntroduces the optmzaton problem along wth several motvatng examples and also dscusses some techncal ponts. The proposed algorthmc framewor and ts convergence propertes a ntroduced n Secton III, whle numercal sults a psented n Secton IV. Secton V draws some conclusons. II. PROBLEM DEFINITION AND PRELIMINARIES A. Problem defnton We consder Problem, whe the feasble set X = X X N s a Cartesan product of lower dmensonal convex sets X R n,andx R n s parttoned accordngly: x = x,...,x N, wth each x R n ; we denote by N {,...,N} the set of the N blocs. The functon F s smooth and not necessarly convex and separable and G s convex, and possbly nondffentable and nonseparable. Problem s very general and ncludes many popular Bg Data formulatons; some examples a lsted next.
3 Ex.# group LASSO: F x = Ax b and Gx = c x or Gx =c N = x, X = R n, X = R n, wth A R m n, b R m,andc R ++ gven constants; group LASSO has long been used n many applcatons n sgnal processng and statstcs []. Ex.# lnear gsson: F x = 0 and Gx = Ax b, X = R n, wth A R m n, and b R m gven constants; the l norm lnear gsson s wdely used technques n statstcs [44]. Note that G s nonseparable. Ex.#3 The Fermat-Weber problem: F x =0and Gx = I = ω A x b, X = R n, wth A R m n, b R m, and ω > 0 gven constants, for all ; ths problem, whch conssts n fndng x R n such that the weghted sum of dstances between x and the I anchors ω,ω,...,ω I, was wdely nvestgated n the optmzaton as well as locaton communtes; see, e.g., [45]. Ths s another example of nonseparable G. Ex.#4 The TV mage constructon: F x = AX V and GX =c TVX, X = R m m,whe A R s m, X R t m, V R t m, c R ++, and TVX m,j= D jx p s the dscte total varatonal sem-norm of X, wth p =or and D j X beng the dscte gradent of X defned as D j X [D j X, D j X ], wth D j X = X +,j X,j f <mand D j X =0 otherwse, and D j X = X,j+ X,j f < m and D j X =0otherwse [46]. Ths s the well-nown nose-fe dscte TV model for compssng sensng mage constructon [46]; TV mnmzng models have become a successful methodology for mage processng, ncludng denosng, deconvoluton, and storaton, to name a few. Ex.#5 Dctonary learnng: F X, Y = M XY F and GY = c Y, X = {X, Y R s m R m t : Xe α, =,...,m}, whex and Y a the matrx optmzaton varables, M R s t, c>0, and α m = > 0 a gven constants, e s the m-dmensonal vector wth a n the -th coordnate and 0 s elsewhe, and X F and X denote the Frobenus norm and the l matrx norm of X, spectvely; ths s an example of the dctonary learnng problem for sparse psentaton [47] that fnds numerous applcatons n varous felds such as computer vson and sgnal and mage processng. Note that F X, Y s not jontly convex n X, Y Ex.#6 Matrx completon: F X, Y =,j Ω M j XY j +c X F + Y F, GX, Y =0, X = R s m R m t, whe Ω s a gven subset of {,...,s} {,...,t}. Matrx completon has found numerous applcatons n varous felds such as commender systems, computer vson, and system dentfcaton. Other problems of ntest that can be cast n the form nclude the Logstc Regsson, the Support Vector Machne, the Nuclear Norm Mnmzaton, the Robust Prncpal Component Analyss, the Sparse Inverse Covarance Selecton, and the Nonnegatve Tensor Factorzaton; see, e.g., [48]. Assumptons. Gven, we mae the followng blanet assumptons: A Each X s nonempty, closed, and convex; A F s C on an open set contanng X; A3 F s Lpschtz contnuous on X wth constant L F ; A4 G s contnuous and convex on X possbly nondffentable and nonseparable; A5 V s coercve,.e., lm x X, x V x =+. The above assumptons a standard and a satsfed by many practcal problems. For nstance, A3 holds automatcally f X s bounded, wheas A5 guarantees the exstence of a soluton. Wth the advances of mult-co archtectus, t s desrable to develop parallel soluton methods for Problem wheby operatons can be card out on some or possbly all bloc varables x at the same tme. The most natural parallel Jacob-type method one can thn of s updatng all blocs smultaneously: gven x, each bloc varable x s updated by solvng the followng subproblem x + argmn { F x, x +Gx, x }. x X Unfortunately ths method converges only under very strctve condtons [49] that a seldom verfed n practce even n the absence of the nonsmooth part G. Furthermo, the exact computaton of x + may be dffcult and computatonally too expensve. To cope wth these ssues, a natural approach s to place the nonconvex functon F,x by a sutably chosen local convex surrogate F x ; x, and solve nstead the convex problems one for each bloc x + argmn { h x ; x F } x ; x +Gx ; x, x X 3 wth the understandng that the mnmzaton n 3 s smpler than that n. Note that the functon G has not been touched; ths s because t s generally much mo dffcult to fnd a good surrogate of a nondffentable functon than of a dffentable one; G s alady convex; and the functons G encounted n practce do not mae the optmzaton problem 3 dffcult a closed form soluton s avalable for a large classes of G s, f F a properly chosen. In ths wor we assume that the surrogate functons F z; w :X X R, have the followng propertes: F F ; w s unformly strongly convex wth constant q> 0 on X ; F x F x ; x = x F x for all x X; F3 x F z; s Lpschtz contnuous on X for all z X ; whe x F s the partal gradent of F wth spect to w.r.t. ts frst argument z. Functon F should be garded as a smple convex surrogate of F at the pont x w.r.t. the bloc of varables x that pserves the frst order propertes of F w.r.t. x. Note that, contrary to most of the wors n the lteratu e.g., [38], we do not qu F to be a global upper surrogate of F, whch sgnfcantly enlarges the range of applcablty of the proposed soluton methods. The most popular choce for F satsfyng F-F3 s F x ; x =F x + x F x T x x +τ x x, 4 wth τ > 0. Ths s essentally the way a new teraton s computed n most bloc-bcds for the soluton of LASSO problems and ts generalzatons. When G 0, ths choce
4 gves rse to a gradent-type scheme; n fact we obtan x + smply by a shft along the antgradent. As we dscussed n Sec. I, ths s a frst-order method, so t seems advsable, at least n some stuatons, to use mo nformatve F -s. If F x, x s convex, an alternatve s to tae F x ; x as a second order expanson of F x, x around x,.e., F x ; x =F x + x F x T x x + x x T x x F x +qi x x, 5 whe q s nonnegatve and can be taen to be zero f F x, x s actually strongly convex. When G 0, ths essentally corsponds to tang a Newton step n mnmzng the duced problem mn x X F x, x. Stll n the case of a unformly strongly convex F x, x, one could also tae just F x ; x =F x, x, whch pserves the structu of the functon. Other valuable choces talod to specfc applcatons a dscussed n [0], [34]. As a gudelne, note that our method, as we shall descrbe n detals shortly, s based on the teratve approxmate soluton of problem 3 and thefo a balance should be amed at between the accuracy of the surrogate F and the ease of soluton of 3. Needless to say, the opton 4 s the less nformatve one, but usually t maes the computaton of the soluton of 3 a cheap tas. Best-sponse map: Assocated wth each and x X, under F-F3, one can defne the followng bloc soluton map: x x argmn h x ; x. 6 x X Note that x x s always well-defned, snce the optmzaton problem n 6 s strongly convex. Gven 6, we can then ntroduce the soluton map X y xy x y N =. 7 Our algorthmc framewor s based on solvng n parallel a sutable selecton of subproblems 6, convergng thus to fxedponts of x of course the selecton vas at each teraton. It s then natural to as whch laton exsts between these fxed ponts and the statonary solutons of Problem. To answer ths ey queston, we call frst two basc defntons. Statonarty: A pont x s a statonary pont of f a subgradent ξ Gx exsts such that F x +ξ T y x 0 for all y X. Coordnate-wse statonarty: A pont x s a coordnatewse statonary pont of f subgradents ξ ξ Gx, wth N, exst such that x F x + ξ T y x 0, forally X and N. In words, a coordnate-wse statonary soluton s a pont for whch x s statonary w.r.t. every bloc of varables. Coordnate-wse statonarty s a weaer form of statonarty. It s the standard property of a lmt pont of a convergent coordnate-wse scheme see, for example [36] [38]. It s clear that a statonary pont s always a coordnatewse statonary pont; the converse however s not always true, unless extra condtons on G a satsfed. Regularty: Problem s gular at a coordnate-wse statonary pont x f x s also a statonary pont of the problem. The followng two smple cases mply the gularty condton, a G s separable stll nonsmooth,.e., Gx = G x ; b G s contnuously dffentable around x. Note that a s due to ξ Gx = G x, wheas b follows from ξ Gx = Gx = Gx. Of course these two cases a not at all nclusve of stuatons for whch gularty holds. As an example of a nonseparable functon for whch gularty holds at a pont at whch G s not contnuously dffentable, consder the functon arsng n logstc gsson problems F x = m j= log + e ajyt j x, wth X = R n,andy j R n and a j {, } beng gven constants. Now, choose Gx = c x ; the sultng functon s contnuously dffentable, and thefo gular, at any statonary pont but x 0. Its easy to verfy that V s also gular at x = 0, fc<log. The algorthm we psent n the paper expands upon the lteratu n psentng the frst determnstc or random parallel coordnate-wse scheme that converges to coordnatewse statonary ponts. Under the gularty condton these ponts a also statonary, and so among the class of parallel algorthms, the method we psent enlarges the class of problems for whch convergence to statonary ponts s acheved for Problem to nclude some classes of nonseparable G. Certanly, proxmal gradent-le algorthms can converge to statonary ponts for any nonseparable G, but such schemes a nhently ncapable of parallelzaton, and thus a typcally much slower n practce. Thus, our algorthm s a step towards, f not complete fulfllment of, the desderata of a parallel algorthm that converges to statonary ponts for all classes of Problem wth arbtrary nonsmooth convex G. The followng proposton s elementary and elucdates the connectons between statonarty condtons of Problem and fxed-ponts of x. Proposton. Gven Problem under A-A5 and F-F3, the followng hold: The set of fxed-ponts of x concdes wth the coordnate-wse statonary ponts of Problem ; If, n addton, Problem s gular at a fxed-pont of x, then such a fxed-pont s also a statonary pont of the problem. III. ALGORITHMIC FRAMEWORK We begn ntroducng a formal descrpton of the salent characterstc of the proposed algorthmc framewor the novel hybrd random/gedy bloc selecton rule. The random bloc selecton wors as follows: at each teraton, a random set S N s generated, and the blocs S a the potental canddate varables to update n parallel. The set S s a alzaton of a random set-valued mappng S wth values n the power set of N. We do not constrant S to any specfc dstrbuton; we only qu that, at each teraton, each bloc has a postve probablty possbly nonunform to be selected. A6 The sets S a alzatons of ndependent random setvalued mappngs S such that P S p, forall =,...,N and N +,andsomep>0.
5 A random selecton rule S satsfyng A6 wll be called proper samplng. Several proper samplng rules wll be dscussed n detals shortly. The proposed hybrd random/gedy bloc selecton rule conssts n combnng random and gedy updates n the followng form. Frst, a random selecton s performed the set S s generated. Second, a gedy procedu s run to select n the pool S only the subset of blocs, say Ŝ,that a promsng accordng to a pscrbed crteron. Fnally all the blocs n Ŝ a updated n parallel. The noton of promsng bloc s made formal next. Snce x s an optmal soluton of 6 f and only f x x =x, a natural dstance of x from the optmalty s d x x x. The blocs n S to be updated can be then chosen based on such an optmalty measu e.g., optng for blocs exhbtng larger d s. Note that n some applcatons, ncludng some of those dscussed n Sec. II, gven a proper bloc decomposton, x x can be computed easly n closed form, see Sec. IV for the dffent examples. However, ths s not always the case, and on some problems, the computaton of x x mght be too expensve. In these cases t mght be useful to ntroduce alternatve, less expensve metrcs by placng the dstance x x x wth a computatonally cheaper error bound,.e., a functon E x such that s x x x E x s x x x, 8 for some 0 < s s. We fer the ntested ader to [0] for some mo detals, and to [50] as an entry pont to the vast lteratu on error bounds. As an example, f problem s unconstraned, Gx 0, and we a usng the surrogate functon gven by 4, a sutable error bound s the functon E x = x F x +τ x x wth s = τ and s = L F. The proposed hybrd random/gedy scheme capturng all the featus -6 dscussed n Sec. I s formally gven n Algorthm. Note that n step S.3 nexact calculatons of x a allowed, whch s another notceable and useful featu: one can duce the cost per teraton wthout affectng too much, expence shows, the emprcal convergence speed. In step S.5 we ntroduced a memory n the varable updates: the new pont x + s a convex combnaton va γ of x and ẑ. Algorthm : Hybrd Random/Determnstc Flexble Parallel Algorthm HyFLEXA Data : {ε } for N, {γ } > 0, x 0 X, ρ 0, ]. Set =0. S. : If x satsfes a termnaton crteron: STOP; S. : Randomly generate a set of blocs S {,...,N} S.3 : Set M max S {E x }. Choose a subset Ŝ S that contans at least one ndex for whch E x ρm. S.4 : For all Ŝ, solve 6 wth accuracy ε : fnd z X s.t. z x x ε ; Set ẑ = z for Ŝ and ẑ = x for Ŝ S.5 : Set x + x + γ ẑ x ; S.6 : +, and go to S.. The convergence propertes of Algorthm a gven next. Theom. Let {x } be the sequence generated by Algorthm, under A-A6. Suppose that {γ } and {ε } satsfy the followng condtons: γ 0, ]; γ 0; γ = + ; v γ < + ; and v ε γ α mn{α, / x F x } for all N and some nonnegatve constants α and α. Addtonally, f nexact solutons a used n Step 3,.e., ε > 0 for some and nfnte, then assume also that G s globally Lpschtz on X. Then, ether Algorthm converges n a fnte number of teratons to a fxed-pont of ˆx of or the exsts at least one lmt pont of {x } that s a fxed-pont of ˆx w.p.. Proof. See Appendx C. Remar 3. Note that the condtons on {ɛ } mply that ɛ 0 for all. The Theom provdes mnmal condtons under whch convergence can be guaranteed. Practcally, of course the choce of ɛ wll affect the practcal performance of the algorthm and the approprate choce s problem dependent and gven by practcal expence. The convergence sults n Theom can be stngthened when G s separable. Theom 4. In the settng of Theom, suppose n addton that Gx s separable,.e., Gx = N G x. Then, ether Algorthm converges n a fnte number of teratons to a statonary soluton of Problem or every lmt pont of {x } s a statonary soluton of Problem w.p.. Proof. See Appendx D. On the random choce of S. We dscuss next some proper samplng rules S that can be used n Step 3 of the algorthm to generate the random sets S ; for notatonal smplcty the teraton ndex wll be omtted. The samplng rule S s unquely characterzed by the probablty mass functon PS P S = S, S N, whch assgn probabltes to the subsets S of N. Assocated wth S, defne the probabltes q j P S = j, for j =,...,N. The followng proper samplng rules, proposed n [6] for convex problems wth separable G, a nstances of rules satsfyng A6, and a used n our computatonal experments. Unform U samplng. All blocs get selected wth the same non zero probablty: P S =Pj S = E [ S ], j N. N Doubly Unform DU samplng. All sets S of equal cardnalty a generated wth equal probablty,.e., PS =PS, for all S, S N such that S = S. The densty functon s then PS = q S n S Nonoverlappng Unform NU samplng. It s a unform samplng assgnng postve probabltes only to sets formng a partton of N.LetS,...,S P be a partton of N, wth each S > 0, the densty functon of the NU samplng s:.
6 PS = P, f S { S,...,S } P 0 otherwse whch corsponds to P S=N/P, forall N. A specal case of the DU samplng that we found very effectve n our experments s the so called nce samplng. Nce Samplng NS. Gven an nteger 0 τ N, aτ-nce samplng s a DU samplng wth q τ =.e., each subset of τ blocs s chosen wth the same probablty. The NS allows us to control the dege of parallelsm of the algorthm by tunng the cardnalty τ of the random sets generated at each teraton, whch maes ths rule partcularly appealng n a mult-co envronment. Indeed, one can set τ equal to the number of avalable cos/processors, and assgn each bloc comng out from the gedy selecton f mplemented to a dedcated processor/co. As a fnal mar, note that the DU/NU rules contan as specal cases fully parallel and sequental updates, when at each teraton a sngle bloc s updated unformly at random, or all blocs a updated. Sequental samplng: It s a DU samplng wth q =,or a NU samplng wth P = N and S j = j, forj =,...,P. Fully parallel samplng: It s a DU samplng wth q N =, or a NU samplng wth P =and S = N. Other ntestng unform and nonunform practcal rules stll satsfyng A6 can be found n [6], [5]. In addton see [5] for extensve numercal sults comparng the dffent samplng schemes. Onthechoceofthestep-szeγ. An example of step-sze rule satsfyng Theom -v s: gven 0 <γ 0, let γ = γ θγ, =,..., 9 whe θ 0, s a gven constant. Numercal sults n Secton IV show the effectveness of 9 on specfc problems. We mar that t s possble to prove convergence of Algorthm also usng other step-sze rules, ncludng a standard Armjo-le lne-search procedu or a sutably small constant step-sze. Note that dffently from most of the schemes n the lteratu, the tunng of the step-sze does not qu the nowledge of the problem parameters e.g., the Lpschtz constants of F and G. IV. NUMERICAL RESULTS In ths secton we psent some plmnary experments provdng sold evdence of the vablty of our approach. Tests we card out on LASSO problems, one of the most studed convex nstances of Problem ; and two nonconvex problems, namely: a nonconvex modfcaton of LASSO, and dctonary learnng problems. A complete assessment of the numercal behavor of our method s beyond the scope of ths paper, snce t would qu very extensve experments, especally for nonconvex problems, whe comparsons of methods should tae nto account the exstence of multple statonary ponts wth possbly dffent objectve values and a strong nfluence of the startng ponts. Furthermo we mar that a proper comparson should be made only wth other random, parallel methods for nonconvex problems. However, Algorthm appears to be the frst of ths nd, together wth the very cent proposal n [39] whch s a specal case of Algorthm. Thefo, whle n the convex case we could compa to some well-establshed methods, n the nonconvex case we could only compa to [39]. In order to enlarge the pool of comparsons, and also to gve a better perspectve on our approach, we decde to nclude n the comparsons also determnstc methods. Of course, one should always eep n mnd that determnstc and random BCD methods have dffent purposes. Random BCD algorthms a the methods of choce when t s not possble to process all data at the same tme, ether because the data become avalable at dffent tmes for example n a networ wth propagaton delays or, smply, because the dmensons of the problem a so huge that the use of determnstc methods, qurng access to all varables at each teraton, s not practcally possble. In spte of all caveats above, the tests convncngly show that our framewor leads to practcal methods that explot well parallelsm and compa favorably to exstng schemes, both determnstc and random. It should be mad that our method, desgned to handle also nonconvex problems, seems to perform well also when compad to schemes specfcally desgned for convex problems only, and n some cases, t outperforms even the most effcent determnstc methods, a sult that s somewhat surprsng, gven the dffent amount of nformaton qud by determnstc methods. All codes have been wrtten n C++ and use the Message Passng Interface for parallel operatons. All algebra s performed by usng the Intel Math Kernel Lbrary MKL. The algorthms we tested on the General Compute Cluster of the Center for Computatonal Research at the SUNY Buffalo. In partcular for our experments we used a partton composed of 37 DELL 3x.3GHz Intel E7-4830 Xeon Processor nodes wth 5 GB of DDR4 man memory and QDR InfnBand 40Gb/s networ card. A. LASSO problems Lasso problems we descrbed n Secton II, he we port the sults obtaned on a set of randomly generated problems. Tunng of Algorthm : The most successful class of random and determnstc methods for LASSO problem a proxmal gradent-le schemes, based on a lnearzaton of F.Asa major departu from curnt schemes, he we propose to better explot the structu of F and use n Algorthm the followng best-sponse: gven a scalar partton of the varables.e., n =for all, let x x argmn x R { F x, x + τ x x + λ x }. Note that x x has a closed form expsson usng a softthsholdng operator [9]. The fe parameters of Algorthm a chosen as follows. The proxmal gans τ and the stepsze γ a tuned as n [0, Sec. VI.A]. The error bound functon s chosen as E x = x x x, and, for any alzaton S, the subsets Ŝ n S.3 of the algorthm a chosen as Ŝ = { S : E x σm }. We denote by c S the cardnalty of S normalzed to the overall
7 number of varables; n our experments, all sets S a generated usng the NU samplng, wth c S = c S,forall. We port sults for the followng values of σ and c S : c S =0.0, 0., 0., 0.5, 0.8; σ =0, whch leads to a fully parallel pu random scheme when at each teraton all varables n Ŝ a updated; and dffent postve values of σ rangng from 0.0 to 0.5, whch corsponds to updatng n a gedy manner only a subset of the varables n Ŝ. We termed Algorthm wth σ =0 Random FLEXble parallel Algorthm RFLEXA, wheas the other nstances wth σ>0 as Hybrd FLEXA HyFLEXA. We note that RFLEXA should be garded both as a partcular nstance of our scheme and as an effcent mplementaton of the method proposed n [39], wth many nontrval algorthmc choces made accordng to our expence n [0] n order to optmze ts behavor. As we mad befo, RFLEXA s the only other random, parallel BCD method for whch convergence to statonary ponts of nonconvex problems has been shown. From another perspectve, the comparson between HyFLEXA and RFLEXA also serves to gauge the mportance of performng a determnstc, gedy bloc selecton after the random one, ths beng one of the ey new deas of HyFLEXA. Algorthms n the lteratu: We compad our versons of HyFLEXA wth the most psentatve parallel random and determnstc algorthms proposed to solve LASSO Problems. Mo specfcally, we consder the followng schemes. PCDM & PCDM: These a proxmal gradent-le parallel randomzed BCD methods proposed n [6] for convex optmzaton problems. As commended by the authors, we use PCDM nstead of PCDM ndeed, our experments show that PCDM outperforms PCDM. We set the parameters β and ω of PCDM as n [6, Table 4], whch guarantees convergence of the algorthm n expected value. Hydra & Hydra : Hydra s a parallel and dstrbuted random gradent-le CDM, proposed n [53]; a closed form soluton of the scalar updates s avalable. Hydra [] s the accelerated verson of Hydra; ndeed, n all our experments, t outperformed Hydra; thefo, we wll port the sults only for Hydra. The fe parameter β s set to β =β cf. Eq. 5 n [53], wth σ gven by Eq. n [53] accordng to the authors, ths seems one of the best choces for β. FLEXA: Ths s the parallel determnstc scheme we proposed n [9], [0]. We use FLEXA as a benchmar of determnstc algorthms, snce t has been shown n [9], [0] that for selected test problems, t numercally outperforms curnt parallel frst-order accelerated gradent-le schemes, ncludng FISTA [9], SparRSA [], GRoc [], parallel BCD [8], and ADMM. The fe parameters of FLEXA, τ and γ, a tuned as n [0, Sec. VI.A], wheas the set S s chosen as S = { I: E x σm }. Other algorthms: We tested also other random algorthms, ncludng sequental random BCD-le methods and Shotgun [7]. However, snce they we not compettve, to not overcrowd the fgus, we do not port sults for these algorthms. In all the experments, the data matrx A =[A A P ] of the LASSO problem s stod n a column-bloc manner, unformly across the P parallel processes. Thus the computaton of each product Ax qud to evaluate F and the norm x that s G s dvded nto the parallel jobs of computng A x and x, followed by a duce operaton. Also, for all the algorthms, the ntal pont was set to the zero vector. Numercal Tests: We generated synthetc LASSO problems usng the random generaton technque proposed by Nesterov [7], whch we properly modfed followng [6] to generate nstances of the problem wth dffent levels of sparsty of the soluton as well as densty of the data matrx A R m n ;we ntroduce the followng two control parameters: s A = average % of nonzeros n each column of A out of m; and s sol =% of nonzeros n the soluton out of n. We tested the algorthms on two groups of LASSO problems, A R 4 5 and A R 5 6, and several deges of densty of A and sparsty of the soluton, namely s sol =0.%, %, 5%, 5%, 30%, and s A = %, 30%, 50%, 70%, 90%. Because of the space lmtaton, we port next only the most psentatve sults; we fer to [5] for mo detals and experments. Results for the LASSO nstance wth 0,000 varables a ported n Fg. and. Fg. shows the behavor of HyFLEXA as a functon of the desgn parameters σ and c S, for dffent values of the soluton sparsty s sol, wheas n Fg. we compa the proposed RFLEXA and HyFLEXA wth FLEXA, PCDM, and Hydra, for dffent values of s sol and s A Fnally, n Fg. 3 we consder larger problems wth M varables. In all the fgus, we plot the latve error x V x V /V versus the CPU tme, whe V s the optmal value of the objectve functon V n our experments V s nown. All the curves a averaged over ten ndependent random alzatons. Note that the CPU tme ncludes communcaton tmes and the ntal tme needed by the methods to perform all p-teratons computatons ths explans why the curves assocated wth Hydra start after the others; n fact Hydra qus some nontrval computatons to estmates β. 0 3 4 5 σ =0., ssol =0.% σ =0.5, ssol =0.% σ =0., ssol =% σ =0.5, ssol =% σ =0., ssol =5% σ =0.5, ssol =5% 6 0 500 00 500 000 500 3000 3500 4000 tme sec a 0 3 4 5 cs =0.5, ssol =0.% cs =0., ssol =0.% cs =0., ssol =0.% cs =0.5, ssol =% cs =0., ssol =% cs =0., ssol =% cs =0.5, ssol =5% cs =0., ssol =5% cs =0., ssol =5% 6 0 500 00 500 000 500 3000 tme sec b Fg. : HyFLEXA for dffent values of c S and σ: Relatve error vs. tme; s sol =0.%, %, 5%, s A = 70%, 0.000 varables, NU samplng, 8 cos; a c S =0.5, and σ =0., 0.5 -bσ =0.5, and c S =0., 0., 0.5. HyFLEXA: On the choce of c S,σ, and the samplng strategy. All the experments ncludng those not ported he because of lac of space show the followng tnd n the behavor of HyFLEXA as a functon of c S,σ. For low densty problems low s sol and s A, large pars c S,σ a pferable, whch corsponds to updatng at each teraton only some varables by performng a heavy gedy search over a szable amount of varables. Ths s n agement wth [0] cf. Remar 5: by the gedy selecton, Algorthm s able to dentfy those varables that wll be zero at the
8 HyFLEXAσ =0.5, c S =0.5 HyFLEXAσ =0., c S =0.5 HyFLEXAσ =0.0, c S =0. RFLEXA c S =0.5 RFLEXA c S =0. Hydra c0 S =0.5 500 Hydra c S =0. PCDMc S =0.5 PCDMc S =0. FLEXAσ =0.5 FLEXAσ =0 0 3 4 0 3 4 0 0 0 000 4000 6000 8000 000 tme sec a 0 4 6 8 tme sec x 4 b Fg. 3: LASSO wth M varables, s A = %, 6 cos; Relatve error vs. tme for: as sol =%-bs sol =5%. The legend s as n Fg.. 3 4 5 6 0 50 0 50 00 50 300 350 tme sec a 0 3 4 5 6 0 500 00 500 000 tme sec b 0 3 4 5 6 0 00 400 600 800 00 tme sec c 3 4 5 6 0 500 00 500 000 500 tme sec a 0 3 4 5 6 0 500 00 500 000 500 3000 3500 4000 4500 tme sec b 0 3 4 5 6 0 00 000 3000 4000 5000 tme sec c Fg. : LASSO wth 0.000 varables, 8 cos; Relatve error vs. tme for: a s A = 30% and s sol =0.% - a s A = 30% and s sol =5%- b s A = 70% and s sol =0.% - b s A = 70% and s sol =5%-c s A = 90% and s sol =0.% -cs A = 90% and s sol =5%. a soluton; thefo updatng only varables that we have strong ason to beleve wll not be zero at a soluton s a better strategy than updatng them all, especally f the solutons a very sparse. Note that ths behavor can be obtaned usng ether large or small c S,σ. However, n the case of low dense problems, the former strategy outperforms the latter. We observed that ths s manly due to the fact that when s A s small, estmatng ˆx computng the products A T A s computatonally affordable, and thus performng a gedy search over mo varables enhances the practcal convergence. When the sparsty of the soluton decases and/or the densty of A ncases large s A and/or s sol, one can see from the fgus that smaller values of c S,σ a mo effectve than larger ones, whch corsponds to usng a less aggssve gedy selecton whle searchng over a smaller pool of varables. In fact, when A s dense, computng all ˆx mght be prohbtve and thus nullfy the potental benefts of a gedy procedu. For nstance, t follows from Fg. -3 that, as the densty of the soluton s sol ncases the pferable choce for c S,σ progssvely moves from 0.5, 0.5 to 0., 0.0, wth both c S and σ decasng. Intestng, a tunng that wors qute well n practce for all the classes of problems we smulated dffent denstes of A, soluton sparsty, number of cos, etc. s c S,σ=0.5, 0., whch seems to st a good balance between not updatng varables that a probably zero at the optmum and nevertheless update a szable amount of varables when needed n order to enhance convergence.. As a fnal mar, we port that, accordng to our experments, the most effectve samplng rule among U, DU, NU, and NS s the NU whch s actually the one the fgus fers to; NS becomes compettve only when the solutons a very sparse, see [5] for a detaled comparson of the dffent rules. Comparson of the algorthms. For low dense matrces A and very sparse solutons, FLEXA σ = 0.5 s faster than ts random counterparts RFLEXA and HyFLEXA as well as ts fully parallel verson, FLEXA σ = 0 [see Fg a, b c and Fg. 3a]. Nevertheless, HyFLEXA [wth c S,σ=0.5, 0.5] mans close. However, as the densty of A and/or the sze of the problem ncase, computng all the products [A T A] qud to estmate ˆx becomes too costly; ths s when a random selecton of the varables becomes benefcal: ndeed, RFLEXA and HyFLEXA consstently outperform FLEXA [see Fg a, b c and Fg. 3b]. Among the random algorthms, Hydra s capable to approach latvely fast low accuracy, especally when the soluton s not too sparse, but has dffcultes n achng hgh accuracy. RFLEXA and HyFLEXA a always much faster than curnt state-of-the-art schemes PCDM and Hydra, especally f hgh accuracy of the solutons s qud. Between RFLEXA and HyFLEXA wth the same c S, the latter consstently outperforms the former about up to fve tme faster, wth a gap that s mo sgnfcant when solutons a sparse. Ths provdes a sold evdence of the effectveness of the proposed hybrd random/gedy selecton method.
9 s.t. b x b, F x =,..., n, whe c s a postve constant chosen so that F x s no longer convex. We consded two nstances of, namely: A R90,000 0,000 s generated usng Nesterov s model as n LASSO problems n Sec. IV-A wth sa = 0% and ssol = 0.%, b =, c = 0.5, and c = 5000; and the same settng as n but wth ssol = 5% and c =. Note that the Hessan of F has the same egenvalues of the Hessan of Ax b n the orgnal LASSO problem, but translated to the left by c. In partcular, F n our problem has many negatve egenvalues; thefo, the objectve functon V n s madly nonconvex. Snce V s now unbounded from below by constructon, we added n box constrants. Comparson of the algorthms: We compad HyFLEXA wth FISTA, SpaRSA, RFLEXA, and determnstc FLEXA. The tunng of FLEXA, RFLEXA, and HyFLEXA s the same as the one used for LASSO problems cf. Sec. IV-A, but addng the extra condton τ > c, for all, so that the sultng one dmensonal subproblems 6 a convex and can be solved n closed form. Note that as far as we a awa of the only parallel random method wth convergence guarantees on nonconvex problems s RFLEXA. We thefo port also sults for SpaRSA and FLEXA for comparson sae. FISTA doesn t have any convergence guarantee for nonconvex problems but nevertheless we also added FISTA to the comparson because of ts benchmar status n the convex case. On our tests, all the algorthms always converge to the same statonary pont we checed statonarty usng the same statonarty measu used n [0]. Computed statonary solutons n the settngs and above have approxmately 0.% and.5% nonzero varables, spectvely. Results of our smulatons a ported n Fg. 4, whe we plot the objectve functon value versus the CPU tme; all the curves a obtaned usng 50 cos and averaged over four ndependent random alzaton. The CPU tme ncludes communcaton tmes and the ntal tme needed by the methods to perform all p-computatons. The tests ndcate that HyFLEXA mproves drastcally on RFLEXA also on these nonconvex problems. Furthermo HyFLEXA performs well also when compad to determnstc methods FISTA, SpaRSA, and FLEXA that use full nformaton on all varables at each teraton; HyFLEXA has a behavor very smlar to that of FLEXA, the best of these the determnstc methods, whch s very promsng. C. Dctonary learnng problems The second nonconvex nstance of Problem we consder s the dctonary learnng problems, as ntroduced n Example # 5 n Sec. II. Wth fence to the notaton then, the data matrx M Rm t s the CBCL Center for Bologcal 5 6 4 5 4 3 3 V x The frst set of nonconvex problems we test s a modfcaton of the LASSO problem, as frst proposed n [0]. Mo pcsely, we consder the followng nonconvex quadratc problems mn V x Ax b c x + c x x Gx SPARSA FISTA FLEXAσ = 0 FLEXAσ = 0.5 HyFLEXAσ = 0.5, cs = 0.5 HyFLEXAσ = 0., cs = 0.5 HyFLEXAσ = 0.0, cs = 0. RFLEXAcS = 0. RFLEXAcS = 0.5 V x B. Nonconvex quadratc problems tmesec a 3 4 tmesec b 3 Fg. 4: Nonconvex Quadratc Problem, 50 cos; objectve functon value vs. tme for: a c = 0.5, b =, c = 5000 - b c = 0., b =, c = 5000. and Computatonal Learnng at MIT Face Database # [54], whch s a set of 9x9 Grayscale PGM format face mages. Every mage s staced up column-wse to form vectors of 36 elements, and each of these ndvdual vectors, consttutes a column of M. We set α = α, =,..., s, m = 36, s = 4m, t = 4, for a total number of.999.940 varables. We consded two nstances of the problem, corspondng to the settngs c = 0.5, α =, and c =, α =. Buldng on the structu of F X, Y and GX, Y, we propose to use the followng best-sponse for FLEXA, RFLEXA, and HyFLEXA: X X, Y, Y X, Y, wth s X and Y parttoned as X X, Y x X, Y = and Y X, Y Y pq X, Y s,t p,q=, whe x s the th column of X, gven by τ argmn x F X, Y T x x + x x x : x α and Y pq X, Y, denotng the pq-th entry of Y, s gven by: τy argmn F X, Ypq, Y pq + c Ypq + pq Ypq Ypq Ypq R Both subproblems above have a closed form soluton that can be obtaned usng smple projectons and the soft-thsholdng operator; we omt the detals because of lac of space. Comparson of the algorthms. In addton to FISTA, SpaRSA, RFLEXA, and FLEXA, we compad HyFLEXA also wth the algorthm proposed n [55] cf. Algorthm specfcally for the dctonary learnng problem. In all the algorthms, the dctonary matrx X and sparse psentaton matrx Y a ntalzed wth overcomplete DCT dctonary [47] and zero matrx, spectvely. Results of our smulatons a ported n Fg.5, whe we plot the objectve value versus the CPU tme. On our tests, dffent algorthms seem to converge to dffent ponts that have been checed to be all statonary. It s ntestng that HyFLEXA aches a lower objectve value. It should also be noted that lower objectve values corspond to dffent sparsty-deges. For example, n the frst problem, RFLEXA cs = 0., SPARSA, and HyFLEXA [σ, cs = 0.5, 0.5] ach a soluton wth 33, 3648, and 4838 nonzero varables, spectvely, whch a n the same range of sparsty, but HyFLEXA has a better objectve value. Now consder the second problem, whe HyFLEXA [σ, cs = 0., 0.] and HyFLEXA [σ, cs = 0.5, 0.5] ach two dffent statonary pont wth 4 and 064 nonzero varables, spectvely, whle all the other algorthms converges to the same pont wth 4
V x 50 40 30 0 00 90 80 70 60 50 40 FISTA SPARSA Algorthm n [55] HyFLEXAσ =0.5,cS =0.5 HyFLEXAσ =0.,cS =0. RFLEXAcS =0.5 RFLEXAcS =0. FLEXAσ =0.5 FLEXAσ =0 0 00 000 3000 4000 5000 6000 7000 tmesec a V x 500 450 400 350 300 50 FISTA SPARSA Algorthm n [55] HyFLEXAσ =0.5,cS =0.5 HyFLEXAσ =0.,cS =0. RFLEXAcS =0.5 RFLEXAcS =0. FLEXAσ =0.5 FLEXAσ =0 00 0 50 0 50 00 50 300 350 tmesec b Fg. 5: Dctonary Learnng Problem, 6 cos; objectve value vs. tme for: a c =0.5,β =-bc =,β =. In the subplot b, some curves a not properly dsplayed because of overlaps, essentally they all behave le RFLEXA. 766 nonzero varables. In concluson, HyFLEXA seems very compettve, outperformng RFLEXA and comparng very well also to determnstc methods. V. CONCLUSIONS We proposed a hghly parallelzable hybrd random/determnstc decomposton algorthm for the mnmzaton of the sum of a possbly noncovex dffentable functon F and a possbly nonsmooth nonseparable convex functon G. The proposed framewor s the frst scheme enjoyng all the followng featus: t allows for pu gedy, pu random, or mxed random/gedy updates of the varables, all convergng under the same unfed set of convergence condtons; t can tacle va parallel updates also nonseparable convex functons G; t can deal wth nonconvex nonseparable F ; v t s parallel; v t can ncorporate both frst-order or hgher-order nformaton; and v t can use nexact solutons. Our plmnary experments on LASSO problems and few selected nonconvex ones showed a very promsng behavor wth spect to state-ofthe-art random and determnstc algorthms. Of course, a mo complete assessment, especally n the nonconvex case, qu much mo experments and s the subject of curnt search. VI. ACKNOWLEDGMENTS The authors a very grateful to Prof. Peter Rchtàr for hs nvaluable comments; we also than Dr. Martn Taáč and Prof. Peter Rchtàr for provdng the C++ code of PCDM and Hydra that we modfed n order to use the MPI lbrary. The wor of Daneshmand and Scutar was supported by the USA NSF Grants CMS 877 and CAREER Award No. 54739. The wor of Facchne was supported by the MIUR project PLATINO Grant Agement n. PON0_007. The wor of Kungurtsev was supported by the European Socal Fund under the Grant CZ..07/.3.00/30.0034. APPENDIX We frst ntroduce some plmnary sults nstrumental to prove both Theom and Theom 3. Gven Ŝ N and x x N, for notatonal smplcty, we wll denote by x or nterchangeably Ŝ x Ŝ the vector whose component s equal to x f Ŝ, and zero otherwse. Wth a slght abuse of notaton we wll also use x, y to denote the orded tuple y,...,y, x, y +,...,y N ; smlarly x, x j, y,j, wth < j stands for y,...,y, x, y +,...,y j, x j, y j+,...,y N. A. On the random samplng and ts propertes We ntroduce some propertes assocated wth the random samplng rules S satsfyng assumpton A6. A ey role n our proofs s played by the followng random set: let {x } be the sequence generated by Algorthm, and mx = argmax x x x, {,...,N} defne the set K mx as { K mx N + : mx S }. 3 The ey propertes of ths set a summarzed n the followng two lemmata. Lemma 5 Infnte cardnalty. Gven the set K mx as n 3, t holds that P K mx = =, whe K mx denotes the cardnalty of K mx. Proof. Suppose that the statement of the lemma s not true. Then, wth postve probablty, the must exst some such that for, mx / S. But we can wrte { P mx / S } = Π P mx / S mx / S,..., mx / S lm p =0. whe the nequalty follows by A6 and the ndependence of the events. But ths obvously gves a contradcton and concludes the proof. Lemma 6. Let {γ } be a sequence satsfyng assumptons - of Theom. Then t holds that P γ < =0. 4 K mx Proof. It holds that, P γ < P γ <n K mx n N K mx P γ <n. n N K mx To prove the lemma, t s then suffcent to show that P K mx γ <n =0, as proved next. Defne, wth N +, as the smallest ndex such that γ j n. 5 j=0 Note that snce =0 γ =+, ˆK s well-defned for all and lm ˆK =+. Foranyn N, t holds: m P γ <n = P γ <n K mx m N K mx = lm m P m γ <n = lm K mx P γ <n K mx
= lm P γ <n, K mx [0, ] < K mx + P γ <n, K mx [0, ˆK ] 6 K mx lm P K mx [0, ] < ˆK } {{ } term I + P γ <n, K mx [0, ˆK ]. K mx } {{ } term II Let us bound next term I and term II separately. Term I:Wehave P K mx [0, ] < ˆK a = P X < P X p =0 =0 > =0 ˆK b =0 p p ˆK =0 p ˆK = ˆK p =0 p c 0 ˆK p 7 whe: a: X 0,...,X ˆK a ndependent Bernoull random varables, wth parameter p P K mx. Note that, due to A6, p p, forall; b: t follows from Chebyshev s nequalty; c: we used the bounds =0 p p ˆK =0 p p. and Term II: Let us wrte term II as ˆK P K mx γ K mx [0, ] < n K mx [0, ] K mx [0, ] ˆK P K mx [0, ] ˆK ˆK a P K mx γ K mx [0, ] < n K mx [0, ] ˆK P K mx [0, ] ˆK ˆK b P K mx γ K mx [0, ] < ˆK =0 γ c ˆK =0 P γ ˆK X =0 < γ ˆK =0 P γ ˆK X ˆK ˆK =0 γ p =0 γ p ˆK =0 γ > ˆK =0 P γ ˆK X =0 γ p > ˆK d =0 γ p p p ˆK = =0 γ p ˆK =0 γ 0, 8 p ˆK =0 γ ˆK =0 γ p ˆK =0 γ whe: a: we used K mx [0, ] ˆK, by the condtonng event; b: t follows from 5, and PA B PA; c: X 0,...,X ˆK a ndependent Bernoull random varables, wth parameter p. The bound s due to K mx [0, ] ; d: t follows from the Chebyshev s nequalty. The desd sult 4 follows adly combnng 6, 7, and 8. B. On the best-sponse map x and ts propertes We ntroduce now some ey propertes of the mappng x defned n 6. We also derve some bounds nvolvng x along wth the sequence {x } generated by Algorthm. Lemma 7 [0]. Consder Problem under A-A5, and F- F3. Suppose that Gx s separable,.e., Gx = G x, wth each G convex on X. Then the mappng X y xy s Lpschtz contnuous on X,.e., the exsts a postve constant ˆL such that xy xz ˆL y z, y, z X. 9 Lemma 8. Let {x } be the sequence generated by Algorthm. For every K mx and Ŝ generated as n step S.3 of Algorthm, the followng holds: the exsts a postve constant c such that, ˆxŜx x Ŝ c ˆxx x. 0 Proof. The followng chan of nequaltes holds: max s ˆxŜx x a s N Ŝ ˆx ρ ρ x x ρ b E ρ x c ρe mx x d ρ mn s max ˆx x x N N ρ mn N s ˆxx x N
whe: n a ρ sanyndexnŝ such that E ρ x ρ max S E x. Note that by defnton of Ŝ cf. step S.3 of Algorthm, such a ndex always exsts; b s due to 8; c follows from the defnton of ρ,andmax S E x = E mx x, the latter due to mx S Ŝ call that K mx ; and d follows from 8. Lemma 9. Let {x } be the sequence generated by Algorthm. For every N +, and Ŝ generated as n step S.3, the followng holds: x F x T xx x q xx x Ŝ Ŝ Ŝ + [ Gx G x x, x ]. Ŝ Proof. Optmalty of x x for the subproblem mples x F x x ; x +ξ x x, x T y x x 0, for all y X, and some ξ x x, x x G x x, x. Thefo, 0 x F x x ; x x T x x + ξ x x, x x T x x. Let us lower bound next the two terms on the RHS of. The unform strong monotoncty of F ; x cf. F, T x F x x ; x x F x ; x x x x q x x x, 3 along wth the gradent consstency condton cf. F x F x ; x = x F x mply x F x x ; x x T x x T = x F x x ; x x F x ; x x x x + x F x ; x x T x x x F x x T x x + q x x x. 4 To bound the second term on the RHS of, let us nvoe the convexty of G, x : Gx, x G x x, x ξ x x, x T x x x, whch yelds ξ x x, x x T x x G x x, x 5 Gx. The desd sult s adly obtaned by combnng wth 4 and 5, and summng over Ŝ. Lemma. Let {x } be the sequence generated by Algorthm, and {γ } 0. For every N + suffcently large, and Ŝ generated as n step S.3, the followng holds: Gx + Gx +γ L G ε 6 Ŝ + γ [ G x x, x Gx ]. Ŝ Proof. Gven 0 and Ŝ,defne x x N, wth { x x + γ x x x, f Ŝ otherwse. x By the convexty and Lpschtz contnuty of G, t follows Gx + = Gx + Gx + G x + G x Gx Gx +γ L G Ŝ ε + G x Gx, 7 whe L G s a global Lpschtz constant of G. We bound next the last term on the RHS of 7. Let γ = γ N,for large enough so that 0 < γ <. Defne ˇx ˇx N, wth ˇx = x f / Ŝ,and ˇx γ x x + γ x 8 otherwse. Usng the defnton of x t s not dffcult to see that x = N N x + N ˇx. 9 Usng 9 and nvong the convexty of G, the followng curson holds for suffcently large : G x =G N ˇx, x + N x, ˇx N + N x = G N ˇx, x N + N x, N ˇx + N N x N G ˇx, x + N N G x, N ˇx + N N x = N G ˇx, x + N N G N x, ˇx + N N x = N G ˇx, x + N N G N ˇx, x + N = N G ˇx, x + N N G x, x, ˇx, ˇx, x N + N 3 N x + N N x, x, N ˇx, + N 3 N x, N G ˇx, x + N G ˇx, x + N N G x, x, N ˇx, + N 3 N x,... Gˇx, x N. N 30 Usng 30, the last term on the RHS of 7 can be upper bounded for suffcently large as G x Gx N [ Gˇx, x Gx ] N
3 = N [ Gˇx, x Gx ] 3 Ŝ a [ γ G x x, x N + γ Gx Gx ] Ŝ = γ [ G x x, x Gx ], Ŝ whe a s due to the convexty of G, x and the defnton of ˇx [cf. 8]. The desd nequalty 6 follows adly by combnng 7 wth 3. Lemma. [56, Lemma 3.4, p.] Let {X }, {Y }, and {Z } be the sequences of numbers such that Y 0 for all. Suppose that X + X Y + Z, =0,,... and =0 Z <. Then ether X or else {X } converges to a fnte value and =0 Y <. C. Proof of Theom For any gven 0, the Descent Lemma [49] yelds: wth ẑ ẑ N and z z N defned n step S.4 of Algorthm, F x + F x + γ x F x T ẑ x γ L F + ẑ x. 3 We bound next the second and thrd terms on the RHS of 3. Denotng by Ŝ the complement of Ŝ,wehave, x F x T ẑ x = x F x T ẑ xx + xx x a = x F x T Ŝ z xx Ŝ + x F x T Ŝ x xx Ŝ + x F x T Ŝ xx x Ŝ + x F x T Ŝ xx x Ŝ = x F x T Ŝ z xx Ŝ + x F x T b Ŝ ε Ŝ xx x Ŝ 33 x F x + x F x T Ŝ xx x Ŝ c Ŝ ε x F x q xx x Ŝ + [ Gx G x x, x ] Ŝ whe n a we used the defnton of ẑ and of the set Ŝ ;n b we used z x x ε ; and c follows from cf. Lemma 9. The thrd term on the RHS of 3 can be bounded as ẑ x z ˆxx Ŝ + ˆxx x Ŝ = + z Ŝ x x + ˆxx x Ŝ ε + ˆxx x Ŝ, Ŝ 34 whe the frst nequalty follows from the defnton of z and ẑ, and n the last nequalty we used z x x ε. Now, we combne the above sults to get the descent property of V along {x }. For suffcently large N +, t holds V x + =F x + +Gx + V x γ q γ L F xx x Ŝ + T, 35 whe the nequalty follows from, 3, 33, and 34, and T s gven by T γ ε LG + x F x + γ L F ε. N N By assumpton v n Theom, t s not dffcult to show that =0 T <. Snceγ 0, t follows from 35 that the exst some postve constant β and a suffcently large, say, such that V x + V x γ β xx x Ŝ + T, 36 for all. Invong Lemma whle usng =0 T < and the coercvty of V, we deduce from 36 that t γ xx x Ŝ < +, 37 lm t and thus also lm t = t K mx Lemma 6 together wth 38 mply γ xx x Ŝ < +. 38 lm nf xx x K mx Ŝ =0, w.p., whchbylemma8mples lm nf xx x =0, w.p.. 39 Thefo, the lmt pont of the nfmum sequence s a fxed pont of x w.p.. D. Proof of Theom 3 The proof follows smlar deas as the one of Theom n our cent wor [0], but wth the nontrval complcaton of dealng wth randomness n the bloc selecton.
4 Gven 39, we show next that, under the separablty assumpton on G, t holds that lm xx x =0 w.p.. For notatonal smplcty, let us defne xx xx x. Note frst that for any fnte but arbtrary sequence {, +,..., }, t holds that E and thus [ K mx t= P γ t ] = K mx t= t= γ t [Pt K mx ] p γ t >β γ t > 0, t= t= for all Kand 0 <β<p. Ths mples that, w.p., the exsts an nfnte sequence of ndexes, say K K, such that K mx t= γ t, γ t >β γ t, K. 40 t= Suppose now, by contradcton, that lm sup xx > 0 wth a postve probablty. Then we can fnd a alzaton such that at the same tme 40 holds for some K and lm sup xx > 0. In the st of the proof we focus on ths alzaton and get a contradcton, thus provng that lm sup xx =0 w.p.. If lm sup xx > 0 then the exsts a δ>0 such that xx > δ for nfntely many and also xx < δ for nfntely many. Thefo, one can always fnd an nfnte set of ndexes, say K, havng the followng propertes: for any K, the exsts an nteger >such that xx <δ, xx > δ 4 δ xx j δ < j <. 4 Proceedng now as n the proof of Theom n [0], we have: for K, a δ < xx xx xx xx + x x 43 b + ˆL x x 44 c + ˆL γ t xx t S t + z t xx t S t t= d + ˆLδ + ε max t= γ t, 45 whe a follows from 4; b s due to Lemma 7; c comes from the trangle nequalty, the updatng rule of the algorthm and the defnton of ẑ ; and n d we used 4, 4, and z t xx t N εt,wheεmax max N ε <. It follows from 45 that lm nf γ t δ K + ˆLδ > 0. 46 + ε t= max We show next that 46 s n contradcton wth the convergence of {V x }. To do that, we plmnary prove that, for suffcently large K,tmustbe xx δ/. Proceedng as n 45, we have: for any gven K, xx + xx + ˆL x + x + ˆLγ xx + ε max. It turns out that for suffcently large K so that + ˆLγ <δ/δ +ε max,tmustbe xx δ/; 47 otherwse the condton xx + δ would be volated [cf. 4]. Heafter we assume wthout loss of generalty that 47 holds for all K n fact, one can always strct {x } K to a proper subsequence. We can show now that 46 s n contradcton wth the convergence of {V x }. Usng 36 possbly over a subsequence, we have: for suffcently large K, V x V x β a V x β K mx t= K mx t= γ t xx t Ŝ + t γ t xx t + b V x β 3 γ t + T t, t= t= t= T t K mx t= 48 whe a follows from Lemma 8 and β = c β > 0; and b s due to 47 and 40, wth β 3 = ββ δ /4. Snce {V x } converges and =0 T <, t holds t= that lm K γt =0, contradctng 46. Thefo lm xx x =0w.p.. Snce {x } s bounded by the coercvty of V and the convergence of {V x }, thas at least one lmt pont x X. By the contnuty of x cf. Lemma 7 t holds that x x = x. By Proposton x s also a statonary soluton of Problem. REFERENCES [] A. Daneshmand, F. Facchne, V. Kungurtsev, and G. Scutar, Flexble selectve parallel algorthms for bg data optmzaton, n Proc. of the Forty-Eghth Aslomar Confence on Sgnals, Systems, and Computers, Nov. -5 04. [] R. Tbshran, Regsson shrnage and selecton va the lasso, Journal of the Royal Statstcal Socety. Ses B Methodologcal, pp. 67 88, 996. [3] Z. Qn, K. Schenberg, and D. Goldfarb, Effcent bloc-coordnate descent algorthms for the group lasso, Mathematcal Programmng Computaton, vol. 5, pp. 43 69, June 03. [4] G.-X. Yuan, K.-W. Chang, C.-J. Hseh, and C.-J. Ln, A comparson of optmzaton methods and softwa for large-scale l-gularzed lnear classfcaton, The Journal of Machne Learnng Research, vol. 9999, pp. 383 334, 0. [5] K. Fountoulas and J. Gondzo, A Second-Order Method for Strongly Convex L-Regularzaton Problems, arxv pprnt arxv:306.5386, 03. [6] I. Necoara and D. Clpc, Effcent parallel coordnate descent algorthm for convex optmzaton problems wth separable constrants: applcaton to dstrbuted MPC, Journal of Process Control, vol. 3, no. 3, pp. 43 53, March 03. [7] Y. Nesterov, Gradent methods for mnmzng composte functons, Mathematcal Programmng, vol. 40, pp. 5 6, August 03. T t
5 [8] P. Tseng and S. Yun, A coordnate gradent descent method for nonsmooth separable mnmzaton, Mathematcal Programmng, vol. 7, no. -, pp. 387 43, March 009. [9] A. Bec and M. Teboulle, A fast teratve shrnage-thsholdng algorthm for lnear nverse problems, SIAM Journal on Imagng Scences, vol., no., pp. 83 0, Jan. 009. [] S. J. Wrght, R. D. Nowa, and M. A. Fguedo, Sparse constructon by separable approxmaton, IEEE Trans. on Sgnal Processng, vol. 57, no. 7, pp. 479 493, July 009. [] Z. Peng, M. Yan, and W. Yn, Parallel and dstrbuted sparse optmzaton, n Sgnals, Systems and Computers, 03 Aslomar Confence on. IEEE, 03, pp. 659 646. [] K. Slavas and G. B. Gannas, Onlne dctonary learnng from bg data usng accelerated stochastc approxmaton algorthms, n Proc. of the IEEE 04 Internatonal Confence on Acoustcs, Speech, and Sgnal Processng ICASSP 04, Flonce, Italy, May 4-9, 04. [3] K. Slavas, G. B. Gannas, and G. Mateos, Modelng and optmzaton for bg data analytcs, IEEE Sgnal Process. Mag., vol. 3, no. 5, pp. 8 3, Sept. 04. [4] M. De Sants, S. Lucd, and F. Rnald, A fast actve set bloc coordnate descent algorthm for l -gularzed least squas, eprnt arxv:403.738, March 04. [5] S. Sra, S. Nowozn, and S. J. Wrght, Eds., Optmzaton for Machne Learnng, ser. Neural Informaton Processng. Cambrdge, Massachusetts: The MIT Pss, Sept. 0. [6] F. Bach, R. Jenatton, J. Maral, and G. Obozns, Optmzaton wth Sparsty-nducng Penaltes. Foundatons and Tnds R n Machne Learnng, Now Publshers Inc, Dec. 0. [7] J. K. Bradley, A. Kyrola, D. Bcson, and C. Guestrn, Parallel coordnate descent for l-gularzed loss mnmzaton, n Proc. of the 8th Internatonal Confence on Machne Learnng, Bellevue, WA, USA, June 8 July, 0. [8] M. Patrsson, Cost approxmaton: a unfed framewor of descent algorthms for nonlnear programs, SIAM Journal on Optmzaton, vol. 8, no., pp. 56 58, 998. [9] F. Facchne, S. Sagratella, and G. Scutar, Flexble parallel algorthms for bg data optmzaton, n Proc. of the IEEE 04 Internatonal Confence on Acoustcs, Speech, and Sgnal Processng ICASSP 04, Flonce, Italy, May 4-9, 04. [0], Flexble parallel algorthms for bg data optmzaton, IEEE Trans. on Sgnal Processng, to appear. 04. [Onlne]. Avalable: http://arxv.org/abs/40.55v5 [] O. Fercoq, Z. Qu, P. Rchtár, and M. Taáč, Fast dstrbuted coordnate descent for non-strongly convex losses, arxv pprnt arxv:405.5300, 04. [] O. Fercoq and P. Rchtár, Accelerated, parallel and proxmal coordnate descent, arxv pprnt arxv:3.5799, 03. [3] Z. Lu and L. Xao, Randomzed Bloc Coordnate Non-Monotone Gradent Method for a Class of Nonlnear Programmng, arxv pprnt arxv:306.598v, 03. [4] I. Necoara and D. Clpc, Dstrbuted random coordnate descent method for composte mnmzaton, Techncal Report, pp. 4, Nov. 03. [Onlne]. Avalable: http://arxv-web.arxv.org/abs/3.530 [5] Y. Nesterov, Effcency of coordnate descent methods on huge-scale optmzaton problems, SIAM Journal on Optmzaton, vol., no., pp. 34 36, 0. [6] P. Rchtár and M. Taáč, Parallel coordnate descent methods for bg data optmzaton, arxv pprnt arxv:.0873, 0. [7] S. Shalev-Shwartz and A. Tewar, Stochastc methods for l- gularzed loss mnmzaton, The Journal of Machne Learnng Research, pp. 865 89, 0. [8] Z. Lu and L. Xao, On the complexty analyss of randomzed bloccoordnate descent methods, arxv pprnt arxv:305.473, 03. [9] I. Necoara and A. Patrascu, A random coordnate descent algorthm for optmzaton problems wth composte objectve functon and lnear coupled constrants, Computatonal Optmzaton and Applcatons, vol. 57, no., pp. 307 337, 04. [30] A. Patrascu and I.Necoara, Effcent random coordnate descent algorthms for large-scale structud nonconvex optmzaton, J. of Global Optmzaton, pp. 3, Feb. 04. [3] P. Rchtár and M. Taáč, Iteraton complexty of randomzed bloccoordnate descent methods for mnmzng a composte functon, Mathematcal Programmng, vol. 44, no. -, pp. 38, 04. [3] I. Dassos, K. Fountoulas, and J. Gondzo, A second-order method for compssed sensng problems wth cohent and dundant dctonas, arxv pprnt arxv:405.446, 04. [33] G.-X. Yuan, C.-H. Ho, and C.-J. Ln, An mproved glmnet for l- gularzed logstc gsson, The Journal of Machne Learnng Research, vol. 3, no., pp. 999 030, 0. [34] G. Scutar, F. Facchne, P. Song, D. Palomar, and J.-S. Pang, Decomposton by Partal lnearzaton: Parallel optmzaton of mult-agent systems, IEEE Trans. Sgnal Process., vol. 6, pp. 64 656, Feb. 04. [35] C. Scherr, A. Tewar, M. Halappanavar, and D. Hagln, Featu clusterng for acceleratng parallel coordnate descent, n Advances n Neural Informaton Processng Systems NIPS0. Curran Assocates, Inc., 0, pp. 8 36. [36] A. Auslender, Optmsaton: méthodes numérques. Masson, 976. [37] P. Tseng, Convergence of a bloc coordnate descent method for nondffentable mnmzaton, Journal of optmzaton theory and applcatons, vol. 9, no. 3, pp. 475 494, 00. [38] M. Razavyayn, M. Hong, and Z.-Q. Luo, A unfed convergence analyss of bloc successve mnmzaton methods for nonsmooth optmzaton, SIAM J. on Opt., vol. 3, no., pp. 6 53, 03. [39] M. Razavyayn, M. Hong, Z.-Q. Luo, and J.-S. Pang, Parallel successve convex approxmaton for nonsmooth nonconvex optmzaton, n Proc. of the Annual Confence on Neural Informaton Processng Systems NIPS, Montal, Quebec, CA, to appear, Dec. 04. [40] M. Hong, M. Razavyayn, and Z.-Q. Luo, Convergence analyss of alternatng dcton method of multplers for a famly of nonconvex problems, arxv pprnt, arxv:4.390, Oct. 04. [4] J. T. Goodman, Exponental prors for maxmum entropy models, Mar. 4 008, us Patent 7,340,376. [4] K.-W. Chang, C.-J. Hseh, and C.-J. Ln, Coordnate descent method for large-scale l-loss lnear support vector machnes, The Journal of Machne Learnng Research, vol. 9, pp. 369 398, 008. [43] R. Tappenden, P. Rchtár, and J. Gondzo, Inexact coordnate descent: complexty and pcondtonng, arxv pprnt arxv:304.5530, 03. [44] T. Haste, R. Tbshran, and J. Fdman, The elements of statstcal learnng. Sprnger Ses n Statstcs. Sprnger, New Yor, 009. [45] H. A. Eselt and V. Maranov, Poneerng developments n locaton analyss, n Foundatons of Locaton Analyss, Internatonal Ses n Operatons Research & Management Scence, A. Eselt and V. Maranov, Eds. Sprnger, 0, ch., pp. 3. [46] A. Chambolle, An algorthm for total varaton mnmzaton and applcatons, Journal of Mathematcal Imagng and Vson, vol. 0, no., pp. 89 97, Jan. 004. [47] J. Maral, F. Bach, J. Ponce, and G. Sapro, Onlne Dctonary Learnng for Sparse Codng, n Proc. of the 6th Internatonal Confence on Machne Learnng, Montal, Quebec, Canada, June 4 8, 009. [48] D. Goldfarb, S. Ma, and K. Schenberg, Fast alternatng lnearzaton methods for mnmzng the sum of two convex functons, Mathematcal Programmng, vol. 4, pp. 349 38, Oct. 03. [49] D. P. Bertseas and J. N. Tstsls, Parallel and Dstrbuted Computaton: Numercal Methods, nd ed. Athena Scentfc Pss, 989. [50] J.-S. Pang, Error bounds n mathematcal programmng, Mathematcal Programmng, vol. 79, no. -3, pp. 99 33, 997. [5] P. Rchtár and M. Taáč, On optmal probabltes n stochastc coordnate descent methods, arxv pprnt arxv:3.3438, 03. [5] A. Daneshmand, Numercal Comparson of Hybrd Random/Determnstc Parallel Algorthms for nonconvex bg data Optmzaton, Dept. of Elect. Eng., SUNY Buffalo, Tech. Rep., August 04. [Onlne]. Avalable: http://www.eng.buffalo.edu/ amrdane/ DaneshmandTechRepNumCompAug4.pdf [53] P. Rchtár and M. Taáč, Dstrbuted coordnate descent method for learnng wth bg data, arxv pprnt arxv:3.059, 03. [54] http://cbcl.mt.edu/softwa-datasets/facedata.html. [55] M. Razavyayn, H.-W. Tseng, and Z.-Q. Luo, Dctonary learnng for sparse psentaton: Complexty and algorthms, n Proc. of the IEEE 04 Internatonal Confence on Acoustcs, Speech, and Sgnal Processng ICASSP 04, Flonce, Italy, May 4-9, 04. [56] D. P. Bertseas and J. N. Tstsls, Neuro-Dynamc Programmng. Cambrdge, Massachusetts: Athena Scentfc Pss, May. 0.