Parallel Algorthms for Bg Data Optmzaton 1 Francsco Facchne, Smone Sagratella, and Gesualdo Scutar Senor Member, IEEE Abstract We propose a decomposton framework for the parallel optmzaton of the sum of a dffentable functon and a (block) separable nonsmooth, convex one. The latter term s usually employed to enforce structu n the soluton, typcally sparsty. Our framework s very flexble and ncludes both fully parallel Jacob schemes and Gauss-Sedel (.e., sequental) ones, as well as vrtually all possbltes n between wth only a subset of varables updated at each teraton. Our theotcal convergence sults mprove on exstng ones, and numercal sults on LASSO and logstc gsson problems show that the new method consstently outperforms exstng algorthms. Index Terms Parallel optmzaton, Dstrbuted methods, Jacob method, LASSO, Sparse soluton. I. INTRODUCTION The mnmzaton of the sum of a smooth functon, F, and of a nonsmooth (block separable) convex one, G, mn V (x) F (x) + G(x), (1) x X s an ubqutous problem that arses n many felds of engneerng, so dverse as compssed sensng, bass pursut denosng, sensor networks, neuroelectromagnetc magng, machne learnng, data mnng, sparse logstc gsson, genomcs, meteology, tensor factorzaton and completon, geophyscs, and rado astronomy. Usually the nonsmooth term s used to promote sparsty of the optmal soluton, whch often corsponds to a parsmonous psentaton of some phenomenon at hand. Many of the afomentoned applcatons can gve rse to extmely large problems so that standard optmzaton technques a hardly applcable. And ndeed, cent years have wtnessed a flurry of search actvty amed at developng soluton methods that a smple (for example based solely on matrx/vector multplcatons) but yet capable to converge to a good approxmate soluton n asonable tme. It s hardly possble he to even summarze the huge amount of work done n ths feld; we fer the ader to the cent works [2] [17] and books [18] [20] as entry ponts to the lteratu. However, wth bg data problems t s clearly necessary to desgn parallel methods able to explot the computatonal power of mult-co processors n order to solve many ntestng problems. It s then surprsng that whle sequental solutons methods for Problem (1) have been wdely nvestgated, the analyss of parallel algorthms sutable to large-scale The order of the authors s alphabetc; all the authors contrbuted equally to the paper. F. Facchne and S. Sagratella a wth the Dept. of Computer, Control, and Management Engneerng, at Unv. of Rome La Sapenza, Rome, Italy. Emals: <facchne,sagratella>@ds.unroma1.t G. Scutar s wth the Dept. of Electrcal Engneerng, at the State Unv. of New York at Buffalo, Buffalo, USA. Emal: gesualdo@buffalo.edu. Hs work was supported by the USA Natonal Scence Foundaton under Grants CMS 1218717 and CAREER Award No. 1254739. Part of ths work has been psented at the 2014 IEEE Internatonal Confence on Acoustcs, Speech, and Sgnal Processng (ICASSP 2014), Flonce, Italy, May 4-9 2014, [1]. mplementatons lags behnd. Gradent-type methods can of course be easly parallelzed, but they a known to generally suffer from slow convergence; furthermo, by lnearzng F they do not explot any structu of F, a fact that nstead has been shown to enhance convergence speed [21]. However, beyond that, and lookng at cent approaches, we a only awa of very few papers that deal wth parallel soluton methods [9] [16]. These papers analyze both randomzed and determnstc block Coordnate Descent Methods (CDMs); one advantage of the analyses then s that they provde an ntestng (global) rate of convergence. However, ) they a essentally stll (gularzed) gradent-based methods; ) they a not flexble enough to nclude, among other thngs, very natural Jacob-type methods (whe at each teraton a mnmzaton of the orgnal functon s performed n parallel wth spect to all blocks of varables); and ) except for [10], [11], [13], they cannot deal wth a nonconvex F. We fer to Secton V for a detaled dscusson on curnt parallel and sequental soluton methods for (1). In ths paper, we propose a new, broad, determnstc algorthmc framework for the soluton of Problem (1). The essental, rather natural dea underlyng our approach s to decompose (1) nto a sequence of (smpler) subproblems wheby the functon F s placed by sutable convex approxmatons; the subproblems can be solved n a parallel and dstrbuted fashon. Key (new) featus of the proposed algorthmc framework a: ) t s parallel, wth a dege of parallelsm that can be chosen by the user and that can go from a complete parallelsm (every varable s updated n parallel to all the others) to the sequental (only one varable s updated at each teraton), coverng vrtually all the possbltes n between ; ) t easly leads to dstrbuted mplementatons; ) t can tackle a nonconvex F ; v) t s very flexble and ncludes, among others, updates based on gradent- or Newtontype approxmatons; v) t easly allows for nexact soluton of the subproblems; v) t permts the update of only some (blocks of) varables at each teraton (a featu that turns out to be very mportant numercally); v) even n the case of the mnmzaton of a smooth, convex functon (.e., F C 1 s convex and G 0) our theotcal sults compa favorably to state-of-the-art methods. The proposed framework encompasses a gamut of novel algorthms, offerng a lot of flexblty to control teraton complexty, communcaton overhead, and convergence speed, whle convergng under the same condtons; these desrable featus make our schemes applcable to several dffent problems and scenaros. Among the varety of new updatng rules for the (block) varables we propose, t s worth mentonng a combnaton of Jacob and Gauss-Sedel updates, whch seems partcularly valuable n parallel optmzaton on mult co/processor archtectus; to the best of our knowledge ths s the frst tme that such a scheme s proposed and analyzed. A further contrbuton of the paper s to mplement our
2 schemes and the most psentatve ones n the lteratu over a parallel archtectu, the General Compute Cluster of the Center for Computatonal Research at the State Unversty of New York at Buffalo. Numercal sults on LASSO and Logstc Regsson problems show that our algorthms consstently outperform state-of-the-art schemes. The paper s organzed as follows. Secton II formally ntroduces the optmzaton problem along wth the man assumptons under whch t s studed. Secton III descrbes our novel general algorthmc framework along wth ts convergence propertes. In Secton IV we dscuss several nstances of the general scheme ntroduced n Secton III. Secton V contans a detaled comparson of our schemes wth stateof-the-art algorthms on smlar problems. Numercal sults a psented n Secton VI, whe we focus on LASSO and Logstc Regsson problems and compa our schemes wth state-of-the-art alternatve soluton methods. Fnally, Secton VII draws some conclusons. All proofs of our sults a gven n the Appendx. II. PROBLEM DEFINITION We consder Problem (1), whe the feasble set X = X 1 X N s a Cartesan product of lower dmensonal convex sets X R n, and x R n s parttoned accordngly: x = (x 1,..., x N ), wth each x R n ; F s smooth (and not necessarly convex) and G s convex and possbly nondffentable, wth G(x) = N =1g (x ). Ths formulaton s very general and ncludes problems of gat ntest. Below we lst some nstances of Problem (1). G(x) = 0; n ths case the problem duces to the mnmzaton of a smooth, possbly nonconvex problem wth convex constrants. F (x) = Ax b 2 and G(x) = cx 1, X = R n, wth A R m n, b R m, and c R ++ gven constants; ths s the nowned and much studed LASSO problem [2]. F (x) = Ax b 2 and G(x) = c N =1 x 2, X = R n, wth A R m n, b R m, and c R ++ gven constants; ths s the group LASSO problem [22]. F (x) = m j=1 log(1 + e a y T x ) and G(x) = cx 1 (or G(x) = c N =1 x 2 ), wth y R n, a R, and c R ++ gven constants; ths s the sparse logstc gsson problem [23], [24]. F (x) = m j=1 max{0, 1 a y T x}2 and G(x) = cx 1, wth a { 1, 1}, y R n, and c R ++ gven; ths s the l 1 -gularzed l 2 -loss Support Vector Machne problem [5]. Other problems that can be cast n the form (1) nclude the Nuclear Norm Mnmzaton problem, the Robust Prncpal Component Analyss problem, the Sparse Inverse Covarance Selecton problem, the Nonnegatve Matrx (or Tensor) Factorzaton problem, see e.g., [25] and fences then. Assumptons. Gven (1), we make the followng blanket assumptons: (A1) Each X s nonempty, closed, and convex; (A2) F s C 1 on an open set contanng X; (A3) F s Lpschtz contnuous on X wth constant L F ; (A4) G(x) = N = g (x ), wth all g contnuous and convex on X ; (A5) V s coercve. Note that the above assumptons a standard and a satsfed by most of the problems of practcal ntest. For nstance, A3 holds automatcally f X s bounded; the block-separablty condton A4 s a common assumpton n the lteratu of parallel methods for the class of problems (1) (t s n fact nstrumental to deal wth the nonsmoothness of G n a parallel envronment). Intestngly A4 s satsfed by all standard G usually encounted n applcatons, ncludng G(x) = x 1 and G(x) = N =1 x 2, whch a among the most commonly used functons. Assumpton A5 s needed to guarantee that the sequence generated by our method s bounded; we could dspense wth t at the prce of a mo complex analyss and cumbersome statement of convergence sults. III. MAIN RESULTS We begn ntroducng an nformal descrpton of our algorthmc framework along wth a lst of key featus that we would lke our schemes enjoy; ths wll shed lght on the co dea of the proposed decomposton technque. We want to develop parallel soluton methods for Problem (1) wheby operatons can be carred out on some or (possbly) all (block) varables x at the same tme. The most natural parallel (Jacob-type) method one can thnk of s updatng all blocks smultaneously: gven x k, each (block) varable x s updated by solvng the followng subproblem x k+1 argmn { F (x, x k ) + g (x ) }, (2) x X whe x denotes the vector obtaned from x by deletng the block x. Unfortunately ths method converges only under very strctve condtons [26] that a seldom verfed n practce. To cope wth ths ssue the proposed approach ntroduces some memory" n the terate: the new pont s a convex combnaton of x k and the solutons of (2). Buldng on ths terate, we would lke our framework to enjoy many addtonal featus, as descrbed next. Approxmatng F : Solvng each subproblem as n (2) may be too costly or dffcult n some stuatons. One may then pfer to approxmate ths problem, n some sutable sense, n order to facltate the task of computng the new teraton. To ths end, we assume that for all N {1,..., N} we can defne a functon P (z; w) : X X R, the canddate approxmant of F, havng the followng propertes (we denote by P the partal gradent of P wth spect to the frst argument z): (P1) P ( ; w) s convex and contnuously dffentable on X for all w X; (P2) P (x ; x) = x F (x) for all x X; (P3) P (z; ) s Lpschtz contnuous on X for all z X. Such a functon P should be garded as a (smple) convex approxmaton of F at the pont x wth spect to the block of varables x that pserves the frst order propertes of F wth spect to x. Based on ths approxmaton we can defne at any pont x k X a gularzed approxmaton h (x ; x k ) of V wth spect to x when F s placed by P whle the nondffentable term s pserved, and a quadratc proxmal term
3 s added to make the overall approxmaton strongly convex. on the same dea, we can ntroduce alternatve less expensve Mo formally, we have h (x ; x k ) P (x ; x k ) + τ ( x x k ) T Q (x k ) ( metrcs by placng the dstance x (x k, τ ) x k x x k ) wth a computatonally cheaper error bound,.e., a functon E (x) } 2 {{ } such that h (x ;x k ) s x (x k, τ ) x k E (x k ) s x (x k, τ ) x k, (4) +g (x ), for some 0 < s s. Of course one can always set whe Q (x k ) s an n n postve defnte matrx (possbly E (x k ) = x (x k, τ ) x k, but other choces a also dependent on x k ). We always assume that the functons h (, x k ) a unformly strongly convex. (A6) All h ( ; x k ) a unformly strongly convex on X wth a common postve defnteness constant q > 0; furthermo, Q ( ) s Lpschtz contnuous on X. Note that an easy and standard way to satsfy A6 s to take, for any and for any k, τ = q > 0 and Q (x k ) = I. However, f P ( ; x k ) s alady unformly strongly convex, one can avod the proxmal term and set τ = 0 whle satsfyng A6. Assocated wth each and pont x k X we can defne the followng optmal block soluton map: x (x k, τ ) argmn x X h (x ; x k ). (3) Note that x (x k, τ ) s always well-defned, snce the optmzaton problem n (3) s strongly convex. Gven (3), we can then ntroduce the soluton map X y x(y, τ ) ( x (y, τ )) N =1. The proposed algorthm (that we formally descrbe later on) s based on the computaton of (an approxmaton of) x(x k, τ ). Thefo the functons P should lead to as easly computable functons x as possble. An approprate choce depends on the problem at hand and on computatonal quments. We dscuss alternatve possble choces for the approxmatons P n Secton IV. Inexact solutons: In many stuatons (especally n the case of large-scale problems), t can be useful to further duce the computatonal effort needed to solve the subproblems n (3) by allowng nexact computatons z k of x (x k, τ ),.e., z k x ( x k, τ ) ε k, whe εk measus the accuracy n computng the soluton. Updatng only some blocks: Another mportant featu we want for our algorthmc framework s the capablty of updatng at each teraton only some of the (block) varables, a featu that has been observed to be very effectve numercally. In fact, our schemes a guaranteed to converge under the update of only a subset of the varables at each teraton; the only condton s that such a subset contans at least one (block) component whch s wthn a factor ρ (0, 1] far away from the optmalty, n the sense explaned next. Snce x k s an optmal soluton of (3) f and only f x (x k, τ ) = x k, a natural dstance of xk from the optmalty s d k x (x k, τ ) x k ; one could then select the blocks x s to update based on such an optmalty measu (e.g., optng for blocks exhbtng larger d k s). However, ths choce qus the computaton of all the solutons x (x k, τ ), for = 1,..., n, whch n some applcatons (e.g., huge-scale problems) mght be computatonally too expensve. Buldng possble; we dscuss ths pont further n Secton IV. Algorthmc framework: We a now ady to formally ntroduce our algorthm, Algorthm 1, that ncludes all the featus dscussed above; convergence to statonary solutons 1 of (1) s stated n Theom 1. Algorthm 1: Inexact Flexble Parallel Algorthm (FLEXA) Data : {ε k } for N, τ 0, {γk } > 0, x 0 X, ρ (0, 1]. Set k = 0. (S.1) : If x k satsfes a termnaton crteron: STOP; (S.2) : For all N, solve (3) wth accuracy ε k : Fnd z k X s.t. z k x ( x k, τ ) ε k ; (S.3) : Set M k max {E (x k )}. Choose a set S k that contans at least one ndex for whch E (x k ) ρm k. Set ẑ k = zk for Sk and ẑ k = xk for Sk (S.4) : Set x k+1 x k + γ k (ẑ k x k ); (S.5) : k k + 1, and go to (S.1). Theom 1: Let {x k } be the sequence generated by Algorthm 1, under A1-A6. Suppose that {γ k } and {ε k } satsfy the followng condtons: ) γ k (0, 1]; ) γ k 0; ) k γk = + ; v) ( ) k γ k 2 < + ; and v) ε k γ k α 1 mn{α 2, 1/ x F (x k )} for all N and some nonnegatve constants α 1 and α 2. Addtonally, f nexact solutons a used n Step 2,.e., ε k > 0 for some and nfnte k, then assume also that G s globally Lpschtz on X. Then, ether Algorthm 1 converges n a fnte number of teratons to a statonary soluton of (1) or every lmt pont of {x k } (at least one such ponts exsts) s a statonary soluton of (1). Proof: See Appendx B. The proposed algorthm s extmely flexble. We can always choose S k = N sultng n the smultaneous update of all the (block) varables (full Jacob scheme); or, at the other extme, one can update a sngle (block) varable per tme, thus obtanng a Gauss-Southwell knd of method. Mo classcal cyclc Gauss-Sedel methods can also be derved and a dscussed n the next subsecton. One can also compute nexact solutons (Step 2) whle pservng convergence, provded that the error term ε k and the step-sze γk s a chosen accordng to Theom 1; some practcal choces for these parameters a dscussed n Secton IV. We emphasze that the Lpschtzanty of G s qud only f x(x k, τ) s not computed exactly for nfnte teratons. At any rate ths Lpschtz condtons s 1 We call that a statonary soluton x of (1) s a ponts for whch a subgradent ξ G(x ) exsts such that ( F (x ) +ξ) T (y x ) 0 for all y X. Of course, f F s convex, statonary ponts concde wth global mnmzers.
4 automatcally satsfed f G s a norm (and thefo n LASSO and group LASSO problems for example) or f X s bounded. As a fnal mark, note that versons of Algorthm 1 whe all (or most of) the varables a updated at each teraton a partcularly amenable to mplementaton n dstrbuted envronments (e.g., mult-user communcatons systems, ad-hoc networks, etc.). In fact, n ths case, not only the calculaton of the nexact solutons z k can be carred out n parallel, but the nformaton that the -th subproblem has to exchange wth the other subproblem n order to compute the next teraton s very lmted. A full appcaton of the potentaltes of our approach n dstrbuted settngs depends however on the specfc applcaton under consderaton and s beyond the scope of ths paper. We fer the ader to [21] for some examples, even f n less general settngs. A. A Gauss-Jacob algorthm Algorthm 1 and ts convergence theory cover fully parallel Jacob as well as Gauss-Southwell-type methods, and many of ther varants. In ths secton we show that Algorthm 1 can also ncorporate hybrd parallel-sequental (Jacob Gauss- Sedel) schemes when block of varables a updated smultaneously by sequentally computng entres per block. Ths procedu seems partcularly well suted to parallel optmzaton on mult-co/processor archtectus. Suppose that we have P processors that can be used n parallel and we want to explot them to solve Problem (1) (P wll denote both the number of processors and the set {1, 2,..., P }). We assgn to each processor p the varables I p ; thefo I 1,..., I P s a partton of I. We denote by x p (x p ) Ip the vector of (block) varables x p assgned to processor p, wth I p ; and x p s the vector of manng varables,.e., the vector of those assgned to all processors except the p-th one. Fnally, gven I p, we partton x p as x p = (x p<, x p ), whe x p< s the vector contanng all varables n I p that come befo (n the order assumed n I p ), whle x p a the manng varables. Thus we wll wrte, wth a slght abuse of notaton x = (x p<, x p, x p ). Once the optmzaton varables have been assgned to the processors, one could n prncple apply the nexact Jacob Algorthm 1. In ths scheme each processor p would compute sequentally, at each teraton k and for every (block) varable x p, a sutable z k p by keepng all varables but x p fxed to (x k pj ) j I p and x k p. But snce we a solvng the problems for each group of varables assgned to a processor sequentally, ths seems a waste of sources; t s nstead much mo effcent to use, wthn each processor, a Gauss-Sedel scheme, wheby the curnt calculated terates a used n all subsequent calculatons. Our Gauss-Jacob method formally descrbed n Algorthm 2 mplements exactly ths dea; ts convergence propertes a gven n Theom 2. Theom 2: Let {x k } k=1 be the sequence generated by Algorthm 2, under the settng of Theom 1. Then, ether Algorthm 2 converges n a fnte number of teratons to a statonary soluton of (1) or every lmt pont of the sequence {x k } k=1 (at least one such ponts exsts) s a statonary soluton of (1). Proof: See Appendx C. Algorthm 2: Inexact Gauss-Jacob Algorthm Data : {ε k p } for p P and I p, τ 0, {γ k } > 0, x 0 K. Set k = 0. (S.1) : If x k satsfes a termnaton crteron: STOP; (S.2) : For all p P do (n parallel), For all I p do (sequentally) a) Fnd z k p s.t. z k p x ( p (x k+1 p<, xk p, xk p), τ ) ε k p ; b) Set x k+1 p x k p + γ ( k z k p ) xk p (S.3) : k k + 1, and go to (S.1). Although the proof of Theom 2 s legated to the appendx, t s ntestng to pont out that the gst of the proof s to show that Algorthm 2 s nothng else but an nstance of Algorthm 1 wth errors. By updatng all varables at each teraton, Algorthm 2 has the advantage that nether the error bounds E nor the exact solutons x p need to be computed, n order to decde whch varables should be updated. Furthermo t s rather ntutve that the use of the latest avalable nformaton should duce the number of overall teratons needed to converge wth spect to Algorthm 1 (assumng n the latter algorthm that all varables a updated at each teraton). However ths advantages should be contrasted wth the followng two facts: ) updatng all varables at each teraton mght not always be the best (or a feasble) choce; and ) n many practcal nstances of Problem (1), usng the latest nformaton as dctated by Algorthm 2 may qu extra calculatons (e.g., to compute functon nformaton, as the gradents) and communcaton overhead. These aspects a dscussed on specfc examples n Secton VI. As a fnal mark, note that Algorthm 2 contans as specal case the classcal cyclcal Gauss-Sedel scheme (a fact that was less obvous to deduce dctly from Algorthm 1); t s suffcent to set P = 1 (corspondng to usng only one processor): the sngle processor updates all the (scalar) varables sequentally whle usng the new values of those that have alady been updated. IV. EXAMPLES AND SPECIAL CASES Algorthms 1 and 2 a very general and encompass a gamut of novel algorthms, each corspondng to varous forms of the approxmant P, the error bound functon E, the step-sze sequence γ k, the block partton, etc. These choces lead to algorthms that can be very dffent from each other, but all convergng under the same condtons. These deges of fedom offer a lot of flexblty to control teraton complexty, communcaton overhead, and convergence speed. In ths secton we outlne several effectve choces for the desgn parameters along wth some llustratve examples of specfc algorthms sultng from a proper combnaton of these choces. On the choce of the step-sze γ k. An example of step-sze rule satsfyng condtons )-v) n Theom 1 s: gven 0 <
5 γ 0 1, let γ k = γ k 1 ( 1 θ γ k 1), k = 1,..., (5) whe θ (0, 1) s a gven constant. Notce that whle ths rule may stll qu some tunng for optmal behavor, t s qute lable, snce n general we a not usng a (sub)gradent dcton, so that many of the well-known practcal drawbacks assocated wth a (sub)gradent method wth dmnshng stepsze a mtgated n our settng. Furthermo, ths choce of step-sze does not qu any form of centralzed coordnaton, whch s a favorable featu n a parallel envronment. Numercal sults n Secton VI show the effectveness of (the customzaton of) (5) on specfc problems. We mark that t s possble to prove convergence of Algorthm 1 also usng other step-sze rules, such as a standard Armjo-lke lne-search procedu or a (sutably small) constant step-sze. We omt the dscusson of these optons because the former s not n lne wth our parallel approach whle the latter s numercally less effcent. On the choce of the error bound functon E (x). As we mentoned, the most obvous choce s to take E (x) = x (x k, τ ) x k. Ths s a valuable choce f the computaton of x (x k, τ ) can be easly accomplshed. For nstance, n the LASSO problem wth N = {1,..., n} (.e., when each block duces to a scalar varable), t s well-known that x (x k, τ ) can be computed n closed form usng the softthsholdng operator [12]. In stuatons whe the computaton of x (x k, τ ) x k s not possble or advsable, we can sort to estmates. Assume momentarly that G 0. Then t s known [27, Proposton 6.3.1] under our assumptons that Π X (x k x F (x k )) x k s an error bound for the mnmzaton problem n (3) and thefo satsfes (4), whe Π X (y) denotes the Eucldean projecton of y onto the closed and convex set X. In ths stuaton we can choose E (x k ) = Π X (x k x F (x k )) x k. If G(x) 0 thngs become mo complex. In most cases of practcal ntest, adequate error bounds can be derved from [11, Lemma 7]. It s ntestng to note that the computaton of E s only needed f a partal update of the (block) varables s performed. However, an opton that s always feasble s to take S k = N at each teraton,.e., update all (block) varables at each teraton. Wth ths choce we can dspense wth the computaton of E altogether. On the choce of the approxmant P (x ; x). The most obvous choce for P s the lnearzaton of F at x k wth spect to x : P (x ; x k ) = F (x k ) + x F (x k ) T (x x k ). Wth ths choce, and takng for smplcty Q (x k ) = I, { x (x k, τ ) = argmn F (x k ) + x F (x k ) T (x x k ) x X + τ } 2 x x k 2 + g (x ). (6) Ths s essentally the way a new teraton s computed n most sequental (block-)cdms for the soluton of (group) LASSO problems and ts generalzatons. Note that contrary to most exstng schemes, our algorthm s parallel. At another extme we could just take P (x ; x k ) = F (x, x k ). Of course, to have P1 satsfed (cf. Secton III), we must assume that F (x, x k ) s convex. Wth ths choce, and settng for smplcty Q (x k ) = I, we have { x (x k, τ ) argmn F (x, x k ) + τ } x X 2 x x k 2 + g (x ), (7) thus gvng rse to a parallel nonlnear Jacob type method for the constraned mnmzaton of V (x). Between the two extme solutons proposed above, one can consder ntermedate choces. For example, If F (x, x k ) s convex, we can take P (x ; x k ) as a second order approxmaton of F (x, x k ),.e., P (x ; x k ) = F (x k ) + x F (x k ) T (x x k ) + 1 2 (x x k )T 2 x x F (x k )(x x k ). (8) When g (x ) 0, ths essentally corsponds to takng a Newton step n mnmzng the duced problem mn x X F (x, x k ), sultng n x (x k, τ ) = argmn x X { F (x k ) + x F (x k ) T (x x k ) + 1 2 (x x k )T 2 x x F (x k )(x x k ) + τ 2 x x k 2 + g (x ) }. (9) Another ntermedate choce, lyng on a specfc structu of the objectve functon that has mportant applcatons s the followng. Suppose that F s a sum-utlty functon,.e., F (x) = j J f j (x, x ), for some fnte set J. Assume now that for every j S J, the functons f j (, x ) s convex. Then we may set P (x ; x k ) = j S f j (x, x k ) + j S f j (x, x k ) T (x x k ) thus pservng, for each, the favorable convex part of F wth spect to x whle lnearzng the nonconvex parts. Ths s the approach adopted n [21] n the desgn of mult-users systems, to whch we fer for applcatons n sgnal processng and communcatons. The framework descrbed n Algorthm 1 can gve rse to very dffent schemes, accordng to the choces one makes for the many varable featus t contans, some of whch have been detaled above. Because of space lmtaton, we cannot dscuss he all possbltes. We provde next just a few nstances of possble algorthms that fall n our framework. Example #1 (Proxmal) Jacob algorthms for convex functons. Consder the smplest problem fallng n our settng: the unconstraned mnmzaton of a contnuously dffentable convex functon,.e., assume that F s convex, G 0, and X = R n. Although ths s possbly the best studed problem n nonlnear optmzaton, classcal parallel methods for ths problem [26, Sec. 3.2.4] qu very strong contracton condtons. In our framework we can take P (x ; x k ) = F (x, x k ), sultng n a parallel Jacob-type method whch does not need any addtonal assumptons. Furthermo our
6 theory shows that we can even dspense wth the convexty assumpton and stll get convergence of a Jacob-type method to a statonary pont. If n addton we take S k = N, we obtan the class of methods studed n [21], [28] [30]. Example #2 Parallel coordnate descent methods for LASSO. Consder the LASSO problem,.e., Problem (1) wth F (x) = Ax b 2, G(x) = cx 1, and X = R n. Probably, to date, the most successful class of methods for ths problem s that of CDMs, wheby at each teraton a sngle varable s updated usng (6). We can easly obtan a parallel verson for ths method by takng n = 1, S k = N and stll usng (6). Alternatvely, nstead of lnearzng F (x), we can better explot the structu of F (x) and use (7). In fact, t s well known that n LASSO problems subproblem (7) can be solved analytcally. We can easly consder smlar methods for the group LASSO problem as well (just take n > 1). Example #3 Parallel coordnate descent methods for Logstc Regsson. Consder the Logstc Regsson problem,.e., Problem (1) wth F (x) = m j=1 log(1 + e ayt x ), G(x) = cx 1, and X = R n, whe y R n, a { 1, 1}, and c R ++ a gven constants. Snce F (x, x k ) s convex, we can take P (x ; x k ) = F (x k ) + x F (x k ) T (x x k ) + 1 2 (x x k )T 2 x x F (x k )(x x k ) and thus obtanng a fully dstrbuted and parallel CDM that uses a second order approxmaton of the smooth functon F. Moover by takng n = 1 and usng a soft-thsholdng operator, each x can be computed n closed form. V. RELATED WORKS The proposed algorthmc framework draws on Successve Convex Approxmaton (SCA) paradgms that have a long hstory n the optmzaton lteratu. Nevertheless, our algorthms and ther convergence condtons (cf. Theoms 1 and 2) unfy and extend curnt parallel and sequental SCA methods n several dctons, as outlned next. (Partally) Parallel Determnstc Methods: The roots of parallel determnstc SCA schemes (when all the varables a updated smultaneously) can be traced back at least to the work of Cohen on the so-called auxlary prncple [28], [29] and ts lated developments, see e.g. [9] [16], [21], [30] [32]. Roughly speakng these works can be dvded n two groups, namely: soluton methods for convex objectve functons [9], [12], [14] [16], [28], [29] and nonconvex ones [10], [11], [13], [21], [30] [32]. All methods n the former group (and [10], [11], [13], [31], [32]) a (proxmal) gradent schemes; they thus sha the classcal drawbacks of gradent-lke schemes; moover, by placng the convex functon F wth ts frst order approxmaton, they do not take any advantage of the structu of F, a fact that nstead has been shown to enhance convergence speed [21]. Comparng wth the second group of works [10], [11], [13], [21], [30] [32], our algorthmc framework mproves on ther convergence propertes whle addng mo flexblty n the selecton of how many varables to update at each teraton. For nstance, wth the excepton of [11], all the afomentoned works do not allow parallel updates of only a subset of all varables, a fact that nstead can dramatcally mprove the convergence speed of the algorthm, as we show n Secton VI. Moover, wth the excepton of [30], they all qu an Armjo-type lne-search, whch makes them not appealng for a (parallel) dstrbuted mplementaton. A scheme n [30] s actually based on dmnshng step-szerules, but ts convergence propertes a qute weak: not all the lmt ponts of the sequence generated by ths scheme a guaranteed to be statonary solutons of (1). Our framework nstead ) deals wth nonconvex (nonsmooth) problems; ) allows one to use a much vared array of approxmatons for F and also nexact solutons of the subproblems; ) s fully parallelzable and dstrbutable (t does not ly on any lne-search); and v) leads to the frst dstrbuted convergent schemes based on very general (possbly) partal updatng rules of the optmzaton varables. In fact, among determnstc schemes, we a awa of only the algorthms [11], [14], [15] performng at each teraton a parallel update of only a subset of all the varables. These algorthms however a gradent-lke schemes, and do not allow nexact solutons of the subproblems (n some large-scale problems the cost of computng the exact soluton of all the subproblems can be prohbtve). In addton, [11] qus an Armjo-type lnesearch wheas [14] and [15] a applcable only to convex objectve functons and a not fully parallel. In fact, convergence condtons then mpose a constrant on the maxmum number of varables that can be smultaneously updated (lnked to the spectral radus of some matrces), a constrant that n many large scale problems s lkely not satsfed. Sequental Methods: Our framework contans as specal cases also sequental updates; t s then ntestng to compa our sults to sequental schemes too. Gven the vast lteratu on the subject, we consder he only the most cent and general work [17]. In [17] the authors consder the mnmzaton of a possbly nonsmooth functon by Gauss-Sedel methods wheby, at each teraton, a sngle block of varables s updated by mnmzng a global upper convex approxmaton of the functon. However, fndng such an approxmaton s generally not an easy task, f not mpossble. To cope wth ths ssue, the authors also proposed a varant of ther scheme that does not need ths qument but uses an Armjo-type lne-search, whch however makes the scheme not sutable for a parallel/dstrbuted mplementaton. Contrary to [17], n our framework condtons on the approxmaton functon (cf. P1-P3) a trval to be satsfed (n partcular, P need not be an upper bound of F ), enlargng sgnfcantly the class of utlty functons V whch the proposed soluton method s applcable to. Furthermo, our framework gves rse to parallel and dstrbuted methods (no lne search s used) when all varables can be updated rather ndependently at the same tme. VI. NUMERICAL RESULTS In ths secton we provde some numercal sults provdng a sold evdence of the vablty of our approach; they clearly show that our algorthmc framework leads to practcal methods that explot well parallelsm and compa favourably to exstng schemes, both parallel and sequental. The tests we carred out on LASSO and Logstc Regsson problems, two of the most studed nstances of Problem (1).
7 All codes have been wrtten n C++ and use the Message Passng Interface for parallel operatons. All algebra s performed by usng the GNU Scentfc Lbrary (GSL). The algorthms we tested on the General Compute Cluster of the Center for Computatonal Research at the State Unversty of New York at Buffalo. In partcular for our experments we used a partton composed of 372 DELL 12x2.40GHz Intel Xeon E5645 Processor computer nodes wth 48 Gb of man memory and QDR InfnBand 40Gb/s network card. In our experments dstrbuted algorthms ran on 20 parallel processes (that s we used 2 nodes wth 10 cos each one), whle sequental algorthms ran on a sngle process (usng thus one sngle co). A. LASSO problem We mplemented the nstance of Algorthm 1 that we descrbed n Example # 2 n the pvous secton, usng the approxmatng functon P as n (7). Note that n the case of LASSO problems x (x k, τ ), the unque soluton (7), can be easly computed n closed form usng the soft-thsholdng operator, see e.g. [12]. Tunng of Algorthm 1: The fe parameters of our algorthm a chosen as follows. The proxmal gans τ a ntally all set to τ = tr(a T A)/2n, whe n s the total number of varables. Ths ntal value, whch s half of the mean of the egenvalues of 2 F, has been observed to be very effectve n all our numercal tests. Choosng an approprate value of τ at each teraton s crucal. Note that n the descrpton of our algorthmc framework we consded fxed values of τ, but t s clear that varyng them a fnte number of tmes does not affect n any way the theotcal convergence propertes of the algorthms. On the other hand, we found that an approprate update of τ n early teratons can enhance consderably the performance of the algorthm. Some plmnary experments showed that an effectve opton s to choose τ large enough to force a decase n the objectve functon value, but not too large to slow down progss towards optmalty. We found that the followng heurstc works well n practce: () all τ a doubled f at a certan teraton the objectve functon does not decase; and () they a all halved f the objectve functon decases for ten consecutve teratons or the latve error on the objectve functon (x) s suffcently small, specfcally f (x) V (x) V V, (10) whe V s the optmal value of the objectve functon V (n our experments on LASSO V s known, see below). In order to avod ncments n the objectve functon, whenever all τ a doubled, the assocated teraton s dscarded, and n Step 4 of Algorthm 1 t s set x k+1 = x k. In any case we lmted the number of possble updates of the values of τ to 100. The step-sze γ k s updated accordng to the followng rule: { } ) γ k = γ (1 k 1 mn 1, θ γ k 1, k = 1,..., (x k 1 ) (11) wth γ 0 = 0.9 and θ = 1e 7. The above dmnshng rule s based on (5) whle guaranteeng that γ k does not become too close to zero befo the latve error s suffcently small. Note that snce τ a changed only a fnte number of tmes and the step-sze γ k decases, the condtons of Theom 1 a all satsfed. Fnally the error bound functon s chosen as E (x k ) = x (x k, τ ) x k, and Sk n Step 3 of the algorthm s set to S k = { : E (x k ) σm k }. In our tests we consder two optons for σ, namely: ) σ = 0, whch leads to a fully parallel scheme when at each teraton all varables a updated; and ) σ = 0.5, whch corsponds to updatng only a subset of all the varables at each teraton. Note that for both choces of σ, the sultng set S k satsfes the qument n Step 3 of Algorthm 1; ndeed, S k always contans the ndex corspondng to the largest E (x k ). Recall also that, as we alady mentoned, the computaton of each x (x k, τ ) for the LASSO problem s n closed form and thus nexpensve. We termed the above nstance of our Algorthm 1 FLEXble parallel Algorthm (FLEXA); n the sequel we wll fer to the two versons of FLEXA as FLEXA σ = 0 and FLEXA σ = 0.5. Algorthms n the lteratu: We compad our versons of FLEXA wth the most common dstrbuted and sequental algorthms proposed n the lteratu to solve the LASSO problem. Mo specfcally, we consder the followng schemes. FISTA: The Fast Iteratve Shrnkage-Thsholdng Algorthm (FISTA) proposed n [12] s a frst order method and can be garded as the benchmark algorthm for LASSO problems. By takng advantages of the separablty of the terms n the objectve functon V, ths method can be easly parallelzed and thus mplemented on a parallel archtectu. FISTA qus the plmnary computaton of the Lpschtz constant L F of F ; n our experments we performed ths computaton usng a dstrbuted verson of the power method that computes A 2 2 (see, e.g., [33]). SpaRSA: Ths s the frst order method proposed n [13]; t s a popular spectral projected gradent method that uses a spectral step length together wth a nonmonotone lne search to enhance convergence. Also ths method can be easly parallelzed, whch s the verson that mplemented n our tests. In all the experments we set the parameters of SpaRSA as n [13]: M = 5, σ = 0.01, α max = 1e30, and α mn = 1e 30. GRock: Ths s a parallel algorthm proposed n [15] that seems to perform extmely well on sparse LASSO problems. We actually tested two nstances of GRock, namely: ) one when only one varable s updated at each teraton; and ) a second nstance whe the number of varables smultaneously updated s equal to the number of the parallel processors (n our experments we used 20 processors). It s mportant to mark that the theotcal convergence propertes of GRock a n jeopardy as the number of varables updated n parallel ncases; roughly speakng, GRock s guaranteed to converge f the columns of the data matrx A n the LASSO problem a almost orthogonal, a featu enjoyed by most of our test problems, but that s not satsfed n many applcatons. ADMM: Ths s a classcal Alternatng Method of Multplers (ADMM) n the form used n [34]. Appled to LASSO problems, ths nstance leads to a sequental scheme whe
8 only one varable per tme can be updated (n closed form). Note that n prncple ADMM can be parallelzed, but t s well known that t does not to scale well wth the number of the processors; thefo n our tests we have not mplemented the parallel verson. GS: Ths s a classcal sequental Gauss-Sedel scheme [26] computng ˆx wth n = 1, and then updatng all x n a sequental fashon (and usng untary step-sze). In all the parallel algorthms we mplemented (FLEXA, FISTA, SpaRSA and GRock), the data matrx A of the LASSO problem s stod n a column block dstrbuted manner A = [A 1 A 2 A P ], whe P s the number of parallel processors. Thus the computaton of each product Ax (whch s qud to evaluate F ) and the norm x 1 (that s G) s dvded nto the parallel jobs of computng A x and x 1, followed by a duce operaton. Columns of A we equally dstrbuted among the processes. Numercal Tests: We generated sx groups of LASSO problems usng the random generaton technque proposed by Nesterov [10]; ths method permts to control the sparsty of the soluton. For the frst fve groups, we consded problems wth 10,000 varables and matrx A havng 9,000 rows. The fve groups dffer from the dege of sparsty of the soluton; mo specfcally the percentage of non zeros n the soluton s 1%, 10%, 20%, 30%, and 40%, spectvely. The last group s formed by nstances wth 100,000 varables and 5000 rows for A, and solutons havng 1% of non zero varables. In all experments and for all the algorthms, the ntal pont was set to the zero vector. Results of our experments for each of the 10,000 varables groups a ported n Fg. 1, whe we plot the latve error as defned n (10) versus the CPU tme; all the curves a averaged over ten ndependent random alzatons. Note that the CPU tme ncludes communcaton tmes (for dstrbuted algorthms) and the ntal tme needed by the methods to perform all p-teratons computatons (ths explans why the curves assocated wth FISTA start after the others; n fact FISTA qus some nontrval ntalzatons based on the computaton of A 2 2). Results of our experments for the LASSO nstance wth 100,000 varables a ported n Fg. 2. The curves a averaged over the random alzatons. Note that we have not ncluded the curves for sequental algorthms (ADMM and GS) on ths group of bg problems, snce we could not use the same nodes used to run all the other algorthms, due to memory lmtatons. However, we tested ADMM and GS on these bg problems on dffent hgh-memory nodes; the obtaned sults (not ported he) showed that, as the dmensons of the problem ncase, sequental methods perform poorly n comparson wth parallel methods; thefo we excluded ADMM and GS n the tests for the LASSO nstance wth 100,000 varables. Gven Fg. 1 and 2, the followng comments a n order. On all the tested problems, FLEXA wth σ = 0.5 outperforms n a consstent manner all other mplemented algorthms. Results for FLEXA wth σ = 0 a qute smlar to those wth σ = 0.5 on the 10,000 varables problems. However on larger problems FLEXA σ = 0 (.e., the verson when all varables FLEXA σ = 0 FLEXA σ = 0.5 FISTA SpaRSA GRock P = 1 GRock P = 20 ADMM GS 0 20 40 60 (b) 0 20 40 60 (d) 0 20 40 60 (a) 0 20 40 60 (c) 0 20 40 60 Fg. 1: Relatve error vs. tme (n seconds) for Lasso wth 10,000 varables: (a) 1% non zeros - (b) 10% non zeros - (c) 20% non zeros - (d) 30% non zeros - (e) 40% non zeros FLEXA σ = 0 FLEXA σ = 0.5 FISTA SpaRSA GRock P = 1 GRock P = 20 0 200 400 600 800 1000 Fg. 2: Relatve error vs. tme (n seconds) for Lasso wth 100,000 varables a updated at each teraton) seems neffectve. Ths sult mght seem surprsng at frst sght: why, once all the optmal solutons x (x k, τ ) a computed, s t mo convenent not to use all of them but update nstead only a subset of varables? We brefly dscuss ths complex ssue next. Remark 3 (On the partal updates): It can be shown that (e)
9 Algorthm 1 has the markable capablty to dentfy those varables that wll be zero at a soluton; because of lack of space, we do not provde he the proof of ths statement but only an nformal descrpton. Roughly speakng, t can be shown that, for k large enough, those varables that a zero n x(x k, τ ) wll be zero also n a lmtng soluton x. Thefo, suppose that k s large enough so that ths dentfcaton property alady takes place (we wll say that we a n the dentfcaton phase") and consder an ndex such that x = 0. Then, f x k s zero, t s clear, by Steps 3 and 4, that x k wll be zero for all ndces k > k, ndependently of whether belongs to S k or not. In other words, f a varable that s zero at the soluton s alady zero when the algorthms enters the dentfcaton phase, that varable wll be zero n all subsequent teratons; ths fact, ntutvely, should enhance the convergence speed of the algorthm. Conversely, f when we enter the dentfcaton phase x k s not zero, the algorthm wll have to brng t back to zero teratvely. It should then be clear why updatng only varables that we have strong ason to beleve wll be non zero at a soluton s a better strategy than updatng them all. Of course, the may be a problem dependence and the best value of σ can vary from problem to problem. But we beleve that the explanaton outlned above gves frm theotcal ground to the dea that t mght be wse to waste" some calculatons and perform only a partal update of the varables. Referrng to sequental methods (ADMM and GS), they behave strkngly well on the 10,000 varables problems, f one keeps n mnd that they only use one process. However, as alady observed, they cannot compete wth parallel methods on larger problems. FISTA s capable to approach latvely fast low accuracy solutons, but has dffcultes n achng hgh accuracy. The verson of GRock wth P = 20 s the closest match to FLEXA, but only when the problems a very sparse. Ths s consstent wth the fact that ts convergence propertes a at stake when the problems a qute dense. Furthermo, t should be clear that f the problem s very large, updatng only 20 varables at each teraton, as GRock does, could slow down the convergence, especally when the optmal soluton s not very sparse. From ths pont of vew, the strategy used by FLEXA σ = 0.5 seems to strke a good balance between not updatng varables that a probably zero at the optmum and nevertheless update a szeable amount of varables when needed n order to enhance convergence. Fnally, SpaRSA seems to be very nsenstve to the dege of sparsty of the soluton; t s comparable to our FLEXA on 10,000 varables problems, but s much less effectve on very large-scale problems. In concluson, Fg. 1 and Fg. 2 show that whle the s no algorthm n the lteratu performng equally well on all the smulated (large and very large-scale) problems, the proposed FLEXA s consstently the wnner. B. Logstc gsson problems The logstc gsson problem s descrbed n Example #3 (cf. Secton III). For such a problem, we mplemented the nstance of Algorthm 1 descrbed n the same example. Mo specfcally, the algorthm s essentally the same descrbed for LASSO, but wth the followng dffences: Data set m n c gsette (scaled) 6000 5000 1/1500 colon-cancer 62 2000 0.01 leukema 38 7129 0.01 TABLE I: Test problems for logstc gsson tests (a) The approxmant P s chosen as the second order approxmaton of the orgnal functon F ; (b) The ntal τ a set to tr(y T Y)/2n for all, whe n s the total number of varables and Y = [y 1 y 2 y m ] T. (c) Snce the optmal value V s not known for the logstc gsson problem, we no longer use (x) as mert functon but Z(x), wth Z(x) = F (x) Π [ c,c] n ( F (x) x). He the projecton Π [ c,c] n(z) can be effcently computed; t acts component-wse on z, snce [ c, c] n = [ c, c] [ c, c]. Note that Z(x) s a vald optmalty measu functon; ndeed, Z(x) = 0 s equvalent to the standard necessary optmalty condton for Problem (1), see [6]. Thefo, whenever (x) was used for the Lasso problems, we now use Z(x) [ncludng n the step-sze rule (11)]. We smulated the nstances of the logstc gsson problem, whose essental data featus a gven n Table I; we downloaded the data from the LIBSVM postory http://www.cse.ntu.edu.tw/ cjln/lbsvm/, whch we fer to for a detaled descrpton of the test problems. In our mplementaton, the matrx Y s stod n a column block dstrbuted manner Y = [Y 1 Y 2 Y P ], whe P s the number of parallel processors. We compad FLEXA σ = 0 and FLEXA σ = 0.5 wth the other parallel algorthms, namely: FISTA, SpaRSA, and GRock. We do not port sults for the sequental methods (GS and ADMM) because we alady ascertaned that they a not compettve. The tunng of the fe parameters n all the algorthms s the same as n Fg. 1 and Fg. 2. In Fg. 3 we plotted the latve error vs. the CPU tme (the latter defned as n Fg. 1 and Fg. 2). Note that ths tme n order to plot the latve error, we had to plmnary estmate V (whch we call s not known for logstc gsson problems). In order to do so we ran FLEXA wth σ = 0.5 untl the mert functon value Z(x k ) went below 1e 6, and used the corspondng value of the objectve functon as estmate of V. We mark that we used ths value only to plot the curves n Fg. 3. Results on Logstc Regsson nforce the concluson we made based on the experments on LASSO problems. Actually, Fg. 3 clearly shows that on these problems both FLEXA methods sgnfcantly and consstently outperform all other soluton methods. In concluson, our experments ndcate that our algorthmc framework can lead to very effcent and practcal soluton methods for large-scale problems, wth the flexblty to adapt to many dffent problem characterstcs.
10 FLEXA σ = 0 FLEXA σ = 0.5 FISTA SpaRSA GRock P = 1 GRock P = 20 0 1 2 3 x 10 4 (a) (x w) T ( x H (x; y) x H (w; y)) c τ x w 2, (12) for all x, w X and gven y X; () x H(x; ) s unformly Lpschtz contnuous on X,.e., the exsts a 0 < L H < ndependent on x such that x H (x; y) x H (x; w) L H y w, (13) for all y, w X and gven x X. Proof: The proof s standard and thus s omtted. Proposton 5: Consder Problem (1) under A1-A6. Then the mappng X y x(y) has the followng propertes: (a) x( ) s Lpschtz contnuous on X,.e., the exsts a postve constant ˆL such that x(y) x(z) ˆL y z, y, z X; (14) 0 0.5 1 1.5 2 (b) 0 1 2 3 4 5 Fg. 3: Relatve error vs. tme (n seconds) for Logstc Regsson: (a) gsette - (b) colon-cancer - (c) leukema VII. CONCLUSIONS We proposed a hghly parallelzable algorthmc scheme for the mnmzaton of the sum of a possbly noncovex dffentable functon and a possbly nonsmooth but blockseparable convex one. Qute markably, our framework leads to dffent (new) algorthms whose dege of parallelsm can be chosen by the user, rangng from fully parallel to sequental schemes, all of them convergng under the same condtons. Many well know sequental and smultaneous soluton methods n the lteratu a just specal cases of our algorthmc framework. Our plmnary tests a very promsng, showng that our algorthms consstently outperform stateof-the-art schemes. Experments on larger and mo vared classes of problems (ncludng those lsted n Secton II) a the subject of our curnt search. We also plan to nvestgate asynchronous versons of Algorthm 1, the latter beng a very mportant ssue n many dstrbuted settngs. APPENDIX We frst ntroduce some plmnary sults nstrumental to prove both Theom 1 and Theom 2. Heafter, for notatonal smplcty, we wll omt the dependence of x(y, τ ) on τ and wrte x(y). Gven S N and x (x ) N =1, we wll also denote by (x) S (or nterchangeably x S ) the vector whose component s equal to x f S, and zero otherwse. A. Intermedate sults Lemma 4: Let H(x; y) h (x ; y). Then, the followng hold: () H( ; y) s unformly strongly convex on X wth constant c τ > 0,.e., (c) (b) the set of the fxed-ponts of x( ) concdes wth the set of statonary solutons of Problem (1); thefo x( ) has a fxed-pont; (c) for every gven y X and for any set S N, t holds that ( x(y) y) T S xf (y) S + S wth c τ q mn τ. g ( x (y)) S c τ ( x(y) y) S 2, g (y ) (15) Proof: We prove the proposton n the followng order: (c), (a), (b). (c): Gven y X, by defnton, each x (y) s the unque soluton of problem (3); then t s not dffcult to see that the followng holds: for all z X, (z x (y)) T x h ( x (y); y) + g (z ) g ( x (y)) 0. (16) Summng and subtractng x P (y ; y) n (16), choosng z = y, and usng P2, we get (y x (y)) T ( x P ( x (y); y) x P (y ; y)) + (y x (y)) T x F (y) + g (y ) g ( x (y)) τ ( x (y) y ) T Q (y) ( x (y) y ) 0, (17) for all N. Observng that the term on the frst lne of (17) s non postve and usng P1, we obtan (y x (y)) T x F (y) + g (y ) g ( x (y)) c τ x (y) y 2, for all N. Summng over S we get (15). (a): We use the notaton ntroduced n Lemma 4. Gven y, z X, by optmalty and (16), we have, for all v and w n X (v x(y)) T x H ( x(y); y) + G(v) G( x(y)) 0 (w x(z)) T x H ( x(z); z) + G(w) G( x(z)) 0. Settng v = x(z) and w = x(y), summng the two nequaltes above, and addng and subtractng x H ( x(y); z), we
11 obtan: ( x(z) x(y)) T ( x H ( x(z); z) x H ( x(y); z)) ( x(y) x(z)) T ( x H ( x(y); z) x H ( x(y); y)). (18) Usng (12) we can lower bound the left-hand-sde of (18) as ( x(z) x(y)) T ( x H ( x(z); z) x H ( x(y); z)) c τ x(z) x(y) 2, (19) wheas the rght-hand-sde of (18) can be upper bounded as ( x(y) x(z)) T ( x H ( x(y); z) x H ( x(y); y)) L H x(y) x(z) y z, (20) whe the nequalty follows from the Cauchy-Schwartz nequalty and (13). Combnng (18), (19), and (20), we obtan the desd Lpschtz property of x( ). (b): Let x X be a fxed pont of x(y), that s x = x(x ). Each x (y) satsfes (16) for any gven y X. For some ξ g (x ), settng y = x and usng x = x(x ) and the convexty of g, (16) duces to (z x ) T ( x F (x ) + ξ ) 0, (21) for all z X and N. Takng nto account the Cartesan structu of X, the separablty of G, and summng (21) over N, we obtan (z x ) T ( x F (x ) + ξ) 0, for all z X, wth z (z ) N =1 and ξ (ξ ) N =1 G(x ); thefo x s a statonary soluton of (1). The converse holds because ) x(x ) s the unque optmal soluton of (3) wth y = x, and ) x s also an optmal soluton of (3), snce t satsfes the mnmum prncple. Lemma 6: [35, Lemma 3.4, p.121] Let {X k }, {Y k }, and {Z k } be the sequences of numbers such that Y k 0 for all k. Suppose that X k+1 X k Y k + Z k, k = 0, 1,... and k=0 Zk <. Then ether X k or else {X k } converges to a fnte value and k=0 Y k <. Lemma 7: Let {x k } be the sequence generated by Algorthm 1. Then, the s a postve constant c such that the followng holds: for all k 1, ( x F (x k ) ) T ( x(x k ) x k) + g S k S k ( x (x k )) S k (22) g (x k ) c x(x k ) x k 2. S k Proof: Let j k be an ndex n S k such that E jk (x k ) ρ max E (x k ) (Step 3 of Algorthm 1). Then, usng the afomentoned bound and (4), t s easy to check that the followng chan of nequaltes holds: s jk x S k(x k ) x k S k s j k x jk (x k ) x k j k Hence we have for any k, E jk (x k ) x S k(x k ) x k S k ( ρ mn s N s jk ρ max E (x k ) ( ) ( ) ρ mn s max{ x (x k ) x k } ( ) ρ mn s x(x k ) x k. N ) x(x k ) x k. (23) Invokng now Proposton 5(c) wth S = ( S k and y ) = x k, and ρ mn s 2. usng (23), (22) holds true, wth c c τ N max j s j B. Proof of Theom 1 We a now ady to prove the theom. For any gven k 0, the Descent Lemma [26] yelds F ( x k+1) F ( x k) + γ k x F ( x k) T (ẑk x k) ( ) γ k 2 L F + ẑ k x k 2, 2 (24) wth ẑ k (ẑ k )N =1 and zk (z k )N =1 defned n Step 3 and 4 (Algorthm 1). Observe that ẑ k x k 2 z k x k 2 2 x(x k ) x k 2 +2 N z k x (x k ) 2 2 x(x k ) x k 2 + 2 N (εk )2, (25) whe the frst nequalty follows from the defnton of z k and ẑ k, and n the last nequalty we used z k x (x k ) ε k. Denotng by S k the complement of S, we also have, for k large enough, x F ( x k) T (ẑk x k) = x F ( x k) T (ẑk x(x k ) + x(x k ) x k) = x F ( x k) T (z k x(x k )) S k S k + x F ( x k) T S k (xk x(x k )) k S + x F ( x k) T ( x(x k ) x k ) S k S k + x F ( x k) T S k ( x(xk ) x k ) k S = x F ( x k) T (z k x(x k )) S k S k + x F ( x k) T ( x(x k ) x k ) S k S k, (26) whe n the second equalty we used the defnton of ẑ k and of the set S k. Now, usng (26) and Lemma 7, we can wrte
12 x F ( x k) T (ẑk x k) + S k g (ẑ k ) S k g (x k ) = x F ( x k) T (ẑk x k) + S k g ( x (x k )) S g k (x k ) + S g k (ẑ k ) S g k ( x (x k )) c x(x k ) x k 2 + S ε k k x F (x k ) +L G S k εk, (27) whe L G s a (global) Lpschtz constant for (all) g. Fnally, from the defnton of ẑ k and of the set S k, we have for all k large enough, V (x k+1 ) = F (x k+1 ) + N g (x k+1 ) = F (x k+1 ) + N g (x k + γk (ẑ k xk )) F (x k+1 ) + N g (x k ) + γk ( S k (g (ẑ k ) g (x k ))) V ( x k) γ ( c ) k γ k L F x(x k ) x k 2 + T k, (28) whe n the frst nequalty we used the the convexty of the g s, wheas the second one follows from (24), (25) and (27), wth T k γ k ( LG + x F (x k ) ) + ( γ k) 2 L F (ε k ) 2. S k ε k Usng assumpton (v), we can bound T k as N T k (γ k ) 2 [ Nα 1 (α 2 L G + 1) + (γ k ) 2 L F (Nα 1 α 2 ) 2], whch, by assumpton (v) mples k=0 T k <. Snce γ k 0, t follows from (28) that the exst some postve constant β 1 and a suffcently large k, say k, such that V (x k+1 ) V (x k ) γ k β 1 x(x k ) x k 2 + T k, (29) for all k k. Invokng Lemma 6 wth the dentfcatons X k = V (x k+1 ), Y k = γ k β 1 x(x k ) x k 2 and Z k = T k whle usng k=0 T k <, we deduce from (29) that ether {V (x k )} or else {V (x k )} converges to a fnte value and k lm γ t x(x t ) x t 2 < +. (30) k t= k Snce V s coercve, V (x) mn y X V (y) >, mplyng that {V ( x k) } s convergent; t follows from (30) and k=0 γk = that lm nf k x(x k ) x k = 0. Usng Proposton 5, we show next that lm k x(x k ) x k = 0; for notatonal smplcty we wll wrte x(x k ) x(x k ) x k. Suppose, by contradcton, that lm sup k x(x k ) > 0. Then, the exsts a δ > 0 such that x(x k ) > 2δ for nfntely many k and also x(x k ) < δ for nfntely many k. Thefo, one can always fnd an nfnte set of ndexes, say K, havng the followng propertes: for any k K, the exsts an nteger k > k such that x(x k ) < δ, x(x k ) > 2δ (31) δ x(x j ) 2δ k < j < k. (32) Gven the above bounds, the followng holds: for all k K, δ (a) < x(x k ) x(x k ) x(x k ) x(x k ) + x k x k (33) (b) (1 + ˆL) x k x k (34) (c) (1 + ˆL) k 1 γ t ( x(x t ) S t + (z t x(x t )) S t ) (d) (1 + ˆL) (2δ + ε max ) k 1 γ t, (35) whe (a) follows from (31); (b) s due to Proposton 5(a); (c) comes from the trangle nequalty, the updatng rule of the algorthm and the defnton of ẑ k ; and n (d) we used (31), (32), and z t x(x t ) N εt, whe εmax max k N εk <. It follows from (35) that k 1 lm nf γ t k δ (1 + ˆL)(2δ > 0. (36) + ε max ) We show next that (36) s n contradcton wth the convergence of {V (x k )}. To do that, we plmnary prove that, for suffcently large k K, t must be x(x k ) δ/2. Proceedng as n (35), we have: for any gven k K, x(x k+1 ) x(x k ) (1 + ˆL) x k+1 x k (1 + ˆL)γ k ( x(x k ) + ε max). It turns out that for suffcently large k K so that (1+ ˆL)γ k < δ/(δ + 2ε max ), t must be x(x k ) δ/2; (37) otherwse the condton x(x k+1 ) δ would be volated [cf. (32)]. Heafter we assume wthout loss of generalty that (37) holds for all k K (n fact, one can alway strct {x k } k K to a proper subsequence). We can show now that (36) s n contradcton wth the convergence of {V (x k )}. Usng (29) (possbly over a subsequence), we have: for suffcently large k K, k 1 V (x k ) V (x k ) β 2 (a) < V (x k ) β 2 (δ 2 /4) γ t x(x t ) 2 k 1 + k 1 k 1 γ t + T t T t (38) whe n (a) we used (32) and (37), and β 2 s some postve constant. Snce {V (x k )} converges and k=0 T k <, (38) k 1 mples lm K k γt = 0, whch contradcts (36). Fnally, snce the sequence {x k } s bounded [due to the coercvty of V and the convergence of {V (x k )}], t has at least one lmt pont x that must belong to X. By the contnuty of x( ) [Proposton 5(a)] and lm k x(x k ) x k = 0, t must be x( x) = x. By Proposton 5(b) x s also a statonary soluton of Problem (1). As a fnal mark, note that f ε k = 0 for every and for every k large enough,.e., f eventually x(x k ) s computed exactly, the s no need to assume that G s globally Lpschtz.
13 In fact n (27) the term contanng L G dsappears, and actually all the terms T k a zero and all the subsequent dervatons ndependent of the Lpschtzanty of G. C. Proof of Theom 2 We show next that Algorthm 2 s just an nstance of the nexact Jacob scheme descrbed n Algorthm 1 satsfyng the convergence condtons n Theom 1; whch proves Theom 2. It s not dffcult to see that ths bols down to provng that, for all p P and I p, the sequence zp k n Step 2a) of Algorthm 2 satsfes z k p x p (x k ) ε k p, (39) for some { ε k p } such that n εk p γk <. The followng holds for the LHS of (39): z k p x p(x k ) x p (x k+1 p<, xk p, x p) x p (x k ) + z k p x p(x k+1 p<, xk p, x p) (a) x p (x k+1 p<, xk p, x p) x p (x k ) + ε k p (b) ˆL x k+1 p< xk p< + εk p (c) = ˆLγ k ( z k p< p<) xk + ε k p ˆLγ ( 1 ) k j=1 (zk pj x pj(x k ) + x pj (x k ) x k pj ) + ε k p (d) ˆLγ k β + ˆLγ k j< εk pj + εk p, whe (a) follows from the error bound n Step 2a) of Algorthm 2; n (b) we used Proposton 5a); (c) follows from Step 2b); and n (d) we used nducton, whe β < s a postve constant. It turns out that (39) s satsfed choosng ε k p ˆLγ k β + ˆLγ k j< εk pj + εk p. REFERENCES [1] F. Facchne, S. Sagratella, and G. Scutar, Flexble parallel algorthms for bg data optmzaton, n Proc. of the IEEE 2014 Internatonal Confence on Acoustcs, Speech, and Sgnal Processng (ICASSP 2014), Flonce, Italy, May 4-9,. [Onlne]. Avalable: http://arxv.org/abs/1311.2444 [2] R. Tbshran, Regsson shrnkage and selecton va the lasso, Journal of the Royal Statstcal Socety. Seres B (Methodologcal), pp. 267 288, 1996. [3] Z. Qn, K. Schenberg, and D. Goldfarb, Effcent block-coordnate descent algorthms for the group lasso, Mathematcal Programmng Computaton, vol. 5, pp. 143 169, June 2013. [4] A. Rakotomamonjy, Surveyng and comparng smultaneous sparse approxmaton (or group-lasso) algorthms, Sgnal processng, vol. 91, no. 7, pp. 1505 1526, July 2011. [5] G.-X. Yuan, K.-W. Chang, C.-J. Hseh, and C.-J. Ln, A comparson of optmzaton methods and softwa for large-scale l1-gularzed lnear classfcaton, The Journal of Machne Learnng Research, vol. 9999, pp. 3183 3234, 2010. [6] R. H. Byrd, J. Nocedal, and F. Oztoprak, An Inexact Successve Quadratc Approxmaton Method for Convex L-1 Regularzed Optmzaton, arxv pprnt arxv:1309.3529, 2013. [7] K. Fountoulaks and J. Gondzo, A Second-Order Method for Strongly Convex L1-Regularzaton Problems, arxv pprnt arxv:1306.5386, 2013. [8] Y. Nesterov, Effcency of coordnate descent methods on huge-scale optmzaton problems, SIAM Journal on Optmzaton, vol. 22, no. 2, pp. 341 362, 2012. [9] I. Necoara and D. Clpc, Effcent parallel coordnate descent algorthm for convex optmzaton problems wth separable constrants: applcaton to dstrbuted MPC, Journal of Process Control, vol. 23, no. 3, pp. 243 253, March 2013. [10] Y. Nesterov, Gradent methods for mnmzng composte functons, Mathematcal Programmng, vol. 140, pp. 125 161, August 2013. [11] P. Tseng and S. Yun, A coordnate gradent descent method for nonsmooth separable mnmzaton, Mathematcal Programmng, vol. 117, no. 1-2, pp. 387 423, March 2009. [12] A. Beck and M. Teboulle, A fast teratve shrnkage-thsholdng algorthm for lnear nverse problems, SIAM Journal on Imagng Scences, vol. 2, no. 1, pp. 183 202, Jan. [13] S. J. Wrght, R. D. Nowak, and M. A. Fguedo, Sparse constructon by separable approxmaton, IEEE Trans. on Sgnal Processng, vol. 57, no. 7, pp. 2479 2493, July 2009. [14] J. K. Bradley, A. Kyrola, D. Bckson, and C. Guestrn, Parallel coordnate descent for l1-gularzed loss mnmzaton, n Proc. of the 28th Internatonal Confence on Machne Learnng, Bellevue, WA, USA, June 28 July 2, 2011. [15] Z. Yn, P. Mng, and Y. Wotao, Parallel and Dstrbuted Sparse Optmzaton, 2013. [Onlne]. Avalable: http://www.caam.rce.edu/ optmzaton/dsparse/ [16] P. Rchtárk and M. Takáč, Parallel coordnate descent methods for bg data optmzaton, arxv pprnt arxv:1212.0873, 2012. [17] M. Razavyayn, M. Hong, and Z.-Q. Luo, A unfed convergence analyss of block successve mnmzaton methods for nonsmooth optmzaton, SIAM Journal on Optmzaton, vol. 23, no. 2, pp. 1126 1153, 2013. [18] P. L. Bühlmann, S. A. van de Geer, and S. Van de Geer, Statstcs for hgh-dmensonal data. Sprnger, 2011. [19] S. Sra, S. Nowozn, and S. J. Wrght, Eds., Optmzaton for Machne Learnng, ser. Neural Informaton Processng. Cambrdge, Massachusetts: The MIT Pss, Sept. 2011. [20] F. Bach, R. Jenatton, J. Maral, and G. Oboznsk, Optmzaton wth Sparsty-nducng Penaltes. Foundatons and Tnds R n Machne Learnng, Now Publshers Inc, Dec. 2011. [21] G. Scutar, F. Facchne, P. Song, D. Palomar, and J.-S. Pang, Decomposton by Partal lnearzaton: Parallel optmzaton of mult-agent systems, IEEE Trans. on Sgnal Processng, vol. 62, pp. 641 656, Feb. 2014. [22] M. Yuan and Y. Ln, Model selecton and estmaton n gsson wth grouped varables, Journal of the Royal Statstcal Socety: Seres B (Statstcal Methodology), vol. 68, no. 1, pp. 49 67, 2006. [23] S. K. Shevade and S. S. Keerth, A smple and effcent algorthm for gene selecton usng sparse logstc gsson, Bonformatcs, vol. 19, no. 17, pp. 2246 2253, 2003. [24] L. Meer, S. Van De Geer, and P. Bühlmann, The group lasso for logstc gsson, Journal of the Royal Statstcal Socety: Seres B (Statstcal Methodology), vol. 70, no. 1, pp. 53 71, 2008. [25] D. Goldfarb, S. Ma, and K. Schenberg, Fast alternatng lnearzaton methods for mnmzng the sum of two convex functons, Mathematcal Programmng, vol. 141, pp. 349 382, Oct. 2013. [26] D. P. Bertsekas and J. N. Tstskls, Parallel and Dstrbuted Computaton: Numercal Methods, 2nd ed. Athena Scentfc Pss, 1989. [27] F. Facchne and J.-S. Pang, Fnte-Dmensonal Varatonal Inequaltes and Complementarty Problem. Sprnger-Verlag, New York, 2003. [28] G. Cohen, Optmzaton by decomposton and coordnaton: A unfed approach, IEEE Trans. on Automatc Control, vol. 23, no. 2, pp. 222 232, Aprl 1978. [29], Auxlary problem prncple and decomposton of optmzaton problems, Journal of Optmzaton Theory and Applcatons, vol. 32, no. 3, pp. 277 305, Nov. 1980. [30] M. Patrksson, Cost approxmaton: a unfed framework of descent algorthms for nonlnear programs, SIAM Journal on Optmzaton, vol. 8, no. 2, pp. 561 582, 1998. [31] M. Fukushma and H. Mne, A generalzed proxmal pont algorthm for certan non-convex mnmzaton problems, Internatonal Journal of Systems Scence, vol. 12, no. 8, pp. 989 1000, 1981. [32] H. Mne and M. Fukushma, A mnmzaton method for the sum of a convex functon and a contnuously dffentable functon, Journal of Optmzaton Theory and Applcatons, vol. 33, no. 1, pp. 9 23, Jan. 1981. [33] Y. Saad, Numercal methods for large egenvalue problems, ser. Classcs n Appled Mathematcs (Book 66). SIAM Socety for Industral & Appled Mathematcs; Revsed edton, May 2011, vol. 158. [34] Z.-Q. Luo and M. Hong, On the lnear convergence of the alternatng dcton method of multplers, arxv pprnt arxv:1208.3922, 2012. [35] D. P. Bertsekas and J. N. Tstskls, Neuro-Dynamc Programmng. Cambrdge, Massachusetts: Athena Scentfc Pss, May. 2011.