arxiv:1311.2444v1 [cs.dc] 11 Nov 2013

FLEXIBLE PARALLEL ALGORITHMS FOR BIG DATA OPTIMIZATION Francsco Facchne 1, Smone Sagratella 1, Gesualdo Scutar 2 1 Dpt. of Computer, Control, and Management Eng., Unversty of Rome La Sapenza", Roma, Italy. 2 Dpt. of Electrcal Eng., State Unversty of New York (SUNY) at Buffalo, Buffalo, NY 14260, USA. arxv:1311.2444v1 [cs.dc] 11 Nov 2013 ABSTRACT We propose a decomposton framework for the parallel optmzaton of the sum of a dfferentable functon and a (block) separable nonsmooth, convex one. The latter term s typcally used to enforce structure n the soluton as, for example, n Lasso problems. Our framework s very flexble and ncludes both fully parallel Jacob schemes and Gauss-Sedel (Southwell-type) ones, as well as vrtually all possbltes n between (e.g., gradent- or Newton-type methods) wth only a subset of varables updated at each teraton. Our theoretcal convergence results mprove on exstng ones, and numercal results show that the new method compares favorably to exstng algorthms. Index Terms Parallel optmzaton, Jacob method, Lasso, Sparse soluton. 1. INTRODUCTION The mnmzaton of the sum of a smooth functon, F, and of a nonsmooth (separable) convex one, G, mnv(x) F(x)+G(x), (1) x X s an ubqutous problem that arses n many felds of engneerng, so dverse as compressed sensng, bass pursut denosng, sensor networks, neuroelectromagnetc magng, machne learnng, data mnng, sparse logstc regresson, genomcs, metereology, tensor factorzaton and completon, geophyscs, and rado astronomy. Usually the nonsmooth term s used to promote sparsty of the optmal soluton, that often corresponds to a parsmonous representaton of some phenomenon at hand. Many of the mentoned applcatons can gve rse to extremely large problems so that standard optmzaton technques are hardly applcable. And ndeed, recent years have wtnessed a flurry of research actvty amed at developng soluton methods that are smple (for example based solely on matrx/vector multplcatons) but yet capable to converge to a good approxmate soluton n reasonable tme. It s hardly possble here to even summarze the huge amount of work done n ths feld; we refer the reader to the recent works [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19] as entry ponts to the lterature. It s clear however that f one wants to solve really large problems, parallel methods explotng the computatonal power of multcore processors have to be employed. It s then surprsng that whle seral solutons methods for Problem (1) have been wdely nvestgated, the analyss of parallel algorthms sutable to large-scale mplementatons lags behnd. Gradent-type methods can of course be easly parallelzed. However, beyond that, we are only aware of very few papers, all very recent, that deal wth parallel soluton methods [2, 6, 13, 17]. These papers analyze both randomzed and determnstc block Coordnate Descent Methods (CDMs) that, essentally, are stll (regularzed) gradent-based methods. One advantage of the analyses n [2, 6, 13, 17] s that they provde an nterestng (global) rate of convergence. On the other hand they apply only to convex The work of Scutar was supported by the USA Natonal Scence Foundaton under Grants CMS 1218717 and CAREER Award No. 1254739 problems and are not flexble enough to nclude, among other thngs, very natural Jacob-type methods (where at each teraton a partal mnmzaton of the orgnal functon s performed wth respect to a block varable whle all other varables are kept fxed) and the possblty to deal wth a nonconvex F. In ths paper, buldng on the approach proposed n [20, 21], we present a broad, determnstc algorthmc framework for the soluton of Problem (1) wth the followng novel features: ) t s parallel, wth a degree of parallelsm that can be chosen by the user and that can go from a complete parallelsm (each varable s updated n parallel to all the others) to the sequental (one varable only s updated at each teraton); ) t can tackle a nonconvex F ; ) t s very flexble and ncludes, among others, updates based on gradent- or Newton-type methods; and v) t easly allows for nexact solutons. Our framework allows us to defne dfferent schemes, all convergng under the same condtons, that can accommodate dfferent problem features and algorthmc requrements. Even n the most studed case n whch F s convex and G(x) 0 our results compare favourably to exstng ones and the numercal results show our approach to be very promsng. 2. PROBLEM DEFINITION We consder Problem (1), where the feasble setx = X 1 X N s a cartesan product of lower dmensonal convex sets X R n, and x R n s parttoned accordngly to x = (x 1,...,x N), wth each x R n. F s smooth (and not necessarly convex) and G s convex and possbly nondfferentable, wth G(x) = N =1 g (x ) wthx X. Ths format s very general and ncludes problems of great nterest. Below we lst some nstances of Problem (1). G(x) = 0; n ths case the problem reduces to the mnmzaton of a smooth, possbly nonconvex problem wth convex constrants. F(x) = Ax b 2 and G(x) = c x 1, X = R n, wth A R m n, b R m, and c R ++ gven constants; ths s the very famous and much studed Lasso problem [22]. F(x) = Ax b 2 and G(x) = c N =1 x 2, X = Rn, wth A R m n, b R m, and c R ++ gven constants; ths s the group Lasso problem [23]. F(x) = m j=1 log(1+e a y T x ) andg(x) = c x 1 (org(x) = c N =1 x 2), wth y Rn, a R, and c R ++ gven constants; ths s the sparse logstc regresson problem [24, 25]. F(x) = m j=1 max{0,1 ayt x} 2 and G(x) = c x 1, wth a { 1,1}, y R n, and c R ++ gven; ths s the l 1- regularzed l 2-loss Support Vector Machne problem, see e.g. [18]. Other problems that can be cast n the form (1) nclude the Nuclear Norm Mnmzaton problem, the Robust Prncpal Component Analyss problem, the Sparse Inverse Covarance Selecton problem, the Nonnegatve Matrx (or Tensor) Factorzaton problem, see e.g. [16, 26] and references theren. Gven (1), we make the followng standard, blanket assumptons: (A1) EachX s nonempty, closed, and convex; (A2) F sc 1 on an open set contanng X; (A3) F s Lpschtz contnuous onx wth constant L F ; (A4) G(x) = N =g(x), wth all g contnuous and convex on X ;

(A5) V s coercve. 3. MAIN RESULTS We want to develop parallel soluton methods for Problem (1) whereby operatons can be carred out on some or (possbly) all (block) varables x at the same tme. The most natural parallel (Jacob-type) method one can thnk of s updatng all blocks smultaneously: gven x k, each (block) varablex k+1 s computed as the soluton ofmn x [F(x,x k )+g (x )] (wherex denotes the vector obtaned from x by deletng the block x ). Unfortunately ths method converges only under very restrctve condtons [27] that are seldom verfed n practce. To cope wth ths ssue we ntroduce some memory" and set the new pont to be a convex combnaton ofx k and the solutons of mn x [F(x,x k )+g (x )]. However our framework has many addtonal features, as dscussed next. ApproxmatngF : Solvng eachmn x [F(x,x k )+g (x )] may be too costly or dffcult n some stuatons. One may then prefer to approxmate ths problem, n some sutable sense, n order to facltate the task of computng the new teraton. To ths end, we assume that for all N {1,...,N} we can defne a functon P (z;w) : X X R havng the followng propertes (we denote by P the partal gradent of P wth respect to z): (P1) P ( ;w) s convex and contnuously dfferentable on X for allw X; (P2) P (x ;x) = x F(x) for all x X; (P3) P (z; ) s Lpschtz contnuous on X for all z X. Such a functon P should be regarded as a (smple) convex approxmaton off at the pontxwth respect to the block of varables x, that preserves the frst order propertes of F wth respect to x. Based on ths approxmaton we can defne at any pont x k X a regularzed approxmaton h (x ;x k ) ofv wth respect tox where F s replaced by P whle the nondfferentable term s preserved, and a quadratc regularzaton s added to make the overall approxmaton strongly convex. More formally, we have ( TQ(x ( ) h(x ;x k ) P (x ;x k )+ τ x x k 2 ) k ) x x k +g (x ), } {{ } h (x ;x k ) where Q (x k ) s an n n postve defnte matrx (possbly dependent onx k ), satsfyng the followng condtons. (A6) All matrces Q (x k ) are unformly postve defnte wth a common postve defnteness constant q > 0; furthermore, Q ( ) s Lpschtz contnuous onx. Note that n most cases (and n all our numercal experments) the Q are constant and equal to the dentty matrx, so that (A6) s automatcally satsfed. Assocated wth each and pont x k X we can defne the followng optmal soluton map: x (x k,τ ) argmn x X h(x ;x k ). (2) Note that x (x k,τ ) s always well-defned, snce the optmzaton problem n (2) s strongly convex. Gven (2), we can then ntroduce X y x(y,τ) ( x (y,τ )) N =1. The algorthm we are about to descrbed s based on the computaton of x. Therefore the approxmatng functons P should lead to as easly computable functons x as possble. An approprate choce depends on the problem at hand and on computatonal requrements. We dscuss some possble choce for the approxmatons P after ntroducng the man algorthm (Algorthm 1). Inexact solutons: In many stuatons (especally n the case of large-scale problems), t can be useful to further reduce the computatonal effort needed to solve the subproblems n (2)( by allowng nexact computatons z k of x (x k,τ ),.e., z k x x k,τ ) ε k, where ε k measures the accuracy n computng the soluton. Updatng only some blocks: Another mportant feature of our algorthmc framework s the possblty of updatng only some of the varables at each teraton. Essentally we prove convergence assumng that at each teraton only a subset of the varables s updated under the condton that ths subset contans at least one (block) component whch s wthn a factor ρ (0,1] from beng suffcently far from optmalty, n the sense explaned next. Frst of all x k s optmal for h (x ;x k ) f and only f x (x k,τ ) = x k. Ideally we would lke then to select the ndces to update accordng to the optmalty measure x (x k,τ ) x k ; but n some stuatons ths could be computatonally too expensve. In order to be able to develop alternatve choces, based on the same dea, we suppose one can compute an error bound,.e., a functon E (x) such that s x (x k,τ ) x k E (x k ) s x (x k,τ ) x k, (3) for some 0 < s s. Of course we can always set E (x k ) = x (x k,τ ) x k, but other choces are also possble; we dscuss some of them after ntroducng the algorthm. We are now ready to formally ntroduce our algorthm, Algorthm 1, that enjoys all the features dscussed above. Its convergence propertes are gven n Theorem 1, whose proof s omtted because of space lmtaton, see [28]. Algorthm 1: Inexact Parallel Algorthm Data : {ε k } for N,τ 0, {γ k } > 0, x 0 X,ρ (0,1]. Setk = 0. (S.1) : If x k satsfes a termnaton crteron: STOP; (S.2) : For all N, solve (2) wth accuracy ε k : Fndz k X s.t. z k x ( x k,τ ) ε k ; (S.3) : SetM k max {E (x k )}. Choose a set S k that contans at least one ndex for whch E (x k ) ρm k. Setẑ k = z k for S k andẑ k = x k for S k (S.4) : Setx k+1 x k +γ k (ẑ k x k ); (S.5) : k k +1, and go to (S.1). Theorem 1 Let {x k } be the sequence generated by Algorthm 1, under A1-A6. Suppose that {γ k } and {ε k } satsfy the followng condtons: ) γ k (0,1]; ) γ k 0; ) k γk = + ; v) ( ) k γ k 2 < + ; and v) ε k γ k α 1mn{α 2,1/ x F(x k ) } for all N and some nonnegatve constants α 1 andα 2. Addtonally, f nexact solutons are used n Step S.2,.e., ε k > 0 for some and nfnte k, then assume also thatgs globally Lpschtz onx. Then, ether Algorthm 1 converges n a fnte number of teratons to a statonary soluton of (1) or every lmt pont of {x k } (at least one such ponts exsts) s a statonary soluton of (1). In the theorem we obtan convergence to statonary ponts x,.e. ponts for whch a subgradentξ G(x ) exsts such that( F(x ) +ξ) T (y x ) 0 for all y X. Of course, f F s convex, statonary ponts concde wth global mnmzers. On Algorthm 1. The proposed algorthm s extremely flexble. We can always choose S k = N resultng n the smultaneous update of all the (block) varables (full Jacob scheme); or, at the other extreme, one can update a sngle (block) varable per tme, thus obtanng a Gauss-Southwell knd of method. One can also compute

nexact solutons (Step 2) whle preservng convergence, provded that the error term ε k and the step-sze γ k s are chosen accordng to Theorem 1. We emphasze that the Lpschtzanty of G s requred only f x(x k,τ) s not computed exactly for nfnte teratons. At any rate ths Lpschtz condtons s automatcally satsfed f G s a norm (and therefore n Lasso and group Lasso problems for example) or f X s bounded. On the choce of the stepsze γ k. An example of step-sze rule satsfyng -v n Theorem 1 s: gven γ 0 = 1, let γ k = γ k 1( 1 θγ k 1), k = 1,..., (4) where θ (0,1) s a gven constant; see [21] for others rules. Ths s actually the rule we used n our practcal experments, see next secton. Notce that whle ths rule may stll requre some tunng for optmal behavour, t s qute relable, snce n general we are not usng a (sub)gradent drecton, so that many of the well-known practcal drawbacks assocated wth a (sub)gradent method wth dmnshng step-sze are mtgated n our settng. Furthermore, ths choce of step-sze does not requre any form of centralzed coordnaton, whch s a favourable feature n a parallel envronment. We remark that t s possble to prove convergence of Algorthm 1 also usng other step-sze rules, such as a standard Armjo-lke lnesearch procedure or a (sutably small) constant step-sze; see [28] for more detals. We omt the dscusson of these optons because of lack of space, but also because the former s not n lne wth our parallel approach whle the latter s numercally less effcent. On the choce of E (x). As we mentoned, the most obvous choce s to take E (x) = x (x k,τ ) x k. Ths s a valuable choce f the computaton of x (x k,τ ) can be easly accomplshed. For nstance, n the Lasso problem wth N = {1,...,n} (.e., when each block reduces to a scalar varable), t s well-known that x (x k,τ ) can be computed n closed form usng the soft-thresholdng operator. In stuatons where the computaton of x (x k,τ ) x k s not possble or advsable, we can resort to estmates. To make the dscusson smple, assume momentarly that G 0. Then t s known [29] that Π X (x k x F(x k )) x k s an error bound for the mnmzaton problem n (2) and therefore satsfes (3). In ths stuaton we can choose E (x k ) = Π X (x k x F(x k )) x k. If G(x) 0 we can easly reduce to the case G 0 by a smple transformaton; the detals are omtted for lack of space, see [21]. It s nterestng to note that the computaton of E s only needed f a partal update of the (block) varables s performed. However, an opton that s always feasble s to takes k = N at each teraton,.e., update all (block) varables at each teraton. Wth ths choce we can dspense wth the computaton ofe altogether. On the choce of P (x ;x). The most obvous choce for P s the lnearzaton of F at x k wth respect to x : P (x ;x k ) = F(x k )+ x F(x k ) T (x x k ). Wth ths choce, and takng for smplcty Q (x k ) = I, x (x k,τ ) s gven by argmnf(x k )+ x F(x k ) T (x x k )+ τ x X 2 x xk 2 +g (x ). (5) Ths s essentally the way a new teraton s computed n most sequental (block-)cdms for the soluton of (group) Lasso problems and ts generalzatons. Note that contrary to most exstng schemes, our algorthm s parallel. At another extreme we could just take P (x ;x k ) = F(x,x k ). Of course, to have (P1) satsfed, we must assume that F(x,x k ) s convex. Wth ths choce, and settng for smplcty Q (x k ) = I, x (x k,τ ) s gven by argmn x X F(x,x k )+ τ 2 x xk 2 +g (x ), (6) thus gvng rse to a parallel nonlnear Jacob type method for the constraned mnmzaton of V(x). Between the two extreme solutons proposed above one can consder ntermedate choces. For example, If F(x,x k ) s convex, we can take P (x ;x k ) as a second order approxmaton of F(x,x k ),.e.,p (x ;x k ) = F(x k )+ x F(x k ) T (x x k )+ 1 2 (x xk ) T 2 x x F(x k )(x x k ). When g (x ) 0, ths essentally corresponds to takng a Newton step n mnmzng the reduced problem mn x X F(x,x k ). The resultng x (x k,τ ) s argmn x X F(x k )+ x F(x k ) T (x x k )+ + 1 2 (x xk ) T 2 x x F(x k )(x x k )+ τ 2 x xk 2 +g (x ). The framework descrbed n Algorthm 1 can gve rse to very dfferent nstances, accordng to the choces one makes for the many varable features t contans, some of whch have been detaled above. For lack of space, we cannot fully dscuss here all possbltes. We provde next just a few nstances of possble algorthms that fall n our framework; more examples can be found n [28]. Example #1 (Proxmal) Jacob algorthms for convex functons: Consder the smplest problem fallng n our settng: the unconstraned mnmzaton of a contnuously dfferentable convex functon,.e., assume that F s convex, G(x) 0, and X = R n. Although ths s possbly the best studed problem n nonlnear optmzaton, classcal parallel methods for ths problem [27, Sec. 3.2.4] requre very strong contracton condtons. In our framework we can take S k = N, P (x ;x k ) = F(x,x k ), resultng n a fully parallel Jacob-type method whch does not need any addtonal assumptons. Furthermore our theory shows that we can even dspense wth the convexty assumpton and stll get convergence of a Jacob-type method to a statonary pont. Example # 2 Parallel coordnate descent method for Lasso Consder the Lasso problem,.e., F(x) = Ax b 2, G(x) = c x 1, and X = R n. Probably, to date, the most succesful class of methods for ths problem s that of CDMs, whereby at each teraton a sngle varable s updated usng (5). We can easly obtan a parallel verson for ths method by takng n = 1, S k = N and stll usng (5). Alternatvely, nstead of lnearzng F(x), we can better explot the convexty of F(x) and use (6). Furthermore, we can easly consder smlar methods for the group Lasso problem (just take n > 1). As a fnal remark, we observe that convergence condtons of exstng (determnstc) fully dstrbuted parallel versons of CDMs such as [2, 17] mpose a constrant on the maxmum number of varables that can be smultaneously updated (lnked to the spectral radus of some matrces), a constrant that n many large scale problems s lkely not satsfed. A key feature of the proposed scheme s that we can parallelze over (possbly) all varables whle guaranteeng convergence. 4. NUMERICAL RESULTS In ths secton we report some prelmnary numercal results that not only show vablty of our approach, but also seem to ndcate that our algorthmc framework can lead to practcal methods that explot well parallelsm and compare favorably to exstng schemes, both parallel and sequental. The tests were carred out on Lasso problems, the most studed case of Problem (1) and, arguably, the most

mportant one. We generate four nstances of problems usng the random generaton technque proposed by Nesterov n [7], that permts to control the sparsty of the soluton. For the frst three groups, we consdered problems wth 10,000 varables wth the matrx A havng 2,000 rows. The three groups dffer n the number of non zeros of the soluton, whch s 20% (low sparsty), 10% (medum sparsty), and 5% (hgh sparsty) respectvely. The last group s an nstance wth 100,000 varables and 5000 rows, and solutons havng 5% of non zero varables (hgh sparsty). We mplemented the nstance of Algorthm 1 that we descrbed n Example # 2 n the prevous secton, wth the only dfference that we used (6) nstead of the proxmal-lnear choce (5). Note that n the case of Lasso problems, the unque soluton (6) can be computed n closed form usng the soft-thresholdng operator, see e.g. [30]. The free parameters of the algorthm are chosen as follows. The proxmal parameters are ntally set to τ = tr(a T A)/2n for all, where n s the total number of varables. Furthermore, we allowed a fnte number of possble changes to τ accordng to the followng rules: () allτ are doubled f at a certan teraton the objectve functon does not decrease; and () they are all halved f the objectve functon decreases for ten consecutve teratons. We updatedγ k accordng to (4) wth γ 0 = 0.9 and θ = 1e 5. Note that snce τ are changed only a fnte number of tmes, condtons of Theorem 1 are satsfed, and thus ths nstance of Algorthm 1 s guaranteed to converge. Fnally we choose not to update all varables but set E (x k ) = x (x k,τ ) x k and ρ = 0.5 n Algorthm 1. We compared our algorthm above, termed FPA (for Flexble Parallel Algorthm), wth a parallel mplementaton of FISTA [30], that can be regarded as the benchmark algorthm for Lasso problems, and Grock, a parallel algorthm proposed n [17] that seems to perform extremely well on sparse problems. We actually tested two nstances of Grock; n the frst only one varable s updated at each teraton whle n the second the number of updated varables s equal to the number of parallel processors used (16 for the frst three set of test problems, 32 for the last). Note that the theoretcal convergence propertes of Grock are n jeopardy as the number of updated varables ncreases and theoretcal convergence condtons for ths method are lkely to hold only f the columns of A are almost orthogonal, a feature enjoyed by our test problems, whch however s not satsfed n most applcatons. As benchmark, we also mplemented two classcal sequental schemes: () a Gauss-Sedel (GS) method computng ˆx, and then updatng x usng untary step-sze, n a sequental fashon, and () a classcal Alternatng Method of Multplers (ADMM) [31] n the form of [32]. Note that ADMM can be parallelzed, but they are known not to scale well and therefore we dd not consder ths possblty. All codes have been wrtten n C++ and use the Message Passng Interface for parallel operatons. All algebra s performed by usng the GNU Scentfc Lbrary (GSL). The algorthms were tested on a cluster at the State Unversty of New York at Buffalo. All computatons were done on one 32 core node composed of four 8 core CPUs wth 96GB of RAM and Infnband card. The 10,000 varables problems were solved usng 16 parallel processes whle for the 100,000 varables problems 32 parallel processes were used. GS and ADMM were always run on a sngle process. Results of our experments are reported n Fg. 1. The curves are averaged over ten random realzatons for each of the 10,000 varables groups, whle for large 100,000 varables problems the average s over 3 realzatons. Note that n Fg.1 the CPU tme ncludes communcaton tmes (for dstrbuted algorthms) and the ntal tme needed by the methods to perform all pre-teratons computatons (ths explans why the plot of FISTA starts after the others; n fact FISTA requres some (V V*)/V* (V V*)/V* 10 0 10 2 10 4 10 0 10 2 10 4 FPA FISTA GRock P=1 GRock P=16 ADMM GS 0 5 10 15 tme (sec) (a) FPA FISTA GRock P=1 GRock P=16 ADMM GS 0 5 10 15 tme (sec) (V V*)/V* 10 0 10 2 10 4 FPA FISTA GRock P=1 GRock P=16 ADMM GS 0 5 10 15 tme (sec) (c) (d) Fg. 1. Relatve error vs. tme (n seconds, logarthmc scale): (a) medum sze and low sparsty - (b) medum sze and sparsty - (c) medum sze and hgh sparsty - (d) large sze and hgh sparsty (V V*)/V* 10 2 10 0 10 2 10 4 (b) FPA FISTA GRock P=1 GRock P=32 ADMM GS 0 50 100 150 200 250 tme (sec) nontrval ntalzatons based on the computaton of A 2 2). Some comments are n order. Fg 1 shows that on the tested problems FPA outperforms n a consstent manner all other mplemented algorthms. Sequental methods behave strkngly well on the 10,000 varables problems, f one keeps n mnd that they only use one process; however, as expected, they cannot compete wth parallel methods when the dmensons ncrease. FISTA s capable to approach relatvely fast low accuracy solutons, but has dffcultes n reachng hgh accuraces. The parallel verson of Grock s the closest match to FPA, but only when the problems are very sparse and the dmensons not too large. Ths s consstent wth the fact that at each teraton Grock updates only a very lmted number of varables, and also wth the fact that ts convergence propertes are at stake when the problems are qute dense. Our experments also suggest that, dfferently from what one could thnk (and often clamed n smlar stuatons when usng gradent-lke methods), updatng only a (sutably chosen) subset of blocks rather than all varables may lead to faster algorthms. In concluson, we beleve the results overall ndcate that our approach can lead to very effcent practcal methods for the soluton of large problems, wth the flexblty to adapt to many dfferent problem characterstcs. 5. CONCLUSIONS We proposed a hghly parallelzable algorthmc scheme for the mnmzaton of the sum of a dfferentable functon and a block-separable nonsmooth one. Our framework easly allows us to analyze parallel versons of well-known sequental methods and leads to entrely new algorthms. When appled to large-scale Lasso problems, our algorthm was shown to outperform exstng methods.

6. APPENDIX: PROOF OF THEOREM 1 We frst ntroduce some prelmnary results nstrumental to prove the theorem. Hereafter, for notatonal smplcty, we wll omt the dependence of x(y,τ) on τ and wrte x(y). Gven S N and x (x ) N =1, we wll also denote by (x) S (or nterchangeably x S) the vector whose component s equal to x f S, and zero otherwse. 6.1. Intermedate results Lemma 2 Set H(x;y) h(x;y). Then, the followng hold: ()H( ;y) s unformly strongly convex onx wth constantc τ > 0,.e., (x w) T ( xh(x;y) xh(w;y)) c τ x w 2, (7) for all x,w X and gven y X; () xh(x; ) s unformly Lpschtz contnuous on X,.e., there exsts a0 < L H < ndependent on x such that xh(x;y) xh (x;w) L H y w, (8) for all y,w X and gvenx X. Proof. The proof s standard and thus s omtted. Proposton 3 Consder Problem (1) under (A1)-(A6). Then the mappng X y x(y) has the followng propertes: (a) x( ) s Lpschtz contnuous on X,.e., there exsts a postve constant ˆL such that x(y) x(z) ˆL y z, y,z X; (9) (b) the set of the fxed-ponts of x( ) concdes wth the set of statonary solutons of Problem (1); therefore x(y) has a fxed-pont; (c) for every gven y X and for any set S N, t holds that ( x(y) y) T S xf(y)s+ S wthc τ q mn τ. g ( x (y)) S c τ ( x(y) y) S 2, g (y ) (10) Proof. We prove the proposton n the followng order: (c), (a), (b). (c): Gven y X, by defnton, each x (y) s the unque soluton of problem (2); then t s not dffcult to see that the followng holds: for all z X, (z x (y)) T x h ( x (y);y)+g (z ) g ( x (y)) 0. (11) Summng and subtractng x P (y ; y) n (11), choosngz = y, and usng (P2), we get (y x (y)) T ( x P ( x (y); y) x P (y ; y)) +(y x (y)) T x F(y)+g (y ) g ( x (y)) τ ( x (y) y ) T Q (y)( x (y) y ) 0, (12) for all N. Usng (12) and observng that the term on the frst lne of (12) s non postve by (P1), we obtan (y x (y)) T x F(y)+g (y ) g ( x (y)) c τ x (y) y 2, (13) for all N. Summng (13) over S we obtan (10). (a): We use the notaton ntroduced n Lemma 2. Gven y,z X, by optmalty and usng (11), we have (v x(y)) T xh( x(y);y)+g(v) G( x(y)) 0 v X (w x(z)) T xh( x(z);z)+g(w) G( x(z)) 0 w X. Settng v = x(z) and w = x(y), summng the two nequaltes above, and addng and subtractng xh( x(y);z), we obtan: ( x(z) x(y)) T ( xh ( x(z);z) xh( x(y);z)) ( x(y) x(z)) T ( xh ( x(y);z) xh ( x(y);y)). Usng (7) we can now lower bound the left-hand-sde of (14) as ( x(z) x(y)) T ( xh ( x(z);z) xh( x(y);z)) c τ x(z) x(y) 2, whereas the rght-hand sde of (14) can be upper bounded as ( x(y) x(z)) T ( xh( x(y);z) xh( x(y);y)) L H x(y) x(z) y z, (14) (15) (16) where the nequalty follows from the Cauchy-Schwartz nequalty and (8). Combnng (14), (15), and (16), we obtan the desred Lpschtz property of x( ). (b): Letx X be a fxed pont of x(y), that s x = x(x ). Each x (y) satsfes (11) for any gven y X. For some ξ g (x ), settngy = x and usng x = x(x ) and the convexty of g, (11) reduces to (z x ) T ( x F(x )+ξ ) 0, (17) for all z X and N. Takng nto account the Cartesan structure of X, the separablty of G, and summng (17) over N we obtan (z x ) T ( xf(x )+ξ) 0, for all z X, wth z (z ) N =1 and ξ (ξ ) N =1 G(x ); therefore x s a statonary soluton of (1). The converse holds because ) x(x ) s the unque optmal soluton of (2) wthy = x, and )x s also an optmal soluton of (2), snce t satsfes the mnmum prncple. Lemma 4 [33, Lemma 3.4, p.121] Let {X k }, {Y k }, and {Z k } be three sequences of numbers such that Y k 0 for all k. Suppose that X k+1 X k Y k +Z k, k = 0,1,... and k=0 Zk <. Then ether X k or else {X k } converges to a fnte value and k=0 Y k <. Lemma 5 Let{x k } be the sequence generated by Algorthm 1. Then, there s a postve constant c such that the followng holds: for all k 1, ( ) T ( xf(x k ) x(x k ) x k) + g ( x (x k )) g (x k ) S k S k S k S k c x(x k ) x k 2. Proof. Letj k be an ndex ns k such thate jk (x k ) ρmax E (x k ) (Step 3 of the algorthm). Then, usng the aforementoned bound and

(3), t s easy to check that the followng chan of nequaltes holds: s jk x S k(x k ) x k S k sj k xj k (xk ) x k j k Hence we have for any k, E jk (x k ) x S k(x k ) x k S k ( ρmn s N s jk ρmaxe (x k ) ( )( ) ρmn s max{ x (x k ) x k } ( ) ρmn s x(x k ) x k. N ) x(x k ) x k. (18) Invokng now 3 (c) wth S = S k and y = x k, and usng (18), the ( ) ρmn s 2. lemma holds, wth c c τ N max j s j 6.2. Proof of Theorem 1 xf ( x k) T (ẑk x k) + S k g (ẑ k ) S k g (x k ) = xf ( x k) T (ẑk x k) + S k g( x(xk )) S k g(xk ) + S g (ẑ k ) k S g ( x (x k )) k c x(x k ) x k 2 + S k εk x F(x k ) +LG S k εk. (22) Fnally, from the defnton of ẑ k and of the set S k, we have for allk large enough, V(x k+1 ) = F(x k+1 )+ N g(xk+1 ) = F(x k+1 )+ N g(xk +γ k (ẑ k x k )) F(x k+1 )+ N g(xk )+γ k( S k (g (ẑ k ) g (x k )) ) V ( x k) γ k( c γ ) k L U x(x k ) x k 2 +T k, (23) where n the frst nequalty we used the the convexty of the g s, whereas the second follows from (19), (20) and (22), wth T k γ ( ) k L G + x F(x k ) + (γ k) 2 L F (ε k ) 2. S k ε k N We are now ready to prove the theorem. For any gven k 0, the Descent Lemma [34] yelds F ( x k+1) F ( x k) +γ k ( xf x k) T (ẑk x k) ( ) γ k 2 L F + ẑ k x k 2, 2 (19) wth ẑ k (ẑ k ) N =1 and z k (z k ) N =1 defned n Step 3 and 4 (Algorthm 1). Observe that ẑ k x k 2 z k x k 2 z k x (x k ) 2 2 x(x k ) x k 2 +2 N 2 x(x k ) x k 2 +2 N (εk ) 2, (20) where the frst nequalty follows from the defnton of z k and ẑ k and n the last nequalty we used z k x (x k ) ε k. Denotng by S k the complement of S, we also have, for k large enough, xf ( x k) T (ẑk x k) = xf ( x k) T (ẑk x(x k )+ x(x k ) x k) = xf ( x k) T S k (z k x(x k )) S k + xf ( x k) T S k (xk x(x k )) S k + xf ( x k) T S k ( x(x k ) x k ) S k + xf ( x k) T S k ( x(xk ) x k ) S k = xf ( x k) T S k (z k x(x k )) S k + xf ( x k) T S k ( x(x k ) x k ) S k, (21) where n the second equalty we used the defnton of ẑ k and of the sets k. Now, usng the above dentty and Lemma 5, we can wrte Usng assumpton (v), we can bound T k as T k (γ k ) 2[ Nα 1(α 2L G +1)+(γ k ) 2 L F (Nα 1α 2) 2], whch, by assumpton (v) mples k=0 Tk <. Snce γ k 0, t follows from (23) that there exst some postve constant β 1 and a suffcently large k, sayk k, such that V(x k+1 ) V(x k ) γ k β 1 x(x k ) x k 2 +T k. (24) Invokng Lemma 4 wth the dentfcatons X k = V(x k+1 ), Y k = γ k β 1 x(x k ) x k 2 and Z k = T k whle usng k=0 Tk <, we deduce from (24) that ether {V(x k )} or else {V(x k )} converges to a fnte value and lm k t= k k γ t x(x t ) x t 2 < +. (25) Snce V s coercve, V(x) mn y X V(y) >, mplyng that {V ( x k) } s convergent; t follows from (25) and k=0 γk = that lmnf k x(x k ) x k = 0. Usng Prop. 3, we show next that lm k x(x k ) x k = 0; for notatonal smplcty we wll wrte x(x k ) x(x k ) x k. Suppose, by contradcton, that lmsup k x(x k ) > 0. Then, there exsts aδ > 0 such that x(x k ) > 2δ for nfntely manyk and also x(x k ) < δ for nfntely many k. Therefore, one can always fnd an nfnte set of ndexes, say K, havng the followng propertes: for any k K, there exsts an nteger k > k such that x(x k ) < δ, x(x k ) > 2δ (26) δ x(x j ) 2δ k < j < k. (27)

Gven the above bounds, the followng holds: for all k K, δ (a) < (b) x(x k ) x(x k ) x(x k ) x(x k ) + x k x k (28) x k x k (29) (1+ ˆL) (c) (1+ ˆL) k 1 γ t( x(x t ) S t+ (z t x(x t )) S t ) (d) (1+ ˆL)(2δ k 1 +ε max ) γ t, (30) where (a) follows from (26); (b) s due to Prop. 3(a); (c) comes from the trangle nequalty, the updatng rule of the algorthm and the defnton ofẑ k ; and n (d) we used (26), (27), and z t x(x t ) N εt, where ε max max k N εk <. It follows from (30) that k 1 lm nf γ t k δ (1+ ˆL)(2δ > 0. (31) +ε max ) We show next that (31) s n contradcton wth the convergence of {V(x k )}. To do that, we prelmnary prove that, for suffcently large k K, t must be x(x k ) δ/2. Proceedng as n (30), we have: for any gven k K, x(x k+1 ) x(x k ) (1+ ˆL) x k+1 x k (1+ ˆL)γ k( x(x k ) +ε max ). It turns out that for suffcently large k K so that (1 + ˆL)γ k < δ/(δ +2ε max ), t must be x(x k ) δ/2; (32) otherwse the condton x(x k+1 ) δ would be volated [cf. (27)]. Hereafter we assume w.l.o.g. that (32) holds for all k K (n fact, one can alway restrct {x k } k K to a proper subsequence). We can show now that (31) s n contradcton wth the convergence of {V(x k )}. Usng (24) (possbly over a subsequence), we have: for suffcently large k K, k 1 V(x k ) V(x k ) β 2 γ t x(x t ) k 1 2 + (a) k 1 k 1 < V(x k ) β 2(δ 2 /4) γ t + T t (33) where n (a) we used (27) and (32), and β 2 s some postve constant. Snce {V(x k )} converges and k=0 Tk <, (33) mples lm k 1 K k γt = 0, whch contradcts (31). Fnally, snce the sequence {x k } s bounded [due to the coercvty of V and the convergence of {V(x k )}], t has at least one lmt pont x that must belong to X. By the contnuty of x( ) [Prop. 3(a)] and lm k x(x k ) x k = 0, t must be x( x) = x. By Prop. 3(b) x s also a statonary soluton of Problem (1). As a fnal remark, note that f ε k = 0 for every and for every k large enough,.e. f eventually x(x k ) s computed exactly, there s no need to assume that G s globally Lpschtz. In fact n T t (22) the term contanng L G dsappears, and actually all the terms T k are zero and all the subsequent dervatons ndependent of the Lpschtzanty of G. 7. REFERENCES [1] F. Bach, R. Jenatton, J. Maral, and G. Oboznsk, Optmzaton wth sparsty-nducng penaltes, arxv preprnt arxv:1108.0775, 2011. [2] J. K. Bradley, A. Kyrola, D. Bckson, and C. Guestrn, Parallel coordnate descent for l1-regularzed loss mnmzaton, arxv preprnt arxv:1105.5379, 2011. [3] P. L. Bühlmann, S. A. van de Geer, and S. Van de Geer, Statstcs for hgh-dmensonal data. Sprnger, 2011. [4] R. H. Byrd, J. Nocedal, and F. Oztoprak, An Inexact Successve Quadratc Approxmaton Method for Convex L-1 Regularzed Optmzaton, arxv preprnt arxv:1309.3529, 2013. [5] K. Fountoulaks and J. Gondzo, A Second-Order Method for Strongly Convex L1-Regularzaton Problems, arxv preprnt arxv:1306.5386, 2013. [6] I. Necoara and D. Clpc, Effcent parallel coordnate descent algorthm for convex optmzaton problems wth separable constrants: applcaton to dstrbuted MPC, Journal of Process Control, vol. 23, no. 3, pp. 243 253, 2013. [7] Y. Nesterov, Gradent methods for mnmzng composte functons, Mathematcal Programmng, pp. 1 37, 2012. [8], Effcency of coordnate descent methods on hugescale optmzaton problems, SIAM Journal on Optmzaton, vol. 22, no. 2, pp. 341 362, 2012. [9] Z. Qn, K. Schenberg, and D. Goldfarb, Effcent blockcoordnate descent algorthms for the group lasso, Mathematcal Programmng Computaton, pp. 1 27, 2010. [10] A. Rakotomamonjy, Surveyng and comparng smultaneous sparse approxmaton (or group-lasso) algorthms, Sgnal processng, vol. 91, no. 7, pp. 1505 1526, 2011. [11] M. Razavyayn, M. Hong, and Z.-Q. Luo, A unfed convergence analyss of block successve mnmzaton methods for nonsmooth optmzaton, SIAM Journal on Optmzaton, vol. 23, no. 2, pp. 1126 1153, 2013. [12] P. Rchtárk and M. Takáč, Iteraton complexty of randomzed block-coordnate descent methods for mnmzng a composte functon, Mathematcal Programmng, pp. 1 38, 2012. [13], Parallel coordnate descent methods for bg data optmzaton, arxv preprnt arxv:1212.0873, 2012. [14] S. Sra, S. Nowozn, and S. J. Wrght, Eds., Optmzaton for Machne Learnng, ser. Neural Informaton Processng. Cambrdge, Massachusetts: The MIT Press, Sept. 2011. [15] P. Tseng and S. Yun, A coordnate gradent descent method for nonsmooth separable mnmzaton, Mathematcal Programmng, vol. 117, no. 1-2, pp. 387 423, 2009. [16] Y. Xu and W. Yn, A block coordnate descent method for mult-convex optmzaton wth applcatons to nonnegatve tensor factorzaton and completon, DTIC Document, Tech. Rep., 2012. [Onlne]. Avalable: http://www.caam.rce.edu/$\sm$optmzaton/bcu/multconvex.html [17] Z. Yn, P. Mng, and Y. Wotao, Parallel and Dstrbuted Sparse Optmzaton, 2013. [Onlne]. Avalable: http://www.caam.rce.edu/$\sm$optmzaton/dsparse/ [18] G.-X. Yuan, K.-W. Chang, C.-J. Hseh, and C.-J. Ln, A comparson of optmzaton methods and software for large-scale l1-regularzed lnear classfcaton, The Journal of Machne Learnng Research, vol. 9999, pp. 3183 3234, 2010. [19] S. J. Wrght, Accelerated block-coordnate relaxaton for regularzed optmzaton, SIAM Journal on Optmzaton, vol. 22, no. 1, pp. 159 186, 2012.

[20] G. Scutar, F. Facchne, P. Song, D. P. Palomar, and J.-S. Pang, Decomposton by partal lnearzaton n multuser systems, n IEEE Internatonal Conference on Acoustcs, Speech and Sgnal Processng (ICASSP 2013), May 4-9 2013, pp. 4424 4428. [21] G. Scutar, F. Facchne, P. Song, D. Palomar, and J.-S. Pang, Decomposton by Partal lnearzaton: Parallel optmzaton of mult-agent systems, IEEE Transactons on Sgnal Processng, to appear, 2013. [22] R. Tbshran, Regresson shrnkage and selecton va the lasso, Journal of the Royal Statstcal Socety. Seres B (Methodologcal), pp. 267 288, 1996. [23] M. Yuan and Y. Ln, Model selecton and estmaton n regresson wth grouped varables, Journal of the Royal Statstcal Socety: Seres B (Statstcal Methodology), vol. 68, no. 1, pp. 49 67, 2006. [24] S. K. Shevade and S. S. Keerth, A smple and effcent algorthm for gene selecton usng sparse logstc regresson, Bonformatcs, vol. 19, no. 17, pp. 2246 2253, 2003. [25] L. Meer, S. Van De Geer, and P. Bühlmann, The group lasso for logstc regresson, Journal of the Royal Statstcal Socety: Seres B (Statstcal Methodology), vol. 70, no. 1, pp. 53 71, 2008. [26] D. Goldfarb, S. Ma, and K. Schenberg, Fast alternatng lnearzaton methods for mnmzng the sum of two convex functons, Mathematcal Programmng, pp. 1 34, 2012. [27] D. P. Bertsekas and J. N. Tstskls, Parallel and Dstrbuted Computaton: Numercal Methods, 2nd ed. Athena Scentfc Press, 1989. [28] F. Facchne, S. Sagratella, and G. Scutar, Flexble Parallel Algorthms for Bg Data Optmzaton, Dept. of Electrcal Eng., State Unversty of New York at Buffalo, Buffalo, NY, USA, Tech. Rep., 2013. [Onlne]. Avalable: http://www.eng.buffalo.edu/$\sm$gesualdo/fsstechrep13.pdf. [29] F. Facchne and J.-S. Pang, Fnte-Dmensonal Varatonal Inequaltes and Complementarty Problem. Sprnger-Verlag, New York, 2003. [30] A. Beck and M. Teboulle, A fast teratve shrnkagethresholdng algorthm for lnear nverse problems, SIAM Journal on Imagng Scences, vol. 2, no. 1, pp. 183 202, 2009. [31] S. Boyd, N. Parkh, E. Chu, B. Peleato, and J. Ecksten, Dstrbuted optmzaton and statstcal learnng va the alternatng drecton method of multplers, Foundatons and Trends R n Machne Learnng, vol. 3, no. 1, pp. 1 122, 2011. [32] Z.-Q. Luo and M. Hong, On the lnear convergence of the alternatng drecton method of multplers, arxv preprnt arxv:1208.3922, 2012. [33] D. P. Bertsekas and J. N. Tstskls, Neuro-Dynamc Programmng. Cambrdge, Massachusetts: Athena Scentfc Press, May. 2011. [34] D. Bertsekas, Nonlnear Programmng. Belmont, MA, USA: Athena Scentfc, 2th Ed., 1999.