Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers

Transcription

1 Foundatons and Trends R n Machne Learnng Vol. 3, No. 1 (2010) c 2011 S. Boyd, N. Parkh, E. Chu, B. Peleato and J. Ecksten DOI: / Dstrbuted Optmzaton and Statstcal Learnng va the Alternatng Drecton Method of Multplers Stephen Boyd 1, Neal Parkh 2, Erc Chu 3 Borja Peleato 4 and Jonathan Ecksten 5 1 Electrcal Engneerng Department, Stanford Unversty, Stanford, CA 94305, USA, [email protected] 2 Computer Scence Department, Stanford Unversty, Stanford, CA 94305, USA, [email protected] 3 Electrcal Engneerng Department, Stanford Unversty, Stanford, CA 94305, USA, [email protected] 4 Electrcal Engneerng Department, Stanford Unversty, Stanford, CA 94305, USA, [email protected] 5 Management Scence and Informaton Systems Department and RUTCOR, Rutgers Unversty, Pscataway, NJ 08854, USA, [email protected]

2 Contents 1 Introducton 3 2 Precursors Dual Ascent Dual Decomposton Augmented Lagrangans and the Method of Multplers 10 3 Alternatng Drecton Method of Multplers Algorthm Convergence Optmalty Condtons and Stoppng Crteron Extensons and Varatons Notes and References 23 4 General Patterns Proxmty Operator Quadratc Objectve Terms Smooth Objectve Terms Decomposton 31 5 Constraned Convex Optmzaton Convex Feasblty Lnear and Quadratc Programmng 36

3 6 l 1 -Norm Problems Least Absolute Devatons Bass Pursut General l 1 Regularzed Loss Mnmzaton Lasso Sparse Inverse Covarance Selecton 45 7 Consensus and Sharng Global Varable Consensus Optmzaton General Form Consensus Optmzaton Sharng 56 8 Dstrbuted Model Fttng Examples Splttng across Examples Splttng across Features 66 9 Nonconvex Problems Nonconvex Constrants B-convex Problems Implementaton Abstract Implementaton MPI Graph Computng Frameworks MapReduce Numercal Examples Small Dense Lasso Dstrbuted l 1 Regularzed Logstc Regresson Group Lasso wth Feature Splttng Dstrbuted Large-Scale Lasso wth MPI Regressor Selecton 100

4 12 Conclusons 103 Acknowledgments 105 A Convergence Proof 106 References 111

5 Abstract Many problems of recent nterest n statstcs and machne learnng can be posed n the framework of convex optmzaton. Due to the exploson n sze and complexty of modern datasets, t s ncreasngly mportant to be able to solve problems wth a very large number of features or tranng examples. As a result, both the decentralzed collecton or storage of these datasets as well as accompanyng dstrbuted soluton methods are ether necessary or at least hghly desrable. In ths revew, we argue that the alternatng drecton method of multplers s well suted to dstrbuted convex optmzaton, and n partcular to large-scale problems arsng n statstcs, machne learnng, and related areas. The method was developed n the 1970s, wth roots n the 1950s, and s equvalent or closely related to many other algorthms, such as dual decomposton, the method of multplers, Douglas Rachford splttng, Spngarn s method of partal nverses, Dykstra s alternatng projectons, Bregman teratve algorthms for l 1 problems, proxmal methods, and others. After brefly surveyng the theory and hstory of the algorthm, we dscuss applcatons to a wde varety of statstcal and machne learnng problems of recent nterest, ncludng the lasso, sparse logstc regresson, bass pursut, covarance selecton, support vector machnes, and many others. We also dscuss general dstrbuted optmzaton, extensons to the nonconvex settng, and effcent mplementaton, ncludng some detals on dstrbuted MPI and Hadoop MapReduce mplementatons.

6 1 Introducton In all appled felds, t s now commonplace to attack problems through data analyss, partcularly through the use of statstcal and machne learnng algorthms on what are often large datasets. In ndustry, ths trend has been referred to as Bg Data, and t has had a sgnfcant mpact n areas as vared as artfcal ntellgence, nternet applcatons, computatonal bology, medcne, fnance, marketng, journalsm, network analyss, and logstcs. Though these problems arse n dverse applcaton domans, they share some key characterstcs. Frst, the datasets are often extremely large, consstng of hundreds of mllons or bllons of tranng examples; second, the data s often very hgh-dmensonal, because t s now possble to measure and store very detaled nformaton about each example; and thrd, because of the large scale of many applcatons, the data s often stored or even collected n a dstrbuted manner. As a result, t has become of central mportance to develop algorthms that are both rch enough to capture the complexty of modern data, and scalable enough to process huge datasets n a parallelzed or fully decentralzed fashon. Indeed, some researchers [92] have suggested that even hghly complex and structured problems may succumb most easly to relatvely smple models traned on vast datasets. 3

7 4 Introducton Many such problems can be posed n the framework of convex optmzaton. Gven the sgnfcant work on decomposton methods and decentralzed algorthms n the optmzaton communty, t s natural to look to parallel optmzaton algorthms as a mechansm for solvng large-scale statstcal tasks. Ths approach also has the beneft that one algorthm could be flexble enough to solve many problems. Ths revew dscusses the alternatng drecton method of multplers (ADMM), a smple but powerful algorthm that s well suted to dstrbuted convex optmzaton, and n partcular to problems arsng n appled statstcs and machne learnng. It takes the form of a decomposton-coordnaton procedure, n whch the solutons to small local subproblems are coordnated to fnd a soluton to a large global problem. ADMM can be vewed as an attempt to blend the benefts of dual decomposton and augmented Lagrangan methods for constraned optmzaton, two earler approaches that we revew n 2. It turns out to be equvalent or closely related to many other algorthms as well, such as Douglas-Rachford splttng from numercal analyss, Spngarn s method of partal nverses, Dykstra s alternatng projectons method, Bregman teratve algorthms for l 1 problems n sgnal processng, proxmal methods, and many others. The fact that t has been re-nvented n dfferent felds over the decades underscores the ntutve appeal of the approach. It s worth emphaszng that the algorthm tself s not new, and that we do not present any new theoretcal results. It was frst ntroduced n the md-1970s by Gabay, Mercer, Glownsk, and Marrocco, though smlar deas emerged as early as the md-1950s. The algorthm was studed throughout the 1980s, and by the md-1990s, almost all of the theoretcal results mentoned here had been establshed. The fact that ADMM was developed so far n advance of the ready avalablty of large-scale dstrbuted computng systems and massve optmzaton problems may account for why t s not as wdely known today as we beleve t should be. The man contrbutons of ths revew can be summarzed as follows: (1) We provde a smple, cohesve dscusson of the extensve lterature n a way that emphaszes and unfes the aspects of prmary mportance n applcatons.

8 5 (2) We show, through a number of examples, that the algorthm s well suted for a wde varety of large-scale dstrbuted modern problems. We derve methods for decomposng a wde class of statstcal problems by tranng examples and by features, whch s not easly accomplshed n general. (3) We place a greater emphass on practcal large-scale mplementaton than most prevous references. In partcular, we dscuss the mplementaton of the algorthm n cloud computng envronments usng standard frameworks and provde easly readable mplementatons of many of our examples. Throughout, the focus s on applcatons rather than theory, and a man goal s to provde the reader wth a knd of toolbox that can be appled n many stuatons to derve and mplement a dstrbuted algorthm of practcal use. Though the focus here s on parallelsm, the algorthm can also be used serally, and t s nterestng to note that wth no tunng, ADMM can be compettve wth the best known methods for some problems. Whle we have emphaszed applcatons that can be concsely explaned, the algorthm would also be a natural ft for more complcated problems n areas lke graphcal models. In addton, though our focus s on statstcal learnng problems, the algorthm s readly applcable n many other cases, such as n engneerng desgn, mult-perod portfolo optmzaton, tme seres analyss, network flow, or schedulng. Outlne We begn n 2 wth a bref revew of dual decomposton and the method of multplers, two mportant precursors to ADMM. Ths secton s ntended manly for background and can be skmmed. In 3, we present ADMM, ncludng a basc convergence theorem, some varatons on the basc verson that are useful n practce, and a survey of some of the key lterature. A complete convergence proof s gven n appendx A. In 4, we descrbe some general patterns that arse n applcatons of the algorthm, such as cases when one of the steps n ADMM can

9 6 Introducton be carred out partcularly effcently. These general patterns wll recur throughout our examples. In 5, we consder the use of ADMM for some generc convex optmzaton problems, such as constraned mnmzaton and lnear and quadratc programmng. In 6, we dscuss a wde varety of problems nvolvng the l 1 norm. It turns out that ADMM yelds methods for these problems that are related to many state-of-theart algorthms. Ths secton also clarfes why ADMM s partcularly well suted to machne learnng problems. In 7, we present consensus and sharng problems, whch provde general frameworks for dstrbuted optmzaton. In 8, we consder dstrbuted methods for generc model fttng problems, ncludng regularzed regresson models lke the lasso and classfcaton models lke support vector machnes. In 9, we consder the use of ADMM as a heurstc for solvng some nonconvex problems. In 10, we dscuss some practcal mplementaton detals, ncludng how to mplement the algorthm n frameworks sutable for cloud computng applcatons. Fnally, n 11, we present the detals of some numercal experments.

10 2 Precursors In ths secton, we brefly revew two optmzaton algorthms that are precursors to the alternatng drecton method of multplers. Whle we wll not use ths materal n the sequel, t provdes some useful background and motvaton. 2.1 Dual Ascent Consder the equalty-constraned convex optmzaton problem mnmze f(x) subject to Ax = b, (2.1) wth varable x R n, where A R m n and f : R n R s convex. The Lagrangan for problem (2.1) s and the dual functon s L(x,y) =f(x) +y T (Ax b) g(y) = nf L(x,y) = f ( A T y) b T y, x where y s the dual varable or Lagrange multpler, and f s the convex conjugate of f; see [20, 3.3] or [140, 12] for background. The dual 7

11 8 Precursors problem s maxmze g(y), wth varable y R m. Assumng that strong dualty holds, the optmal values of the prmal and dual problems are the same. We can recover a prmal optmal pont x from a dual optmal pont y as x = argmnl(x,y ), x provded there s only one mnmzer of L(x,y ). (Ths s the case f, e.g., f s strctly convex.) In the sequel, we wll use the notaton argmn x F (x) to denote any mnmzer of F, even when F does not have a unque mnmzer. In the dual ascent method, we solve the dual problem usng gradent ascent. Assumng that g s dfferentable, the gradent g(y) can be evaluated as follows. We frst fnd x + = argmn x L(x,y); then we have g(y)=ax + b, whch s the resdual for the equalty constrant. The dual ascent method conssts of teratng the updates x k+1 := argmnl(x,y k ) (2.2) x y k+1 := y k + α k (Ax k+1 b), (2.3) where α k > 0 s a step sze, and the superscrpt s the teraton counter. The frst step (2.2) s an x-mnmzaton step, and the second step (2.3) s a dual varable update. The dual varable y can be nterpreted as a vector of prces, and the y-update s then called a prce update or prce adjustment step. Ths algorthm s called dual ascent snce, wth approprate choce of α k, the dual functon ncreases n each step,.e., g(y k+1 ) >g(y k ). The dual ascent method can be used even n some cases when g s not dfferentable. In ths case, the resdual Ax k+1 b s not the gradent of g, but the negatve of a subgradent of g. Ths case requres a dfferent choce of the α k than when g s dfferentable, and convergence s not monotone; t s often the case that g(y k+1 ) g(y k ). In ths case, the algorthm s usually called the dual subgradent method [152]. If α k s chosen approprately and several other assumptons hold, then x k converges to an optmal pont and y k converges to an optmal

12 2.2 Dual Decomposton 9 dual pont. However, these assumptons do not hold n many applcatons, so dual ascent often cannot be used. As an example, f f s a nonzero affne functon of any component of x, then the x-update (2.2) fals, snce L s unbounded below n x for most y. 2.2 Dual Decomposton The major beneft of the dual ascent method s that t can lead to a decentralzed algorthm n some cases. Suppose, for example, that the objectve f s separable (wth respect to a partton or splttng of the varable nto subvectors), meanng that f(x) = N f (x ), =1 where x =(x 1,...,x N ) and the varables x R n are subvectors of x. Parttonng the matrx A conformably as A =[A 1 A N ], so Ax = N =1 A x, the Lagrangan can be wrtten as L(x,y)= N L (x,y)= =1 N ( f (x )+y T A x (1/N )y T b ), =1 whch s also separable n x. Ths means that the x-mnmzaton step (2.2) splts nto N separate problems that can be solved n parallel. Explctly, the algorthm s := argmnl (x,y k ) (2.4) x y k+1 := y k + α k (Ax k+1 b). (2.5) x k+1 The x-mnmzaton step (2.4) s carred out ndependently, n parallel, for each =1,...,N. In ths case, we refer to the dual ascent method as dual decomposton. In the general case, each teraton of the dual decomposton method requres a broadcast and a gather operaton. In the dual update step (2.5), the equalty constrant resdual contrbutons A x k+1 are

13 10 Precursors collected (gathered) n order to compute the resdual Ax k+1 b. Once the (global) dual varable y k+1 s computed, t must be dstrbuted (broadcast) to the processors that carry out the N ndvdual x mnmzaton steps (2.4). Dual decomposton s an old dea n optmzaton, and traces back at least to the early 1960s. Related deas appear n well known work by Dantzg and Wolfe [44] and Benders [13] on large-scale lnear programmng, as well as n Dantzg s semnal book [43]. The general dea of dual decomposton appears to be orgnally due to Everett [69], and s explored n many early references [107, 84, 117, 14]. The use of nondfferentable optmzaton, such as the subgradent method, to solve the dual problem s dscussed by Shor [152]. Good references on dual methods and decomposton nclude the book by Bertsekas [16, chapter 6] and the survey by Nedć and Ozdaglar [131] on dstrbuted optmzaton, whch dscusses dual decomposton methods and consensus problems. A number of papers also dscuss varants on standard dual decomposton, such as [129]. More generally, decentralzed optmzaton has been an actve topc of research snce the 1980s. For nstance, Tstskls and hs co-authors worked on a number of decentralzed detecton and consensus problems nvolvng the mnmzaton of a smooth functon f known to multple agents [160, 161, 17]. Some good reference books on parallel optmzaton nclude those by Bertsekas and Tstskls [17] and Censor and Zenos [31]. There has also been some recent work on problems where each agent has ts own convex, potentally nondfferentable, objectve functon [130]. See [54] for a recent dscusson of dstrbuted methods for graph-structured optmzaton problems. 2.3 Augmented Lagrangans and the Method of Multplers Augmented Lagrangan methods were developed n part to brng robustness to the dual ascent method, and n partcular, to yeld convergence wthout assumptons lke strct convexty or fnteness of f. The augmented Lagrangan for (2.1) s L ρ (x,y)=f(x) +y T (Ax b) +(ρ/2) Ax b 2 2, (2.6)

14 2.3 Augmented Lagrangans and the Method of Multplers 11 where ρ>0 s called the penalty parameter. (Note that L 0 s the standard Lagrangan for the problem.) The augmented Lagrangan can be vewed as the (unaugmented) Lagrangan assocated wth the problem mnmze f(x) +(ρ/2) Ax b 2 2 subject to Ax = b. Ths problem s clearly equvalent to the orgnal problem (2.1), snce for any feasble x the term added to the objectve s zero. The assocated dual functon s g ρ (y) = nf x L ρ (x,y). The beneft of ncludng the penalty term s that g ρ can be shown to be dfferentable under rather mld condtons on the orgnal problem. The gradent of the augmented dual functon s found the same way as wth the ordnary Lagrangan,.e., by mnmzng over x, and then evaluatng the resultng equalty constrant resdual. Applyng dual ascent to the modfed problem yelds the algorthm x k+1 := argmnl ρ (x,y k ) (2.7) x y k+1 := y k + ρ(ax k+1 b), (2.8) whch s known as the method of multplers for solvng (2.1). Ths s the same as standard dual ascent, except that the x-mnmzaton step uses the augmented Lagrangan, and the penalty parameter ρ s used as the step sze α k. The method of multplers converges under far more general condtons than dual ascent, ncludng cases when f takes on the value + or s not strctly convex. It s easy to motvate the choce of the partcular step sze ρ n the dual update (2.8). For smplcty, we assume here that f s dfferentable, though ths s not requred for the algorthm to work. The optmalty condtons for (2.1) are prmal and dual feasblty,.e., Ax b =0, f(x )+A T y =0, respectvely. By defnton, x k+1 mnmzes L ρ (x,y k ), so 0= x L ρ (x k+1,y k ) ( ) = x f(x k+1 )+A T y k + ρ(ax k+1 b) = x f(x k+1 )+A T y k+1.

15 12 Precursors We see that by usng ρ as the step sze n the dual update, the terate (x k+1,y k+1 ) s dual feasble. As the method of multplers proceeds, the prmal resdual Ax k+1 b converges to zero, yeldng optmalty. The greatly mproved convergence propertes of the method of multplers over dual ascent comes at a cost. When f s separable, the augmented Lagrangan L ρ s not separable, so the x-mnmzaton step (2.7) cannot be carred out separately n parallel for each x. Ths means that the basc method of multplers cannot be used for decomposton. We wll see how to address ths ssue next. Augmented Lagrangans and the method of multplers for constraned optmzaton were frst proposed n the late 1960s by Hestenes [97, 98] and Powell [138]. Many of the early numercal experments on the method of multplers are due to Mele et al. [124, 125, 126]. Much of the early work s consoldated n a monograph by Bertsekas [15], who also dscusses smlartes to older approaches usng Lagrangans and penalty functons [6, 5, 71], as well as a number of generalzatons.

16 3 Alternatng Drecton Method of Multplers 3.1 Algorthm ADMM s an algorthm that s ntended to blend the decomposablty of dual ascent wth the superor convergence propertes of the method of multplers. The algorthm solves problems n the form mnmze subject to f(x) +g(z) Ax + Bz = c (3.1) wth varables x R n and z R m, where A R p n, B R p m, and c R p. We wll assume that f and g are convex; more specfc assumptons wll be dscussed n 3.2. The only dfference from the general lnear equalty-constraned problem (2.1) s that the varable, called x there, has been splt nto two parts, called x and z here, wth the objectve functon separable across ths splttng. The optmal value of the problem (3.1) wll be denoted by p = nf{f(x) +g(z) Ax + Bz = c}. As n the method of multplers, we form the augmented Lagrangan L ρ (x,z,y)=f(x) +g(z) +y T (Ax + Bz c) +(ρ/2) Ax + Bz c

17 14 Alternatng Drecton Method of Multplers ADMM conssts of the teratons x k+1 := argmnl ρ (x,z k,y k ) (3.2) x z k+1 := argmnl ρ (x k+1,z,y k ) (3.3) z y k+1 := y k + ρ(ax k+1 + Bz k+1 c), (3.4) where ρ>0. The algorthm s very smlar to dual ascent and the method of multplers: t conssts of an x-mnmzaton step (3.2), a z-mnmzaton step (3.3), and a dual varable update (3.4). As n the method of multplers, the dual varable update uses a step sze equal to the augmented Lagrangan parameter ρ. The method of multplers for (3.1) has the form (x k+1,z k+1 ) := argmnl ρ (x,z,y k ) x,z y k+1 := y k + ρ(ax k+1 + Bz k+1 c). Here the augmented Lagrangan s mnmzed jontly wth respect to the two prmal varables. In ADMM, on the other hand, x and z are updated n an alternatng or sequental fashon, whch accounts for the term alternatng drecton. ADMM can be vewed as a verson of the method of multplers where a sngle Gauss-Sedel pass [90, 10.1] over x and z s used nstead of the usual jont mnmzaton. Separatng the mnmzaton over x and z nto two steps s precsely what allows for decomposton when f or g are separable. The algorthm state n ADMM conssts of z k and y k. In other words, (z k+1,y k+1 ) s a functon of (z k,y k ). The varable x k s not part of the state; t s an ntermedate result computed from the prevous state (z k 1,y k 1 ). If we swtch (re-label) x and z, f and g, and A and B n the problem (3.1), we obtan a varaton on ADMM wth the order of the x- update step (3.2) and z-update step (3.3) reversed. The roles of x and z are almost symmetrc, but not qute, snce the dual update s done after the z-update but before the x-update.

18 3.2 Convergence Scaled Form ADMM can be wrtten n a slghtly dfferent form, whch s often more convenent, by combnng the lnear and quadratc terms n the augmented Lagrangan and scalng the dual varable. Defnng the resdual r = Ax + Bz c, wehave y T r +(ρ/2) r 2 2 =(ρ/2) r +(1/ρ)y 2 2 (1/2ρ) y 2 2 =(ρ/2) r + u 2 2 (ρ/2) u 2 2, where u =(1/ρ)y s the scaled dual varable. Usng the scaled dual varable, we can express ADMM as ( ) x k+1 := argmn f(x) +(ρ/2) Ax + Bz k c + u k 2 2 (3.5) x ( ) z k+1 := argmn g(z) +(ρ/2) Ax k+1 + Bz c + u k 2 2 (3.6) z u k+1 := u k + Ax k+1 + Bz k+1 c. (3.7) Defnng the resdual at teraton k as r k = Ax k + Bz k c, we see that u k = u 0 + k r j, the runnng sum of the resduals. We call the frst form of ADMM above, gven by ( ), the unscaled form, and the second form ( ) the scaled form, snce t s expressed n terms of a scaled verson of the dual varable. The two are clearly equvalent, but the formulas n the scaled form of ADMM are often shorter than n the unscaled form, so we wll use the scaled form n the sequel. We wll use the unscaled form when we wsh to emphasze the role of the dual varable or to gve an nterpretaton that reles on the (unscaled) dual varable. 3.2 Convergence There are many convergence results for ADMM dscussed n the lterature; here, we lmt ourselves to a basc but stll very general result that apples to all of the examples we wll consder. We wll make one j=1

19 16 Alternatng Drecton Method of Multplers assumpton about the functons f and g, and one assumpton about problem (3.1). Assumpton 1. The (extended-real-valued) functons f : R n R {+ } and g : R m R {+ } are closed, proper, and convex. Ths assumpton can be expressed compactly usng the epgraphs of the functons: The functon f satsfes assumpton 1 f and only f ts epgraph epf = {(x,t) R n R f(x) t} s a closed nonempty convex set. Assumpton 1 mples that the subproblems arsng n the x-update (3.2) and z-update (3.3) are solvable,.e., there exst x and z, not necessarly unque (wthout further assumptons on A and B), that mnmze the augmented Lagrangan. It s mportant to note that assumpton 1 allows f and g to be nondfferentable and to assume the value +. For example, we can take f to be the ndcator functon of a closed nonempty convex set C,.e., f(x) =0forx C and f(x) =+ otherwse. In ths case, the x-mnmzaton step (3.2) wll nvolve solvng a constraned quadratc program over C, the effectve doman of f. Assumpton 2. The unaugmented Lagrangan L 0 has a saddle pont. Explctly, there exst (x,z,y ), not necessarly unque, for whch L 0 (x,z,y) L 0 (x,z,y ) L 0 (x,z,y ) holds for all x, z, y. By assumpton 1, t follows that L 0 (x,z,y ) s fnte for any saddle pont (x,z,y ). Ths mples that (x,z ) s a soluton to (3.1), so Ax + Bz = c and f(x ) <, g(z ) <. It also mples that y s dual optmal, and the optmal values of the prmal and dual problems are equal,.e., that strong dualty holds. Note that we make no assumptons about A, B, or c, except mplctly through assumpton 2; n partcular, nether A nor B s requred to be full rank.

20 3.2 Convergence Convergence Under assumptons 1 and 2, the ADMM terates satsfy the followng: Resdual convergence. r k 0 as k,.e., the terates approach feasblty. Objectve convergence. f(x k )+g(z k ) p as k,.e., the objectve functon of the terates approaches the optmal value. Dual varable convergence. y k y as k, where y s a dual optmal pont. A proof of the resdual and objectve convergence results s gven n appendx A. Note that x k and z k need not converge to optmal values, although such results can be shown under addtonal assumptons Convergence n Practce Smple examples show that ADMM can be very slow to converge to hgh accuracy. However, t s often the case that ADMM converges to modest accuracy suffcent for many applcatons wthn a few tens of teratons. Ths behavor makes ADMM smlar to algorthms lke the conjugate gradent method, for example, n that a few tens of teratons wll often produce acceptable results of practcal use. However, the slow convergence of ADMM also dstngushes t from algorthms such as Newton s method (or, for constraned problems, nteror-pont methods), where hgh accuracy can be attaned n a reasonable amount of tme. Whle n some cases t s possble to combne ADMM wth a method for producng a hgh accuracy soluton from a low accuracy soluton [64], n the general case ADMM wll be practcally useful mostly n cases when modest accuracy s suffcent. Fortunately, ths s usually the case for the knds of large-scale problems we consder. Also, n the case of statstcal and machne learnng problems, solvng a parameter estmaton problem to very hgh accuracy often yelds lttle to no mprovement n actual predcton performance, the real metrc of nterest n applcatons.

21 18 Alternatng Drecton Method of Multplers 3.3 Optmalty Condtons and Stoppng Crteron The necessary and suffcent optmalty condtons for the ADMM problem (3.1) are prmal feasblty, and dual feasblty, Ax + Bz c =0, (3.8) 0 f(x )+A T y (3.9) 0 g(z )+B T y. (3.10) Here, denotes the subdfferental operator; see, e.g., [140, 19, 99]. (When f and g are dfferentable, the subdfferentals f and g can be replaced by the gradents f and g, and can be replaced by =.) Snce z k+1 mnmzes L ρ (x k+1,z,y k ) by defnton, we have that 0 g(z k+1 )+B T y k + ρb T (Ax k+1 + Bz k+1 c) = g(z k+1 )+B T y k + ρb T r k+1 = g(z k+1 )+B T y k+1. Ths means that z k+1 and y k+1 always satsfy (3.10), so attanng optmalty comes down to satsfyng (3.8) and (3.9). Ths phenomenon s analogous to the terates of the method of multplers always beng dual feasble; see page 11. Snce x k+1 mnmzes L ρ (x,z k,y k ) by defnton, we have that or equvalently, 0 f(x k+1 )+A T y k + ρa T (Ax k+1 + Bz k c) = f(x k+1 )+A T (y k + ρr k+1 + ρb(z k z k+1 )) = f(x k+1 )+A T y k+1 + ρa T B(z k z k+1 ), ρa T B(z k+1 z k ) f(x k+1 )+A T y k+1. Ths means that the quantty s k+1 = ρa T B(z k+1 z k ) can be vewed as a resdual for the dual feasblty condton (3.9). We wll refer to s k+1 as the dual resdual at teraton k + 1, and to r k+1 = Ax k+1 + Bz k+1 c as the prmal resdual at teraton k +1.

22 3.3 Optmalty Condtons and Stoppng Crteron 19 In summary, the optmalty condtons for the ADMM problem consst of three condtons, ( ). The last condton (3.10) always holds for (x k+1,z k+1,y k+1 ); the resduals for the other two, (3.8) and (3.9), are the prmal and dual resduals r k+1 and s k+1, respectvely. These two resduals converge to zero as ADMM proceeds. (In fact, the convergence proof n appendx A shows B(z k+1 z k ) converges to zero, whch mples s k converges to zero.) Stoppng Crtera The resduals of the optmalty condtons can be related to a bound on the objectve suboptmalty of the current pont,.e., f(x k )+g(z k ) p. As shown n the convergence proof n appendx A, we have f(x k )+g(z k ) p (y k ) T r k +(x k x ) T s k. (3.11) Ths shows that when the resduals r k and s k are small, the objectve suboptmalty also must be small. We cannot use ths nequalty drectly n a stoppng crteron, however, snce we do not know x. But f we guess or estmate that x k x 2 d, we have that f(x k )+g(z k ) p (y k ) T r k + d s k 2 y k 2 r k 2 + d s k 2. The mddle or rghthand terms can be used as an approxmate bound on the objectve suboptmalty (whch depends on our guess of d). Ths suggests that a reasonable termnaton crteron s that the prmal and dual resduals must be small,.e., r k 2 ɛ pr and s k 2 ɛ dual, (3.12) where ɛ pr > 0 and ɛ dual > 0 are feasblty tolerances for the prmal and dual feasblty condtons (3.8) and (3.9), respectvely. These tolerances can be chosen usng an absolute and relatve crteron, such as ɛ pr = pɛ abs + ɛ rel max{ Ax k 2, Bz k 2, c 2 }, ɛ dual = nɛ abs + ɛ rel A T y k 2, where ɛ abs > 0 s an absolute tolerance and ɛ rel > 0 s a relatve tolerance. (The factors p and n account for the fact that the l 2 norms are n R p and R n, respectvely.) A reasonable value for the relatve stoppng

23 20 Alternatng Drecton Method of Multplers crteron mght be ɛ rel =10 3 or 10 4, dependng on the applcaton. The choce of absolute stoppng crteron depends on the scale of the typcal varable values. 3.4 Extensons and Varatons Many varatons on the classc ADMM algorthm have been explored n the lterature. Here we brefly survey some of these varants, organzed nto groups of related deas. Some of these methods can gve superor convergence n practce compared to the standard ADMM presented above. Most of the extensons have been rgorously analyzed, so the convergence results descrbed above are stll vald (n some cases, under some addtonal condtons) Varyng Penalty Parameter A standard extenson s to use possbly dfferent penalty parameters ρ k for each teraton, wth the goal of mprovng the convergence n practce, as well as makng performance less dependent on the ntal choce of the penalty parameter. In the context of the method of multplers, ths approach s analyzed n [142], where t s shown that superlnear convergence may be acheved f ρ k. Though t can be dffcult to prove the convergence of ADMM when ρ vares by teraton, the fxedρ theory stll apples f one just assumes that ρ becomes fxed after a fnte number of teratons. A smple scheme that often works well s (see, e.g., [96, 169]): τ ncr ρ k f r k 2 >µ s k 2 ρ k+1 := ρ k /τ decr f s k 2 >µ r k 2 (3.13) ρ k otherwse, where µ>1, τ ncr > 1, and τ decr > 1 are parameters. Typcal choces mght be µ = 10 and τ ncr = τ decr = 2. The dea behnd ths penalty parameter update s to try to keep the prmal and dual resdual norms wthn a factor of µ of one another as they both converge to zero. The ADMM update equatons suggest that large values of ρ place a large penalty on volatons of prmal feasblty and so tend to produce

24 3.4 Extensons and Varatons 21 small prmal resduals. Conversely, the defnton of s k+1 suggests that small values of ρ tend to reduce the dual resdual, but at the expense of reducng the penalty on prmal feasblty, whch may result n a larger prmal resdual. The adjustment scheme (3.13) nflates ρ by τ ncr when the prmal resdual appears large compared to the dual resdual, and deflates ρ by τ decr when the prmal resdual seems too small relatve to the dual resdual. Ths scheme may also be refned by takng nto account the relatve magntudes of ɛ pr and ɛ dual. When a varyng penalty parameter s used n the scaled form of ADMM, the scaled dual varable u k =(1/ρ)y k must also be rescaled after updatng ρ; for example, f ρ s halved, u k should be doubled before proceedng More General Augmentng Terms Another dea s to allow for a dfferent penalty parameter for each constrant, or more generally, to replace the quadratc term (ρ/2) r 2 2 wth (1/2)r T Pr, where P s a symmetrc postve defnte matrx. When P s constant, we can nterpret ths generalzed verson of ADMM as standard ADMM appled to a modfed ntal problem wth the equalty constrants r = 0 replaced wth Fr = 0, where F T F = P Over-relaxaton In the z- and y-updates, the quantty Ax k+1 can be replaced wth α k Ax k+1 (1 α k )(Bz k c), where α k (0,2) s a relaxaton parameter; when α k > 1, ths technque s called over-relaxaton, and when α k < 1, t s called under-relaxaton. Ths scheme s analyzed n [63], and experments n [59, 64] suggest that over-relaxaton wth α k [1.5,1.8] can mprove convergence Inexact Mnmzaton ADMM wll converge even when the x- and z-mnmzaton steps are not carred out exactly, provded certan suboptmalty measures

25 22 Alternatng Drecton Method of Multplers n the mnmzatons satsfy an approprate condton, such as beng summable. Ths result s due to Ecksten and Bertsekas [63], buldng on earler results by Gol shten and Tret yakov [89]. Ths technque s mportant n stuatons where the x- or z-updates are carred out usng an teratve method; t allows us to solve the mnmzatons only approxmately at frst, and then more accurately as the teratons progress Update Orderng Several varatons on ADMM nvolve performng the x-, z-, and y- updates n varyng orders or multple tmes. For example, we can dvde the varables nto k blocks, and update each of them n turn, possbly multple tmes, before performng each dual varable update; see, e.g., [146]. Carryng out multple x- and z-updates before the y-update can be nterpreted as executng multple Gauss-Sedel passes nstead of just one; f many sweeps are carred out before each dual update, the resultng algorthm s very close to the standard method of multplers [17, 3.4.4]. Another varaton s to perform an addtonal y-update between the x- and z-update, wth half the step length [17] Related Algorthms There are also a number of other algorthms dstnct from but nspred by ADMM. For nstance, Fukushma [80] apples ADMM to a dual problem formulaton, yeldng a dual ADMM algorthm, whch s shown n [65] to be equvalent to the prmal Douglas-Rachford method dscussed n [57, 3.5.6]. As another example, Zhu et al. [183] dscuss varatons of dstrbuted ADMM (dscussed n 7, 8, and 10) that can cope wth varous complcatng factors, such as nose n the messages exchanged for the updates, or asynchronous updates, whch can be useful n a regme when some processors or subsystems randomly fal. There are also algorthms resemblng a combnaton of ADMM and the proxmal method of multplers [141], rather than the standard method of multplers; see, e.g., [33, 60]. Other representatve publcatons nclude [62, 143, 59, 147, 158, 159, 42].

26 3.5 Notes and References Notes and References ADMM was orgnally proposed n the md-1970s by Glownsk and Marrocco [86] and Gabay and Mercer [82]. There are a number of other mportant papers analyzng the propertes of the algorthm, ncludng [76, 81, 75, 87, 157, 80, 65, 33]. In partcular, the convergence of ADMM has been explored by many authors, ncludng Gabay [81] and Ecksten and Bertsekas [63]. ADMM has also been appled to a number of statstcal problems, such as constraned sparse regresson [18], sparse sgnal recovery [70], mage restoraton and denosng [72, 154, 134], trace norm regularzed least squares mnmzaton [174], sparse nverse covarance selecton [178], the Dantzg selector [116], and support vector machnes [74], among others. For examples n sgnal processng, see [42, 40, 41, 150, 149] and the references theren. Many papers analyzng ADMM do so from the perspectve of maxmal monotone operators [23, 141, 142, 63, 144]. Brefly, a wde varety of problems can be posed as fndng a zero of a maxmal monotone operator; for example, f f s closed, proper, and convex, then the subdfferental operator f s maxmal monotone, and fndng a zero of f s smply mnmzaton of f; such a mnmzaton may mplctly contan constrants f f s allowed to take the value +. Rockafellar s proxmal pont algorthm [142] s a general method for fndng a zero of a maxmal monotone operator, and a wde varety of algorthms have been shown to be specal cases, ncludng proxmal mnmzaton (see 4.1), the method of multplers, and ADMM. For a more detaled revew of the older lterature, see [57, 2]. The method of multplers was shown to be a specal case of the proxmal pont algorthm by Rockafellar [141]. Gabay [81] showed that ADMM s a specal case of a method called Douglas-Rachford splttng for monotone operators [53, 112], and Ecksten and Bertsekas [63] showed n turn that Douglas-Rachford splttng s a specal case of the proxmal pont algorthm. (The varant of ADMM that performs an extra y-update between the x- and z-updates s equvalent to Peaceman-Rachford splttng [137, 112] nstead, as shown by Glownsk and Le Tallec [87].) Usng the same framework, Ecksten

27 24 Alternatng Drecton Method of Multplers and Bertsekas [63] also showed the relatonshps between a number of other algorthms, such as Spngarn s method of partal nverses [153]. Lawrence and Spngarn [108] develop an alternatve framework showng that Douglas-Rachford splttng, hence ADMM, s a specal case of the proxmal pont algorthm; Ecksten and Ferrs [64] offer a more recent dscusson explanng ths approach. The major mportance of these results s that they allow the powerful convergence theory for the proxmal pont algorthm to apply drectly to ADMM and other methods, and show that many of these algorthms are essentally dentcal. (But note that our proof of convergence of the basc ADMM algorthm, gven n appendx A, s selfcontaned and does not rely on ths abstract machnery.) Research on operator splttng methods and ther relaton to decomposton algorthms contnues to ths day [66, 67]. A consderable body of recent research consders replacng the quadratc penalty term n the standard method of multplers wth a more general devaton penalty, such as one derved from a Bregman dvergence [30, 58]; see [22] for background materal. Unfortunately, these generalzatons do not appear to carry over n a straghtforward manner from non-decomposton augmented Lagrangan methods to ADMM: There s currently no proof of convergence known for ADMM wth nonquadratc penalty terms.

28 4 General Patterns Structure n f, g, A, and B can often be exploted to carry out the x- and z-updates more effcently. Here we consder three general cases that we wll encounter repeatedly n the sequel: quadratc objectve terms, separable objectve and constrants, and smooth objectve terms. Our dscusson wll be wrtten for the x-update but apples to the z- update by symmetry. We express the x-update step as x + ( ) = argmn f(x) +(ρ/2) Ax v 2 2, x where v = Bz + c u s a known constant vector for the purposes of the x-update. 4.1 Proxmty Operator Frst, consder the smple case where A = I, whch appears frequently n the examples. Then the x-update s x + ( ) = argmn f(x) +(ρ/2) x v 2 2. x As a functon of v, the rghthand sde s denoted prox f,ρ (v) and s called the proxmty operator of f wth penalty ρ [127]. In varatonal 25

29 26 General Patterns analyss, ( ) f(v) = nf f(x) +(ρ/2) x v 2 x 2 s known as the Moreau envelope or Moreau-Yosda regularzaton of f, and s connected to the theory of the proxmal pont algorthm [144]. The x-mnmzaton n the proxmty operator s generally referred to as proxmal mnmzaton. Whle these observatons do not by themselves allow us to mprove the effcency of ADMM, t does te the x-mnmzaton step to other well known deas. When the functon f s smple enough, the x-update (.e., the proxmty operator) can be evaluated analytcally; see [41] for many examples. For nstance, f f s the ndcator functon of a closed nonempty convex set C, then the x-update s x + ( ) = argmn f(x) +(ρ/2) x v 2 2 =ΠC (v), x where Π C denotes projecton (n the Eucldean norm) onto C. Ths holds ndependently of the choce of ρ. As an example, f f s the ndcator functon of the nonnegatve orthant R n +,wehavex + =(v) +, the vector obtaned by takng the nonnegatve part of each component of v. 4.2 Quadratc Objectve Terms Suppose f s gven by the (convex) quadratc functon f(x) =(1/2)x T Px + q T x + r, where P S n +, the set of symmetrc postve semdefnte n n matrces. Ths ncludes the cases when f s lnear or constant, by settng P, or both P and q, to zero. Then, assumng P + ρa T A s nvertble, x + s an affne functon of v gven by x + =(P + ρa T A) 1 (ρa T v q). (4.1) In other words, computng the x-update amounts to solvng a lnear system wth postve defnte coeffcent matrx P + ρa T A and rghthand sde ρa T v q. As we show below, an approprate use of numercal lnear algebra can explot ths fact and substantally mprove performance. For general background on numercal lnear algebra, see [47] or [90]; see [20, appendx C] for a short overvew of drect methods.

30 4.2 Quadratc Objectve Terms Drect Methods We assume here that a drect method s used to carry out the x-update; the case when an teratve method s used s dscussed n 4.3. Drect methods for solvng a lnear system Fx= g are based on frst factorng F = F 1 F 2 F k nto a product of smpler matrces, and then computng x = F 1 b by solvng a sequence of problems of the form F z = z 1, where z 1 = F1 1 g and x = z k. The solve step s sometmes also called a back-solve. The computatonal cost of factorzaton and back-solve operatons depends on the sparsty pattern and other propertes of F. The cost of solvng Fx= g s the sum of the cost of factorng F and the cost of the back-solve. In our case, the coeffcent matrx s F = P + ρa T A and the rghthand sde s g = ρa T v q, where P S n + and A R p n. Suppose we explot no structure n A or P,.e., we use generc methods that work for any matrx. We can form F = P + ρa T A at a cost of O(pn 2 ) flops (floatng pont operatons). We then carry out a Cholesky factorzaton of F at a cost of O(n 3 ) flops; the back-solve cost s O(n 2 ). (The cost of formng g s neglgble compared to the costs lsted above.) When p s on the order of, or more than n, the overall cost s O(pn 2 ). (When p s less than n n order, the matrx nverson lemma descrbed below can be used to carry out the update n O(p 2 n) flops.) Explotng Sparsty When A and P are such that F s sparse (.e., has enough zero entres to be worth explotng), much more effcent factorzaton and backsolve routnes can be employed. As an extreme case, f P and A are dagonal n n matrces, then both the factor and solve costs are O(n). If P and A are banded, then so s F. If F s banded wth bandwdth k, the factorzaton cost s O(nk 2 ) and the back-solve cost s O(nk). In ths case, the x-update can be carred out at a cost O(nk 2 ), plus the cost of formng F. The same approach works when P + ρa T A has a more general sparsty pattern; n ths case, a permuted Cholesky factorzaton can be used, wth permutatons chosen to reduce fll-n.

31 28 General Patterns Cachng Factorzatons Now suppose we need to solve multple lnear systems, say, Fx () = g (), = 1,...,N, wth the same coeffcent matrx but dfferent rghthand sdes. Ths occurs n ADMM when the parameter ρ s not changed. In ths case, the factorzaton of the coeffcent matrx F can be computed once and then back-solves can be carred out for each rghthand sde. If t s the factorzaton cost and s s the back-solve cost, then the total cost becomes t + Ns nstead of N(t + s), whch would be the cost f we were to factor F each teraton. As long as ρ does not change, we can factor P + ρa T A once, and then use ths cached factorzaton n subsequent solve steps. For example, f we do not explot any structure and use the standard Cholesky factorzaton, the x-update steps can be carred out a factor n more effcently than a nave mplementaton, n whch we solve the equatons from scratch n each teraton. When structure s exploted, the rato between t and s s typcally less than n but often stll sgnfcant, so here too there are performance gans. However, n ths case, there s less beneft to ρ not changng, so we can wegh the beneft of varyng ρ aganst the beneft of not recomputng the factorzaton of P + ρa T A. In general, an mplementaton should cache the factorzaton of P + ρa T A and then only recompute t f and when ρ changes Matrx Inverson Lemma We can also explot structure usng the matrx nverson lemma, whch states that the dentty (P + ρa T A) 1 = P 1 ρp 1 A T (I + ρap 1 A T ) 1 AP 1 holds when all the nverses exst. Ths means that f lnear systems wth coeffcent matrx P can be solved effcently, and p s small, or at least no larger than n n order, then the x-update can be computed effcently as well. The same trck as above can also be used to obtan an effcent method for computng multple updates: The factorzaton of I + ρap 1 A T can be cached and cheaper back-solves can be used n computng the updates.

32 4.2 Quadratc Objectve Terms 29 As an example, suppose that P s dagonal and that p n. A nave method for computng the update costs O(n 3 ) flops n the frst teraton and O(n 2 ) flops n subsequent teratons, f we store the factors of P + ρa T A. Usng the matrx nverson lemma (.e., usng the rghthand sde above) to compute the x-update costs O(np 2 ) flops, an mprovement by a factor of (n/p) 2 over the nave method. In ths case, the domnant cost s formng AP 1 A T. The factors of I + ρap 1 A T can be saved after the frst update, so subsequent teratons can be carred out at cost O(np) flops, a savngs of a factor of p over the frst update. Usng the matrx nverson lemma to compute x + can also make t less costly to vary ρ n each teraton. When P s dagonal, for example, we can compute AP 1 A T once, and then form and factor I + ρ k AP 1 A T n teraton k at a cost of O(p 3 ) flops. In other words, the update costs an addtonal O(np) flops, so f p 2 s less than or equal to n n order, there s no addtonal cost (n order) to carryng out updates wth ρ varyng n each teraton Quadratc Functon Restrcted to an Affne Set The same comments hold for the slghtly more complex case of a convex quadratc functon restrcted to an affne set: f(x)=(1/2)x T Px + q T x + r, domf = {x Fx= g}. Here, x + s stll an affne functon of v, and the update nvolves solvng the KKT (Karush-Kuhn-Tucker) system [ P + ρi F T F 0 ][ x k+1 ν ] [ q ρ(z k u k ) + g ] =0. All of the comments above hold here as well: Factorzatons can be cached to carry out addtonal updates more effcently, and structure n the matrces can be exploted to mprove the effcency of the factorzaton and back-solve steps.

33 30 General Patterns 4.3 Smooth Objectve Terms Iteratve Solvers When f s smooth, general teratve methods can be used to carry out the x-mnmzaton step. Of partcular nterest are methods that only requre the ablty to compute f(x) for a gven x, to multply a vector by A, and to multply a vector by A T. Such methods can scale to relatvely large problems. Examples nclude the standard gradent method, the (nonlnear) conjugate gradent method, and the lmtedmemory Broyden-Fletcher-Goldfarb-Shanno (L-BFGS) algorthm [113, 26]; see [135] for further detals. The convergence of these methods depends on the condtonng of the functon to be mnmzed. The presence of the quadratc penalty term (ρ/2) Ax v 2 2 tends to mprove the condtonng of the problem and thus mprove the performance of an teratve method for updatng x. Indeed, one method for adjustng the parameter ρ from teraton to teraton s to ncrease t untl the teratve method used to carry out the updates converges quckly enough Early Termnaton A standard technque to speed up the algorthm s to termnate the teratve method used to carry out the x-update (or z-update) early,.e., before the gradent of f(x) +(ρ/2) Ax v 2 2 s very small. Ths technque s justfed by the convergence results for ADMM wth nexact mnmzaton n the x- and z-update steps. In ths case, the requred accuracy should be low n the ntal teratons of ADMM and then repeatedly ncreased n each teraton. Early termnaton n the x-or z-updates can result n more ADMM teratons, but much lower cost per teraton, gvng an overall mprovement n effcency Warm Start Another standard trck s to ntalze the teratve method used n the x-update at the soluton x k obtaned n the prevous teraton. Ths s called a warm start. The prevous ADMM terate often gves a good enough approxmaton to result n far fewer teratons (of the

34 4.4 Decomposton 31 teratve method used to compute the update x k+1 ) than f the teratve method were started at zero or some other default ntalzaton. Ths s especally the case when ADMM has almost converged, n whch case the updates wll not change sgnfcantly from ther prevous values Quadratc Objectve Terms Even when f s quadratc, t may be worth usng an teratve method rather than a drect method for the x-update. In ths case, we can use a standard (possbly precondtoned) conjugate gradent method. Ths approach makes sense when drect methods do not work (e.g., because they requre too much memory) or when A s dense but a fast method s avalable for multplyng a vector by A or A T. Ths s the case, for example, when A represents the dscrete Fourer transform [90]. 4.4 Decomposton Block Separablty Suppose x =(x 1,...,x N ) s a partton of the varable x nto subvectors and that f s separable wth respect to ths partton,.e., f(x)=f 1 (x 1 )+ + f N (x N ), where x R n and N =1 n = N. If the quadratc term Ax 2 2 s also separable wth respect to the partton,.e., A T A s block dagonal conformably wth the partton, then the augmented Lagrangan L ρ s separable. Ths means that the x-update can be carred out n parallel, wth the subvectors x updated by N separate mnmzatons Component Separablty In some cases, the decomposton extends all the way to ndvdual components of x,.e., f(x) =f 1 (x 1 )+ + f n (x n ), where f : R R, and A T A s dagonal. The x-mnmzaton step can then be carred out va n scalar mnmzatons, whch can n some cases be expressed analytcally (but n any case can be computed very effcently). We wll call ths component separablty.

35 32 General Patterns Soft Thresholdng For an example that wll come up often n the sequel, consder f(x) = λ x 1 (wth λ>0) and A = I. In ths case the (scalar) x -update s x + ( := argmn λ x +(ρ/2)(x v ) 2). x Even though the frst term s not dfferentable, we can easly compute a smple closed-form soluton to ths problem by usng subdfferental calculus; see [140, 23] for background. Explctly, the soluton s x + := S λ/ρ (v ), where the soft thresholdng operator S s defned as a κ a>κ S κ (a) = 0 a κ a + κ a < κ, or equvalently, S κ (a) =(a κ) + ( a κ) +. Yet another formula, whch shows that the soft thresholdng operator s a shrnkage operator (.e., moves a pont toward zero), s S κ (a) =(1 κ/ a ) + a (4.2) (for a 0). We refer to updates that reduce to ths form as elementwse soft thresholdng. In the language of 4.1, soft thresholdng s the proxmty operator of the l 1 norm.

36 5 Constraned Convex Optmzaton The generc constraned convex optmzaton problem s mnmze subject to f(x) x C, (5.1) wth varable x R n, where f and C are convex. Ths problem can be rewrtten n ADMM form (3.1) as mnmze f(x) +g(z) subject to x z =0, where g s the ndcator functon of C. The augmented Lagrangan (usng the scaled dual varable) s L ρ (x,z,u)=f(x) +g(z) +(ρ/2) x z + u 2 2, so the scaled form of ADMM for ths problem s x k+1 := argmn x z k+1 := Π C (x k+1 + u k ) u k+1 := u k + x k+1 z k+1. ( f(x) +(ρ/2) x z k + u k )

37 34 Constraned Convex Optmzaton The x-update nvolves mnmzng f plus a convex quadratc functon,.e., evaluaton of the proxmal operator assocated wth f. The z- update s Eucldean projecton onto C. The objectve f need not be smooth here; ndeed, we can nclude addtonal constrants (.e., beyond those represented by x C) by defnng f to be + where they are volated. In ths case, the x-update becomes a constraned optmzaton problem over domf = {x f(x) < }. As wth all problems where the constrant s x z = 0, the prmal and dual resduals take the smple form r k = x k z k, s k = ρ(z k z k 1 ). In the followng sectons we gve some more specfc examples. 5.1 Convex Feasblty Alternatng Projectons A classc problem s to fnd a pont n the ntersecton of two closed nonempty convex sets. The classcal method, whch dates back to the 1930s, s von Neumann s alternatng projectons algorthm [166, 36, 21]: x k+1 := Π C (z k ) z k+1 := Π D (x k+1 ), where Π C and Π D are Eucldean projecton onto the sets C and D, respectvely. In ADMM form, the problem can be wrtten as mnmze f(x) +g(z) subject to x z =0, where f s the ndcator functon of C and g s the ndcator functon of D. The scaled form of ADMM s then x k+1 := Π C (z k u k ) z k+1 := Π D (x k+1 + u k ) (5.2) u k+1 := u k + x k+1 z k+1, so both the x and z steps nvolve projecton onto a convex set, as n the classcal method. Ths s exactly Dykstra s alternatng projectons

38 5.1 Convex Feasblty 35 method [56, 9], whch s far more effcent than the classcal method that does not use the dual varable u. Here, the norm of the prmal resdual x k z k 2 has a nce nterpretaton. Snce x k C and z k D, x k z k 2 s an upper bound on dst(c, D), the Eucldean dstance between C and D. If we termnate wth r k 2 ɛ pr, then we have found a par of ponts, one n C and one n D, that are no more than ɛ pr far apart. Alternatvely, the pont (1/2)(x k + z k ) s no more than a dstance ɛ pr /2 from both C and D Parallel Projectons Ths method can be appled to the problem of fndng a pont n the ntersecton of N closed convex sets A 1,...,A N n R n by runnng the algorthm n R nn wth C = A 1 A N, D = {(x 1,...,x N ) R nn x 1 = x 2 = = x N }. If x =(x 1,...,x N ) R nn, then projecton onto C s and projecton onto D s Π C (x)=(π A1 (x 1 ),...,Π AN (x N )), Π D (x) =(x,x,...,x), where x =(1/N ) N =1 x s the average of x 1,...,x N. Thus, each step of ADMM can be carred out by projectng ponts onto each of the sets A n parallel and then averagng the results: x k+1 := Π A (z k u k ) z k+1 := 1 N (x k+1 + u k ) N =1 u k+1 := u k + x k+1 z k+1. Here the frst and thrd steps are carred out n parallel, for =1,...,N. (The descrpton above nvolves a small abuse of notaton n droppng the ndex from z, snce they are all the same.) Ths can be vewed as a specal case of constraned optmzaton, as descrbed n 4.4, where the ndcator functon of A 1 A N splts nto the sum of the ndcator functons of each A.

39 36 Constraned Convex Optmzaton We note for later reference a smplfcaton of the ADMM algorthm above. Takng the average (over ) of the last equaton we obtan u k+1 = u k + x k+1 z k, combned wth z k+1 = x k+1 + u k (from the second equaton) we see that u k+1 = 0. So after the frst step, the average of u s zero. Ths means that z k+1 reduces to x k+1. Usng these smplfcatons, we arrve at the smple algorthm x k+1 := Π A (x k u k ) u k+1 := u k +(x k+1 x k+1 ). Ths shows that u k s the runnng sum of the the dscrepances xk xk (assumng u 0 = 0). The frst step s a parallel projecton onto the sets C ; the second nvolves averagng the projected ponts. There s a large lterature on successve projecton algorthms and ther many applcatons; see the survey by Bauschke and Borwen [10] for a general overvew, Combettes [39] for applcatons to mage processng, and Censor and Zenos [31, 5] for a dscusson n the context of parallel optmzaton. 5.2 Lnear and Quadratc Programmng The standard form quadratc program (QP) s mnmze (1/2)x T Px + q T x (5.3) subject to Ax = b, x 0, wth varable x R n ; we assume that P S n +. When P = 0, ths reduces to the standard form lnear program (LP). We express t n ADMM form as where mnmze f(x) +g(z) subject to x z =0, f(x)=(1/2)x T Px + q T x, domf = {x Ax = b} s the orgnal objectve wth restrcted doman and g s the ndcator functon of the nonnegatve orthant R n +.

40 5.2 Lnear and Quadratc Programmng 37 The scaled form of ADMM conssts of the teratons ( ) x k+1 := argmn f(x) +(ρ/2) x z k + u k 2 2 x z k+1 := (x k+1 + u k ) + u k+1 := u k + x k+1 z k+1. As descrbed n 4.2.5, the x-update s an equalty-constraned least squares problem wth optmalty condtons [ ][ ] [ P + ρi A T x k+1 q ρ(z k u k ] ) + =0. A 0 ν b All of the comments on effcent computaton from 4.2, such as storng factorzatons so that subsequent teratons can be carred out cheaply, also apply here. For example, f P s dagonal, possbly zero, the frst x-update can be carred out at a cost of O(np 2 ) flops, where p s the number of equalty constrants n the orgnal quadratc program. Subsequent updates only cost O(np) flops Lnear and Quadratc Cone Programmng More generally, any conc constrant x Kcan be used n place of the constrant x 0, n whch case the standard quadratc program (5.3) becomes a quadratc conc program. The only change to ADMM s n the z-update, whch then nvolves projecton onto K. For example, we can solve a semdefnte program wth x S n + by projectng x k+1 + u k onto S n + usng an egenvalue decomposton.

41 6 l 1 -Norm Problems The problems addressed n ths secton wll help llustrate why ADMM s a natural ft for machne learnng and statstcal problems n partcular. The reason s that, unlke dual ascent or the method of multplers, ADMM explctly targets problems that splt nto two dstnct parts, f and g, that can then be handled separately. Problems of ths form are pervasve n machne learnng, because a sgnfcant number of learnng problems nvolve mnmzng a loss functon together wth a regularzaton term or sde constrants. In other cases, these sde constrants are ntroduced through problem transformatons lke puttng the problem n consensus form, as wll be dscussed n 7.1. Ths secton contans a varety of smple but mportant problems nvolvng l 1 norms. There s wdespread current nterest n many of these problems across statstcs, machne learnng, and sgnal processng, and applyng ADMM yelds nterestng algorthms that are state-of-the-art, or closely related to state-of-the-art methods. We wll see that ADMM naturally lets us decouple the nonsmooth l 1 term from the smooth loss term, whch s computatonally advantageous. In ths secton, we focus on the non-dstrbuted versons of these problems for smplcty; the problem of dstrbuted model fttng wll be treated n the sequel. 38

42 6.1 Least Absolute Devatons 39 The dea of l 1 regularzaton s decades old, and traces back to Huber s [100] work on robust statstcs and a paper of Claerbout [38] n geophyscs. There s a vast lterature, but some mportant modern papers are those on total varaton denosng [145], soft thresholdng [49], the lasso [156], bass pursut [34], compressed sensng [50, 28, 29], and structure learnng of sparse graphcal models [123]. Because of the now wdespread use of models ncorporatng an l 1 penalty, there has also been consderable research on optmzaton algorthms for such problems. A recent survey by Yang et al. [173] compares and benchmarks a number of representatve algorthms, ncludng gradent projecton [73, 102], homotopy methods [52], teratve shrnkage-thresholdng [45], proxmal gradent [132, 133, 11, 12], augmented Lagrangan methods [175], and nteror-pont methods [103]. There are other approaches as well, such as Bregman teratve algorthms [176] and teratve thresholdng algorthms [51] mplementable n a message-passng framework. 6.1 Least Absolute Devatons A smple varant on least squares fttng s least absolute devatons, n whch we mnmze Ax b 1 nstead of Ax b 2 2. Least absolute devatons provdes a more robust ft than least squares when the data contans large outlers, and has been used extensvely n statstcs and econometrcs. See, for example, [95, 10.6], [171, 9.6], and [20, 6.1.2]. In ADMM form, the problem can be wrtten as mnmze z 1 subject to Ax z = b, so f = 0 and g = 1. Explotng the specal form of f and g, and assumng A T A s nvertble, ADMM can be expressed as x k+1 := (A T A) 1 A T (b + z k u k ) z k+1 := S 1/ρ (Ax k+1 b + u k ) u k+1 := u k + Ax k+1 z k+1 b, where the soft thresholdng operator s nterpreted elementwse. As n 4.2, the matrx A T A can be factored once; the factors are then used n cheaper back-solves n subsequent x-updates.

43 40 l 1-Norm Problems The x-update step s the same as carryng out a least squares ft wth coeffcent matrx A and rghthand sde b + z k u k. Thus ADMM can be nterpreted as a method for solvng a least absolute devatons problem by teratvely solvng the assocated least squares problem wth a modfed rghthand sde; the modfcaton s then updated usng soft thresholdng. Wth factorzaton cachng, the cost of subsequent least squares teratons s much smaller than the ntal one, often makng the tme requred to carry out least absolute devatons very nearly the same as the tme requred to carry out least squares Huber Fttng A problem that les n between least squares and least absolute devatons s Huber functon fttng, mnmze g hub (Ax b), where the Huber penalty functon g hub s quadratc for small arguments and transtons to an absolute value for larger values. For scalar a, t s gven by { a g hub 2 /2 a 1 (a) = a 1/2 a > 1 and extends to vector arguments as the sum of the Huber functons of the components. (For smplcty, we consder the standard Huber functon, whch transtons from quadratc to absolute value at the level 1.) Ths can be put nto ADMM form as above, and the ADMM algorthm s the same except that the z-update nvolves the proxmty operator of the Huber functon rather than that of the l 1 norm: z k+1 := ρ ( Ax k+1 b + u k) ρ 1+ρ S 1+1/ρ(Ax k+1 b + u k ). When the least squares ft x ls =(A T A) 1 b satsfes x ls 1 for all, t s also the Huber ft. In ths case, ADMM termnates n two steps.

44 6.2 Bass Pursut Bass Pursut Bass pursut s the equalty-constraned l 1 mnmzaton problem mnmze x 1 subject to Ax = b, wth varable x R n, data A R m n, b R m, wth m<n. Bass pursut s often used as a heurstc for fndng a sparse soluton to an underdetermned system of lnear equatons. It plays a central role n modern statstcal sgnal processng, partcularly the theory of compressed sensng; see [24] for a recent survey. In ADMM form, bass pursut can be wrtten as mnmze f(x) + z 1 subject to x z =0, where f s the ndcator functon of {x R n Ax = b}. The ADMM algorthm s then x k+1 := Π(z k u k ) z k+1 := S 1/ρ (x k+1 + u k ) u k+1 := u k + x k+1 z k+1, where Π s projecton onto {x R n Ax = b}. The x-update, whch nvolves solvng a lnearly-constraned mnmum Eucldean norm problem, can be wrtten explctly as x k+1 := (I A T (AA T ) 1 A)(z k u k )+A T (AA T ) 1 b. Agan, the comments on effcent computaton from 4.2 apply; by cachng a factorzaton of AA T, subsequent projectons are much cheaper than the frst one. We can nterpret ADMM for bass pursut as reducng the soluton of a least l 1 norm problem to solvng a sequence of mnmum Eucldean norm problems. For a dscusson of smlar algorthms for related problems n mage processng, see [2]. A recent class of algorthms called Bregman teratve methods have attracted consderable nterest for solvng l 1 problems lke bass pursut. For bass pursut and related problems, Bregman teratve regularzaton [176] s equvalent to the method of multplers, and the splt Bregman method [88] s equvalent to ADMM [68].

45 42 l 1-Norm Problems 6.3 General l 1 Regularzed Loss Mnmzaton Consder the generc problem mnmze l(x) +λ x 1, (6.1) where l s any convex loss functon. In ADMM form, ths problem can be wrtten as mnmze l(x) +g(z) subject to x z =0, where g(z)=λ z 1. The algorthm s ( ) x k+1 := argmn l(x) +(ρ/2) x z k + u k 2 2 x z k+1 := S λ/ρ (x k+1 + u k ) u k+1 := u k + x k+1 z k+1. The x-update s a proxmal operator evaluaton. If l s smooth, ths can be done by any standard method, such as Newton s method, a quas- Newton method such as L-BFGS, or the conjugate gradent method. If l s quadratc, the x-mnmzaton can be carred out by solvng lnear equatons, as n 4.2. In general, we can nterpret ADMM for l 1 regularzed loss mnmzaton as reducng t to solvng a sequence of l 2 (squared) regularzed loss mnmzaton problems. A very wde varety of models can be represented wth the loss functon l, ncludng generalzed lnear models [122] and generalzed addtve models [94]. In partcular, generalzed lnear models subsume lnear regresson, logstc regresson, softmax regresson, and Posson regresson, snce they allow for any exponental famly dstrbuton. For general background on models lke l 1 regularzed logstc regresson, see, e.g., [95, 4.4.4]. In order to use a regularzer g(z) other than z 1, we smply replace the soft thresholdng operator n the z-update wth the proxmty operator of g, as n 4.1.

46 6.4 Lasso Lasso An mportant specal case of (6.1) s l 1 regularzed lnear regresson, also called the lasso [156]. Ths nvolves solvng mnmze (1/2) Ax b λ x 1, (6.2) where λ>0 s a scalar regularzaton parameter that s usually chosen by cross-valdaton. In typcal applcatons, there are many more features than tranng examples, and the goal s to fnd a parsmonous model for the data. For general background on the lasso, see [95, 3.4.2]. The lasso has been wdely appled, partcularly n the analyss of bologcal data, where only a small fracton of a huge number of possble factors are actually predctve of some outcome of nterest; see [95, 18.4] for a representatve case study. In ADMM form, the lasso problem can be wrtten as mnmze f(x) +g(z) subject to x z =0, where f(x)=(1/2) Ax b 2 2 and g(z) =λ z 1. By 4.2 and 4.4, ADMM becomes x k+1 := (A T A + ρi) 1 (A T b + ρ(z k u k )) z k+1 := S λ/ρ (x k+1 + u k ) u k+1 := u k + x k+1 z k+1. Note that A T A + ρi s always nvertble, snce ρ>0. The x-update s essentally a rdge regresson (.e., quadratcally regularzed least squares) computaton, so ADMM can be nterpreted as a method for solvng the lasso problem by teratvely carryng out rdge regresson. When usng a drect method, we can cache an ntal factorzaton to make subsequent teratons much cheaper. See [1] for an example of an applcaton n mage processng Generalzed Lasso The lasso problem can be generalzed to mnmze (1/2) Ax b λ Fx 1, (6.3)

47 44 l 1-Norm Problems where F s an arbtrary lnear transformaton. An mportant specal case s when F R (n 1) n s the dfference matrx, 1 j = +1 F j = 1 j = 0 otherwse, and A = I, n whch case the generalzaton reduces to mnmze (1/2) x b λ n 1 =1 x +1 x. (6.4) The second term s the total varaton of x. Ths problem s often called total varaton denosng [145], and has applcatons n sgnal processng. When A = I and F s a second dfference matrx, the problem (6.3) s called l 1 trend flterng [101]. In ADMM form, the problem (6.3) can be wrtten as mnmze (1/2) Ax b λ z 1 subject to Fx z =0, whch yelds the ADMM algorthm x k+1 := (A T A + ρf T F ) 1 (A T b + ρf T (z k u k )) z k+1 := S λ/ρ (Fx k+1 + u k ) u k+1 := u k + Fx k+1 z k+1. For the specal case of total varaton denosng (6.4), A T A + ρf T F s trdagonal, so the x-update can be carred out n O(n) flops [90, 4.3]. For l 1 trend flterng, the matrx s pentadagonal, so the x-update s stll O(n) flops Group Lasso As another example, consder replacng the regularzer x 1 wth N =1 x 2, where x =(x 1,...,x N ), wth x R n. When n = 1 and N = n, ths reduces to the l 1 regularzed problem (6.1). Here the regularzer s separable wth respect to the partton x 1,...,x N but not fully separable. Ths extenson of l 1 norm regularzaton s called the group lasso [177] or, more generally, sum-of-norms regularzaton [136].

48 6.5 Sparse Inverse Covarance Selecton 45 ADMM for ths problem s the same as above wth the z-update replaced wth block soft thresholdng z k+1 = S λ/ρ (x k+1 + u k ), =1,...,N, where the vector soft thresholdng operator S κ : R m R m s S κ (a) =(1 κ/ a 2 ) + a, wth S κ (0) = 0. Ths formula reduces to the scalar soft thresholdng operator when a s a scalar, and generalzes the expresson gven n (4.2). Ths can be extended further to handle overlappng groups, whch s often useful n bonformatcs and other applcatons [181, 118]. In ths case, we have N potentally overlappng groups G {1,...,n} of varables, and the objectve s (1/2) Ax b λ N x G 2, where x G s the subvector of x wth entres n G. Because the groups can overlap, ths knd of objectve s dffcult to optmze wth many standard methods, but t s straghtforward wth ADMM. To use ADMM, ntroduce N new varables x R G and consder the problem =1 mnmze (1/2) Az b λ N =1 x 2 subject to x z =0, =1,...,N, wth local varables x and global varable z. Here, z s the global varable z s dea of what the local varable x should be, and s gven by a lnear functon of z. Ths follows the notaton for general form consensus optmzaton outlned n full detal n 7.2; the overlappng group lasso problem above s a specal case. 6.5 Sparse Inverse Covarance Selecton Gven a dataset consstng of samples from a zero mean Gaussan dstrbuton n R n, a N(0,Σ), =1,...,N,

49 46 l 1-Norm Problems consder the task of estmatng the covarance matrx Σ under the pror assumpton that Σ 1 s sparse. Snce (Σ 1 ) j s zero f and only f the th and jth components of the random varable are condtonally ndependent, gven the other varables, ths problem s equvalent to the structure learnng problem of estmatng the topology of the undrected graphcal model representaton of the Gaussan [104]. Determnng the sparsty pattern of the nverse covarance matrx Σ 1 s also called the covarance selecton problem. For n very small, t s feasble to search over all sparsty patterns n Σ 1 snce for a fxed sparsty pattern, determnng the maxmum lkelhood estmate of Σ s a tractable (convex optmzaton) problem. A good heurstc that scales to much larger values of n s to mnmze the negatve log-lkelhood (wth respect to the parameter X =Σ 1 ) wth an l 1 regularzaton term to promote sparsty of the estmated nverse covarance matrx [7]. If S s the emprcal covarance matrx (1/N ) N =1 a a T, then the estmaton problem can be wrtten as mnmze Tr(SX) logdetx + λ X 1, wth varable X S n +, where 1 s defned elementwse,.e., asthe sum of the absolute values of the entres, and the doman of logdet s S n ++, the set of symmetrc postve defnte n n matrces. Ths s a specal case of the general l 1 regularzed problem (6.1) wth (convex) loss functon l(x)=tr(sx) logdetx. The dea of covarance selecton s orgnally due to Dempster [48] and was frst studed n the sparse, hgh-dmensonal regme by Menshausen and Bühlmann [123]. The form of the problem above s due to Banerjee et al. [7]. Some other recent papers on ths problem nclude Fredman et al. s graphcal lasso [79], Duch et al. [55], Lu [115], Yuan [178], and Schenberg et al. [148], the last of whch shows that ADMM outperforms state-of-the-art methods for ths problem. The ADMM algorthm for sparse nverse covarance selecton s X k+1 := argmn X Z k+1 := argmn Z U k+1 := U k + X k+1 Z k+1, ( ) Tr(SX) logdetx +(ρ/2) X Z k + U k 2 F ( ) λ Z 1 +(ρ/2) X k+1 Z + U k 2 F

50 6.5 Sparse Inverse Covarance Selecton 47 where F s the Frobenus norm,.e., the square root of the sum of the squares of the entres. Ths algorthm can be smplfed much further. The Z-mnmzaton step s elementwse soft thresholdng, Zj k+1 := S λ/ρ (Xj k+1 + Uj), k and the X-mnmzaton also turns out to have an analytcal soluton. The frst-order optmalty condton s that the gradent should vansh, S X 1 + ρ(x Z k + U k )=0, together wth the mplct constrant X 0. Rewrtng, ths s ρx X 1 = ρ(z k U k ) S. (6.5) We wll construct a matrx X that satsfes ths condton and thus mnmzes the X-mnmzaton objectve. Frst, take the orthogonal egenvalue decomposton of the rghthand sde, ρ(z k U k ) S = QΛQ T, where Λ = dag(λ 1,...,λ n ), and Q T Q = QQ T = I. Multplyng (6.5) by Q T on the left and by Q on the rght gves ρ X X 1 =Λ, where X = Q T XQ. We can now construct a dagonal soluton of ths equaton,.e., fnd postve numbers X that satsfy ρ X 1/ X = λ. By the quadratc formula, X = λ + λ 2 +4ρ, 2ρ whch are always postve snce ρ>0. It follows that X = Q XQ T satsfes the optmalty condton (6.5), so ths s the soluton to the X- mnmzaton. The computatonal effort of the X-update s that of an egenvalue decomposton of a symmetrc matrx.

51 7 Consensus and Sharng Here we descrbe two generc optmzaton problems, consensus and sharng, and ADMM-based methods for solvng them usng dstrbuted optmzaton. Consensus problems have a long hstory, especally n conjuncton wth ADMM; see, e.g., Bertsekas and Tstskls [17] for a dscusson of dstrbuted consensus problems n the context of ADMM from the 1980s. Some more recent examples nclude a survey by Nedć and Ozdaglar [131] and several applcaton papers by Gannaks and co-authors n the context of sgnal processng and wreless communcatons, such as [150, 182, 121]. 7.1 Global Varable Consensus Optmzaton We frst consder the case wth a sngle global varable, wth the objectve and constrant terms splt nto N parts: mnmze f(x) = N =1 f (x), where x R n, and f : R n R {+ } are convex. We refer to f as the th term n the objectve. Each term can also encode constrants by assgnng f (x)=+ when a constrant s volated. The goal s to 48

52 7.1 Global Varable Consensus Optmzaton 49 solve the problem above n such a way that each term can be handled by ts own processng element, such as a thread or processor. Ths problem arses n many contexts. In model fttng, for example, x represents the parameters n a model and f represents the loss functon assocated wth the th block of data or measurements. In ths case, we would say that x s found by collaboratve flterng, snce the data sources are collaboratng to develop a global model. Ths problem can be rewrtten wth local varables x R n and a common global varable z: mnmze N =1 f (x ) subject to x z =0, =1,...,N. (7.1) Ths s called the global consensus problem, snce the constrant s that all the local varables should agree,.e., be equal. Consensus can be vewed as a smple technque for turnng addtve objectves N =1 f (x), whch show up frequently but do not splt due to the varable beng shared across terms, nto separable objectves N =1 f (x ), whch splt easly. For a useful recent dscusson of consensus algorthms, see [131] and the references theren. ADMM for the problem (7.1) can be derved ether drectly from the augmented Lagrangan L ρ (x 1,...,x N,z,y)= N =1 ( f (x )+y T (x z) +(ρ/2) x z 2 ) 2, or smply as a specal case of the constraned optmzaton problem (5.1) wth varable (x 1,...,x N ) R nn and constrant set C = {(x 1,...,x N ) x 1 = x 2 = = x N }. The resultng ADMM algorthm s the followng: x k+1 ( ) := argmn f (x )+y kt (x z k )+(ρ/2) x z k 2 2 x N ( ) x k+1 +(1/ρ)y k z k+1 := 1 N =1 y k+1 := y k + ρ(x k+1 z k+1 ).

53 50 Consensus and Sharng Here, we wrte y kt nstead of (y k ) T to lghten the notaton. The frst and last steps are carred out ndependently for each = 1,...,N.In the lterature, the processng element that handles the global varable z s sometmes called the central collector or the fuson center. Note that the z-update s smply the projecton of x k+1 +(1/ρ)y k onto the constrant set C of block-constant vectors. Ths algorthm can be smplfed further. Wth the average (over =1,...,N) of a vector denoted wth an overlne, the z-update can be wrtten z k+1 = x k+1 +(1/ρ)y k. Smlarly, averagng the y-update gves y k+1 = y k + ρ(x k+1 z k+1 ). Substtutng the frst equaton nto the second shows that y k+1 =0,.e., the dual varables have average value zero after the frst teraton. Usng z k = x k, ADMM can be wrtten as x k+1 ( := argmn f (x )+y kt (x x k )+(ρ/2) x x k 2 2 x y k+1 := y k + ρ(x k+1 x k+1 ). We have already seen a specal case of ths n parallel projectons (see 5.1.2), whch s consensus ADMM for the case when f are all ndcator functons of convex sets. Ths s a very ntutve algorthm. The dual varables are separately updated to drve the varables nto consensus, and quadratc regularzaton helps pull the varables toward ther average value whle stll attemptng to mnmze each local f. We can nterpret consensus ADMM as a method for solvng problems n whch the objectve and constrants are dstrbuted across multple processors. Each processor only has to handle ts own objectve and constrant term, plus a quadratc term whch s updated each teraton. The quadratc terms (or more accurately, the lnear parts of the quadratc terms) are updated n such a way that the varables converge to a common value, whch s the soluton of the full problem. )

54 7.1 Global Varable Consensus Optmzaton 51 For consensus ADMM, the prmal and dual resduals are r k =(x k 1 x k,...,x k N x k ), s k = ρ(x k x k 1,...,x k x k 1 ), so ther (squared) norms are r k 2 2 = N x k x k 2 2, s k 2 2 = Nρ 2 x k x k =1 The frst term s N tmes the standard devaton of the ponts x 1,...,x N, a natural measure of (lack of) consensus. When the orgnal consensus problem s a parameter fttng problem, the x-update step has an ntutve statstcal nterpretaton. Suppose f s the negatve log-lkelhood functon for the parameter x, gven the measurements or data avalable to the th processng element. Then x k+1 s precsely the maxmum a posteror (MAP) estmate of the parameter, gven the Gaussan pror dstrbuton N (x k +(1/ρ)y k,ρi). The expresson for the pror mean s also ntutve: It s the average value x k of the local parameter estmates n the prevous teraton, translated slghtly by y k, the prce of the th processor dsagreeng wth the consensus n the prevous teraton. Note also that the use of dfferent forms of penalty n the augmented term, as dscussed n 3.4, wll lead to correspondng changes n ths pror dstrbuton; for example, usng a matrx penalty P rather than a scalar ρ wll mean that the Gaussan pror dstrbuton has covarance P rather than ρi Global Varable Consensus wth Regularzaton In a smple varaton on the global varable consensus problem, an objectve term g, often representng a smple constrant or regularzaton, s handled by the central collector: mnmze N =1 f (x )+g(z) subject to x z =0, =1,...,N. (7.2)

55 52 Consensus and Sharng The resultng ADMM algorthm s ( ) x k+1 := argmn f (x )+y kt (x z k )+(ρ/2) x z k 2 2 (7.3) x ( ) N z k+1 := argmn g(z) + ( y kt z +(ρ/2) x k+1 z 2 2) (7.4) z =1 y k+1 := y k + ρ(x k+1 z k+1 ). (7.5) By collectng the lnear and quadratc terms, we can express the z- update as an averagng step, as n consensus ADMM, followed by a proxmal step nvolvng g: z k+1 := argmn z ( g(z) +(Nρ/2) z x k+1 (1/ρ)y k 2 2 In the case wth nonzero g, we do not n general have y k =0, so we cannot drop the y terms from z-update as n consensus ADMM. As an example, for g(z) =λ z 1, wth λ>0, the second step of the z-update s a soft threshold operaton: z k+1 := S λ/nρ (x k+1 (1/ρ)y k ). As another smple example, suppose g s the ndcator functon of R n +, whch means that the g term enforces nonnegatvty of the varable. In ths case, the update s z k+1 := (x k+1 (1/ρ)y k ) +. The scaled form of ADMM for ths problem also has an appealng form, whch we record here for convenence: ( ) x k+1 := argmn f (x )+(ρ/2) x z k + u k 2 2 (7.6) x ( ) z k+1 := argmn g(z) +(Nρ/2) z x k+1 u k 2 2 (7.7) z u k+1 := u k + x k+1 z k+1. (7.8) In many cases, ths verson s smpler and easer to work wth than the unscaled form. ).

56 7.2 General Form Consensus Optmzaton General Form Consensus Optmzaton We now consder a more general form of the consensus mnmzaton problem, n whch we have local varables x R n, =1,...,N, wth the objectve f 1 (x 1 )+ + f N (x N ) separable n the x. Each of these local varables conssts of a selecton of the components of the global varable z R n ; that s, each component of each local varable corresponds to some global varable component z g. The mappng from local varable ndces nto global varable ndex can be wrtten as g = G(, j), whch means that local varable component (x ) j corresponds to global varable component z g. Achevng consensus between the local varables and the global varable means that (x ) j = z G(,j), =1,...,N, j =1,...,n. If G(, j)=j for all, then each local varable s just a copy of the global varable, and consensus reduces to global varable consensus, x = z. General consensus s of nterest n cases where n n, so each local vector only contans a small number of the global varables. In the context of model fttng, the followng s one way that general form consensus naturally arses. The global varable z s the full feature vector (.e., vector of model parameters or ndependent varables n the data), and dfferent subsets of the data are spread out among N processors. Then x can be vewed as the subvector of z correspondng to (nonzero) features that appear n the th block of data. In other words, each processor handles only ts block of data and only the subset of model coeffcents that are relevant for that block of data. If n each block of data all regressors appear wth nonzero values, then ths reduces to global consensus. For example, f each tranng example s a document, then the features may nclude words or combnatons of words n the document; t wll often be the case that some words are only used n a small subset of the documents, n whch case each processor can just deal wth the words that appear n ts local corpus. In general, datasets that are hgh-dmensonal but sparse wll beneft from ths approach.

57 54 Consensus and Sharng Fg General form consensus optmzaton. Local objectve terms are on the left; global varable components are on the rght. Each edge n the bpartte graph s a consstency constrant, lnkng a local varable and a global varable component. For ease of notaton, let z R n be defned by ( z ) j = z G(,j). Intutvely, z s the global varable s dea of what the local varable x should be; the consensus constrant can then be wrtten very smply as x z =0, =1,...,N. The general form consensus problem s mnmze N =1 f (x ) subject to x z =0, =1,...,N, (7.9) wth varables x 1,...,x N and z ( z are lnear functons of z). A smple example s shown n Fgure 7.1. In ths example, we have N = 3 subsystems, global varable dmenson n = 4, and local varable dmensons n 1 =4,n 2 = 2, and n 3 = 3. The objectve terms and global varables form a bpartte graph, wth each edge representng a consensus constrant between a local varable component and a global varable. The augmented Lagrangan for (7.9) s L ρ (x,z,y)= N =1 ( f (x )+y T (x z )+(ρ/2) x z 2 ) 2,

58 7.2 General Form Consensus Optmzaton 55 wth dual varable y R n. Then ADMM conssts of the teratons x k+1 := argmn f (x )+y kt x +(ρ/2) x z k 2 2 x ( m ( ) ) ( ) z k+1 := argmn z y kt z +(ρ/2) x k+1 z 2 2 =1 y k+1 := y k + ρ(x k+1 z k+1 ), where the x - and y -updates can be carred out ndependently n parallel for each. The z-update step decouples across the components of z, snce L ρ s fully separable n ts components: z k+1 g := G(,j)=g ( ) (x k+1 ) j +(1/ρ)(y k) j G(,j)=g 1, so z g s found by averagng all entres of x k+1 +(1/ρ)y k that correspond to the global ndex g. Applyng the same type of argument as n the global varable consensus case, we can show that after the frst teraton, (y k ) j =0, G(,j)=g.e., the sum of the dual varable entres that correspond to any gven global ndex g s zero. The z-update step can thus be wrtten n the smpler form z k+1 g := (1/k g ) G(,j)=g (x k+1 ) j, where k g s the number of local varable entres that correspond to global varable entry z g. In other words, the z-update s local averagng for each component z g rather than global averagng; n the language of collaboratve flterng, we could say that only the processng elements that have an opnon on a feature z g wll vote on z g General Form Consensus wth Regularzaton As n the global consensus case, the general form consensus problem can be generalzed by allowng the global varable nodes to handle an

59 56 Consensus and Sharng objectve term. Consder the problem mnmze N =1 f (x )+g(z) subject to x z =0, =1,...,N, (7.10) where g s a regularzaton functon. The z-update nvolves the local averagng step from the unregularzed settng, followed by an applcaton of the proxmty operator prox g,kgρ to the results of ths averagng, just as n the global varable consensus case. 7.3 Sharng Another canoncal problem that wll prove useful n the sequel s the sharng problem mnmze N =1 f (x )+g( N =1 x ) (7.11) wth varables x R n, =1,...,N, where f s a local cost functon for subsystem, and g s the shared objectve, whch takes as argument the sum of the varables. We can thnk of the varable x as beng the choce of agent ; the sharng problem nvolves each agent adjustng ts varable to mnmze ts ndvdual cost f (x ), as well as the shared objectve term g( N =1 x ). The sharng problem s mportant both because many useful problems can be put nto ths form and because t enjoys a dual relatonshp wth the consensus problem, as dscussed below. Sharng can be wrtten n ADMM form by copyng all the varables: mnmze N =1 f (x )+g( N =1 z ) subject to x z =0, =1,...,N, (7.12) wth varables x,z R n, =1,...,N. The scaled form of ADMM s ( ) x k+1 := argmn f (x )+(ρ/2) x z k + u k 2 2 x ( z k+1 := argmn g( N =1 z )+(ρ/2) ) N =1 z u k xk z u k+1 := u k + x k+1 z k+1. The frst and last steps can be carred out ndependently n parallel for each = 1,...,N. As wrtten, the z-update requres solvng a problem

60 7.3 Sharng 57 n Nn varables, but we wll show that t s possble to carry t out by solvng a problem n only n varables. For smplcty of notaton, let a = u k + xk+1. Then the z-update can be rewrtten as mnmze g(nz) +(ρ/2) N =1 z a 2 2 subject to z =(1/N ) N =1 z, wth addtonal varable z R n. Mnmzng over z 1,...,z N wth z fxed has the soluton z = a + z a, (7.13) so the z-update can be computed by solvng the unconstraned problem mnmze g(nz) +(ρ/2) N =1 z a 2 2 for z R n and then applyng (7.13). Substtutng (7.13) for z k+1 n the u-update gves u k+1 = u k + x k+1 z k+1, (7.14) whch shows that the dual varables u k are all equal (.e., n consensus) and can be replaced wth a sngle dual varable u R m. Substtutng n the expresson for z k n the x-update, the fnal algorthm becomes x k+1 := argmn x ( z k+1 := argmn z f (x )+(ρ/2) x x k + x k z k + u k 2 2 ( ) g(nz) +(Nρ/2) z u k x k u k+1 := u k + x k+1 z k+1. The x-update can be carred out n parallel, for =1,...,N. The z- update step requres gatherng x k+1 to form the averages, and then solvng a problem wth n varables. After the u-update, the new value of x k+1 z k+1 + u k+1 s scattered to the subsystems Dualty Attachng Lagrange multplers ν to the constrants x z = 0, the dual functon Γ of the ADMM sharng problem (7.12) s gven by { g (ν Γ(ν 1,...,ν N )= 1 ) f ( ν ) f ν 1 = ν 2 = = ν N otherwse. )

61 58 Consensus and Sharng Lettng ψ = g and h (ν) =f ( ν), the dual sharng problem can be wrtten as mnmze N =1 h (ν )+ψ(ν) (7.15) subject to ν ν =0, wth varables ν R n, ν R n, =1,...,N. Ths s dentcal to the regularzed global varable consensus problem (7.2). Assumng strong dualty holds, ths mples that y k = ρu k ν n ADMM, where ν s an optmal pont of (7.15). Consder the reverse drecton. Attachng Lagrange multplers d R n to the constrants ν ν = 0, the dual of the regularzed global consensus problem s mnmze N =1 f (d )+g( N =1 d ) wth varables d R n, whch s exactly the sharng problem (7.11). (Ths follows because f and g are assumed to be convex and closed, so f = f and g = g.) Assumng strong dualty holds, runnng ADMM on the consensus problem (7.15) gves that d k x, where x s an optmal pont of the sharng problem (7.11). Thus, there s a close dual relatonshp between the consensus problem (7.15) and the sharng problem (7.11). In fact, the global consensus problem can be solved by runnng ADMM on ts dual sharng problem, and vce versa. Ths s related to work by Fukushma [80] on dual ADMM methods Optmal Exchange Here, we hghlght an mportant specal case of the sharng problem wth an appealng economc nterpretaton. The exchange problem s mnmze N =1 f (x ) subject to N =1 x (7.16) =0, wth varables x R n, =1,...,N, where f represents the cost functon for subsystem. Ths s a sharng problem where the shared objectve g s the ndcator functon of the set {0}. The components of the vectors x represent quanttes of commodtes that are exchanged

62 7.3 Sharng 59 among N agents or subsystems. When (x ) j s nonnegatve, t can be vewed as the amount of commodty j receved by subsystem from the exchange. When (x ) j s negatve, ts magntude (x ) j can be vewed as the amount of commodty j contrbuted by subsystem to the exchange. The equlbrum constrant that each commodty clears, or balances, s smply N =1 x = 0. As ths nterpretaton suggests, ths and related problems have a long hstory n economcs, partcularly n the theores of market exchange, resource allocaton, and general equlbrum; see, for example, the classc works by Walras [168], Arrow and Debreu [4], and Uzawa [162, 163]. The exchange problem can be solved va ADMM ether by applyng the generc sharng algorthm above and smplfyng, or by treatng t as a generc constraned convex problem (5.1), wth C = { x R nn x x N =0}. Ths gves the exchange ADMM algorthm x k+1 ( ) := argmn f (x )+(ρ/2) x x k + x k + u k 2 2 x u k+1 := u k + x k+1. It s also nstructve to consder the unscaled form of ADMM for ths problem: ( ) x k+1 := argmn f (x )+y kt x +(ρ/2) x (x k x k ) 2 2 x y k+1 := y k + ρx k+1. The varable y k converges to an optmal dual varable, whch s readly nterpreted as a set of optmal or clearng prces for the exchange. The proxmal term n the x-update s a penalty for x k+1 devatng from x k, projected onto the feasble set. The x-update n exchange ADMM can be carred out ndependently n parallel, for = 1,...,N. The u-update (or otherwse averagng), and broadcastng x k+1 + u k+1 back to the processors handlng the x updates. Exchange ADMM can be vewed as a form of tâtonnement or prce adjustment process [168, 163] from Walras theory of general equlbrum. Tâtonnement represents the mechansm of the compettve market workng towards market equlbrum; the dea s that the market requres gatherng the x k+1

63 60 Consensus and Sharng acts va prce adjustment,.e., ncreasng or decreasng the prce of each good dependng on whether there s an excess demand or excess supply of the good, respectvely. Dual decomposton s the smplest algorthmc expresson of tâtonnement. In ths settng, each agent adjusts hs consumpton x to mnmze hs ndvdual cost f (x ) adjusted by the cost y T x, where y s the prce vector. The central collector (called the secretary of market n [163]) works toward equlbrum by adjustng the prces y up or down dependng on whether each commodty or good s overproduced or underproduced. ADMM dffers only n the ncluson of the proxmal regularzaton term n the updates for each agent. As y k converges to an optmal prce vector y, the effect of the proxmal regularzaton term vanshes. The proxmal regularzaton term can be nterpreted as each agent s commtment to help clear the market.

64 8 Dstrbuted Model Fttng A general convex model fttng problem can be wrtten n the form mnmze l(ax b) +r(x), (8.1) wth parameters x R n, where A R m n s the feature matrx, b R m s the output vector, l : R m R s a convex loss functon, and r s a convex regularzaton functon. We assume that l s addtve, so l(ax b) = m l (a T x b ), =1 where l : R R s the loss for the th tranng example, a R n s the feature vector for example, and b s the output or response for example. Each l can be dfferent, though n practce they are usually all the same. We also assume that the regularzaton functon r s separable. The most common examples are r(x) =λ x 2 2 (called Tkhonov regularzaton, orardge penalty n statstcal settngs) and r(x) =λ x 1 (sometmes genercally called a lasso penalty n statstcal settngs), where λ s a postve regularzaton parameter, though more elaborate regularzers can be used just as easly. In some cases, one or more model 61

65 62 Dstrbuted Model Fttng parameters are not regularzed, such as the offset parameter n a classfcaton model. Ths corresponds to, for example, r(x) =λ x 1:n 1 1, where x 1:n 1 s the subvector of x consstng of all but the last component of x; wth ths choce of r, the last component of x s not regularzed. The next secton dscusses some examples that have the general form above. We then consder two ways to solve (8.1) n a dstrbuted manner, namely, by splttng across tranng examples and by splttng across features. Whle we work wth the assumpton that l and r are separable at the component level, we wll see that the methods we descrbe work wth approprate block separablty as well. 8.1 Examples Regresson Consder a lnear modelng problem wth measurements of the form b = a T x + v, where a s the th feature vector and the measurement noses v are ndependent wth log-concave denstes p ; see, e.g., [20, 7.1.1]. Then the negatve log-lkelhood functon s l(ax b), wth l (ω) = logp ( ω). If r = 0, then the general fttng problem (8.1) can be nterpreted as maxmum lkelhood estmaton of x under nose model p.ifr s taken to be the negatve log pror densty of x, then the problem can be nterpreted as MAP estmaton. For example, the lasso follows the form above wth quadratc loss l(u)=(1/2) u 2 2 and l 1 regularzaton r(x) =λ x 1, whch s equvalent to MAP estmaton of a lnear model wth Gaussan nose and a Laplacan pror on the parameters [156, 5] Classfcaton Many classfcaton problems can also be put n the form of the general model fttng problem (8.1), wth A, b, l, and r approprately chosen. We follow the standard setup from statstcal learnng theory, as descrbed n, e.g., [8]. Let p R n 1 denote the feature vector of the th example

66 8.1 Examples 63 and let q { 1,1} denote the bnary outcome or class label, for = 1,...,m. The goal s to fnd a weght vector w R n 1 and offset v R such that sgn(p T w + v) =q holds for many examples. Vewed as a functon of p, the expresson p T w + v s called a dscrmnant functon. The condton that the sgn of the dscrmnant functon and the response should agree can also be wrtten as µ > 0, where µ = q (p T w + v) s called the margn of the th tranng example. In the context of classfcaton, loss functons are generally wrtten as a functon of the margn, so the loss for the th example s l (µ )=l (q (p T w + v)). A classfcaton error s made f and only f the margn s negatve, so l should be postve and decreasng for negatve arguments and zero or small for postve arguments. To fnd the parameters w and v, we mnmze the average loss plus a regularzaton term on the weghts: 1 m m l (q (p T w + v)) + r wt (w). (8.2) =1 Ths has the generc model fttng form (8.1), wth x =(w,v), a = (q p, q ), b = 0, and regularzer r(x) =r wt (w). (We also need to scale l by 1/m.) In the sequel, we wll address such problems usng the form (8.1) wthout comment, assumng that ths transformaton has been carred out. In statstcal learnng theory, the problem (8.2) s referred to as penalzed emprcal rsk mnmzaton or structural rsk mnmzaton. When the loss functon s convex, ths s sometmes termed convex rsk mnmzaton. In general, fttng a classfer by mnmzng a surrogate loss functon,.e., a convex upper bound to 0-1 loss, s a well studed and wdely used approach n machne learnng; see, e.g., [165, 180, 8]. Many classfcaton models n machne learnng correspond to dfferent choces of loss functon l and regularzaton or penalty r wt.

67 64 Dstrbuted Model Fttng Some common loss functons are hnge loss (1 µ ) +, exponental loss exp( µ ), and logstc loss log(1 + exp( µ )); the most common regularzers are l 1 and l 2 (squared). The support vector machne (SVM) [151] corresponds to hnge loss wth a quadratc penalty, whle exponental loss yelds boostng [78] and logstc loss yelds logstc regresson. 8.2 Splttng across Examples Here we dscuss how to solve the model fttng problem (8.1) wth a modest number of features but a very large number of tranng examples. Most classcal statstcal estmaton problems belong to ths regme, wth large volumes of relatvely low-dmensonal data. The goal s to solve the problem n a dstrbuted way, wth each processor handlng a subset of the tranng data. Ths s useful ether when there are so many tranng examples that t s nconvenent or mpossble to process them on a sngle machne or when the data s naturally collected or stored n a dstrbuted fashon. Ths ncludes, for example, onlne socal network data, webserver access logs, wreless sensor networks, and many cloud computng applcatons more generally. We partton A and b by rows, A = A 1. A N, b= b 1. b N, wth A R m n and b R m, where N =1 m = m. Thus, A and b represent the th block of data and wll be handled by the th processor. We frst put the model fttng problem n the consensus form mnmze N =1 l (A x b )+r(z) subject to x z =0, =1,...,N, (8.3) wth varables x R n and z R n. Here, l refers (wth some abuse of notaton) to the loss functon for the th block of data. The problem can now be solved by applyng the generc global varable consensus ADMM

68 8.2 Splttng across Examples 65 algorthm descrbed n 7.1, gven here wth scaled dual varable: ( ) x k+1 := argmn l (A x b )+(ρ/2) x z k + u k 2 2 x ( ) z k+1 := argmn r(z) +(Nρ/2) z x k+1 u k 2 2 z u k+1 := u k + x k+1 z k+1. The frst step, whch conssts of an l 2 -regularzed model fttng problem, can be carred out n parallel for each data block. The second step requres gatherng varables to form the average. The mnmzaton n the second step can be carred out componentwse (and usually analytcally) when r s assumed to be fully separable. The algorthm descrbed above only requres that the loss functon l be separable across the blocks of data; the regularzer r does not need to be separable at all. (However, when r s not separable, the z-update may requre the soluton of a nontrval optmzaton problem.) Lasso For the lasso, ths yelds the dstrbuted algorthm ( ) x k+1 := argmn (1/2) A x b 2 2 +(ρ/2) x z k + u k 2 2 x z k+1 := S λ/ρn (x k+1 + u k ) u k+1 := u k + x k+1 z k+1. Each x -update takes the form of a Tkhonov-regularzed least squares (.e., rdge regresson) problem, wth analytcal soluton x k+1 := (A T A + ρi) 1 (A T b + ρ(z k u k )). The technques from 4.2 apply: If a drect method s used, then the factorzaton of A T A + ρi can be cached to speed up subsequent updates, and f m <n, then the matrx nverson lemma can be appled to let us factor the smaller matrx A A T + ρi nstead. Comparng ths dstrbuted-data lasso algorthm wth the seral verson n 6.4, we see that the only dfference s the collecton and averagng steps, whch couple the computatons for the data blocks. An ADMM-based dstrbuted lasso algorthm s descrbed n [121], wth applcatons n sgnal processng and wreless communcatons.

69 66 Dstrbuted Model Fttng Sparse Logstc Regresson Consder solvng (8.1) wth logstc loss functons l and l 1 regularzaton. We gnore the ntercept term for notatonal smplcty; the algorthm can be easly modfed to ncorporate an ntercept. The ADMM algorthm s x k+1 ( ) := argmn l (A x )+(ρ/2) x z k + u k 2 2 x z k+1 := S λ/ρn (x k+1 + u k ) u k+1 := u k + x k+1 z k+1. Ths s dentcal to the dstrbuted lasso algorthm, except for the x update, whch here nvolves an l 2 regularzed logstc regresson problem that can be effcently solved by algorthms lke L-BFGS Support Vector Machne Usng the notaton of (8.1), the algorthm s x k+1 z k+1 := ( ) := argmn 1 T (A x + 1) + +(ρ/2) x z k + u k 2 2 x ρ (1/λ) +Nρ (xk+1 + u k ) u k+1 := u k + x k+1 z k+1. Each x -update essentally nvolves fttng a support vector machne to the local data A (wth an offset n the quadratc regularzaton term), so ths can be carred out effcently usng an exstng SVM solver for seral problems. The use of ADMM to tran support vector machnes n a dstrbuted fashon was descrbed n [74]. 8.3 Splttng across Features Now we consder the model fttng problem (8.1) wth a modest number of examples and a large number of features. Statstcal problems of ths knd frequently arse n areas lke natural language processng and bonformatcs, where there are often a large number of potental

70 8.3 Splttng across Features 67 explanatory varables for any gven outcome. For example, the observatons may be a corpus of documents, and the features could nclude all words and pars of adjacent words (bgrams) that appear n each document. In bonformatcs, there are usually relatvely few people n a gven assocaton study, but there can be a very large number of potental features relatng to factors lke observed DNA mutatons n each ndvdual. There are many examples n other areas as well, and the goal s to solve such problems n a dstrbuted fashon wth each processor handlng a subset of the features. In ths secton, we show how ths can be done by formulatng t as a sharng problem from 7.3. We partton the parameter vector x as x =(x 1,...,x N ), wth x R n, where N =1 n = n. Conformably partton the data matrx A as A =[A 1 A N ], wth A R m n, and the regularzaton functon as r(x)= N =1 r (x ). Ths mples that Ax = N =1 A x,.e., A x can be thought of as a partal predcton of b usng only the features referenced n x. The model fttng problem (8.1) becomes ( N ) mnmze l =1 A x b + N =1 r (x ). Followng the approach used for the sharng problem (7.12), we express the problem as ( N ) mnmze l =1 z b + N =1 r (x ) subject to A x z =0, =1,...,N, wth new varables z R m. The dervaton and smplfcaton of ADMM also follows that for the sharng problem. The scaled form of ADMM s x k+1 argmn x ( r (x )+(ρ/2) A x ( N N l( z b) + ) := z k + u k 2 2 z k+1 := argmn z =1 u k+1 := u k + A x k+1 z k+1. =1 (ρ/2) A x k+1 z k + u k 2 2 )

71 68 Dstrbuted Model Fttng As n the dscusson for the sharng problem, we carry out the z-update by frst solvng for the average z k+1 : ( ) z k+1 := argmn l(nz b) +(Nρ/2) z Ax k+1 u k 2 2 z z k+1 := z k+1 + A x k+1 + u k Ax k+1 u k, where Ax k+1 =(1/N ) N =1 A x k+1. Substtutng the last expresson nto the update for u, we fnd that u k+1 = Ax k+1 + u k z k+1, whch shows that, as n the sharng problem, all the dual varables are equal. Usng a sngle dual varable u k R m, and elmnatng z,we arrve at the algorthm x k+1 := argmn x ( z k+1 := argmn z u k+1 := u k + Ax k+1 z k+1. r (x )+(ρ/2) A x A x k z k + Ax k + u k 2 2 ( ) l(nz b) +(Nρ/2) z Ax k+1 u k 2 2 ) The frst step nvolves solvng N parallel regularzed least squares problems n n varables each. Between the frst and second steps, we collect and sum the partal predctors A x k+1 to form Ax k+1. The second step s a sngle mnmzaton n m varables, a quadratcally regularzed loss mnmzaton problem; the thrd step s a smple update n m varables. Ths algorthm does not requre l to be separable n the tranng examples, as assumed earler. If l s separable, then the z-update fully splts nto m separate scalar optmzaton problems. Smlarly, the regularzer r only needs to be separable at the level of the blocks of features. For example, f r s a sum-of-norms, as n 6.4.2, then t would be natural to have each subsystem handle a separate group.

72 8.3 Splttng across Features Lasso In ths case, the algorthm above becomes ( ) x k+1 := argmn (ρ/2) A x A x k z k + Ax k + u k λ x 1 x z k+1 1 := (b + ρax k+1 + ρu k) N + ρ u k+1 := u k + Ax k+1 z k+1. Each x -update s a lasso problem wth n varables, whch can be solved usng any sngle processor lasso method. In the x -updates, we have x k+1 = 0 (meanng that none of the features n the th block are used) f and only f A T (A x k + z k Ax k u k ) λ/ρ. 2 When ths occurs, the x -update s fast (compared to the case when x k+1 0). In a parallel mplementaton, there s no beneft to speedng up only some of the tasks beng executed n parallel, but n a seral settng we do beneft Group Lasso Consder the group lasso problem wth the feature groups concdng wth the blocks of features, and l 2 norm (not squared) regularzaton: mnmze (1/2) Ax b λ N =1 x 2. The z-update and u-update are the same as for the lasso, but the x update becomes x k+1 ( := argmn (ρ/2) A x A x k z k + Ax k + u k λ x 2 ). x (Only the subscrpt on the last norm dffers from the lasso case.) Ths nvolves mnmzng a functon of the form (ρ/2) A x v λ x 2, whch can be carred out as follows. The soluton s x = 0 f and only f A T v 2 λ/ρ. Otherwse, the soluton has the form x =(A T A + νi) 1 A T v,

73 70 Dstrbuted Model Fttng for the value of ν>0 that gves ν x 2 = λ/ρ. Ths value can be found usng a one-parameter search (e.g., va bsecton) over ν. We can speed up the computaton of x for several values of ν (as needed for the parameter search) by computng and cachng an egendecomposton of A T A. Assumng A s tall,.e., m n (a smlar method works when m<n ), we compute an orthogonal Q for whch A T A = Qdag(λ)Q T, where λ s the vector of egenvalues of A T A (.e., the squares of the sngular values of A ). The cost s O(mn 2 ) flops, domnated (n order) by formng A T A. We subsequently compute x 2 usng x 2 = dag(λ + ν1) 1 Q T A T v 2. Ths can be computed n O(n ) flops, once Q T A T v s computed, so the search over ν s costless (n order). The cost per teraton s thus O(mn ) (to compute Q T A T v), a factor of n better than carryng out the x -update wthout cachng Sparse Logstc Regresson The algorthm s dentcal to the lasso problem above, except that the z-update becomes ( ) z k+1 := argmn l(nz) +(ρ/2) z Ax k+1 u k 2 2, z where l s the logstc loss functon. Ths splts to the component level, and nvolves the proxmty operator for l. Ths can be very effcently computed by a lookup table that gves the approxmate value, followed by one or two Newton steps (for a scalar problem). It s nterestng to see that n dstrbuted sparse logstc regresson, the domnant computaton s the soluton of N parallel lasso problems Support Vector Machne The algorthm s x k+1 := argmn x ( z k+1 := argmn z u k+1 := u k + Ax k+1 z k+1. (ρ/2) A x A x k z k + Ax k + u k λ x 2 2 ( ) 1 T (Nz + 1) + +(ρ/2) z Ax k+1 u k 2 2 )

74 8.3 Splttng across Features 71 The x -updates nvolve quadratc functons, and requre solvng rdge regresson problems. The z-update splts to the component level, and can be expressed as the shfted soft thresholdng operaton v N/ρ v > 1/N + N/ρ z k+1 := 1/N v [ 1/N, 1/N + N/ρ] v < 1/N, v where v = Ax k+1 + u k (and here, the subscrpt denotes the entry n the vector z k+1 ) Generalzed Addtve Models A generalzed addtve model has the form b n f j (a j ), j=1 where a j s the jth element of the feature vector a, and f j : R R are the feature functons. When the feature functons f j are lnear,.e., of the form f j (a j )=w j a j, ths reduces to standard lnear regresson. We choose the feature functons by solvng the optmzaton problem mnmze m =1 l ( n j=1 f j(a j ) b )+ n j=1 r j(f j ), where a j s the jth component of the feature vector of the th example, and b s the assocated outcome. Here the optmzaton varables are the functons f j F j, where F j s a subspace of functons; r j s now a regularzaton functonal. Usually f j s lnearly parametrzed by a fnte number of coeffcents, whch are the underlyng optmzaton varables, but ths formulaton can also handle the case when F j s nfnte-dmensonal. In ether case, t s clearer to thnk of the feature functons f j as the varables to be determned.

75 72 Dstrbuted Model Fttng We splt the features down to ndvdual functons so N = n. The algorthm s f k+1 j := ( argmn r j (f j )+(ρ/2) m f j F j z k+1 := argmn z u k+1 := u k + f k+1 z k+1, =1 (f j(a j ) f k j (a j) z k + f k + u k )2 ) ( m =1 l (Nz b )+(ρ/2) N =1 z f k+1 u k 2 2 where f k =(1/n) n j=1 f j k(a j), the average value of the predcted response n j=1 f j k(a j) for the th feature. The f j -update s an l 2 (squared) regularzed functon ft. The z- update can be carred out componentwse. )

76 9 Nonconvex Problems We now explore the use of ADMM for nonconvex problems, focusng on cases n whch the ndvdual steps n ADMM,.e., the x- and z- updates, can be carred out exactly. Even n ths case, ADMM need not converge, and when t does converge, t need not converge to an optmal pont; t must be consdered just another local optmzaton method. The hope s that t wll possbly have better convergence propertes than other local optmzaton methods, where better convergence can mean faster convergence or convergence to a pont wth better objectve value. For nonconvex problems, ADMM can converge to dfferent (and n partcular, nonoptmal) ponts, dependng on the ntal values x 0 and y 0 and the parameter ρ. 9.1 Nonconvex Constrants Consder the constraned optmzaton problem mnmze subject to f(x) x S, 73

77 74 Nonconvex Problems wth f convex, but S nonconvex. Here, ADMM has the form ( ) x k+1 := argmn f(x) +(ρ/2) x z k + u k 2 2 x z k+1 := Π S (x k+1 + u k ) u k+1 := u k + x k+1 z k+1, where Π S s projecton onto S. The x-mnmzaton step (whch s evaluatng a proxmal operator) s convex snce f s convex, but the z-update s projecton onto a nonconvex set. In general, ths s hard to compute, but t can be carred out exactly n some mportant specal cases we lst below. Cardnalty. If S = {x card(x) c}, where card gves the number of nonzero elements, then Π S (v) keeps the c largest magntude elements and zeroes out the rest. Rank. If S s the set of matrces wth rank c, then Π S (v) s determned by carryng out a sngular value decomposton, v = σ u v T, and keepng the top c dyads,.e., form Π S (v)= c =1 σ u v T. Boolean constrants. If S = {x x {0,1}}, then Π S (v) smply rounds each entry to 0 or 1, whchever s closer. Integer constrants can be handled n the same way Regressor Selecton As an example, consder the least squares regressor selecton or feature selecton problem, mnmze Ax b 2 2 subject to card(x) c, whch s to fnd the best ft to b as a lnear combnaton of no more than c columns of A. For ths problem, ADMM takes the form above, where the x-update nvolves a regularzed least squares problem, and the z- update nvolves keepng the c largest magntude elements of x k+1 + u k. Ths s just lke ADMM for the lasso, except that soft thresholdng s

78 9.1 Nonconvex Constrants 75 replaced wth hard thresholdng. Ths close connecton s hardly surprsng, snce lasso can be thought of as a heurstc for solvng the regressor selecton problem. From ths vewpont, the lasso controls the trade-off between least squares error and sparsty through the parameter λ, whereas n ADMM for regressor selecton, the same trade-off s controlled by the parameter c, the exact cardnalty desred Factor Model Fttng The goal s to approxmate a symmetrc matrx Σ (say, an emprcal covarance matrx) as a sum of a rank-k and a dagonal postve semdefnte matrx. Usng the Frobenus norm to measure approxmaton error, we have the problem mnmze (1/2) X + dag(d) Σ 2 F subject to X 0, Rank(X) =k, d 0, wth varables X S n, d R n. (Any convex loss functon could be used n leu of the Frobenus norm.) We take f(x) = nf (1/2) X + dag(d) d 0 Σ 2 F =(1/2) n (X j Σ j ) 2 +(1/2) (X Σ ) 2 +, j wth the optmzng d havng the form d =(Σ X ) +, =1,...,n. We take S to be the set of postve semdefnte rank-k matrces. ADMM for the factor model fttng problem s then X k+1 := argmn X =1 ( f(x) +(ρ/2) X Z k + U k 2 F Z k+1 := Π S (X k+1 + U k ) U k+1 := U k + X k+1 Z k+1, where Z, U S n. The X-update s separable to the component level, and can be expressed as ) (X k+1 ) j := (1/(1 + ρ)) (Σ j + ρ(zj k U j k ) j (X k+1 ) := (1/(1 + ρ)) ( Σ + ρ(z k U k)) Σ Z U (X k+1 ) := Z k U k Σ >Z U. )

79 76 Nonconvex Problems The Z-update s carred out by an egenvalue decomposton, keepng only the dyads assocated wth the largest k postve egenvalues. 9.2 B-convex Problems Another problem that admts exact ADMM updates s the general bconvex problem, mnmze F (x, z) subject to G(x, z) = 0, where F : R n R m R s b-convex,.e., convex n x for each z and convex n z for each x, and G : R n R m R p s b-affne,.e., affne n x for each fxed z, and affne n z for each fxed x. When F s separable n x and z, and G s jontly affne n x and z, ths reduces to the standard ADMM problem form (3.1). For ths problem ADMM has the form x k+1 := argmn x z k+1 := argmn z u k+1 := u k + G(x k+1,z k+1 ). ( ) F (x,z k )+(ρ/2) G(x,z k )+u k 2 2 ( F (x k+1,z)+(ρ/2) G(x k+1,z)+u k 2 2 Both the x- and z-updates nvolve convex optmzaton problems, and so are tractable. When G = 0 (or s smply absent), ADMM reduces to smple alternatng mnmzaton, a standard method for b-convex mnmzaton Nonnegatve Matrx Factorzaton As an example, consder nonnegatve matrx factorzaton [110]: mnmze (1/2) VW C 2 F subject to V j 0, W j 0, wth varables V R p r and W R r q, and data C R p q. In ths form of the problem, the objectve (whch ncludes the constrants) s b-affne, and there are no equalty constrants, so ADMM becomes the standard method for nonnegatve matrx factorzaton, whch s )

80 9.2 B-convex Problems 77 alternately mnmzng over V, wth W fxed, and then mnmzng over W, wth V fxed. We can also ntroduce a new varable, movng the b-lnear term from the objectve nto the constrants: mnmze (1/2) X C 2 F + I +(V )+I + (W ) subject to X VW =0, wth varables X,V,W, where I + s the ndcator functon for elementwse nonnegatve matrces. Wth (X,V ) servng the role of x, and W servng the role of z above, ADMM becomes (X k+1,v k+1 ) := argmn X,V 0 ( X C 2 F +(ρ/2) X VW k + U k 2 F W k+1 := argmn X k+1 V k+1 W + U k 2 F W 0 U k+1 := U k + X k+1 V k+1 W k+1. The frst step splts across the rows of X and V, so can be performed by solvng a set of quadratc programs, n parallel, to fnd each row of X and V separately; the second splts n the columns of W,socanbe performed by solvng parallel quadratc programs to fnd each column. )

81 10 Implementaton Ths secton addresses the mplementaton of ADMM n a dstrbuted computng envronment. For smplcty, we focus on the global consensus problem wth regularzaton, mnmze N =1 f (x )+g(z) subject to x z =0, where f s the th objectve functon term and g s the global regularzer. Extensons to the more general consensus case are mostly straghtforward. We frst descrbe an abstract mplementaton and then show how ths maps onto a varety of software frameworks Abstract Implementaton We refer to x and u as local varables stored n subsystem, and to z as the global varable. For a dstrbuted mplementaton, t s often more natural to group the local computatons (.e., the x - and u -updates), so we wrte ADMM as u := u + x z x := argmn ( f (x )+(ρ/2) x z + u 2 ) 2 z := prox g,nρ (x + u). 78

82 10.1 Abstract Implementaton 79 Here, the teraton ndces are omtted because n an actual mplementaton, we can smply overwrte prevous values of these varables. Note that the u-update must be done before the x-update n order to match ( ). If g = 0, then the z-update smply nvolves computng x, and the u are not part of the aggregaton, as dscussed n 7.1. Ths suggests that the man features requred to mplement ADMM are the followng: Mutable state. Each subsystem must store the current values of x and u. Local computaton. Each subsystem must be able to solve a small convex problem, where small means that the problem s solvable usng a seral algorthm. In addton, each local process must have local access to whatever data are requred to specfy f. Global aggregaton. There must be a mechansm for averagng local varables and broadcastng the result back to each subsystem, ether by explctly usng a central collector or va some other approach lke dstrbuted averagng [160, 172]. If computng z nvolves a proxmal step (.e., fg s nonzero), ths can ether be performed centrally or at each local node; the latter s easer to mplement n some frameworks. Synchronzaton. All the local varables must be updated before performng global aggregaton, and the local updates must all use the latest global varable. One way to mplement ths synchronzaton s va a barrer, a system checkpont at whch all subsystems must stop and wat untl all other subsystems reach t. When actually mplementng ADMM, t helps to consder whether to take the local perspectve of a subsystem performng local processng and communcatng wth a central collector, or the global perspectve of a central collector coordnatng the work of a set of subsystems. Whch s more natural depends on the software framework used. From the local perspectve, each node receves z, updates u and then x, sends them to the central collector, wats, and then receves the

83 80 Implementaton updated z. From the global perspectve, the central collector broadcasts z to the subsystems, wats for them to fnsh local processng, gathers all the x and u, and updates z. (Of course, f ρ vares across teratons, then ρ must also be updated and broadcast when z s updated.) The nodes must also evaluate the stoppng crtera and decde when to termnate; see below for examples. In the general form consensus case, whch we do not dscuss here, a decentralzed mplementaton s possble that does not requre z to be centrally stored; each set of subsystems that share a varable can communcate among themselves drectly. In ths settng, t can be convenent to thnk of ADMM as a message-passng algorthm on a graph, where each node corresponds to a subsystem and the edges correspond to shared varables MPI Message Passng Interface (MPI) [77] s a language-ndependent message-passng specfcaton used for parallel algorthms, and s the most wdely used model for hgh-performance parallel computng today. There are numerous mplementatons of MPI on a varety of dstrbuted platforms, and nterfaces to MPI are avalable from a wde varety of languages, ncludng C, C++, and Python. There are multple ways to mplement consensus ADMM n MPI, but perhaps the smplest s gven n Algorthm 1. Ths pseudocode uses a sngle program, multple data (SPMD) programmng style, n whch each processor or subsystem runs the same program code but has ts own set of local varables and can read n a separate subset of the data. We assume there are N processors, wth each processor storng local varables x and u, a (redundant) copy of the global varable z, and handlng only the local data mplct n the objectve component f. In step 4, Allreduce denotes usng the MPI Allreduce operaton to compute the global sum over all processors of the contents of the vector w, and store the result n w on every processor; the same apples to the scalar t. After step 4, then, w = n =1 (x + u )=N(x + u) and t = r 2 2 = n =1 r 2 2 on all processors. We use Allreduce because

84 10.3 Graph Computng Frameworks 81 Algorthm 1 Global consensus ADMM n MPI. ntalze N processes, along wth x,u,r,z. repeat 1. Update u := u + x ( z. 2. Update x := argmn x f(x) +(ρ/2) x z + u 2) Let w := x + u and t := r Allreduce w and t. 5. Let z prev := z and update z := prox g,nρ (w/n). 6. ext f ρ N z z prev 2 ɛ conv and t ɛ feas. 7. Update r := x z. ts mplementaton s n general much more scalable than smply havng each subsystem send ts results drectly to an explct central collector. Next, n steps 5 and 6, all processors (redundantly) compute the z-update and perform the termnaton test. It s possble to have the z-update and termnaton test performed on just one processor and broadcast the results to the other processors, but dong so complcates the code and s generally no faster Graph Computng Frameworks Snce ADMM can be nterpreted as performng message-passng on a graph, t s natural to mplement t n a graph processng framework. Conceptually, the mplementaton wll be smlar to the MPI case dscussed above, except that the role of the central collector wll often be handled abstractly by the system, rather than havng an explct central collector process. In addton, hgher-level graph processng frameworks provde a number of bult n servces that one would otherwse have to manually mplement, such as fault tolerance. Many modern graph frameworks are based on or nspred by Valant s bulk-synchronous parallel (BSP) model [164] for parallel computaton. A BSP computer conssts of a set of processors networked together, and a BSP computaton conssts of a seres of global supersteps. Each superstep conssts of three stages: parallel

85 82 Implementaton computaton, n whch the processors, n parallel, perform local computatons; communcaton, n whch the processors communcate among themselves; and barrer synchronzaton, n whch the processes wat untl all processes are fnshed communcatng. The frst step n each ADMM superstep conssts of performng local u - and x -updates. The communcaton step would broadcast the new x and u values to a central collector node, or globally to each ndvdual processor. Barrer synchronzaton s then used to ensure that all the processors have updated ther prmal varable before the central collector averages and rebroadcasts the results. Specfc frameworks drectly based on or nspred by the BSP model nclude the Parallel BGL [91], GraphLab [114], and Pregel [119], among others. Snce all three follow the general outlne above, we refer the reader to the ndvdual papers for detals MapReduce MapReduce [46] s a popular programmng model for dstrbuted batch processng of very large datasets. It has been wdely used n ndustry and academa, and ts adopton has been bolstered by the open source project Hadoop, nexpensve cloud computng servces avalable through Amazon, and enterprse products and servces offered by Cloudera. MapReduce lbrares are avalable n many languages, ncludng Java, C++, and Python, among many others, though Java s the prmary language for Hadoop. Though t s awkward to express ADMM n MapReduce, the amount of cloud nfrastructure avalable for MapReduce computng can make t convenent to use n practce, especally for large problems. We brefly revew some key features of Hadoop below; see [170] for general background. A MapReduce computaton conssts of a set of Map tasks, whch process subsets of the nput data n parallel, followed by a Reduce task, whch combnes the results of the Map tasks. Both the Map and Reduce functons are specfed by the user and operate on key-value pars. The Map functon performs the transformaton (k,v) [(k 1,v 1),...,(k m,v m)],

86 10.4 MapReduce 83 that s, t takes a key-value par and emts a lst of ntermedate key-value pars. The engne then collects all the values v 1,...,v r that correspond to the same output key k (across all Mappers) and passes them to the Reduce functons, whch performs the transformaton (k,[v 1,...,v r]) (k,r(v 1,...,v r)), where R s a commutatve and assocatve functon. For example, R could smply sum v. In Hadoop, Reducers can emt lsts of key-value pars rather than just a sngle par. Each teraton of ADMM can easly be represented as a MapReduce task: The parallel local computatons are performed by Maps, and the global aggregaton s performed by a Reduce. We wll descrbe a smple global consensus mplementaton to gve the general flavor and dscuss the detals below. Here, we have the Reducer compute Algorthm 2 An teraton of global consensus ADMM n Hadoop/ MapReduce. functon map(key, dataset D ) 1. Read (x,u,ẑ) from HBase table. 2. Compute z := prox g,nρ ((1/N )ẑ). 3. Update u := u + x z. 4. Update x := argmn x ( f(x) +(ρ/2) x z + u 2 2). 5. Emt (key central, record (x,u )). functon reduce(key central, records (x 1,u 1),...,(x N,u N )) 1. Update ẑ := N =1 x + u. 2. Emt (key j, record (x j,u j,ẑ)) to HBase for j =1,...,N. ẑ = N =1 (x + u ) rather than z or z because summaton s assocatve whle averagng s not. We assume N s known (or, alternatvely, 1). We have N Mappers, one for each subsystem, and each Mapper updates u and x usng the ẑ from the prevous teraton. Each Mapper ndependently executes the proxmal step to compute z, but ths s usually a cheap operaton lke soft thresholdng. It emts an ntermedate key-value par that essentally serves as a message to the central collector. There s a sngle Reducer, playng the role of a central collector, and ts ncomng values are the messages from the Mappers. The updated records are the Reducer can compute the sum N =1

87 84 Implementaton then wrtten out drectly to HBase by the Reducer, and a wrapper program restarts a new MapReduce teraton f the algorthm has not converged. The wrapper wll check whether ρ N z z prev 2 ɛ conv and ( N =1 x z 2 2 )1/2 ɛ feas to determne convergence, as n the MPI case. (The wrapper checks the termnaton crtera nstead of the Reducer because they are not assocatve to check.) The man dffculty s that MapReduce tasks are not desgned to be teratve and do not preserve state n the Mappers across teratons, so mplementng an teratve algorthm lke ADMM requres some understandng of the underlyng nfrastructure. Hadoop contans a number of components supportng large-scale, fault-tolerant dstrbuted computng applcatons. The relevant components here are HDFS, a dstrbuted fle system based on Google s GFS [85], and HBase, a dstrbuted database based on Google s BgTable [32]. HDFS s a dstrbuted flesystem, meanng that t manages the storage of data across an entre cluster of machnes. It s desgned for stuatons where a typcal fle may be ggabytes or terabytes n sze and hgh-speed streamng read access s requred. The base unts of storage n HDFS are blocks, whch are 64 MB to 128 MB n sze n a typcal confguraton. Fles stored on HDFS are comprsed of blocks; each block s stored on a partcular machne (though for redundancy, there are replcas of each block on multple machnes), but dfferent blocks n the same fle need not be stored on the same machne or even nearby. For ths reason, any task that processes data stored on HDFS (e.g., the local datasets D ) should process a sngle block of data at a tme, snce a block s guaranteed to resde wholly on one machne; otherwse, one may cause unnecessary network transfer of data. In general, the nput to each Map task s data stored on HDFS, and Mappers cannot access local dsk drectly or perform any stateful computaton. The scheduler runs each Mapper as close to ts nput data as possble, deally on the same node, n order to mnmze network transfer of data. To help preserve data localty, each Map task should also be assgned around a block s worth of data. Note that ths s very dfferent from the mplementaton presented for MPI, where each process can be told to pck up the local data on whatever machne t s runnng on.

88 10.4 MapReduce 85 Snce each Mapper only handles a sngle block of data, there wll usually be a number of Mappers runnng on the same machne. To reduce the amount of data transferred over the network, Hadoop supports the use of combners, whch essentally Reduce the results of all the Map tasks on a gven node so only one set of ntermedate keyvalue pars need to be transferred across machnes for the fnal Reduce task. In other words, the Reduce step should be vewed as a two-step process: Frst, the results of all the Mappers on each ndvdual node are reduced wth Combners, and then the records across each machne are Reduced. Ths s a major reason why the Reduce functon must be commutatve and assocatve. Snce the nput value to a Mapper s a block of data, we also need a mechansm for a Mapper to read n local varables, and for the Reducer to store the updated varables for the next teraton. Here, we use HBase, a dstrbuted database bult on top of HDFS that provdes fast random read-wrte access. HBase, lke BgTable, provdes a dstrbuted mult-dmensonal sorted map. The map s ndexed by a row key, a column key, and a tmestamp. Each cell n an HBase table can contan multple versons of the same data ndexed by tmestamp; n our case, we can use the teraton counts as the tmestamps to store and access data from prevous teratons; ths s useful for checkng termnaton crtera, for example. The row keys n a table are strngs, and HBase mantans data n lexcographc order by row key. Ths means that rows wth lexcographcally adjacent keys wll be stored on the same machne or nearby. In our case, varables should be stored wth the subsystem dentfer at the begnnng of row key, so nformaton for the same subsystem s stored together and s effcent to access. For more detals, see [32, 170]. The dscusson and pseudocode above omts and glosses over many detals for smplcty of exposton. MapReduce frameworks lke Hadoop also support much more sophstcated mplementatons, whch may be necessary for very large scale problems. For example, f there are too many values for a sngle Reducer to handle, we can use an approach analogous to the one suggested for MPI: Mappers emt pars to regonal reduce jobs, and then an addtonal MapReduce step s carred out that uses an dentty mapper and aggregates regonal results

89 86 Implementaton nto a global result. In ths secton, our goal s merely to gve a general flavor of some of the ssues nvolved n mplementng ADMM n a MapReduce framework, and we refer to [46, 170, 111] for further detals. There has also been some recent work on alternatve MapReduce systems that are specfcally desgned for teratve computaton, whch are lkely better suted for ADMM [25, 179], though the mplementatons are less mature and less wdely avalable. See [37, 93] for examples of recent papers dscussng machne learnng and optmzaton n MapReduce frameworks.

90 11 Numercal Examples In ths secton we report numercal results for several examples. The examples are chosen to llustrate a varety of the deas dscussed above, ncludng cachng matrx factorzatons, usng teratve solvers for the updates, and usng consensus and sharng ADMM to solve dstrbuted problems. The mplementatons of ADMM are wrtten to be as smple as possble, wth no mplementaton-level optmzaton or tunng. The frst secton dscusses a small nstance of the lasso problem wth a dense coeffcent matrx. Ths helps llustrate some of the basc behavor of the algorthm, and the mpact of some of the lnear algebrabased optmzatons suggested n 4. We fnd, for example, that we can compute the entre regularzaton path for the lasso n not much more tme than t takes to solve a sngle problem nstance, whch n turn takes not much more tme than solvng a sngle rdge regresson problem of the same sze. We then dscuss a seral mplementaton of the consensus ADMM algorthm appled to l 1 regularzed logstc regresson, where we splt the problem across tranng examples. Here, we focus on detals of mplementng consensus ADMM for ths problem, rather than on actual 87

91 88 Numercal Examples dstrbuted solutons. The followng secton has a smlar dscusson of the group lasso problem, but splt across features, wth each regularzaton group correspondng to a dstnct subsystem. We then turn to a real large-scale dstrbuted mplementaton usng an MPI-based solver wrtten n C. We report on the results of solvng some large lasso problems on clusters hosted n Amazon EC2, and fnd that a farly basc mplementaton s able to solve a lasso problem wth 30 GB of data n a few mnutes. Our last example s regressor selecton, a nonconvex problem. We compare the sparsty-ft trade-off curve obtaned usng nonconvex ADMM, drectly controllng the number of regressors, wth the same curve obtaned usng the lasso regularzaton path (wth posteror least squares ft). We wll see that the curves are not the same, but gve very smlar results. Ths suggests that the regressor selecton method may be preferable to the lasso when the desred sparsty level s known n advance: It s much easer to explctly set the desred sparsty level than tunng the regularzaton parameter λ to obtan ths level. All examples except the large-scale lasso are mplemented n Matlab, and run on an Intel Core 3 processor runnng at 3.2 GHz. The large lasso example s mplemented n C usng MPI for nterprocess communcaton and the GNU Scentfc Lbrary for lnear algebra. Source code and data for these examples (and others) can be found at boyd/papers/admm_dstr_stats.html and most are extensvely commented Small Dense Lasso We consder a small, dense nstance of the lasso problem (6.2), where the feature matrx A has m = 1500 examples and n = 5000 features. We generate the data as follows. We frst choose A j N(0,1) and then normalze the columns to have unt l 2 norm. A true value x true R n s generated wth 100 nonzero entres, each sampled from an N (0,1) dstrbuton. The labels b are then computed as b = Ax true + v, where v N(0,10 3 I), whch corresponds to a sgnal-to-nose rato Ax true 2 2 / v 2 2 of around 60.

92 11.1 Small Dense Lasso 89 Fg Norms of prmal resdual (top) and dual resdual (bottom) versus teraton, for a lasso problem. The dashed lnes show ɛ pr (top) and ɛ dual (bottom). We set the penalty parameter ρ = 1 and set termnaton tolerances ɛ abs =10 4 and ɛ rel =10 2. The varables u 0 and z 0 were ntalzed to be zero Sngle Problem We frst solve the lasso problem wth regularzaton parameter λ = 0.1λ max, where λ max = A T b s the crtcal value of λ above whch the soluton of the lasso problem s x = 0. (Although not relevant, ths choce correctly dentfes about 80% of the nonzero entres n x true.) Fgure 11.1 shows the prmal and dual resdual norms by teraton, as well as the assocated stoppng crteron lmts ɛ pr and ɛ dual (whch vary slghtly n each teraton snce they depend on x k, z k, and y k through the relatve tolerance terms). The stoppng crteron was satsfed after 15 teratons, but we ran ADMM for 35 teratons to show the contnued progress. Fgure 11.2 shows the objectve suboptmalty p k p, where p k =(1/2) Az k b λ zk 1 s the objectve value at z k. The optmal objectve value p = was ndependently verfed usng l1 ls [102].

93 90 Numercal Examples Fg Objectve suboptmalty versus teraton for a lasso problem. The stoppng crteron s satsfed at teraton 15, ndcated by the vertcal dashed lne. Snce A s fat (.e., m<n), we apply the matrx nverson lemma to (A T A + ρi) 1 and nstead compute the factorzaton of the smaller matrx I +(1/ρ)AA T, whch s then cached for subsequent x-updates. The factor step tself takes about nm 2 +(1/3)m 3 flops, whch s the cost of formng AA T and computng the Cholesky factorzaton. Subsequent updates requre two matrx-vector multplcatons and forwardbackward solves, whch requre approxmately 4mn +2m 2 flops. (The cost of the soft thresholdng step n the z-update s neglgble.) For these problem dmensons, the flop count analyss suggests a factor/solve rato of around 350, whch means that 350 subsequent ADMM teratons can be carred out for the cost of the ntal factorzaton. In our basc mplementaton, the factorzaton step takes about 1 second, and subsequent x-updates take around 30 ms. (Ths gves a factor/solve rato of only 33, less than predcted, due to a partcularly effcent matrx-matrx multplcaton routne used n Matlab.) Thus the total cost of solvng an entre lasso problem s around 1.5 seconds only 50% more than the ntal factorzaton. In terms of parameter estmaton, we can say that computng the lasso estmate requres only 50%

94 11.1 Small Dense Lasso 91 Fg Iteratons needed versus λ for warm start (sold lne) and cold start (dashed lne). more tme than a rdge regresson estmate. (Moreover, n an mplementaton wth a hgher factor/solve rato, the addtonal effort for the lasso would have been even smaller.) Fnally, we report the effect of varyng the parameter ρ on convergence tme. Varyng ρ over the 100:1 range from 0.1 to 10 yelds a solve tme rangng between 1.45 seconds to around 4 seconds. (In an mplementaton wth a larger factor/solve rato, the effect of varyng ρ would have been even smaller.) Over-relaxaton wth α = 1.5 does not sgnfcantly change the convergence tme wth ρ = 1, but t does reduce the worst convergence tme over the range ρ [0.1, 10] to only 2.8 seconds Regularzaton Path To llustrate computng the regularzaton path, we solve the lasso problem for 100 values of λ, spaced logarthmcally from 0.01λ max (where x has around 800 nonzeros) to 0.95λ max (where x has two nonzero entres). We frst solve the lasso problem as above for λ =0.01λ max, and for each subsequent value of λ, we then ntalze (warm start) z and u at ther optmal values for the prevous λ. Ths requres only one factorzaton for all the computatons; warm startng ADMM at the

95 92 Numercal Examples Task Tme (s) Factorzaton 1.1 x-update 0.03 Sngle lasso (λ =0.1λ max) 1.5 Cold start regularzaton path (100 values of λ) 160 Warm start regularzaton path (100 values of λ) 13 Table Summary of tmngs for lasso example. prevous value sgnfcantly reduces the number of ADMM teratons requred to solve each lasso problem after the frst one. Fgure 11.3 shows the number of teratons requred to solve each lasso problem usng ths warm start ntalzaton, compared to the number of teratons requred usng a cold start of z 0 = u 0 = 0 for each λ. For the 100 values of λ, the total number of ADMM teratons requred s 428, whch takes 13 seconds n all. By contrast, wth cold starts, we need 2166 total ADMM teratons and 100 factorzatons to compute the regularzaton path, or around 160 seconds total. Ths tmng nformaton s summarzed n Table Dstrbuted l 1 Regularzed Logstc Regresson In ths example, we use consensus ADMM to ft an l 1 regularzed logstc regresson model. Followng 8, the problem s m mnmze log ( 1 + exp( b (a T w + v)) ) + λ w 1, (11.1) =1 wth optmzaton varables w R n and v R. The tranng set conssts of m pars (a,b ), where a R n s a feature vector and b { 1, 1} s the correspondng label. We generated a problem nstance wth m =10 6 tranng examples and n =10 4 features. The m examples are dstrbuted among N = 100 subsystems, so each subsystem has 10 4 tranng examples. Each feature vector a was generated to have approxmately 10 nonzero features, each sampled ndependently from a standard normal dstrbuton. We chose a true weght vector w true R n to have 100 nonzero values, and these entres, along wth the true ntercept v true, were sampled

96 11.2 Dstrbuted l 1 Regularzed Logstc Regresson 93 ndependently from a standard normal dstrbuton. The labels b were then generated usng b = sgn(a T w true + v true + v ), where v N(0,0.1). The regularzaton parameter s set to λ =0.1λ max, where λ max s the crtcal value above whch the soluton of the problem s w =0. Here λ max s more complcated to descrbe than n the smple lasso case descrbed above. Let θ neg be the fracton of examples wth b = 1 and θ pos the fracton wth b = 1, and let b R m be a vector wth entres θ neg where b = 1 and θ pos where b = 1. Then λ max = A T b (see [103, 2.1]). (Whle not relevant here, the fnal ftted model wth λ = 0.1λ max classfed the tranng examples wth around 90% accuracy.) Fttng the model nvolves solvng the global consensus problem (8.3) n 8.2 wth local varables x =(v,w ) and consensus varable z =(v,w). As n the lasso example, we used ɛ abs =10 4 and ɛ rel = 10 2 as tolerances and used the ntalzaton u 0 =0,z0 = 0. We use the penalty parameter value ρ = 1 for the teratons. We used L-BFGS to carry out the x -updates. We used Nocedal s Fortran 77 mplementaton of L-BFGS wth no tunng: We used default parameters, a memory of 5, and a constant termnaton tolerance across ADMM teratons (for a more effcent mplementaton, these tolerances would start large and decrease wth ADMM teratons). We warm started the x -updates. We used a seral mplementaton that performs the x -updates sequentally; n a dstrbuted mplementaton, of course, the x -updates would be performed n parallel. To report an approxmaton of the tmng that would have been acheved n a parallel mplementaton, we report the maxmum tme requred to update x among the K subsystems. Ths corresponds roughly to the maxmum number of L-BFGS teratons requred for the x -update. Fgure 11.4 shows the progress of the prmal and dual resdual norm by teraton. The dashed lne shows when the stoppng crteron has been satsfed (after 19 teratons), resultng n a prmal resdual norm of about 1. Snce the RMS consensus error can be expressed as (1/ m) r k 2 where m =10 6, a prmal resdual norm of about 1 means that on average, the elements of x agree wth z up to the thrd dgt.

97 94 Numercal Examples Fg Progress of prmal and dual resdual norm for dstrbuted l 1 regularzed logstc regresson problem. The dashed lnes show ɛ pr (top) and ɛ dual (bottom). Fg Left. Objectve suboptmalty of dstrbuted l 1 regularzed logstc regresson versus teraton. Rght. Progress versus elapsed tme. The stoppng crteron s satsfed at teraton 19, ndcated by the vertcal dashed lne. Fgure 11.5 shows the suboptmalty p k p for the consensus varable, where p k = m =1 ( ) log 1 + exp( b (a T w k + v k )) + λ w k 1.

98 11.3 Group Lasso wth Feature Splttng 95 The optmal value p = was verfed usng l1 logreg [103]. The lefthand plot shows ADMM progress by teraton, whle the rghthand plot shows the cumulatve tme n a parallel mplementaton. It took 19 teratons to satsfy the stoppng crteron. The frst 4 teratons of ADMM took 2 seconds, whle the last 4 teratons (before the stoppng crteron s satsfed) took less than 0.5 seconds. Ths s because as the terates approach consensus, L-BFGS requres fewer teratons due to warm startng Group Lasso wth Feature Splttng We consder the group lasso example, descrbed n 6.4.2, mnmze (1/2) Ax b λ N =1 x 2, where x =(x 1,...,x N ), wth x R n. We wll solve the problem by splttng across feature groups x 1,...,x N usng the formulaton n 8.3. We generated a problem nstance wth N = 200 groups of features, wth n = 100 features per group, for =1,...,200, for a total of n = features and m = 200 examples. A true value x true R n was generated, wth 9 nonzero groups, resultng n 900 nonzero feature values. The feature matrx A s dense, wth entres drawn from an N (0,1) dstrbuton, and ts columns then normalzed to have unt l 2 norm (as n the lasso example of 11.1). The outcomes b are generated by b = Ax true + v, where v N(0,0.1I), whch corresponds to a sgnal-to-nose rato Ax true 2 2 / v 2 2 of around 60. We used the penalty parameter ρ = 10 and set termnaton tolerances as ɛ abs =10 4 and ɛ rel =10 2. The varables u 0 and z 0 are ntalzed to be zero. We used the regularzaton parameter value λ =0.5λ max, where λ max = max{ A T 1 b 2,..., A T Nb 2 } s the crtcal value of λ above whch the soluton s x = 0. (Although not relevant, ths choce of λ correctly dentfes 6 of the 9 nonzero groups n x true and produces an estmate wth 17 nonzero groups.) The stoppng crteron was satsfed after 47 teratons.

99 96 Numercal Examples Fg Norms of prmal resdual (top) and dual resdual (bottom) versus teraton, for the dstrbuted group lasso problem. The dashed lnes show ɛ pr (top) and ɛ dual (bottom). The x -update s computed usng the method descrbed n 8.3.2, whch nvolves computng and cachng egendecompostons of A T A. The egenvalue decompostons of A T A took around 7 mllseconds; subsequent x -updates took around 350 mcroseconds, around a factor of 20 faster. For 47 ADMM teratons, these numbers predct a total runtme n a seral mplementaton of about 5 seconds; the actual runtme was around 7 seconds. For a parallel mplementaton, we can estmate the runtme (neglectng nterprocess communcaton and data dstrbuton) as beng about 200 tmes faster, around 35 mllseconds. Fgure 11.6 shows the progress of the prmal and dual resdual norm by teraton. The dashed lne shows when the stoppng crteron s satsfed (after 47 teratons). Fgure 11.7 shows the suboptmalty p k p for the problem versus teraton, where p k =(1/2) Ax k b λ K x k 2. =1

100 11.4 Dstrbuted Large-Scale Lasso wth MPI 97 Fg Suboptmalty of dstrbuted group lasso versus teraton. The stoppng crteron s satsfed at teraton 47, ndcated by the vertcal dashed lne. The optmal objectve value p = was found by runnng ADMM for 1000 teratons Dstrbuted Large-Scale Lasso wth MPI In prevous sectons, we dscussed an dealzed verson of a dstrbuted mplementaton that was actually carred out serally for smplcty. We now turn to a much more realstc dstrbuted example, n whch we solve a very large nstance of the lasso problem (6.2) usng a dstrbuted solver mplemented n C usng MPI for nter-process communcaton and the GNU Scentfc Lbrary (GSL) for lnear algebra. In ths example, we splt the problem across tranng examples rather than features. We carred out the experments n a cluster of vrtual machnes runnng on Amazon s Elastc Compute Cloud (EC2). Here, we focus entrely on scalng and mplementaton detals. The data was generated as n 11.1, except that we now solve a problem wth m = examples and n = 8000 features across N = 80 subsystems, so each subsystem handles 5000 tranng examples. Note

101 98 Numercal Examples that the overall problem has a sknny coeffcent matrx but each of the subproblems has a fat coeffcent matrx. We emphasze that the coeffcent matrx s dense, so the full dataset requres over 30 GB to store and has 3.2 bllon nonzero entres n the total coeffcent matrx A. Ths s far too large to be solved effcently, or at all, usng standard seral methods on commonly avalable hardware. We solved the problem usng a cluster of 10 machnes. We used Cluster Compute nstances, whch have 23 GB of RAM, two quad-core Intel Xeon X5570 Nehalem chps, and are connected to each other wth 10 Ggabt Ethernet. We used hardware vrtual machne mages runnng CentOS 5.4. Snce each node had 8 cores, we ran the code wth 80 processes, so each subsystem ran on ts own core. In MPI, communcaton between processes on the same machne s performed locally va the shared-memory Byte Transfer Layer (BTL), whch provdes low latency and hgh bandwdth communcaton, whle communcaton across machnes goes over the network. The data was szed so all the processes on a sngle machne could work entrely n RAM. Each node had ts own attached Elastc Block Storage (EBS) volume that contaned only the local data relevant to that machne, so dsk throughput was shared among processes on the same machne but not across machnes. Ths s to emulate a scenaro where each machne s only processng the data on ts local dsk, and none of the dataset s transferred over the network. We emphasze that usage of a cluster set up n ths fashon costs under $20 per hour. We solved the problem wth a delberately nave mplementaton of the algorthm, based drectly on the dscusson of 6.4, 8.2, and The mplementaton conssts of a sngle fle of C code, under 400 lnes despte extensve comments. The lnear algebra (BLAS operatons and the Cholesky factorzaton) were performed usng a stock nstallaton of the GNU Scentfc Lbrary. We now report the breakdown of the wall-clock runtme. It took roughly 30 seconds to load all the data nto memory. It then took 4-5 mnutes to form and then compute the Cholesky factorzatons of I +(1/ρ)A A T. After cachng these factorzatons, t then took seconds for each subsequent ADMM teraton. Ths ncludes the backsolves n the x -updates and all the message passng. For ths problem,

102 11.4 Dstrbuted Large-Scale Lasso wth MPI 99 Total dataset sze 30 GB Number of subsystems 80 Total dataset dmensons Subsystem dmensons Data loadng tme 30 seconds Factorzaton tme 5 mnutes Sngle teraton tme 1 second Total runtme 6 mnutes Table Rough summary of a large dense dstrbuted lasso example. ADMM converged n 13 teratons, yeldng a start-to-fnsh runtme of under 6 mnutes to solve the whole problem. Approxmate tmes are summarzed n Table Though we dd not compute t as part of ths example, the extremely low cost of each teraton means that t would be straghtforward to compute the entre regularzaton path for ths problem usng the method descrbed n In that example, t requred 428 teratons to compute the regularzaton path for 100 settngs of λ, whle t took around 15 teratons for a sngle nstance to converge, roughly the same as n ths example. Extrapolatng for ths case, t s plausble that the entre regularzaton path, even for ths very large problem, could easly be obtaned n another fve to ten mnutes. It s clear that by far the domnant computaton s formng and computng the Cholesky factorzaton, locally and n parallel, of each A T A + ρi (or I +(1/ρ)A A T, f the matrx nverson lemma s appled). As a result, t s worth keepng n mnd that the performance of the lnear algebra operatons n our basc mplementaton can be sgnfcantly mproved by usng LAPACK nstead of GSL for the Cholesky factorzaton, and by replacng GSL s BLAS mplementaton wth a hardware-optmzed BLAS lbrary produced by ATLAS, a vendor lbrary lke Intel MKL, or a GPU-based lnear algebra package. Ths could easly lead to several orders of magntude faster performance. In ths example, we used a dense coeffcent matrx so the code could be wrtten usng a sngle smple math lbrary. Many real-world examples of the lasso have larger numbers of tranng examples or features, but are sparse and do not have bllons of nonzero entres, as we do here. The code we provde could be modfed n the usual manner

103 100 Numercal Examples to handle sparse or structured matrces (e.g., use CHOLMOD [35] for sparse Cholesky factorzaton), and would also scale to very large problems. More broadly, t could also be adapted wth mnmal work to add constrants or otherwse modfy the lasso problem, or even solve completely dfferent problems, lke tranng logstc regresson or SVMs. It s worth observng that ADMM scales well both horzontally and vertcally. We could easly have solved much larger problem nstances n roughly the same amount of tme than the one descrbed here by havng each subsystem solve a larger subproblem (up to the pont where each machne s RAM s saturated, whch t was not here); by runnng more subsystems on each machne (though ths can lead to performance degradaton n key areas lke the factorzaton step); or smply by addng more machnes to the cluster, whch s mostly straghtforward and relatvely nexpensve on Amazon EC2. Up to a certan problem sze, the solver can be mplemented by users who are not expert n dstrbuted systems, dstrbuted lnear algebra, or advanced mplementaton-level performance enhancements. Ths s n sharp contrast to what s requred n many other cases. Solvng extremely large problem nstances requrng hundreds or thousands of machnes would requre a more sophstcated mplementaton from a systems perspectve, but t s nterestng to observe that a basc verson can solve rather large problems quckly on standard software and hardware. To the best of our knowledge, the example above s one of the largest lasso problems ever solved Regressor Selecton In our last example, we apply ADMM to an nstance of the (nonconvex) least squares regressor selecton problem descrbed n 9.1, whch seeks the best quadratc ft to a set of labels b from a combnaton of no more than c columns of A (regressors). We use the same A and b generated for the dense lasso example n 11.1, wth m = 1500 examples and n = 5000 features, but nstead of relyng on the l 1 regularzaton heurstc to acheve a sparse soluton, we explctly constran ts cardnalty to be below c = 100.

104 11.5 Regressor Selecton 101 Fg Ft versus cardnalty for the lasso (dotted lne), lasso wth posteror least squares ft (dashed lne), and regressor selecton (sold lne). The x-update step has exactly the same expresson as n the lasso example, so we use the same method, based on the matrx nverson lemma and cachng, descrbed n that example. The z-update step conssts of keepng the c largest magntude components of x + u and zerong the rest. For the sake of clarty, we performed an ntermedate sortng of the components, but more effcent schemes are possble. In any case, the cost of the z-update s neglgble compared wth that of the x-update. Convergence of ADMM for a nonconvex problem such as ths one s not guaranteed; and even when t does converge, the fnal result can depend on the choce of ρ and the ntal values for z and u. To explore ths, we ran 100 ADMM smulatons wth randomly chosen ntal values and ρ rangng between 0.1 and 100. Indeed, some of them dd not converge, or at least, were convergng slowly. But most of them converged, though not to exactly the same ponts. However, the objectve values obtaned by those that converged were reasonably close to each other, typcally wthn 5%. The dfferent values of x found had small

105 102 Numercal Examples varatons n support (choce of regressors) and value (weghts), but the largest weghts were consstently assgned to the same regressors. We now compare the use of nonconvex regressor selecton wth the lasso, n terms of obtanng the sparsty-ft trade-off. We obtan ths curve for regressor selecton by runnng ADMM for each value of c between c = 1 and c = 120. For the lasso, we compute the regularzaton path for 300 values of λ; for each x lasso found, we then perform a least squares ft usng the sparsty pattern n x lasso to get our fnal x. For each cardnalty, we plot the best ft found among all such x. Fgure 11.8 shows the trade-off curves obtaned by regressor selecton and the lasso, wth and wthout posteror least squares ft. We see that whle the results are not exactly the same, they are qute smlar, and for all practcal purposes, equvalent. Ths suggests that regressor selecton va ADMM can be used as well as lasso for obtanng a good cardnalty-ft trade-off; t mght have an advantage when the desred cardnalty s known ahead of tme.

106 12 Conclusons We have dscussed ADMM and llustrated ts applcablty to dstrbuted convex optmzaton n general and many problems n statstcal machne learnng n partcular. We argue that ADMM can serve as a good general-purpose tool for optmzaton problems arsng n the analyss and processng of modern massve datasets. Much lke gradent descent and the conjugate gradent method are standard tools of great use when optmzng smooth functons on a sngle machne, ADMM should be vewed as an analogous tool n the dstrbuted regme. ADMM sts at a hgher level of abstracton than classcal optmzaton algorthms lke Newton s method. In such algorthms, the base operatons are low-level, consstng of lnear algebra operatons and the computaton of gradents and Hessans. In the case of ADMM, the base operatons nclude solvng small convex optmzaton problems (whch n some cases can be done va a smple analytcal formula). For example, when applyng ADMM to a very large model fttng problem, each update reduces to a (regularzed) model fttng problem on a smaller dataset. These subproblems can be solved usng any standard seral algorthm sutable for small to medum szed problems. In ths sense, ADMM bulds on exstng algorthms for sngle machnes, and so can be 103

107 104 Conclusons vewed as a modular coordnaton algorthm that ncentvzes a set of smpler algorthms to collaborate to solve much larger global problems together than they could on ther own. Alternatvely, t can be vewed as a smple way of bootstrappng specalzed algorthms for small to medum szed problems to work on much larger problems than would otherwse be possble. We emphasze that for any partcular problem, t s lkely that another method wll perform better than ADMM, or that some varaton on ADMM wll substantally mprove performance. However, a smple algorthm derved from basc ADMM wll often offer performance that s at least comparable to very specalzed algorthms (even n the seral settng), and n most cases, the smple ADMM algorthm wll be effcent enough to be useful. In a few cases, ADMM-based methods actually turn out to be state-of-the-art even n the seral regme. Moreover, ADMM has the beneft of beng extremely smple to mplement, and t maps onto several standard dstrbuted programmng models reasonably well. ADMM was developed over a generaton ago, wth ts roots stretchng far n advance of the Internet, dstrbuted and cloud computng systems, massve hgh-dmensonal datasets, and the assocated largescale appled statstcal problems. Despte ths, t appears to be well suted to the modern regme, and has the mportant beneft of beng qute general n ts scope and applcablty.

108 Acknowledgments We are very grateful to Rob Tbshran and Trevor Haste for encouragng us to wrte ths revew. Thanks also to Alexs Battle, Dmtr Bertsekas, Danny Bckson, Tom Goldsten, Dmtr Gornevsky, Daphne Koller, Vcente Malave, Stephen Oakley, and Alex Techman for helpful comments and dscussons. Yang Wang and Matt Kranng helped n developng ADMM for the sharng and exchange problems, and Arezou Keshavarz helped work out ADMM for generalzed addtve models. We thank Georgos Gannaks and Alejandro Rbero for pontng out some very relevant references that we had mssed n an earler verson. We thank John Duch for a very careful readng of the manuscrpt and for suggestons that greatly mproved t. Support for ths work was provded n part by AFOSR grant FA and NASA grant NNX07AEIIA. Neal Parkh was supported by the Cortlandt and Jean E. Van Rensselaer Engneerng Fellowshp from Stanford Unversty and by the Natonal Scence Foundaton Graduate Research Fellowshp under Grant No. DGE Erc Chu was supported by the Pan Wen-Yuan Foundaton Scholarshp. 105

109 A Convergence Proof The basc convergence result gven n 3.2 can be found n several references, such as [81, 63]. Many of these gve more sophstcated results, wth more general penaltes or nexact mnmzaton. For completeness, we gve a proof here. We wll show that f f and g are closed, proper, and convex, and the Lagrangan L 0 has a saddle pont, then we have prmal resdual convergence, meanng that r k 0, and objectve convergence, meanng that p k p, where p k = f(x k )+g(z k ). We wll also see that the dual resdual s k = ρa T B(z k z k 1 ) converges to zero. Let (x,z,y ) be a saddle pont for L 0, and defne V k =(1/ρ) y k y ρ B(z k z ) 2 2, We wll see that V k s a Lyapunov functon for the algorthm,.e., a nonnegatve quantty that decreases n each teraton. (Note that V k s unknown whle the algorthm runs, snce t depends on the unknown values z and y.) We frst outlne the man dea. The proof reles on three key nequaltes, whch we wll prove below usng basc results from convex analyss 106

110 107 along wth smple algebra. The frst nequalty s V k+1 V k ρ r k ρ B(z k+1 z k ) 2 2. (A.1) Ths states that V k decreases n each teraton by an amount that depends on the norm of the resdual and on the change n z over one teraton. Because V k V 0, t follows that y k and Bz k are bounded. Iteratng the nequalty above gves that ( ) ρ r k B(z k+1 z k ) 2 2 V 0, k=0 whch mples that r k 0 and B(z k+1 z k ) 0ask. Multplyng the second expresson by ρa T shows that the dual resdual s k = ρa T B(z k+1 z k ) converges to zero. (Ths shows that the stoppng crteron (3.12), whch requres the prmal and dual resduals to be small, wll eventually hold.) The second key nequalty s p k+1 p (y k+1 ) T r k+1 ρ(b(z k+1 z k )) T ( r k+1 + B(z k+1 z )), (A.2) and the thrd nequalty s p p k+1 y T r k+1. (A.3) The rghthand sde n (A.2) goes to zero as k, because B(z k+1 z ) s bounded and both r k+1 and B(z k+1 z k ) go to zero. The rghthand sde n (A.3) goes to zero as k, snce r k goes to zero. Thus we have lm k p k = p,.e., objectve convergence. Before gvng the proofs of the three key nequaltes, we derve the nequalty (3.11) mentoned n our dscusson of stoppng crteron from the nequalty (A.2). We smply observe that r k+1 + B(z k+1 z k )= A(x k+1 x ); substtutng ths nto (A.2) yelds (3.11), p k+1 p (y k+1 ) T r k+1 +(x k+1 x ) T s k+1. Proof of nequalty (A.3) Snce (x,z,y ) s a saddle pont for L 0,wehave L 0 (x,z,y ) L 0 (x k+1,z k+1,y ).

111 108 Convergence Proof Usng Ax + Bz = c, the lefthand sde s p. Wth p k+1 = f(x k+1 )+ g(z k+1 ), ths can be wrtten as whch gves (A.3). Proof of nequalty (A.2) p p k+1 + y T r k+1, By defnton, x k+1 mnmzes L ρ (x,z k,y k ). Snce f s closed, proper, and convex t s subdfferentable, and so s L ρ. The (necessary and suffcent) optmalty condton s 0 L ρ (x k+1,z k,y k )= f(x k+1 )+A T y k + ρa T (Ax k+1 + Bz k c). (Here we use the basc fact that the subdfferental of the sum of a subdfferentable functon and a dfferentable functon wth doman R n s the sum of the subdfferental and the gradent; see, e.g., [140, 23].) Snce y k+1 = y k + ρr k+1, we can plug n y k = y k+1 ρr k+1 and rearrange to obtan 0 f(x k+1 )+A T (y k+1 ρb(z k+1 z k )). Ths mples that x k+1 mnmzes f(x) +(y k+1 ρb(z k+1 z k )) T Ax. A smlar argument shows that z k+1 mnmzes g(z) +y (k+1)t Bz. It follows that and that f(x k+1 )+(y k+1 ρb(z k+1 z k )) T Ax k+1 f(x )+(y k+1 ρb(z k+1 z k )) T Ax g(z k+1 )+y (k+1)t Bz k+1 g(z )+y (k+1)t Bz. Addng the two nequaltes above, usng Ax + Bz = c, and rearrangng, we obtan (A.2).

112 109 Proof of nequalty (A.1) Addng (A.2) and (A.3), regroupng terms, and multplyng through by 2 gves 2(y k+1 y ) T r k+1 2ρ(B(z k+1 z k )) T r k+1 +2ρ(B(z k+1 z k )) T (B(z k+1 z )) 0. (A.4) The result (A.1) wll follow from ths nequalty after some manpulaton and rewrtng. We begn by rewrtng the frst term. Substtutng y k+1 = y k + ρr k+1 gves 2(y k y ) T r k+1 + ρ r k ρ r k+1 2 2, and substtutng r k+1 =(1/ρ)(y k+1 y k ) n the frst two terms gves (2/ρ)(y k y ) T (y k+1 y k )+(1/ρ) y k+1 y k ρ r k Snce y k+1 y k =(y k+1 y ) (y k y ), ths can be wrtten as ( ) (1/ρ) y k+1 y 2 2 y k y ρ r k (A.5) We now rewrte the remanng terms,.e., ρ r k ρ(B(z k+1 z k )) T r k+1 +2ρ(B(z k+1 z k )) T (B(z k+1 z )), where ρ r k s taken from (A.5). Substtutng n the last term gves and substtutng z k+1 z =(z k+1 z k )+(z k z ) ρ r k+1 B(z k+1 z k ) ρ B(zk+1 z k ) 2 2 n the last two terms, we get ρ r k+1 B(z k+1 z k ) ρ +2ρ(B(z k+1 z k )) T (B(z k z )), z k+1 z k =(z k+1 z ) (z k z ) ( ) B(z k+1 z ) 2 2 B(z k z ) 2 2.

113 110 Convergence Proof Wth the prevous step, ths mples that (A.4) can be wrtten as V k V k+1 ρ r k+1 B(z k+1 z k ) 2 2. (A.6) To show (A.1), t now suffces to show that the mddle term 2ρr (k+1)t (B(z k+1 z k )) of the expanded rght hand sde of (A.6) s postve. To see ths, recall that z k+1 mnmzes g(z) +y (k+1)t Bz and z k mnmzes g(z) +y kt Bz, so we can add and to get that g(z k+1 )+y (k+1)t Bz k+1 g(z k )+y (k+1)t Bz k g(z k )+y kt Bz k g(z k+1 )+y kt Bz k+1 (y k+1 y k ) T (B(z k+1 z k )) 0. Substtutng y k+1 y k = ρr k+1 gves the result, snce ρ>0.

114 References [1] M. V. Afonso, J. M. Boucas-Das, and M. A. T. Fgueredo, Fast mage recovery usng varable splttng and constraned optmzaton, IEEE Transactons on Image Processng, vol. 19, no. 9, pp , [2] M. V. Afonso, J. M. Boucas-Das, and M. A. T. Fgueredo, An Augmented Lagrangan Approach to the Constraned Optmzaton Formulaton of Imagng Inverse Problems, IEEE Transactons on Image Processng, vol. 20, pp , [3] E. Anderson, Z. Ba, C. Bschof, J. Demmel, J. Dongarra, J. D. Croz, A. Greenbaum, S. Hammarlng, A. McKenney, and D. Sorenson, LAPACK: A portable lnear algebra lbrary for hgh-performance computers. IEEE Computng Socety Press, [4] K. J. Arrow and G. Debreu, Exstence of an equlbrum for a compettve economy, Econometrca: Journal of the Econometrc Socety, vol. 22, no. 3, pp , [5] K. J. Arrow, L. Hurwcz, and H. Uzawa, Studes n Lnear and Nonlnear Programmng. Stanford Unversty Press: Stanford, [6] K. J. Arrow and R. M. Solow, Gradent methods for constraned maxma, wth weakened assumptons, n Studes n Lnear and Nonlnear Programmng, (K. J. Arrow, L. Hurwcz, and H. Uzawa, eds.), Stanford Unversty Press: Stanford, [7] O. Banerjee, L. E. Ghaou, and A. d Aspremont, Model selecton through sparse maxmum lkelhood estmaton for multvarate Gaussan or bnary data, Journal of Machne Learnng Research, vol. 9, pp ,

115 112 References [8] P. L. Bartlett, M. I. Jordan, and J. D. McAulffe, Convexty, classfcaton, and rsk bounds, Journal of the Amercan Statstcal Assocaton, vol. 101, no. 473, pp , [9] H. H. Bauschke and J. M. Borwen, Dykstra s alternatng projecton algorthm for two sets, Journal of Approxmaton Theory, vol. 79, no. 3, pp , [10] H. H. Bauschke and J. M. Borwen, On projecton algorthms for solvng convex feasblty problems, SIAM Revew, vol. 38, no. 3, pp , [11] A. Beck and M. Teboulle, A fast teratve shrnkage-thresholdng algorthm for lnear nverse problems, SIAM Journal on Imagng Scences, vol. 2, no. 1, pp , [12] S. Becker, J. Bobn, and E. J. Candès, NESTA: A fast and accurate frstorder method for sparse recovery, Avalable at edu/ emmanuel/papers/nesta.pdf, [13] J. F. Benders, Parttonng procedures for solvng mxed-varables programmng problems, Numersche Mathematk, vol. 4, pp , [14] A. Bensoussan, J.-L. Lons, and R. Temam, Sur les méthodes de décomposton, de décentralsaton et de coordnaton et applcatons, Methodes Mathematques de l Informatque, pp , [15] D. P. Bertsekas, Constraned Optmzaton and Lagrange Multpler Methods. Academc Press, [16] D. P. Bertsekas, Nonlnear Programmng. Athena Scentfc, second ed., [17] D. P. Bertsekas and J. N. Tstskls, Parallel and Dstrbuted Computaton: Numercal Methods. Prentce Hall, [18] J. M. Boucas-Das and M. A. T. Fgueredo, Alternatng Drecton Algorthms for Constraned Sparse Regresson: Applcaton to Hyperspectral Unmxng, arxv: , [19] J. Borwen and A. Lews, Convex Analyss and Nonlnear Optmzaton: Theory and Examples. Canadan Mathematcal Socety, [20] S. Boyd and L. Vandenberghe, Convex Optmzaton. Cambrdge Unversty Press, [21] L. M. Bregman, Fndng the common pont of convex sets by the method of successve projectons, Proceedngs of the USSR Academy of Scences, vol. 162, no. 3, pp , [22] L. M. Bregman, The relaxaton method of fndng the common pont of convex sets and ts applcaton to the soluton of problems n convex programmng, USSR Computatonal Mathematcs and Mathematcal Physcs, vol. 7, no. 3, pp , [23] H. Brézs, Opérateurs Maxmaux Monotones et Sem-Groupes de Contractons dans les Espaces de Hlbert. North-Holland: Amsterdam, [24] A. M. Brucksten, D. L. Donoho, and M. Elad, From sparse solutons of systems of equatons to sparse modelng of sgnals and mages, SIAM Revew, vol. 51, no. 1, pp , [25] Y. Bu, B. Howe, M. Balaznska, and M. D. Ernst, HaLoop: Effcent Iteratve Data Processng on Large Clusters, Proceedngs of the 36th Internatonal Conference on Very Large Databases, 2010.

116 References 113 [26] R. H. Byrd, P. Lu, and J. Nocedal, A Lmted Memory Algorthm for Bound Constraned Optmzaton, SIAM Journal on Scentfc and Statstcal Computng, vol. 16, no. 5, pp , [27] E. J. Candès and Y. Plan, Near-deal model selecton by l 1 mnmzaton, Annals of Statstcs, vol. 37, no. 5A, pp , [28] E. J. Candès, J. Romberg, and T. Tao, Robust uncertanty prncples: Exact sgnal reconstructon from hghly ncomplete frequency nformaton, IEEE Transactons on Informaton Theory, vol. 52, no. 2, p. 489, [29] E. J. Candès and T. Tao, Near-optmal sgnal recovery from random projectons: Unversal encodng strateges?, IEEE Transactons on Informaton Theory, vol. 52, no. 12, pp , [30] Y. Censor and S. A. Zenos, Proxmal mnmzaton algorthm wth D- functons, Journal of Optmzaton Theory and Applcatons, vol. 73, no. 3, pp , [31] Y. Censor and S. A. Zenos, Parallel Optmzaton: Theory, Algorthms, and Applcatons. Oxford Unversty Press, [32] F. Chang, J. Dean, S. Ghemawat, W. C. Hseh, D. A. Wallach, M. Burrows, T. Chandra, A. Fkes, and R. E. Gruber, BgTable: A dstrbuted storage system for structured data, ACM Transactons on Computer Systems, vol. 26, no. 2, pp. 1 26, [33] G. Chen and M. Teboulle, A proxmal-based decomposton method for convex mnmzaton problems, Mathematcal Programmng, vol. 64, pp , [34] S. S. Chen, D. L. Donoho, and M. A. Saunders, Atomc decomposton by bass pursut, SIAM Revew, vol. 43, pp , [35] Y. Chen, T. A. Davs, W. W. Hager, and S. Rajamanckam, Algorthm 887: CHOLMOD, supernodal sparse Cholesky factorzaton and update/downdate, ACM Transactons on Mathematcal Software, vol. 35, no. 3, p. 22, [36] W. Cheney and A. A. Goldsten, Proxmty maps for convex sets, Proceedngs of the Amercan Mathematcal Socety, vol. 10, no. 3, pp , [37] C. T. Chu, S. K. Km, Y. A. Ln, Y. Y. Yu, G. Bradsk, A. Y. Ng, and K. Olukotun, MapReduce for machne learnng on multcore, n Advances n Neural Informaton Processng Systems, [38] J. F. Claerbout and F. Mur, Robust modelng wth erratc data, Geophyscs, vol. 38, p. 826, [39] P. L. Combettes, The convex feasblty problem n mage recovery, Advances n Imagng and Electron Physcs, vol. 95, pp , [40] P. L. Combettes and J. C. Pesquet, A Douglas-Rachford splttng approach to nonsmooth convex varatonal sgnal recovery, IEEE Journal on Selected Topcs n Sgnal Processng, vol. 1, no. 4, pp , [41] P. L. Combettes and J. C. Pesquet, Proxmal Splttng Methods n Sgnal Processng, arxv: , [42] P. L. Combettes and V. R. Wajs, Sgnal recovery by proxmal forwardbackward splttng, Multscale Modelng and Smulaton, vol. 4, no. 4, pp , 2006.

117 114 References [43] G. B. Dantzg, Lnear Programmng and Extensons. RAND Corporaton, [44] G. B. Dantzg and P. Wolfe, Decomposton prncple for lnear programs, Operatons Research, vol. 8, pp , [45] I. Daubeches, M. Defrse, and C. D. Mol, An teratve thresholdng algorthm for lnear nverse problems wth a sparsty constrant, Communcatons on Pure and Appled Mathematcs, vol. 57, pp , [46] J. Dean and S. Ghemawat, MapReduce: Smplfed data processng on large clusters, Communcatons of the ACM, vol. 51, no. 1, pp , [47] J. W. Demmel, Appled Numercal Lnear Algebra. SIAM: Phladelpha, PA, [48] A. P. Dempster, Covarance selecton, Bometrcs, vol. 28, no. 1, pp , [49] D. L. Donoho, De-nosng by soft-thresholdng, IEEE Transactons on Informaton Theory, vol. 41, pp , [50] D. L. Donoho, Compressed sensng, IEEE Transactons on Informaton Theory, vol. 52, no. 4, pp , [51] D. L. Donoho, A. Malek, and A. Montanar, Message-passng algorthms for compressed sensng, Proceedngs of the Natonal Academy of Scences, vol. 106, no. 45, p , [52] D. L. Donoho and Y. Tsag, Fast soluton of l 1-norm mnmzaton problems when the soluton may be sparse, Tech. Rep., Stanford Unversty, [53] J. Douglas and H. H. Rachford, On the numercal soluton of heat conducton problems n two and three space varables, Transactons of the Amercan Mathematcal Socety, vol. 82, pp , [54] J. C. Duch, A. Agarwal, and M. J. Wanwrght, Dstrbuted Dual Averagng n Networks, n Advances n Neural Informaton Processng Systems, [55] J. C. Duch, S. Gould, and D. Koller, Projected subgradent methods for learnng sparse Gaussans, n Proceedngs of the Conference on Uncertanty n Artfcal Intellgence, [56] R. L. Dykstra, An algorthm for restrcted least squares regresson, Journal of the Amercan Statstcal Assocaton, vol. 78, pp , [57] J. Ecksten, Splttng methods for monotone operators wth applcatons to parallel optmzaton. PhD thess, MIT, [58] J. Ecksten, Nonlnear proxmal pont algorthms usng Bregman functons, wth applcatons to convex programmng, Mathematcs of Operatons Research, pp , [59] J. Ecksten, Parallel alternatng drecton multpler decomposton of convex programs, Journal of Optmzaton Theory and Applcatons, vol. 80, no. 1, pp , [60] J. Ecksten, Some saddle-functon splttng methods for convex programmng, Optmzaton Methods and Software, vol. 4, no. 1, pp , [61] J. Ecksten, A practcal general approxmaton crteron for methods of multplers based on Bregman dstances, Mathematcal Programmng, vol. 96, no. 1, pp , [62] J. Ecksten and D. P. Bertsekas, An alternatng drecton method for lnear programmng, Tech. Rep., MIT, 1990.

118 References 115 [63] J. Ecksten and D. P. Bertsekas, On the Douglas-Rachford splttng method and the proxmal pont algorthm for maxmal monotone operators, Mathematcal Programmng, vol. 55, pp , [64] J. Ecksten and M. C. Ferrs, Operator-splttng methods for monotone affne varatonal nequaltes, wth a parallel applcaton to optmal control, INFORMS Journal on Computng, vol. 10, pp , [65] J. Ecksten and M. Fukushma, Some reformulatons and applcatons of the alternatng drecton method of multplers, Large Scale Optmzaton: State of the Art, pp , [66] J. Ecksten and B. F. Svater, A famly of projectve splttng methods for the sum of two maxmal monotone operators, Mathematcal Programmng, vol. 111, no. 1-2, p. 173, [67] J. Ecksten and B. F. Svater, General projectve splttng methods for sums of maxmal monotone operators, SIAM Journal on Control and Optmzaton, vol. 48, pp , [68] E. Esser, Applcatons of Lagrangan-based alternatng drecton methods and connectons to splt Bregman, CAM report, vol. 9, p. 31, [69] H. Everett, Generalzed Lagrange multpler method for solvng problems of optmum allocaton of resources, Operatons Research, vol. 11, no. 3, pp , [70] M. J. Fadl and J. L. Starck, Monotone operator splttng for optmzaton problems n sparse recovery, IEEE ICIP, [71] A. V. Facco and G. P. McCormck, Nonlnear Programmng: Sequental Unconstraned Mnmzaton Technques. Socety for Industral and Appled Mathematcs, Frst publshed n 1968 by Research Analyss Corporaton. [72] M. A. T. Fgueredo and J. M. Boucas-Das, Restoraton of Possonan Images Usng Alternatng Drecton Optmzaton, IEEE Transactons on Image Processng, vol. 19, pp , [73] M. A. T. Fgueredo, R. D. Nowak, and S. J. Wrght, Gradent projecton for sparse reconstructon: Applcaton to compressed sensng and other nverse problems, IEEE Journal on Selected Topcs n Sgnal Processng, vol. 1, no. 4, pp , [74] P. A. Forero, A. Cano, and G. B. Gannaks, Consensus-based dstrbuted support vector machnes, Journal of Machne Learnng Research, vol. 11, pp , [75] M. Fortn and R. Glownsk, Augmented Lagrangan Methods: Applcatons to the Numercal Soluton of Boundary-Value Problems. North-Holland: Amsterdam, [76] M. Fortn and R. Glownsk, On decomposton-coordnaton methods usng an augmented Lagrangan, n Augmented Lagrangan Methods: Applcatons to the Soluton of Boundary-Value Problems, (M. Fortn and R. Glownsk, eds.), North-Holland: Amsterdam, [77] M. Forum, MPI: A Message-Passng Interface Standard, verson 2.2. Hgh- Performance Computng Center: Stuttgart, [78] Y. Freund and R. Schapre, A decson-theoretc generalzaton of on-lne learnng and an applcaton to boostng, n Computatonal Learnng Theory, pp , Sprnger, 1995.

119 116 References [79] J. Fredman, T. Haste, and R. Tbshran, Sparse nverse covarance estmaton wth the graphcal lasso, Bostatstcs, vol. 9, no. 3, p. 432, [80] M. Fukushma, Applcaton of the alternatng drecton method of multplers to separable convex programmng problems, Computatonal Optmzaton and Applcatons, vol. 1, pp , [81] D. Gabay, Applcatons of the method of multplers to varatonal nequaltes, n Augmented Lagrangan Methods: Applcatons to the Soluton of Boundary-Value Problems, (M. Fortn and R. Glownsk, eds.), North-Holland: Amsterdam, [82] D. Gabay and B. Mercer, A dual algorthm for the soluton of nonlnear varatonal problems va fnte element approxmatons, Computers and Mathematcs wth Applcatons, vol. 2, pp , [83] M. Galass, J. Daves, J. Theler, B. Gough, G. Jungman, M. Booth, and F. Ross, GNU Scentfc Lbrary Reference Manual. Network Theory Ltd., thrd ed., [84] A. M. Geoffron, Generalzed Benders decomposton, Journal of Optmzaton Theory and Applcatons, vol. 10, no. 4, pp , [85] S. Ghemawat, H. Goboff, and S. T. Leung, The Google fle system, ACM SIGOPS Operatng Systems Revew, vol. 37, no. 5, pp , [86] R. Glownsk and A. Marrocco, Sur l approxmaton, par elements fns d ordre un, et la resoluton, par penalsaton-dualté, d une classe de problems de Drchlet non lneares, Revue Françase d Automatque, Informatque, et Recherche Opératonelle, vol. 9, pp , [87] R. Glownsk and P. L. Tallec, Augmented Lagrangan methods for the soluton of varatonal problems, Tech. Rep. 2965, Unversty of Wsconsn- Madson, [88] T. Goldsten and S. Osher, The splt Bregman method for l 1 regularzed problems, SIAM Journal on Imagng Scences, vol. 2, no. 2, pp , [89] E. G. Gol shten and N. V. Tret yakov, Modfed Lagrangans n convex programmng and ther generalzatons, Pont-to-Set Maps and Mathematcal Programmng, pp , [90] G. H. Golub and C. F. van Loan, Matrx Computatons. Johns Hopkns Unversty Press, thrd ed., [91] D. Gregor and A. Lumsdane, The Parallel BGL: A generc lbrary for dstrbuted graph computatons, Parallel Object-Orented Scentfc Computng, [92] A. Halevy, P. Norvg, and F. Perera, The Unreasonable Effectveness of Data, IEEE Intellgent Systems, vol. 24, no. 2, [93] K. B. Hall, S. Glpn, and G. Mann, MapReduce/BgTable for dstrbuted optmzaton, n Neural Informaton Processng Systems: Workshop on Learnng on Cores, Clusters, and Clouds, [94] T. Haste and R. Tbshran, Generalzed Addtve Models. Chapman & Hall, [95] T. Haste, R. Tbshran, and J. Fredman, The Elements of Statstcal Learnng: Data Mnng, Inference and Predcton. Sprnger, second ed., 2009.

120 References 117 [96] B. S. He, H. Yang, and S. L. Wang, Alternatng drecton method wth selfadaptve penalty parameters for monotone varatonal nequaltes, Journal of Optmzaton Theory and Applcatons, vol. 106, no. 2, pp , [97] M. R. Hestenes, Multpler and gradent methods, Journal of Optmzaton Theory and Applcatons, vol. 4, pp , [98] M. R. Hestenes, Multpler and gradent methods, n Computng Methods n Optmzaton Problems, (L. A. Zadeh, L. W. Neustadt, and A. V. Balakrshnan, eds.), Academc Press, [99] J.-B. Hrart-Urruty and C. Lemaréchal, Fundamentals of Convex Analyss. Sprnger, [100] P. J. Huber, Robust estmaton of a locaton parameter, Annals of Mathematcal Statstcs, vol. 35, pp , [101] S.-J. Km, K. Koh, S. Boyd, and D. Gornevsky, l 1 Trend flterng, SIAM Revew, vol. 51, no. 2, pp , [102] S.-J. Km, K. Koh, M. Lustg, S. Boyd, and D. Gornevsky, An nteror-pont method for large-scale l 1-regularzed least squares, IEEE Journal of Selected Topcs n Sgnal Processng, vol. 1, no. 4, pp , [103] K. Koh, S.-J. Km, and S. Boyd, An nteror-pont method for large-scale l 1- regularzed logstc regresson, Journal of Machne Learnng Research, vol. 1, no. 8, pp , [104] D. Koller and N. Fredman, Probablstc Graphcal Models: Prncples and Technques. MIT Press, [105] S. A. Kontogorgs, Alternatng drectons methods for the parallel soluton of large-scale block-structured optmzaton problems. PhD thess, Unversty of Wsconsn-Madson, [106] S. A. Kontogorgs and R. R. Meyer, A varable-penalty alternatng drectons method for convex optmzaton, Mathematcal Programmng, vol. 83, pp , [107] L. S. Lasdon, Optmzaton Theory for Large Systems. MacMllan, [108] J. Lawrence and J. E. Spngarn, On fxed ponts of non-expansve pecewse sometrc mappngs, Proceedngs of the London Mathematcal Socety, vol. 3, no. 3, p. 605, [109] C. L. Lawson, R. J. Hanson, D. R. Kncad, and F. T. Krogh, Basc lnear algebra subprograms for Fortran usage, ACM Transactons on Mathematcal Software, vol. 5, no. 3, pp , [110] D. D. Lee and H. S. Seung, Algorthms for non-negatve matrx factorzaton, Advances n Neural Informaton Processng Systems, vol. 13, [111] J. Ln and M. Schatz, Desgn Patterns for Effcent Graph Algorthms n MapReduce, n Proceedngs of the Eghth Workshop on Mnng and Learnng wth Graphs, pp , [112] P. L. Lons and B. Mercer, Splttng algorthms for the sum of two nonlnear operators, SIAM Journal on Numercal Analyss, vol. 16, pp , [113] D. C. Lu and J. Nocedal, On the Lmted Memory Method for Large Scale Optmzaton, Mathematcal Programmng B, vol. 45, no. 3, pp , 1989.

121 118 References [114] Y. Low, J. Gonzalez, A. Kyrola, D. Bckson, C. Guestrn, and J. M. Hellersten, GraphLab: A New Parallel Framework for Machne Learnng, n Conference on Uncertanty n Artfcal Intellgence, [115] Z. Lu, Smooth optmzaton approach for sparse covarance selecton, SIAM Journal on Optmzaton, vol. 19, no. 4, pp , [116] Z. Lu, T. K. Pong, and Y. Zhang, An Alternatng Drecton Method for Fndng Dantzg Selectors, arxv: , [117] D. G. Luenberger, Introducton to Lnear and Nonlnear Programmng. Addson-Wesley: Readng, MA, [118] J. Maral, R. Jenatton, G. Oboznsk, and F. Bach, Network flow algorthms for structured sparsty, Advances n Neural Informaton Processng Systems, vol. 24, [119] G. Malewcz, M. H. Austern, A. J. C. Bk, J. C. Dehnert, I. Horn, N. Leser, and G. Czajkowsk, Pregel: A system for large-scale graph processng, n Proceedngs of the 2010 Internatonal Conference on Management of Data, pp , [120] A. F. T. Martns, M. A. T. Fgueredo, P. M. Q. Aguar, N. A. Smth, and E. P. Xng, An Augmented Lagrangan Approach to Constraned MAP Inference, n Internatonal Conference on Machne Learnng, [121] G. Mateos, J.-A. Bazerque, and G. B. Gannaks, Dstrbuted sparse lnear regresson, IEEE Transactons on Sgnal Processng, vol. 58, pp , Oct [122] P. J. McCullagh and J. A. Nelder, Generalzed Lnear Models. Chapman & Hall, [123] N. Menshausen and P. Bühlmann, Hgh-dmensonal graphs and varable selecton wth the lasso, Annals of Statstcs, vol. 34, no. 3, pp , [124] A. Mele, E. E. Cragg, R. R. Iver, and A. V. Levy, Use of the augmented penalty functon n mathematcal programmng problems, part 1, Journal of Optmzaton Theory and Applcatons, vol. 8, pp , [125] A. Mele, E. E. Cragg, and A. V. Levy, Use of the augmented penalty functon n mathematcal programmng problems, part 2, Journal of Optmzaton Theory and Applcatons, vol. 8, pp , [126] A. Mele, P. E. Mosely, A. V. Levy, and G. M. Coggns, On the method of multplers for mathematcal programmng problems, Journal of Optmzaton Theory and Applcatons, vol. 10, pp. 1 33, [127] J.-J. Moreau, Fonctons convexes duales et ponts proxmaux dans un espace Hlberten, Reports of the Pars Academy of Scences, Seres A, vol. 255, pp , [128] D. Mosk-Aoyama, T. Roughgarden, and D. Shah, Fully dstrbuted algorthms for convex optmzaton problems, Avalable at stanford.edu/ tm/papers/dstrcvxopt.pdf, [129] I. Necoara and J. A. K. Suykens, Applcaton of a smoothng technque to decomposton n convex optmzaton, IEEE Transactons on Automatc Control, vol. 53, no. 11, pp , 2008.

122 References 119 [130] A. Nedć and A. Ozdaglar, Dstrbuted subgradent methods for multagent optmzaton, IEEE Transactons on Automatc Control, vol. 54, no. 1, pp , [131] A. Nedć and A. Ozdaglar, Cooperatve dstrbuted mult-agent optmzaton, n Convex Optmzaton n Sgnal Processng and Communcatons, (D. P. Palomar and Y. C. Eldar, eds.), Cambrdge Unversty Press, [132] Y. Nesterov, A method of solvng a convex programmng problem wth convergence rate O(1/k 2 ), Sovet Mathematcs Doklady, vol. 27, no. 2, pp , [133] Y. Nesterov, Gradent methods for mnmzng composte objectve functon, CORE Dscusson Paper, Catholc Unversty of Louvan, vol. 76, p. 2007, [134] M. Ng, P. Wess, and X. Yuang, Solvng Constraned Total-Varaton Image Restoraton and Reconstructon Problems va Alternatng Drecton Methods, ICM Research Report, Avalable at org/db_file/2009/10/2434.pdf, [135] J. Nocedal and S. J. Wrght, Numercal Optmzaton. Sprnger-Verlag, [136] H. Ohlsson, L. Ljung, and S. Boyd, Segmentaton of ARX-models usng sumof-norms regularzaton, Automatca, vol. 46, pp , [137] D. W. Peaceman and H. H. Rachford, The numercal soluton of parabolc and ellptc dfferental equatons, Journal of the Socety for Industral and Appled Mathematcs, vol. 3, pp , [138] M. J. D. Powell, A method for nonlnear constrants n mnmzaton problems, n Optmzaton, (R. Fletcher, ed.), Academc Press, [139] A. Rbero, I. Schzas, S. Roumelots, and G. Gannaks, Kalman flterng n wreless sensor networks Incorporatng communcaton cost n state estmaton problems, IEEE Control Systems Magazne, vol. 30, pp , Apr [140] R. T. Rockafellar, Convex Analyss. Prnceton Unversty Press, [141] R. T. Rockafellar, Augmented Lagrangans and applcatons of the proxmal pont algorthm n convex programmng, Mathematcs of Operatons Research, vol. 1, pp , [142] R. T. Rockafellar, Monotone operators and the proxmal pont algorthm, SIAM Journal on Control and Optmzaton, vol. 14, p. 877, [143] R. T. Rockafellar and R. J.-B. Wets, Scenaros and polcy aggregaton n optmzaton under uncertanty, Mathematcs of Operatons Research, vol. 16, no. 1, pp , [144] R. T. Rockafellar and R. J.-B. Wets, Varatonal Analyss. Sprnger-Verlag, [145] L. Rudn, S. J. Osher, and E. Fatem, Nonlnear total varaton based nose removal algorthms, Physca D, vol. 60, pp , [146] A. Ruszczyńsk, An augmented Lagrangan decomposton method for block dagonal lnear programmng problems, Operatons Research Letters, vol. 8, no. 5, pp , [147] A. Ruszczyńsk, On convergence of an augmented Lagrangan decomposton method for sparse convex optmzaton, Mathematcs of Operatons Research, vol. 20, no. 3, pp , 1995.

123 120 References [148] K. Schenberg, S. Ma, and D. Goldfarb, Sparse nverse covarance selecton va alternatng lnearzaton methods, n Advances n Neural Informaton Processng Systems, [149] I. D. Schzas, G. Gannaks, S. Roumelots, and A. Rbero, Consensus n ad hoc WSNs wth nosy lnks part II: Dstrbuted estmaton and smoothng of random sgnals, IEEE Transactons on Sgnal Processng, vol. 56, pp , Apr [150] I. D. Schzas, A. Rbero, and G. B. Gannaks, Consensus n ad hoc WSNs wth nosy lnks part I: Dstrbuted estmaton of determnstc sgnals, IEEE Transactons on Sgnal Processng, vol. 56, pp , Jan [151] B. Schölkopf and A. J. Smola, Learnng wth Kernels: Support Vector Machnes, Regularzaton, Optmzaton, and Beyond. MIT Press, [152] N. Z. Shor, Mnmzaton Methods for Non-Dfferentable Functons. Sprnger- Verlag, [153] J. E. Spngarn, Applcatons of the method of partal nverses to convex programmng: decomposton, Mathematcal Programmng, vol. 32, pp , [154] G. Stedl and T. Teuber, Removng multplcatve nose by Douglas-Rachford splttng methods, Journal of Mathematcal Imagng and Vson, vol. 36, no. 2, pp , [155] C. H. Teo, S. V. N. Vshwanathan, A. J. Smola, and Q. V. Le, Bundle methods for regularzed rsk mnmzaton, Journal of Machne Learnng Research, vol. 11, pp , [156] R. Tbshran, Regresson shrnkage and selecton va the lasso, Journal of the Royal Statstcal Socety, Seres B, vol. 58, pp , [157] P. Tseng, Applcatons of a splttng algorthm to decomposton n convex programmng and varatonal nequaltes., SIAM Journal on Control and Optmzaton, vol. 29, pp , [158] P. Tseng, Alternatng projecton-proxmal methods for convex programmng and varatonal nequaltes, SIAM Journal on Optmzaton, vol. 7, pp , [159] P. Tseng, A modfed forward-backward splttng method for maxmal monotone mappngs, SIAM Journal on Control and Optmzaton, vol. 38, p. 431, [160] J. N. Tstskls, Problems n decentralzed decson makng and computaton. PhD thess, Massachusetts Insttute of Technology, [161] J. N. Tstskls, D. P. Bertsekas, and M. Athans, Dstrbuted asynchronous determnstc and stochastc gradent optmzaton algorthms, IEEE Transactons on Automatc Control, vol. 31, no. 9, pp , [162] H. Uzawa, Market mechansms and mathematcal programmng, Econometrca: Journal of the Econometrc Socety, vol. 28, no. 4, pp , [163] H. Uzawa, Walras tâtonnement n the theory of exchange, The Revew of Economc Studes, vol. 27, no. 3, pp , [164] L. G. Valant, A brdgng model for parallel computaton, Communcatons of the ACM, vol. 33, no. 8, p. 111, 1990.

124 References 121 [165] V. N. Vapnk, The Nature of Statstcal Learnng Theory. Sprnger-Verlag, [166] J. von Neumann, Functonal Operators, Volume 2: The Geometry of Orthogonal Spaces. Prnceton Unversty Press: Annals of Mathematcs Studes, Reprnt of 1933 lecture notes. [167] M. J. Wanwrght and M. I. Jordan, Graphcal models, exponental famles, and varatonal nference, Foundatons and Trends n Machne Learnng, vol. 1, no. 1-2, pp , [168] L. Walras, Éléments d économe poltque pure, ou, Théore de la rchesse socale. F. Rouge, [169] S. L. Wang and L. Z. Lao, Decomposton method wth a varable parameter for a class of monotone varatonal nequalty problems, Journal of Optmzaton Theory and Applcatons, vol. 109, no. 2, pp , [170] T. Whte, Hadoop: The Defntve Gude. O Relly Press, second ed., [171] J. M. Wooldrdge, Introductory Econometrcs: A Modern Approach. South Western College Publcatons, fourth ed., [172] L. Xao and S. Boyd, Fast lnear teratons for dstrbuted averagng, Systems & Control Letters, vol. 53, no. 1, pp , [173] A. Y. Yang, A. Ganesh, Z. Zhou, S. S. Sastry, and Y. Ma, A Revew of Fast l 1-Mnmzaton Algorthms for Robust Face Recognton, arxv: , [174] J. Yang and X. Yuan, An nexact alternatng drecton method for trace norm regularzed least squares problem, Avalable at optmzaton-onlne.org, [175] J. Yang and Y. Zhang, Alternatng drecton algorthms for l 1-problems n compressve sensng, Preprnt, [176] W. Yn, S. Osher, D. Goldfarb, and J. Darbon, Bregman teratve algorthms for l 1-mnmzaton wth applcatons to compressed sensng, SIAM Journal on Imagng Scences, vol. 1, no. 1, pp , [177] M. Yuan and Y. Ln, Model selecton and estmaton n regresson wth grouped varables, Journal of the Royal Statstcal Socety: Seres B (Statstcal Methodology), vol. 68, no. 1, pp , [178] X. M. Yuan, Alternatng drecton methods for sparse covarance selecton, Preprnt, Avalable at FILE/2009/09/2390.pdf, [179] M. Zahara, M. Chowdhury, M. J. Frankln, S. Shenker, and I. Stoca, Spark: Cluster computng wth workng sets, n Proceedngs of the 2nd USENIX Conference on Hot Topcs n Cloud Computng, [180] T. Zhang, Statstcal behavor and consstency of classfcaton methods based on convex rsk mnmzaton, Annals of Statstcs, vol. 32, no. 1, pp , [181] P. Zhao, G. Rocha, and B. Yu, The composte absolute penaltes famly for grouped and herarchcal varable selecton, Annals of Statstcs, vol. 37, no. 6A, pp , 2009.

125 122 References [182] H. Zhu, A. Cano, and G. B. Gannaks, Dstrbuted consensus-based demodulaton: algorthms and error analyss, IEEE Transactons on Wreless Communcatons, vol. 9, no. 6, pp , [183] H. Zhu, G. B. Gannaks, and A. Cano, Dstrbuted n-network channel decodng, IEEE Transactons on Sgnal Processng, vol. 57, no. 10, pp , 2009.