Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers


 Estella Thomas
 1 years ago
 Views:
Transcription
1 Foundatons and Trends R n Machne Learnng Vol. 3, No. 1 (2010) c 2011 S. Boyd, N. Parkh, E. Chu, B. Peleato and J. Ecksten DOI: / Dstrbuted Optmzaton and Statstcal Learnng va the Alternatng Drecton Method of Multplers Stephen Boyd 1, Neal Parkh 2, Erc Chu 3 Borja Peleato 4 and Jonathan Ecksten 5 1 Electrcal Engneerng Department, Stanford Unversty, Stanford, CA 94305, USA, 2 Computer Scence Department, Stanford Unversty, Stanford, CA 94305, USA, 3 Electrcal Engneerng Department, Stanford Unversty, Stanford, CA 94305, USA, 4 Electrcal Engneerng Department, Stanford Unversty, Stanford, CA 94305, USA, 5 Management Scence and Informaton Systems Department and RUTCOR, Rutgers Unversty, Pscataway, NJ 08854, USA,
2 Contents 1 Introducton 3 2 Precursors Dual Ascent Dual Decomposton Augmented Lagrangans and the Method of Multplers 10 3 Alternatng Drecton Method of Multplers Algorthm Convergence Optmalty Condtons and Stoppng Crteron Extensons and Varatons Notes and References 23 4 General Patterns Proxmty Operator Quadratc Objectve Terms Smooth Objectve Terms Decomposton 31 5 Constraned Convex Optmzaton Convex Feasblty Lnear and Quadratc Programmng 36
3 6 l 1 Norm Problems Least Absolute Devatons Bass Pursut General l 1 Regularzed Loss Mnmzaton Lasso Sparse Inverse Covarance Selecton 45 7 Consensus and Sharng Global Varable Consensus Optmzaton General Form Consensus Optmzaton Sharng 56 8 Dstrbuted Model Fttng Examples Splttng across Examples Splttng across Features 66 9 Nonconvex Problems Nonconvex Constrants Bconvex Problems Implementaton Abstract Implementaton MPI Graph Computng Frameworks MapReduce Numercal Examples Small Dense Lasso Dstrbuted l 1 Regularzed Logstc Regresson Group Lasso wth Feature Splttng Dstrbuted LargeScale Lasso wth MPI Regressor Selecton 100
4 12 Conclusons 103 Acknowledgments 105 A Convergence Proof 106 References 111
5 Abstract Many problems of recent nterest n statstcs and machne learnng can be posed n the framework of convex optmzaton. Due to the exploson n sze and complexty of modern datasets, t s ncreasngly mportant to be able to solve problems wth a very large number of features or tranng examples. As a result, both the decentralzed collecton or storage of these datasets as well as accompanyng dstrbuted soluton methods are ether necessary or at least hghly desrable. In ths revew, we argue that the alternatng drecton method of multplers s well suted to dstrbuted convex optmzaton, and n partcular to largescale problems arsng n statstcs, machne learnng, and related areas. The method was developed n the 1970s, wth roots n the 1950s, and s equvalent or closely related to many other algorthms, such as dual decomposton, the method of multplers, Douglas Rachford splttng, Spngarn s method of partal nverses, Dykstra s alternatng projectons, Bregman teratve algorthms for l 1 problems, proxmal methods, and others. After brefly surveyng the theory and hstory of the algorthm, we dscuss applcatons to a wde varety of statstcal and machne learnng problems of recent nterest, ncludng the lasso, sparse logstc regresson, bass pursut, covarance selecton, support vector machnes, and many others. We also dscuss general dstrbuted optmzaton, extensons to the nonconvex settng, and effcent mplementaton, ncludng some detals on dstrbuted MPI and Hadoop MapReduce mplementatons.
6 1 Introducton In all appled felds, t s now commonplace to attack problems through data analyss, partcularly through the use of statstcal and machne learnng algorthms on what are often large datasets. In ndustry, ths trend has been referred to as Bg Data, and t has had a sgnfcant mpact n areas as vared as artfcal ntellgence, nternet applcatons, computatonal bology, medcne, fnance, marketng, journalsm, network analyss, and logstcs. Though these problems arse n dverse applcaton domans, they share some key characterstcs. Frst, the datasets are often extremely large, consstng of hundreds of mllons or bllons of tranng examples; second, the data s often very hghdmensonal, because t s now possble to measure and store very detaled nformaton about each example; and thrd, because of the large scale of many applcatons, the data s often stored or even collected n a dstrbuted manner. As a result, t has become of central mportance to develop algorthms that are both rch enough to capture the complexty of modern data, and scalable enough to process huge datasets n a parallelzed or fully decentralzed fashon. Indeed, some researchers [92] have suggested that even hghly complex and structured problems may succumb most easly to relatvely smple models traned on vast datasets. 3
7 4 Introducton Many such problems can be posed n the framework of convex optmzaton. Gven the sgnfcant work on decomposton methods and decentralzed algorthms n the optmzaton communty, t s natural to look to parallel optmzaton algorthms as a mechansm for solvng largescale statstcal tasks. Ths approach also has the beneft that one algorthm could be flexble enough to solve many problems. Ths revew dscusses the alternatng drecton method of multplers (ADMM), a smple but powerful algorthm that s well suted to dstrbuted convex optmzaton, and n partcular to problems arsng n appled statstcs and machne learnng. It takes the form of a decompostoncoordnaton procedure, n whch the solutons to small local subproblems are coordnated to fnd a soluton to a large global problem. ADMM can be vewed as an attempt to blend the benefts of dual decomposton and augmented Lagrangan methods for constraned optmzaton, two earler approaches that we revew n 2. It turns out to be equvalent or closely related to many other algorthms as well, such as DouglasRachford splttng from numercal analyss, Spngarn s method of partal nverses, Dykstra s alternatng projectons method, Bregman teratve algorthms for l 1 problems n sgnal processng, proxmal methods, and many others. The fact that t has been renvented n dfferent felds over the decades underscores the ntutve appeal of the approach. It s worth emphaszng that the algorthm tself s not new, and that we do not present any new theoretcal results. It was frst ntroduced n the md1970s by Gabay, Mercer, Glownsk, and Marrocco, though smlar deas emerged as early as the md1950s. The algorthm was studed throughout the 1980s, and by the md1990s, almost all of the theoretcal results mentoned here had been establshed. The fact that ADMM was developed so far n advance of the ready avalablty of largescale dstrbuted computng systems and massve optmzaton problems may account for why t s not as wdely known today as we beleve t should be. The man contrbutons of ths revew can be summarzed as follows: (1) We provde a smple, cohesve dscusson of the extensve lterature n a way that emphaszes and unfes the aspects of prmary mportance n applcatons.
8 5 (2) We show, through a number of examples, that the algorthm s well suted for a wde varety of largescale dstrbuted modern problems. We derve methods for decomposng a wde class of statstcal problems by tranng examples and by features, whch s not easly accomplshed n general. (3) We place a greater emphass on practcal largescale mplementaton than most prevous references. In partcular, we dscuss the mplementaton of the algorthm n cloud computng envronments usng standard frameworks and provde easly readable mplementatons of many of our examples. Throughout, the focus s on applcatons rather than theory, and a man goal s to provde the reader wth a knd of toolbox that can be appled n many stuatons to derve and mplement a dstrbuted algorthm of practcal use. Though the focus here s on parallelsm, the algorthm can also be used serally, and t s nterestng to note that wth no tunng, ADMM can be compettve wth the best known methods for some problems. Whle we have emphaszed applcatons that can be concsely explaned, the algorthm would also be a natural ft for more complcated problems n areas lke graphcal models. In addton, though our focus s on statstcal learnng problems, the algorthm s readly applcable n many other cases, such as n engneerng desgn, multperod portfolo optmzaton, tme seres analyss, network flow, or schedulng. Outlne We begn n 2 wth a bref revew of dual decomposton and the method of multplers, two mportant precursors to ADMM. Ths secton s ntended manly for background and can be skmmed. In 3, we present ADMM, ncludng a basc convergence theorem, some varatons on the basc verson that are useful n practce, and a survey of some of the key lterature. A complete convergence proof s gven n appendx A. In 4, we descrbe some general patterns that arse n applcatons of the algorthm, such as cases when one of the steps n ADMM can
9 6 Introducton be carred out partcularly effcently. These general patterns wll recur throughout our examples. In 5, we consder the use of ADMM for some generc convex optmzaton problems, such as constraned mnmzaton and lnear and quadratc programmng. In 6, we dscuss a wde varety of problems nvolvng the l 1 norm. It turns out that ADMM yelds methods for these problems that are related to many stateoftheart algorthms. Ths secton also clarfes why ADMM s partcularly well suted to machne learnng problems. In 7, we present consensus and sharng problems, whch provde general frameworks for dstrbuted optmzaton. In 8, we consder dstrbuted methods for generc model fttng problems, ncludng regularzed regresson models lke the lasso and classfcaton models lke support vector machnes. In 9, we consder the use of ADMM as a heurstc for solvng some nonconvex problems. In 10, we dscuss some practcal mplementaton detals, ncludng how to mplement the algorthm n frameworks sutable for cloud computng applcatons. Fnally, n 11, we present the detals of some numercal experments.
10 2 Precursors In ths secton, we brefly revew two optmzaton algorthms that are precursors to the alternatng drecton method of multplers. Whle we wll not use ths materal n the sequel, t provdes some useful background and motvaton. 2.1 Dual Ascent Consder the equaltyconstraned convex optmzaton problem mnmze f(x) subject to Ax = b, (2.1) wth varable x R n, where A R m n and f : R n R s convex. The Lagrangan for problem (2.1) s and the dual functon s L(x,y) =f(x) +y T (Ax b) g(y) = nf L(x,y) = f ( A T y) b T y, x where y s the dual varable or Lagrange multpler, and f s the convex conjugate of f; see [20, 3.3] or [140, 12] for background. The dual 7
11 8 Precursors problem s maxmze g(y), wth varable y R m. Assumng that strong dualty holds, the optmal values of the prmal and dual problems are the same. We can recover a prmal optmal pont x from a dual optmal pont y as x = argmnl(x,y ), x provded there s only one mnmzer of L(x,y ). (Ths s the case f, e.g., f s strctly convex.) In the sequel, we wll use the notaton argmn x F (x) to denote any mnmzer of F, even when F does not have a unque mnmzer. In the dual ascent method, we solve the dual problem usng gradent ascent. Assumng that g s dfferentable, the gradent g(y) can be evaluated as follows. We frst fnd x + = argmn x L(x,y); then we have g(y)=ax + b, whch s the resdual for the equalty constrant. The dual ascent method conssts of teratng the updates x k+1 := argmnl(x,y k ) (2.2) x y k+1 := y k + α k (Ax k+1 b), (2.3) where α k > 0 s a step sze, and the superscrpt s the teraton counter. The frst step (2.2) s an xmnmzaton step, and the second step (2.3) s a dual varable update. The dual varable y can be nterpreted as a vector of prces, and the yupdate s then called a prce update or prce adjustment step. Ths algorthm s called dual ascent snce, wth approprate choce of α k, the dual functon ncreases n each step,.e., g(y k+1 ) >g(y k ). The dual ascent method can be used even n some cases when g s not dfferentable. In ths case, the resdual Ax k+1 b s not the gradent of g, but the negatve of a subgradent of g. Ths case requres a dfferent choce of the α k than when g s dfferentable, and convergence s not monotone; t s often the case that g(y k+1 ) g(y k ). In ths case, the algorthm s usually called the dual subgradent method [152]. If α k s chosen approprately and several other assumptons hold, then x k converges to an optmal pont and y k converges to an optmal
12 2.2 Dual Decomposton 9 dual pont. However, these assumptons do not hold n many applcatons, so dual ascent often cannot be used. As an example, f f s a nonzero affne functon of any component of x, then the xupdate (2.2) fals, snce L s unbounded below n x for most y. 2.2 Dual Decomposton The major beneft of the dual ascent method s that t can lead to a decentralzed algorthm n some cases. Suppose, for example, that the objectve f s separable (wth respect to a partton or splttng of the varable nto subvectors), meanng that f(x) = N f (x ), =1 where x =(x 1,...,x N ) and the varables x R n are subvectors of x. Parttonng the matrx A conformably as A =[A 1 A N ], so Ax = N =1 A x, the Lagrangan can be wrtten as L(x,y)= N L (x,y)= =1 N ( f (x )+y T A x (1/N )y T b ), =1 whch s also separable n x. Ths means that the xmnmzaton step (2.2) splts nto N separate problems that can be solved n parallel. Explctly, the algorthm s := argmnl (x,y k ) (2.4) x y k+1 := y k + α k (Ax k+1 b). (2.5) x k+1 The xmnmzaton step (2.4) s carred out ndependently, n parallel, for each =1,...,N. In ths case, we refer to the dual ascent method as dual decomposton. In the general case, each teraton of the dual decomposton method requres a broadcast and a gather operaton. In the dual update step (2.5), the equalty constrant resdual contrbutons A x k+1 are
13 10 Precursors collected (gathered) n order to compute the resdual Ax k+1 b. Once the (global) dual varable y k+1 s computed, t must be dstrbuted (broadcast) to the processors that carry out the N ndvdual x mnmzaton steps (2.4). Dual decomposton s an old dea n optmzaton, and traces back at least to the early 1960s. Related deas appear n well known work by Dantzg and Wolfe [44] and Benders [13] on largescale lnear programmng, as well as n Dantzg s semnal book [43]. The general dea of dual decomposton appears to be orgnally due to Everett [69], and s explored n many early references [107, 84, 117, 14]. The use of nondfferentable optmzaton, such as the subgradent method, to solve the dual problem s dscussed by Shor [152]. Good references on dual methods and decomposton nclude the book by Bertsekas [16, chapter 6] and the survey by Nedć and Ozdaglar [131] on dstrbuted optmzaton, whch dscusses dual decomposton methods and consensus problems. A number of papers also dscuss varants on standard dual decomposton, such as [129]. More generally, decentralzed optmzaton has been an actve topc of research snce the 1980s. For nstance, Tstskls and hs coauthors worked on a number of decentralzed detecton and consensus problems nvolvng the mnmzaton of a smooth functon f known to multple agents [160, 161, 17]. Some good reference books on parallel optmzaton nclude those by Bertsekas and Tstskls [17] and Censor and Zenos [31]. There has also been some recent work on problems where each agent has ts own convex, potentally nondfferentable, objectve functon [130]. See [54] for a recent dscusson of dstrbuted methods for graphstructured optmzaton problems. 2.3 Augmented Lagrangans and the Method of Multplers Augmented Lagrangan methods were developed n part to brng robustness to the dual ascent method, and n partcular, to yeld convergence wthout assumptons lke strct convexty or fnteness of f. The augmented Lagrangan for (2.1) s L ρ (x,y)=f(x) +y T (Ax b) +(ρ/2) Ax b 2 2, (2.6)
14 2.3 Augmented Lagrangans and the Method of Multplers 11 where ρ>0 s called the penalty parameter. (Note that L 0 s the standard Lagrangan for the problem.) The augmented Lagrangan can be vewed as the (unaugmented) Lagrangan assocated wth the problem mnmze f(x) +(ρ/2) Ax b 2 2 subject to Ax = b. Ths problem s clearly equvalent to the orgnal problem (2.1), snce for any feasble x the term added to the objectve s zero. The assocated dual functon s g ρ (y) = nf x L ρ (x,y). The beneft of ncludng the penalty term s that g ρ can be shown to be dfferentable under rather mld condtons on the orgnal problem. The gradent of the augmented dual functon s found the same way as wth the ordnary Lagrangan,.e., by mnmzng over x, and then evaluatng the resultng equalty constrant resdual. Applyng dual ascent to the modfed problem yelds the algorthm x k+1 := argmnl ρ (x,y k ) (2.7) x y k+1 := y k + ρ(ax k+1 b), (2.8) whch s known as the method of multplers for solvng (2.1). Ths s the same as standard dual ascent, except that the xmnmzaton step uses the augmented Lagrangan, and the penalty parameter ρ s used as the step sze α k. The method of multplers converges under far more general condtons than dual ascent, ncludng cases when f takes on the value + or s not strctly convex. It s easy to motvate the choce of the partcular step sze ρ n the dual update (2.8). For smplcty, we assume here that f s dfferentable, though ths s not requred for the algorthm to work. The optmalty condtons for (2.1) are prmal and dual feasblty,.e., Ax b =0, f(x )+A T y =0, respectvely. By defnton, x k+1 mnmzes L ρ (x,y k ), so 0= x L ρ (x k+1,y k ) ( ) = x f(x k+1 )+A T y k + ρ(ax k+1 b) = x f(x k+1 )+A T y k+1.
15 12 Precursors We see that by usng ρ as the step sze n the dual update, the terate (x k+1,y k+1 ) s dual feasble. As the method of multplers proceeds, the prmal resdual Ax k+1 b converges to zero, yeldng optmalty. The greatly mproved convergence propertes of the method of multplers over dual ascent comes at a cost. When f s separable, the augmented Lagrangan L ρ s not separable, so the xmnmzaton step (2.7) cannot be carred out separately n parallel for each x. Ths means that the basc method of multplers cannot be used for decomposton. We wll see how to address ths ssue next. Augmented Lagrangans and the method of multplers for constraned optmzaton were frst proposed n the late 1960s by Hestenes [97, 98] and Powell [138]. Many of the early numercal experments on the method of multplers are due to Mele et al. [124, 125, 126]. Much of the early work s consoldated n a monograph by Bertsekas [15], who also dscusses smlartes to older approaches usng Lagrangans and penalty functons [6, 5, 71], as well as a number of generalzatons.
16 3 Alternatng Drecton Method of Multplers 3.1 Algorthm ADMM s an algorthm that s ntended to blend the decomposablty of dual ascent wth the superor convergence propertes of the method of multplers. The algorthm solves problems n the form mnmze subject to f(x) +g(z) Ax + Bz = c (3.1) wth varables x R n and z R m, where A R p n, B R p m, and c R p. We wll assume that f and g are convex; more specfc assumptons wll be dscussed n 3.2. The only dfference from the general lnear equaltyconstraned problem (2.1) s that the varable, called x there, has been splt nto two parts, called x and z here, wth the objectve functon separable across ths splttng. The optmal value of the problem (3.1) wll be denoted by p = nf{f(x) +g(z) Ax + Bz = c}. As n the method of multplers, we form the augmented Lagrangan L ρ (x,z,y)=f(x) +g(z) +y T (Ax + Bz c) +(ρ/2) Ax + Bz c
17 14 Alternatng Drecton Method of Multplers ADMM conssts of the teratons x k+1 := argmnl ρ (x,z k,y k ) (3.2) x z k+1 := argmnl ρ (x k+1,z,y k ) (3.3) z y k+1 := y k + ρ(ax k+1 + Bz k+1 c), (3.4) where ρ>0. The algorthm s very smlar to dual ascent and the method of multplers: t conssts of an xmnmzaton step (3.2), a zmnmzaton step (3.3), and a dual varable update (3.4). As n the method of multplers, the dual varable update uses a step sze equal to the augmented Lagrangan parameter ρ. The method of multplers for (3.1) has the form (x k+1,z k+1 ) := argmnl ρ (x,z,y k ) x,z y k+1 := y k + ρ(ax k+1 + Bz k+1 c). Here the augmented Lagrangan s mnmzed jontly wth respect to the two prmal varables. In ADMM, on the other hand, x and z are updated n an alternatng or sequental fashon, whch accounts for the term alternatng drecton. ADMM can be vewed as a verson of the method of multplers where a sngle GaussSedel pass [90, 10.1] over x and z s used nstead of the usual jont mnmzaton. Separatng the mnmzaton over x and z nto two steps s precsely what allows for decomposton when f or g are separable. The algorthm state n ADMM conssts of z k and y k. In other words, (z k+1,y k+1 ) s a functon of (z k,y k ). The varable x k s not part of the state; t s an ntermedate result computed from the prevous state (z k 1,y k 1 ). If we swtch (relabel) x and z, f and g, and A and B n the problem (3.1), we obtan a varaton on ADMM wth the order of the x update step (3.2) and zupdate step (3.3) reversed. The roles of x and z are almost symmetrc, but not qute, snce the dual update s done after the zupdate but before the xupdate.
18 3.2 Convergence Scaled Form ADMM can be wrtten n a slghtly dfferent form, whch s often more convenent, by combnng the lnear and quadratc terms n the augmented Lagrangan and scalng the dual varable. Defnng the resdual r = Ax + Bz c, wehave y T r +(ρ/2) r 2 2 =(ρ/2) r +(1/ρ)y 2 2 (1/2ρ) y 2 2 =(ρ/2) r + u 2 2 (ρ/2) u 2 2, where u =(1/ρ)y s the scaled dual varable. Usng the scaled dual varable, we can express ADMM as ( ) x k+1 := argmn f(x) +(ρ/2) Ax + Bz k c + u k 2 2 (3.5) x ( ) z k+1 := argmn g(z) +(ρ/2) Ax k+1 + Bz c + u k 2 2 (3.6) z u k+1 := u k + Ax k+1 + Bz k+1 c. (3.7) Defnng the resdual at teraton k as r k = Ax k + Bz k c, we see that u k = u 0 + k r j, the runnng sum of the resduals. We call the frst form of ADMM above, gven by ( ), the unscaled form, and the second form ( ) the scaled form, snce t s expressed n terms of a scaled verson of the dual varable. The two are clearly equvalent, but the formulas n the scaled form of ADMM are often shorter than n the unscaled form, so we wll use the scaled form n the sequel. We wll use the unscaled form when we wsh to emphasze the role of the dual varable or to gve an nterpretaton that reles on the (unscaled) dual varable. 3.2 Convergence There are many convergence results for ADMM dscussed n the lterature; here, we lmt ourselves to a basc but stll very general result that apples to all of the examples we wll consder. We wll make one j=1
19 16 Alternatng Drecton Method of Multplers assumpton about the functons f and g, and one assumpton about problem (3.1). Assumpton 1. The (extendedrealvalued) functons f : R n R {+ } and g : R m R {+ } are closed, proper, and convex. Ths assumpton can be expressed compactly usng the epgraphs of the functons: The functon f satsfes assumpton 1 f and only f ts epgraph epf = {(x,t) R n R f(x) t} s a closed nonempty convex set. Assumpton 1 mples that the subproblems arsng n the xupdate (3.2) and zupdate (3.3) are solvable,.e., there exst x and z, not necessarly unque (wthout further assumptons on A and B), that mnmze the augmented Lagrangan. It s mportant to note that assumpton 1 allows f and g to be nondfferentable and to assume the value +. For example, we can take f to be the ndcator functon of a closed nonempty convex set C,.e., f(x) =0forx C and f(x) =+ otherwse. In ths case, the xmnmzaton step (3.2) wll nvolve solvng a constraned quadratc program over C, the effectve doman of f. Assumpton 2. The unaugmented Lagrangan L 0 has a saddle pont. Explctly, there exst (x,z,y ), not necessarly unque, for whch L 0 (x,z,y) L 0 (x,z,y ) L 0 (x,z,y ) holds for all x, z, y. By assumpton 1, t follows that L 0 (x,z,y ) s fnte for any saddle pont (x,z,y ). Ths mples that (x,z ) s a soluton to (3.1), so Ax + Bz = c and f(x ) <, g(z ) <. It also mples that y s dual optmal, and the optmal values of the prmal and dual problems are equal,.e., that strong dualty holds. Note that we make no assumptons about A, B, or c, except mplctly through assumpton 2; n partcular, nether A nor B s requred to be full rank.
20 3.2 Convergence Convergence Under assumptons 1 and 2, the ADMM terates satsfy the followng: Resdual convergence. r k 0 as k,.e., the terates approach feasblty. Objectve convergence. f(x k )+g(z k ) p as k,.e., the objectve functon of the terates approaches the optmal value. Dual varable convergence. y k y as k, where y s a dual optmal pont. A proof of the resdual and objectve convergence results s gven n appendx A. Note that x k and z k need not converge to optmal values, although such results can be shown under addtonal assumptons Convergence n Practce Smple examples show that ADMM can be very slow to converge to hgh accuracy. However, t s often the case that ADMM converges to modest accuracy suffcent for many applcatons wthn a few tens of teratons. Ths behavor makes ADMM smlar to algorthms lke the conjugate gradent method, for example, n that a few tens of teratons wll often produce acceptable results of practcal use. However, the slow convergence of ADMM also dstngushes t from algorthms such as Newton s method (or, for constraned problems, nterorpont methods), where hgh accuracy can be attaned n a reasonable amount of tme. Whle n some cases t s possble to combne ADMM wth a method for producng a hgh accuracy soluton from a low accuracy soluton [64], n the general case ADMM wll be practcally useful mostly n cases when modest accuracy s suffcent. Fortunately, ths s usually the case for the knds of largescale problems we consder. Also, n the case of statstcal and machne learnng problems, solvng a parameter estmaton problem to very hgh accuracy often yelds lttle to no mprovement n actual predcton performance, the real metrc of nterest n applcatons.
21 18 Alternatng Drecton Method of Multplers 3.3 Optmalty Condtons and Stoppng Crteron The necessary and suffcent optmalty condtons for the ADMM problem (3.1) are prmal feasblty, and dual feasblty, Ax + Bz c =0, (3.8) 0 f(x )+A T y (3.9) 0 g(z )+B T y. (3.10) Here, denotes the subdfferental operator; see, e.g., [140, 19, 99]. (When f and g are dfferentable, the subdfferentals f and g can be replaced by the gradents f and g, and can be replaced by =.) Snce z k+1 mnmzes L ρ (x k+1,z,y k ) by defnton, we have that 0 g(z k+1 )+B T y k + ρb T (Ax k+1 + Bz k+1 c) = g(z k+1 )+B T y k + ρb T r k+1 = g(z k+1 )+B T y k+1. Ths means that z k+1 and y k+1 always satsfy (3.10), so attanng optmalty comes down to satsfyng (3.8) and (3.9). Ths phenomenon s analogous to the terates of the method of multplers always beng dual feasble; see page 11. Snce x k+1 mnmzes L ρ (x,z k,y k ) by defnton, we have that or equvalently, 0 f(x k+1 )+A T y k + ρa T (Ax k+1 + Bz k c) = f(x k+1 )+A T (y k + ρr k+1 + ρb(z k z k+1 )) = f(x k+1 )+A T y k+1 + ρa T B(z k z k+1 ), ρa T B(z k+1 z k ) f(x k+1 )+A T y k+1. Ths means that the quantty s k+1 = ρa T B(z k+1 z k ) can be vewed as a resdual for the dual feasblty condton (3.9). We wll refer to s k+1 as the dual resdual at teraton k + 1, and to r k+1 = Ax k+1 + Bz k+1 c as the prmal resdual at teraton k +1.
22 3.3 Optmalty Condtons and Stoppng Crteron 19 In summary, the optmalty condtons for the ADMM problem consst of three condtons, ( ). The last condton (3.10) always holds for (x k+1,z k+1,y k+1 ); the resduals for the other two, (3.8) and (3.9), are the prmal and dual resduals r k+1 and s k+1, respectvely. These two resduals converge to zero as ADMM proceeds. (In fact, the convergence proof n appendx A shows B(z k+1 z k ) converges to zero, whch mples s k converges to zero.) Stoppng Crtera The resduals of the optmalty condtons can be related to a bound on the objectve suboptmalty of the current pont,.e., f(x k )+g(z k ) p. As shown n the convergence proof n appendx A, we have f(x k )+g(z k ) p (y k ) T r k +(x k x ) T s k. (3.11) Ths shows that when the resduals r k and s k are small, the objectve suboptmalty also must be small. We cannot use ths nequalty drectly n a stoppng crteron, however, snce we do not know x. But f we guess or estmate that x k x 2 d, we have that f(x k )+g(z k ) p (y k ) T r k + d s k 2 y k 2 r k 2 + d s k 2. The mddle or rghthand terms can be used as an approxmate bound on the objectve suboptmalty (whch depends on our guess of d). Ths suggests that a reasonable termnaton crteron s that the prmal and dual resduals must be small,.e., r k 2 ɛ pr and s k 2 ɛ dual, (3.12) where ɛ pr > 0 and ɛ dual > 0 are feasblty tolerances for the prmal and dual feasblty condtons (3.8) and (3.9), respectvely. These tolerances can be chosen usng an absolute and relatve crteron, such as ɛ pr = pɛ abs + ɛ rel max{ Ax k 2, Bz k 2, c 2 }, ɛ dual = nɛ abs + ɛ rel A T y k 2, where ɛ abs > 0 s an absolute tolerance and ɛ rel > 0 s a relatve tolerance. (The factors p and n account for the fact that the l 2 norms are n R p and R n, respectvely.) A reasonable value for the relatve stoppng
23 20 Alternatng Drecton Method of Multplers crteron mght be ɛ rel =10 3 or 10 4, dependng on the applcaton. The choce of absolute stoppng crteron depends on the scale of the typcal varable values. 3.4 Extensons and Varatons Many varatons on the classc ADMM algorthm have been explored n the lterature. Here we brefly survey some of these varants, organzed nto groups of related deas. Some of these methods can gve superor convergence n practce compared to the standard ADMM presented above. Most of the extensons have been rgorously analyzed, so the convergence results descrbed above are stll vald (n some cases, under some addtonal condtons) Varyng Penalty Parameter A standard extenson s to use possbly dfferent penalty parameters ρ k for each teraton, wth the goal of mprovng the convergence n practce, as well as makng performance less dependent on the ntal choce of the penalty parameter. In the context of the method of multplers, ths approach s analyzed n [142], where t s shown that superlnear convergence may be acheved f ρ k. Though t can be dffcult to prove the convergence of ADMM when ρ vares by teraton, the fxedρ theory stll apples f one just assumes that ρ becomes fxed after a fnte number of teratons. A smple scheme that often works well s (see, e.g., [96, 169]): τ ncr ρ k f r k 2 >µ s k 2 ρ k+1 := ρ k /τ decr f s k 2 >µ r k 2 (3.13) ρ k otherwse, where µ>1, τ ncr > 1, and τ decr > 1 are parameters. Typcal choces mght be µ = 10 and τ ncr = τ decr = 2. The dea behnd ths penalty parameter update s to try to keep the prmal and dual resdual norms wthn a factor of µ of one another as they both converge to zero. The ADMM update equatons suggest that large values of ρ place a large penalty on volatons of prmal feasblty and so tend to produce
Boosting as a Regularized Path to a Maximum Margin Classifier
Journal of Machne Learnng Research 5 (2004) 941 973 Submtted 5/03; Revsed 10/03; Publshed 8/04 Boostng as a Regularzed Path to a Maxmum Margn Classfer Saharon Rosset Data Analytcs Research Group IBM T.J.
More informationDropout: A Simple Way to Prevent Neural Networks from Overfitting
Journal of Machne Learnng Research 15 (2014) 19291958 Submtted 11/13; Publshed 6/14 Dropout: A Smple Way to Prevent Neural Networks from Overfttng Ntsh Srvastava Geoffrey Hnton Alex Krzhevsky Ilya Sutskever
More informationSequential DOE via dynamic programming
IIE Transactons (00) 34, 1087 1100 Sequental DOE va dynamc programmng IRAD BENGAL 1 and MICHAEL CARAMANIS 1 Department of Industral Engneerng, Tel Avv Unversty, Ramat Avv, Tel Avv 69978, Israel Emal:
More informationAlgebraic Point Set Surfaces
Algebrac Pont Set Surfaces Gae l Guennebaud Markus Gross ETH Zurch Fgure : Illustraton of the central features of our algebrac MLS framework From left to rght: effcent handlng of very complex pont sets,
More informationA Study of the Cosine DistanceBased Mean Shift for Telephone Speech Diarization
TASL046013 1 A Study of the Cosne DstanceBased Mean Shft for Telephone Speech Darzaton Mohammed Senoussaou, Patrck Kenny, Themos Stafylaks and Perre Dumouchel Abstract Speaker clusterng s a crucal
More informationMean Field Theory for Sigmoid Belief Networks. Abstract
Journal of Artæcal Intellgence Research 4 è1996è 61 76 Submtted 11è95; publshed 3è96 Mean Feld Theory for Sgmod Belef Networks Lawrence K. Saul Tomm Jaakkola Mchael I. Jordan Center for Bologcal and Computatonal
More informationDocumentation for the TIMES Model PART I
Energy Technology Systems Analyss Programme http://www.etsap.org/tools.htm Documentaton for the TIMES Model PART I Aprl 2005 Authors: Rchard Loulou Uwe Remne Amt Kanuda Antt Lehtla Gary Goldsten 1 General
More informationMANY of the problems that arise in early vision can be
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 26, NO. 2, FEBRUARY 2004 147 What Energy Functons Can Be Mnmzed va Graph Cuts? Vladmr Kolmogorov, Member, IEEE, and Ramn Zabh, Member,
More information(Almost) No Label No Cry
(Almost) No Label No Cry Gorgo Patrn,, Rchard Nock,, Paul Rvera,, Tbero Caetano,3,4 Australan Natonal Unversty, NICTA, Unversty of New South Wales 3, Ambata 4 Sydney, NSW, Australa {namesurname}@anueduau
More informationRECENT DEVELOPMENTS IN QUANTITATIVE COMPARATIVE METHODOLOGY:
Federco Podestà RECENT DEVELOPMENTS IN QUANTITATIVE COMPARATIVE METHODOLOGY: THE CASE OF POOLED TIME SERIES CROSSSECTION ANALYSIS DSS PAPERS SOC 302 INDICE 1. Advantages and Dsadvantages of Pooled Analyss...
More informationSectorSpecific Technical Change
SectorSpecfc Techncal Change Susanto Basu, John Fernald, Jonas Fsher, and Mles Kmball 1 November 2013 Abstract: Theory mples that the economy responds dfferently to technology shocks that affect the producton
More informationComplete Fairness in Secure TwoParty Computation
Complete Farness n Secure TwoParty Computaton S. Dov Gordon Carmt Hazay Jonathan Katz Yehuda Lndell Abstract In the settng of secure twoparty computaton, two mutually dstrustng partes wsh to compute
More informationMultiProduct Price Optimization and Competition under the Nested Logit Model with ProductDifferentiated Price Sensitivities
MultProduct Prce Optmzaton and Competton under the Nested Logt Model wth ProductDfferentated Prce Senstvtes Gullermo Gallego Department of Industral Engneerng and Operatons Research, Columba Unversty,
More informationPerson Reidentification by Probabilistic Relative Distance Comparison
Person Redentfcaton by Probablstc Relatve Dstance Comparson WeSh Zheng 1,2, Shaogang Gong 2, and Tao Xang 2 1 School of Informaton Scence and Technology, Sun Yatsen Unversty, Chna 2 School of Electronc
More informationAsRigidAsPossible Shape Manipulation
AsRgdAsPossble Shape Manpulaton akeo Igarash 1, 3 omer Moscovch John F. Hughes 1 he Unversty of okyo Brown Unversty 3 PRESO, JS Abstract We present an nteractve system that lets a user move and deform
More informationDo Firms Maximize? Evidence from Professional Football
Do Frms Maxmze? Evdence from Professonal Football Davd Romer Unversty of Calforna, Berkeley and Natonal Bureau of Economc Research Ths paper examnes a sngle, narrow decson the choce on fourth down n the
More informationWho are you with and Where are you going?
Who are you wth and Where are you gong? Kota Yamaguch Alexander C. Berg Lus E. Ortz Tamara L. Berg Stony Brook Unversty Stony Brook Unversty, NY 11794, USA {kyamagu, aberg, leortz, tlberg}@cs.stonybrook.edu
More informationA Structure for General and Specc Market Rsk Eckhard Platen 1 and Gerhard Stahl Summary. The paper presents a consstent approach to the modelng of general and specc market rsk as dened n regulatory documents.
More informationJournal of International Economics
Journal of Internatonal Economcs 79 (009) 31 41 Contents lsts avalable at ScenceDrect Journal of Internatonal Economcs journal homepage: www.elsever.com/locate/je Composton and growth effects of the current
More informationThe Relationship between Exchange Rates and Stock Prices: Studied in a Multivariate Model Desislava Dimitrova, The College of Wooster
Issues n Poltcal Economy, Vol. 4, August 005 The Relatonshp between Exchange Rates and Stock Prces: Studed n a Multvarate Model Desslava Dmtrova, The College of Wooster In the perod November 00 to February
More informationStable Distributions, Pseudorandom Generators, Embeddings, and Data Stream Computation
Stable Dstrbutons, Pseudorandom Generators, Embeddngs, and Data Stream Computaton PIOTR INDYK MIT, Cambrdge, Massachusetts Abstract. In ths artcle, we show several results obtaned by combnng the use of
More informationAssessing health efficiency across countries with a twostep and bootstrap analysis *
Assessng health effcency across countres wth a twostep and bootstrap analyss * Antóno Afonso # $ and Mguel St. Aubyn # February 2007 Abstract We estmate a semparametrc model of health producton process
More informationFace Alignment through Subspace Constrained MeanShifts
Face Algnment through Subspace Constraned MeanShfts Jason M. Saragh, Smon Lucey, Jeffrey F. Cohn The Robotcs Insttute, Carnege Mellon Unversty Pttsburgh, PA 15213, USA {jsaragh,slucey,jeffcohn}@cs.cmu.edu
More informationSVO: Fast SemiDirect Monocular Visual Odometry
SVO: Fast SemDrect Monocular Vsual Odometry Chrstan Forster, Mata Pzzol, Davde Scaramuzza Abstract We propose a semdrect monocular vsual odometry algorthm that s precse, robust, and faster than current
More informationAsRigidAsPossible Image Registration for Handdrawn Cartoon Animations
AsRgdAsPossble Image Regstraton for Handdrawn Cartoon Anmatons Danel Sýkora Trnty College Dubln John Dnglana Trnty College Dubln Steven Collns Trnty College Dubln source target our approach [Papenberg
More informationDP5: A Private Presence Service
DP5: A Prvate Presence Servce Nkta Borsov Unversty of Illnos at UrbanaChampagn, Unted States nkta@llnos.edu George Danezs Unversty College London, Unted Kngdom g.danezs@ucl.ac.uk Ian Goldberg Unversty
More informationEffect of a spectrum of relaxation times on the capillary thinning of a filament of elastic liquid
J. NonNewtonan Flud Mech., 72 (1997) 31 53 Effect of a spectrum of relaxaton tmes on the capllary thnnng of a flament of elastc lqud V.M. Entov a, E.J. Hnch b, * a Laboratory of Appled Contnuum Mechancs,
More informationSupport vector domain description
Pattern Recognton Letters 20 (1999) 1191±1199 www.elsever.nl/locate/patrec Support vector doman descrpton Davd M.J. Tax *,1, Robert P.W. Dun Pattern Recognton Group, Faculty of Appled Scence, Delft Unversty
More informationAN EFFECTIVE MATRIX GEOMETRIC MEAN SATISFYING THE ANDO LI MATHIAS PROPERTIES
MATHEMATICS OF COMPUTATION Volume, Number, Pages S 5578(XX) AN EFFECTIVE MATRIX GEOMETRIC MEAN SATISFYING THE ANDO LI MATHIAS PROPERTIES DARIO A. BINI, BEATRICE MEINI AND FEDERICO POLONI Abstract. We
More informationFrom Computing with Numbers to Computing with Words From Manipulation of Measurements to Manipulation of Perceptions
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS I: FUNDAMENTAL THEORY AND APPLICATIONS, VOL. 45, NO. 1, JANUARY 1999 105 From Computng wth Numbers to Computng wth Words From Manpulaton of Measurements to Manpulaton
More information