HIGH-DIMENSIONAL REGRESSION WITH NOISY AND MISSING DATA: PROVABLE GUARANTEES WITH NONCONVEXITY

Transcription

1 The Aals of Statistics 2012, Vol. 40, No. 3, DOI: /12-AOS1018 Istitute of Mathematical Statistics, 2012 HIGH-DIMENSIONAL REGRESSION WITH NOISY AND MISSING DATA: PROVABLE GUARANTEES WITH NONCONVEXITY BY PO-LING LOH 1,2 AND MARTIN J. WAINWRIGHT 2 Uiversity of Califoria, Berkeley Although the stadard formulatios of predictio problems ivolve fullyobserved ad oiseless data draw i a i.i.d. maer, may applicatios ivolve oisy ad/or missig data, possibly ivolvig depedece, as well. We study these issues i the cotext of high-dimesioal sparse liear regressio, ad propose ovel estimators for the cases of oisy, missig ad/or depedet data. May stadard approaches to oisy or missig data, such as those usig the EM algorithm, lead to optimizatio problems that are iheretly ocovex, ad it is difficult to establish theoretical guaratees o practical algorithms. While our approach also ivolves optimizig ocovex programs, we are able to both aalyze the statistical error associated with ay global optimum, ad more surprisigly, to prove that a simple algorithm based o projected gradiet descet will coverge i polyomial time to a small eighborhood of the set of all global miimizers. O the statistical side, we provide oasymptotic bouds that hold with high probability for the cases of oisy, missig ad/or depedet data. O the computatioal side, we prove that uder the same types of coditios required for statistical cosistecy, the projected gradiet descet algorithm is guarateed to coverge at a geometric rate to a ear-global miimizer. We illustrate these theoretical predictios with simulatios, showig close agreemet with the predicted scaligs. 1. Itroductio. I stadard formulatios of predictio problems, it is assumed that the covariates are fully-observed ad sampled idepedetly from some uderlyig distributio. However, these assumptios are ot realistic for may applicatios, i which covariates may be observed oly partially, observed subject to corruptio or exhibit some type of depedecy. Cosider the problem of modelig the votig behavior of politicias: i this settig, votes may be missig due to abstetios, ad temporally depedet due to collusio or tit-for-tat behavior. Similarly, surveys ofte suffer from the missig data problem, sice users fail to respod to all questios. Sesor etwork data also teds to be both oisy due to measuremet error, ad partially missig due to failures or drop-outs of sesors. Received September 2011; revised May Supported i part by a Hertz Foudatio Fellowship ad the Departmet of Defese (DoD) through a NDSEG Fellowship. 2 Supported i part by NSF Grat DMS ad Air Force Office of Scietific Research Grat AFOSR-09NL184. MSC2010 subject classificatios. Primary 62F12; secodary 68W25. Key words ad phrases. High-dimesioal statistics, missig data, ocovexity, regularizatio, sparse liear regressio, M-estimatio. 1637

2 1638 P.-L. LOH AND M. J. WAINWRIGHT There are a variety of methods for dealig with oisy ad/or missig data, icludig various heuristic methods, as well as likelihood-based methods ivolvig the expectatio maximizatio (EM) algorithm (e.g., see the book [8] ad refereces therei). A challege i this cotext is the possible ocovexity of associated optimizatio problems. For istace, i applicatios of EM, problems i which the egative likelihood is a covex fuctio ofte become ocovex with missig or oisy data. Cosequetly, although the EM algorithm will coverge to a local miimum, it is difficult to guaratee that the local optimum is close to a global miimum. I this paper, we study these issues i the cotext of high-dimesioal sparse liear regressio i particular, i the case whe the predictors or covariates are oisy, missig, ad/or depedet. Our mai cotributio is to develop ad study simple methods for hadlig these issues, ad to prove theoretical results about both the associated statistical error ad the optimizatio error. Like EM-based approaches, our estimators are based o solvig optimizatio problems that may be ocovex; however, despite this ocovexity, we are still able to prove that a simple form of projected gradiet descet will produce a output that is sufficietly close as small as the statistical error to ay global optimum. As a secod result, we boud the statistical error, showig that it has the same scalig as the miimax rates for the classical cases of perfectly observed ad idepedetly sampled covariates. I this way, we obtai estimators for oisy, missig, ad/or depedet data that have the same scalig behavior as the usual fully-observed ad idepedet case. The resultig estimators allow us to solve the problem of high-dimesioal Gaussia graphical model selectio with missig data. There is a large body of work o the problem of corrupted covariates or errori-variables for regressio problems (e.g., see the papers ad books [3, 6, 7, 21], as well as refereces therei). Much of the earlier theoretical work is classical i ature, meaig that it requires that the sample size diverges with the dimesio p fixed. Most relevat to this paper is more recet work that has examied issues of corrupted ad/or missig data i the cotext of high-dimesioal sparse liear models, allowig for p. Städler ad Bühlma [18] developed a EMbased method for sparse iverse covariace matrix estimatio i the missig data regime, ad used this result to derive a algorithm for sparse liear regressio with missig data. As metioed above, however, it is difficult to guaratee that EM will coverge to a poit close to a global optimum of the likelihood, i cotrast to the methods studied here. Rosebaum ad Tsybakov [14] studied the sparse liear model whe the covariates are corrupted by oise, ad proposed a modified form of the Datzig selector (see the discussio followig our mai results for a detailed compariso to this past work, ad also to cocurret work [15] by the same authors). For the particular case of multiplicative oise, the type of estimator that we cosider here has bee studied i past work [21]; however, this theoretical aalysis is of the classical type, holdig oly for p, i cotrast to the high-dimesioal models that are of iterest here.

3 HIGH-DIMENSIONAL NOISY LASSO 1639 The remaider of this paper is orgaized as follows. We begi i Sectio 2 with backgroud ad a precise descriptio of the problem. We the itroduce the class of estimators we will cosider ad the form of the projected gradiet descet algorithm. Sectio 3 is devoted to a descriptio of our mai results, icludig a pair of geeral theorems o the statistical ad optimizatio error, ad the a series of corollaries applyig our results to the cases of oisy, missig, ad depedet data. I Sectio 4, we demostrate simulatios to cofirm that our methods work i practice, ad verify the theoretically-predicted scalig laws. Sectio 5 cotais proofs of some of the mai results, with the remaiig proofs cotaied i the supplemetary Appedix [9]. NOTATION. For a matrix M, we write M max := max i,j m ij to be the elemetwise l -orm of M. Furthermore, M 1 deotes the iduced l 1 -operator orm (maximum absolute colum sum) of M, ad M op is the spectral orm of M. We write κ(m) := λ max(m) λ mi (M), the coditio umber of M. For matrices M 1,M 2, we write M 1 M 2 to deote the compoetwise Hadamard product, ad write M 1 : M 2 to deote compoetwise divisio. For fuctios f() ad g(), we write f() g() to mea that f() cg() for a uiversal costat c (0, ), ad similarly, f() g() whe f() c g() for some uiversal costat c (0, ). Fially, we write f() g() whe f() g() ad f() g() hold simultaeously. 2. Backgroud ad problem setup. I this sectio, we provide backgroud ad a precise descriptio of the problem, ad the motivate the class of estimators aalyzed i this paper. We the discuss a simple class of projected gradiet descet algorithms that ca be used to obtai a estimator Observatio model ad high-dimesioal framework. Suppose we observe a respose variable y i R liked to a covariate vector x i R p via the liear model (2.1) y i = x i,β + ε i for i = 1, 2,...,. Here, the regressio vector β R p is ukow, ad ε i R is observatio oise, idepedet of x i. Rather tha directly observig each x i R p, we observe a vector z i R p liked to x i via some coditioal distributio, that is, (2.2) z i Q( x i ) for i = 1, 2,...,. This setup applies to various disturbaces to the covariates, icludig: (a) Covariates with additive oise: We observe z i = x i + w i,wherew i R p is a radom vector idepedet of x i, say zero-mea with kow covariace matrix w.

4 1640 P.-L. LOH AND M. J. WAINWRIGHT (b) Missig data: For some fractio ρ [0, 1), we observe a radom vector z i R p such that for each compoet j, we idepedetly observe z ij = x ij with probability 1 ρ, adz ij = with probability ρ. We ca also cosider the case whe the etries i the jth colum have a differet probability ρ j of beig missig. (c) Covariates with multiplicative oise: Geeralizig the missig data problem, suppose we observe z i = x i u i,whereu i R p is agai a radom vector idepedet of x i,ad is the Hadamard product. The problem of missig data is a special case of multiplicative oise, where all u ij s are idepedet ad u ij Beroulli(1 ρ j ). Our first set of results is determiistic, depedig o specific istatiatios of the observatios {(y i,z i )} i=1. However, we are also iterested i results that hold with high probability whe the x i s ad z i s are draw at radom. We cosider both the case whe the x i s are draw i.i.d. from a fixed distributio; ad the case of depedet covariates, whe the x i s are geerated accordig to a statioary vector autoregressive (VAR) process. We work withi a high-dimesioal framework that allows the umber of predictors p to grow ad possibly exceed the sample size. Of course, cosistet estimatio whe p is impossible uless the model is edowed with additioal structure for istace, sparsity i the parameter vector β. Cosequetly, we study the class of models where β has at most k ozero parameters, where k is also allowed to icrease to ifiity with p ad M-estimators for oisy ad missig covariates. I order to motivate the class of estimators we will cosider, let us begi by examiig a simple determiistic problem. Let x 0 be the covariace matrix of the covariates, ad cosider the l 1 -costraied quadratic program { 1 2 βt x β x β,β }. (2.3) β arg mi β 1 R As log as the costrait radius R is at least β 1, the uique solutio to this covex program is β = β. Of course, this program is a idealizatio, sice i practice we may ot kow the covariace matrix x, ad we certaily do ot kow x β after all, β is the quatity we are tryig to estimate! Noetheless, this idealizatio still provides useful ituitio, as it suggests various estimators based o the plug-i priciple. Give a set of samples, it is atural to form estimates of the quatities x ad x β, which we deote by Ɣ R p p ad γ R p, respectively, ad to cosider the modified program (2.4) β arg mi β 1 R or alteratively, the regularized versio (2.5) β arg mi β R p { 1 2 βt Ɣβ γ,β }, { } 1 2 βt Ɣβ γ,β +λ β 1,

5 HIGH-DIMENSIONAL NOISY LASSO 1641 where λ > 0 is a user-defied regularizatio parameter. Note that the two problems are equivalet by Lagragia duality whe the objectives are covex, but ot i the case of a ocovex objective. The Lasso [4, 19] is a special case of these programs, obtaied by settig (2.6) Ɣ Las := 1 XT X ad γ Las := 1 XT y, where we have itroduced the shorthad y = (y 1,...,y ) T R,adX R p, with xi T as its ith row. A simple calculatio shows that ( Ɣ Las, γ Las ) are ubiased estimators of the pair ( x, x β ). This ubiasedess ad additioal cocetratio iequalities (to be described i the sequel) uderlie the well-kow aalysis of the Lasso i the high-dimesioal regime. I this paper, we focus o more geeral istatiatios of the programs (2.4)ad (2.5), ivolvig differet choices of the pair ( Ɣ, γ) that are adapted to the cases of oisy ad/or missig data. Note that the matrix Ɣ Las is positive semidefiite, so the Lasso program is covex. I sharp cotrast, for the case of oisy or missig data, the most atural choice of the matrix Ɣ is ot positive semidefiite, hece the quadratic losses appearig i the problems (2.4) ad(2.5) areocovex. Furthermore, whe Ɣ has egative eigevalues, the objective i equatio (2.5) is ubouded from below. Hece, we make use of the followig regularized estimator: { } 1 (2.7) β arg mi β 1 b 0 k 2 βt Ɣβ γ,β +λ β 1 for a suitable costat b 0. I the presece of ocovexity, it is geerally impossible to provide a polyomial-time algorithm that coverges to a (ear) global optimum, due to the presece of local miima. Remarkably, we are able to prove that this issue is ot sigificat i our settig, ad a simple projected gradiet descet algorithm applied to the programs (2.4) or(2.7) coverges with high probability to a vector extremely close to ay global optimum. Let us illustrate these ideas with some examples. Recall that ( Ɣ, γ) serve as ubiased estimators for ( x, x β ). EXAMPLE 1 (Additive oise). Suppose we observe Z = X + W,whereW is a radom matrix idepedet of X, with rows w i draw i.i.d. from a zero-mea distributio with kow covariace w. We cosider the pair (2.8) Ɣ add := 1 ZT Z w ad γ add := 1 ZT y. Note that whe w = 0 (correspodig to the oiseless case), the estimators reduce to the stadard Lasso. However, whe w 0, the matrix Ɣ add is ot positive semidefiite i the high-dimesioal regime ( p). Ideed, sice the matrix 1 ZT Z has rak at most, the subtracted matrix w may cause Ɣ add to have a

6 1642 P.-L. LOH AND M. J. WAINWRIGHT large umber of egative eigevalues. For istace, if w = σw 2I for σ w 2 > 0, the Ɣ add has p eigevalues equal to σw 2. EXAMPLE 2 (Missig data). We ow cosider the case where the etries of X are missig at radom. Let us first describe a estimator for the special case where each etry is missig at radom, idepedetly with some costat probability ρ [0, 1). (I Example 3 to follow, we will describe the extesio to geeral missig probabilities.) Cosequetly, we observe the matrix Z R p with etries { Xij, with probability 1 ρ, Z ij = 0, otherwise. Give the observed matrix Z R p,weuse Z T ( Z Z Ɣ T mis := ρ diag Z ) (2.9) ad γ mis := 1 Z T y, where Z ij = Z ij /(1 ρ). It is easy to see that the pair ( Ɣ mis, γ mis ) reduces to the pair ( Ɣ Las, γ Las ) for the stadard Lasso whe ρ = 0, correspodig to o missig Z T Z data. I the more iterestig case whe ρ (0, 1), the matrix i equatio (2.9) has rak at most, so the subtracted diagoal matrix may cause the matrix Ɣ mis to have a large umber of egative eigevalues whe p. As a cosequece, the matrix Ɣ mis is ot (i geeral) positive semidefiite, so the associated quadratic fuctio is ot covex. EXAMPLE 3 (Multiplicative oise). As a geeralizatio of the previous example, we ow cosider the case of multiplicative oise. I particular, suppose we observe the quatity Z = X U, whereu is a matrix of oegative oise variables. I may applicatios, it is atural to assume that the rows u i of U are draw i a i.i.d. maer, say from some distributio i which both the vector E[u 1 ] ad the matrix E[u 1 u T 1 ] have strictly positive etries. This geeral family of multiplicative oise models arises i various applicatios; we refer the reader to the papers [3, 6, 7, 21] for more discussio ad examples. A atural choice of the pair ( Ɣ, γ)is give by the quatities Ɣ mul := 1 ZT Z : E ( u 1 u T ) (2.10) 1 ad Ɣ mul := 1 ZT y : E(u 1 ), where : deotes elemetwise divisio. A small calculatio shows that these are ubiased estimators of x ad x β, respectively. The estimators (2.10) have bee studied i past work [21], but oly uder classical scalig ( p). As a special case of the estimators (2.10), suppose the etries u ij of U are idepedet Beroulli(1 ρ j ) radom variables. The the observed matrix Z = X U correspods to a missig-data matrix, where each elemet of the jth colum has probability ρ j of beig missig. I this case, the estimators (2.10) become Ɣ mis = ZT Z (2.11) : M ad γ mis = 1 ZT y : (1 ρ),

7 HIGH-DIMENSIONAL NOISY LASSO 1643 where M := E(u 1 u T 1 ) satisfies { (1 ρi )(1 ρ M ij = j ), if i j, 1 ρ i, if i = j, ρ is the parameter vector cotaiig the ρ j s, ad 1 is the vector of all 1 s. I this way, we obtai a geeralizatio of the estimator discussed i Example Restricted eigevalue coditios. Give a estimate β, there are various ways to assess its closeess to β. I this paper, we focus o the l 2 -orm β β 2, as well as the closely related l 1 -orm β β 1. Whe the covariate matrix X is fully observed (so that the Lasso ca be applied), it is ow well uderstood that a sufficiet coditio for l 2 -recovery is that the matrix Ɣ Las = 1 XT X satisfy a certai type of restricted eigevalue (RE) coditio (e.g., [2, 20]). I this paper, we make use of the followig coditio. DEFINITION 1 (Lower-RE coditio). The matrix Ɣ satisfies a lower restricted eigevalue coditio with curvature α 1 > 0 ad tolerace τ(,p)>0if (2.12) θ T Ɣθ α 1 θ 2 2 τ(,p) θ 2 1 for all θ R p. It ca be show that whe the Lasso matrix Ɣ Las = 1 XT X satisfies this RE coditio (2.12), the Lasso estimate has low l 2 -error for ay vector β supported o ay subset of size at most k τ(,p) 1. I particular, boud (2.12) implies a sparse RE coditio for all k of this magitude, ad coversely, Lemma 11 i the Appedix of [9] shows that a sparse RE coditio implies boud (2.12). I this paper, we work with coditio (2.12), sice it is especially coveiet for aalyzig optimizatio algorithms. I the stadard settig (with ucorrupted ad fully observed desig matrices), it is kow that for may choices of the desig matrix X (with rows havig covariace ), the Lasso matrix Ɣ Las will satisfy such a RE coditio with high probability (e.g., [13, 17]) with α 1 = 1 2 λ mi( ) ad τ(,p) log p. A sigificat portio of the aalysis i this paper is devoted to provig that differet choices of Ɣ, such as the matrices Ɣ add ad Ɣ mis defied earlier, also satisfy coditio (2.12) with high probability. This fact is by o meas obvious, sice as previously discussed, the matrices Ɣ add ad Ɣ mis geerally have large umbers of egative eigevalues. Fially, although such upper bouds are ot ecessary for statistical cosistecy, our algorithmic results make use of the aalogous upper restricted eigevalue coditio, formalized i the followig: DEFINITION 2 (Upper-RE coditio). The matrix Ɣ satisfies a upper restricted eigevalue coditio with smoothess α 2 > 0 ad tolerace τ(,p) > 0 if (2.13) θ T Ɣθ α 2 θ τ(,p) θ 2 1 for all θ R p.

8 1644 P.-L. LOH AND M. J. WAINWRIGHT I recet work o high-dimesioal projected gradiet descet, Agarwal et al. [1] make use of a more geeral form of the lower ad upper bouds (2.12) ad (2.13), applicable to oquadratic losses as well, which are referred to as the restricted strog covexity (RSC) ad restricted smoothess (RSM) coditios, respectively. For various class of radom desig matrices, it ca be show that the Lasso matrix Ɣ Las satisfies the upper boud (2.13) with α 2 = 2λ max ( x ) ad τ(,p) log p ; see Raskutti et al. [13] for the Gaussia case ad Rudelso ad Zhou [17] for the sub-gaussia settig. We will establish similar scalig for our choices of Ɣ Gradiet descet algorithms. I additio to provig results about the global miima of the (possibly ocovex) programs (2.4) ad(2.5), we are also iterested i polyomial-time procedures for approximatig such optima. I this paper, we aalyze some simple algorithms for solvig either the costraied program (2.4) or the Lagragia versio (2.7). Note that the gradiet of the quadratic loss fuctio takes the form L(β) = Ɣβ γ. I applicatio to the costraied versio, the method of projected gradiet descet geerates a sequece of iterates {β t,t = 0, 1, 2,...} by the recursio { β t+1 = arg mi L ( β t) + L ( β t ),β β t + η β 1 R 2 β β t } (2.14) 2 2, where η>0 is a stepsize parameter. Equivaletly, this update ca be writte as β t+1 = (β t η 1 L(βt )), where deotes the l 2 -projectio oto the l 1 -ball of radius R. This projectio ca be computed rapidly i O(p) time usig a procedure duetoduchietal.[5]. For the Lagragia update, we use a slight variat of the projected gradiet update (2.14), amely { β t+1 = arg mi L ( β t ) + L ( β t ),β β t + η β 1 R 2 β β t } (2.15) λ β 1 with the oly differece beig the iclusio of the regularizatio term. This update ca also performed efficietly by performig two projectios oto the l 1 -ball; see the paper [1] for details. Whe the objective fuctio is covex (equivaletly, Ɣ is positive semidefiite), the iterates (2.14) or(2.15) are guarateed to coverge to a global miimum of the objective fuctios (2.4) ad(2.7), respectively. I our settig, the matrix Ɣ eed ot be positive semidefiite, so the best geeric guaratee is that the iterates coverge to a local optimum. However, our aalysis shows that for the family of programs (2.4) or(2.7), uder a reasoable set of coditios satisfied by various statistical models, the iterates actually coverge to a poit extremely close to ay global optimum i both l 1 -orm ad l 2 -orm; see Theorem 2 to follow for a more detailed statemet.

9 HIGH-DIMENSIONAL NOISY LASSO Mai results ad cosequeces. We ow state our mai results ad discuss their cosequeces for oisy, missig, ad depedet data Geeral results. We provide theoretical guaratees for both the costraied estimator (2.4) ad the Lagragia versio (2.7). Note that we obtai differet optimizatio problems as we vary the choice of the pair ( Ɣ, γ) R p p R p. We begi by statig a pair of geeral results, applicable to ay pair that satisfies certai coditios. Our first result (Theorem 1) provides bouds o the statistical error, amely the quatity β β 2, as well as the correspodig l 1 -error, where β is ay global optimum of the programs (2.4) or(2.7). Sice the problem may be ocovex i geeral, it is ot immediately obvious that oe ca obtai a provably good approximatio to ay global optimum without resortig to costly search methods. I order to assuage this cocer, our secod result (Theorem 2) provides rigorous bouds o the optimizatio error, amely the differeces β t β 2 ad β t β 1 icurred by the iterate β t after ruig t rouds of the projected gradiet descet updates (2.14) or(2.15) Statistical error. I cotrollig the statistical error, we assume that the matrix Ɣ satisfies a lower-re coditio with curvature α 1 ad tolerace τ(,p),as previously defied (2.12). Recall that Ɣ ad γ serve as surrogates to the determiistic quatities x R p p ad x β R p, respectively. Our results also ivolve a measure of deviatio i these surrogates. I particular, we assume that there is some fuctio ϕ(q,σ ε ), depedig o the two sources of oise i our problem: the stadard deviatio σ ε of the observatio oise vector ε from equatio (2.1), ad the coditioal distributio Q from equatio (2.2) that liks the covariates x i to the observed versios z i. With this otatio, we cosider the deviatio coditio (3.1) γ Ɣβ ϕ(q,σ ε ) log p. To aid ituitio, ote that iequality (3.1) holds wheever the followig two deviatio coditios are satisfied: γ x β ϕ(q,σ ε ) ad (3.2) log p ( Ɣ x )β log p ϕ(q,σ ε ). The pair of iequalities (3.2) clearly measures the deviatio of the estimators ( Ɣ, γ)from their populatio versios, ad they are sometimes easier to verify theoretically. However, iequality (3.1) may be used directly to derive tighter bouds (e.g., i the additive oise case). Ideed, the bouds established via iequalities (3.2) is ot sharp i the limit of low oise o the covariates, due to the secod

10 1646 P.-L. LOH AND M. J. WAINWRIGHT iequality. I the proofs of our corollaries to follow, we will verify the deviatio coditios for various forms of oisy, missig, ad depedet data, with the quatity ϕ(q,σ ε ) chagig depedig o the model. We have the followig result, which applies to ay global optimum β of the regularized versio (2.7) with λ 4ϕ(Q,σ ε ) log p : THEOREM 1 (Statistical error). Suppose the surrogates ( Ɣ, γ) satisfy the deviatio boud (3.1), ad the matrix Ɣ satisfies the lower-re coditio (2.12) with parameters (α 1,τ)such that (3.3) { α1 kτ(,p) mi 128 k, ϕ(q,σ } ε) log p. b 0 The for ay vector β with sparsity at most k, there is a uiversal positive costat c 0 such that ay global optimum β of the Lagragia program (2.7) with ay b 0 β 2 satisfies the bouds (3.4a) β β 2 c { } 0 k log p max ϕ(q,σ ε ) α 1,λ ad (3.4b) β β 1 8c { } 0k log p max ϕ(q,σ ε ) α 1,λ. The same bouds (without λ ) also apply to the costraied program (2.4) with radius choice R = β 1. Remarks. To be clear, all the claims of Theorem 1 are determiistic. Probabilistic coditios will eter whe we aalyze specific statistical models ad certify that the RE coditio (3.3) ad deviatio coditios are satisfied by a radom pair ( Ɣ, γ)with high probability. We ote that for the stadard Lasso choice ( Ɣ Las, γ Las ) of this matrix vector pair, bouds of the form (3.4) for sub-gaussia oise are well kow from past work (e.g., [2, 11, 12, 23]). The ovelty of Theorem 1 is i allowig for geeral pairs of such surrogates, which as show by the examples discussed earlier ca lead to ocovexity i the uderlyig M- estimator. Moreover, some iterestig differeces arise due to the term ϕ(q,σ ε ), which chages depedig o the ature of the model (missig, oisy, ad/or depedet). As will be clarified i the sequel. Provig that the coditios of Theorem 1 are satisfied with high probability for oisy/missig data requires some otrivial aalysis ivolvig both cocetratio iequalities ad radom matrix theory. Note that i the presece of ocovexity, it is possible i priciple for the optimizatio problems (2.4) ad(2.7) tohavemay global optima that are separated by large distaces. Iterestigly, Theorem 1 guaratees that this upleasat feature does ot arise uder the stated coditios: give ay two global optima β ad β

11 HIGH-DIMENSIONAL NOISY LASSO 1647 of the program (2.4), Theorem 1 combied with the triagle iequality guaratees that β β 2 β β 2 + β β ϕ(q,σ ε ) k log p 2 2c 0 α 1 [ad similarly for the program (2.7)]. Cosequetly, uder ay scalig such that k log p = o(1), the set of all global optima must lie withi a l 2 -ball whose radius shriks to zero. I additio, it is worth observig that Theorem 1 makes a specific predictio for the scalig behavior of the l 2 -error β β 2. I order to study this scalig predictio, we performed simulatios uder the additive oise model described i Example 1, usig the parameter settig x = I ad w = σw 2I with σ w = 0.2. Pael (a) of Figure 1 provides plots 3 of the error β β 2 versus the sample size, for problem dimesios p {128, 256, 512}. Note that for all three choices of dimesios, the error decreases to zero as the sample size icreases, showig cosistecy of the method. The curves also shift to the right as the dimesio p icreases, reflectig the atural ituitio that larger problems are harder i a certai sese. Theorem 1 makes a specific predictio about this scalig behavior: i particular, if we plot the l 2 -error versus the rescaled sample size /(k log p),the curves should roughly alig for differet values of p. Pael (b) shows the same data re-plotted o these rescaled axes, thus verifyig the predicted stackig behavior. (a) (b) FIG. 1. Plots of the error β β 2 after ruig projected gradiet descet o the ocovex objective, with sparsity k p. Plot (a) is a error plot for i.i.d. data with additive oise, ad plot (b) shows l 2 -error versus the rescaled sample size k log p. As predicted by Theorem 1, the curves alig for differet values of p i the rescaled plot. 3 Corollary 1, to be stated shortly, guaratees that the coditios of Theorem 1 are satisfied with high probability for the additive oise model. I additio, Theorem 2 to follow provides a efficiet method of obtaiig a accurate approximatio of the global optimum.

12 1648 P.-L. LOH AND M. J. WAINWRIGHT Fially, as oted by a reviewer, the costrait R = β 1 i the program (2.4) is rather restrictive, sice β is ukow. Theorem 1 merely establishes a heuristic for the scalig expected for this optimal radius. I this regard, the Lagragia estimator (2.7) is more appealig, sice it oly requires choosig b 0 to be larger tha β 2, ad the coditios o the regularizer λ are the stadard oes from past work o the Lasso Optimizatio error. Although Theorem 1 provides guaratees that hold uiformly for ay global miimizer, it does ot provide guidace o how to approximate such a global miimizer usig a polyomial-time algorithm. Ideed, for ocovex programs i geeral, gradiet-type methods may become trapped i local miima, ad it is impossible to guaratee that all such local miima are close to a global optimum. Noetheless, we are able to show that for the family of programs (2.4), uder reasoable coditios o Ɣ satisfied i various settigs, simple gradiet methods will coverge geometrically fast to a very good approximatio of ay global optimum. The followig theorem supposes that we apply the projected gradiet updates (2.14) to the costraied program (2.4), or the composite updates (2.15) to the Lagragia program (2.7), with stepsize η = 2α 2. I both cases, we assume that k log p, as is required for statistical cosistecy i Theorem 1. THEOREM 2 (Optimizatio error). Uder the coditios of Theorem 1: (a) For ay global optimum β of the costraied program (2.4), there are uiversal positive costats (c 1,c 2 ) ad a cotractio coefficiet γ (0, 1), idepedet of (,p,k), such that the gradiet descet iterates (2.14) satisfy the bouds (3.5) β t β 2 2 γ t β 0 β c log p 1 β β c 2 β β 2 2, (3.6) β t β 1 2 k β t β k β β β β 1 for all t 0. (b) Lettig φ deote the objective fuctio of Lagragia program (2.7) with global optimum β, ad applyig composite gradiet updates (2.15), there are uiversal positive costats (c 1,c 2 ) ad a cotractio coefficiet γ (0, 1), idepedet of (,p,k), such that (3.7) β t β 2 2 c 1 β β 2 2 where T := c 2 log (φ(β0 ) φ( β)) δ 2 / log(1/γ ). } {{ } δ 2 for all iterates t T, Remarks. As with Theorem 1, these claims are determiistic i ature. Probabilistic coditios will eter ito the corollaries, which ivolve provig that the surrogate matrices Ɣ used for oisy, missig ad/or depedet data satisfy the

13 HIGH-DIMENSIONAL NOISY LASSO 1649 lower- ad upper-re coditios with high probability. The proof of Theorem 2 itself is based o a extesio of a result due to Agarwal et al. [1] o the covergece of projected gradiet descet ad composite gradiet descet i high dimesios. Their result, as origially stated, imposed covexity of the loss fuctio, but the proof ca be modified so as to apply to the ocovex loss fuctios of iterest here. As oted followig Theorem 1, all global miimizers of the ocovex program (2.4) lie withi a small ball. I additio, Theorem 2 guaratees that the local miimizers also lie withi a ball of the same magitude. Note that i order to show that Theorem 2 ca be applied to the specific statistical models of iterest i this paper, a cosiderable amout of techical aalysis remais i order to establish that its coditios hold with high probability. I order to uderstad the sigificace of the bouds (3.5) ad(3.7), ote that they provide upper bouds for the l 2 -distace betwee the iterate β t at time t, which is easily computed i polyomial-time, ad ay global optimum β of the program (2.4) or(2.7), which may be difficult to compute. Focusig o boud (3.5), sice γ (0, 1), the first term i the boud vaishes as t icreases. The remaiig terms ivolve the statistical errors β β q,forq = 1, 2, which are cotrolled i Theorem 1. It ca be verified that the two terms ivolvig the statistical error o the right-had side are bouded as O( k log p ), so Theorem 2 guaratees that projected gradiet descet produce a output that is essetially as good i terms of statistical error as ay global optimum of the program (2.4). Boud (3.7) provides a similar guaratee for composite gradiet descet applied to the Lagragia versio. Experimetally, we have foud that the predictios of Theorem 2 are bore out i simulatios. Figure 2 shows the results of applyig the projected gradiet descet method to solve the optimizatio problem (2.4) i the case of additive oise (a) (b) FIG. 2. Plots of the optimizatio error log( β t β 2 ) ad statistical error log( β t β 2 ) versus iteratio umber t, geerated by ruig projected gradiet descet o the ocovex objective. Each plot shows the solutio path for the same problem istace, usig 10 differet startig poits. As predicted by Theorem 2, the optimizatio error decreases geometrically.

14 1650 P.-L. LOH AND M. J. WAINWRIGHT [pael (a)], ad missig data [pael (b)]. I each case, we geerated a radom problem istace, ad the applied the projected gradiet descet method to compute a estimate β. We the reapplied the projected gradiet method to the same problem istace 10 times, each time with a radom startig poit, ad measured the error β t β 2 betwee the iterates ad the first estimate (optimizatio error), ad the error β t β 2 betwee the iterates ad the truth (statistical error). Withi each pael, the blue traces show the optimizatio error over 10 trials, ad the red traces show the statistical error. O the logarithmic scale give, a geometric rate of covergece correspods to a straight lie. As predicted by Theorem 2, regardless of the startig poit, the iterates {β t } exhibit geometric covergece to the same fixed poit. 4 The statistical error cotracts geometrically up to a certai poit, the flattes out Some cosequeces. As discussed previously, both Theorems 1 ad 2 are determiistic results. Applyig them to specific statistical models requires some additioal work i order to establish that the stated coditios are met. We ow tur to the statemets of some cosequeces of these theorems for differet cases of oisy, missig ad depedet data. I all the corollaries below, the claims hold with probability greater tha 1 c 1 exp( c 2 log p),where(c 1,c 2 ) are uiversal positive costats, idepedet of all other problem parameters. Note that i all corollaries, the triplet (,p,k) is assumed to satisfy scalig of the form k log p, asis ecessary for l 2 -cosistet estimatio of k-sparse vectors i p dimesios. DEFINITION 3. We say that a radom matrix X R p is sub-gaussia with parameters (, σ 2 ) if: (a) each row xi T R p is sampled idepedetly from a zero-mea distributio with covariace,ad (b) for ay uit vector u R p, the radom variable u T x i is sub-gaussia with parameter at most σ. For istace, if we form a radom matrix by drawig each row idepedetly from the distributio N(0, ), the the resultig matrix X R p is a sub- Gaussia matrix with parameters (, op ) Bouds for additive oise: i.i.d. case. We begi with the case of i.i.d. samples with additive oise, as described i Example 1. COROLLARY 1. Suppose that we observe Z = X + W, where the radom matrices X, W R p are sub-gaussia with parameters ( x,σx 2 ), ad let ε be 4 To be precise, Theorem 2 states that the iterates will coverge geometrically to a small eighborhood of all the global optima.

15 HIGH-DIMENSIONAL NOISY LASSO 1651 a i.i.d. sub-gaussia vector with parameter σ 2 ε. Let σz 2 = σ x 2 + σ w 2. The uder the scalig max{ σz 4 λ 2 mi ( x ), 1}k log p, for the M-estimator based o the surrogates ( Ɣ add, γ add ), the results of Theorems 1 ad 2 hold with parameters α 1 = 1 2 λ mi( x ) ad ϕ(q,σ ε ) = c 0 σ z (σ w + σ ε ) β 2, with probability at least 1 c 1 exp( c 2 log p). Remarks. the boud (a) Cosequetly, the l 2 -error of ay optimal solutio β satisfies β β 2 σ z(σ w + σ ε ) λ mi ( x ) β k log p 2 with high probability. The prefactor i this boud has a atural iterpretatio as a iverse sigal-to-oise ratio; for istace, whe X ad W are zero-mea Gaussia matrices with row covariaces x = σx 2I ad w = σw 2 I, respectively, we have λ mi ( x ) = σx 2,so (σ w + σ ε ) σx 2 + σ w 2 = σ w + σ ε 1 + σ w 2 λ mi ( x ) σ x σx 2. This quatity grows with the ratios σ w /σ x ad σ ε /σ x, which measure the SNR of the observed covariates ad predictors, respectively. Note that whe σ w = 0, correspodig to the case of ucorrupted covariates, the boud o l 2 -error agrees with kow results. See Sectio 4 for simulatios ad further discussios of the cosequeces of Corollary 1. (b) We may also compare the results i (a) with bouds from past work o highdimesioal sparse regressio with oisy covariates [15]. I this work, Rosebaum ad Tsybakov derive similar cocetratio bouds o sub-gaussia matrices. The tolerace parameters are all O( log p ), with prefactors depedig o the sub-gaussia parameters of the matrices. I particular, i their otatio, ν ( σ x σ w + σ w σ ε + σw 2 ) log p β 1, leadig to the boud (cf. Theorem 2 of Rosebaum ad Tsybakov [15]) β β 2 ν k λ mi ( x ) σ 2 λ mi ( x ) k log p β 1. Extesios to ukow oise covariace. Situatios may arise where the oise covariace w is ukow, ad must be estimated from the data. Oe simple method is to assume that w is estimated from idepedet observatios of the

16 1652 P.-L. LOH AND M. J. WAINWRIGHT oise. I this case, suppose we idepedetly observe a matrix W 0 R p with i.i.d. vectors of oise. The we use w = 1 W 0 T W 0 as our estimate of w.amore sophisticated variat of this method (cf. Chapter 4 of Carroll et al. [3]) assumes that we observe k i replicate measuremets Z i1,...,z ik for each x i ad form the estimator i=1 ki j=1 w = (Z ij Z i )(Z ij Z i ) T (3.8) i=1. (k i 1) Basedotheestimator w, we form the pair ( Ɣ, γ) such that γ = 1 ZT y ad Ɣ = ZT Z w. I the proofs of Sectio 5, we will aalyze the case where w = 1 W 0 T W 0 ad show that the result of Corollary 1 still holds whe w must be estimated from the data. Note that the estimator i equatio (3.8) will also yield the same result, but the aalysis is more complicated Bouds for missig data: i.i.d. case. Next, we tur to the case of i.i.d. samples with missig data, as discussed i Example 3. For a missig data parameter vector ρ, wedefieρ max := max j ρ j, ad assume ρ max < 1. COROLLARY 2. Let X R p be sub-gaussia with parameters ( x,σx 2), ad Z the missig data matrix with parameter ρ. Let ε be a i.i.d. sub-gaussia vector with parameter σ 2 1 σ ε. If max( 4 (1 ρ max ) 4 x λ 2 mi (, 1)k log p, the Theorems x) 1 ad 2 hold with probability at least 1 c 1 exp( c 2 log p) for α 1 = 1 2 λ mi( x ) σ ad ϕ(q,σ ε ) = c x 0 1 ρ max (σ ε + σ x 1 ρ max ) β 2. Remarks. Suppose X is a Gaussia radom matrix ad ρ j = ρ for all j. I σx this case, the ratio 2 λ mi ( x ) = λ max( x ) λ mi ( x ) = κ( x) is the coditio umber of x. The ( ϕ(q,σ ε ) 1 σ x σ ε α λ mi ( x ) 1 ρ + κ( ) x) (1 ρ) 2 β 2, a quatity that depeds o both the coditioig of x, ad the fractio ρ [0, 1) of missig data. We will cosider the results of Corollary 2 applied to this example i the simulatios of Sectio 4. Extesios to ukow ρ. As i the additive oise case, we may wish to cosider the case whe the missig data parameters ρ are ot observed ad must be estimated from the data. For each j = 1, 2,...,p, we estimate ρ j usig ρ j, the empirical average of the umber of observed etries per colum. Let ρ R p deote the resultig estimator of ρ. Naturally, we use the pair of estimators ( Ɣ, γ) defied by (3.9) Ɣ = ZT Z : M ad γ = 1 ZT y : (1 ρ),

17 where HIGH-DIMENSIONAL NOISY LASSO 1653 M ij = { (1 ρi )(1 ρ j ), if i j, 1 ρ i, if i = j. We will show i Sectio 5 that Corollary 2 holds whe ρ is estimated by ρ Bouds for depedet data. Turig to the case of depedet data, we cosider the settig where the rows of X are draw from a statioary vector autoregressive (VAR) process accordig to (3.10) x i+1 = Ax i + v i for i = 1, 2,..., 1, where v i R p is a zero-mea oise vector with covariace matrix v,ad A R p p is a drivig matrix with spectral orm A 2 < 1. We assume the rows of X are draw from a Gaussia distributio with covariace x, such that x = A x A T + v. Hece, the rows of X are idetically distributed but ot idepedet, with the choice A = 0 givig rise to the i.i.d. sceario. Corollaries 3 ad 4 correspod to the case of additive oise ad missig data for a Gaussia VAR process. COROLLARY 3. Suppose the rows of X are draw accordig to a Gaussia VAR process with drivig matrix A. Suppose the additive oise matrix W is i.i.d. with Gaussia rows, ad let ε be a i.i.d. sub-gaussia vector with parameter σ 2 ε. ζ If max( 4 λ 2 mi ( x), 1)k log p, with ζ 2 = w op + 2 x op 1 A op, the Theorems 1 ad 2 hold with probability at least 1 c 1 exp( c 2 log p) for α 1 = 1 2 λ mi( x ) ad ϕ(q,σ ε ) = c 0 (σ ε ζ + ζ 2 ) β 2. COROLLARY 4. Suppose the rows of X are draw accordig to a Gaussia VAR process with drivig matrix A, ad Z is the observed matrix subject to missig data, with parameter ρ. Let ε be a i.i.d. sub-gaussia vector with parameter σ 2 ζ ε. If max( 4 λ 2 mi ( x), 1)k log p, with ζ = x op (1 ρ max ) 2 1 A op, the Theorems 1 ad 2 hold with probability at least 1 c 1 exp( c 2 log p) for α 1 = 1 2 λ mi( x ) ad ϕ(q,σ ε ) = c 0 (σ ε ζ + ζ 2 ) β 2. REMARKS. Note that the scalig ad the form of ϕ i Corollaries 2 4 are very similar, except with differet effective variaces σ 2 σx = 2, ζ 2 or ζ 2, (1 ρ max ) 2 depedig o the type of corruptio i the data. As we will see i Sectio 5, the proofs ivolve verifyig the deviatio coditios (3.2) usig similar techiques. O the other had, the proof of Corollary 1 proceeds via deviatio coditio (3.1), which produces a tighter boud. Note that we may exted the cases of depedet data to situatios whe w ad ρ are ukow ad must be estimated from the data. The proofs of these extesios are idetical to the i.i.d case, so we will omit them.

18 1654 P.-L. LOH AND M. J. WAINWRIGHT 3.3. Applicatio to graphical model iverse covariace estimatio. The problem of iverse covariace estimatio for a Gaussia graphical model is also related to the Lasso. Meishause ad Bühlma [10] prescribed a way to recover the support of the precisio matrix whe each colum of is k-sparse, via liear regressio ad the Lasso. More recetly, Yua [22] proposed a method for estimatig usig the Datzig selector, ad obtaied error bouds o 1 whe the colums of are bouded i l 1. Both of these results assume that X is fully-observed ad has i.i.d. rows. Suppose we are give a matrix X R p of samples from a multivariate Gaussia distributio, where each row is distributed accordig to N(0, ). We assume the rows of X are either i.i.d. or sampled from a Gaussia VAR process. Based o the modified Lasso of the previous sectio, we devise a method to estimate based o a corrupted observatio matrix Z, whe is sparse. Our method bears similarity to the method of Yua [22], but is valid i the case of corrupted data, ad does ot require a l 1 colum boud. Let X j deote the jth colum of X, ad let X j deote the matrix X with jth colum removed. By stadard results o Gaussia graphical models, there exists a vector θ j R p 1 such that (3.11) X j = X j θ j + ε j, where ε j is a vector of i.i.d. Gaussias ad ε j X j for each j.ifwedefiea j := ( jj j, j θ j ) 1, we ca verify that j, j = a j θ j. Our algorithm, described below, forms estimates θ j ad â j for each j, the combies the estimates to obtai a estimate j, j = â j θ j. I the additive oise case, we observe the matrix Z = X + W. From the equatios (3.11), we obtai Z j = X j θ j + (ε j + W j ). Note that δ j = ε j + W j is a vector of i.i.d. Gaussias, ad sice X W,wehaveδ j X j. Hece, our results o covariates with additive oise allow us to recover θ j from Z. We ca verify that this reduces to solvig the program (2.4) or(2.7) with the pair ( Ɣ (j), γ (j) ) = ( j, j, 1 Z jt Z j ),where = 1 ZT Z w. Whe Z is a missig-data versio of X, we similarly estimate the vectors θ j via equatio (3.11), usig our results o the Lasso with missig covariates. Here, both covariates ad resposes are subject to missig data, but this makes o differece i our theoretical results. For each j,weusethepair ( Ɣ (j), γ (j)) ( = j, j, 1 Z jt Z j : ( 1 ρ j ) ) (1 ρ j ), where = 1 ZT Z : M, adm is defied as i Example 3. To obtai the estimate, we therefore propose the followig procedure, based o the estimators {( Ɣ (j), γ (j) )} p j=1 ad. ALGORITHM 3.1. (1) Perform p liear regressios of the variables Z j upo the remaiig variables Z j, usig the program (2.4) or(2.7) with the estimators ( Ɣ (j), γ (j) ), to obtai estimates θ j of θ j.

19 HIGH-DIMENSIONAL NOISY LASSO 1655 (2) Estimate the scalars a j usig the quatity â j := ( jj j, j θ j ) 1, based o the estimator.form with j, j = â j θ j ad jj = â j. (3) Set = arg mi S p 1,whereS p is the set of symmetric matrices. Note that the miimizatio i step (3) is a liear program, so is easily solved with stadard methods. We have the followig corollary about : COROLLARY 5. Suppose the colums of the matrix are k-sparse, ad suppose the coditio umber κ( ) is ozero ad fiite. Suppose we have γ (j) Ɣ (j) θ j log p (3.12) ϕ(q,σ ε ) j, ad suppose we have the followig additioal deviatio coditio o : (3.13) max cϕ(q,σ ε ) log p. Fially, suppose the lower-re coditio holds uiformly over the matrices with the scalig (3.3).The uder the estimatio procedure of Algorithm 3.1, there exists a uiversal costat c 0 such that op c 0κ 2 ( ( ) ϕ(q,σε ) λ mi ( ) λ mi ( ) + ϕ(q,σ ) ε) log p k α 1. REMARKS. Note that Corollary 5 is agai a determiistic result, with parallel structure to Theorem 1. Furthermore, the deviatio bouds (3.12) ad(3.13) hold for all scearios cosidered i Sectio 3.2 above, usig Corollaries 1 4 for the first two iequalities, ad a similar boudig techique for max ;adthe lower-re coditio holds over all matrices Ɣ (j) by the same techique used to establish the lower-re coditio for Ɣ. The uiformity of the lower-re boud over all sub-matrices holds because 0 <λ mi ( ) λ mi ( j, j ) λ max ( j, j ) λ max ( ) <. Hece, the error boud i Corollary 5 holds with probability at least 1 c 1 exp( c 2 log p) whe k log p, for the appropriate values of ϕ ad α Simulatios. I this sectio, we report some additioal simulatio results to cofirm that the scaligs predicted by our theory are sharp. I Figure 1 followig Theorem 1, we showed that the error curves alig whe plotted agaist a suitably rescaled sample size, i the case of additive oise perturbatios. Pael (a) of Figure 3 shows these same types of rescaled curves for the case of missig data, with sparsity k p, covariate matrix x = I, ad missig fractio ρ = 0.2, whereas pael (b) shows the rescaled plots for the vector autoregressive case with additive Ɣ (j)

20 1656 P.-L. LOH AND M. J. WAINWRIGHT (a) (b) FIG. 3. Plots of the error β β 2 after ruig projected gradiet descet o the ocovex objective, with sparsity k p. I all cases, we plotted the error versus the rescaled sample size k log p. As predicted by Theorems 1 ad 2, the curves alig for differet values of p whe plotted i this rescaled maer.(a)missig data case with i.i.d. covariates.(b)vector autoregressive data with additive oise. Each poit represets a average over 100 trials. oise perturbatios, usig a drivig matrix A with A op = 0.2. Each poit correspods to a average over 100 trials. Oce agai, we see excellet agreemet with the scalig law provided by Theorem 1. We also ra simulatios to verify the form of the fuctio ϕ(q,σ ε ) appearig i Corollaries 1 ad 2. I the additive oise settig for i.i.d. data, we set x = I ad ε equal to i.i.d. Gaussia oise with σ ε = 0.5.Forafixedvalueoftheparametersp = 256 ad k log p, we ra the projected gradiet descet algorithm for differet values of σ w (0.1, 0.3), such that w = σw 2I ad 60(1 + σ w 2)2 k log p, with β 2 = 1. Accordig to the theory, ϕ(q,σ ε) α (σ w + 0.5) 1 + σw 2,sothat β β 2 (σ w + 0.5) 1 + σw 2 k log p (1 + σw 2)2 k log p σ w σw 2 I order to verify this theoretical predictio, we plotted σ w versus the rescaled 1+σ 2 error w σ w +0.5 β β 2. As show by Figure 4(a), the curve is roughly costat, as predicted by the theory. Similarly, i the missig data settig for i.i.d. data, we set x = I ad ε equal to i.i.d. Gaussia oise with σ ε = 0.5. For a fixed value of the parameters p = 128 ad k log p, we ra simulatios for differet values of the missig data parameter ρ (0, 0.3), such that 60 k log p. Accordig to the theory, ϕ(q,σ ε) (1 ρ) 4 α σ ε 1 ρ + 1. Cosequetly, with our specified scaligs of (,p,k), we should expect a (1 ρ) 2

21 HIGH-DIMENSIONAL NOISY LASSO 1657 (a) FIG. 4. (a)plot of the rescaled l 2 -error 1+σw 2 σ w +0.5 β β 2 versus the additive oise stadard β β deviatio σ w for the i.i.d. model with additive oise. (b)plot of the rescaled l 2 -error (1 ρ) versus the missig fractio ρ for the i.i.d. model with missig data. Both curves are roughly costat, showig that our error bouds o β β 2 exhibit the proper scalig. Each poit represets a average over 200 trials. (b) boud of the form β β 2 ϕ(q,σ ε) k log p (1 ρ). α β β 2 The plot of ρ versus the rescaled error 1+0.5(1 ρ) isshowifigure4(b). The curve is agai roughly costat, agreeig with theoretical results. Fially, we studied the behavior of the iverse covariace matrix estimatio algorithm o three types of Gaussia graphical models: (a) Chai-structured graphs. I this case, all odes of the graph are arraged i a liear chai. Hece, each ode (except the two ed odes) has degree k = 2. The diagoal etries of are set equal to 1, ad all etries correspodig to liks i the chai are set equal to 0.1. The is rescaled so op = 1. (b) Star-structured graphs. I this case, all odes are coected to a cetral ode, which has degree k 0.1p. All other odes have degree 1. The diagoal etries of are set equal to 1, ad all etries correspodig to edges i the graph are set equal to 0.1. The is rescaled so op = 1. (c) Erdős Reyi graphs. This example comes from Rothma et al. [16]. For a sparsity parameter k log p, we radomly geerate the matrix by first geeratig the matrix B such that the diagoal etries are 0, ad all other etries are idepedetly equal to 0.5 with probability k/p, ad 0 otherwise. The δ is chose so that = B + δi has coditio umber p. Fially, is rescaled so op = 1.

22 1658 P.-L. LOH AND M. J. WAINWRIGHT After geeratig the matrix X of i.i.d. samples from the appropriate graphical model, with covariace matrix x = 1, we geerated the corrupted matrix Z = X + W with w = (0.2) 2 I i the additive oise case, or the missig data matrix Z with ρ = 0.2 i the missig data case. 1 Paels (a) ad (c) i Figure 5 show the rescaled l 2 -error k op plotted agaist the sample size for a chai-structured graph. I paels (b) ad (d), we have l 2 -error plotted agaist the rescaled sample size, /(k log p).oceagai, we see good agreemet with the theoretical predictios. We have obtaied qualitatively similar results for the star ad Erdős Reyi graphs. (a) (b) (c) (d) FIG. 5. (a)plots of the error op after ruig projected gradiet descet o the ocovex objective for a chai-structured Gaussia graphical model with additive oise. As predicted by Theorems 1 ad 2, all curves alig whe the error is rescaled by 1 ad plotted agaist the ratio k k log p, as show i (b). Plots (c) ad (d) show the results of simulatios o missig data sets. Each poit represets the average over 50 trials.