# HIGH DIMENSIONAL GRAPHS AND VARIABLE SELECTION WITH THE LASSO. By Nicolai Meinshausen and Peter Bühlmann ETH Zürich

Save this PDF as:

Size: px
Start display at page:

Download "HIGH DIMENSIONAL GRAPHS AND VARIABLE SELECTION WITH THE LASSO. By Nicolai Meinshausen and Peter Bühlmann ETH Zürich"

## Transcription

1 The Annals of Statistics HIGH DIMENSIONAL GRAPHS AND VARIABLE SELECTION WITH THE LASSO By Nicolai Meinshausen and Peter Bühlmann ETH Zürich The pattern of zero entries in the inverse covariance matrix of a multivariate normal distriution corresponds to conditional independence restrictions etween variales. Covariance selection aims at estimating those structural zeros from data. We show that neighorhood selection with the Lasso is a computationally attractive alternative to standard covariance selection for sparse high-dimensional graphs. Neighorhood selection estimates the conditional independence restrictions separately for each node in the graph and is hence equivalent to variale selection for Gaussian linear models. We show that the proposed neighorhood selection scheme is consistent for sparse high-dimensional graphs. Consistency hinges on the choice of the penalty parameter. The oracle value for optimal prediction does not lead to a consistent neighorhood estimate. Controlling instead the proaility of falsely joining some distinct connectivity components of the graph, consistent estimation for sparse graphs is achieved (with exponential rates), even when the numer of variales grows like the numer of oservations raised to an aritrary power. 1. Introduction. Consider the p-dimensional multivariate normal distriuted random variale X = (X 1,..., X p ) N (µ, Σ). This includes Gaussian linear models where, for example, X 1 is the response variale and {X k ; 2 k p} are the predictor variales. Assuming that the covariance matrix Σ is non-singular, the conditional independence structure of the distriution can e conveniently represented y a graphical model G = (Γ, E), where Γ = {1,..., p} is the set of nodes and E the set of edges in Γ Γ. A pair (a, ) is contained in the edge set E if and only if X a is conditionally dependent of X, given all remaining variales X Γ\{a,} = AMS 2000 suject classifications: Primary 62J07; secondary 62H20, 62F12 Keywords and phrases: Linear regression, Covariance selection, Gaussian graphical models, Penalized regression 1

2 2 N. MEINSHAUSEN AND P. BÜHLMANN {X k ; k Γ\{a, }}. Every pair of variales not contained in the edge set is conditionally independent, given all remaining variales and corresponds to a zero entry in the inverse covariance matrix (Lauritzen, 1996). Covariance selection was introduced y Dempster (1972) and aims at discovering the conditional independence restrictions (the graph) from a set of i.i.d. oservations. Covariance selection traditionally relies on the discrete optimization of an ojective function (Lauritzen, 1996; Edwards, 2000). Exhaustive search is computationally infeasile for all ut very low-dimensional models. Usually, greedy forward or ackward search is employed. In forward search, the initial estimate of the edge set is the empty set and edges are then added iteratively until a suitale stopping criterion is fulfilled. The selection (deletion) of a single edge in this search strategy requires an MLE fit (Speed and Kiiveri, 1986) for O(p 2 ) different models. The procedure is not well suited for high-dimensional graphs. The existence of the MLE is not guaranteed in general if the numer of oservations is smaller than the numer of nodes (Buhl, 1993). More disturingly, the complexity of the procedure renders even greedy search strategies impractical for modestly sized graphs. In contrast, neighorhood selection with the Lasso, proposed in the following, relies on optimization of a convex function, applied consecutively to each node in the graph. The method is computationally very efficient and is consistent even for the high-dimensional setting, as will e shown. Neighorhood selection is a suprolem of covariance selection. The neighorhood ne a of a node a Γ is the smallest suset of Γ\{a} so that, given all variales X nea in the neighourhood, X a is conditionally independent of all remaining variales. The neighorhood of a node a Γ consists of all nodes Γ\{a} so that (a, ) E. Given n i.i.d. oservations of X, neighorhood selection aims at estimating (individually) the neighorhood of any given variale (or node). The neighorhood selection can e cast into a standard regression prolem and can e solved efficiently with the Lasso (Tishirani, 1996), as will e shown in this paper. The consistency of the proposed neighorhood selection will e shown for sparse high-dimensional graphs, where the numer of variales is potentially growing like any power of the numer of oservations (high-dimensionality) whereas the numer of neighors of any variale is growing at most slightly slower than the numer of oservations (sparsity).

3 VARIABLE SELECTION WITH THE LASSO 3 A numer of studies have examined the case of regression with a growing numer of parameters as sample size increases. The closest to our setting is the recent work of Greenshtein and Ritov (2004), who study consistent prediction in a triangular setup very similar to ours (see also Juditsky and Nemirovski, 2000). However, the prolem of consistent estimation of the model structure, which is the relevant concept for graphical models, is very different and not treated in these studies. We study in section 2 under which conditions, and at which rate, the neighorhood estimate with the Lasso converges to the true neighorhood. The choice of the penalty is crucial in the high-dimensional setting. The oracle penalty for optimal prediction turns out to e inconsistent for estimation of the true model. This solution might include an unounded numer of noise variales into the model. We motivate a different choice of the penalty such that the proaility of falsely connecting two or more distinct connectivity components of the graph is controlled at very low levels. Asymptotically, the proaility of estimating the correct neighorhood converges exponentially to 1, even when the numer of nodes in the graph is growing rapidly like any power of the numer of oservations. As a consequence, consistent estimation of the full edge set in a sparse high-dimensional graph is possile (section 3). Encouraging numerical results are provided in section 4. The proposed estimate is shown to e oth more accurate than the traditional forward selection MLE strategy and computationally much more efficient. The accuracy of the forward selection MLE fit is in particular poor if the numer of nodes in the graph is comparale to the numer of oservations. In contrast, neighorhood selection with the Lasso is shown to e reasonaly accurate for estimating graphs with several thousand nodes, using only a few hundred oservations. 2. Neighorhood Selection. Instead of assuming a fixed true underlying model, we adopt a more flexile approach similar to the triangular setup in Greenshtein and Ritov (2004). Both the numer of nodes in the graphs (numer of variales), denoted y p(n) = Γ(n), and the distriution (the covariance matrix) depend in general on the numer of oservations, so that Γ = Γ(n) and Σ = Σ(n). The neighorhood ne a of a node a Γ(n) is the smallest suset of Γ(n)\{a} so that X a is conditionally independent of all

4 4 N. MEINSHAUSEN AND P. BÜHLMANN remaining variales. Denote the closure of node a Γ(n) y cl a := ne a {a}. Then X a {X k ; k Γ(n)\cl a } X nea. For details see Lauritzen (1996). The neighorhood depends in general on n as well. However, this dependence is notationally suppressed in the following. It is instructive to give a slightly different definition of a neighorhood. For each node a Γ(n), consider optimal prediction of X a, given all remaining variales. Let θ a R p(n) e the vector of coefficients for optimal prediction, (1) θ a = arg min θ:θ a=0 E(X a k Γ(n) θ k X k ) 2. As a generalization of (1), which will e of use later, consider optimal prediction of X a, given only a suset of variales {X k ; k A}, where A Γ(n)\{a}. The optimal prediction is characterized y the vector θ a,a, (2) θ a,a = arg min E(X a θ:θ k =0, k / A k Γ(n) θ k X k ) 2. The elements of θ a are determined y the inverse covariance matrix (Lauritzen, 1996). For Γ\{a} and K(n) = Σ 1 (n), it holds that θ a = K a (n)/k aa (n). The set of non-zero coefficients of θ a is identical to the set { Γ(n)\{a} : K a (n) 0} of non-zero entries in the corresponding row vector of the inverse covariance matrix and defines precisely the set of neighors of node a. The est predictor for X a is thus a linear function of variales in the set of neighors of the node a only. The set of neighors of a node a Γ(n) can hence e written as ne a = { Γ(n) : θ a 0}. This set corresponds to the set of effective predictor variales in regression with response variale X a and predictor variales {X k ; k Γ(n) \ {a}}. Given n independent oservations of X N (0, Σ(n)), neighorhood selection tries to estimate the set of neighors of a node a Γ(n). As the optimal linear prediction of X a has non-zero coefficients precisely for variales in the set of neighors of the node a, it seems reasonale to try to exploit this relation.

5 VARIABLE SELECTION WITH THE LASSO Neighorhood selection with the Lasso. It is well known that the Lasso, introduced y Tishirani (1996), and known as Basis Pursuit in the context of wavelet regression (Chen et al., 2001), has a parsimonious property (Knight and Fu, 2000). When predicting a variale X a with all remaining variales {X k ; k Γ(n)\{a}}, the vanishing Lasso coefficient estimates identify asymptotically the neighorhood of node a in the graph, as shown in the following. Let the n p(n)-dimensional matrix X contain n independent oservations of X, so that the columns X a correspond for all a Γ(n) to the vector of n independent oservations of X a. Let, e the usual inner product on R n and 2 the corresponding norm. The Lasso estimate ˆθ a,λ of θ a is given y (3) ˆθ a,λ = arg min θ:θ a=0 (n 1 X a Xθ λ θ 1 ), where θ 1 = Γ(n) θ is the l 1 -norm of the coefficient vector. Normalization of all variales to common empirical variance is recommended for the estimator in (3). The solution to (3) is not necessarily unique. However, if uniqueness fails, the set of solutions is still convex and all our results aout neighorhoods (in particular Theorems 1 and 2) hold for any solution of (3). Other regression estimates have een proposed, which are ased on the l p -norm, where p is typically in the range [0, 2] (Frank and Friedman, 1993). A value of p = 2 leads to the ridge estimate, while p = 0 corresponds to traditional model selection. It is well known that the estimates have a parsimonious property (with some components eing exactly zero) for p 1 only, while the optimization prolem in (3) is only convex for p 1. Hence l 1 -constrained empirical risk minimization occupies a unique position, as p = 1 is the only value of p for which variale selection takes place while the optimization prolem is still convex and hence feasile for high-dimensional prolems. The neighorhood estimate (parameterized y λ) is defined y the nonzero coefficient estimates of the l 1 -penalized regression, ˆne λ a = { Γ(n) : ˆθ a,λ 0}. Each choice of a penalty parameter λ specifies thus an estimate of the neighorhood ne a of node a Γ(n) and one is left with the choice of a suitale penalty parameter. Larger values of the penalty tend to shrink the size of

6 6 N. MEINSHAUSEN AND P. BÜHLMANN the estimated set, while more variales are in general included into ˆne λ a if the value of λ is diminished The prediction-oracle solution. A seemingly useful choice of the penalty parameter is the (unavailale) prediction-oracle value, λ oracle = arg min λ E(X a k Γ(n) ˆθ a,λ k X k) 2. The expectation is understood to e with respect to a new X, which is independent of the sample on which ˆθ a,λ is estimated. The prediction-oracle penalty minimizes the predictive risk among all Lasso estimates. An estimate of λ oracle is otained y the cross-validated choice λ cv. For l 0 -penalized regression it was shown y Shao (1993) that the crossvalidated choice of the penalty parameter is consistent for model selection under certain conditions on the size of the validation set. The predictionoracle solution does not lead to consistent model selection for the Lasso, as shown in the following for a simple example. Proposition 1. Let the numer of variales grow to infinity, p(n), for n, with p(n) = o(n γ ) for some γ > 0. Assume that the covariance matrices Σ(n) are identical to the identity matrix except for some pair (a, ) Γ(n) Γ(n), for which Σ a (n) = Σ a (n) = s, for some 0 < s < 1 and all n N. The proaility of selecting the wrong neighorhood for node a converges to 1 under the prediction-oracle penalty, P ( ˆne λ oracle a ne a ) 1 for n. A proof is given in the appendix. It follows from the proof of Proposition 1 that many noise variales are included into the neighorhood estimate with the prediction-oracle solution. In fact, the proaility of including noise variales with the prediction-oracle solution does not even vanish asymptotically for a fixed numer of variales. If the penalty is chosen larger than the prediction-optimal value, consistent neighorhood selection is possile with the Lasso, as demonstrated in the following Assumptions. We make a few assumptions to prove consistency of neighorhood selection with the Lasso. We always assume availaility of n independent oservations from X N (0, Σ).

7 VARIABLE SELECTION WITH THE LASSO 7 High-dimensionality. The numer of variales is allowed to grow like the numer of oservations n raised to an aritrarily high power. Assumption 1. There exists γ > 0, so that p(n) = O(n γ ) for n. It is in particular allowed for the following analysis that the numer of variales is very much larger than the numer of oservations, p(n) n. Non-singularity. matrices. We make two regularity assumptions for the covariance Assumption 2. For all a Γ(n) and n N, V ar(x a ) = 1. There exists v 2 > 0, so that for all n N and a Γ(n), V ar(x a X Γ(n)\{a} ) v 2. Common variance can always e achieved y appropriate scaling of the variales. A scaling to common (empirical) variance of all variales is desirale, as the solutions would otherwise depend on the chosen units or dimensions in which they are represented. The second part of the assumption explicitly excludes singular or nearly singular covariance matrices. For singular covariance matrices, edges are not uniquely defined y the distriution and it is hence not surprising that nearly singular covariance matrices are not suitale for consistent variale selection. Note, however, that the empirical covariance matrix is a.s. singular if p(n) > n, which is allowed in our analysis. Sparsity. The main assumption is the sparsity of the graph. This entails a restriction on the size of the neighorhoods of variales. Assumption 3. There exists some 0 κ < 1 so that max ne a = O(n κ ) for n. a Γ(n) This assumption limits the maximal possile rate of growth for the size of neighorhoods. For the next sparsity condition, consider again the definition in (2) of the optimal coefficient θ,a for prediction of X, given variales in the set A Γ(n).

8 8 N. MEINSHAUSEN AND P. BÜHLMANN Assumption 4. There exists some ϑ < so that for all neighoring nodes a, Γ(n) and all n N, θ a,ne\{a} 1 ϑ. This assumption is e.g. fulfilled if Assumption 2 holds and the size of the overlap of neighorhoods is ounded y an aritrarily large numer from aove for neighoring nodes. That is, if there exists some m < so that for all n N, (4) max ne a ne m for n, a, Γ(n), ne a then Assumption 4 is fulfilled. To see this, note that Assumption 2 gives a finite ound for the l 2 -norm of θ a,ne \{a}, while (4) gives a finite ound for the l 0 -norm. Taken together, Assumption 4 is implied. Magnitude of partial correlations. The next assumption ounds the magnitude of partial correlations from elow. The partial correlation π a etween variales X a and X is the correlation after having eliminated the linear effects from all remaining variales {X k ; k Γ(n)\{a, }}; for details see Lauritzen (1996). Assumption 5. There exists a constant δ > 0 and some ξ > κ, with κ as in Assumption 3, so that for every (a, ) E, π a δn (1 ξ)/2. It will e shown elow that Assumption 5 cannot e relaxed in general. Note that neighorhood selection for node a Γ(n) is equivalent to simultaneously testing the null hypothesis of zero partial correlation etween variale X a and all remaining variales X, Γ(n)\{a}. The null hypothesis of zero partial correlation etween two variales can e tested y using the corresponding entry in the normalised inverse empirical covariance matrix. A graph estimate ased on such tests has een proposed y Drton and Perlman (2004). Such a test can only e applied, however, if the numer of variales is smaller than the numer of oservations, p(n) n, as the empirical covariance matrix is singular otherwise. Even if p(n) = n c for some constant c > 0, Assumption 5 would have to hold with ξ = 1 to have

9 VARIABLE SELECTION WITH THE LASSO 9 a positive power of rejecting false null hypotheses for such an estimate, that is partial correlations would have to e ounded y a positive value from elow. Neighorhood staility. The last assumption is referred to as neighorhood staility. Using the definition of θ a,a in equation (2), define for all a, Γ(n), (5) S a () := sign(θ a,nea k )θ,nea k. k ne a The assumption of neighorhood staility restricts the magnitude of the quantities S a () for non-neighoring nodes a, Γ(n). Assumption 6. / ne a, There exists some δ < 1 so that for all a, Γ(n) with S a () < δ. It is shown in Proposition 3 that this assumption cannot e relaxed. We give in the following a more intuitive condition which essentially implies Assumption 6. This will justify the term neighorhood staility. Consider the definition in equation (1) of the optimal coefficients θ a for prediction of X a, For η > 0, define θ a (η) as the optimal set of coefficients under an additional l 1 -penalty, (6) θ a (η) := arg min θ:θ a=0 E(X a k Γ(n) θ k X k ) 2 + η θ 1. The neighorhood ne a of node a was defined as the set of non-zero coefficients of θ a, ne a = {k Γ(n) : θk a 0}. Define the distured neighorhood ne a (η) as ne a (η) := {k Γ(n) : θk a (η) 0}. It clearly holds that ne a = ne a (0). The assumption of neighorhood staility is fulfilled if there exists some infinitesimal small perturation η, which may depend on n, so that the distured neighorhood ne a (η) is identical to the undistured neighorhood ne a (0). Proposition 2. If there exists some η > 0 so that ne a (η) = ne a (0), then S a () 1 for all Γ(n)\ne a.

10 10 N. MEINSHAUSEN AND P. BÜHLMANN A proof is given in the appendix. In light of Proposition 2 it seems that Assumption 6 is a very weak condition. To give one example, Assumption 6 is automatically fulfilled under the much stronger assumption that the graph does not contain cycles. We give a rief reasoning for this. Consider two non-neighoring nodes a and. If the nodes are in different connectivity components there is nothing left to show as S a () = 0. If they are in the same connectivity component, then there exists one node k ne a that separates from ne a \{k}, as there is just one unique path etween any two variales in the same connectivity component if the graph does not contain cycles. Using the gloal Markov property, the random variale X is independent of X nea\{k}, given X k. The random variale E(X X nea ) is thus a function of X k only. As the distriution is Gaussian, E(X X nea ) = θ,nea k X k. By Assumption 2, V ar(x X nea ) = v 2 for some v 2 > 0. It follows that V ar(x ) = v 2 + (θ,nea k ) 2 = 1 and hence = 1 v 2 < 1, which implies that Assumption 6 is indeed fulfilled if θ,nea k the graph does not contain cycles. We mention that Assumption 6 is likewise fulfilled if the inverse covariance matrices Σ 1 (n) are for each n N diagonally dominant. A matrix is said to e diagonally dominant if and only if, for each row, the sum of the asolute values of the non-diagonal elements is smaller than the asolute value of the diagonal element. The proof of this is straightforward ut tedious and hence omitted Controlling type I errors. The asymptotic properties of Lasso-type estimates in regression have een studied in detail y Knight and Fu (2000) for a fixed numer of variales. Their results say that the penalty parameter λ should decay for an increasing numer of oservations at least as fast as n 1/2 to otain a n 1/2 -consistent estimate. It turns out that a slower rate is needed for consistent model selection in the high-dimensional case where p(n) n. However, a rate n (1 ε)/2 with any κ < ε < ξ (where κ, ξ are defined as in Assumptions 3 and 5) is sufficient for consistent neighorhood selection, even when the numer of variales is growing rapidly with the numer of oservations. Theorem 1. Let Assumptions e fulfilled. Let the penalty parameter satisfy λ n dn (1 ε)/2 with some κ < ε < ξ and d > 0. There

11 VARIABLE SELECTION WITH THE LASSO 11 exists some c > 0 so that, for all a Γ(n), P ( ˆne λ a ne a ) = 1 O(exp( cn ε )) for n. A proof is given in the appendix. Theorem 1 states that the proaility of (falsely) including any of the nonneighoring variales of the node a Γ(n) into the neighorhood estimate is vanishing exponentially fast, even though the numer of non-neighoring variales may grow very rapidly with the numer of oservations. It is shown in the following that Assumption 6 cannot e relaxed. Proposition 3. If there exists some a, Γ(n) with / ne a and S a () > 1, then, for λ = λ n as in Theorem 1, P ( ˆne λ a ne a ) 0 for n. A proof is given in the appendix. Assumption 6 of neighorhood staility is hence critical for the success of Lasso neighorhood selection Controlling type II errors. So far it was shown that the proaility of falsely including variales into the neighorhood can e controlled y the Lasso. The question arises if the proaility of including all neighoring variales into the neighorhood estimate is converging to 1 for n. Theorem 2. Let the Assumptions of Theorem 1 e fulfilled. For λ = λ n as in Theorem 1, it holds for some c > 0 that P (ne a ˆne λ a) = 1 O(exp( cn ε )) for n. A proof is given in the appendix. It may e of interest whether Assumption 5 could e relaxed, so that edges are detected even if the partial correlation is vanishing at a rate n (1 ξ)/2 for some ξ < κ. The following proposition says that ξ > ε (and thus ξ > κ as ε > κ) is a necessary condition if a stronger version of Assumption 4 holds, which is fulfilled for forests and trees, for example. Proposition 4. Let the Assumption of Theorem 1 e fulfilled with ϑ < 1 in Assumption 4. For a Γ(n), let there e some Γ(n)\{a} with π a 0 and π a = O(n (1 ξ)/2 ) for n for some ξ < ε. Then P ( ˆne λ a) 0 for n.

12 12 N. MEINSHAUSEN AND P. BÜHLMANN Theorem 2 and Proposition 4 say that edges etween nodes which partial correlation vanishes at a rate n (1 ξ)/2 are, with proaility converging to 1 for n, detected if ξ > ε and are undetected if ξ < ε. The results do not cover the case ξ = ε, which remains a challenging question for further research. All results so far have treated the distinction etween zero and non-zero partial correlations only. The signs of partial correlations of neighoring nodes can e estimated consistently under the same assumptions and with the same rates, as can e seen in the proofs. 3. Covariance Selection. It follows from section 2 that it is possile under certain conditions to estimate the neighorhood of each node in the graph consistently, e.g. P ( ˆne λ a = ne a ) 1 for n. The full graph is given y the set Γ(n) of nodes and the edge set E = E(n). The edge set contains those pairs (a, ) Γ(n) Γ(n), for which the partial correlation etween X a and X is not zero. As the partial correlations are precisely non-zero for neighors, the edge set E Γ(n) Γ(n) is given y E = {(a, ) : a ne ne a }. The first condition, a ne, implies in fact the second, ne a, and vice versa, so that the edge is as well identical to {(a, ) : a ne ne a }. For an estimate of the edge set of a graph, we can apply neighorhood selection to each node in the graph. A natural estimate of the edge set is then given y Êλ, Γ(n) Γ(n), where (7) Ê λ, = {(a, ) : a ˆne λ ˆneλ a}. Note that a ˆne λ does not necessarily imply ˆneλ a and vice versa. We can hence also define a second, less conservative, estimate of the edge set y (8) Ê λ, = {(a, ) : a ˆne λ ˆneλ a}. The discrepancies etween the estimates (7) and (8) are quite small in our experience. Asymptotically the difference etween oth estimates vanishes,

13 VARIABLE SELECTION WITH THE LASSO 13 as seen in the following corollary. We refer to oth edge set estimates collectively with the generic notation Êλ, as the following result holds true for oth of them. Corollary 1. Under the conditions of Theorem 2, for some c > 0, P (Êλ = E) = 1 O(exp( cn ε )) for n. The claim follows since Γ(n) 2 = p(n) 2 = O(n 2γ ) y Assumption 1 and neighorhood selection has an exponentially fast convergence rate as descried y Theorem 2. Corollary 1 says that the conditional independence structure of a multivariate normal distriution can e estimated consistently y comining the neighorhood estimates for all variales. Note that there are in total 2 (p2 p)/2 distinct graphs for a p-dimensional variale. However, for each of the p nodes there are only 2 p 1 distinct potential neighorhoods. By raking the graph selection prolem into a consecutive series of neighorhood selection prolems, the complexity of the search is thus reduced sustantially at the price of potential inconsistencies etween neighorhood estimates. Graph estimates that apply this strategy for complexity reduction are sometimes called dependency networks (Heckerman et al., 2000). The complexity of the proposed neighorhood selection for one node with the Lasso is reduced further to O(np min{n, p}), as the Lars procedure of (Efron et al., 2004) requires O(min{n, p}) steps, each of complexity O(np). For high-dimensional prolems as in Theorems 1 and 2, where the numer of variales grows like p(n) cn γ for some c > 0 and γ > 1, this is equivalent to O(p 2+2/γ ) computations for the whole graph. The complexity of the proposed method scales thus approximately quadratic with the numer of nodes for large values of γ. Before providing some numerical results, we discuss in the following the choice of the penalty parameter. Finite sample results and significance. It was shown aove that consistent neighorhood and covariance selection is possile with the Lasso in a high-dimensional setting. However, the asymptotic considerations give little advice on how to choose a specific penalty parameter for a given prolem. Ideally, one would like to guarantee that pairs of variales which are not contained in the edge set enter the estimate of the edge set only with very low

14 14 N. MEINSHAUSEN AND P. BÜHLMANN (pre-specified) proaility. Unfortunately, it seems very difficult to otain such a result as the proaility of falsely including a pair of variales into the estimate of the edge set depends on the exact covariance matrix, which is in general unknown. It is possile, however, to constrain the proaility of (falsely) connecting two distinct connectivity components of the true graph. The connectivity component C a Γ(n) of a node a Γ(n) is the set of nodes which are connected to node a y a chain of edges. The neighorhood ne a is clearly part of the connectivity component C a. Let Ĉλ a e the connectivity component of a in the estimated graph (Γ, Êλ ). For any level 0 < α < 1, consider the choice of the penalty (9) λ(α) = 2ˆσ a Φ 1 α ( n 2p(n) 2 ), where Φ = 1 Φ (Φ the c.d.f. of N (0, 1)) and ˆσ a 2 = n 1 X a, X a. The proaility of falsely joining two distinct connectivity components with the estimate of the edge set is ounded y the level α under the choice λ = λ(α) of the penalty parameter, as shown in the following theorem. Theorem 3. Let Assumptions 1-6 e fulfilled. Using penalty parameter λ(α), it holds for all n N that P ( a Γ(n) : Ĉλ a C a ) α. A proof is given in the appendix. This implies that if the edge set is empty (E = ) it is estimated y an empty set with high proaility, P (Êλ = ) 1 α. Theorem 3 is a finite sample result. The previous asymptotic results in Theorem 1 and 2 hold true if the level α is vanishing exponentially to zero for an increasing numer of oservations, leading to consistent edge set estimation. 4. Numerical examples. We use oth the Lasso estimate from section 3 and forward selection MLE (Lauritzen, 1996; Edwards, 2000) to estimate sparse graphs. We found it difficult to compare numerically neighorhood selection with forward selection MLE for more than thirty nodes

15 VARIABLE SELECTION WITH THE LASSO 15 Figure 1: A realization of a graph is shown on the left, generated as descried in the text. The graph consists of 1000 nodes and 1747 edges out of distinct pairs of variales. The estimated edge set, using estimate (7) at level α =.05, see equation (9), is shown in the middle. There are 2 erroneously included edges, marked y an arrow, while 1109 edges are correctly detected. For estimate (8) and an adjusted level as descried in the text, the result is shown on the right. Again two edges are erroneously included. Not a single pair of disjoint connectivity components of the true graph has een (falsely) joined y either estimate. in the graph. The high computational complexity of the forward selection MLE made the computations for such relatively low-dimensional prolems very costly already. The Lasso scheme in contrast handled with ease graphs with more than 1000 nodes, using the recent algorithm developed in Efron et al. (2004). Where comparison was feasile, the performance of the neighorhood selection scheme was etter. The difference was particularly pronounced if the ratio etween oservations to variales was low, as can e seen in Tale 1, which will e descried in more detail elow. First we give an account of the generation of the underlying graphs, which we are trying to estimate. A realization of an underlying (random) graph is given in the left panel of Figure 1. The nodes of the graph are associated with a spatial location and the location of each node is distriuted identically and uniformly on the two-dimensional square [0, 1] 2. Every pair of nodes is included initially in the edge set with a proaility of ϕ(d/ p), where d is the Euclidean distance etween the pair of variales and ϕ the density of the standard normal distriution. The maximum numer of edges

16 16 N. MEINSHAUSEN AND P. BÜHLMANN connecting to each node is limited to four to achieve the desired sparsity of the graph. Edges which connect to nodes which do not fulfill this constraint are removed randomly until the constraint is fulfilled for all edges. Initially all variales have identical conditional variance and the partial correlation etween neighors is set to (asolute values less than 0.25 guarantee positive definiteness of the inverse covariance matrix), that is Σ 1 aa all nodes a Γ, Σ 1 a = 1 for = if there is an edge connecting a and, Σ 1 a = 0 otherwise. The diagonal elements of the corresponding covariance matrix are in general larger than 1. To achieve constant variance, all variales are finally rescaled so that the diagonal elements of Σ are all unity. Using the Cholesky transformation of the covariance matrix, n independent samples are drawn from the corresponding Gaussian distriution. The average numer of edges which are correctly included into the estimate of the edge set is shown in Tale 1 as a function of the numer of edges which are falsely included. The accuracy of the forward selection MLE is comparale to the proposed Lasso neighorhood selection if the numer of nodes is much smaller than the numer of oservations. The accuracy of the forward selection MLE reaks down, however, if the numer of nodes is approximately equal to the numer of oservations. Forward selection MLE is only marginally etter than random guessing in this case. Computation of the forward selection MLE (using MIM, Edwards, 2000) took on the same desktop up to several hundred times longer than the Lasso neighorhood selection for the full graph. For more than 30 nodes, the differences are even more pronounced. The Lasso neighorhood selection can e applied to hundred- or thousanddimensional graphs, a realistic size for e.g. iological networks. A graph with 1000 nodes (following the same model as descried aove) and its estimates (7) and (8), using 600 oservations, are shown in Figure 1. A level α =.05 is used for estimate Êλ,. For etter comparison, the level α was adjusted to α =.064 for the estimate Êλ,, so that oth estimates lead to the same numer of included edges. There are two erroneous edge inclusions, while 1109 out of all 1747 edges have een correctly identified y either estimate. Of these 1109 edges, 907 are common to oth estimates while 202 are just present in either (7) or (8). To examine if results are critically dependent on the assumption of Gaus-

17 VARIABLE SELECTION WITH THE LASSO 17 Tale 1: The average numer of correctly identified edges as a function of the numer k of falsely included edges for n = 40 oservations and p = 10, 20, 30 nodes for forward selection MLE (FS), Êλ,, Êλ, and random guessing. p = 10 p = 20 p = 30 k random FS Ê λ, Ê λ, sianity, long-tailed noise is added to the oservations. Instead of n i.i.d. oservations of X N (0, Σ), n i.i.d oservations of X + 0.1Z are made, where the components of Z are independent and follow a t 2 -distriution. For 10 simulations (with each 500 oservations), the proportion of false rejections among all rejections is increasing only slightly from 0.8% (without long-tailed noise) to 1.4% (with long tailed-noise) for Êλ, and from 4.8% to 5.2% for Êλ,. Our limited numerical experience suggests that the properties of the graph estimator do not seem to e critically affected y deviations from Gaussianity. 5. Proofs Notation and useful lemmas. As a generalization of (3), the Lasso estimate ˆθ a,a,λ of θ a,a, defined in (2), is given y (10) ˆθ a,a,λ = arg min θ:θ k =0 k / A (n 1 X a Xθ λ θ 1 ). The notation ˆθ a,λ is thus just a shorthand notation for ˆθ a,γ(n)\{a},λ. Lemma 1. elements Given θ R p(n), let G(θ) e a p(n)-dimensional vector with G (θ) = 2n 1 X a Xθ, X. A vector ˆθ with ˆθ k = 0, k Γ(n)\A is a solution to (10) iff for all A, G (ˆθ) = sign(ˆθ )λ in case ˆθ 0 and G (ˆθ) λ in case ˆθ = 0. Moreover,

18 18 N. MEINSHAUSEN AND P. BÜHLMANN if the solution is not unique and G (ˆθ) < λ for some solution ˆθ, then ˆθ = 0 for all solutions of (10). Proof. Denote the sudifferential of n 1 X a Xθ λ θ 1 with respect to θ y D(θ). The vector ˆθ is a solution to (10) iff there exists an element d D(ˆθ) so that d = 0, A. D(θ) is given y {G(θ)+λe, e S}, where S R p(n) is given y S := {e R p(n) : e = sign(θ ) if θ 0 and e [ 1, 1] if θ = 0}. The first part of the claim follows. The second part follows from the proof of Theorem 3.1. in Osorne et al. (2000). Lemma 2. Let ˆθ a,nea,λ e defined for every a Γ(n) as in (10). Under the assumptions of Theorem 1, it holds for some c > 0 that for all a Γ(n), P (sign(ˆθ a,nea,λ ) = sign(θ a ), ne a) = 1 O(exp( cn ε )) for n For the sign-function, it is understood that sign(0) = 0. The Lemma says, in other words, that if one could restrict the Lasso estimate to have zero coefficients for all nodes which are not in the neighorhood of node a, then the signs of the partial correlations in the neighorhood of node a are estimated consistently under the made assumptions. Proof. Using Bonferroni s inequality, and ne a = o(n) for n, it suffices to show that there exists some c > 0 so that for for every a, Γ(n) with ne a, P (sign(ˆθ a,nea,λ ) = sign(θ a )) = 1 O(exp( cnε )) for n Consider the definition of ˆθ a,nea,λ in (10), (11) ˆθ a,nea,λ = arg min (n 1 X a Xθ λ θ 1 ). θ:θ k =0 k / ne a Assume now that component of this estimate is fixed at a constant value β. Denote this new estimate y θ a,,λ (β), (12) θ a,,λ (β) = arg min θ Θ a, (β) (n 1 X a Xθ λ θ 1 ),

19 where VARIABLE SELECTION WITH THE LASSO 19 Θ a, (β) := {θ R p(n) : θ = β; θ k = 0, k / ne a }. a,nea,λ There exists always a value β (namely β = ˆθ ) so that θ a,,λ (β) is identical to ˆθ a,nea,λ. Thus, if sign(ˆθ a,nea,λ ) sign(θ a ), there would exist some β with sign(β)sign(θ a) 0 so that θ a,,λ (β) would e a solution to (11). Using sign(θ a) 0 for all ne a, it is thus sufficient to show that for every β with sign(β)sign(θ a) < 0, θ a,,λ (β) cannot e a solution to (11) with high proaility. We focus in the following on the case where θ a > 0 for notational simplicity. The case θ a < 0 follows analogously. If θa > 0, it follows y Lemma 1 that θ a,,λ a,,λ (β) with θ (β) = β 0 can only e a solution to (11) if G ( θ a,,λ (β)) λ. Hence it suffices to show that for some c > 0 and all ne a with θ a > 0, for n, (13) P (sup{g ( θ a,,λ (β))} < λ) = 1 O(exp( cn ε )). β 0 Let in the following R λ a(β) e the n-dimensional vector of residuals, (14) R λ a(β) := X a X θ a,,λ (β). We can write X as (15) X = k ne a\{} θ,nea\{} k X k + W, where W is independent of {X k ; k ne a \{}}. By straightforward calculation, using (15), G ( θ a,,λ (β)) = 2n 1 R λ a(β), W θ,nea\{} k (2n 1 R λ a(β), X k ). k ne a\{} By Lemma 1, for all k ne a \{}, G k ( θ a,,λ (β)) = 2n 1 R λ a(β), X k λ. This together with the equation aove yields (16) G ( θ a,,λ (β)) 2n 1 R λ a(β), W + λ θ,nea\{} 1. Using Assumption 4, there exists some ϑ <, so that θ,nea\{} 1 ϑ. For proving (13) it is therefore sufficient to show that there exists for every g > 0 some c > 0 so that it holds for all ne a with θ a > 0, for n, (17) P ( inf β 0 {2n 1 R λ a(β), W } > gλ) = 1 O(exp( cn ε )).

20 20 N. MEINSHAUSEN AND P. BÜHLMANN With a little ause of notation, let W R n e the, at most ( ne a 1)- dimensional, space which is spanned y the vectors {X k, k ne a \{}} and let W e the orthogonal complement of W in R n. Split the n-dimensional vector W of oservations of W into the sum of two vectors (18) W = W + W, where W is contained in the space W R n, while the remaining part W is chosen orthogonal to this space (in the orthogonal complement W of W ). The inner product in (17) can e written as (19) 2n 1 R λ a(β), W = 2n 1 R λ a(β), W + 2n 1 R λ a(β), W. By Lemma 3 (see elow), there exists for every g > 0 some c > 0 so that, for n, P ( inf β 0 {2n 1 R λ a(β), W /(1 + β )} > gλ) = 1 O(exp( cnε )) To show (17), it is sufficient to prove that there exists for every g > 0 some c > 0 so that, for n, (20) P ( inf β 0 {2n 1 R λ a(β), W g(1 + β )λ} > gλ) = 1 O(exp( cnε )), It holds for some random variale V a, independent of X nea, that X a = k ne a θ a k X k + V a. Note that V a and W are independent normally distriuted random variales with variances σa 2 and σ 2 respectively. By Assumption 2, 0 < v2 σ 2, σ2 a 1. Note furthermore that W and X nea\{} are independent. Using θ a = θ a,nea and (15), (21) X a = (θk a + θa θ,nea\{} k )X k + θ a W + V a. k ne a\{} Using (21), the definition of the residuals in (14), and the orthogonality property of W, 2n 1 R λ a(β), W = 2n 1 (θ a β) W, W + 2n 1 V a, W, (22) 2n 1 (θ a β) W, W 2n 1 V a, W.

### Subspace Pursuit for Compressive Sensing: Closing the Gap Between Performance and Complexity

Subspace Pursuit for Compressive Sensing: Closing the Gap Between Performance and Complexity Wei Dai and Olgica Milenkovic Department of Electrical and Computer Engineering University of Illinois at Urbana-Champaign

### THE PROBLEM OF finding localized energy solutions

600 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 45, NO. 3, MARCH 1997 Sparse Signal Reconstruction from Limited Data Using FOCUSS: A Re-weighted Minimum Norm Algorithm Irina F. Gorodnitsky, Member, IEEE,

### Fast Solution of l 1 -norm Minimization Problems When the Solution May be Sparse

Fast Solution of l 1 -norm Minimization Problems When the Solution May be Sparse David L. Donoho and Yaakov Tsaig October 6 Abstract The minimum l 1 -norm solution to an underdetermined system of linear

### Some Sharp Performance Bounds for Least Squares Regression with L 1 Regularization

Some Sharp Performance Bounds for Least Squares Regression with L 1 Regularization Tong Zhang Statistics Department Rutgers University, NJ tzhang@stat.rutgers.edu Abstract We derive sharp performance bounds

### IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 52, NO. 4, APRIL 2006 1289. Compressed Sensing. David L. Donoho, Member, IEEE

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 52, NO. 4, APRIL 2006 1289 Compressed Sensing David L. Donoho, Member, IEEE Abstract Suppose is an unknown vector in (a digital image or signal); we plan to

### Decoding by Linear Programming

Decoding by Linear Programming Emmanuel Candes and Terence Tao Applied and Computational Mathematics, Caltech, Pasadena, CA 91125 Department of Mathematics, University of California, Los Angeles, CA 90095

### From Sparse Solutions of Systems of Equations to Sparse Modeling of Signals and Images

SIAM REVIEW Vol. 51,No. 1,pp. 34 81 c 2009 Society for Industrial and Applied Mathematics From Sparse Solutions of Systems of Equations to Sparse Modeling of Signals and Images Alfred M. Bruckstein David

### COSAMP: ITERATIVE SIGNAL RECOVERY FROM INCOMPLETE AND INACCURATE SAMPLES

COSAMP: ITERATIVE SIGNAL RECOVERY FROM INCOMPLETE AND INACCURATE SAMPLES D NEEDELL AND J A TROPP Abstract Compressive sampling offers a new paradigm for acquiring signals that are compressible with respect

### Indexing by Latent Semantic Analysis. Scott Deerwester Graduate Library School University of Chicago Chicago, IL 60637

Indexing by Latent Semantic Analysis Scott Deerwester Graduate Library School University of Chicago Chicago, IL 60637 Susan T. Dumais George W. Furnas Thomas K. Landauer Bell Communications Research 435

### Orthogonal Bases and the QR Algorithm

Orthogonal Bases and the QR Algorithm Orthogonal Bases by Peter J Olver University of Minnesota Throughout, we work in the Euclidean vector space V = R n, the space of column vectors with n real entries

### Foundations of Data Science 1

Foundations of Data Science John Hopcroft Ravindran Kannan Version /4/204 These notes are a first draft of a book being written by Hopcroft and Kannan and in many places are incomplete. However, the notes

### Structured Variable Selection with Sparsity-Inducing Norms

Journal of Machine Learning Research 1 (011) 777-84 Submitted 9/09; Revised 3/10; Published 10/11 Structured Variable Selection with Sparsity-Inducing Norms Rodolphe Jenatton INRIA - SIERRA Project-team,

### Scalable Collaborative Filtering with Jointly Derived Neighborhood Interpolation Weights

Seventh IEEE International Conference on Data Mining Scalable Collaborative Filtering with Jointly Derived Neighborhood Interpolation Weights Robert M. Bell and Yehuda Koren AT&T Labs Research 180 Park

### An Introduction to Variable and Feature Selection

Journal of Machine Learning Research 3 (23) 1157-1182 Submitted 11/2; Published 3/3 An Introduction to Variable and Feature Selection Isabelle Guyon Clopinet 955 Creston Road Berkeley, CA 9478-151, USA

### Sketching as a Tool for Numerical Linear Algebra

Foundations and Trends R in Theoretical Computer Science Vol. 10, No. 1-2 (2014) 1 157 c 2014 D. P. Woodruff DOI: 10.1561/0400000060 Sketching as a Tool for Numerical Linear Algebra David P. Woodruff IBM

### If You re So Smart, Why Aren t You Rich? Belief Selection in Complete and Incomplete Markets

If You re So Smart, Why Aren t You Rich? Belief Selection in Complete and Incomplete Markets Lawrence Blume and David Easley Department of Economics Cornell University July 2002 Today: June 24, 2004 The

### Optimization with Sparsity-Inducing Penalties. Contents

Foundations and Trends R in Machine Learning Vol. 4, No. 1 (2011) 1 106 c 2012 F. Bach, R. Jenatton, J. Mairal and G. Obozinski DOI: 10.1561/2200000015 Optimization with Sparsity-Inducing Penalties By

### High-Rate Codes That Are Linear in Space and Time

1804 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL 48, NO 7, JULY 2002 High-Rate Codes That Are Linear in Space and Time Babak Hassibi and Bertrand M Hochwald Abstract Multiple-antenna systems that operate

### Statistical Inference in Two-Stage Online Controlled Experiments with Treatment Selection and Validation

Statistical Inference in Two-Stage Online Controlled Experiments with Treatment Selection and Validation Alex Deng Microsoft One Microsoft Way Redmond, WA 98052 alexdeng@microsoft.com Tianxi Li Department

### Generalized compact knapsacks, cyclic lattices, and efficient one-way functions

Generalized compact knapsacks, cyclic lattices, and efficient one-way functions Daniele Micciancio University of California, San Diego 9500 Gilman Drive La Jolla, CA 92093-0404, USA daniele@cs.ucsd.edu

### ON THE DISTRIBUTION OF SPACINGS BETWEEN ZEROS OF THE ZETA FUNCTION. A. M. Odlyzko AT&T Bell Laboratories Murray Hill, New Jersey ABSTRACT

ON THE DISTRIBUTION OF SPACINGS BETWEEN ZEROS OF THE ZETA FUNCTION A. M. Odlyzko AT&T Bell Laboratories Murray Hill, New Jersey ABSTRACT A numerical study of the distribution of spacings between zeros

### EVALUATION OF GAUSSIAN PROCESSES AND OTHER METHODS FOR NON-LINEAR REGRESSION. Carl Edward Rasmussen

EVALUATION OF GAUSSIAN PROCESSES AND OTHER METHODS FOR NON-LINEAR REGRESSION Carl Edward Rasmussen A thesis submitted in conformity with the requirements for the degree of Doctor of Philosophy, Graduate

### Bayesian Models of Graphs, Arrays and Other Exchangeable Random Structures

Bayesian Models of Graphs, Arrays and Other Exchangeable Random Structures Peter Orbanz and Daniel M. Roy Abstract. The natural habitat of most Bayesian methods is data represented by exchangeable sequences

### RECENTLY, there has been a great deal of interest in

IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 47, NO. 1, JANUARY 1999 187 An Affine Scaling Methodology for Best Basis Selection Bhaskar D. Rao, Senior Member, IEEE, Kenneth Kreutz-Delgado, Senior Member,

### How to Use Expert Advice

NICOLÒ CESA-BIANCHI Università di Milano, Milan, Italy YOAV FREUND AT&T Labs, Florham Park, New Jersey DAVID HAUSSLER AND DAVID P. HELMBOLD University of California, Santa Cruz, Santa Cruz, California

### Object Detection with Discriminatively Trained Part Based Models

1 Object Detection with Discriminatively Trained Part Based Models Pedro F. Felzenszwalb, Ross B. Girshick, David McAllester and Deva Ramanan Abstract We describe an object detection system based on mixtures

### From Few to Many: Illumination Cone Models for Face Recognition Under Variable Lighting and Pose. Abstract

To Appear in the IEEE Trans. on Pattern Analysis and Machine Intelligence From Few to Many: Illumination Cone Models for Face Recognition Under Variable Lighting and Pose Athinodoros S. Georghiades Peter

### Probability in High Dimension

Ramon van Handel Probability in High Dimension ORF 570 Lecture Notes Princeton University This version: June 30, 2014 Preface These lecture notes were written for the course ORF 570: Probability in High