The Nonparanormal: Semiparametric Estimation of High Dimensional Undirected Graphs

Transcription

1 Joural of Machie Learig Research Submitted 3/09; Revised 5/09; ublished 0/09 The Noparaormal: Semiparametric Estimatio of High Dimesioal Udirected Graphs Ha Liu Joh Lafferty Larry Wasserma School of Computer Sciece Caregie Mello Uiversity 5000 Forbes Aveue ittsburgh, A 523, USA HANLIU@CS.CMU.EDU LAFFERTY@CS.CMU.EDU LARRY@STAT.CMU.EDU Editor: Marti J. Waiwright Abstract Recet methods for estimatig sparse udirected graphs for real-valued data i high dimesioal problems rely heavily o the assumptio of ormality. We show how to use a semiparametric Gaussia copula or oparaormal for high dimesioal iferece. Just as additive models exted liear models by replacig liear fuctios with a set of oe-dimesioal smooth fuctios, the oparaormal exteds the ormal by trasformig the variables by smooth fuctios. We derive a method for estimatig the oparaormal, study the method s theoretical properties, ad show that it works well i may examples. Keywords: graphical models, Gaussia copula, high dimesioal iferece, sparsity, l regularizatio, graphical lasso, paraormal, occult. Itroductio The liear model is a maistay of statistical iferece that has bee exteded i several importat ways. A extesio to high dimesios was achieved by addig a sparsity costrait, leadig to the lasso Tibshirai, 996. A extesio to oparametric models was achieved by replacig liear fuctios with smooth fuctios, leadig to additive models Hastie ad Tibshirai, 999. These two ideas were recetly combied, leadig to a extesio called sparse additive models SpAM Ravikumar et al., 2008, 2009a. I this paper we cosider a similar oparametric extesio of udirected graphical models based o multivariate Gaussia distributios i the high dimesioal settig. Specifically, we use a high dimesioal Gaussia copula with oparametric margials, which we refer to as a oparaormal distributio. If X is a p-dimesioal radom vector distributed accordig to a multivariate Gaussia distributio with covariace matrix Σ, the coditioal idepedece relatios betwee the radom variables X,X 2,...,X p are ecoded i a graph formed from the precisio matrix Ω = Σ. Specifically, missig edges i the graph correspod to zeroes of Ω. To estimate the graph from a sample of size, it is oly ecessary to estimate Σ, which is easy if is much larger tha p. However, whe p is larger tha, the problem is more challegig. Recet work has focused o the problem of estimatig the graph i this high dimesioal settig, which becomes feasible if G is sparse. Yua ad Li 2007 c 2009 Ha Liu, Joh Lafferty ad Larry Wasserma.

2 LIU, LAFFERTY, AND WASSERMAN Assumptios Dimesio Regressio Graphical Models parametric oparametric low liear model multivariate ormal high lasso graphical lasso low additive model oparaormal high sparse additive model l -regularized oparaormal Figure : Compariso of regressio ad graphical models. The oparaormal exteds additive models to the graphical model settig. Regularizig the iverse covariace leads to a extesio to high dimesios, which parallels sparse additive models for regressio. ad Baerjee et al propose a estimator based o regularized maximum likelihood usig a l costrait o the etries of Ω, ad Friedma et al develop a efficiet algorithm for computig the estimator usig a graphical versio of the lasso. The resultig estimatio procedure has excellet theoretical properties, as show recetly by Rothma et al ad Ravikumar et al. 2009b. While Gaussia graphical models ca be useful, a reliace o exact ormality is limitig. Our goal i this paper is to weake this assumptio. Our approach parallels the ideas behid sparse additive models for regressio Ravikumar et al., 2008, 2009a. Specifically, we replace the Gaussia with a semiparametric Gaussia copula. This meas that we replace the radom variable X = X,...,X p by the trasformed radom variable fx = f X,..., f p X p, ad assume that fx is multivariate Gaussia. This semiparametric copula results i a oparametric extesio of the ormal that we call the oparaormal distributio. The oparaormal depeds o the fuctios { f j }, ad a mea µ ad covariace matrix Σ, all of which are to be estimated from data. While the resultig family of distributios is much richer tha the stadard parametric ormal the paraormal, the idepedece relatios amog the variables are still ecoded i the precisio matrix Ω = Σ. We propose a oparametric estimator for the fuctios { f j }, ad show how the graphical lasso ca be used to estimate the graph i the high dimesioal settig. The relatioship betwee liear regressio models, Gaussia graphical models, ad their extesios to oparametric ad high dimesioal models is summarized i Figure. Most theoretical results o semiparametric copulas focus o low or at least fiite dimesioal models Klaasse ad Weller, 997; Tsukahara, Models with icreasig dimesio require a more delicate aalysis; i particular, simply pluggig i the usual empirical distributio of the margials does ot lead to accurate iferece. Istead we use a trucated empirical distributio. We give a theoretical aalysis of this estimator, provig cosistecy results with respect to risk, model selectio, ad estimatio of Ω i the Frobeius orm. I the followig sectio we review the basic otio of the graph correspodig to a multivariate Gaussia, ad formulate differet criteria for evaluatig estimators of the covariace or iverse covariace. I Sectio 3 we preset the oparaormal, ad i Sectio 4 we discuss estimatio of the model. We preset a theoretical aalysis of the estimatio method i Sectio 5, with the detailed proofs collected i a appedix. I Sectio 6 we preset experimets with both simulated data ad gee microarray data, where the problem is to costruct the isopreoid biosythetic pathway. 2296

3 THE NONARANORMAL 2. Estimatig Udirected Graphs Let X = X,...,X p deote a radom vector with distributio = Nµ,Σ. The udirected graph G = V,E correspodig to cosists of a vertex set V ad a edge set E. The set V has p elemets, oe for each compoet of X. The edge set E cosists of ordered pairs i, j where i, j E if there is a edge betwee X i ad X j. The edge betwee i, j is excluded from E if ad oly if X i is idepedet of X j give the other variables X \{i, j} X s : s p, s i, j, writte X i X j X\{i, j}. It is well kow that, for multivariate Gaussia distributios, holds if ad oly if Ω i j = 0 where Ω = Σ. Let X,X 2,...,X be a radom sample from, where X i R p. If is much larger tha p, the we ca estimate Σ usig maximum likelihood, leadig to the estimate Ω = S, where S = i= T X i X X i X is the sample covariace, with X the sample mea. The zeroes of Ω ca the be estimated by applyig hypothesis testig to Ω Drto ad erlma, 2007, Whe p >, maximum likelihood is o loger useful; i particular, the estimate Σ is ot positive defiite, havig rak o greater tha. Ispired by the success of the lasso for liear models, several authors have suggested estimatig Σ by miimizig lω+λ Ω jk j k where lω = log Ω trωs plog2π 2 is the log-likelihood with S the sample covariace matrix. The estimator Ω ca be computed efficietly usig the glasso algorithm Friedma et al., 2007, which is a block coordiate descet algorithm that uses the stadard lasso to estimate a sigle row ad colum of Ω i each iteratio. Uder appropriate sparsity coditios, the resultig estimator Ω has bee show to have good theoretical properties Rothma et al., 2008; Ravikumar et al., 2009b. There are several differet ways to judge the quality of a estimator Σ of the covariace or Ω of the iverse covariace. We discuss three i this paper, persistecy, orm cosistecy, ad sparsistecy. ersistecy meas cosistecy i risk, whe the model is ot ecessarily assumed to be correct. Suppose the true distributio has mea µ 0, ad that we use a multivariate ormal px;µ 0,Σ for predictio; we do ot assume that is ormal. We observe a ew vector X ad defie the predictio risk to be Z RΣ = Elog px;µ 0,Σ = log px;µ 0,Σdx. It follows that RΣ = 2 trσ Σ 0 +log Σ plog2π 2297

4 LIU, LAFFERTY, AND WASSERMAN where Σ 0 is the covariace of X uder. IfS is a set of covariace matrices, the oracle is defied to be the covariace matrix Σ that miimizes RΣ overs: Σ = arg mi Σ S RΣ. Thus px;µ 0,Σ is the best predictor of a ew observatio amog all distributios i {px;µ 0,Σ : Σ S}. I particular, ifs cosists of covariace matrices with sparse graphs, the px;µ 0,Σ is, i some sese, the best sparse predictor. A estimator Σ is persistet if R Σ RΣ 0 as the sample size icreases to ifiity. Thus, a persistet estimator approximates the best estimator over the classs, but we do ot assume that the true distributio has a covariace matrix is, or eve that it is Gaussia. Moreover, we allow the dimesio p = p to icrease with. O the other had, orm cosistecy ad sparsistecy require that the true distributio is Gaussia. I this case, let Σ 0 deote the true covariace matrix. A estimator is orm cosistet if Σ Σ 0 where is a orm. If EΩ deotes the edge set correspodig to Ω, a estimator is sparsistet if EΩ E Ω 0. Thus, a sparsistet estimator idetifies the correct graph cosistetly. We preset our theoretical aalysis o these properties of the oparaormal i Sectio The Noparaormal We say that a radom vector X = X,...,X p T has a oparaormal distributio if there exist fuctios { f j } p j= such that Z fx Nµ,Σ, where fx = f X,..., f p X p. We the write X NN µ,σ, f. Whe the f j s are mootoe ad differetiable, the joit probability desity fuctio of X is give by { p X x = 2π p/2 exp p Σ /2 2 fx µt Σ fx µ} f jx j. 2 j= Lemma The oparaormal distributio NN µ,σ, f is a Gaussia copula whe the f j s are mootoe ad differetiable. roof By Sklar s theorem Sklar, 959, ay joit distributio ca be writte as Fx,...,x p = C{F x,...,f p x p } where the fuctio C is called a copula. For the oparaormal we have Fx,...,x p = Φ µ,σ Φ F x,...,φ F p x p 2298

5 THE NONARANORMAL where Φ µ,σ is the multivariate Gaussia cdf ad Φ is the uivariate stadard Gaussia cdf. Thus, the correspodig copula is Cu,...,u p = Φ µ,σ Φ u,...,φ u p. This is exactly a Gaussia copula with parameters µ ad Σ. If each f j is differetiable the the desity of X has the same form as 2. Note that the desity i 2 is ot idetifiable; to make the family idetifiable we demad that f j preserve meas ad variaces: µ j =EZ j =EX j ad σ 2 j Σ j j = VarZ j = VarX j. 3 Note that these coditios oly deped o diagσ but ot the full covariace matrix. Let F j x deote the margial distributio fuctio of X j. The f j x µ j F j x =X j x =Z j f j x = Φ which implies that f j x = µ j + σ j Φ F j x. 4 The followig basic fact says that the idepedece graph of the oparaormal is ecoded i Ω = Σ, as for the parametric ormal. σ j Lemma 2 If X NN µ,σ, f is oparaormal ad each f j is differetiable, the X i X j X \{i, j} if ad oly if Ω i j = 0, where Ω = Σ. roof From the form of the desity 2, it follows that the desity factors with respect to the graph of Ω, ad therefore obeys the global Markov property of the graph. Next we show that the above is true for ay choice of idetificatio restrictios. Lemma 3 Defie h j x = Φ F j x 5 ad let Λ be the covariace matrix of hx. The X j X k X \{ j,k} if ad oly if Λ jk = 0. roof We ca rewrite the covariace matrix as Hece Σ = DΛD ad Σ jk = CovZ j,z k = σ j σ k Covh j X j,h k X k. Σ = D Λ D, where D is the diagoal matrix with diagd = σ. The zero patter of Λ is therefore idetical to the zero patter of Σ. 2299

6 LIU, LAFFERTY, AND WASSERMAN Figure 2: Desities of three 2-dimesioal oparaormals. The compoet fuctios have the form f j x = sigx x α j. Left: α = 0.9, α 2 = 0.8; ceter: α =.2, α 2 = 0.8; right α = 2, α 2 = 3. I each case µ= 0,0 ad Σ =.5.5. Thus, it is ot ecessary to estimate µ or σ to estimate the graph. Figure 2 shows three examples of 2-dimesioal oparaormal desities. I each case, the compoet fuctios f j x take the form f j x = a j sigx x α j + b j where the costats a j ad b j are set to eforce the idetifiability costraits 3. The covariace i each case is Σ =.5.5 ad the mea is µ= 0,0. The expoet α j determies the oliearity. It ca be see how the cocavity of the desity chages with the expoet α, ad that α > ca result i multiple modes. The assumptio that fx = f X,..., f p X p is ormal leads to a semiparametric model where oly oe dimesioal fuctios eed to be estimated. But the mootoicity of the fuctios f j, which map otor, eables computatioal tractability of the oparaormal. For more geeral fuctios f, the ormalizig costat for the desity { p X x exp } 2 fx µt Σ fx µ caot be computed i closed form. 2300

7 THE NONARANORMAL 4. Estimatio Method Let X,...,X be a sample of size where X i = X i,...,xi p T R p. I light of 5 we defie ĥ j x = Φ F j x where F j is a estimator of F j. A atural cadidate for F j is the margial empirical distributio fuctio F j t { }. X i j t i= i= Now, let θ deote the parameters of the copula. Tsukahara 2005 suggests takig θ to be the solutio of φ F X i,..., F p X p i,θ = 0 where φ is a estimatig equatio ad F j t = F j t/ +. I our case, θ correspods to the covariace matrix. The resultig estimator θ, called a rak approximate Z-estimator, has excellet theoretical properties. However, we are iterested i the high dimesioal sceario where the dimesio p is allowed to icrease with ; the variace of F j t is too large i this case. Istead, we use the followig trucated or Wisorized estimator: δ if F j x < δ F j x = F j x if δ F j x δ 6 δ if F j x > δ, where δ is a trucatio parameter. Clearly, there is a bias-variace tradeoff i choosig δ. Essetially the same estimator with δ = / is studied by Klaasse ad Weller 997 i the case of bivariate Gaussia copula. I what follows we use δ 4 /4 πlog. This provides the right balace so that we ca achieve the desired rate of covergece i our estimate of Ω ad the associated udirected graph G i the high dimesioal settig. Give this estimate of the distributio of variable X j, we the estimate the trasformatio fuctio f j by f j x µ j + σ j h j x 7 where h j x = Φ F j x ad µ j ad σ j are the sample mea ad the stadard deviatio: µ j X i 2. j ad σ j = X i j µ j i=. After Charles. Wisor, whom Joh Tukey credited with covertig him from topology to statistics Mallows 990. i= 230

8 LIU, LAFFERTY, AND WASSERMAN Now, let S f be the sample covariace matrix of fx,..., fx ; that is, S f T fx i µ f fx i µ f 8 µ f i= i= fx i. We the estimate Ω usig S f. For istace, the maximum likelihood estimator is S f. The l -regularized estimator is { } Ω = arg mi tr ΩS f log Ω +λ Ω Ω Ω MLE = where λ is a regularizatio parameter, ad Ω = j k Ω jk. The estimated graph is the Ê = { j,k : Ω jk 0}. The oparaormal is aalogous to a sparse additive regressio model Ravikumar et al., 2009a, i the sese that both methods trasform the variables by uivariate fuctios. However, while sparse additive models use a regularized risk criterio to fit uivariate trasformatios, our oparaormal estimator uses a two-step procedure:. Replace the observatios, for each variable, by their respective ormal scores, subject to a Wisorized trucatio. 2. Apply the graphical lasso to the trasformed data to estimate the udirected graph. The first step is o-iterative ad computatioally efficiet, with o tuig parameters; it also makes the oparaormal ameable to theoretical aalysis. Startig with the model i 2, aother possibility would be to parametrize each f j accordig to some parametric class of mootoe fuctios such as the Box-Cox family, ad the fid the maximum likelihood estimates of Ω, f,... f p i that class. This might lead to estimates of f j that deped o Ω, ad vice versa, ad the estimatio problem would ot i geeral be covex. Alteratively, due to 4, the margial iformatio could be used to estimate the parameters. Our oparametric approach to estimatig the trasformatios has the advatages of makig few assumptios ad beig easy to compute. I the followig sectio we aalyze the theoretical properties of this estimator. 5. Theoretical Results I this sectio we preset our theoretical results o risk cosistecy, model selectio cosistecy, ad orm cosistecy of the covariace Σ ad iverse covariace Ω. From Lemma 3, the estimate of the graph does ot deped o σ j, j {,..., p} ad µ, so we assume that σ j = ad µ= 0. Our key techical result is a aalysis of the covariace of the Wisorized estimator defied i 6, 7, ad 8. I particular, we show that uder appropriate coditios, max j,k S f jk S f jk = o where S f jk deotes the j,k etry of the matrix. This result allows us to leverage the recet aalysis of Rothma et al ad Ravikumar et al. 2009b i the Gaussia case to obtai cosistecy results for the oparaormal. More precisely, our mai theorem is the followig

9 THE NONARANORMAL Theorem 4 Suppose that p = ξ ad let f be the Wisorized estimator defied i 7 with δ = 4 /4 πlog. Defie For some M 2ξ+. The for ay ε C M log plog 2 C M 48 π 2M M /2 ad sufficietly large, we have max S f jk S f jk > 2ε jk 2 πlogp + 2exp /2 ε 2 2log p 232π 2 log 2 + 2exp 2log p /2 + o. 8πlog The proof of the above theorem is give i Sectio 7. The followig corollary is immediate, ad specifies the scalig of the dimesio i terms of sample size. Corollary 5 Let M max{5π,2ξ+}. The log plog max 2 S f jk S f jk > 2CM jk /2 = o. Hece, max j,k S f jk S f jk = O log plog 2. /2 The followig corollary yields estimatio cosistecy i both the Frobeius orm ad the l 2 - operator orm. The proof follows the same argumets as the proof of Theorem ad Theorem 2 from Rothma et al. 2008, replacig their Lemma with our Theorem 4. For a matrix A = a i j, the Frobeius orm F is defied as A F i, j a 2 i j. The l 2- operator orm 2 is defied as the magitude of the largest eigevalue of the matrix, A 2 max x 2 = Ax 2. I the followig, we write a b if there are positive costats c ad C idepedet of such that c a /b C. Corollary 6 Suppose that the data are geerated as X i NN µ 0,Σ 0, f 0, ad let Ω 0 = Σ 0. If the regularizatio parameter λ is chose as log plog 2 λ 2C M /2 where C M is defied i Theorem 4. The the oparaormal estimator Ω of 9 satisfies Ω Ω 0 F = O s+ plog plog 2 /2 2303

10 LIU, LAFFERTY, AND WASSERMAN ad Ω Ω 0 2 = O slog plog 2, /2 where s Card{i, j {,..., p} {,..., p} Ω 0 i, j 0, i j} is the umber of ozero off-diagoal elemets of the true precisio matrix. To prove the model selectio cosistecy result, we eed further assumptios. We follow Ravikumar 2009 ad let the p 2 p 2 Fisher iformatio matrix of Σ 0 be Γ Σ 0 Σ 0 where is the Kroecker matrix product, ad defie the support set S of Ω 0 = Σ 0 as S {i, j {,..., p} {,..., p} Ω 0 i, j 0}. We use S c to deote the complemet of S i the set {,..., p} {,..., p}, ad for ay two subsets T ad T of {,..., p} {,..., p}, we use Γ T T to deote the sub-matrix with rows ad colums of Γ idexed by T ad T respectively. Assumptio There exists some α 0,], such that ΓS c SΓ SS α. As i Ravikumar et al. 2009b, we defie two quatities K Σ0 Σ 0 ad K Γ Γ SS. Further, we defie the maximum row degree as d max i=,...,p Card{ j,..., p Ω 0i, j 0}. Assumptio 2 The quatities K Σ 0 ad K Γ are bouded, ad there are positive costats C such that mi Ω log 3 0 j,k C j,k S /2 for large eough. The proof of the followig corollary uses our Theorem 4 i place of Equatio 2 i the aalysis of Ravikumar et al. 2009b. Corollary 7 Suppose the regularizatio parameter is chose as log plog 2 λ 2C M /2 where CM,, p is defied i Theorem 4. The the oparaormal estimator Ω satisfies G Ω,Ω 0 o whereg Ω,Ω 0 is the evet { } sig Ω j,k = sigω 0 j,k, j,k S. 2304

11 THE NONARANORMAL Our persistecy risk cosistecy result parallels the persistecy result for additive models give i Ravikumar et al. 2009a, ad allows model dimesio that grows expoetially with sample size. The defiitio i this theorem uses the fact from Lemma that sup x Φ F j x 2log whe δ = /4 /4 πlog. I the ext theorem, we do ot assume the true model is oparaormal ad defie the populatio ad sample risks as R f,ω = 2 { tr [ ΩE fx fx T ] log Ω plog2π } R f,ω = 2 {tr[ωs f] log Ω plog2π}. Theorem 8 Suppose that p e ξ for some ξ <, ad defie the classes M = { f :R R : f is mootoe with f C } log C = { Ω : Ω L }. Let Ω be give by The R f, Ω { } Ω = argmi tr ΩS f log Ω. Ω C log if R f,ω = O L C ξ. f,ω M p Hece the Wisorized estimator of f,ω with δ = /4 /4 πlog is persistet over C whe L = o ξ/2 / log. The proofs of Theorems 4 ad 8 are give i Sectio Experimetal Results I this sectio, we report experimetal results o sythetic ad real data sets. We maily compare the l -regularized oparaormal ad Gaussia paraormal models, computed usig the graphical lasso algorithm glasso of Friedma et al The primary coclusios are: i Whe the data are multivariate Gaussia, the performace of the two methods is comparable; ii whe the model is correct, the oparaormal performs much better tha the graphical lasso i may cases; iii for a particular gee microarray data set, our method behaves differetly from the graphical lasso, ad may support differet biological coclusios. Note that we ca reuse the glasso implemetatio to fit a sparse oparaormal. I particular, after computig the Wisorized sample covariace S f, we pass this matrix to the glasso routie to carry out the optimizatio { } Ω = arg mi tr ΩS f log Ω +λ Ω. Ω 2305

12 LIU, LAFFERTY, AND WASSERMAN 6. Neighborhood Graphs We begi by describig a procedure to geerate graphs as i Meishause ad Bühlma, 2006, with respect to which several distributios ca the be defied. We geerate a p-dimesioal sparse graph G V,E as follows: Let V = {,..., p} correspod to variables X = X,...,X p. We associate each idex j with a poit Y [0,] 2 where j,y 2 j Y k,...,y k Uiform[0, ] for k =,2. Each pair of odes i, j is icluded i the edge set E with probability i, j E = exp y i y j 2 2π 2s where y i y i,y 2 i is the observatio of Y i,y 2 i ad represets the Euclidea distace. Here, s = 0.25 is a parameter that cotrols the sparsity level of the geerated graph. We restrict the maximum degree of the graph to be four ad build the iverse covariace matrix Ω 0 accordig to Ω 0 i, j = if i = j if i, j E 0 otherwise, where the value guaratees positive defiiteess of the iverse covariace matrix. Give Ω 0, data poits are sampled from X,...,X NNµ 0,Σ 0, f 0 where µ 0 =.5,...,.5, Σ 0 = Ω 0. For simplicity, the trasformatio fuctios for all dimesios are the same, f =...= f p = f. To sample data from the oparaormal distributio, we also require g f ; two differet trasformatios g are employed. Defiitio 9 Gaussia CDF Trasformatio Let g 0 be a oe-dimesioal Gaussia cumulative distributio fuctio with mea µ g0 ad the stadard deviatio σ g0, that is, t µg0 g 0 t Φ We defie the trasformatio fuctio g j = f j g j z j σ j where σ j = Σ 0 j, j. σ g0. for the j-th dimesio as Z t µj g 0 z j g 0 tφ σ j dt Z Z dt 2 t µj y µj g 0 y g 0 tφ σ j φ σ j dy + µ j 2306

13 THE NONARANORMAL before trasform ower trasform CDF trasform Desity Desity Desity N = 5000 Badwidth = N = 5000 Badwidth = N = 5000 Badwidth = 0.64 idetity fuctio power fuctio, alpha = 3 CDF of N0.05, Figure 3: The power ad cdf trasformatios. The desities are estimated usig a kerel desity estimator with badwidths selected by cross-validatio. Defiitio 0 Symmetric ower Trasformatio Let g 0 be the symmetric ad odd trasformatio give by g 0 t = sigt t α where α > 0 is a parameter. We defie the power trasformatio for the j-th dimesio as g j z j σ j g 0 z j µ j Z g 2 0 t µ jφ t µj σ j dt + µ j. These trasformatio are costructed to preserve the margial mea ad stadard deviatio. I the followig experimets, we refer to them as the cdf trasformatio ad the power trasformatio, respectively. For the cdf trasformatio, we set µ g0 = 0.05 ad σ g0 = 0.4. For the power trasformatio, we set α = 3. To visualize these two trasformatios, we sample 5000 data poits from a oe-dimesioal ormal distributio N0.5,.0 ad the apply the above two trasformatios; the results are show i Figure 3. It ca be see how the cdf ad power trasformatios map a uivariate ormal distributio ito a highly skewed ad a bi-modal distributio, respectively. 2307

14 LIU, LAFFERTY, AND WASSERMAN cdf power liear glasso path glasso path glasso path oparaormal path oparaormal path oparaormal path = 500 cdf power liear glasso path glasso path glasso path oparaormal path oparaormal path oparaormal path = 200 Figure 4: Regularizatio paths for the glasso ad oparaormal with = 500 top ad = 200 bottom. The paths for the relevat variables ozero iverse covariace etries are plotted as solid black lies; the paths for the irrelevat variables are plotted as dashed red lies. For o-gaussia distributios, the oparaormal better separates the relevat ad irrelevat dimesios. To geerate sythetic data, we set p = 40, resultig i = 820 parameters to be estimated, ad vary the sample sizes from = 200 to = 000. Three coditios are cosidered, correspodig to usig the cdf trasform, the power trasform, or o trasformatio. I each case, both the glasso ad the oparaormal are applied to estimate the graph. 2308

15 THE NONARANORMAL 6.. COMARISON OF REGULARIZATION ATHS We choose a set of regularizatio parameters Λ; for each λ Λ, we obtai a estimate Ω which is a matrix. The upper triagular matrix has 780 parameters; we vectorize it to get a 780-dimesioal parameter vector. A regularizatio path is the trace of these parameters over all the regularizatio parameters withi Λ. The regularizatio paths for both methods are plotted i Figure 4. For the cdf trasformatio ad the power trasformatio, the oparaormal separates the relevat ad the irrelevat dimesios very well. For the glasso, relevat variables are mixed with irrelevat variables. If o trasformatio is applied, the paths for both methods are almost the same ESTIMATED TRANSFORMATIONS For sample size = 000, we plot the estimated trasformatios for three of the variables i Figure 5. It is clear that Wisorizatio plays a sigificat role for the power trasformatio. This is ituitive due to the high skewess of the oparaormal distributio i this case. cdf power liear f estimated true f estimated true g estimated true x x x f estimated true f estimated true g estimated true x2 x2 x2 f estimated true f estimated true g estimated true x3 x3 x3 Figure 5: Estimated trasformatios for the first three variables. Wisorizatio plays a sigificat role for the power trasformatio due to its high skewess. 2309

16 LIU, LAFFERTY, AND WASSERMAN cdf power liear Oracle Score Oracle Score Oracle Score NoparaNormal Glasso NoparaNormal Glasso NoparaNormal Glasso Oracle Score Oracle Score Oracle Score NoparaNormal Glasso NoparaNormal Glasso NoparaNormal Glasso Oracle Score Oracle Score Oracle Score NoparaNormal Glasso NoparaNormal Glasso NoparaNormal Glasso Figure 6: Boxplots of the oracle scores for = 000,500,200 top, ceter, bottom QUANTITATIVE COMARISON To evaluate the performace for structure estimatio quatitatively, we use false positive ad false egative rates. Let G = V,E be a p-dimesioal graph which has at most p 2 edges i which there are E = r edges, ad let Ĝ λ = V,Ê λ be a estimated graph usig the regularizatio parameter λ. The umber of false positives at λ is Fλ umber of edges i Ê λ ot i E The umber of false egatives at λ is defied as The oracle regularizatio level λ is the FNλ umber of edges i E ot i Ê λ. λ = arg mi{fλ+fnλ}. λ Λ The oracle score is Fλ + FNλ. Figure 6 shows boxplots of the oracle scores for the two methods, calculated usig 00 simulatios. 230

17 THE NONARANORMAL To illustrate the overall performace of these two methods over the full paths, ROC curves are show i Figure 7, usig FNλ, Fλ r p. 2 r The curves clearly show how the performace of both methods improves with sample size, ad that the oparaormal is superior to the Gaussia model i most cases. cdf power liear CDF Trasform ower Trasform No Trasform F Noparaormal glasso F Noparaormal glasso F Noparaormal glasso FN CDF Trasform FN ower Trasform FN No Trasform F Noparaormal glasso F Noparaormal glasso F Noparaormal glasso FN CDF Trasform FN ower Trasform FN No Trasform F Noparaormal glasso F Noparaormal glasso F Noparaormal glasso FN FN FN Figure 7: ROC curves for sample sizes = 000,500,200 top, middle, bottom. Let FE Fλ ad FNE FNλ, Tables, 2, ad 3 provide umerical comparisos of both methods o data sets with differet trasformatios, where we repeat the experimets 00 times ad report the average FE ad FNE values with the correspodig stadard deviatios. It s clear from the tables that the oparaormal achieves sigificatly smaller errors tha the glasso if the true distributio of the data is ot multivariate Gaussia ad achieves performace comparable to the glasso whe the true distributio is exactly multivariate Gaussia. Figure 8 shows typical rus for the cdf ad power trasformatios. It s clear that whe the glasso estimates the graph icorrectly, the mistakes iclude both false positives ad egatives. 23

18 LIU, LAFFERTY, AND WASSERMAN Noparaormal glasso FE sdfe FNE sdfne FE sdfe FNE sdfne Table : Quatitative compariso o the data set usig the cdf trasformatio. For both FE ad FNE, the oparaormal performs much better i geeral. Noparaormal glasso FE sdfe FNE sdfne FE sdfe FNE sdfne Table 2: Quatitative compariso o the data set usig the power trasformatio. For both FE ad FNE, the oparaormal performs much better i geeral COMARISON IN THE GAUSSIAN CASE The previous experimets idicate that the oparaormal works almost as well as the glasso i the Gaussia case. This iitially appears surprisig, sice a parametric method is expected to be more efficiet tha a oparametric method if the parametric assumptio is correct. To maifest this efficiecy loss, we coducted some experimets with very small ad relatively large p. For multivariate Gaussia models, Figure 9 shows results with, p,s = 50,40,/8,50,00,/5 232

19 THE NONARANORMAL Noparaormal glasso FE sdfe FNE sdfne FE sdfe FNE sdfne Table 3: Quatitative compariso o the data set without ay trasformatio. The two methods behave similarly, the glasso is slightly better. ad 30, 00, /5. From the mea ROC curves, we see that oparaormal does ideed behave worse tha the glasso, suggestig some efficiecy loss. However, from the correspodig boxplots, the efficiecy reductio is relatively isigificat THE CASE WHEN p Figure 0 shows results from a simulatio of the oparaormal usig cdf trasformatios with = 200, p = 500 ad sparsity level s = /40. The boxplot shows that the oparaormal outperforms the glasso. A typical ru of the regularizatio paths cofirms this coclusio, showig that the oparaormal path separates the relevat ad irrelevat dimesios very well. I cotrast, with the glasso the relevat variables are buried amog the irrelevat variables. 6.2 Gee Microarray Data I this study, we cosider a data set based o Affymetrix GeeChip microarrays for the plat Arabidopsis thaliaa, Wille et al., The sample size is = 8. The expressio levels for each chip are pre-processed by log-trasformatio ad stadardizatio. A subset of 40 gees from the isopreoid pathway are chose, ad we study the associatios amog them usig both the paraormal ad oparaormal models. Eve though these data are geerally treated as multivariate Gaussia i the previous aalysis Wille et al., 2004, our study shows that the results of the oparaormal ad the glasso are very differet over a wide rage of regularizatio parameters. This suggests the oparaormal could support differet scietific coclusios COMARISON OF THE REGULARIZATION ATHS We first compare the regularizatio paths of the two methods, i Figure. To geerate the paths, we select 50 regularizatio parameters o a evely spaced grid i the iterval [0.6,.2]. Although 233

20 LIU, LAFFERTY, AND WASSERMAN cdf power true graph, p = 40 oparaormal, p = 40 true graph, p = 40 oparaormal, p = 40 z z z z graphical lasso, p = 40 symmetric differece, p = 40 z z z z z z graphical lasso, p = 40 symmetric differece, p = 40 z z z z z z true graph, p = 40 oparaormal, p = 40 true graph, p = 40 oparaormal, p = 40 z z z z graphical lasso, p = 40 symmetric differece, p = 40 z z z z z z graphical lasso, p = 40 symmetric differece, p = 40 z z z z z Figure 8: Typical rus for the two methods for = 000 usig the cdf ad power trasformatios. The dashed black lies i the symmetric differece plots idicate edges foud by the glasso but ot the oparaormal, ad vice-versa for the solid red lies. z the paths for the two methods look similar, there are some subtle differeces. I particular, variables become ozero i a differet order, especially whe the regularizatio parameter is i the rage λ [0.2, 0.3]. As show below, these subtle differeces i the paths lead to differet model selectio behaviors COMARISON OF THE ESTIMATED GRAHS Figure 2 compares the estimated graphs for the two methods at several values of the regularizatio parameter λ i the rage [0.6,0.37]. For each λ, we show the estimated graph from the oparaormal i the first colum. I the secod colum we show the graph obtaied by scaig the full 234