Network Maximal Correlation Soheil Feizi, Ali Makhdoumi, Ken Duffy, Manolis Kellis, and Muriel Medard

Transcription

1 Computer Scence and Artfcal Intellgence Laboratory Techncal Report MIT-CSAIL-TR September 2, 205 Network Maxmal Correlaton Sohel Fez, Al Makhdoum, Ken Duffy, Manols Kells, and Murel Medard massachusetts nsttute of technology, cambrdge, ma 0239 usa

2 Network Maxmal Correlaton Sohel Fez,3, Al Makhdoum,3, Ken Duffy 2, Manols Kells and Murel Médard September 205 Abstract Identfyng nonlnear relatonshps n large datasets s a dauntng task partcularly when the form of the nonlnearty s unknown. Here, we ntroduce Network Maxmal Correlaton (NMC) as a fundamental measure to capture nonlnear assocatons n networks wthout the knowledge of underlyng nonlnearty shapes. NMC nfers, possbly nonlnear, transformatons of varables wth zero means and unt varances by maxmzng total nonlnear correlaton over the underlyng network. For the case of havng two varables, NMC s equvalent to the standard Maxmal Correlaton. We characterze a soluton of the NMC optmzaton usng geometrc propertes of Hlbert spaces for both dscrete and jontly Gaussan varables. For dscrete random varables, we show that the NMC optmzaton s an nstance of the Maxmum Correlaton Problem and provde necessary condtons for ts global optmal soluton. Moreover, we propose an effcent algorthm based on Alternatng Condtonal Expectaton (ACE) whch converges to a local NMC optmum. For ths algorthm, we provde gudelnes for choosng approprate startng ponts to jump out of local maxmzers. We also propose a dstrbuted algorthm to compute a -ɛ approxmaton of the NMC value for large and dense graphs usng graph parttonng. For jontly Gaussan varables, under some condtons, we show that the NMC optmzaton can be smplfed to a Max-Cut problem, where we provde condtons under whch an NMC soluton can be computed exactly. Under some general condtons, we show that NMC can nfer the underlyng graphcal model for functons of latent jontly Gaussan varables. These functons are unknown, bjectve, and can be nonlnear. Ths result broadens the famly of contnuous dstrbutons whose graphcal models can be characterzed effcently. We llustrate the robustness of NMC n real world applcatons by showng ts contnuty wth respect to small perturbatons of jont dstrbutons. We also show that sample NMC (NMC computed usng emprcal dstrbutons) converges exponentally fast to the true NMC value. Fnally, we apply NMC to dfferent cancer datasets ncludng breast, kdney and lver cancers, and show that NMC nfers gene modules that are sgnfcantly assocated wth survval tmes of ndvduals whle they are not detected usng lnear assocaton measures. Introducton Identfyng relatonshps among varables n large datasets s an ncreasngly mportant task n systems bology [], socal scences [2], fnance [3], etc. Whle correlaton-based measures capture lnear assocatons, they can fal to nfer true nonlnear relatonshps among varables, whch can often occur n real-world applcatons [4]. One famly of measures to nfer nonlnear assocatons among Massachusetts Insttute of Technology (MIT), Cambrdge, US. 2 Hamlton Insttute, Maynooth Unversty, Ireland. 3 These authors contrbuted equally to ths work.

3 varables s based on mutual nformaton [5, 6]. Although mutual nformaton computes a measure of assocaton strength among varables, t does not provde functons through whch varables are related to each other. Moreover, relable computaton of mutual nformaton requres an excessve number of samples, partcularly for large number of varables [7]. A classcal measure to nfer a nonlnear relatonshp between two varables s Maxmal Correlaton (MC), ntroduced by Gebelen [8] and studed n references [9 2]. MC nfers, possbly nonlnear, transformatons of two varables wth zero means and unt varances by maxmzng ther parwse correlaton. MC can be computed effcently for both dscrete [3] and contnuous [4] random varables. For dscrete varables, under some mld condtons, MC s equal to the second largest sngular value of a normalzed jont probablty dstrbuton matrx [3]. In that case, transformatons of varables can be characterzed usng rght and left sngular vectors of the normalzed probablty dstrbuton matrx. Recently, MC has been used n dfferent applcatons n nformaton theory [5 7], nformaton-theoretc securty and prvacy [8 20], and data processng [2, 22]. Many modern applcatons nclude large number of varables wth possbly nonlnear relatonshps among them. Usng MC to capture parwse assocatons can cause sgnfcant over-fttng ssues because each varable can be assgned to multple nonlnear relatons. Here we propose Network Maxmal Correlaton (NMC) as a fundamental measure to capture nonlnear assocatons n networks wthout the knowledge of underlyng nonlnearty shapes. In the NMC optmzaton, each varable s assgned to at most one transformaton functon wth zero mean and unt varance. NMC nfers optmal transformatons of varables by maxmzng ther nner products over edges of the underlyng graph. For the case of two varables, NMC s equvalent to MC. The NMC defnton does not assume a specfc relatonshp among node varables and the graph structure. For llustraton, we consder ths relatonshp n dfferent NMC applcatons such as graphcal model nference. Furthermore, the NMC optmzaton can be regularzed to have even fewer nonlnear transformatons to avod over-fttng ssues. In ths paper, we characterze a soluton of the NMC optmzaton usng geometrc propertes of Hlbert spaces for both dscrete and contnuous jontly Gaussan varables. For dscrete random varables, we show that the NMC optmzaton s an nstance of the Maxmum Correlaton Problem (MCP) whch s NP-hard [23 26]. In ths case, usng results of the Multvarate Egenvector Problem (MEP) [23], we provde necessary condtons for a global NMC optmum. We also propose an effcent algorthm based on Alternatng Condtonal Expectaton (ACE) [3], whch converges to a local NMC optmum. We also provde gudelnes for choosng approprate startng ponts of the algorthm to jump out of local maxmzers. The proposed ACE algorthm does not requre formng jont dstrbuton matrces whch could be expensve for varables wth large alphabet szes. We also propose a dstrbuted verson of the ACE algorthm to compute a -ɛ approxmaton of the NMC value for large and dense graphs usng graph parttonng. For jontly Gaussan varables, we use projectons over Hermtte-Chebychev polynomals to characterze an optmal soluton of the NMC optmzaton. Under some condtons, we show that the NMC optmzaton s equvalent to the Max-Cut problem, whch s NP-complete [27]. However, there exst algorthms to approxmate ts soluton usng Semdefnte Programmng (SDP) wthn an approxmaton factor of [28]. In ths case, we provde condtons under whch an NMC soluton can be computed exactly. Usng these results, under some general condtons we show that NMC can nfer the underlyng graphcal model for functons of latent jontly Gaussan varables. These functons are unknown, bjectve, and can be nonlnear. Ths result broadens the famly of contnuous dstrbutons whose graphcal models can be characterzed effcently. 2

4 In real-world applcatons, often only nosy samples of jont dstrbutons are avalable. For ths case, we prove a fnte sample generalzaton bound, and error guarantees for NMC. In partcular, under general condtons we prove that NMC s contnuous wth respect to jont probablty dstrbutons. That s, a small perturbaton n the dstrbuton results n a small change n the NMC value. Moreover, we show that Sample NMC (.e., NMC computed usng emprcal dstrbutons) converges exponentally fast to the NMC value as the sample sze grows. Moreover, we use the NMC optmzaton to characterze a nonlnear global relevance graph wth a certan complexty [29] and propose a greedy algorthm to nfer such a nonlnear relevance graph approxmately. Fnally, we apply NMC to dfferent cancer datasets [30] ncludng breast, kdney and lver cancers and show that usng the NMC network, we can nfer gene modules that are sgnfcantly assocated wth survval tmes of ndvduals whle they are not detected usng lnear assocaton measures. 2 Maxmal Correlaton In ths secton, we ntroduce notatons and revew pror work on maxmal correlaton. 2. Notaton Suppose X and X 2 are two random varables defned on probablty space (Ω, F, P ) takng values n (X, B ) and (X 2, B 2 ), respectvely. The map X (Ω, F) (X, B ) generates the subalgebra F = X (B ) of F. Let P X be the restrcton of the measure P on F, =, 2. For dscrete varables, X and X 2 are ther fnte support sets wth cardnaltes X, for =, Defnton and General Propertes A Pearson s lnear correlaton coeffcent between real-valued varables X and X 2 s defned as cor(x, X 2 ) = E [(X E[X ]) (X 2 E[X 2 ])] var(x ), var(x 2 ) where var(x ) represents the varance of random varable X, for =, 2. Correlaton-based measures capture lnear assocatons between varables, gnorng possble nonlnear relatonshps. Example Suppose X s a Gaussan varable wth zero mean and unt varance. Let X 2 = X 2. In ths case, even though varables are strongly assocated wth each other, the correlaton coeffcent between them s close to zero (see e.g. Fgure -a). It s because these varables are related through a nonlnear transformaton. One way to capture such a nonlnear relatonshp between these varables s to quantfy maxmum correlaton between ther, possbly nonlnear, transformatons. In ths example, suppose φ (X ) = α X 2 + α 2 and φ 2 (X 2 ) = α 2 X 2 + α 22, where coeffcents α j are selected so that φ (X ) has zero mean and unt varance, for both =, 2. In ths case, the correlaton coeffcent between transformed varables s one (see e.g. Fgure -b), capturng a strong nonlnear assocaton between varables X and X 2. Maxmal correlaton (MC) between varables X and X 2 whch was ntroduced by Gebelen [8] captures a nonlnear assocaton between them by selectng, possbly nonlnear, transformaton functons φ (X ) and φ 2 (X 2 ) so that φ (X ) and φ (X ) have the hghest correlaton among all other transformaton functons wth zero means and unt varances. 3

5 (a) 0.25 (b) 0.06 X Maxmal Corrolaton X Fgure : (a) Samples of varables X and X 2 wth a nonlnear relatonshp consdered n Example. (b) φ (X ) and φ 2 (X 2 ) are transformed varables, capturng the nonlnear relatonshp between X and X 2. Defnton (Maxmal Correlaton) Maxmal correlaton between two random varables X and X 2 s defned as ρ(x, X 2 ) max φ,φ 2 E[φ (X ) φ 2 (X 2 )], (2.) subject to φ (X ) Ω R s measurable, E[φ (X )] = 0, and E[φ (X ) 2 ] =, for =, 2. For =, 2, let φ (X ) denote an optmal soluton of (2.). Maxmal correlaton ρ(x, X 2 ) s always between 0 and, where a hgh MC value ndcates a strong assocaton between two varables [8]. The study of maxmal correlaton and other prncple nerta components between two varables dates back to Hrschfeld [9], Gebelen [8], Sarmanov [0], Rény [], and Greenacre [2]. Recently, MC has been used n nformaton theory and appled probablty problems such as data processng, nference of common randomness among others [0, 4, 22, 3, 32]. Unlke lnear correlaton, MC only depends on the jont dstrbuton of varables P X,X 2 (, ), and not on ther alphabets X. Several works have nvestgated dfferent aspects of optmzaton (2.) for both dscrete [3] and contnuous [4] random varables. In partcular, the exstence of an optmal soluton for the MC optmzaton and the unqueness of such solutons have been nvestgated n [3]. Reference [4] has used projectons over Hlbert spaces to compute MC for Gaussan varables. We extend ths approach to derve exstng MC results for dscrete varables. In the next secton, we use a smlar approach based on Hlbert projectons to characterze network maxmal correlaton for both dscrete and jontly Gaussan varables. Defnton 2 For =, 2, we defne a Hlbert space H as H = {φ (X ) φ (X ) s measurable, E[φ (X )] = 0, E[(φ (X )) 2 ] < }, where the product s defned as φ, φ E[φ (X ) φ (X )]. Snce every Hlbert space has an orthonormal bass (Theorem 2.4, [33]), we let {ψ, } = and {ψ 2,} = be correspondng orthonormal bases of H and H 2, respectvely. Consder the followng optmza- φ s a mappng from X to R and X s a mappng from Ω to X. Thus, we have φ (X ) = φ X Ω R. 4

6 ton: max a,j where ρ j E[ψ, (X ) ψ 2,j (X 2 )]. a, a 2,j ρ j (2.2),j a 2,j =, =, 2, j= j= a,j E[ψ,j (X )] = 0, =, 2, Proposton Suppose φ ( ) and a,j are optmal solutons of optmzatons (2.) and (2.2), respectvely. Then, we have φ (x) = j= Moreover, the jont probablty dstrbuton can be wrtten as Proof A proof s presented n Secton 0.. a,j ψ,j (x). (2.3) P X X 2 (x, x 2 ) = ρ j ψ, (x )ψ 2,j (x 2 ).,j Proposton provdes an alternatve optmzaton (2.2) to solve the maxmal correlaton problem (2.). Selectng approprate orthonormal bases for Hlbert spaces H and H 2 s crtcal to obtanng a tractable optmzaton (2.2). In the followng, we use Proposton to solve the maxmal correlaton optmzaton for general dscrete varables as well as for jontly Gaussan varables. Example 2 (MC for Dscrete Random Varables) Suppose X and X 2 are two dscrete random varables wth a jont probablty functon P X,X 2 (, ). Let {,..., X } and {,..., X 2 } be alphabets of random varables X and X 2, respectvely. We choose the followng orthonormal bases for H and H 2 : ψ, (x) = {x = } PX () and ψ 2,j(x) = {x = j} PX2 (j). By these selectons of bases, we have ρ j = E[ψ, (X ) ψ 2,j (X 2 )] = P X X 2 (, j) PX () P X2 (j). Moreover, we have E[ψ,j (X )] = P X (j), =, 2. Thus, optmzaton (2.2) s smplfed to the followng optmzaton: max P X,X a, a 2 (, j) 2,j,j PX () P X2 (j) X j= X j= (a,j ) 2 =, =, 2, a,j PX (j) = 0, =, 2. (2.4) 5

7 Accordng to Proposton, to solve MC optmzaton (2.) t s suffcent to fnd an optmal soluton of optmzaton (2.4). In the followng, we show that an optmal soluton of optmzaton (2.4) can be computed n a closed form usng matrx spectral decomposton. Defne the normalzed jont dstrbuton matrx as whose sze s X X 2. Let be coeffcent vectors. Moreover, let Q(, j) P X,X 2 (, j) PX () P X2 (j) a (a,, a,2,..., a, X ) T and a 2 (a 2,, a 2,2,..., a 2, X2 ) T (2.5) p ( P X (), P X (2),..., P X ( X )) T (2.6) p2 ( P X2 (), P X2 (2),..., P X2 ( X 2 )) T be vectors of square roots of margnal probabltes. Optmzaton (2.4) can be re-wrtten as follows: max a T Q a 2 (2.7) a 2 =, =, 2, a p, =, 2. In the followng, we show that optmal coeffcent vectors a and a 2 of optmzaton (2.7) are equal to the left and rght sngular vectors of the matrx Q correspondng to ts second largest sngular value. Moreover, the optmal value (the maxmal correlaton between two varables X and X 2 ) s equal to the second largest sngular value of the matrx Q. To show ths, we defne random varables Z and Z 2 such that P Z = a, PX (), Z 2 = a 2,j = P PX2 (j) X,X 2 (, j), where a = and a 2 =. Usng the Cauchy-Schwartz nequalty, we have that a T Qa 2 = E[Z Z 2 ] E[Z 2]E[Z2 2 ] = a a 2 =. Therefore, the maxmum sngular value of Q s at most one. Usng (2.5), one can see that the rght and left sngular vectors of Q wth the sngular value one are p and p 2, respectvely. Thus, the feasble set of optmzaton (2.7) ncludes unt-norm vectors orthogonal to leadng sngular vectors of Q. Thus, the optmal value s equal to the second largest sngular value and optmal vectors a and a 2 are left and rght sngular vectors correspondng to the second largest sngular value. Example 3 (MC for Jontly Gaussan Random Varables) Ths example s studed n reference [4] to compute MC between two Gaussan varables. In Secton 6, we use a smlar approach to characterze network maxmal correlaton for jontly Gaussan varables. Suppose (X, X 2 ) are jontly Gaussan varables wth the correlaton coeffcent ρ. The k-th Hermtte-chebychev polynomal s defned as dk Ψ k (x) = ( ) k e x2 dx k e x2. (2.8) 6

8 These polynomals form an orthonormal bass wth respect to Gaussan dstrbutons. That s, H (x )H j (x 2 )f(x, x 2 )dx dx 2 = ρ =j, (2.9) where f(x, y) s the jont densty functon of Gaussan varables wth correlaton ρ, and =j s one when = j, otherwse t s zero. Let ψ,j to be the j-th Hermtte-Chebychev polynomal, for =, 2. Usng (2.9), we have Moreover, we have ρ j = E[ψ, (X ) ψ 2,j (X 2 )] = ρ =j. E[ψ,j (X )] = j=0, =, 2, (2.0) because all of these functons for j have zero means over a Gaussan dstrbuton. Therefore, optmzaton (2.2) can be wrtten as max a, a 2, ρ (2.) =0 j=0 (a,j ) 2 =, =, 2, a,0 = 0, =, 2. Snce ρ, an optmal soluton of optmzaton (2.) s obtaned when a, =, a 2, =, whle other coeffcents are equal to zero. The sgns of a, for =, 2 are determned so that a, a 2, ρ = ρ. Ths leads to the maxmal correlaton ρ between two varables that s equal to the absolute value of the correlaton coeffcent between them when the two random varables are jontly Gaussan. Moreover, optmal transformaton functons are φ (X ) = a, ψ, = ±X, =, 2, where sgns of varables are selected so that a, a 2, ρ = ρ. 3 Statstcal Propertes of Maxmal Correlaton In many applcatons, often only nosy samples of jont dstrbutons are observed. In ths secton, we prove a fnte sample generalzaton bound, and error guarantees for maxmal correlaton of dscrete random varables. Specfcally, under some general condtons, we prove that ˆ maxmal correlaton s a contnuous measure wth respect to the jont probablty dstrbuton. That s, a small perturbaton n the dstrbuton results n a small change n the MC value. ˆ sample maxmal correlaton between two varables, computed usng m samples from the jont dstrbuton, converges exponentally fast to the MC value, as m grows. These propertes establsh maxmal correlaton as a robust assocaton measure to capture nonlnear dependences between varables n real-world applcatons. 7

9 Throughout ths subsecton we only consder dscrete random varables and assume that all alphabet elements x X have postve probabltes (otherwse they can be neglected wthout loss of generalty). That s, f δ arg mn x X P X (x ), =, 2, (3.) then δ(p ) mn{δ, δ 2 } > 0. The emprcal dstrbuton of these varables usng m observed samples s defned as P (m) (x, x 2 ) = m m = {x() = x, x () 2 = x 2 }, where {x (), x() 2 }m = are..d. samples drawn accordng to a dstrbuton P X,X 2. The vector of observed samples of varable X s denoted by x = (x (), x (2),..., x (m) ). For any vector v = (v,..., v p ) R d and p, we let v p represent the standard p-norm of the vector v defned as d v p = ( v p p ). For p = 2, we drop the subscrpt f no confuson arses,.e., v = v Contnuty of Maxmal Correlaton = Let P X,X 2 (, ) and P X,X 2 (, ) be two dstrbutons over alphabets (X, X 2 ) wth the correspondng MC values ρ and ρ, respectvely. In the followng, we show that f the dstance between P and P s small (.e., P P ɛ), ther correspondng MC values (ρ and ρ) are close to each other as well. Theorem Let P P ɛ, for some ɛ > 0. Then, we have where D max{ X, X 2 }, and δ mn(δ(p ), δ( P )). Proof A proof s presented n Secton 0.2. ρ ρ 2 ɛ δ 2 D3/2, (3.2) The sketch of the proof s as follows: The normalzed jont dstrbuton matrx Q (2.5) can be wrtten as Q = D X (P ) 2 PX,X 2 D X2 (P ) 2, (3.3) where D X (P ) denotes a dagonal matrx whose dagonal s P X, for =, 2. Snce the matrx Q s a contnuous functon of P X,X 2 (, ), ts sngular values (and therefore ts second largest sngular value) are contnuous functons of P as well. 3.2 Sample Maxmal Correlaton Let {x, x 2 }m = be..d. samples drawn accordng to a jont probablty dstrbuton P X,X 2 (, ). Suppose P (m) (, ) denotes the emprcal dstrbuton obtaned from these samples. Maxmal correlaton computed usng ths emprcal probablty dstrbuton s called Sample Maxmal Correlaton and s denoted by ρ m (X, X 2 ). In the followng, we show that ρ m (X, X 2 ) converges to ρ(x, X 2 ) exponentally fast, as m. 8

10 Theorem 2 For any dstrbuton P, and any ɛ > 0, P [ ρ m (X, X 2 ) ρ(x, X 2 ) > ɛ] 0, exponentally fast. More precsely, f then, m 3 24 D log ( δ(p ) 2 ɛ η ), P [ ρ m (X, X 2 ) ρ(x, X 2 ) > ɛ] η, where D = max{ X, X 2 }. The bound can also be wrtten as P [ ρ m (X, X 2 ) ρ(x, X 2 ) > ɛ] 24 exp ( mδ(p )2 ɛ 3 D ). Proof A proof s presented n Secton 0.3. The proof follows from the facts that maxmal correlaton s a contnuous functon of the nput dstrbuton accordng to Theorem, and the emprcal dstrbuton converges exponentally fast to the true dstrbuton. 4 Network Maxmal Correlaton 4. Defnton and General Propertes In ths secton, we ntroduce Network Maxmal Correlaton (NMC) as a fundamental measure to capture nonlnear assocatons over networks. Let G = (V, E) be a graph wth n nodes and E edges. The graph G s un-weghted, does not have self-loops, and can be drected or undrected. Each node s assgned to a random varable X. Here, we ntroduce NMC wthout assumng a specfc relatonshp among node varables and the graph structure. We dscuss ths relatonshp n dfferent applcatons of NMC n Sectons 6, 7, and 8. NMC nfers best nonlnear transformaton functons assgned to each node varable so that the total parwse correlaton over the network s maxmzed. Suppose X,..., X n are n random varables defned on probablty space (Ω, F, P ), where X takes values n (X, B ). The map X (Ω, F) (X, B ) generates the subalgebra F = X (B ) of F. Let P X be the restrcton of the measure P on F, =,..., n. For dscrete varables, X and X 2 are ther fnte support sets wth cardnaltes n = X, for =,..., n. Defnton 3 (Network Maxmal Correlaton) Network maxmal correlaton among varables X,..., X n connected by a graph G = (V, E) s defned as ρ G (X,..., X n ) max φ,...,φ n E[φ (X ) φ j (X j )], (4.) (,j) E subject to φ (X ) Ω R s measurable, E[φ (X )] = 0, and E[φ (X ) 2 ] =, for n. The Optmzaton (4.) maxmzes total parwse correlaton over the network wthout dstngushng among postve and negatve correlatons. In some applcatons, the strength of an assocaton does not depend on the sgn of the correlaton coeffcent. In those cases, one can re-wrte the NMC optmzaton (4.) to maxmze the total absolute parwse correlatons over the network as follows: 9

11 Defnton 4 (Absolute Network Maxmal Correlaton) Consder the followng optmzaton: max φ,...,φ n E[φ (X ) φ j (X j )], (4.2) (,j) E subject to φ (X ) Ω R s measurable, E[φ (X )] = 0 and E[φ 2 (X )] =, for any n. We refer to ths optmzaton as an absolute NMC optmzaton. Let φ ( ) be an optmal soluton of the NMC optmzaton (4.) (n Proposton 2 we prove the exstence of such soluton). Then, an edge maxmal correlaton between varables and j s defned as ρ G (X, X j ) E[φ (X ) φ j (X j )], (4.3) where (, j) E. Unlke maxmal correlaton formulaton of (2.), transformaton functons n optmzaton (4.) are constraned by the network structure. Therefore, an edge maxmal correlaton between varables X and X j s always smaller than or equal to ther maxmal correlaton,.e., ρ G (X, X j ) ρ(x, X j ). Computaton of maxmal correlaton for each edge ndependently results n two nonlnear functons assgned to nodes of that edge. Therefore, f the network has E edges, t wll result n nference of 2 E possbly nonlnear functons. In that setup, each node can be assocated to dfferent nonlnear transformaton functons whch can rase over-fttng ssues partcularly for dense networks. On the other hand, n the NMC formulaton of (4.), we assgn a sngle functon to each node n the graph. Therefore, optmzaton (4.) results n n possbly nonlnear functons. Lemma The NMC optmzaton (4.) s equvalent to the followng MSE optmzaton: mn φ,...,φ n 2 E[(φ (X ) φ j (X j )) 2 ], (4.4) (,j) E where E[φ (X )] = 0 and E[φ 2 (X )] =, for any n. Proof A proof s presented n Secton 0.4. Smlarly to Defnton 2, for =, 2,..., n, we defne a Hlbert space H as H = {φ (X ) φ (X ) s measurable, E[φ (X )] = 0, E[(φ (X )) 2 ] < }, where the product s defned as φ, φ E[φ (X ) φ (X )]. The followng proposton shows the exstence of optmal transformatons of the NMC optmzaton (4.): Proposton 2 Under the assumpton that Hlbert spaces H s are compact, there exst functons φ such that E[φ (X )] = 0 and E[φ (X ) 2 ] = for n, that acheve the optmal value of optmzaton (4.). Proof A proof s presented n Secton 0.5. The assumpton that Hlbert spaces H s are compact holds when X s are dscrete random varables wth fnte support, or when X s are jontly Gaussan random varables. 0

12 Let P denote the projecton operaton from the space H j (for any j ) onto H, for any n. Accordng to Lemma 5 ths projecton can be characterzed usng condtonal expectatons a s follows: For random varable φ j H j, we have P φ j = argmn φ H E [(φ φ j ) 2 ] = E[φ j X ] E[φj X ] 2. The followng proposton characterzes optmal NMC transformaton functons usng projecton operators: Proposton 3 Optmal transformaton functons of NMC optmzaton (4.) {φ, n} satsfy φ = j N () P φ j j N () P φ j = E[ j N () φ j X ] E[ j N () φ j X ], (4.5) where N () represents neghbors of node n the graph G = (V, E). Proof A proof s presented n Secton 0.6. Note A smlar approach can be used to characterze the absolute NMC optmzaton (4.2) by ntroducng extra varables to represent correlaton sgns of edges: max s,j E[φ (X )φ j (X j )] (,j) E E[φ (X )] = 0, E[φ 2 (X )] =, n. (4.6) In ths case, smlarly to Proposton (0.6), we can wrte where φ = j N () s j P φ j j N () s j P φ j, s j = sgn (E[φ (X )φ j (X j )]). Proposton 3 characterzes a property of optmal transformatons of NMC (4.) usng projecton operatons wthout explctly computng the optmal NMC soluton. In the followng, we use orthonormal representatons of the Hlbert spaces H and propose a constructve approach to solve the NMC optmzaton. Recall that {ψ,j } j= represents an orthonormal bass for H. Consder the followng optmzaton: max (, ) E j= j= j,j a,j a,j a 2,j =, n, a,j E[ψ,j (X )] = 0, n, ρj,j, (4.7) where ρ j,j, E[ψ,j (X ) ψ,j (X )].

13 Theorem 3 Suppose φ ( ) and a,j are optmal solutons of optmzatons (4.) and (4.7), respectvely. Then, we have φ (x) = Proof A proof s presented n Secton 0.7. j= a,j ψ,j (x). (4.8) Smlarly to the case of two varables dscussed n Secton 2.2, selectng approprate Hlbert spaces H s crtcal to have a tractable optmzaton (4.7). In the followng, we consder the NMC optmzaton for dscrete varables, whle the Gaussan case s dscussed n Secton 6. Example 4 (NMC for Dscrete Random Varables) Suppose X s a dscrete random varable wth alphabet {,..., X }. Smlarly to Example 2, let ψ,j (x) = {x = j} be an orthonormal bass for H. Thus, we have ρ j,j, = E[ψ,j (X ) ψ,j (X )] = P X X (j, j ) PX (j) P X (j ). PX (j) Therefore, optmzaton (4.7) s smplfed to the followng optmzaton: max (, ) E X j= X j= P XX a,j a (j, j ),j j,j PX (j) P X (j ) (a,j ) 2 =, n, a,j PX (j) = 0, n. (4.9) Smlarly to Example 2, we defne the matrx Q, Q, (j, j ) whose sze s X X. Moreover, recall that for =,..., n, we have as P X,X (j, j ) PX (j) P X (j ), (4.0) a = (a,, a,2,..., a, X ) T p = ( P X (), P X (2),..., P X ( X )) T. Therefore, optmzaton (4.9) can be re-wrtten as follows: max a T Q, a (4.) (, ) E a 2 =, n, a p, n. Optmzaton (4.) s not convex nor concave n general. In Secton 5.2, we show that ths optmzaton s an nstance of the standard Maxmum Correlaton Problem (MCP) proposed by Hotellng [24, 25]. By makng ths connecton, we use establshed technques of solvng Multvarate Egenvalue Problem (MEP) to solve optmzaton (4.). 2

14 4.2 Statstcal Propertes of NMC In ths part, we nvestgate the robustness of NMC for dscrete varables wth fnte support aganst small perturbatons of jont probablty dstrbutons of varable pars. Moreover, we show that sample NMC (.e., NMC computed usng emprcal dstrbutons) converges to the true NMC value exponentally fast as the sample sze ncreases. To smplfy notaton, suppose P, s the matrx representaton of the jont probablty dstrbuton of varables X and X. Theorem 4 Network maxmal correlaton s a contnuous functon of the jont probablty dstrbutons P,, for all (, ) E. Let P, P, ɛ, for some ɛ > 0, and all (, ) E. Then, we have ρ G ρ G ɛ E D δ 2, (4.2) where D = max{ X,..., X n }, and δ = mn n (mn{δ(p X ), δ( P X )}). Proof A proof s presented n Secton 0.8. Next, we show that the sample NMC denoted by ρ m (G) converges to the true NMC value ρ G exponentally fast, as the sample sze m ncreases: Theorem 5 Sample NMC converges to NMC, exponentally fast. Partcularly, let δ = mn n δ(p X ) and D = max{ X,..., X n }. Then, for we have Proof A proof s presented n Secton 0.9. m ( 24 E 2 D 3 8 max{ V, E } ɛ 2 δ 2 ) log ( ), (4.3) η P[ ρ m (G) ρ G > ɛ] η. (4.4) Note that for the case of havng two varables, robustness bounds provded n Theorems 4 and 5 are more loose compared to bounds provded by Theorem and 2 owng to the generalty of relaxatons used n NMC performance characterzaton. 4.3 Regularzed NMC In ths secton, we assume that all varables are real-valued (note that ths s not a necessary condton for MC and NMC). The NMC optmzaton (4.) results n n possbly nonlnear transformaton functons φ (X ) whose dstances from orgnal varables can be arbtrarly large (.e., E[φ (X ) X ] can be arbtrarly small). In some applcatons, one may wsh to have fewer than n nonlnear transformaton functons assgned to varables, or alternatvely to control dstances among transformed and orgnal varables. Here, we propose a regularzed NMC optmzaton framework whch penalzes dstances among optmal transformaton functons φ (X ) and the orgnal varables X. Suppose varables have mean zero and unt varance. I.e., E[X ] = 0 and E[X 2 ] =. 3

15 Algorthm Alternatng Condtonal Expectaton Intalzaton: φ (0) (X ), φ (0) 2 (X 2) wth mean zero. for k=0,,... φ (k+) (X ) = E[φ (k) 2 (X 2) X ]. update: φ (k+) (X ) = φ (k+) 2 (X 2 ) = E[φ (k) update: φ (k+) 2 (X 2 ) = (X ) X 2 ]. φ (k+) (X ) E[(φ (k+) (X )) 2 ] φ (k+) 2 (X 2 ) E[(φ (k+) 2 (X 2 )) 2 ] update: ρ (k+) = E[φ (k+) (X )φ (k+) 2 (X 2 )] end Defnton 5 (Regularzed NMC) Regularzed NMC among varables X,..., X n connected by a graph G = (V, E) s defned as the soluton of the followng optmzaton: max ( λ) φ,...,φ n (,j) E E[φ (X ) φ j (X j )] + λ E[φ (X ) X ], (4.5) V where E[φ (X )] = 0 and E[φ 2 (X )] =, for any n. 0 λ s the regularzaton parameter. Unlke MC and NMC, whch only depend on the jont dstrbutons of varables, the regularzed NMC depends on both jont dstrbutons and alphabets of varables because of the regularzaton term. Moreover, one can defne regularzed absolute NMC smlarly to optmzaton (4.2). Let optmal transformaton functons computed by optmzaton (4.5) be φ,λ. If λ = 0, φ,λ = φ, whle f λ =, φ,λ = X. By varyng λ between 0 and, transformaton functons vary from φ to X. Suppose ρ G,λ (X,..., X n ) E[φ,λ (X ) φ j,λ (X j)]. (,j) E Therefore, ρ G,0 = ρ G and ρ G, s the total lnear correlatons over the network. By the defnton of NMC, ρ G,0 ρ G,. 5 Computaton of MC and NMC In ths secton, we frst revew an exstng algorthm to compute MC and then ntroduce an effcent algorthm to compute NMC. We also propose a parallelzable verson of the NMC algorthm based on network parttonng and show that ts expected performance s ɛ-away from the true NMC value. 5. Computaton of Maxmal Correlaton Gven the jont dstrbuton of varables, one can use Proposton to compute the MC value and optmal transformaton functons. In partcular, for dscrete random varables, Example 2 shows that a soluton of optmzaton (2.) can be characterzed by the second largest sngular value of the normalzed jont dstrbuton matrx Q (2.5). Gven samples of varables (.e., {x (), x() 2 }m = ), one can compute MC usng the emprcal jont dstrbuton of varables P m (, ). Robustness of MC 4

16 computaton usng emprcal dstrbutons s dscussed n Secton 3. If alphabet szes of varables are large, formng the jont dstrbuton matrx can be costly. An teratve approach to compute maxmal correlaton wthout formng the jont dstrbuton functon s based on Alternatng Condtonal Expectaton (ACE) [3]. Brefly, at each nteracton, the ACE algorthm computes optmal transformaton functons usng condtonal expectatons, assumng that the other transformaton functon s fxed (n a Gauss-Sedel manner [34]). If the correlaton value does not ncrease by a certan value, the algorthm termnates. We descrbe the steps of ths algorthm n Algorthm. Proposton 4 The sequence ρ (k) generated by Algorthm converges to a local optmum of optmzaton (2.). Moreover, f startng ponts of Algorthm are such that vectors and (φ (0) () p X (),..., φ (0) ( X ) p X ( X )) (φ (0) 2 () p X2 (),..., φ (0) 2 ( X 2 ) p X2 ( X 2 )) are not orthogonal to the span of the left and rght sngular vectors correspondng to the second largest sngular value of Q, then ACE algorthm converges to the global optmum. Moreover, f the Q matrx has unque sngular vectors (left and rght) correspondng to the second largest sngular value, optmal transformaton functons are unque maxmzers of optmzaton (2.). Proof See Theorems 5.4 and 5.5 of reference [3]. 5.2 Computaton of NMC In ths secton, we frst establsh a connecton between the NMC optmzaton (4.) wth Maxmum Correlaton Problem (MCP) and Multvarate Egenvalue Problem (MEP) ([23 26]). Then, we deploy technques used to solve MEP and MCP n order to compute NMC. The Maxmum Correlaton Problem (MCP), proposed by Hotellng [24, 25], s to fnd the lnear combnaton of one set of varables that correlates maxmally wth the lnear combnaton of another set of varables. Ths problem s defned as max b n b T C,j b j,j= b =, n, (5.) where b R n and C,j R n n j. Optmzaton (5.) s n the standard form of the MCP problem [24,25]. Upon employng the Lagrange multpler theory [34], the frst-order optmalty condton for optmzaton (5.7) s the exstence of real-valued scalars, namely, Lagrange multplers λ,..., λ n, such that the followng system of equatons s satsfed: n j= C j b j = λ b, n b =, n. (5.2) Ths system of equatons s called Multvarate Egenvalue Problem (MEP). We next establsh the connecton between NMC and MCP. To that end, we defne the followng notaton: For each, 5

17 snce I X p p T s postve semdefnte, we take ts square root 2 and wrte I p p T = B B T, where I X s a X X dentty matrx. Let b B a. Let U Σ U T be the sngular value decomposton of B where U (j) s the j-th column of U and σ (j) s the j-th sngular value of B. We wll show that only one sngular value of B s zero whch s equal to the sngular vector p. Wthout loss of generalty, suppose σ = 0, for all. Defne A a X X matrx as follows: A (2) [U,..., U ( X ) ] dag σ (2),..., σ ( X ) (2) [U,..., U ( X ) ] T. (5.3) Snce σ (j) 0, for all n, and j 2, thus A s well-defned accordng to (5.3). Theorem 6 The NMC optmzaton (4.) can be re-wrtten as follows: max b b T A T (Q p T p ) A b (5.4) (, ) E s.t. b 2 =. (5.5) Proof A proof s presented n Secton 0.0. Let C be a matrx consstng of submatrces C, where f (, ) E, C, A T (Q p p T ) A, (5.6) otherwse C, s an all zero matrx of sze X X. Let b (b T,..., bt n ) T R M, where b R X and M = n = X. Proposton 5 The NMC optmzaton (5.4) can be wrtten as follows: max b T Cb (5.7) b 2 =, n. Optmzaton (5.7) s n the standard form of the MCP problem [24, 25]. After showng that the NMC optmzaton can be reformulated as the MCP, we use the exstng technques n the lterature to solve t. Several local maxmzers exst for cases that fndng a global optmum of optmzaton (5.7) s computatonal dffcult [23, 35]. For example, an aggregated power method that terates on blocks of C was proposed by Horst [36] as a general technque for solvng the MEP numercally. Below, we summarze general algorthmc deas to solve MCP: () Frst, an effcent algorthm s used to solve MEP whch s the necessary frst order condton for MCP. Ths step s studed n references [23, 36]. (2) Next, a strategy s used to properly choose startng ponts of the algorthm or jump out of the local mnma of optmzaton (5.7). Ths step s studed n [26, 37]. 6

18 Algorthm 2 Gauss-Sedel Algorthm for MEP Input: C R M R M. Intalzaton: b (0) R M. for k = 0,,... for =,..., n end end b (k) λ (k) b (k+) = j= C jb (k+) j = b (k) 2. = b (k) λ (k) + n j= C jb (k) j. An effcent algorthm to solve MEP: Algorthm 2 s a Gauss-Sedel algorthm [34] to solve MEP whch s proposed by [23]. Ths algorthm s essentally a varant of the classcal power teraton method (see e.g. [38]). Let r(b (k) ) = (b (k) ) T Cb (k), λ (b) = b T [C,..., C n ]b, and Λ(b) = dag(λ (b)i X,..., λ n (b)i Xn ). Theorem 7 ([26]) Suppose the matrx C s symmetrc. We have a) The sequence {r(b (k) )} generated by Algorthm 2 s monotoncally ncreasng and convergent. b) Let (Λ, b ) be a soluton of MEP. If b s a local maxmzer of (5.7), then for any, we have λ (b ) σ X (C ). Moreover, f b s a global maxmzer of (5.7), then for any, we have λ (b ) σ (C ), where are egenvalues of the matrx C. σ (C ) σ X (C ) A strategy for avodng local optmums of MCP: Let b be a soluton of MEP. Usng Theorem 7, snce C s a zero matrx, n order to have b a global maxmzer of optmzaton (5.7), we need to have λ (b ) 0. Based on ths observaton, we have the followng strategy for choosng an startng pont for Algorthm 2. Let b be a soluton of (5.2) wth the correspondng Λ. Suppose that there exst an n such that λ < 0. Let w be the unt vector assocated wth the egenvalue of λ I X. Now let ˆb = b q, 2 Square root of a symmetrc postve semdefnte matrx A s defned as A = UΣ /2 U T where A = UΣU T. 7

19 Algorthm 3 Network ACE to compute NMC Input: G, X,..., X n, Intalzaton: φ (0) (X ),..., φ (0) n (X n ) wth mean zero. for k = 0,... φ (X ) = φ (k) (X ), n for = n φ (X ) = E [ j N φ j (X j ) X ] end end φ (k+) ρ (k+) G update: φ (X ) = φ (X ) E[φ (X ) 2 ] (X ) = φ (X ), n = (,j) E E [φ (k+) (X )φ (k+) j (X j )] where q s a vector where q = 0 for all and q = 2w T b w stll satsfes ˆb = for all =,..., n and gves r(ˆb) = r( b) 4λ (w T b ) 2 > r( b). We repeat ths process untl we have λ 0 for all =,..., n. Note that ths s not a suffcent condton for the global maxmzer of (5.7) and s only a necessary condton as Theorem 7 shows. After repeatng ths procedure, the condton gven n Theorem 7 holds. We then call upon Algorthm 2 to produce yet a better soluton for (5.7). Based on Algorthm 2, we ntroduce an algorthm to compute NMC usng alternatng condtonal expectaton. We prove that the proposed algorthm converges to the local optmum of the NMC optmzaton (4.). We then use a strategy explaned n ths secton to jump out of local maxmzers. At each teraton of the algorthm, we update transformaton functons as follows: Suppose at teraton r, transformaton functons are φ r j. If we fx all varables except the transformaton functon of node, an optmal soluton of φ r+ can be wrtten as the normalzed condtonal expectaton of functons of ts neghbors (see Proposton 3). In each update, the objectve functon of the NMC optmzaton ncreases or stays the same. Proposton 6 The sequence {ρ (k) G soluton of NMC optmzaton (4.). } k=0 generated by Algorthm 3 converges to a local optmum Proof A proof s presented n Secton 0.. Fgure 2 llustrates the convergence of the ACE algorthm to compute NMC of sx jontly Gaussan varables connected over a complete graph (for more detals, see Example 5). Proposton 7 The computatonal complexty of each teraton of Algorthm 3 s where d max s the maxmum node degree and D = max X. O(nDd max + E ), (5.8) Smlarly to Algorthm 2 for solvng MCP, Algorthm 3 fnds a local optmum soluton. Once the algorthm termnates, usng Theorem 7 and the strategy provded n the prevous secton, f the convergence pont does not satsfy the necessary condtons for a global optmum, we update the startng pont of Algorthm 3 and run t agan to reach a yet better soluton. 8

20 NMC value Number of teratons Fgure 2: An llustraton of the convergence of the ACE algorthm 3 to compute NMC over a complete graph connectng sx jontly Gaussan varables. Note that we can use a smlar algorthm to Algorthm 3 to compute regularzed NMC of Defnton 5. The objectve functon of the regularzed NMC optmzaton (4.5) can be wrtten as follows: V E[φ (X ) (( λ) j N () φ j (X j ) + λx )] (5.9) Thus, to compute the regularzed NMC, one can use a smlar ACE Algorthm 3 wth the followng updates for transformaton functons: φ (X ) = E[( λ) j N () φ j (X j ) + λx X ]. (5.0) If varables are contnuous and we only observe samples from ther jont dstrbutons, emprcal computaton of condtonal expectatons n Algorthm 3 may be challengng owng to the lack of suffcent samples at each pont. One way to overcome ths ssue s to dscretze contnuous varables by quantzng them. However, ths approach can ntroduce sgnfcant quantzaton errors. An alternatve approach to compute emprcal condtonal expectatons at pont x 0 R s to use all samples n ts B neghborhood (.e., x 0 B/2 x x 0 + B/2). By usng ths approach, the computaton of emprcal condtonal expectatons n Algorthm 3 can be wrtten as follows: φ (X = x ) = E[ φ j (X j ) x B/2 X x + B/2]. (5.) j N () We use ths approach n our ACE mplementatons to compute NMC for contnuous varables. If the graph G = (V, E) s sparse (.e., E = O(n)), Proposton 7 shows that the NMC computaton can be performed effcently n lnear tme complexty wth respect to the number of nodes n the network. However, f the graph s dense or the number of nodes n the network s large, ths computaton may be expensve. In the followng, we propose an approach to compute NMC usng parallel computatons. 9

21 5.3 Parallel Computaton of NMC For large and dense networks, exact computaton of NMC may become computatonally expensve (Proposton 7). For those cases, we propose a parallelzable algorthm whch approxmates NMC usng network parttonng. The dea can be descrbed as follow. For a gven graph G = (V, E), () Partton the graph nto small dsjont sets, (2) Estmate NMC for each partton ndependently, (3) Combne NMC solutons over sub-graphs to form an approxmaton of NMC for the orgnal graph. Defnton 6 An (ɛ, k)- parttonng of graph G = (V, E) s a dstrbuton on fnte parttons of V so that for any partton {V,..., V M } wth non zero probablty, V m k, for all m M. Moreover, the probablty that an edge falls across parttons s bounded by ɛ: for any e E, P[e E c ] ɛ, where E c = E m (V m V m ) s the set of cut edges. The probablty s wth respect to the dstrbuton on parttons. Defnton 7 A graph G s poly-growth f there exsts r > 0 and C > 0, such that for any vertex v n the graph, N v (d) Cd r, where N v (d) s the number of nodes wthn dstance d of v n G. Reference [39] descrbes the followng procedure for generatng an (ɛ, k) parttonng on a graph:. Gven G = (V, E), k, and ɛ > 0, we defne the truncated geometrc dstrbuton as follows: P[x = l] = { ɛ( ɛ)l, l < k, ( ɛ) k, l = k. (5.2) 2. We then order nodes arbtrarly,..., N. For node, we sample R accordng to dstrbuton (5.2) and assgn all nodes wthn that dstance from node to color (dstance s defned as the shortest path length on the graph). If a node has already colored, we re-color t wth color. 3. All nodes wth the same color form a partton. Proposton 8 If G s a poly-graph, then by selectng k = Θ( r ɛ log r ɛ ), the above procedure results n an (ɛ, Ck r ) partton. Proof See reference [39]. Next, we use an (ɛ, k)-graph parttonng to approxmate NMC over large graphs usng parallel computatons. Consder the followng approach: () Gven an (ɛ, k)- parttonng of G, we sample a partton {V,..., V M } of V. 20

22 (2) For each partton m M, we compute NMC over G m = (V m, E (V m V m )), denoted by ˆρ Gm. (3) Let ˆρ G = M m= ˆρ(G m ) be an approxmaton of ρ G. In the followng, we bound the approxmaton error by boundng boundary effects: Theorem 8 Consder an (ɛ, k)- parttonng of the graph G. We have, where the expectaton s over (ɛ, k)- parttons of graph G. Proof A proof s presented n Secton 0.2. E[ˆρ G ] ( ɛ)ρ G, (5.3) 6 NMC Applcaton n Inference of Nonlnear Gaussan Graphcal Models In ths secton, we dscuss an applcaton of NMC to nfer graphcal models for nonlnear functons of jontly Gaussan varables. Suppose (X,..., X n ) are jontly Gaussan varables wth zero means and unt varances. Let ρ, be the correlaton coeffcent between varables X and X. Let ψ,j to be the j-th Hermtte-Chebychev polynomal (2.8), for n. Recall that these polynomals form an orthonormal bass wth respect to Gaussan dstrbuton (see Example 3 and reference [4] for convergence detals). We have ρ j,j, = E[ψ,j (X ) ψ,j (X )] (6.) = ρ j, j=j, where j=j s equal to one f j = j, otherwse t s zero. Moreover, usng the defnton of Hermtte- Chebychev polynomals (2.8), we have E[ψ,j (X )] = j=0, n. (6.2) because all of these functons for j have zero means over a Gaussan dstrbuton. Therefore, optmzaton (4.7) can be wrtten as max (, ) E j=2 j=2 a,j a,j ρ j, (6.3) (a,j ) 2 =, n. In general, solvng optmzaton (6.3) s NP-complete. We establsh ths by dentfyng that one nstance of ths optmzaton s smplfed to the max-cut problem whch s NP-complete [27]. Theorem 9 Let s {, } for n. Let G = (V, E) be a complete graph. Suppose ( s s )ρ, 0, n, s s ρ, ρ 2,, n. Then, a = (0, s, 0,..., 0), for n s a global maxmzer of optmzaton (6.3). 2

23 Proof A proof s presented n Secton 0.3. Proposton 9 Under assumptons of Theorem 9, the NMC optmzaton (4.) s smplfed to the followng max-cut optmzaton: max s s ρ, (6.4) s {, }, n. Moreover, for all n, we have φ (X ) = s X, where φ and s are optmal solutons of optmzatons (3) and (6.4), respectvely. Proof A proof s presented n Secton 0.4. Note 2 Under the condtons of Theorem 9, one can compute the strength of the nonlnear relatonshps among varables by solvng multple parwse MC optmzaton (2.). However, one needs to solve optmzaton (6.4) to compute the sgns of covarance coeffcents. In general, Max-Cut optmzaton (6.4) s NP-complete [27]. However, there exst algorthms to approxmate ts soluton usng Semdefnte Programmng (SDP) wth approxmaton factor of [28]. Corollary Let φ (X ) be an optmal soluton of NMC optmzaton (3). Under assumptons of Theorem 9, f then, φ (X ) = X. ρ, ρ 2,, n, (6.5) Intutvely, the assumptons of Corollary consders jontly Gaussan varables wth correlaton coeffcents that are mostly postve. However, the covarance matrx can have negatve values as well. We wll show that ths assumpton s crtcal n graphcal model nference of nonlnear jontly Gaussan varables. Suppose (X,..., X n ) are jontly Gaussan varables wth the covarance matrx Λ X. Wthout loss of generalty, we assume all varables have zero means and unt varances. I.e., E[X ] = 0 and E[X 2 ] =, for all n. Let J X be the nformaton (precson) matrx [40] of these varables where J X = Λ X. Defne G X = (V X, E X ) such that, (, j) E X f and only f J X (, j) 0. Theorem 0 (e.g., [40]) If (, j) E X, then where represents ndependency between varables. X X j {X k, k, j} (6.6) Theorem 0 represents a way to explctly model the jont dstrbuton of Gaussan varables usng a graphcal model G X = (V X, E X ). Ths result s crtcal n several applcatons nvolved wth Gaussan varables whch requres computaton of margnal dstrbutons, or computaton of the mode of the dstrbuton. These computatons can be performed effcently over the graphcal model usng belef propagaton approaches [40]. Moreover, Gaussan graphcal models play an mportant 22

24 role n many applcatons such as lnear regresson [4], partal correlaton [42], maxmum lkelhood estmaton [43], etc. In many applcatons, even f varables are not jontly Gaussan, a Gaussan approxmaton s used often n practce, partally owng to the effcent nference of ther graphcal models. In the followng, under some condtons, we use the NMC optmzaton to characterze graphcal models for functons of latent jontly Gaussan varables. These functons are unknown, bjectve, and can be lnear or nonlnear. More precsely, let Y = f (X ), where f R R s a bjectve and dfferentable functon. Our goal s to characterze a graphcal model for varables (Y, Y 2,..., Y n ) wthout the knowledge of f ( ) functons. Consder the followng NMC optmzaton: max E[g (Y ) g (Y )], (6.7) (, ) E[g (Y )] = 0, n, E[g 2 (Y )] =, n. Suppose g ( ) represents an optmal soluton for optmzaton (6.7). Defne the matrx Λ nmc such that Λ nmc (, j) = E[g (Y )gj (Y j)]. Moreover, let J nmc = Λ nmc. Defne G nmc = (V nmc, E nmc ) such that, (, j) E nmc ff J nmc (, j) 0. The followng theorem characterzes the graphcal model of varables (Y,..., Y n ). Theorem Suppose X s satsfy the condtons of Corollary. If (, j) E nmc, then Proof A proof s presented n Secton 0.5. Y Y j {Y k, k, j}. (6.8) Under the condtons of Corollary, Theorem characterzes the graphcal model of varables Y s that are related to latent jontly Gaussan varables X s through unknown bjectve functons f s. The famly of dstrbutons consdered n Theorem s broad and ncludes many Gaussan dstrbutons as well as dstrbutons whose varables are bjectve functons of Gaussan varables. Graphcal models characterzed n Theorem can be used n computaton of margnal dstrbutons, computaton of the mode of the jont dstrbuton, and n other applcatons of estmaton and predcton smlarly to the case of Gaussan graphcal models. Note that Theorem only consders f functons that are bjectve and dfferentable. If these functons were not bjectve, the feasble set of optmzaton (6.7) s n fact smaller than the feasble set of the orgnal NMC optmzaton (4.) over Gaussan varables. Example 5 Consder sx jontly Gaussan varables X,...,X 6, each wth unt varance and mean 23

25 X X X X2 0. X X3 0. X X6 ph3 ph2 ph Y6 Y X ph6 X2 Y Y3 Y2 X ph4 0. after NMC Y 0.02 (b) before NMC 0.08 ph5 (a) X Fgure 3: (a) Relatonshps among latent jontly Gaussan varables X and ther nonlnear observatons Y. (b) Relatonshps among latent jontly Gaussan varables and nferred transformatons usng the NMC optmzaton. zero. We observe Y = f (X ) where, 0X, Y = f (X ) = 0 X, Y2 = f2 (X2 ) = e20x2, f X 0, otherwse, (6.9) Y3 = f3 (X3 ) = X3, Y4 = f4 (X4 ) = X43, e20x5, Y5 = f5 (X5 ) = 20X 5, e Y6 = f6 (X6 ) = X65. f X5 0, otherwse, The functons f ( ) reman unknown for the nference part. Relatonshps among orgnal, observed and NMC varables are depcted n Fgure 3. Then, we use φ (Y ) to nfer the underlyng covarance and precson matrces accordng to Theorem. As t s llustrated n Fgure 4, covarance and precson matrces computed usng observed varables Y show sgnfcant errors compared to the true networks owng to the exstence of extensve nonlnear relatonshps. However, nferred covarance and precson matrces usng the NMC optmzaton closely approxmate the true covarance and precson matrces, respectvely (Theorem ). Small errors n covarance coeffcent estmaton n ths example are owng to computaton of the NMC soluton usng emprcal dstrbutons accordng to the ACE algorthm 3. 24

26 (a) before NMC after NMC (b) before NMC after NMC error error.2 Fgure 4: For sx nonlnear functons of jontly Gaussan varables descrbed n Example 5, panel (a) plots covarance matrx errors before and after NMC. Usng NMC, estmated covarance coeffcents among varables are close to true ones. Panel (b) plots precson matrx errors before and after NMC. Usng NMC, estmated elements of the precson matrx are close to true ones. 7 NMC Applcaton n Inference of Nonlnear Relevance Graphs Relevance graphs (RG s) play an mportant role n many applcatons ncludng systems bology, socal and economc scences as they characterze varable pars wth hghest observed smlartes [29]. Let X,...,X n be n random varables wth zero means and unt varances. Consder a smlarty measure S(X, X j ) between varables X and X j. If the smlarty measure between varables X and X j s ndependent of other varables, the resultng graph s called a parwse relevance graph (PRG). Defnton 8 Consder the followng optmzaton: G S = arg max G=(V,E) E =k S(X, X j ). (7.) (,j) E Then, G S = (V, E ) s called a parwse relevance graph (PRG) of varables X,..., X n, wth k edges, correspondng to the smlarty measure S(, ). PRG s can be nferred effcently n practce. In the followng, we hghlght two examples of such graphs: Example 6 If S(, ) s a correlaton-based smlarty measure (e.g., S(X, X j ) = E[X X j ] ), the optmzaton (7.) results n a correlaton-based PRG. Ths graph only captures the top k lnear assocatons among varables, gnorng nonlnear ones. Example 7 If S(, ) s a mutual nformaton-based smlarty measure (.e., S(X, X j ) = I(X ; X j ) where I(.;.) represents the mutual nformaton functon [5]), optmzaton (7.) results n an MIbased PRG whch captures nonlnear assocatons among pars of varables. However, t does not provde explct forms of such nonlnear relatonshps. 25