STATISTICAL MODELS AND ISSUES IN THE ANALYSIS OF NETWORK DATA George Michailidis Department of Statistics, University of Michigan www.stat.lsa.umich.edu/ gmichail Plenary Talk UP-STAT 13 Rochester Institute of Technology, April 2013
WHY NETWORKS? Relative small field of study until late 1990s Explosive growth of interest and work on networks from 2000 forward Main factors: Development of high-throughput technologies Systems level perspective in science New modeling techniques and computational advances
SOME RECENT DEVELOPMENTS I Rapid increase in publications
SOME RECENT DEVELOPMENTS II New books, courses and journals
SOME RECENT DEVELOPMENTS III Dedicated workshops
WHAT IS A NETWORK? A collection of interconnected entities Mathematically, it is convenient to represent it as a graph G = (V,E), where V denotes the set of nodes (vertices) and E the set of edges
EXAMPLES OF NETWORKS Networks have become an integral tool for addressing diverse problems in a number of scientific fields. For example: Technological (e.g. communications, transportation, energy, sensor) Biological (e.g. gene regulation, protein interactions, predator-prey relations) Social (e.g. friendship, e-mail, trade flows) Informational (e.g. Web, Twitter, peer-to-peer)
STATISTICS AND NETWORK ANALYSIS - Network Analysis has attracted participants from diverse scientific fields, including information scientists, statisticians, mathematicians, applied physicists, complex systems theorists,... - Uneven developments - Some topics rediscover and employ established techniques, others require new models and tools - Nevertheless, some key topics have merged that are truly statistical in nature
STATISTICS AND NETWORK ANALYSIS Network characterization (e.g. importance of nodes, identification of network communities, properties of the degree distribution) Network sampling - novel sampling schemes to construct a network; e.g. induced subgraph sampling, incident subgraph sampling, snowball sampling, link tracing Network inference - identify the topology from data; e.g. link prediction, graphical modeling Network dynamics: (i) stochastic processes (flows) on graphs, (ii) evolution of graphs over time Network visualization Time-varying networks Incorporation of network information in statistical inference
NETWORK INFERENCE BASED ON GRAPHICAL MODELS Some background on graphical models Represent conditional independence relationships between a set of random variables No edge between X j and X j X j is independent of X j conditional on all other variables Typically, estimated from a set of n iid observations on p variables 1 3 1 3 5 5 2 4 2 4
EXAMPLE 1: TEXT MINING address mail phone offic inform gener develop project work time student graduat system includ program email research group interest comput engin public fax home page link web scienc univers depart
EXAMPLE 2: ROLL CALL DATA Akaka Alexander Allard Allen Baucus Bayh Bennett Biden Bingaman Bond Boxer Brownback Bunning Burns Burr Byrd Cantwell Carper Chafee Chambliss Clinton Coburn Cochran Coleman Collins Conrad Cornyn Corzine Craig Crapo Dayton DeMint DeWine Dodd Dole Domenici Dorgan Durbin Ensign Enzi Feingold Feinstein Frist Graham Grassley Gregg Hagel Harkin Hatch Hutchison Inhofe Inouye Isakson Jeffords Johnson Kennedy Kerry Kohl Kyl Landrieu Lautenberg Leahy Levin Lieberman Lincoln Lott Lugar Martinez McCain McConnell Mikulski Murkowski Murray Nelson Nelson Obama Pryor Reed Reid Roberts Rockefeller Salazar Santorum Sarbanes Schumer Sessions Shelby Smith Snowe Specter Stabenow Stevens Sununu Talent Thomas Thune Vitter Voinovich Warner Wyden
EXAMPLE 3: GENE NETWORKS
GAUSSIAN GRAPHICAL MODELS X 1,...,X p jointly follow N(0,Σ) Dependence structure fully characterized by the covariance structure Let ρ j,j = cor(x j,x j others) denote the partial correlation. PARTIAL CORRELATION Nodes j and j are connected ρ j,j 0
GAUSSIAN GRAPHICAL MODELS (CTD) INVERSE COVARIANCE MATRIX Let Ω = Σ 1 denote the inverse covariance matrix. We have ρ j,j ω j,j. 1 ρ 13 3 ω 1,1 0 ω 1,3 ω 1,4 0 ρ 35 0 ω 2,2 0 ω 2,4 0 ρ ρ 34 14 5 Ω = ω 3,1 0 ω 3,3 ω 3,4 ω 3,5 ω 4,1 ω 4,2 ω 4,3 ω 4,4 0 2 ρ 24 4 0 0 ω 5,3 0 ω 5,5 Hence, estimating Gaussian graphical model Estimating Ω Also, estimating the graph corresponds to identifying the zeros in Ω.
THE CASE OF HIGH-DIMENSIONAL DATA What happens if we have few samples and many more variables? Some examples: Biological networks: samples in the hundreds (at best), molecular entities in the thousands Text mining: both documents and corpus size in the thousands, but one needs to estimate all pairwise relationships between words! Solution: impose sparsity
ESTIMATION OF A SPARSE INVERSE COVARIANCE MATRIX This issue was addressed in a paper by Dempster (1972) and then remained dormant for 35 years, until Meinshausen and Buhlmann (2006) developed a penalized (lasso) regression approach to solve it Since then, there have been over 100 papers looking at various modeling, computational and inference aspects of the problem
MAXIMUM LIKELIHOOD ESTIMATION OF A SPARSE INVERSE COVARIANCE MATRIX This goal can be accomplished by optimizing the following objective function, where Σ is the sample covariance matrix and 0 requires Ω to be positive definite max Ω 0 log(det(ω)) trace( ΣΩ) λ j j ω j,j Note that when λ = 0, Ω = ( Σ) 1
ILLUSTRATION OF SPARSITY AS A FUNCTION OF λ Sparse inverse covariance estimation with the graphical lasso 7
ILLUSTRATION: CS WEBPAGES AT CMU Faculty Project Computer Science Department Student Course
CS WEBPAGES AT CMU Used about 1400 webpages and focused on the 100 most frequent words Common Structure home web site fall fax spring page public mail send phone person list year email select link note offic instructor problem book address topic relat hour work graduat number class professor access theori assist faculti algorithm specif time gener base student includ teach analysi develop interest associ structur data program model inform contact design project softwar languag applic system process area parallel construct implement comput recent research commun group engin member high perform paper current architectur laboratori distribut advanc lab support network studi scienc technolog introduct educ depart www univers center institut (A) Webpage site web page link home (C) Parallel programming distribut parallel system algorithm perform problem high (B) Research area/lab current research lab laboratori area member group (D) Software development softwar develop structur data program algorithm languag
ESTIMATED NETWORKS FOR FACULTY AND STUDENT WEBPAGES (A) Student scienc comput univers depart page research interest home inform student work offic system phone public program email mail fax project engin link group graduat includ time web gener develop address paper area fall languag professor softwar teach current design applic base contact list relat recent class assist algorithm hour studi model analysi institut technolog laboratori implement introduct www number construct year faculti network center note topic process distribut instructor lab problem member person perform structur data architectur send educ associ access site spring parallel theori commun high book select support specif advanc (B) Faculty scienc comput univers depart page research interest home inform student work offic system phone public program email mail fax project engin link group graduat includ time web gener develop address paper area fall languag professor softwar teach current design applic base contact list relat recent class assist algorithm hour studi model analysi institut technolog laboratori implement introduct www number construct year faculti network center note topic process distribut instructor lab problem member person perform structur data architectur send educ associ access site spring parallel theori commun high book select support specif advanc
INCORPORATING NETWORK INFORMATION IN STATISTICAL TESTING PROBLEMS Rationale: High-throughput techniques (sequencing, profiling) have enabled comprehensive monitoring of biological systems Analysis of high-throughput data typically yields a list of differentially expressed genes (proteins, metabolites, etc.), obtained by statistical testing for differences between two groups, for example, normal and disease or treatment and control This list has the potential to provide insight into a given biological phenomenon or phenotype, but in many cases it is hard to extract meaning from it
INCORPORATING NETWORK INFORMATION IN STATISTICAL TESTING PROBLEMS (CTD) Biomedical researchers in order to reduce the complexity in the data have resorted in grouping the genes into smaller sets (pathways) of related ones; e.g. according to their function The number of knowledge data bases and their content that can be used for such grouping is increasing at an accelerating pace (e.g. KEGG, GO, TRANSFAC, DIP,...)
PROBLEM FORMULATION Given n 1 samples for the control condition and n 2 samples for the treatment condition of expression data for p genes and the network of gene interactions (shown below), test for activation of selected subgraphs
A LATENT VARIABLE MODEL FORMULATION X 1 = γ 1 X 2 = ρ 12 X 1 + γ 2 = ρ 12 γ 1 + γ 2 X 3 = ρ 23 X 2 + γ 3 = ρ 23 ρ 12 γ 1 + ρ 23 γ 2 + γ 3
A LATENT VARIABLE MODEL FORMULATION X 1 = γ 1 Thus X = Λγ where X 2 = ρ 12 X 1 + γ 2 = ρ 12 γ 1 + γ 2 X 3 = ρ 23 X 2 + γ 3 = ρ 23 ρ 12 γ 1 + ρ 23 γ 2 + γ 3 Λ = 1 0 0 ρ 12 1 0 ρ 12 ρ 23 ρ 23 1
THE LATENT VARIABLE MODEL Let Y be the ith sample in the expression data Let Y = X + ε, with X the signal and ε N p (0,σ 2 ε I p ) the noise Define latent variables γ N p (µ,σ 2 γ I p ) Let the influence of the jth gene on the ith gene be Λ ij ; Λ = [Λ ij ] is called the Influence Matrix of the network. Y = Λγ + ε, Y N p (Λµ,σ 2 γ ΛΛ + σ 2 ε I p )
MIXED LINEAR MODEL REPRESENTATION Let (Yi C, µ C,Λ C ) and (Yi T, µ T,Λ T ) represent the data under control and treatment, then: Y = Ψβ + Πγ + ε where β = (µ C, µ T ) ( ΛC Λ Ψ = C 0 0 0 0 Λ T Λ T ) Π = diag(λ C,...,Λ C,Λ T,...,Λ T ) [ γ E ε ] [ 0 = 0 ] [ γ ε ] [ σ 2 = γ I 0 0 σε 2 I ]
INFERENCE USING MLM Let l be an estimable linear combination of fixed effects (we call l a contrast vector) and consider the test: H 0 : lβ = 0 vs. H 1 : lβ 0 Consider the Wald test statistic: T = l ˆβ l ˆQl Under the null hypothesis, T has approximately a t distribution with degrees of freedom estimated using Satterthwaite s approximation method ν = 2(l ˆQl ) 2 τ K τ τ is the gradient of lql with respect to (σ 2 γ,σ 2 ε ) K is the empirical covariance matrix of (σ 2 γ,σ 2 ε )
ANALYSIS OF YEAST GALACTOSE UTILIZATION DATA
EXTRACTING INTERESTING PATTERNS FROM TIME-EVOLVING NETWORKS Time-evolving network data consist of ordered sequences of graphs, e.g., network time-series
POPULAR APPROACH: TIME SERIES ANALYSIS OF NETWORK STATISTICS Extracting time series of network statistics (e.g. centrality parameter) allows direct application of time-series methods
DRAWBACKS OF NETWORK STATISTICS ANALYSIS Which network statistics? Heavily context dependent Often unknown and the easiest statistics to compute may not be informative. Which are the important nodes and how did they evolve over time? Usually requires additional, ad-hoc analysis
DECOMPOSITION OF THE NETWORK ADJACENCY MATRICES Matrix decompositions achieve dimension reduction Preserve essential features Large amount of existing work that can be leveraged for the problem at hand Which matrix decomposition? Non-negative Matrix Factorization
NON-NEGATIVE MATRIX FACTORIZATION Let Y be an observed n p matrix that is non-negative. NMF expresses Y UV T, where U R n K +,V Rp K +, and K << min{n,p}
WHY NMF? Better interpretability: Y ij = K k=1 U ik V kj, U ik V kj measures the contribution of cluster k to Y ij. Adjacency matrices are typically non-negative
MODELING NETWORK TIME SERIES Decompose spatio-temporal (network time-series) data as Space Time Basis Factors Smoothness Conditions. Intuition: Networks have short term fluctuations, but latent factors are smooth and exhibit long term trends.
EVOLVING FACTORIZATIONS We observe {Y t,t = 1,...,T } (network time-series), and posit Y t UVt T or U t Vt T. Depends on context ( different network types) and goal (clustering, heaviest element search, visual exploration).
OBTAINING ESTIMATES Based on optimizing the following objective function T U 0,V t 0 t=1 O = min + T t, t=1 Y t UV T t 2 F W (t, t) V t V t 2 F + λ g T t=1 Tr(V T t L t V t ) W (t, t) is a weight function that is proportional to some kernel and controls sensitivity to short term fluctuations. Similar to a Hodrick-Prescott filter. λ g,l t form a group penalty that control the importance of a priori clustering knowledge.
GROUP PENALTY Main Idea: If nodes i and j belong to the same group, then they should have similar coordinates given by V t. Define the Laplacian as L t = D t G t, where { 1, if nodes i and j belong to the same group (G t ) ij = 0, otherwise D t = diag( (G t ) ij,j = 1,...,n). i
LAPLACIAN SMOOTHING Fact: For every n K matrix V t, we have λ g Tr(V t T L t V t ) = λ g (G t ) ij ((V t ) ik (V t ) jk ) 2. k i,j The group penalty {λ g,l t } creates an abstract manifold at time t, and the weight function W (t, t) creates an abstract manifold between times t and t. The penalties utilize external information to create a topology that we embed and view the data in
ARXIV CITATION NETWORK Citation network sequence from the e-print service arxiv for the high energy physics theory section, and covers papers from October 1993 to December 2002. There are 22750 papers (nodes) with 176602 edges (references) over 112 months. Since citations never die, we posit Y t = UV T t.
Estimates of V t (Time-varying Paper Impact Scores) 1st Component 2nd Component Sum of Components Citation Network Layouts I II III IV V
HIGHEST IMPACT PAPERS BY V t 1993-1999 Title Authors In-Degree Out-Degree # citations (Google) Heterotic and Type I String Dynamics Horava and Witten 783 18 2265 from Eleven Dimensions Five-branes And M-Theory On An Orbifold Witten 169 15 249 D-Branes and Topological Field Theories Bershadsky, et. al 271 15 457 Lectures on Superstring and M Theory Dualities Schwarz 274 68 483 Type IIB Superstrings, BPS Monopoles, Hanany and Witten 437 20 809 And Three-Dimensional Gauge Dynamics 2000 onwards Title Authors In-Degree Out-Degree # citations (Google) The Large N Limit of Superconformal Field Maldacena 1059 2 9928 Theories and Supergravity Anti De Sitter Space And Holography Witten 766 2 6467 Gauge Theory Correlators from Non-Critical Klebanov and Polyakov 708 0 5592 String Theory Large N Field Theories, String Theory Aharony, et. al 446 74 3131 and Gravity String Theory and Noncommutative Geometry Seiberg and Witten 796 12 3624
STATIC CLUSTERING The degree (number of connections) of each paper over all time points, colored by a top community detection algorithm (Newman PNAS, 2006). The groupings are not interpretable in terms of the time-profile of each paper.
EIGENVECTOR CENTRALITY Average age 0 10 20 30 40 50 60 70 top 5 authorities top 10 authorities top 50 authorities top 100 authorities top 500 authorities 1994 1996 1998 2000 2002 Year The average age in months of the top authority papers over time (Kleinberg, J.ACM 1999). We see evidence for a change point around year 2000, but what about paper growth, grouping structure? Need more, ad-hoc analysis.
GENERAL REFERENCES Kolaczyk, E.D. (2009), Statistical Analysis of Network Data: Methods and Models, Springer. Feinberg, S. (2012), A Brief History of Statistical Models for Network Analysis and Open Challenges, Journal of Computational and Graphical Statistics, 20, 825-839 Michailidis, G. (2012), Statistical Challenges in Biological Networks, Journal of Computational and Graphical Statistics, 20, 840-855 Hunter, D., Krivitsky, P. and Scheinberger, M. (2012), Computational Statistical Methods for Social Network Models, Journal of Computational and Graphical Statistics, 20, 856-882
SPECIFIC TO THIS PRESENTATION Guo, J., Levina, E., Michailidis, G. and Zhu, J. (2011), Joint estimation of multiple graphical models, Biometrika, 98, 1-15 Shojaie A. and Michailidis G. (2009), Analysis of Gene Sets Based on The Underlying Regulatory Network, Journal of Computational Biology, 16(3):407-426 Mankad, S. and Michailidis, G. (2012), Structural and functional discovery in dynamic networks with non-negative matrix factorization, Physical Review E, forthcoming