STATISTICAL MODELS AND ISSUES IN THE ANALYSIS OF NETWORK DATA

Similar documents
The United States Senate

SMALL BUSINESS HEALTH PLANS S RESOURCE KIT

DATA ANALYSIS II. Matrix Algorithms

Statistical machine learning, high dimension and big data

USING SPECTRAL RADIUS RATIO FOR NODE DEGREE TO ANALYZE THE EVOLUTION OF SCALE- FREE NETWORKS AND SMALL-WORLD NETWORKS

NETZCOPE - a tool to analyze and display complex R&D collaboration networks

BayesX - Software for Bayesian Inference in Structured Additive Regression

Statistical and computational challenges in networks and cybersecurity

Protein Protein Interaction Networks

Graphical Modeling for Genomic Data

1 o Semestre 2007/2008

Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning

Fitting Subject-specific Curves to Grouped Longitudinal Data

Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh

Statistical Machine Learning

Exploratory Factor Analysis and Principal Components. Pekka Malo & Anton Frantsev 30E00500 Quantitative Empirical Research Spring 2016

Spatial Statistics Chapter 3 Basics of areal data and areal data modeling

Machine Learning and Data Analysis overview. Department of Cybernetics, Czech Technical University in Prague.

Part 2: Community Detection

Several Views of Support Vector Machines

Social Media Mining. Network Measures

1 Solving LPs: The Simplex Algorithm of George Dantzig

Learning outcomes. Knowledge and understanding. Competence and skills

MSCA Introduction to Statistical Concepts

Information Management course

Multivariate Analysis (Slides 13)

University of Lille I PC first year list of exercises n 7. Review

Detection of changes in variance using binary segmentation and optimal partitioning

Data, Measurements, Features

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

The Scientific Data Mining Process

An Introduction to Machine Learning

INDIRECT INFERENCE (prepared for: The New Palgrave Dictionary of Economics, Second Edition)

11. Time series and dynamic linear models

Probabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014

Review Jeopardy. Blue vs. Orange. Review Jeopardy

Business Intelligence and Process Modelling

THE NUMBER OF GRAPHS AND A RANDOM GRAPH WITH A GIVEN DEGREE SEQUENCE. Alexander Barvinok

LABEL PROPAGATION ON GRAPHS. SEMI-SUPERVISED LEARNING. ----Changsheng Liu

Credit Risk Models: An Overview

Introduction to Matrix Algebra

Part 1: Link Analysis & Page Rank

Introduction to General and Generalized Linear Models

CS Data Science and Visualization Spring 2016

Predict the Popularity of YouTube Videos Using Early View Data

A LOGNORMAL MODEL FOR INSURANCE CLAIMS DATA

Statistical Analysis of Network Data

Understanding the Impact of Weights Constraints in Portfolio Theory

THREE DIMENSIONAL REPRESENTATION OF AMINO ACID CHARAC- TERISTICS

Matrix Differentiation

CS Master Level Courses and Areas COURSE DESCRIPTIONS. CSCI 521 Real-Time Systems. CSCI 522 High Performance Computing

Data Mining: Algorithms and Applications Matrix Math Review

Applications to Data Smoothing and Image Processing I

Least-Squares Intersection of Lines

How To Understand The Network Of A Network

Lecture 3: Linear methods for classification

Lecture 9: Introduction to Pattern Analysis

Extracting correlation structure from large random matrices

The Method of Least Squares

An Introduction to the Use of Bayesian Network to Analyze Gene Expression Data

Numerical methods for American options

An explicit link between Gaussian fields and Gaussian Markov random fields; the stochastic partial differential equation approach

Penalized Logistic Regression and Classification of Microarray Data

Lecture 2 Linear functions and examples

Simple Linear Regression Inference

MapReduce Approach to Collective Classification for Networks

Social and Technological Network Analysis. Lecture 3: Centrality Measures. Dr. Cecilia Mascolo (some material from Lada Adamic s lectures)

jorge s. marques image processing

Science Navigation Map: An Interactive Data Mining Tool for Literature Analysis

Introduction to Support Vector Machines. Colin Campbell, Bristol University

MSCA Introduction to Statistical Concepts

Basics of Statistical Machine Learning

APPM4720/5720: Fast algorithms for big data. Gunnar Martinsson The University of Colorado at Boulder

Estimating an ARMA Process

Statistics for BIG data

MIMO CHANNEL CAPACITY

A mixture model for random graphs

STANDING COMMITTEES OF THE SENATE

CS 2750 Machine Learning. Lecture 1. Machine Learning. CS 2750 Machine Learning.

So which is the best?

NodeXL for Network analysis Demo/hands-on at NICAR 2012, St Louis, Feb 24. Peter Aldhous, San Francisco Bureau Chief.

BIOINF 525 Winter 2016 Foundations of Bioinformatics and Systems Biology

Semi-Supervised Support Vector Machines and Application to Spam Filtering

Big Data Analytics of Multi-Relationship Online Social Network Based on Multi-Subnet Composited Complex Network

Course: Model, Learning, and Inference: Lecture 5

Lasso on Categorical Data

Fast Multipole Method for particle interactions: an open source parallel library component

Digital Imaging and Multimedia. Filters. Ahmed Elgammal Dept. of Computer Science Rutgers University

Bioinformatics: Network Analysis

Analysis of Bayesian Dynamic Linear Models

Sanjeev Kumar. contribute

Lecture/Recitation Topic SMA 5303 L1 Sampling and statistical distributions

Statistics Graduate Courses

Transcription:

STATISTICAL MODELS AND ISSUES IN THE ANALYSIS OF NETWORK DATA George Michailidis Department of Statistics, University of Michigan www.stat.lsa.umich.edu/ gmichail Plenary Talk UP-STAT 13 Rochester Institute of Technology, April 2013

WHY NETWORKS? Relative small field of study until late 1990s Explosive growth of interest and work on networks from 2000 forward Main factors: Development of high-throughput technologies Systems level perspective in science New modeling techniques and computational advances

SOME RECENT DEVELOPMENTS I Rapid increase in publications

SOME RECENT DEVELOPMENTS II New books, courses and journals

SOME RECENT DEVELOPMENTS III Dedicated workshops

WHAT IS A NETWORK? A collection of interconnected entities Mathematically, it is convenient to represent it as a graph G = (V,E), where V denotes the set of nodes (vertices) and E the set of edges

EXAMPLES OF NETWORKS Networks have become an integral tool for addressing diverse problems in a number of scientific fields. For example: Technological (e.g. communications, transportation, energy, sensor) Biological (e.g. gene regulation, protein interactions, predator-prey relations) Social (e.g. friendship, e-mail, trade flows) Informational (e.g. Web, Twitter, peer-to-peer)

STATISTICS AND NETWORK ANALYSIS - Network Analysis has attracted participants from diverse scientific fields, including information scientists, statisticians, mathematicians, applied physicists, complex systems theorists,... - Uneven developments - Some topics rediscover and employ established techniques, others require new models and tools - Nevertheless, some key topics have merged that are truly statistical in nature

STATISTICS AND NETWORK ANALYSIS Network characterization (e.g. importance of nodes, identification of network communities, properties of the degree distribution) Network sampling - novel sampling schemes to construct a network; e.g. induced subgraph sampling, incident subgraph sampling, snowball sampling, link tracing Network inference - identify the topology from data; e.g. link prediction, graphical modeling Network dynamics: (i) stochastic processes (flows) on graphs, (ii) evolution of graphs over time Network visualization Time-varying networks Incorporation of network information in statistical inference

NETWORK INFERENCE BASED ON GRAPHICAL MODELS Some background on graphical models Represent conditional independence relationships between a set of random variables No edge between X j and X j X j is independent of X j conditional on all other variables Typically, estimated from a set of n iid observations on p variables 1 3 1 3 5 5 2 4 2 4

EXAMPLE 1: TEXT MINING address mail phone offic inform gener develop project work time student graduat system includ program email research group interest comput engin public fax home page link web scienc univers depart

EXAMPLE 2: ROLL CALL DATA Akaka Alexander Allard Allen Baucus Bayh Bennett Biden Bingaman Bond Boxer Brownback Bunning Burns Burr Byrd Cantwell Carper Chafee Chambliss Clinton Coburn Cochran Coleman Collins Conrad Cornyn Corzine Craig Crapo Dayton DeMint DeWine Dodd Dole Domenici Dorgan Durbin Ensign Enzi Feingold Feinstein Frist Graham Grassley Gregg Hagel Harkin Hatch Hutchison Inhofe Inouye Isakson Jeffords Johnson Kennedy Kerry Kohl Kyl Landrieu Lautenberg Leahy Levin Lieberman Lincoln Lott Lugar Martinez McCain McConnell Mikulski Murkowski Murray Nelson Nelson Obama Pryor Reed Reid Roberts Rockefeller Salazar Santorum Sarbanes Schumer Sessions Shelby Smith Snowe Specter Stabenow Stevens Sununu Talent Thomas Thune Vitter Voinovich Warner Wyden

EXAMPLE 3: GENE NETWORKS

GAUSSIAN GRAPHICAL MODELS X 1,...,X p jointly follow N(0,Σ) Dependence structure fully characterized by the covariance structure Let ρ j,j = cor(x j,x j others) denote the partial correlation. PARTIAL CORRELATION Nodes j and j are connected ρ j,j 0

GAUSSIAN GRAPHICAL MODELS (CTD) INVERSE COVARIANCE MATRIX Let Ω = Σ 1 denote the inverse covariance matrix. We have ρ j,j ω j,j. 1 ρ 13 3 ω 1,1 0 ω 1,3 ω 1,4 0 ρ 35 0 ω 2,2 0 ω 2,4 0 ρ ρ 34 14 5 Ω = ω 3,1 0 ω 3,3 ω 3,4 ω 3,5 ω 4,1 ω 4,2 ω 4,3 ω 4,4 0 2 ρ 24 4 0 0 ω 5,3 0 ω 5,5 Hence, estimating Gaussian graphical model Estimating Ω Also, estimating the graph corresponds to identifying the zeros in Ω.

THE CASE OF HIGH-DIMENSIONAL DATA What happens if we have few samples and many more variables? Some examples: Biological networks: samples in the hundreds (at best), molecular entities in the thousands Text mining: both documents and corpus size in the thousands, but one needs to estimate all pairwise relationships between words! Solution: impose sparsity

ESTIMATION OF A SPARSE INVERSE COVARIANCE MATRIX This issue was addressed in a paper by Dempster (1972) and then remained dormant for 35 years, until Meinshausen and Buhlmann (2006) developed a penalized (lasso) regression approach to solve it Since then, there have been over 100 papers looking at various modeling, computational and inference aspects of the problem

MAXIMUM LIKELIHOOD ESTIMATION OF A SPARSE INVERSE COVARIANCE MATRIX This goal can be accomplished by optimizing the following objective function, where Σ is the sample covariance matrix and 0 requires Ω to be positive definite max Ω 0 log(det(ω)) trace( ΣΩ) λ j j ω j,j Note that when λ = 0, Ω = ( Σ) 1

ILLUSTRATION OF SPARSITY AS A FUNCTION OF λ Sparse inverse covariance estimation with the graphical lasso 7

ILLUSTRATION: CS WEBPAGES AT CMU Faculty Project Computer Science Department Student Course

CS WEBPAGES AT CMU Used about 1400 webpages and focused on the 100 most frequent words Common Structure home web site fall fax spring page public mail send phone person list year email select link note offic instructor problem book address topic relat hour work graduat number class professor access theori assist faculti algorithm specif time gener base student includ teach analysi develop interest associ structur data program model inform contact design project softwar languag applic system process area parallel construct implement comput recent research commun group engin member high perform paper current architectur laboratori distribut advanc lab support network studi scienc technolog introduct educ depart www univers center institut (A) Webpage site web page link home (C) Parallel programming distribut parallel system algorithm perform problem high (B) Research area/lab current research lab laboratori area member group (D) Software development softwar develop structur data program algorithm languag

ESTIMATED NETWORKS FOR FACULTY AND STUDENT WEBPAGES (A) Student scienc comput univers depart page research interest home inform student work offic system phone public program email mail fax project engin link group graduat includ time web gener develop address paper area fall languag professor softwar teach current design applic base contact list relat recent class assist algorithm hour studi model analysi institut technolog laboratori implement introduct www number construct year faculti network center note topic process distribut instructor lab problem member person perform structur data architectur send educ associ access site spring parallel theori commun high book select support specif advanc (B) Faculty scienc comput univers depart page research interest home inform student work offic system phone public program email mail fax project engin link group graduat includ time web gener develop address paper area fall languag professor softwar teach current design applic base contact list relat recent class assist algorithm hour studi model analysi institut technolog laboratori implement introduct www number construct year faculti network center note topic process distribut instructor lab problem member person perform structur data architectur send educ associ access site spring parallel theori commun high book select support specif advanc

INCORPORATING NETWORK INFORMATION IN STATISTICAL TESTING PROBLEMS Rationale: High-throughput techniques (sequencing, profiling) have enabled comprehensive monitoring of biological systems Analysis of high-throughput data typically yields a list of differentially expressed genes (proteins, metabolites, etc.), obtained by statistical testing for differences between two groups, for example, normal and disease or treatment and control This list has the potential to provide insight into a given biological phenomenon or phenotype, but in many cases it is hard to extract meaning from it

INCORPORATING NETWORK INFORMATION IN STATISTICAL TESTING PROBLEMS (CTD) Biomedical researchers in order to reduce the complexity in the data have resorted in grouping the genes into smaller sets (pathways) of related ones; e.g. according to their function The number of knowledge data bases and their content that can be used for such grouping is increasing at an accelerating pace (e.g. KEGG, GO, TRANSFAC, DIP,...)

PROBLEM FORMULATION Given n 1 samples for the control condition and n 2 samples for the treatment condition of expression data for p genes and the network of gene interactions (shown below), test for activation of selected subgraphs

A LATENT VARIABLE MODEL FORMULATION X 1 = γ 1 X 2 = ρ 12 X 1 + γ 2 = ρ 12 γ 1 + γ 2 X 3 = ρ 23 X 2 + γ 3 = ρ 23 ρ 12 γ 1 + ρ 23 γ 2 + γ 3

A LATENT VARIABLE MODEL FORMULATION X 1 = γ 1 Thus X = Λγ where X 2 = ρ 12 X 1 + γ 2 = ρ 12 γ 1 + γ 2 X 3 = ρ 23 X 2 + γ 3 = ρ 23 ρ 12 γ 1 + ρ 23 γ 2 + γ 3 Λ = 1 0 0 ρ 12 1 0 ρ 12 ρ 23 ρ 23 1

THE LATENT VARIABLE MODEL Let Y be the ith sample in the expression data Let Y = X + ε, with X the signal and ε N p (0,σ 2 ε I p ) the noise Define latent variables γ N p (µ,σ 2 γ I p ) Let the influence of the jth gene on the ith gene be Λ ij ; Λ = [Λ ij ] is called the Influence Matrix of the network. Y = Λγ + ε, Y N p (Λµ,σ 2 γ ΛΛ + σ 2 ε I p )

MIXED LINEAR MODEL REPRESENTATION Let (Yi C, µ C,Λ C ) and (Yi T, µ T,Λ T ) represent the data under control and treatment, then: Y = Ψβ + Πγ + ε where β = (µ C, µ T ) ( ΛC Λ Ψ = C 0 0 0 0 Λ T Λ T ) Π = diag(λ C,...,Λ C,Λ T,...,Λ T ) [ γ E ε ] [ 0 = 0 ] [ γ ε ] [ σ 2 = γ I 0 0 σε 2 I ]

INFERENCE USING MLM Let l be an estimable linear combination of fixed effects (we call l a contrast vector) and consider the test: H 0 : lβ = 0 vs. H 1 : lβ 0 Consider the Wald test statistic: T = l ˆβ l ˆQl Under the null hypothesis, T has approximately a t distribution with degrees of freedom estimated using Satterthwaite s approximation method ν = 2(l ˆQl ) 2 τ K τ τ is the gradient of lql with respect to (σ 2 γ,σ 2 ε ) K is the empirical covariance matrix of (σ 2 γ,σ 2 ε )

ANALYSIS OF YEAST GALACTOSE UTILIZATION DATA

EXTRACTING INTERESTING PATTERNS FROM TIME-EVOLVING NETWORKS Time-evolving network data consist of ordered sequences of graphs, e.g., network time-series

POPULAR APPROACH: TIME SERIES ANALYSIS OF NETWORK STATISTICS Extracting time series of network statistics (e.g. centrality parameter) allows direct application of time-series methods

DRAWBACKS OF NETWORK STATISTICS ANALYSIS Which network statistics? Heavily context dependent Often unknown and the easiest statistics to compute may not be informative. Which are the important nodes and how did they evolve over time? Usually requires additional, ad-hoc analysis

DECOMPOSITION OF THE NETWORK ADJACENCY MATRICES Matrix decompositions achieve dimension reduction Preserve essential features Large amount of existing work that can be leveraged for the problem at hand Which matrix decomposition? Non-negative Matrix Factorization

NON-NEGATIVE MATRIX FACTORIZATION Let Y be an observed n p matrix that is non-negative. NMF expresses Y UV T, where U R n K +,V Rp K +, and K << min{n,p}

WHY NMF? Better interpretability: Y ij = K k=1 U ik V kj, U ik V kj measures the contribution of cluster k to Y ij. Adjacency matrices are typically non-negative

MODELING NETWORK TIME SERIES Decompose spatio-temporal (network time-series) data as Space Time Basis Factors Smoothness Conditions. Intuition: Networks have short term fluctuations, but latent factors are smooth and exhibit long term trends.

EVOLVING FACTORIZATIONS We observe {Y t,t = 1,...,T } (network time-series), and posit Y t UVt T or U t Vt T. Depends on context ( different network types) and goal (clustering, heaviest element search, visual exploration).

OBTAINING ESTIMATES Based on optimizing the following objective function T U 0,V t 0 t=1 O = min + T t, t=1 Y t UV T t 2 F W (t, t) V t V t 2 F + λ g T t=1 Tr(V T t L t V t ) W (t, t) is a weight function that is proportional to some kernel and controls sensitivity to short term fluctuations. Similar to a Hodrick-Prescott filter. λ g,l t form a group penalty that control the importance of a priori clustering knowledge.

GROUP PENALTY Main Idea: If nodes i and j belong to the same group, then they should have similar coordinates given by V t. Define the Laplacian as L t = D t G t, where { 1, if nodes i and j belong to the same group (G t ) ij = 0, otherwise D t = diag( (G t ) ij,j = 1,...,n). i

LAPLACIAN SMOOTHING Fact: For every n K matrix V t, we have λ g Tr(V t T L t V t ) = λ g (G t ) ij ((V t ) ik (V t ) jk ) 2. k i,j The group penalty {λ g,l t } creates an abstract manifold at time t, and the weight function W (t, t) creates an abstract manifold between times t and t. The penalties utilize external information to create a topology that we embed and view the data in

ARXIV CITATION NETWORK Citation network sequence from the e-print service arxiv for the high energy physics theory section, and covers papers from October 1993 to December 2002. There are 22750 papers (nodes) with 176602 edges (references) over 112 months. Since citations never die, we posit Y t = UV T t.

Estimates of V t (Time-varying Paper Impact Scores) 1st Component 2nd Component Sum of Components Citation Network Layouts I II III IV V

HIGHEST IMPACT PAPERS BY V t 1993-1999 Title Authors In-Degree Out-Degree # citations (Google) Heterotic and Type I String Dynamics Horava and Witten 783 18 2265 from Eleven Dimensions Five-branes And M-Theory On An Orbifold Witten 169 15 249 D-Branes and Topological Field Theories Bershadsky, et. al 271 15 457 Lectures on Superstring and M Theory Dualities Schwarz 274 68 483 Type IIB Superstrings, BPS Monopoles, Hanany and Witten 437 20 809 And Three-Dimensional Gauge Dynamics 2000 onwards Title Authors In-Degree Out-Degree # citations (Google) The Large N Limit of Superconformal Field Maldacena 1059 2 9928 Theories and Supergravity Anti De Sitter Space And Holography Witten 766 2 6467 Gauge Theory Correlators from Non-Critical Klebanov and Polyakov 708 0 5592 String Theory Large N Field Theories, String Theory Aharony, et. al 446 74 3131 and Gravity String Theory and Noncommutative Geometry Seiberg and Witten 796 12 3624

STATIC CLUSTERING The degree (number of connections) of each paper over all time points, colored by a top community detection algorithm (Newman PNAS, 2006). The groupings are not interpretable in terms of the time-profile of each paper.

EIGENVECTOR CENTRALITY Average age 0 10 20 30 40 50 60 70 top 5 authorities top 10 authorities top 50 authorities top 100 authorities top 500 authorities 1994 1996 1998 2000 2002 Year The average age in months of the top authority papers over time (Kleinberg, J.ACM 1999). We see evidence for a change point around year 2000, but what about paper growth, grouping structure? Need more, ad-hoc analysis.

GENERAL REFERENCES Kolaczyk, E.D. (2009), Statistical Analysis of Network Data: Methods and Models, Springer. Feinberg, S. (2012), A Brief History of Statistical Models for Network Analysis and Open Challenges, Journal of Computational and Graphical Statistics, 20, 825-839 Michailidis, G. (2012), Statistical Challenges in Biological Networks, Journal of Computational and Graphical Statistics, 20, 840-855 Hunter, D., Krivitsky, P. and Scheinberger, M. (2012), Computational Statistical Methods for Social Network Models, Journal of Computational and Graphical Statistics, 20, 856-882

SPECIFIC TO THIS PRESENTATION Guo, J., Levina, E., Michailidis, G. and Zhu, J. (2011), Joint estimation of multiple graphical models, Biometrika, 98, 1-15 Shojaie A. and Michailidis G. (2009), Analysis of Gene Sets Based on The Underlying Regulatory Network, Journal of Computational Biology, 16(3):407-426 Mankad, S. and Michailidis, G. (2012), Structural and functional discovery in dynamic networks with non-negative matrix factorization, Physical Review E, forthcoming