A mixture model for random graphs

Similar documents

Message-passing sequential detection of multiple change points in networks

Complex Networks Analysis: Clustering Methods

USING SPECTRAL RADIUS RATIO FOR NODE DEGREE TO ANALYZE THE EVOLUTION OF SCALE- FREE NETWORKS AND SMALL-WORLD NETWORKS

A discussion of Statistical Mechanics of Complex Networks P. Part I

DATA ANALYSIS II. Matrix Algorithms

Part 2: Community Detection

A scalable multilevel algorithm for graph clustering and community structure detection

Scheduling Shop Scheduling. Tim Nieberg

Lecture 4: BK inequality 27th August and 6th September, 2007

VERTICES OF GIVEN DEGREE IN SERIES-PARALLEL GRAPHS

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

Stationary random graphs on Z with prescribed iid degrees and finite mean connections

Spatial Statistics Chapter 3 Basics of areal data and areal data modeling

Latent Class Regression Part II

Fitting Subject-specific Curves to Grouped Longitudinal Data

Big Data Analytics of Multi-Relationship Online Social Network Based on Multi-Subnet Composited Complex Network

Poisson Models for Count Data

Scientific Collaboration Networks in China s System Engineering Subject

Bioinformatics: Network Analysis

Aggregate Loss Models

The Matrix Elements of a 3 3 Orthogonal Matrix Revisited

General Network Analysis: Graph-theoretic. COMP572 Fall 2009

Segmentation models and applications with R

Supplement to Call Centers with Delay Information: Models and Insights

Graph models for the Web and the Internet. Elias Koutsoupias University of Athens and UCLA. Crete, July 2003

Why? A central concept in Computer Science. Algorithms are ubiquitous.

THE NUMBER OF GRAPHS AND A RANDOM GRAPH WITH A GIVEN DEGREE SEQUENCE. Alexander Barvinok

MATH4427 Notebook 2 Spring MATH4427 Notebook Definitions and Examples Performance Measures for Estimators...

Part 2: One-parameter models

A hidden Markov model for criminal behaviour classification

Network/Graph Theory. What is a Network? What is network theory? Graph-based representations. Friendship Network. What makes a problem graph-like?

GENERATING AN ASSORTATIVE NETWORK WITH A GIVEN DEGREE DISTRIBUTION

Random graphs with a given degree sequence

Chapter 29 Scale-Free Network Topologies with Clustering Similar to Online Social Networks

A Uniform Asymptotic Estimate for Discounted Aggregate Claims with Subexponential Tails

Model-Based Cluster Analysis for Web Users Sessions

1 Introduction to Matrices

4. How many integers between 2004 and 4002 are perfect squares?

Maximum Likelihood Estimation

5 Directed acyclic graphs

Temporal Dynamics of Scale-Free Networks

Parametric fractional imputation for missing data analysis

Performance Metrics for Graph Mining Tasks

Course: Model, Learning, and Inference: Lecture 5

Markov random fields and Gibbs measures

Graphical Modeling for Genomic Data

SEQUENCES OF MAXIMAL DEGREE VERTICES IN GRAPHS. Nickolay Khadzhiivanov, Nedyalko Nenov

a 11 x 1 + a 12 x a 1n x n = b 1 a 21 x 1 + a 22 x a 2n x n = b 2.

Recursive Estimation

Exam C, Fall 2006 PRELIMINARY ANSWER KEY

Evaluation of a New Method for Measuring the Internet Degree Distribution: Simulation Results

STATISTICA. Clustering Techniques. Case Study: Defining Clusters of Shopping Center Patrons. and

Sampling Biases in IP Topology Measurements

Walk-Based Centrality and Communicability Measures for Network Analysis

Exploratory data analysis (Chapter 2) Fall 2011

Graph Mining and Social Network Analysis

Statistical Machine Learning from Data

Credit Risk Models: An Overview

Graphs over Time Densification Laws, Shrinking Diameters and Possible Explanations

A LOGNORMAL MODEL FOR INSURANCE CLAIMS DATA

Finite Horizon Investment Risk Management

Adaptive Design for Intra Patient Dose Escalation in Phase I Trials in Oncology

How To Cluster Of Complex Systems

Summary of Formulas and Concepts. Descriptive Statistics (Ch. 1-4)

Lecture 3: Linear methods for classification

Graph Theory and Networks in Biology

A SURVEY ON CONTINUOUS ELLIPTICAL VECTOR DISTRIBUTIONS

GLMs: Gompertz s Law. GLMs in R. Gompertz s famous graduation formula is. or log µ x is linear in age, x,

Social Media Mining. Graph Essentials

University of Maryland Fraternity & Sorority Life Spring 2015 Academic Report

5. Multiple regression

Graph theoretic approach to analyze amino acid network

Parallel Algorithms for Small-world Network. David A. Bader and Kamesh Madduri

Search Heuristics for Load Balancing in IP-networks

SGL: Stata graph library for network analysis

SPANNING CACTI FOR STRUCTURALLY CONTROLLABLE NETWORKS NGO THI TU ANH NATIONAL UNIVERSITY OF SINGAPORE

Statistical Machine Learning

Nominal and ordinal logistic regression

A permutation can also be represented by describing its cycles. What do you suppose is meant by this?

Data Mining. Cluster Analysis: Advanced Concepts and Algorithms

PUBLIC TRANSPORT SYSTEMS IN POLAND: FROM BIAŁYSTOK TO ZIELONA GÓRA BY BUS AND TRAM USING UNIVERSAL STATISTICS OF COMPLEX NETWORKS

Improving Experiments by Optimal Blocking: Minimizing the Maximum Within-block Distance

Finding and counting given length cycles

U = x x x 1 4. What are the equilibrium relative prices of the three goods? traders has members who are best off?

Transcription:

A mixture model for random graphs J-J Daudin, F. Picard, S. Robin robin@inapg.inra.fr UMR INA-PG / ENGREF / INRA, Paris Mathématique et Informatique Appliquées Examples of networks. Social: Biological: Internet: who knows who? which protein interacts with which? connection between servers or web pages. 1

Random graphs Notation and definition. Given a set of n vertices (i = 1..n), X ij indicates the presence/absence of a (non oriented) edge between vertices i and j: X ij = X ji =Á{i j}, X ii = 0. The random graph is defined by the join distribution of all the {X ij } i,j. Typical characteristics. Degree (connectivity) of the vertices: K i = j i X ij Clustering coefficient: c = Pr{X jk = 1 X ij = X ik = 1} Diameter: Longest path between two vertices. 2

Erdos-Rényi (ER) model Definition. The {X ij } i,j are i.i.d.: X ij B(p). Characteristics. Degree : Clustering coefficient: K i B(n 1, p) P(λ) c = p Drawback. The ER fits poorly many real-world networks. Empirical degree distributions are often very different from the Poisson distribution because of few vertices having very high degrees. Empirical clustering coefficients are generally higher than expected under ER. 3

Erdös-Rényi mixture for graph (ERMG) An explicit random graph model Mixture population of edges. We still suppose that the edges belong to Q groups: α q = Pr{i q}, Z iq =Á{i q}. Conditional distribution of the edges. The edges {X ij } are conditionally independent given the group of the vertices: X ij {i q, j l} B(π ql ). π ql = π lq is the connection probability between groups q and l. A high value of π ql reveals a preferential connectivity between groups q and l. 4

Some properties of the ERMG model Conditional distribution of the degrees: K i {i q} B(n 1, π q ) P(λ q ) where π q = l α lπ ql, λ q = (n 1)π q. Marginal distribution of the degrees: we get a Poisson mixture K i q B(n 1, π q ) q α q P(λ q ). 5

Between-group connectivity. A ql denotes the connectivity between groups q and l: A ql = i<j Z iq Z jl X ij. In the ERMG model, its expectation is n(n 1) (A ql ) = α q α l π ql. 2 Clustering coefficient: c = Pr{ V}/ Pr{V} = Pr{ }/ Pr{V}. In the ERMG model, we get c = q,l,m α qα l α m π ql π qm π lm q,l,m α qα l α m π ql π qm. 6

Independent model The absence of preferential connection between groups corresponds to the case where π ql = η q η l. Distribution of degrees: {K i i q} P(λ q ), where λ q = (n 1)η q η, η = l α lη l. Between group connectivity: (A ql ) = n(n 1)(α q η q )(α l η l )/2. ( q α qη 2 q) 2 Clustering coefficient: c = The ER model corresponds to η 2. Q = 1, α 1 = 1, η = η 1 = p, so we get the known result: c = η 4 1/η 2 1 = p. 7

Examples Description Network Q π Clustering coefficient Random 1 p p Independent model (product connectivity) Stars 4 Clusters (affiliation networks) 2 2 ( ) a 2 ab ab b 2 0 1 0 0 1 0 1 0 0 1 0 1 0 0 1 0 ( ) 1 ε ε 1 (a 2 + b 2 ) 2 (a + b) 2 0 1 + 3ε 2 (1 + ε) 2 8

Scale free network model. (Barabasi & Albert, 99) The network is build iteratively: the i-th vertex joining the network connects one of the (i 1) preceeding ones with probability proportional to their current degree (busy gets busier): j < i, Pr i {i j} K i j. The limit marginal distribution for the degrees is then scale free: p(k) k 3. Analogous modeling with the independent ERMG. At time q, n q = nα q vertices join the net work. They preferentially connect the oldest vertices: π ql = η q η l, η 1 η 2 η q... The decreasing speed of the {η q } gives the tail of the degree distribution. 9

Maximum likelihood estimation via E-M We denote X = {X ij } i,j=1..n, Z = {Z iq } i=1..n,q=1..q. Likelihood The conditional expectation of the complete-data log-likelihood is Q(X) = {L(X, Z) X } = i τ iq log α q + q i θ ijql log b(x ij ;π ql ), q j>i l where τ iq and θ ijql are posterior probabilities τ iq = Pr{Z iq = 1 X }, θ ijql = Pr{Z iq Z jl = 1 X } Evaluating these probabilities is not straightforward because the {Z iq } are all dependent conditionally on X. 10

E step. We approximate the conditional joint distribution of the {Z iq }: Pr{Z X } i Pr{Z i X, Z i } where Pr{Z iq = 1 X, Z i } α q b(c im ;Nm, i π qm ) The elements of Z i are estimated by their conditional expectation: Ẑ jl = τ jl. The posterior probabilities τ iq must therefore satisfy τ iq = Pr{Z iq = 1 X, Ẑi } which is actually a fix point type relation. The τ iq are obtained by iterating it. M step. Maximizing Q(X) subject to q α q = 1 gives m τ iq /n, θ ijql. α q = i π ql = i θ ijql X ij / i j j 11

Choice of the number of groups We propose a heuristic penalized likelihood criterion inspired from BIC. Since Q(X) is the sum of τ iq log α q i q θ ijql log b(x ij ;π ql ) i q j>i l which deals with (Q 1) independent proportions α q s and involves n terms, which deals with Q(Q + 1)/2 probabilities π ql s and involves n(n 1)/2 terms, we propose the following heuristic criterion: 2Q(X) + (Q 1)log n + Q(Q + 1) 2 [ ] n(n 1) log. 2 12

Application to Karate Club Data n = 34 members (vertices) of a Karate club 2 members are connected is they have social interactions (apart from their sportive activity) 156 edges. This dataset (Zachary, 77) has been intensively studied in the literature, generally with Q = 4 groups. Parameter estimates. α(%) 5.9 8.9 36.8 48.4 100 16.5 6.8 73.8 π 16.5 100 52.9 16.0 (%) 6.8 52.8 12.3 0.0 73.8 16.0 0.0 7.8 λ 15.0 12.2 3.2 3.2 Clustering coefficient. ERMG models gives 0.314 while the empirical c is 0.256. 13

Dot-plot representation of the graph. Dot present means X ij = 1 The vertices are re-ordered according to their mean group number : 35 30 25 20 15 10 q i = q q τ iq 5 0 0 5 10 15 20 25 30 35 Posterior probabilities τ iq. 1 2 0.8 0.6 3 13 16 0.4 0.2 0 0 5 10 15 20 25 30 35 14

Interpretation of the groups 2 persons, including the administrator, strongly connected with group 4, but not with groups 2 and 3; 3 persons including the instructor, strongly connected with group 3, but not with groups 1 and 4; 13 ordinary members, connected with the instructor; 16 ordinary members, connected with the administrator. End of the story. The instructor (group 2) finally leaved the club and started another one with about one half the members (corresponding to group 3?). 15

Selection of the number of groups. The pseudo BIC actually selects Q = 6 groups Comparison with the 4 group model. Former groups 1 and 4 are conserved. Former groups 2 and 3 are each divided in two new groups We do not know if the new club did last very long... 35 30 25 20 15 10 5 0 0 5 10 15 20 25 30 35 Posterior probabilities τ iq. 1 1 0.8 0.6 2 3 5 7 16 0.4 0.2 0 0 5 10 15 20 25 30 35 16

Application to E. coli reaction network n = 605 vertices (reactions) and 1 782 edges. 2 reactions i and j are connected if the product of i is the substrate of j (or conversely). provided by V. Lacroix and M.-F. Sagot (INRIA Hélix). Number of groups. Pseudo-BIC selects Q = 21. Group proportions. α q (%). 60 50 40 30 20 10 0 0 5 10 15 20 25 Many small groups actually correspond to cliques or pseudo-cliques. 17

600 Dot-plot representation of the graph. 500 Biological interpretation: Groups 1 to 20 gather reactions involving all the same compound either as a substrate or as a product. A compound (pyruvate, ATP, etc) can be associated to each group. 400 300 200 100 0 0 100 200 300 400 500 600 Posterior probabilities τ iq. 1 0.8 0.6 8 10 13 17 4 8 11 14 18 6 9 11 15 18 7 9 12 16 19 35 345 0.4 0.2 0 0 100 200 300 400 500 600 18

Zoom (bottom left). Submatrix of π: q, l 1 9 10 16 1 1.0 9.11.65 10.43.67 16 1.0.01 1.0 200 100 Vertices degree K i. Mean degree in the last group: K 21 = 2.6 0 35 30 25 20 15 10 5 0 19

Distribution of the degree. According to the ERMG, de degrees have a Poisson mixture distribution. Histogram + mixture distribution 150 1 P-P plot 0.9 0.8 100 0.7 0.6 0.5 50 0.4 0.3 0.2 0.1 0 0 10 20 30 40 0 0 0.2 0.4 0.6 0.8 1 Clustering coefficient. Empirical ERMG (Q = 6) ERMG (Q = 21) ER (Q = 1) 0.626 0.436 0.544 0.0098 20

Reaction graph. 15 (16) 8 (10) 13 (14) 3 (7) Group number 12 (13) 16 (17) (group size) 1 (4) 11 (12) 7 (9) 21 (345) 9 (11) 17 (18) 6 (9) 4 (8) 14 (15) 19 (19) 20 (35) 10 (11) 5 (8) 2 (6) 18 (18) 21

Conclusions Past. The ERMG model is a flexible generalization of the ER model and a promising alternative to the scale-free model. It seems to fit well several real-world networks It is properly defined, so its properties can be properly studied. Future. Study the probabilistic properties of the ERMG model (diameter, probability for a subgraph to be connected, etc). Derive a relevant criterion to select the number of groups. Extension to valued graphs: X ij not only 0/1, but some measure of the connection intensity. 22