A mixture model for random graphs

Size: px

Start display at page:

Download "A mixture model for random graphs"

Benjamin Beasley
10 years ago
Views:

1 A mixture model for random graphs J-J Daudin, F. Picard, S. Robin UMR INA-PG / ENGREF / INRA, Paris Mathématique et Informatique Appliquées Examples of networks. Social: Biological: Internet: who knows who? which protein interacts with which? connection between servers or web pages. 1

fr UMR INA-PG / ENGREF / INRA, Paris Mathématique et Informatique Appliquées

2 Random graphs Notation and definition. Given a set of n vertices (i = 1..n), X ij indicates the presence/absence of a (non oriented) edge between vertices i and j: X ij = X ji =Á{i j}, X ii = 0. The random graph is defined by the join distribution of all the {X ij } i,j. Typical characteristics. Degree (connectivity) of the vertices: K i = j i X ij Clustering coefficient: c = Pr{X jk = 1 X ij = X ik = 1} Diameter: Longest path between two vertices. 2

X ii = 0. The random graph is defined by the join distribution of all the {X ij } i,j. Typical characteristics.

3 Erdos-Rényi (ER) model Definition. The {X ij } i,j are i.i.d.: X ij B(p). Characteristics. Degree : Clustering coefficient: K i B(n 1, p) P(λ) c = p Drawback. The ER fits poorly many real-world networks. Empirical degree distributions are often very different from the Poisson distribution because of few vertices having very high degrees. Empirical clustering coefficients are generally higher than expected under ER. 3

The ER fits poorly many real-world networks.

4 Erdös-Rényi mixture for graph (ERMG) An explicit random graph model Mixture population of edges. We still suppose that the edges belong to Q groups: α q = Pr{i q}, Z iq =Á{i q}. Conditional distribution of the edges. The edges {X ij } are conditionally independent given the group of the vertices: X ij {i q, j l} B(π ql ). π ql = π lq is the connection probability between groups q and l. A high value of π ql reveals a preferential connectivity between groups q and l. 4

5 Some properties of the ERMG model Conditional distribution of the degrees: K i {i q} B(n 1, π q ) P(λ q ) where π q = l α lπ ql, λ q = (n 1)π q. Marginal distribution of the degrees: we get a Poisson mixture K i q B(n 1, π q ) q α q P(λ q ). 5

6 Between-group connectivity. A ql denotes the connectivity between groups q and l: A ql = i<j Z iq Z jl X ij. In the ERMG model, its expectation is n(n 1) (A ql ) = α q α l π ql. 2 Clustering coefficient: c = Pr{ V}/ Pr{V} = Pr{ }/ Pr{V}. In the ERMG model, we get c = q,l,m α qα l α m π ql π qm π lm q,l,m α qα l α m π ql π qm. 6

In the ERMG model, its expectation is n(n 1) (A ql ) = α q α l π ql.

7 Independent model The absence of preferential connection between groups corresponds to the case where π ql = η q η l. Distribution of degrees: {K i i q} P(λ q ), where λ q = (n 1)η q η, η = l α lη l. Between group connectivity: (A ql ) = n(n 1)(α q η q )(α l η l )/2. ( q α qη 2 q) 2 Clustering coefficient: c = The ER model corresponds to η 2. Q = 1, α 1 = 1, η = η 1 = p, so we get the known result: c = η 4 1/η 2 1 = p. 7

Between group connectivity: (A ql ) = n(n 1)(α q η q )(α l η l )/2.

8 Examples Description Network Q π Clustering coefficient Random 1 p p Independent model (product connectivity) Stars 4 Clusters (affiliation networks) 2 2 ( ) a 2 ab ab b ( ) 1 ε ε 1 (a 2 + b 2 ) 2 (a + b) ε 2 (1 + ε) 2 8

(affiliation networks) 2 2 ( ) a 2 ab ab b 2 0 1 0 0 1 0 1 0 0 1

9 Scale free network model. (Barabasi & Albert, 99) The network is build iteratively: the i-th vertex joining the network connects one of the (i 1) preceeding ones with probability proportional to their current degree (busy gets busier): j < i, Pr i {i j} K i j. The limit marginal distribution for the degrees is then scale free: p(k) k 3. Analogous modeling with the independent ERMG. At time q, n q = nα q vertices join the net work. They preferentially connect the oldest vertices: π ql = η q η l, η 1 η 2 η q... The decreasing speed of the {η q } gives the tail of the degree distribution. 9

probability proportional to their current degree (busy gets busier): j < i, Pr i {i j} K i j.

10 Maximum likelihood estimation via E-M We denote X = {X ij } i,j=1..n, Z = {Z iq } i=1..n,q=1..q. Likelihood The conditional expectation of the complete-data log-likelihood is Q(X) = {L(X, Z) X } = i τ iq log α q + q i θ ijql log b(x ij ;π ql ), q j>i l where τ iq and θ ijql are posterior probabilities τ iq = Pr{Z iq = 1 X }, θ ijql = Pr{Z iq Z jl = 1 X } Evaluating these probabilities is not straightforward because the {Z iq } are all dependent conditionally on X. 10

1..q. Likelihood The conditional expectation of the complete-data log-likelihood is Q(X) = {L(X, Z) X } = i τ iq log

11 E step. We approximate the conditional joint distribution of the {Z iq }: Pr{Z X } i Pr{Z i X, Z i } where Pr{Z iq = 1 X, Z i } α q b(c im ;Nm, i π qm ) The elements of Z i are estimated by their conditional expectation: Ẑ jl = τ jl. The posterior probabilities τ iq must therefore satisfy τ iq = Pr{Z iq = 1 X, Ẑi } which is actually a fix point type relation. The τ iq are obtained by iterating it. M step. Maximizing Q(X) subject to q α q = 1 gives m τ iq /n, θ ijql. α q = i π ql = i θ ijql X ij / i j j 11

im ;Nm, i π qm ) The elements of Z i are estimated by their conditional expectation: Ẑ jl = τ jl.

12 Choice of the number of groups We propose a heuristic penalized likelihood criterion inspired from BIC. Since Q(X) is the sum of τ iq log α q i q θ ijql log b(x ij ;π ql ) i q j>i l which deals with (Q 1) independent proportions α q s and involves n terms, which deals with Q(Q + 1)/2 probabilities π ql s and involves n(n 1)/2 terms, we propose the following heuristic criterion: 2Q(X) + (Q 1)log n + Q(Q + 1) 2 [ ] n(n 1) log. 2 12

independent proportions α q s and involves n terms, which deals with Q(Q + 1)/2 probabilities π ql s and

13 Application to Karate Club Data n = 34 members (vertices) of a Karate club 2 members are connected is they have social interactions (apart from their sportive activity) 156 edges. This dataset (Zachary, 77) has been intensively studied in the literature, generally with Q = 4 groups. Parameter estimates. α(%) π (%) λ Clustering coefficient. ERMG models gives while the empirical c is

This dataset (Zachary, 77) has been intensively studied in the literature, generally with Q = 4 groups. Parameter estimates.

14 Dot-plot representation of the graph. Dot present means X ij = 1 The vertices are re-ordered according to their mean group number : q i = q q τ iq Posterior probabilities τ iq

their mean group number : 35 30 25 20 15 10 q i = q q τ iq 5 0 0 5

15 Interpretation of the groups 2 persons, including the administrator, strongly connected with group 4, but not with groups 2 and 3; 3 persons including the instructor, strongly connected with group 3, but not with groups 1 and 4; 13 ordinary members, connected with the instructor; 16 ordinary members, connected with the administrator. End of the story. The instructor (group 2) finally leaved the club and started another one with about one half the members (corresponding to group 3?). 15

members, connected with the instructor; 16 ordinary members, connected with the administrator. End of the story.

16 Selection of the number of groups. The pseudo BIC actually selects Q = 6 groups Comparison with the 4 group model. Former groups 1 and 4 are conserved. Former groups 2 and 3 are each divided in two new groups We do not know if the new club did last very long Posterior probabilities τ iq

17 Application to E. coli reaction network n = 605 vertices (reactions) and edges. 2 reactions i and j are connected if the product of i is the substrate of j (or conversely). provided by V. Lacroix and M.-F. Sagot (INRIA Hélix). Number of groups. Pseudo-BIC selects Q = 21. Group proportions. α q (%) Many small groups actually correspond to cliques or pseudo-cliques. 17

provided by V. Lacroix and M.-F. Sagot (INRIA Hélix). Number of groups. Pseudo-BIC selects Q = 21.

18 600 Dot-plot representation of the graph. 500 Biological interpretation: Groups 1 to 20 gather reactions involving all the same compound either as a substrate or as a product. A compound (pyruvate, ATP, etc) can be associated to each group Posterior probabilities τ iq

a substrate or as a product. A compound (pyruvate, ATP, etc) can be associated to each group.

19 Zoom (bottom left). Submatrix of π: q, l Vertices degree K i. Mean degree in the last group: K 21 =

20 Distribution of the degree. According to the ERMG, de degrees have a Poisson mixture distribution. Histogram + mixture distribution P-P plot Clustering coefficient. Empirical ERMG (Q = 6) ERMG (Q = 21) ER (Q = 1)

Histogram + mixture distribution 150 1 P-P plot 0.9 0.8 100 0.7 0.6 0.5 50 0.4 0.

21 Reaction graph. 15 (16) 8 (10) 13 (14) 3 (7) Group number 12 (13) 16 (17) (group size) 1 (4) 11 (12) 7 (9) 21 (345) 9 (11) 17 (18) 6 (9) 4 (8) 14 (15) 19 (19) 20 (35) 10 (11) 5 (8) 2 (6) 18 (18) 21

22 Conclusions Past. The ERMG model is a flexible generalization of the ER model and a promising alternative to the scale-free model. It seems to fit well several real-world networks It is properly defined, so its properties can be properly studied. Future. Study the probabilistic properties of the ERMG model (diameter, probability for a subgraph to be connected, etc). Derive a relevant criterion to select the number of groups. Extension to valued graphs: X ij not only 0/1, but some measure of the connection intensity. 22

Message-passing sequential detection of multiple change points in networks

Message-passing sequential detection of multiple change points in networks Long Nguyen, Arash Amini Ram Rajagopal University of Michigan Stanford University ISIT, Boston, July 2012 Nguyen/Amini/Rajagopal