Mixture Models for Genomic Data

Transcription

1 Mixture Models for Genomic Data S. Robin AgroParisTech / INRA École de Printemps en Apprentissage automatique, Baie de somme, May 2010 S. Robin (AgroParisTech / INRA) Mixture Models May 10 1 / 48

2 Outline 1 Some examples 2 Statistical inference of mixture models 3 Independent mixture model 4 Hidden Markov model 5 Mixture for random graphs 6 Variational Bayes inference 7 Some extensions S. Robin (AgroParisTech / INRA) Mixture Models May 10 2 / 48

3 Some examples Some examples S. Robin (AgroParisTech / INRA) Mixture Models May 10 3 / 48

4 Some examples ChIP-chip experiments ChIP on chip ChIP = Chromatin Immuno-Precipitation, aims at detecting protein-dna interactions. ChIP-chip: Probes corresponding to different loci are spotted on a glass slide. log IP: DNA fragments interacting with the protein of interest. Input: whole genomic DNA. X = log(ip1/ip2) Non-zero X reveal differential protein- DNA interaction between samples 1 and 2. S. Robin (AgroParisTech / INRA) Mixture Models May 10 4 / 48

5 Some examples ChIP-chip experiments Proposed model Denoting X i = log(ip1 i /IP2 i ) the signal observed for probe i, Z i its unknown status, we can assume that the Z i s are i.i.d Z i M(1;π), π k = Pr{Z i = k} = Pr{Z ik = 1} the X i s are independent conditionally to the Z i s: (X i Z i = k) f k ( ) = f ( ;γ k ), e.g. f k = N(µ k,σk 2 ). We have to estimate {π k,µ k,σk 2 } k and Pr{Z i = k X i } S. Robin (AgroParisTech / INRA) Mixture Models May 10 5 / 48

6 Some examples Accounting for the genomic localisation: HMM Accounting for the genomic localisation Probes are (almost) equally spaced along the genome, Probes with large (positive or negative) ratio, tend to be clustered Proposed model: Hidden Markov model (HMM: Baum and Petrie (1966),Churchill (1992)) The X i s are still independent conditionally to the Z i s: (X i Z i = k) f k, But the status are (Markov-)dependent: {Z i } MC(π) π kl = Pr{Z i = k Z i 1 = l} S. Robin (AgroParisTech / INRA) Mixture Models May 10 6 / 48

7 Some examples Regulatory network Regulatory network Regulatory network = directed graph where Nodes = genes (or groups of genes, e.g. operons) Edges = regulations: {i j} i regulates j Typical questions are Do some nodes share similar connexion profiles? Is there a macroscopic organisation of the network? S. Robin (AgroParisTech / INRA) Mixture Models May 10 7 / 48

8 Some examples Regulatory network Proposed model Denoting X ij the presence of regulation from operon i to operon j, Z i the unknown status of operon i, we can assume that [Daudin et al. (2008)] the Z i s are i.i.d Z i M(1;π), π k = Pr{Z i = k} = Pr{Z ik = 1} the X ij s are independent conditionally to the Z i s: (X ij Z i = k,z j = l) B(γ kl ). We want to estimate θ = (π,γ) and Pr{Z i = k {X j }} S. Robin (AgroParisTech / INRA) Mixture Models May 10 8 / 48

9 Statistical inference of mixture models Model and likelihoods Statistical inference of incomplete data models S. Robin (AgroParisTech / INRA) Mixture Models May 10 9 / 48

10 Statistical inference of mixture models Model and likelihoods Statistical inference of incomplete data models Notations: X = observed data (typically X = {X i }), Z = unobserved data, θ = the unknown parameters of both the distributions of Z and X. Definitions: Likelihood of the observed data (or observed likelihood): log P(X) = log P(X;θ) Complete likelihood: log P(X,Z) = log P(X,Z;θ) S. Robin (AgroParisTech / INRA) Mixture Models May 10 9 / 48

11 Statistical inference of mixture models Model and likelihoods Maximum likelihood inference Maximum likelihood estimate: We are looking for Incomplete data model: The calculation of θ = arg max log P(X;θ) θ P(X;θ) = Z P(X,Z;θ) is not always possible since this sum typically involves K n terms. This of P(X,Z;θ) is much easier... except that Z is unknown. S. Robin (AgroParisTech / INRA) Mixture Models May / 48

12 Statistical inference of mixture models Variational approach Variational approach The Küllback-Leibler divergence between distributions F and G: KL(F;G) = F(U)log F(U) G(U) du 0 is a non-symmetric dissimilarity measure; is zero iff F = G. Lower bound: For any distribution Q(Z), we have [Jordan et al. (1999),Jaakkola (2000)] log P(X) log P(X) KL[Q(Z); P(Z X)] = log P(X) Q(Z)log Q(Z)dZ + Q(Z) log P(Z X)dZ = Q(Z)log Q(Z)dZ + Q(Z)log P(X,Z)dZ = H(Q) + E Q [log P(X,Z)]. S. Robin (AgroParisTech / INRA) Mixture Models May / 48

13 Consequences Statistical inference of mixture models Variational approach If P(Z X) can be calculated: taking Q(Z) = P(Z X) achieves the maximisation of log P(X) through this of E Q [log P(X,Z)]. E-M algorithm for independent mixtures and hidden Markov models [Dempster et al. (1977),McLachlan and Peel (2000),Cappé et al. (2005)]. If P(Z X) can not be calculated: the best lower bound of log P(X) is obtained for Q = arg min Q Q KL[Q(Z);P(Z X)] Mean-field approximation for random graphs. S. Robin (AgroParisTech / INRA) Mixture Models May / 48

14 Variational E-M Statistical inference of mixture models Variational E-M Expectation step (E-step): calculate or its approximation P(Z X; θ) Q = arg min Q Q KL[Q(Z);P(Z X;θ)]. Maximisation step (M-step): estimate θ with θ = arg max θ E Q[log P(X,Z)] which maximise log P(X) if Q(Z) = P(Z X), and its lower bound otherwise. S. Robin (AgroParisTech / INRA) Mixture Models May / 48

15 Independent mixture model Independent mixture models Model S. Robin (AgroParisTech / INRA) Mixture Models May / 48

16 Independent mixture model Model Independent mixture models Reminder: [McLachlan and Peel (2000)] the Z i s are i.i.d M(1;π), π k = Pr{Z i = k}, k = 1..K; the X i s are independent conditionally to Z: (X i Z i = k) f (γ k ). Identifiability. The model is invariant for any permutation of the labels {1,...,K} the mixture model has K! equivalent definitions. Distribution of the observed data. X i g(x) = k π k f (x;γ k ). because Pr{X i = x} = k Pr{X i = x Z i = k}pr{z i = k}. S. Robin (AgroParisTech / INRA) Mixture Models May / 48

17 Independent mixture model Model Dependency structure Some properties: {Z i } are independent; {X i } are independent conditionally to Z; Couples {(X i,z i )} are i.i.d.; (X i,x j Z i = Z j ) are not independent. Graphical representation.... Z i 1 Z i Z i X i 1 X i X i+1... S. Robin (AgroParisTech / INRA) Mixture Models May / 48

18 Likelihoods Independent mixture model Inference Observed likelihood: log P(X;θ) = i log g(x i ;θ) = i [ ] log π k f (X i ;γ k ) k Complete likelihood: log P(X,Z;θ) = log P(Z;θ) + log P(X Z;θ) = Z ik log π k + Z ik log f (X i ;γ k ) i k i k = Z ik [log π k + log f (X i ;γ k )]. i k S. Robin (AgroParisTech / INRA) Mixture Models May / 48

19 Inference: E-step Independent mixture model Inference Since the {(X i,z i )} i are independant, we can calculate P(Z X) = P(Z i X i ) = τ Z ik ik i i where τ ik = Pr{Z i = k X} = E Q {Z ik }: τ ik = Pr{Z i = k X i,θ} = π kf (X i ;γ k ) l π lf (X i ;γ l ) k (Bayes rule). Conditional expectation of the complete likelihood ( completed likelihood): { } E Q [log P(X,Z)] = E Q Z ik log[π k f (X i ;γ k )] i k = τ ik [log π k + log f (X i ;γ k )] i k S. Robin (AgroParisTech / INRA) Mixture Models May / 48

20 Inference: M-step Independent mixture model Inference We want to maximise θ = arg max θ E Q[log P(X,Z)] = arg max τ ik [log π k + log f (X i ;γ k )] θ weighted version of the usual maximum likelihood estimates (MLE). Gaussian case: γ k = (µ k,σ k ) π k = 1 n i i k τ ik = n k n, µ k = 1 τ ik X i, n k i σ k 2 = 1 τ ik (X i µ k ) 2. n k i S. Robin (AgroParisTech / INRA) Mixture Models May / 48

21 Independent mixture model Graphical interpretation Inference Distributions: Posterior probabilities: g(x) = π 1 f 1 (x) + π 2 f 2 (x) + π 3 f 3 (x) τ ik = π kf k (x i ) g(x i ) τ ik (%) i = 1 i = 2 i = 3 k = k = k = Demo S. Robin (AgroParisTech / INRA) Mixture Models May / 48

22 Independent mixture model Precision of the estimates Precision of the estimates In the framework of the regular MLE (i.e. when P(Z X) can be calculated), the asymptotic variance of the estimates is given by [ ] V ( θ) = I(θ) 1 2 where I(θ) = E log P(X;θ) 2 θ where I(θ) is the Fisher information matrix. Louis (1982) provides a convenient way to calculate of I(θ), based only on the complete-likelihood: [ P ] [ (X,Z;θ) P ] [ I(θ) = E P(X,Z;θ) X (X;θ) P ] (X;θ) E E. P(X; θ) P(X; θ) } {{ } =0 at the miximum of log P(X;θ) S. Robin (AgroParisTech / INRA) Mixture Models May / 48

23 Independent mixture model Application to ChIP-chip Application to ChIP-chip Common variance Different variances k π k µ k σ k k π k µ k σ k S. Robin (AgroParisTech / INRA) Mixture Models May / 48

24 Independent mixture model Application to ChIP-chip Probe classification Common variance Different variances Heterogenous variances provide a better fit to the distribution But the classification rule is not very convenient... Accounting for annotation. When some annotation C i is available for each probe, the model can account through different prior probabilities: π c k = Pr{Z i = k C i = c} S. Robin (AgroParisTech / INRA) Mixture Models May / 48

25 Independent mixture model Application to ChIP-chip Case of IP/Input experiments Complete DNA is often used as a reference ( Input ) to detect protein-dna interaction. The relation IP = f (Input) is seemingly linear, but the difference IP Input is not (always) sufficient for classification... because the shape of the relation depends on the probe status. Frequency S. Robin (AgroParisTech / INRA) Mixture Models May / log(ip/input)

26 Independent mixture model Application to ChIP-chip Mixture of regressions 17 This mixture states that: 2 IPi Zi = k N (ak + bk Inputi, σ ) where k = 0 (normal) or 1 (enriched) b ak and b bk are only weighted versions of the standard OLS intercept and slope estimates. S. Robin (AgroParisTech / INRA) Mixture Models May / 48

27 Hidden Markov model Hidden Markov model Model S. Robin (AgroParisTech / INRA) Mixture Models May / 48

28 Hidden Markov model Model Hidden Markov model Reminder: {Z i } MC(π), π kl = Pr{Z i = l Z i 1 = k}; Z 1 M(1;ν) (e.g. ν = stationary distribution of π); the X i s are independent conditionally to Z: (X i Z i = k) f (γ k ). Distribution of the observed data. X i g(x) = k ν i k f (x;γ k). since Z i M(1;ν i ) where ν i = νπ i 1 S. Robin (AgroParisTech / INRA) Mixture Models May / 48

29 Hidden Markov model Model Dependency structure Some properties: (Z i 1,Z i ) are not independent, (X i 1,X i ) are not independent, (X i 1,X i ) are independent conditionally on Z i, Graphical representation.... Z i 1 Z i Z i X i 1 X i X i+1... S. Robin (AgroParisTech / INRA) Mixture Models May / 48

30 Likelihood Hidden Markov model Inference Complete likelihood: P(X, Z) = P(Z)P(X Z) { } = ν Z 1k k π Z i 1,kZ i,l kl f k (X i ) Z ik k i>1 k,l i k log P(X,Z) = Z 1k log ν k + Z i 1,k Z i,l log π kl k i>1 k,l + Z ik log f k (X i ) Z ik i k Completed likelihood: E Q [log P(X,Z)] = E Q [Z 1k ] log ν k + E Q [Z i 1,k Z i,l ]log π kl k i>1 k,l + E Q [Z ik ]log f k (X i ) i k S. Robin (AgroParisTech / INRA) Mixture Models May / 48

31 E-step Hidden Markov model Inference For Q(Z) = P(Z X), we need to compute τ ik = E Q [Z ik ] and η ikl = E Q [Z i 1,k Z i,l ] Forward equation: Denoting F il = Pr{Z i = l X1 i }, we have [Devijver (1985)] F il F i 1,k π kl f l (X i ). k Backward equation: Once we get all F ik, we get the τ ik as τ ik = l π kl τ i+1,l G i+1,l F ik, with G i+1,l = k π kl τ ik. S. Robin (AgroParisTech / INRA) Mixture Models May / 48

32 Hidden Markov model Inference Forward recursion: Proof F 1l = P(Z 1l X 1 ) = P(X 1 Z 1l )P(Z 1l )/P(X 1 ) ν l f l (X 1 ) F il = P(Z il X i 1) = k P(Z i 1,k,Z il X i 1 ) = k = k P(Z il,z i 1,k,X i 1 )/ P(X i 1 ) P(X i Z il )P(Z il Z i 1,k )P(Z i 1,k X i 1 1 ) [ P(X1 i 1 ) P(X1 i) ] k f l (X i )π k,l F i 1,k S. Robin (AgroParisTech / INRA) Mixture Models May / 48

33 Hidden Markov model Backward recursion: Proof Inference τ nk = P(Z 1k X) = P(Z 1k X1 n ) = F nk τ ik = P(Z ik X1 n ) = P(Z ik,z i+1,l,x1 n )/P(X 1 n ) l = l = F ik and P(X i 1 )P(X n i+1 Z i+1,l)/p(x n 1 ) P(X i 1 )P(Z ik X i 1 )P(Z i+1,l Z ik )P(X n i+1 Z i+1,l)/p(x n 1 ) l π kl P(X i 1 )P(X n i+1 Z i+1,l)/p(x n 1 ) = P(X i 1 )P(X n i+1 Z i+1,l)p(x i 1 Z i+1,l)/p(x n 1 )P(X i 1 Z i+1,l) = P(X i 1)P(X n 1 Z i+1,l )/P(X n 1 )P(X i 1 Z i+1,l ) = P(Z i+1,l X n 1 )/P(Z i+1,l X i 1) = τ i+1,l /P(Z i+1,l X i 1) where P(Z i+1,l X i 1 ) = k P(Z i+1,l,z ik X i 1 ) = k F ikπ kl. S. Robin (AgroParisTech / INRA) Mixture Models May / 48

34 Hidden Markov model Ratio Application to ChIP-chip Application to ChIP-chip: heterogeneous variances Distribution fit (dotted=mixture) Classification along the genome Position Position Position S. Robin (AgroParisTech / INRA) Mixture Models May / 48

35 Hidden Markov model Ratio Application to ChIP-chip Application to ChIP-chip: common variance Distribution fit (dotted=mixture) Classification along the genome Position Position Position S. Robin (AgroParisTech / INRA) Mixture Models May / 48

36 Hidden Markov model Application to ChIP-chip One step further in modelling The observed signal is actually bi-dimensional: IPwt = signal observed at each probe in the wild-type, IPmut = signal observed at each probe in the mutant. A joint modelling allows to distinguish between identical probes (same signal in both lines) non-methylated probes (no signal in both lines). Source: C. Bérard S. Robin (AgroParisTech / INRA) Mixture Models May / 48

37 Hidden Markov model Application to ChIP-chip Comparison with genome annotation The probe classification provided by the HMM is more consistent with their spatial organisation. The Methylation mark in META1 is lost in the mutant, mostly in the left-end part, near the regulatory region. Legend: lost, enriched, normal S. Robin (AgroParisTech / INRA) Mixture Models May / 48

38 Mixture for random graphs Model Mixture model for random graph S. Robin (AgroParisTech / INRA) Mixture Models May / 48

39 Mixture for random graphs Model Mixture model for random graph Reminder: the Z i s are i.i.d M(1;π), π k = Pr{Z i = k}, k = 1..K; the X ij s are independent conditionally to Z: (X ij Z i = k,z j = l) f (γ kl ). Distribution of the observed data. π k π l f (x;γ kl ), e.g. B π k π l γ kl, X ij Z i = k X ij g(x) = k,l g(x) = l π l f (x;γ kl ), k,l ( ) e.g. B π l γ kl. l S. Robin (AgroParisTech / INRA) Mixture Models May / 48

40 Dependency structure Mixture for random graphs Model Graphical rep.: P(Z)P(X Z) Moral graph [Lauritzen (1996)] Cond. dep. of Z given X: P(Z X) X ij X ij X ij Z j Z j Z j X jk Z i X jk Z i X jk Z i Z k Z k Z k X ik X ik X ik The conditional dependency of Z is a clique no factorisation can be hoped to calculate P(X Z) P(X Z) can only be approximated. S. Robin (AgroParisTech / INRA) Mixture Models May / 48

41 Likelihood Mixture for random graphs Inference Complete likelihood: log P(X,Z) = i,k Z ik log π k + i,j Z ik Z jl [X ij log γ kl + (1 X ij )log(1 γ kl )]. k,l Completed likelihood: Denoting τ ik = E(Z ik ), E Q [log P(X,Z)] = i,k τ ik log π k + i,j τ ik τ jl [X ij log γ kl + (1 X ij )log(1 γ kl )] k,l M-step: (Again) weighted version of the MLE. S. Robin (AgroParisTech / INRA) Mixture Models May / 48

42 Mixture for random graphs Approximation of P(Z X) Inference Problem: We are looking for Q = arg min Q Q KL[Q(Z);P(Z X)]. The optimum over all possible distributions is Q (Z) = P(Z X)... which can no be calculated. We restrict ourselves to the set of factorisable distributions: { } Q = Q : Q(Z) = i Q i (Z i ) = i k τ Z ik ik Q is characterised by the set of optimal parameters τ ik s: τ ik Pr{Z i = k X}. The optimal τik s can be found using standard (constrained) optimisation techniques. S. Robin (AgroParisTech / INRA) Mixture Models May / 48.

43 Mixture for random graphs Inference Optimisation: The optimal τ ik s must satisfy { τ ik KL[Q(Z);P(Z X)] + ( λ i {τ ik } i k τ ik 1 )} = 0 which leads to the fix-point relation: τ ik π k j i l [ ] γ X τ ij kl (1 γ kl) 1 X jl ij also known as mean-field approximation in physics. Intuitive interpretation: [ Pr{Z i = k X,Z i } π k γ X ij kl (1 γ kl) ij] 1 X Zjl. j i l S. Robin (AgroParisTech / INRA) Mixture Models May / 48

44 Mixture for random graphs Application to regulatory networks Application to regulatory networks γ kl (%) α (%) (source Picard) S. Robin (AgroParisTech / INRA) Mixture Models May / 48

45 Variational Bayes inference Variational Bayes inference Bayesian inference S. Robin (AgroParisTech / INRA) Mixture Models May / 48

46 Variational Bayes inference Bayesian inference Variational Bayes inference Bayesian point of view: The parameter θ itself is random: where P(θ) is prior distribution of θ. θ P(θ) Bayesian inference: The goal is then to calculate the posterior distribution P(θ X) = P(θ)P(X θ). P(X) Its explicit calculation is possible in nice cases, e.g. exponential family with conjugate prior. Monte-Carlo (e.g. MCMC) sampling is often used to estimate it. S. Robin (AgroParisTech / INRA) Mixture Models May / 48

47 Variational Bayes inference Incomplete data model Bayesian inference Hierarchical modelling: The model is typically defined with: the prior distribution of θ: the conditional distribution of the unobserved Z: the conditional distribution of the observed X: P(θ) P(Z θ) P(X Z,θ) Inference: The goal is know to calculate (or estimate) the joint conditional distribution P(Z,θ X) = P(θ)P(Z θ)p(x Z,θ) P(X) which is often intractable, even when P(Z X,θ) can be calculated (e.g. independent mixture models). S. Robin (AgroParisTech / INRA) Mixture Models May / 48

48 Variational Bayes inference VB-EM Variational Bayes inference Exponential family / conjugate prior: if log P(θ) = φ(θ) ν + cst log P(X,Z θ) = φ(θ) u(x,z) + cst. Variational optimisation: The best approximate distribution Q = arg min Q Q KL[Q(Z,θ);P(Z,θ X)] within the class of factorisable distributions Q: Q = {Q : Q(Z,θ) = Q Z (Z)Q θ (θ)} can be recovered via a variational Bayes E-M algorithm (VBEM) [Beal and Ghahramani (2003)]. S. Robin (AgroParisTech / INRA) Mixture Models May / 48

49 Variational Bayes inference VB-EM VB-EM algorithm The approximate conditional distributions Q Z (Z) and Q θ (θ) are alternatively updated. VB-M step: Approximate posterior of θ log Q(θ) = φ(θ) {E QZ [u(x,z)] + ν} + cst VB-E step: Approximate conditional distribution of Z log Q(Z) = E Qθ [φ(θ)] u(x,z) + cst General properties: Still not well known Consistency for some particular cases. Generally tend to underestimate the posterior variance of θ. Obviously depends on the quality of the approximate Q(θ,Z). S. Robin (AgroParisTech / INRA) Mixture Models May / 48

50 Variational Bayes inference Mixture for networks Application to mixtures for networks VB-EM Credibility intervals with a mixture of 2 groups of nodes π 1 : +, γ 11 :, γ 12 :, γ 22 : For all parameters, VB-EM posterior credibility intervals achieve the nominal level (90%), as soon as n 25. the VB-EM approximation works well, at least for graphs. S. Robin (AgroParisTech / INRA) Mixture Models May / 48

51 Variational Bayes inference Approximate posterior distribution Mixture for networks S. Robin (AgroParisTech / INRA) Mixture Models May / 48

52 Model selection Some extensions Model selection S. Robin (AgroParisTech / INRA) Mixture Models May / 48

53 Model selection Some extensions Model selection Problem: The number of groups K is often not known a priori. Model fit: The observed log-likelihood L K (X) = log P(X, { θ 1,... θ K }) is not sufficient to measure of how the model fits to the data since it always increases with K. Penalised criterion: A penalty term has to be added to avoid over-fitting: BIC(K) = L K log n(#param.)/2 ICL(K) = BIC(K) H(P(Z X)) S. Robin (AgroParisTech / INRA) Mixture Models May / 48

54 Some extensions Model selection Baum, L. and Petrie, T. (1966). Statistical inference for probalistic functions of finite state markov chains. Ann. Math. Statis Beal, J., M. and Ghahramani, Z. (2003). The variational Bayesian EM algorithm for incomplete data: with application to scoring graphical model structures. In Bayesian Statistics 7, Oxford University Press. Cappé, O., Moulines, E. and Rydén, T. (2005). Inference in Hidden Markov Models. Springer. Churchill, G. A. (1992). Hidden Markov chains and the analysis of genome structure. Computer Chem Daudin, J.-J., Picard, F. and Robin, S. (Jun, 2008). A mixture model for random graphs. Stat. Comput. 18 (2) Dempster, A. P., Laird, N. M. and Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. J. R. Statist. Soc. B Devijver, P. (1985). Baum s forward-backward algorithm revisited. Pattern Recogn. Lett Jaakkola, T. (2000). Advanced mean field methods: theory and practice. chapter Tutorial on variational approximation methods. MIT Press. Jordan, M. I., Ghahramani, Z., Jaakkola, T. and Saul, L. K. (1999). An introduction to variational methods for graphical models. Machine Learning. 37 (2) Lauritzen, S. (1996). Graphical Models. Oxford Statistical Science Series. Clarendon Press. Louis, T. (1982). Finding the observed information matrix when using the em algorithm. J. R. Statist. Soc. B. 44 (2) McLachlan, G. and Peel, D. (2000). Finite Mixture Models. Wiley. S. Robin (AgroParisTech / INRA) Mixture Models May / 48