Probabilistic methods for post-genomic data integration Dirk Husmeier Biomathematics & Statistics Scotland (BioSS) JMB, The King s Buildings, Edinburgh EH9 3JZ United Kingdom http://wwwbiossacuk/ dirk
Integrated analysis of regulatory networks
Integrated analysis of regulatory networks Expression data alone are not sufficient ombining multiple sources of information yields complementary constraints
ombining promoter sequences and gene expression data
ombining promoter sequences and gene expression data onventional approach: Find clusters of co-expressed genes Identify regulatory elements by searching for common over-represented motifs in the promoter regions of these genes
Shortcomings of the conventional algorithm
Microarray data Model Promoter sequences
Microarray data Model Promoter sequences
Microarray data Model Promoter sequences
Segal s unifying probabilistic model
Microarray data Model Promoter sequences
Microarray data Model Promoter sequences
Microarray data Model Promoter sequences
Segal, Yelensky, Koller (2003) Bioinformatics 19
Segal, Yelensky, Koller (2003) Bioinformatics 19 Revision: Motif finding
Motif: TAT A G T G A A T T T A T A G A
Motif: TAT A G T G A A T T T A T A G A
Motif: TAT A G T G A A T T T A T A G A
Motif: TAT A G T G A A T T T A T A G A
Motif: TAT A G T G A A T T T A T A G A
Motif: TAT A G T G A A T T T A T A G A
Motif: TAT A G T G A A T T T A T A G A
Position Specific Scoring Matrix (PSSM) Search for a motif of length W in binding sequences
Position Specific Scoring Matrix (PSSM) Search for a motif of length W in binding sequences W 4 matrix ψ k (l): Probability that the nucleotide in the kth position, k [1,, W ], is an l {A,, G, T }
Position Specific Scoring Matrix (PSSM) Search for a motif of length W in binding sequences W 4 matrix ψ k (l): Probability that the nucleotide in the kth position, k [1,, W ], is an l {A,, G, T } Background model for non-binding sequences 4-dim vector θ 0 (l): Probability of nucleotide l; this distribution is position-independent
Sequence S 1, S 2,, S N
Sequence S 1, S 2,, S N Non-binding sequence: R=0 P (S 1, S 2,, S N R = 0) = N θ 0 (S t ) t=1
Sequence S 1, S 2,, S N Non-binding sequence: R=0 P (S 1, S 2,, S N R = 0) = N θ 0 (S t ) Binding sequence: R=1, motif starting at position m+1 t=1 k=1 t=1 P (S 1, S 2,, S N R = 1, start = m + 1) m W N = θ 0 (S t ) ψ k (S m+k ) θ 0 (S t ) = N θ 0 (S t ) t=1 W k=1 ψ k (S m+k ) θ 0 (S m+k ) t=m+w +1
Binding sequence: R=1, motif starting at position m+1 P (S 1, S 2,, S N R = 1, start = m + 1) = N W θ 0 (S t ) t=1 k=1 ψ k (S m+k ) θ 0 (S m+k )
Binding sequence: R=1, motif starting at position m+1 P (S 1, S 2,, S N R = 1, start = m + 1) = N θ 0 (S t ) t=1 W k=1 Binding sequence: R=1, motif starting anywhere P (S 1, S 2,, S N R = 1) = = N W m=0 ψ k (S m+k ) θ 0 (S m+k ) P (start = m + 1)P (S 1, S 2,, S N R = 1, start = m + 1) N 1 θ 0 (S t ) N W + 1 t=1 N W m=0 W k=1 ψ k (S m+k ) θ 0 (S m+k )
Binding sequence: R=1, motif starting at position m+1 P (S 1, S 2,, S N R = 1, start = m + 1) = N θ 0 (S t ) t=1 W k=1 Binding sequence: R=1, motif starting anywhere P (S 1, S 2,, S N R = 1) = = N W m=0 ψ k (S m+k ) θ 0 (S m+k ) P (start = m + 1)P (S 1, S 2,, S N R = 1, start = m + 1) N 1 θ 0 (S t ) N W + 1 t=1 N W m=0 W k=1 ψ k (S m+k ) θ 0 (S m+k ) Objective: Prediction of binding activity from sequence: P (R = 1 S 1, S 2,, S N )
P (R = 1 S 1, S 2,, S N ) = = = Apply Bayes rule: P (S 1, S 2,, S N R = 1)P (R = 1) P (S 1, S 2,, S N R = 0)P (R = 0) + P (S 1, S 2,, S N R = 1)P (R = 1) ( ) 1 1 + P (R = 0)P (S 1, S 2,, S N R = 0) P (R = 1)P (S 1, S 2,, S N R = 1) ( [ ] N W 1 ) 1 P (R = 1) 1 W ψ k (S m+k ) 1 + P (R = 0) (N W + 1) θ 0 (S m+k ) m=0 k=1
P (R = 1 S 1, S 2,, S N ) = = = Apply Bayes rule: P (S 1, S 2,, S N R = 1)P (R = 1) P (S 1, S 2,, S N R = 0)P (R = 0) + P (S 1, S 2,, S N R = 1)P (R = 1) ( ) 1 1 + P (R = 0)P (S 1, S 2,, S N R = 0) P (R = 1)P (S 1, S 2,, S N R = 1) ( [ ] N W 1 ) 1 P (R = 1) 1 W ψ k (S m+k ) 1 + P (R = 0) (N W + 1) θ 0 (S m+k ) m=0 k=1 Define: w k (l) = log ψ k(l) θ 0 (l), w 0 = log P (R=1) P (R=0), logit(z) = 1 1+exp( z)
P (R = 1 S 1, S 2,, S N ) ( [ w 0 = logit log N W + 1 N W m=0 ( W )] ) exp w k (S t+k ) k=1 4 W + 1 parameters: w k (l), w 0
Motif: TAT A G T G A A T T T A T A G A
Motif: TAT A G T G A A T T T A T A G A Score 1
Motif: TAT A G T G A A T T T A T A G A Score 1 Score 2
Motif: TAT A G T G A A T T T A T A G A Score 1 Score 2 Score t
Motif: TAT A G T G A A T T T A T A G A Score 1 Score 2 Score t Score N
Motif: TAT A G T G A A T T T A T A G A Score 1 Score 2 Score t Score N +
Motif: TAT A G T G A A T T T A T A G A Score 1 Score 2 Score t Score N + Nonlinear transfer function
Motif: TAT A G T G A A T T T A T A G A Score 1 Score 2 Score t Score N + Nonlinear transfer function P(R=1 sequence)
P (R = 1 S 1, S 2,, S N ) ( [ w 0 = logit log N W + 1 N W m=0 ( W )] ) exp w k (S t+k ) k=1 4 W + 1 parameters: w k (l), w 0
Wolfgang Lehrach Biomathematics & Statistics Scotland Ab initio prediction of protein interaction
SH3 yeast two-hybrid interaction network Tong et al (2002), Science 295, 321-324 285 interactions between 28 SH3 proteins and 143 binding peptides 9 binding partners per SH3 domain on average
Final Test Set Performance 1 True positive rate (sensitivity) 08 06 04 02 0 061 Reiss 062 None 064 Naive 069 Gaussian 071 Laplacian with pruning 073 Laplacian 0 02 04 06 08 1 False positive rate (1 specificity)
The model of Segal, Yelensky and Koller Bioinformatics 19, 2003
P(gR 2 gs) TAT A G gs 1 gs 2 gs N gr 1 gr 2
Transcriptional Regulation Basics Evaluation MotifScanne ases onclusions
P(gR 2 gs) TAT A G gs 1 gs 2 gs N gr 1 gr 2
P(gR 2 gs) TAT A G gs gs2 1 gsn gr 1 gr 2 gm 1 2 3 gr 1 gr 2 gm
P(gR 2 gs) TA T A G gs1 gs2 gs N gr 1 gr 2 gr 1 gr 2 gm P(gE3 gm) gm 1 2 gm 1 2 3 3 0 ge ge 1 2 ge 3
P(gR 2 gs) TAT A G gs 1 gs 2 gs N gr1 gr 2 gm ge 1 ge 2 ge 3
P (gr i = 1 gs 1, gs 2,, gs N ) ( [ w 0 = logit log N W + 1 N W m=0 ( W )] ) exp w k (gs t+k ) k=1 4 W + 1 parameters per binding motif R i : w i k (l), wi 0
gs1 gs2 gs N gr 1 gr 2 gr 1 gr 2 gm 1 2 3 gm ge ge 1 2 ge 3
Softmax function P (gm = m gr 1 = r 1, gr 2 = r 2,, gr N = r N ) ( L ) exp i=1 u mir i = ( m exp L ) i=1 u mir i Parameter matrix: Number of motifs/regulators number of modules
gs1 gs2 gsn gr1 gr2 gm P(gE 3 gm) 0 gm 1 2 3 ge 1 ge 2 ge3
Independent Gaussian distributions P (ge 1, ge 2,, ge L gm = m) = j P (ge j gm = m) P (ge j gm = m) = N(µ j,m, σ j,m ) For each module m and each condition j: Mean: µ j,m Standard deviation: σ j,m
Parameter estimation
P(gR 2 gs) TA T A G gs1 gs2 gs N gr 1 gr 2 gr 1 gr 2 gm P(gE3 gm) gm 1 2 gm 1 2 3 3 0 ge ge 1 2 ge 3
P(gR 2 gs) TA T A G gs1 gs2 gsn gr 1 gr 2 gr 1 gr 2 gm P(gE3 gm) gm 1 2 gm 1 2 3 3 0 ge ge 1 2 ge 3
Bayesian approach P(parameters data) = P(parameters, latent variables data)
Bayesian approach P(parameters data) = P(parameters, latent variables data) Intractable!
Bayesian approach P(parameters data) = P(parameters, latent variables data) Intractable! Gibbs sampling parameters P(parameters latent variables, data) latent variables P(latent variables parameters, data)
y P(x,y) x
y P(x,y) x
y P(x,y) P(x y) x
y P(y x) P(x,y) x
y P(x,y) x
Still too expensive Find one good set of parameters rather than a whole sample from the posterior distribution Hard-assignment EM algorithm Various heuristic simplifications See Bioinformatics 19, 2003 for details
gs 1 gs 2 gs N gr1 gr 2 gm ge 1 ge 2 ge 3
E-step gs 1 gs 2 gs N gr1 gr 2 gm ge 1 ge 2 ge 3
M-step gs 1 gs 2 gs N gr1 gr 2 gm ge 1 ge 2 ge 3
gs 1 gs 2 gs N gr1 gr 2 gm ge 1 ge 2 ge 3
gs 1 gs 2 gs N gr1 gr 2 gm ge 1 ge 2 ge 3
Segal, Yelensky, Koller (2003) Bioinformatics 19 Saccharomyces cerevisiae
From Segal et al, Bioinformatics 2003
Experiment 1 173 microarrays, measuring responses to various stress conditions (Gasch et al 2000) onventional algorithms: 20% of the predicted motifs are known Unified probabilistic model: 45% of the predicted motifs are known
Experiment 2 77 microarrays, expression during the cell cycle (Spellman et al 1998) onventional algorithms: 30% of the predicted motifs are known Unified probabilistic model: 56% of the predicted motifs are known
From Segal et al, Bioinformatics 2003