Probabilistic methods for post-genomic data integration

Size: px

Start display at page:

Download "Probabilistic methods for post-genomic data integration"

Kimberly Blair
8 years ago
Views:

1 Probabilistic methods for post-genomic data integration Dirk Husmeier Biomathematics & Statistics Scotland (BioSS) JMB, The King s Buildings, Edinburgh EH9 3JZ United Kingdom dirk

2 Integrated analysis of regulatory networks

3 Integrated analysis of regulatory networks Expression data alone are not sufficient ombining multiple sources of information yields complementary constraints

4 ombining promoter sequences and gene expression data

5 ombining promoter sequences and gene expression data onventional approach: Find clusters of co-expressed genes Identify regulatory elements by searching for common over-represented motifs in the promoter regions of these genes

genes Identify regulatory elements by searching for

6 Shortcomings of the conventional algorithm

7 Microarray data Model Promoter sequences

8 Microarray data Model Promoter sequences

9 Microarray data Model Promoter sequences

10 Segal s unifying probabilistic model

11 Microarray data Model Promoter sequences

12 Microarray data Model Promoter sequences

13 Microarray data Model Promoter sequences

14 Segal, Yelensky, Koller (2003) Bioinformatics 19

15 Segal, Yelensky, Koller (2003) Bioinformatics 19 Revision: Motif finding

16 Motif: TAT A G T G A A T T T A T A G A

17 Motif: TAT A G T G A A T T T A T A G A

18 Motif: TAT A G T G A A T T T A T A G A

19 Motif: TAT A G T G A A T T T A T A G A

20 Motif: TAT A G T G A A T T T A T A G A

21 Motif: TAT A G T G A A T T T A T A G A

22 Motif: TAT A G T G A A T T T A T A G A

23 Position Specific Scoring Matrix (PSSM) Search for a motif of length W in binding sequences

24 Position Specific Scoring Matrix (PSSM) Search for a motif of length W in binding sequences W 4 matrix ψ k (l): Probability that the nucleotide in the kth position, k [1,, W ], is an l {A,, G, T }

25 Position Specific Scoring Matrix (PSSM) Search for a motif of length W in binding sequences W 4 matrix ψ k (l): Probability that the nucleotide in the kth position, k [1,, W ], is an l {A,, G, T } Background model for non-binding sequences 4-dim vector θ 0 (l): Probability of nucleotide l; this distribution is position-independent

26 Sequence S 1, S 2,, S N

27 Sequence S 1, S 2,, S N Non-binding sequence: R=0 P (S 1, S 2,, S N R = 0) = N θ 0 (S t ) t=1

28 Sequence S 1, S 2,, S N Non-binding sequence: R=0 P (S 1, S 2,, S N R = 0) = N θ 0 (S t ) Binding sequence: R=1, motif starting at position m+1 t=1 k=1 t=1 P (S 1, S 2,, S N R = 1, start = m + 1) m W N = θ 0 (S t ) ψ k (S m+k ) θ 0 (S t ) = N θ 0 (S t ) t=1 W k=1 ψ k (S m+k ) θ 0 (S m+k ) t=m+w +1

29 Binding sequence: R=1, motif starting at position m+1 P (S 1, S 2,, S N R = 1, start = m + 1) = N W θ 0 (S t ) t=1 k=1 ψ k (S m+k ) θ 0 (S m+k )

30 Binding sequence: R=1, motif starting at position m+1 P (S 1, S 2,, S N R = 1, start = m + 1) = N θ 0 (S t ) t=1 W k=1 Binding sequence: R=1, motif starting anywhere P (S 1, S 2,, S N R = 1) = = N W m=0 ψ k (S m+k ) θ 0 (S m+k ) P (start = m + 1)P (S 1, S 2,, S N R = 1, start = m + 1) N 1 θ 0 (S t ) N W + 1 t=1 N W m=0 W k=1 ψ k (S m+k ) θ 0 (S m+k )

31 Binding sequence: R=1, motif starting at position m+1 P (S 1, S 2,, S N R = 1, start = m + 1) = N θ 0 (S t ) t=1 W k=1 Binding sequence: R=1, motif starting anywhere P (S 1, S 2,, S N R = 1) = = N W m=0 ψ k (S m+k ) θ 0 (S m+k ) P (start = m + 1)P (S 1, S 2,, S N R = 1, start = m + 1) N 1 θ 0 (S t ) N W + 1 t=1 N W m=0 W k=1 ψ k (S m+k ) θ 0 (S m+k ) Objective: Prediction of binding activity from sequence: P (R = 1 S 1, S 2,, S N )

32 P (R = 1 S 1, S 2,, S N ) = = = Apply Bayes rule: P (S 1, S 2,, S N R = 1)P (R = 1) P (S 1, S 2,, S N R = 0)P (R = 0) + P (S 1, S 2,, S N R = 1)P (R = 1) ( ) P (R = 0)P (S 1, S 2,, S N R = 0) P (R = 1)P (S 1, S 2,, S N R = 1) ( [ ] N W 1 ) 1 P (R = 1) 1 W ψ k (S m+k ) 1 + P (R = 0) (N W + 1) θ 0 (S m+k ) m=0 k=1

33 P (R = 1 S 1, S 2,, S N ) = = = Apply Bayes rule: P (S 1, S 2,, S N R = 1)P (R = 1) P (S 1, S 2,, S N R = 0)P (R = 0) + P (S 1, S 2,, S N R = 1)P (R = 1) ( ) P (R = 0)P (S 1, S 2,, S N R = 0) P (R = 1)P (S 1, S 2,, S N R = 1) ( [ ] N W 1 ) 1 P (R = 1) 1 W ψ k (S m+k ) 1 + P (R = 0) (N W + 1) θ 0 (S m+k ) m=0 k=1 Define: w k (l) = log ψ k(l) θ 0 (l), w 0 = log P (R=1) P (R=0), logit(z) = 1 1+exp( z)

34 P (R = 1 S 1, S 2,, S N ) ( [ w 0 = logit log N W + 1 N W m=0 ( W )] ) exp w k (S t+k ) k=1 4 W + 1 parameters: w k (l), w 0

35 Motif: TAT A G T G A A T T T A T A G A

36 Motif: TAT A G T G A A T T T A T A G A Score 1

37 Motif: TAT A G T G A A T T T A T A G A Score 1 Score 2

38 Motif: TAT A G T G A A T T T A T A G A Score 1 Score 2 Score t

39 Motif: TAT A G T G A A T T T A T A G A Score 1 Score 2 Score t Score N

40 Motif: TAT A G T G A A T T T A T A G A Score 1 Score 2 Score t Score N +

41 Motif: TAT A G T G A A T T T A T A G A Score 1 Score 2 Score t Score N + Nonlinear transfer function

42 Motif: TAT A G T G A A T T T A T A G A Score 1 Score 2 Score t Score N + Nonlinear transfer function P(R=1 sequence)

43 P (R = 1 S 1, S 2,, S N ) ( [ w 0 = logit log N W + 1 N W m=0 ( W )] ) exp w k (S t+k ) k=1 4 W + 1 parameters: w k (l), w 0

44 Wolfgang Lehrach Biomathematics & Statistics Scotland Ab initio prediction of protein interaction

45 SH3 yeast two-hybrid interaction network Tong et al (2002), Science 295, interactions between 28 SH3 proteins and 143 binding peptides 9 binding partners per SH3 domain on average

47 Final Test Set Performance 1 True positive rate (sensitivity) Reiss 062 None 064 Naive 069 Gaussian 071 Laplacian with pruning 073 Laplacian False positive rate (1 specificity)

48 The model of Segal, Yelensky and Koller Bioinformatics 19, 2003

49 P(gR 2 gs) TAT A G gs 1 gs 2 gs N gr 1 gr 2

50 Transcriptional Regulation Basics Evaluation MotifScanne ases onclusions

51 P(gR 2 gs) TAT A G gs 1 gs 2 gs N gr 1 gr 2

52 P(gR 2 gs) TAT A G gs gs2 1 gsn gr 1 gr 2 gm gr 1 gr 2 gm

53 P(gR 2 gs) TA T A G gs1 gs2 gs N gr 1 gr 2 gr 1 gr 2 gm P(gE3 gm) gm 1 2 gm ge ge 1 2 ge 3

54 P(gR 2 gs) TAT A G gs 1 gs 2 gs N gr1 gr 2 gm ge 1 ge 2 ge 3

55 P (gr i = 1 gs 1, gs 2,, gs N ) ( [ w 0 = logit log N W + 1 N W m=0 ( W )] ) exp w k (gs t+k ) k=1 4 W + 1 parameters per binding motif R i : w i k (l), wi 0

56 gs1 gs2 gs N gr 1 gr 2 gr 1 gr 2 gm gm ge ge 1 2 ge 3

57 Softmax function P (gm = m gr 1 = r 1, gr 2 = r 2,, gr N = r N ) ( L ) exp i=1 u mir i = ( m exp L ) i=1 u mir i Parameter matrix: Number of motifs/regulators number of modules

58 gs1 gs2 gsn gr1 gr2 gm P(gE 3 gm) 0 gm ge 1 ge 2 ge3

59 Independent Gaussian distributions P (ge 1, ge 2,, ge L gm = m) = j P (ge j gm = m) P (ge j gm = m) = N(µ j,m, σ j,m ) For each module m and each condition j: Mean: µ j,m Standard deviation: σ j,m

60 Parameter estimation

61 P(gR 2 gs) TA T A G gs1 gs2 gs N gr 1 gr 2 gr 1 gr 2 gm P(gE3 gm) gm 1 2 gm ge ge 1 2 ge 3

62 P(gR 2 gs) TA T A G gs1 gs2 gsn gr 1 gr 2 gr 1 gr 2 gm P(gE3 gm) gm 1 2 gm ge ge 1 2 ge 3

63 Bayesian approach P(parameters data) = P(parameters, latent variables data)

64 Bayesian approach P(parameters data) = P(parameters, latent variables data) Intractable!

65 Bayesian approach P(parameters data) = P(parameters, latent variables data) Intractable! Gibbs sampling parameters P(parameters latent variables, data) latent variables P(latent variables parameters, data)

66 y P(x,y) x

67 y P(x,y) x

68 y P(x,y) P(x y) x

69 y P(y x) P(x,y) x

70 y P(x,y) x

71 Still too expensive Find one good set of parameters rather than a whole sample from the posterior distribution Hard-assignment EM algorithm Various heuristic simplifications See Bioinformatics 19, 2003 for details

72 gs 1 gs 2 gs N gr1 gr 2 gm ge 1 ge 2 ge 3

73 E-step gs 1 gs 2 gs N gr1 gr 2 gm ge 1 ge 2 ge 3

74 M-step gs 1 gs 2 gs N gr1 gr 2 gm ge 1 ge 2 ge 3

75 gs 1 gs 2 gs N gr1 gr 2 gm ge 1 ge 2 ge 3

76 gs 1 gs 2 gs N gr1 gr 2 gm ge 1 ge 2 ge 3

77 Segal, Yelensky, Koller (2003) Bioinformatics 19 Saccharomyces cerevisiae

78 From Segal et al, Bioinformatics 2003

79 Experiment microarrays, measuring responses to various stress conditions (Gasch et al 2000) onventional algorithms: 20% of the predicted motifs are known Unified probabilistic model: 45% of the predicted motifs are known

80 Experiment 2 77 microarrays, expression during the cell cycle (Spellman et al 1998) onventional algorithms: 30% of the predicted motifs are known Unified probabilistic model: 56% of the predicted motifs are known

81 From Segal et al, Bioinformatics 2003

Data Integration. Lectures 16 & 17. ECS289A, WQ03, Filkov

Data Integration. Lectures 16 & 17. ECS289A, WQ03, Filkov Data Integration Lectures 16 & 17 Lectures Outline Goals for Data Integration Homogeneous data integration time series data (Filkov et al. 2002) Heterogeneous data integration microarray + sequence microarray