Integrating DNA Motif Discovery and Genome-Wide Expression Analysis. Erin M. Conlon

Size: px

Start display at page:

Download "Integrating DNA Motif Discovery and Genome-Wide Expression Analysis. Erin M. Conlon"

Florence Harmon
9 years ago
Views:

1 Integrating DNA Motif Discovery and Genome-Wide Expression Analysis Department of Mathematics and Statistics University of Massachusetts Amherst Statistics in Functional Genomics Workshop Ascona, Switzerland June 30, 2004 test

Statistics University of Massachusetts Amherst

2 Motif Discovery Identify short patterns in DNA sequence Patterns play role in control of gene expression Finding sites will help: develop disease treatments understand disease susceptibility

3 Motif Discovery Whole genome sequence ATTTACCGATGGCTGCACTATGCCCTATCGATCGACCTC TCATCTTCACATCGCATCACCAGTTCAGGATAGACACGG ACGGCCTCGATTGACGGTGGTACAATTTACCGATGGCTG CACTATGCCCTATCGATCGACCTCTCATGCTTCACATCG CATCACCAGTTCAGGATAGACACGGTCACATCGCATCAC Microarray information Regulatory sequence upstream from genes GATGGCTGCACCTCATCGTATGCCCTACGACCTCTCGC CACATCGCATCTCATCGACCAGTTCAGACACGGACGGC GCCTCGCTCATCGGTGGTACAGTTCAAACCTGACTAAA TCTCGTTAGGACCATCTCATCGACCCACATCGAGAGCG CGCTAGCCCTCATCGGATCTTGTTCGAGAATTGCCTAT

Microarray information Regulatory sequence upstream from genes GATGGCTGCACCTCATCGTATGCCCTACGACCTCTCGC

4 Gene Expression Control transcription factor gene expression CTCATCG upstream DNA sequence gene

5 Transcription Factor Binding Sites Upstream Sequence Co-expressed Genes GATGGGGGCTCATCGACGTGTATGC...ACGATGTCTC Gene 1 CACACCCCCTCTCATCGCGTCCCTT...CGCCCCCCCG Gene 2 GCCTCCTCATCGGTGGTACTCCAGT...TACATGACTA Gene 3 TCTCATGCTCATCGCATCACGTGTA...GCAATGAGAG CGCCTCATCGTGGATCTTGCGAATT...AGAATGGCCT Gene 100 Transcription Start

..CGCCCCCCCG Gene 2 GCCTCCTCATCGGTGGTACTCCAGT.

6 1) Motif Matrix 2) Sequence Logo 3) Consensus Sequence CTCATCG

7 MDscan Motif Finding Algorithm Uses 100 highest expressed genes, finds 30 candidate motifs for each width [5,15] Confirms motifs using 500 highest expressed genes Repeat for lowest expressed genes

8 Motifs Correlated with Expression Goal: relate global gene expression to motif matrices For each motif: calculate sequence score for each gene. score number of copies of a motif in each gene s upstream sequence regress gene expression to motif scores, determine significant motifs

gene. score number of copies of a motif in each gene s upstream

9 Single Motif Regression Expression Sequence score # motif copies

10 Linear Regression Model For each motif: where Y = α + β S + g m mg e g Y g = log 2 -ratio of expression β S mg e m g = = = regression coefficient sequence score error

11 Over-expression of a Transcription Factor Rox1p is a transcription factor in yeast that binds to the 10-mer: TCTATTGTTT (from SCPD database of transcription factor binding sites)

12 Rox1p Over-expression Yeast expression data for Rox1p over-expression for 5,838 genes 800 basepair upstream sequence for each gene Use genes most repressed to find and refine 330 candidate motifs width [5,15] Regression with global gene expression to calculate p-values and rank motifs

most repressed to find and refine 330 candidate motifs width [5,15]

13 Overexpressing a Transcription Factor Known binding site: TCTATTGTTT

14 Comparison to Other Motif-Finding Algorithms Statistically-based algorithms 1) AlignAce (Roth et al. 1998): Gibbs sampling approach 2) MEME (Grundy et al. 1996): expectation maximization (EM) Both use iterative procedures to update random initial probability matrices Drawback may be trapped in local maxima

1998): Gibbs sampling approach 2) MEME (Grundy et al.

15 Over-expressing a Transcription Factor Known binding site: TCTATTGTTT

16 Combinatorial Effects of Motifs Identify motifs that work together to control gene expression Method: MDscan generates 660 motifs width [5,15] that both enhance and inhibit expression Remove non-significant motifs Stepwise regression to determine final additive model

width [5,15] that both enhance and inhibit expression Remove

17 Multiple Regression Model to Determine Motifs Working Together where S Y β g m mg M e g = = = = = log 2 -ratio of regression coefficient sequence score subset of error M Y = α + β S + g m mg m=1 expression e significant motifs g

of regression coefficient sequence score subset of error

18 Yeast Amino Acid Starvation Experiment Expression for 5,970 genes Find motifs both enhancing and inhibiting expression 235 significant motifs Stepwise regression yields 25 final motifs

19 Multiple Motifs Influencing Expression

20 Known Motifs Positive Coefficients: STRE, URS1: respond to stress PHO4, MET4: nutrient scavenging GCN4: amino acid production Negative Coefficients: M3A, M3B, RAP1: slow cell growth

21 Motifs Influencing Expression over Time Yeast cell cycle information (Spellman et al. 1998): 2 cell cycles 18 time points 7-minute intervals Examine expression patterns over time

22 Time Series Expression Use Motif Regressor to find multiple motifs at each time point 273 motifs total Each motif is regressed with the expression at all other 17 time points

23 Motif: ACGCGTCGCG Phase Test M/G1 G1 S G2 M M/G1 G1 S G2 M

24 Motif: GCTCATCGC Phase Test M/G1 G1 S G2 M M/G1 G1 S G2 M

25 Motif Clustering Method: Hierarchically cluster motif patterns Euclidean distance 20 clusters Plot average coefficients for each cluster

26 Cluster 1: Known Motif SCB (6 motifs) Regression Coefficient Test Phase M/G1 G1 S G2 M M/G1 G1 S G2 M

27 Known Cell Cycle Motifs Regression Coefficient Phase Test Cell Cycle Time Points M/G1 G1 S G2 M M/G1 G1 S G2 M

28 Other Cell Cycle Motifs Regression Coefficient Phase Test Cell Cycle Time Points M/G1 G1 S G2 M M/G1 G1 S G2 M

29 Non Cell Cycle Motifs Regression Coefficient Phase Test Cell Cycle Time Points M/G1 G1 S G2 M M/G1 G1 S G2 M

30 Simulation Study Randomly assign yeast cell cycle expression to 5,838 genes Use MDscan to find candidate motifs Use simple linear regression to determine p-values of motifs Repeat 100 times to generate 40,324 motifs

31 Simulation Results Motifs From Real Sequences Motifs From Random Sequences

32 Summary Microarray and sequence information are combined to find transcription factor binding sites Stepwise regression identifies motifs working together to control expression We find known motifs, and new putative motifs in single experiments and time course experiments

33 Acknowledgements X. Shirley Liu Jun Liu Departments of Biostatistics and Statistics, Harvard University Jason Lieb Department of Biology University of North Carolina This work was partially supported by NIH National Library of Medicine grant 1F37LM

34 Reference Conlon, E.M., Liu, X.S., Lieb, J.D., Liu, J.S. (2003) Integrating regulatory motif discovery and genomewide expression analysis. Proc Natl Acad Sci USA 100:

Data Integration. Lectures 16 & 17. ECS289A, WQ03, Filkov

Data Integration. Lectures 16 & 17. ECS289A, WQ03, Filkov Data Integration Lectures 16 & 17 Lectures Outline Goals for Data Integration Homogeneous data integration time series data (Filkov et al. 2002) Heterogeneous data integration microarray + sequence microarray