Integrating DNA Motif Discovery and Genome-Wide Expression Analysis Department of Mathematics and Statistics University of Massachusetts Amherst Statistics in Functional Genomics Workshop Ascona, Switzerland June 30, 2004 test
Motif Discovery Identify short patterns in DNA sequence Patterns play role in control of gene expression Finding sites will help: develop disease treatments understand disease susceptibility
Motif Discovery Whole genome sequence ATTTACCGATGGCTGCACTATGCCCTATCGATCGACCTC TCATCTTCACATCGCATCACCAGTTCAGGATAGACACGG ACGGCCTCGATTGACGGTGGTACAATTTACCGATGGCTG CACTATGCCCTATCGATCGACCTCTCATGCTTCACATCG CATCACCAGTTCAGGATAGACACGGTCACATCGCATCAC Microarray information Regulatory sequence upstream from genes GATGGCTGCACCTCATCGTATGCCCTACGACCTCTCGC CACATCGCATCTCATCGACCAGTTCAGACACGGACGGC GCCTCGCTCATCGGTGGTACAGTTCAAACCTGACTAAA TCTCGTTAGGACCATCTCATCGACCCACATCGAGAGCG CGCTAGCCCTCATCGGATCTTGTTCGAGAATTGCCTAT
Gene Expression Control transcription factor gene expression CTCATCG upstream DNA sequence gene
Transcription Factor Binding Sites Upstream Sequence Co-expressed Genes GATGGGGGCTCATCGACGTGTATGC...ACGATGTCTC Gene 1 CACACCCCCTCTCATCGCGTCCCTT...CGCCCCCCCG Gene 2 GCCTCCTCATCGGTGGTACTCCAGT...TACATGACTA Gene 3 TCTCATGCTCATCGCATCACGTGTA...GCAATGAGAG CGCCTCATCGTGGATCTTGCGAATT...AGAATGGCCT Gene 100 Transcription Start
1) Motif Matrix 2) Sequence Logo 3) Consensus Sequence CTCATCG
MDscan Motif Finding Algorithm Uses 100 highest expressed genes, finds 30 candidate motifs for each width [5,15] Confirms motifs using 500 highest expressed genes Repeat for lowest expressed genes
Motifs Correlated with Expression Goal: relate global gene expression to motif matrices For each motif: calculate sequence score for each gene. score number of copies of a motif in each gene s upstream sequence regress gene expression to motif scores, determine significant motifs
Single Motif Regression Expression Sequence score # motif copies
Linear Regression Model For each motif: where Y = α + β S + g m mg e g Y g = log 2 -ratio of expression β S mg e m g = = = regression coefficient sequence score error
Over-expression of a Transcription Factor Rox1p is a transcription factor in yeast that binds to the 10-mer: TCTATTGTTT (from SCPD database of transcription factor binding sites)
Rox1p Over-expression Yeast expression data for Rox1p over-expression for 5,838 genes 800 basepair upstream sequence for each gene Use genes most repressed to find and refine 330 candidate motifs width [5,15] Regression with global gene expression to calculate p-values and rank motifs
Overexpressing a Transcription Factor Known binding site: TCTATTGTTT
Comparison to Other Motif-Finding Algorithms Statistically-based algorithms 1) AlignAce (Roth et al. 1998): Gibbs sampling approach 2) MEME (Grundy et al. 1996): expectation maximization (EM) Both use iterative procedures to update random initial probability matrices Drawback may be trapped in local maxima
Over-expressing a Transcription Factor Known binding site: TCTATTGTTT
Combinatorial Effects of Motifs Identify motifs that work together to control gene expression Method: MDscan generates 660 motifs width [5,15] that both enhance and inhibit expression Remove non-significant motifs Stepwise regression to determine final additive model
Multiple Regression Model to Determine Motifs Working Together where S Y β g m mg M e g = = = = = log 2 -ratio of regression coefficient sequence score subset of error M Y = α + β S + g m mg m=1 expression e significant motifs g
Yeast Amino Acid Starvation Experiment Expression for 5,970 genes Find motifs both enhancing and inhibiting expression 235 significant motifs Stepwise regression yields 25 final motifs
Multiple Motifs Influencing Expression
Known Motifs Positive Coefficients: STRE, URS1: respond to stress PHO4, MET4: nutrient scavenging GCN4: amino acid production Negative Coefficients: M3A, M3B, RAP1: slow cell growth
Motifs Influencing Expression over Time Yeast cell cycle information (Spellman et al. 1998): 2 cell cycles 18 time points 7-minute intervals Examine expression patterns over time
Time Series Expression Use Motif Regressor to find multiple motifs at each time point 273 motifs total Each motif is regressed with the expression at all other 17 time points
Motif: ACGCGTCGCG Phase Test M/G1 G1 S G2 M M/G1 G1 S G2 M
Motif: GCTCATCGC Phase Test M/G1 G1 S G2 M M/G1 G1 S G2 M
Motif Clustering Method: Hierarchically cluster motif patterns Euclidean distance 20 clusters Plot average coefficients for each cluster
Cluster 1: Known Motif SCB (6 motifs) Regression Coefficient Test Phase M/G1 G1 S G2 M M/G1 G1 S G2 M
Known Cell Cycle Motifs Regression Coefficient Phase Test Cell Cycle Time Points M/G1 G1 S G2 M M/G1 G1 S G2 M
Other Cell Cycle Motifs Regression Coefficient Phase Test Cell Cycle Time Points M/G1 G1 S G2 M M/G1 G1 S G2 M
Non Cell Cycle Motifs Regression Coefficient Phase Test Cell Cycle Time Points M/G1 G1 S G2 M M/G1 G1 S G2 M
Simulation Study Randomly assign yeast cell cycle expression to 5,838 genes Use MDscan to find candidate motifs Use simple linear regression to determine p-values of motifs Repeat 100 times to generate 40,324 motifs
Simulation Results Motifs From Real Sequences Motifs From Random Sequences
Summary Microarray and sequence information are combined to find transcription factor binding sites Stepwise regression identifies motifs working together to control expression We find known motifs, and new putative motifs in single experiments and time course experiments
Acknowledgements X. Shirley Liu Jun Liu Departments of Biostatistics and Statistics, Harvard University Jason Lieb Department of Biology University of North Carolina This work was partially supported by NIH National Library of Medicine grant 1F37LM07626-01
Reference Conlon, E.M., Liu, X.S., Lieb, J.D., Liu, J.S. (2003) Integrating regulatory motif discovery and genomewide expression analysis. Proc Natl Acad Sci USA 100:3339-3344.