Integrating DNA Motif Discovery and Genome-Wide Expression Analysis. Erin M. Conlon



Similar documents
Data Integration. Lectures 16 & 17. ECS289A, WQ03, Filkov

NOVEL GENOME-SCALE CORRELATION BETWEEN DNA REPLICATION AND RNA TRANSCRIPTION DURING THE CELL CYCLE IN YEAST IS PREDICTED BY DATA-DRIVEN MODELS

T cell Epitope Prediction

Systematic discovery of regulatory motifs in human promoters and 30 UTRs by comparison of several mammals

Genetomic Promototypes

Probabilistic methods for post-genomic data integration

TOWARD BIG DATA ANALYSIS WORKSHOP

Network Analysis. BCH 5101: Analysis of -Omics Data 1/34

Statistics Graduate Courses

Current Motif Discovery Tools and their Limitations

MIC - Detecting Novel Associations in Large Data Sets. by Nico Güttler, Andreas Ströhlein and Matt Huska

FlipFlop: Fast Lasso-based Isoform Prediction as a Flow Problem

Service courses for graduate students in degree programs other than the MS or PhD programs in Biostatistics.

How To Cluster

Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca

Statistical issues in the analysis of microarray data

Protein Protein Interaction Networks

Title: Lending Club Interest Rates are closely linked with FICO scores and Loan Length

Exploratory data analysis for microarray data

Tutorial for proteome data analysis using the Perseus software platform

Likelihood Approaches for Trial Designs in Early Phase Oncology

D-optimal plans in observational studies

RNA Structure and folding

LOGISTIC REGRESSION ANALYSIS

They can be obtained in HQJHQH format directly from the home page at:

Heuristics for the Sorting by Length-Weighted Inversions Problem on Signed Permutations

Pairwise Sequence Alignment

Gene Expression Analysis

Univariate Regression

BIOINF 525 Winter 2016 Foundations of Bioinformatics and Systems Biology

Data Mining: Overview. What is Data Mining?

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Comparing Methods for Identifying Transcription Factor Target Genes

Feed Forward Loops in Biological Systems

A Primer of Genome Science THIRD

Logistic Regression (1/24/13)

Core Facility Genomics

Lecture 19: Proteins, Primary Struture

How To Cluster Of Complex Systems

STATISTICA Formula Guide: Logistic Regression. Table of Contents

Lecture/Recitation Topic SMA 5303 L1 Sampling and statistical distributions

Activity 7.21 Transcription factors

Developing Risk Adjustment Techniques Using the System for Assessing Health Care Quality in the

Just the Facts: A Basic Introduction to the Science Underlying NCBI Resources

Stepwise Regression. Chapter 311. Introduction. Variable Selection Procedures. Forward (Step-Up) Selection

Multivariate Analysis of Ecological Data

How To Understand Multivariate Models

Predictive Gene Signature Selection for Adjuvant Chemotherapy in Non-Small Cell Lung Cancer Patients

Comparing Functional Data Analysis Approach and Nonparametric Mixed-Effects Modeling Approach for Longitudinal Data Analysis

Lecture 11 Data storage and LIMS solutions. Stéphane LE CROM

Social Media Mining. Data Mining Essentials

Please follow the directions once you locate the Stata software in your computer. Room 114 (Business Lab) has computers with Stata software

Example: Document Clustering. Clustering: Definition. Notion of a Cluster can be Ambiguous. Types of Clusterings. Hierarchical Clustering

Exercise with Gene Ontology - Cytoscape - BiNGO

What is the difference between basal and activated transcription?

Clustering & Visualization

Building risk prediction models - with a focus on Genome-Wide Association Studies. Charles Kooperberg

Metodi Numerici per la Bioinformatica

Principle Component Analysis and Partial Least Squares: Two Dimension Reduction Techniques for Regression

Statistical Analysis. NBAF-B Metabolomics Masterclass. Mark Viant

Aiping Lu. Key Laboratory of System Biology Chinese Academic Society

Support Vector Machines with Clustering for Training with Very Large Datasets

How To Understand How Gene Expression Is Regulated

Performance Metrics for Graph Mining Tasks

1 Solving LPs: The Simplex Algorithm of George Dantzig

1. Introduction Gene regulation Genomics and genome analyses Hidden markov model (HMM)

Paper D Ranking Predictors in Logistic Regression. Doug Thompson, Assurant Health, Milwaukee, WI

Cancer Biostatistics Workshop Science of Doing Science - Biostatistics

Comparative genomic hybridization Because arrays are more than just a tool for expression analysis

Healthcare Analytics. Aryya Gangopadhyay UMBC

Polynomial Neural Network Discovery Client User Guide

Gene Enrichment Analysis

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning

7 Time series analysis

Transcription:

Integrating DNA Motif Discovery and Genome-Wide Expression Analysis Department of Mathematics and Statistics University of Massachusetts Amherst Statistics in Functional Genomics Workshop Ascona, Switzerland June 30, 2004 test

Motif Discovery Identify short patterns in DNA sequence Patterns play role in control of gene expression Finding sites will help: develop disease treatments understand disease susceptibility

Motif Discovery Whole genome sequence ATTTACCGATGGCTGCACTATGCCCTATCGATCGACCTC TCATCTTCACATCGCATCACCAGTTCAGGATAGACACGG ACGGCCTCGATTGACGGTGGTACAATTTACCGATGGCTG CACTATGCCCTATCGATCGACCTCTCATGCTTCACATCG CATCACCAGTTCAGGATAGACACGGTCACATCGCATCAC Microarray information Regulatory sequence upstream from genes GATGGCTGCACCTCATCGTATGCCCTACGACCTCTCGC CACATCGCATCTCATCGACCAGTTCAGACACGGACGGC GCCTCGCTCATCGGTGGTACAGTTCAAACCTGACTAAA TCTCGTTAGGACCATCTCATCGACCCACATCGAGAGCG CGCTAGCCCTCATCGGATCTTGTTCGAGAATTGCCTAT

Gene Expression Control transcription factor gene expression CTCATCG upstream DNA sequence gene

Transcription Factor Binding Sites Upstream Sequence Co-expressed Genes GATGGGGGCTCATCGACGTGTATGC...ACGATGTCTC Gene 1 CACACCCCCTCTCATCGCGTCCCTT...CGCCCCCCCG Gene 2 GCCTCCTCATCGGTGGTACTCCAGT...TACATGACTA Gene 3 TCTCATGCTCATCGCATCACGTGTA...GCAATGAGAG CGCCTCATCGTGGATCTTGCGAATT...AGAATGGCCT Gene 100 Transcription Start

1) Motif Matrix 2) Sequence Logo 3) Consensus Sequence CTCATCG

MDscan Motif Finding Algorithm Uses 100 highest expressed genes, finds 30 candidate motifs for each width [5,15] Confirms motifs using 500 highest expressed genes Repeat for lowest expressed genes

Motifs Correlated with Expression Goal: relate global gene expression to motif matrices For each motif: calculate sequence score for each gene. score number of copies of a motif in each gene s upstream sequence regress gene expression to motif scores, determine significant motifs

Single Motif Regression Expression Sequence score # motif copies

Linear Regression Model For each motif: where Y = α + β S + g m mg e g Y g = log 2 -ratio of expression β S mg e m g = = = regression coefficient sequence score error

Over-expression of a Transcription Factor Rox1p is a transcription factor in yeast that binds to the 10-mer: TCTATTGTTT (from SCPD database of transcription factor binding sites)

Rox1p Over-expression Yeast expression data for Rox1p over-expression for 5,838 genes 800 basepair upstream sequence for each gene Use genes most repressed to find and refine 330 candidate motifs width [5,15] Regression with global gene expression to calculate p-values and rank motifs

Overexpressing a Transcription Factor Known binding site: TCTATTGTTT

Comparison to Other Motif-Finding Algorithms Statistically-based algorithms 1) AlignAce (Roth et al. 1998): Gibbs sampling approach 2) MEME (Grundy et al. 1996): expectation maximization (EM) Both use iterative procedures to update random initial probability matrices Drawback may be trapped in local maxima

Over-expressing a Transcription Factor Known binding site: TCTATTGTTT

Combinatorial Effects of Motifs Identify motifs that work together to control gene expression Method: MDscan generates 660 motifs width [5,15] that both enhance and inhibit expression Remove non-significant motifs Stepwise regression to determine final additive model

Multiple Regression Model to Determine Motifs Working Together where S Y β g m mg M e g = = = = = log 2 -ratio of regression coefficient sequence score subset of error M Y = α + β S + g m mg m=1 expression e significant motifs g

Yeast Amino Acid Starvation Experiment Expression for 5,970 genes Find motifs both enhancing and inhibiting expression 235 significant motifs Stepwise regression yields 25 final motifs

Multiple Motifs Influencing Expression

Known Motifs Positive Coefficients: STRE, URS1: respond to stress PHO4, MET4: nutrient scavenging GCN4: amino acid production Negative Coefficients: M3A, M3B, RAP1: slow cell growth

Motifs Influencing Expression over Time Yeast cell cycle information (Spellman et al. 1998): 2 cell cycles 18 time points 7-minute intervals Examine expression patterns over time

Time Series Expression Use Motif Regressor to find multiple motifs at each time point 273 motifs total Each motif is regressed with the expression at all other 17 time points

Motif: ACGCGTCGCG Phase Test M/G1 G1 S G2 M M/G1 G1 S G2 M

Motif: GCTCATCGC Phase Test M/G1 G1 S G2 M M/G1 G1 S G2 M

Motif Clustering Method: Hierarchically cluster motif patterns Euclidean distance 20 clusters Plot average coefficients for each cluster

Cluster 1: Known Motif SCB (6 motifs) Regression Coefficient Test Phase M/G1 G1 S G2 M M/G1 G1 S G2 M

Known Cell Cycle Motifs Regression Coefficient Phase Test Cell Cycle Time Points M/G1 G1 S G2 M M/G1 G1 S G2 M

Other Cell Cycle Motifs Regression Coefficient Phase Test Cell Cycle Time Points M/G1 G1 S G2 M M/G1 G1 S G2 M

Non Cell Cycle Motifs Regression Coefficient Phase Test Cell Cycle Time Points M/G1 G1 S G2 M M/G1 G1 S G2 M

Simulation Study Randomly assign yeast cell cycle expression to 5,838 genes Use MDscan to find candidate motifs Use simple linear regression to determine p-values of motifs Repeat 100 times to generate 40,324 motifs

Simulation Results Motifs From Real Sequences Motifs From Random Sequences

Summary Microarray and sequence information are combined to find transcription factor binding sites Stepwise regression identifies motifs working together to control expression We find known motifs, and new putative motifs in single experiments and time course experiments

Acknowledgements X. Shirley Liu Jun Liu Departments of Biostatistics and Statistics, Harvard University Jason Lieb Department of Biology University of North Carolina This work was partially supported by NIH National Library of Medicine grant 1F37LM07626-01

Reference Conlon, E.M., Liu, X.S., Lieb, J.D., Liu, J.S. (2003) Integrating regulatory motif discovery and genomewide expression analysis. Proc Natl Acad Sci USA 100:3339-3344.