COMPUTATIONAL FRAMEWORKS FOR UNDERSTANDING THE FUNCTION AND EVOLUTION OF DEVELOPMENTAL ENHANCERS IN DROSOPHILA

Similar documents
Genetomic Promototypes

Integrating DNA Motif Discovery and Genome-Wide Expression Analysis. Erin M. Conlon

In developmental genomic regulatory interactions among genes, encoding transcription factors

A Mathematical Model of a Synthetically Constructed Genetic Toggle Switch

Identifying cis-regulatory modules by combining comparative and compositional analysis of DNA

Understanding the dynamics and function of cellular networks

NOVEL GENOME-SCALE CORRELATION BETWEEN DNA REPLICATION AND RNA TRANSCRIPTION DURING THE CELL CYCLE IN YEAST IS PREDICTED BY DATA-DRIVEN MODELS

Feed Forward Loops in Biological Systems

What is the difference between basal and activated transcription?

Positive Feedback and Bistable Systems. Copyright 2008: Sauro

MORPHEUS. Prediction of Transcription Factors Binding Sites based on Position Weight Matrix.

Statistical mechanics for real biological networks

Network Analysis. BCH 5101: Analysis of -Omics Data 1/34

Genetics and Evolution: An ios Application to Supplement Introductory Courses in. Transmission and Evolutionary Genetics

Learning Predictive Models of Gene Dynamics. Nick Jones, Sumeet Agarwal

Data Integration. Lectures 16 & 17. ECS289A, WQ03, Filkov

Hidden Markov Models

RETRIEVING SEQUENCE INFORMATION. Nucleotide sequence databases. Database search. Sequence alignment and comparison

Chapter 6: Biological Networks

The search process for small targets in cellular microdomains! Jürgen Reingruber

Protein Protein Interaction Networks

ARTICLE IN PRESS. Journal of Theoretical Biology

Bioinformatics: Network Analysis

Current Motif Discovery Tools and their Limitations

Analysis and Integration of Big Data from Next-Generation Genomics, Epigenomics, and Transcriptomics

Supplementary Information

DnaSP, DNA polymorphism analyses by the coalescent and other methods.

Introduction to Bioinformatics 3. DNA editing and contig assembly

MATCH Commun. Math. Comput. Chem. 61 (2009)

Conference: From DNA-Inspired Physics to Physics-Inspired Biology. 1-5 June How Proteins Find Their Targets on DNA

Comparing Methods for Identifying Transcription Factor Target Genes

Additional Modeling Information

Overview. Swarms in nature. Fish, birds, ants, termites, Introduction to swarm intelligence principles Particle Swarm Optimization (PSO)

S1 Text. Modeling deterministic single-cell microrna-p53-mdm2 network Figure 2 Figure 2

Next Generation Sequencing: Technology, Mapping, and Analysis

FlipFlop: Fast Lasso-based Isoform Prediction as a Flow Problem

How To Understand How Gene Expression Is Regulated

Make Better Decisions with Optimization

The College of Liberal Arts and Sciences

Activity 7.21 Transcription factors

Frequently Asked Questions Next Generation Sequencing

Heuristics for the Sorting by Length-Weighted Inversions Problem on Signed Permutations

Systematic discovery of regulatory motifs in human promoters and 30 UTRs by comparison of several mammals

Adaptive Tolerance Algorithm for Distributed Top-K Monitoring with Bandwidth Constraints

Epigenetic variation and complex disease risk

Regulated recruitment and cooperativity in the design of biological regulatory systems

A Conserved Developmental Patterning Network Produces Quantitatively Different Output in Multiple Species of Drosophila

The Making of the Fittest: Evolving Switches, Evolving Bodies

UNDERSTANDING the control of embryo development is. Integrating Data Clustering and Visualization for the Analysis of 3D Gene Expression Data

Qualitative analysis of regulatory networks

1. Introduction Gene regulation Genomics and genome analyses Hidden markov model (HMM)

Multi-Objective Genetic Test Generation for Systems-on-Chip Hardware Verification

Nebula A web-server for advanced ChIP-seq data analysis. Tutorial. by Valentina BOEVA

Year 9 set 1 Mathematics notes, to accompany the 9H book.

Gene Transcription in Prokaryotes

BIG Biomedicine and the Foundations of BIG Data Analysis

Boolean Network Models

DNA Insertions and Deletions in the Human Genome. Philipp W. Messer

FART Neural Network based Probabilistic Motif Discovery in Unaligned Biological Sequences

CRAC: An integrated approach to analyse RNA-seq reads Additional File 3 Results on simulated RNA-seq data.

Systems Biology through Data Analysis and Simulation

Comparative genomic hybridization Because arrays are more than just a tool for expression analysis

GENE REGULATION. Teacher Packet

BIOINF 525 Winter 2016 Foundations of Bioinformatics and Systems Biology

Computing the maximum similarity bi-clusters of gene expression data

Yale Pseudogene Analysis as part of GENCODE Project

Discovery & Modeling of Genomic Regulatory Networks with Big Data

A Computer Scientist s Guide to the Regulatory Genome

Inferring Probabilistic Models of cis-regulatory Modules. BMI/CS Spring 2015 Colin Dewey

Bacterial and Phage Genetic Switches

T cell Epitope Prediction

Review. Bayesianism and Reliability. Today s Class

Program Approval Form

INSECT: In silico search for co-occurring transcription factors

USING SPECTRAL RADIUS RATIO FOR NODE DEGREE TO ANALYZE THE EVOLUTION OF SCALE- FREE NETWORKS AND SMALL-WORLD NETWORKS

Data Analysis Tools. Tools for Summarizing Data

PolyLens: Software for Map-based Visualization and Analysis of Genome-scale Polymorphism Data

RAP: Accurate and fast motif finding based on protein binding microarray data

Bachelor of Arts in Applied Psychology

Gene Regulation -- The Lac Operon

Core Bioinformatics. Degree Type Year Semester Bioinformàtica/Bioinformatics OB 0 1

TITLE MOTIVATION OBJECTIVES AUDIENCE COURSE INSTRUCTORS. Analysis of regulatory sequences controlling the expression of gene networks

Graph theoretic approach to analyze amino acid network

Modeling and Simulation of Gene Regulatory Networks

Biological Sciences Initiative. Human Genome

Prof Brian McStay Wellcome Trust Senior Investigator Award April March 2020

BIO 3350: ELEMENTS OF BIOINFORMATICS PARTIALLY ONLINE SYLLABUS

BBSRC TECHNOLOGY STRATEGY: TECHNOLOGIES NEEDED BY RESEARCH KNOWLEDGE PROVIDERS

Using EXCEL Solver October, 2000

Cooperativity and Rapid Evolution of Cobound Transcription Factors in Closely Related Mammals

SeattleSNPs Interactive Tutorial: Web Tools for Site Selection, Linkage Disequilibrium and Haplotype Analysis

MOPAT: a graph-based method to predict recurrent cis-regulatory modules from known motifs

Evaluation & Validation: Credibility: Evaluating what has been learned

Bi-8: Introduction to Molecular Biology by Prof. Angela Stathopoulos

VISUALIZATION AND INTEGRATED DATA MINING OF DISPARATE INFORMATION

IMPROVING BUSINESS VALUE THROUGH DATA ANALYTICS - APPROACH DOCUMENT. Krysalis Analytics from Data to Insights

Adaptive Equalization of binary encoded signals Using LMS Algorithm

False Discovery Rates

QUANTIFIED THE IMPACT OF AGILE. Double your productivity. Improve Quality by 250% Balance your team performance. Cut Time to Market in half

Inferring the role of transcription factors in regulatory networks

Transcription:

COMPUTATIONAL FRAMEWORKS FOR UNDERSTANDING THE FUNCTION AND EVOLUTION OF DEVELOPMENTAL ENHANCERS IN DROSOPHILA Saurabh Sinha, Dept of Computer Science, University of Illinois

Cis-regulatory modules (enhancers) 2 Even-skipped ( eve ) gene expressed in seven stripes in the trunk region Different stripes driven by different cisregulatory sequences Eve stripe 2

Cis-regulatory modules 3 Reporter gene From Steve Small, NYU Reporter gene expression Regulatory sequence associated with eve Stripe 2

Cis-regulatory modules 4 Repressors Kr Gt Gt Kr Gt Kr Eve Stripe 2 Activators bcd bcd bcd bcd Hb bcd Regulatory sequence associated with eve Stripe 2

5 DISCOVERY FUNCTION CRM EVOLUTION

Drosophila Genome Surveyor 6 Genome Browser tracks for motifs for ~ 300 TFs. (HMM-based.) In each of 12 Drosophila genomes, as well as multi-species averages Can combine tracks for any subset of motifs (for CRM discovery)

7 DISCOVERY FUNCTION CRM EVOLUTION

CRM Function PLoS Biology 2010 PLoS Comp Bio 2010 8 How do we go from sequence to expression? ACGGATCGACA.CGACGACGATCG

Well, we ll use more than just the sequence PLoS Biology 2010 PLoS Comp Bio 2010 Assume that TF concentration profile known Segal et al. Nature 451, 535-540(31 January 2008)doi:10.1038/nature06496 ACGGATCGACA.CGACGACGATCG (eve stripe 2 CRM)

The A/P patterning regulatory network Assume that TF concentration profile known Segal et al. Nature 451, 535-540(31 January 2008)doi:10.1038/nature06496 ACGGATCGACA.CGACGACGATCG (eve stripe 2 CRM) Predicted expr Real expr

Statistical Thermodynamics-based models Shea & Ackers (1985). The OR control system of bacteriophage lambda. A physical-chemical model for gene regulation. J Mol Biol 181: 211 230. Buchler NE, Gerland U, Hwa T (2003). On schemes of combinatorial transcription logic. Proc Natl Acad Sci U S A 100: 5136 5141. Gertz J, Siggia ED, Cohen BA (2009). Analysis of combinatorial cis-regulation in synthetic and genomic promoters. Nature 457: 215 218. This is what our framework will be based on.

Statistical Thermodynamics-based models Other quantitative models: Janssens H, Hou S, Jaeger J, Kim AR, Myasnikova E, et al. (2006). Nat Genet 38: 1159 1165. Zinzen RP, Papatsenko D (2007). PLoS Comput Biol 3: e84. A general purpose software implementation missing.

Statistical weight = relative probability of a configuration Model based on equilibrium thermodynamics 13 CRM with 3 binding sites. Two activator sites and one repressor site. 2 3 = 8 possible configurations of bound/unbound factors. Statistical weight of a bound site (q) given by sequence and TF concentration. Statistical weight of a configuration comes from product over bound sites.

Modeling Gene Expression 14 BTM may be bound (at promoter) or not Shea &Ackers, 1985 Gene expression probability of bound BTM. Gertz et al, 2009.

TF effect on gene expression 15 > 1 < 1 Each bound trans. factor interacts independently with BTM. Activators have stabilizing effect. Repressors destabilize.

Implementation Two free parameters per TF: one for TF-DNA interaction one for TF s activation or repression strength Given these parameters, relative TF concentrations and any sequence, compute the predicted expression level (fractional occupancy of BTM) in time proportional to length of sequence. Dynamic programming. Note that predicted expression levels are relative.

Implementation Given any set of enhancer sequences and their output expression profiles, learn parameter values such that model output best fits data. ~40-50 CRMs that drive A/P pattern Pattern here is the expression in each of ~100 bins along the A/P axis ~6-10 TFs that are known to be relevant

Issues: Objective function What does it mean for model output to fit data? That is, what is the objective function? Sum of squared errors Average Correlation coefficient Each has problems

Issues: Objective function What does it mean for model output to fit data? That is, what is the objective function? Objective function is really a subjective choice. Public implementation alternates between Average CC and RMSE. New implementation uses Pattern Generating Potential or PGP (from Kazemian et al 2010). We engineered an objective function that we were least unhappy about.

Issues: Objective function

Issues: Simultaneous fit to all CRMs Important that we fit the parameters to many CRMs simultaneously. Generally easy to fit to a single CRM or a handful. Therefore need an objective function that can not only tell when one prediction is a better fit than another prediction for the same CRM, But also compare the fit on one CRM to the fit on another CRM.

Issues: Simultaneous fit to all CRMs Important that we fit the parameters to many CRMs simultaneously. In practice, there will be some (or many) CRMs for which we are missing key TFs, or CRMs that are weird. But a simultaneous fit will try to find parameters that produce best fits overall. Perhaps we d like to allow the optimization to pass on some (or many) CRMs of its choice. We do that now.

Issues: Optimization algorithm Tried a few things; public implementation uses a combination of a gradient descent method and a simplex algorithm. Also tried an evolutionary strategy On real data, similar results; on realistic but simulated data, similar results. Also, a fairly exhaustive search of parameter space done before choosing 1000 best starting points

Visuals of model fits (some good ones)

Mechanistic inferences? Test if particular mechanistic aspects improve the fit of model to data. For example, the model I described vs model that includes short range repression.

Comparing model fits Compare the optimized objective function under each of the two models Sum of squared errors Average correlation coefficient PGP Same, but under cross validation, if models differ in complexity Statistical significance of the difference?

Testing mechanistic aspects Effect of cooperative DNA binding by pairs of TFs Cooperativity in DNA-binding (between adjacent bound trans. factors) contributes a term to the weight of a configuration

Testing mechanistic aspects Effect of cooperative DNA binding by pairs of TFs BCD and KNI self cooperativity helps

Testing mechanistic aspects Effect of cooperative DNA binding by pairs of TFs

Back to model comparison Typically, both models do similarly on many CRMs, one does better on some, the other does better on some others. Comparing overall quality of fit often misses the mark. Compare fits on each CRM separately, quantify as a p-value, see if a significant number of CRMs have a significant improvement under one model vs another.

Testing mechanistic aspects Effect of synergistic activation multiple bound activators simultaneously contacting the basal transcriptional machinery With two bound activators, there are two possibilities: 1) Both interact simultaneously with the BTM: leads to synergistic activation 2) Only one interacts with the BTM at a time: no synergy

Testing mechanistic aspects Effect of synergistic activation multiple bound activators simultaneously contacting the basal transcriptional machinery

Testing mechanistic aspects Effect of synergistic activation multiple bound activators simultaneously contacting the basal transcriptional machinery

Testing mechanistic aspects Effect of synergistic activation multiple bound activators simultaneously contacting the basal transcriptional machinery

Testing mechanistic aspects Effect of short range repression Repressor will work without direct interaction with BTM If bound, creates a new configuration where its locality is rendered inaccessible to other factors For KR, HB: short range repression model as effective as the baseline model

Issue: missing TFs An effect that is really due to a missing TF may be incorrectly assigned due to a mechanistic aspect in a model. So we need to be most diligent about including the relevant TFs.

<slides deleted>

Issue: Gene locus modeling Trained model can be used to predict expression pattern driven by any CRM sequence Ideally, would like to predict gene expression pattern from entire locus In this scenario, we don t know the CRMs in the locus

Issue: Gene locus modeling Given the 16 Kbp eve locus, can we predict its pattern correctly? Predicting on the entire 16 Kbp locus as one sequence will not work.

<slides deleted>

Acknowledgements 41 Collaborators Drosophila David Arnosti Michael Brodsky Marc Halfon Stas Shvartsman Scot Wolfe (MSU) (U.Mass. Med) (SUNY Buffalo) (Princeton) (U.Mass. Med) Md. A.Hassan Samee Xin He Jaebum Kim Thyago Duque Majid Kazemian Charles Blatti Students Statistical thermodynamics Eric Siggia (Rockefeller) Funding NSF CAREER NIH (NIGMS) R01 Dept. of Computer Science, UIUC. Institute for Genomic Biology, UIUC