The Segway annotation of ENCODE data



Similar documents
GMQL Functional Comparison with BEDTools and BEDOPS

A Brief Introduction on DNase-Seq Data Aanalysis

Shouguo Gao Ph. D Department of Physics and Comprehensive Diabetes Center

Analysis and Integration of Big Data from Next-Generation Genomics, Epigenomics, and Transcriptomics

Analysis of ChIP-seq data in Galaxy

Visualisation tools for next-generation sequencing

Genetomic Promototypes

New Technologies for Sensitive, Low-Input RNA-Seq. Clontech Laboratories, Inc.

Using Ensembl tools for browsing ENCODE data

Control of Gene Expression

Nebula A web-server for advanced ChIP-seq data analysis. Tutorial. by Valentina BOEVA

Core Facility Genomics

Boolean Implications Identify Wilms Tumor 1 Mutation as a Driver of DNA Hypermethylation in Acute Myeloid Leukemia

RETRIEVING SEQUENCE INFORMATION. Nucleotide sequence databases. Database search. Sequence alignment and comparison

RNAseq / ChipSeq / Methylseq and personalized genomics

Data Integration. Lectures 16 & 17. ECS289A, WQ03, Filkov

Current Motif Discovery Tools and their Limitations

Searching Nucleotide Databases

Gene Expression Analysis

DNA Methylation in MDS/MPD/AML: Implications for application

Comparing Methods for Identifying Transcription Factor Target Genes

Computational Genomics. Next generation sequencing (NGS)

Discovery and Quantification of RNA with RNASeq Roderic Guigó Serra Centre de Regulació Genòmica (CRG)

Control of Gene Expression

GeneProf and the new GeneProf Web Services

How many of you have checked out the web site on protein-dna interactions?

CRAC: An integrated approach to analyse RNA-seq reads Additional File 3 Results on simulated RNA-seq data.

Lecture 11 Data storage and LIMS solutions. Stéphane LE CROM

BIO 3352: BIOINFORMATICS II HYBRID COURSE SYLLABUS

GeneSifter: Next Generation Data Management and Analysis for Next Generation Sequencing

Faculty of Medicine. Settore disciplinare: BIO/10. functional domains. Monica Soldi. IFOM-IEO Campus, Milan. Matricola n. R08407

Systematic discovery of regulatory motifs in human promoters and 30 UTRs by comparison of several mammals

FlipFlop: Fast Lasso-based Isoform Prediction as a Flow Problem

Lectures 1 and February 7, Genomics 2012: Repetitorium. Peter N Robinson. VL1: Next- Generation Sequencing. VL8 9: Variant Calling

Vad är bioinformatik och varför behöver vi det i vården? a bioinformatician's perspectives

The Human Genome Project

INTERNATIONAL CONFERENCE ON HARMONISATION OF TECHNICAL REQUIREMENTS FOR REGISTRATION OF PHARMACEUTICALS FOR HUMAN USE Q5B

European Genome-phenome Archive database of human data consented for use in biomedical research at the European Bioinformatics Institute

1. Introduction Gene regulation Genomics and genome analyses Hidden markov model (HMM)

Hidden Markov Models in Bioinformatics. By Máthé Zoltán Kőrösi Zoltán 2006

FBIO - Fundations of Bioinformatics

Alison Yao, Ph.D. July 2014

Human-Mouse Synteny in Functional Genomics Experiment

SUPPLEMENTARY METHODS

Prof Brian McStay Wellcome Trust Senior Investigator Award April March 2020

SICKLE CELL ANEMIA & THE HEMOGLOBIN GENE TEACHER S GUIDE

Interaktionen von RNAs und Proteinen

Mass Spectrometry Signal Calibration for Protein Quantitation

When you install Mascot, it includes a copy of the Swiss-Prot protein database. However, it is almost certain that you and your colleagues will want

Cloud-Based Big Data Analytics in Bioinformatics

Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012

G E N OM I C S S E RV I C ES

Activity 7.21 Transcription factors

RT 2 Profiler PCR Array: Web-Based Data Analysis Tutorial

Probabilistic methods for post-genomic data integration

Next Generation Sequencing: Technology, Mapping, and Analysis

Course Requirements for the Ph.D., M.S. and Certificate Programs

Cloud BioLinux: Pre-configured and On-demand Bioinformatics Computing for the Genomics Community

In developmental genomic regulatory interactions among genes, encoding transcription factors

BIO 3350: ELEMENTS OF BIOINFORMATICS PARTIALLY ONLINE SYLLABUS

Gene Switches Teacher Information

Tutorial for proteome data analysis using the Perseus software platform

Umbilical Cord Blood Stem Cells Current Status & Future Potential

European Medicines Agency

Understanding the dynamics and function of cellular networks

Integrating DNA Motif Discovery and Genome-Wide Expression Analysis. Erin M. Conlon

Hidden Markov models in gene finding. Bioinformatics research group David R. Cheriton School of Computer Science University of Waterloo

Biochemistry Major Talk Welcome!!!!!!!!!!!!!!

Bioinformatics Resources at a Glance

The Therapeutic Potential of Human Umbilical Cord Blood Transplantation for Neonatal Hypoxic-Ischemic Brain Injury and Ischemic Stroke

Just the Facts: A Basic Introduction to the Science Underlying NCBI Resources

Genotyping by sequencing and data analysis. Ross Whetten North Carolina State University

Biomedical Big Data and Precision Medicine

PreciseTM Whitepaper

Crime Scenes and Genes

Exploratory Spatial Data Analysis

NIH/NIGMS Trainee Forum: Computational Biology and Medical Informatics at Georgia Tech

Biotechnology. Srivatsan Kidambi, Ph.D.

G&D. apoptosis, tumor suppressor and cell cycle research antibodies. 3 a A JOURNAL OF CELLULAR AND MOLECULAR BIOLOGY

Genomes and SNPs in Malaria and Sickle Cell Anemia

Using Galaxy for NGS Analysis. Daniel Blankenberg Postdoctoral Research Associate The Galaxy Team

NOVEL GENOME-SCALE CORRELATION BETWEEN DNA REPLICATION AND RNA TRANSCRIPTION DURING THE CELL CYCLE IN YEAST IS PREDICTED BY DATA-DRIVEN MODELS

Network Analysis. BCH 5101: Analysis of -Omics Data 1/34

Subtypes of AML follow branches of myeloid development, making the FAB classificaoon relaovely simple to understand.

Transcription:

The Segway annotation of ENCODE data Michael M. Hoffman Department of Genome Sciences University of Washington

Overview 1. ENCODE Project 2. Semi-automated genomic annotation 3. Chromatin 4. RNA-seq

Functional genomics ENCODE Project Consortium 2011. PLoS Biol 9:e1001046.

Chromatin immunoprecipitation (ChIP) Park PJ 2009. Nat Rev Genet 10:669.

ChIP sequence

sequence signal: Wiggler Extends tags in strand direction Extension length determined by crosscorrelation peak Signal only in mappable regions 1-bp resolution Anshul Kundaje http://align2rawsignal.googlecode.com/ Hoffman MM et al. 2013. Nucleic Acids Res 41:827.

signal tracks extended reads per base Fine-scale data H3K4me2 H3K27me3 Histone modifications Pol2b Egr-1 GABP Pol2 (Myers) Transcription factors Sin3Ak-20 TAF1 300 bp

2685 data sets Maher B 2012. Nature 489:46.

2685 data sets Now what? Maher B 2012. Nature 489:46.

Overview 1. ENCODE Project 2. Semi-automated genomic annotation 3. Chromatin 4. RNA-seq

Semi-automated annotation signal tracks annotation pattern discovery visualization interpretation

Genomic segmentation

Nonoverlapping segments

Nonoverlapping segments

Finite number of labels 0 1 0 1 2 1

Maximize similarity in labels 2 0 1 0 1 1

Bayesian network for ChIP-seq X t signal at position t observed random variable continuous

Bayesian network for ChIP-seq Q t transcription factor present at position t? 0: transcription factor is not present 1: transcription factor is present X t signal at position t hidden random variable observed random variable discrete continuous

Bayesian network for ChIP-seq Q t TF present at position t? µ 0 σ 0 µ 1 σ 1 P(X t Q t = 0) ~ N(µ 0, σ 0 ) P(X t Q t = 1) ~ N(µ 1, σ 1 ) X t signal at position t hidden random variable observed random variable emission probability parameter discrete continuous conditional relationship

Bayesian network: 2 positions Q t Q t+1 µ 0 σ 0 µ 1 σ 1 µ 0 σ 0 µ 1 σ 1 X t X t+1 hidden random variable observed random variable emission probability parameter discrete continuous conditional relationship

Bayesian network: 2 positions Q t 00 01 10 11 Q t+1 µ 0 σ 0 µ 1 σ 1 µ 0 σ 0 µ 1 σ 1 P(Q t+1 = 0 Q t = 0) = 0.99 P(Q t+1 = 1 Q t = 0) = 0.01 P(Q t+1 = 0 Q t = 1) = 0.01 P(Q t+1 = 1 Q t = 1) = 0.99 X t X t+1 hidden random variable observed random variable transition probability parameter emission probability parameter discrete continuous conditional relationship

Dynamic Bayesian network (DBN) Q t 00 01 10 11 Q t+1 Q t+2 00 01 00 01 Q 10 11 10 11 µ 0 σ 0 µ 0 σ 0 µ 0 σ 0 µ µ 1 σ 1 µ 1 σ 1 µ 1 σ 1 µ X t X t+1 X t+2 X hidden random variable observed random variable transition probability parameter emission probability parameter discrete continuous conditional relationship

Dynamic BN for segmentation segment label DNaseI H3K36me3 CTCF hidden random variable observed random variable transition probability parameter emission probability parameter discrete continuous conditional relationship

Heterogeneous missing data Hoffman MM et al. 2012. Nat Methods 9:473.

Handling missing data 00 01 10 11 segment µ 0 σ 0 µ 1 σ 1 µ 0 σ 0 µ 1 σ 1 1 0 DNaseI hidden random variable observed random variable transition probability parameter emission probability parameter discrete continuous conditional switching

Handling missing data present(dnasei) segment label present(h3k36me3) DNaseI present(ctcf) H3K36me3 CTCF hidden random variable observed random variable transition probability parameter emission probability parameter discrete continuous conditional switching

Length distribution present(dnasei) segment label present(h3k36me3) DNaseI present(ctcf) H3K36me3 CTCF

Length distribution frame index ruler segment countdown segment transition present(dnasei) Minimum segment length Maximum segment length present(h3k36me3) Trained geometric length distribution present(ctcf) Dirichlet prior on segment length Weight of prior versus observed data segment label DNaseI H3K36me3 CTCF

Segway A way to segment the genome http://noble.gs.washington.edu/proj/segway/ Hoffman MM et al. 2012. Nat Methods 9:473.

Overview 1. ENCODE Project 2. Semi-automated genomic annotation 3. Chromatin 4. RNA-seq

embryoblast mesendoderm H1 hesc embryonic stem cell endoderm mesoderm lateral mesoderm intermediate mesoderm hemangioblast liver blood vessel endothelium myeloid progenitor hemocytoblast lymphoid progenitor lymphoblast cervix HepG2 hepatocelluar carcinoma cell HUVEC umbilical vein endothelial cell K562 chronic myeloid leukemia cell GM12878 lymphoblastoid cell HeLa-S3 cervical carcinoma cell

Input tracks 49 tracks ENCODE K562 49 ChIP-seq DNase-seq FAIRE-seq 8 different labs

Picking the number of labels 25 labels 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

Emission parameters Each cell represents a Gaussian. Means are rownormalized so the highest mean value for a track is red and the lowest mean value is blue. Standard deviation is proportional to the length of the black bar. 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

TSS transcription star GS gene start GM gene middle GE gene end E enhancer I insulator R repression D dead

Transcription start site (TSS) Hoffman MM et al. 2013. Nucleic Acids Res 41:827.

Rediscovering genes

Zooming out 10 TSS segments occur near 5 ends of genes TSS/G* segments missing in gene deserts R*/D* segments occur more in gene deserts

3' gene ends Jason Ernst Hoffman MM et al. 2013. Nucleic Acids Res 41:827.

A puzzling region Lots of genes but very few TSS/GS segments. Why? Because these genes are not expressed in K562.

Experimental validation Testing <1000bp sequences for promoter activity predicted + in K562 predicted in K562 predicted + in GM12878 predicted in GM12878 http://switchgeargenomics.com/products/promoter-reporter-collection/

Luciferase assay results Hoffman MM et al. 2012. Nat Methods 9:473.

Comparison with GWAS catalog Bob Harris, Ross Hardison Hoffman MM et al. 2013. Nucleic Acids Res 41:827.

Summary of results Semi-automated genomic annotation begins with pattern discovery from multiple functional genomics data sets and enables: A simple annotation with a single label for each part of the genome. Visualization reducing multivariate data to a comprehensible representation. Interpretation of the context and potential regulatory impact of variants.

Software availability Segway data tracks segmentation Hoffman MM et al. 2012. Nat Methods 9:473. http://noble.gs.washington.edu/proj/segway/ Segtools segmentation plots and summary statistics Buske OJ et al. 2011. BMC Bioinformatics 12:415 http://noble.gs.washington.edu/proj/segtools/ Genomedata efficient access to numeric data anchored to genome Hoffman MM et al. 2010. Bioinformatics 26:1458. http://noble.gs.washington.edu/proj/genomedata/

Acknowledgments Bill Noble Jeff Bilmes Orion Buske Paul Ellenbogen University of Washington: Harshad Petwe, Meg Olson, Sheila Reynolds, Noble Research Group. University of Massachusetts Medical School: Zhiping Weng. SwitchGear Genomics: Patrick Collins. Stanford University: Anshul Kundaje. Pennsylvania State University: Ross Hardison, Bob Harris. European Bioinformatics Institute: Ewan Birney, Ian Dunham. University of California, Santa Cruz: Kate Rosenbloom, Brian Raney. Cold Spring Harbor Laboratory: Tom Gingeras, Carrie Davis. CRG: Sarah Djebali. RIKEN: Timo Lassmann. ENCODE Project Consortium. NIH/NHGRI: K99HG006259, U54HG004695.