The Segway annotation of ENCODE data Michael M. Hoffman Department of Genome Sciences University of Washington
Overview 1. ENCODE Project 2. Semi-automated genomic annotation 3. Chromatin 4. RNA-seq
Functional genomics ENCODE Project Consortium 2011. PLoS Biol 9:e1001046.
Chromatin immunoprecipitation (ChIP) Park PJ 2009. Nat Rev Genet 10:669.
ChIP sequence
sequence signal: Wiggler Extends tags in strand direction Extension length determined by crosscorrelation peak Signal only in mappable regions 1-bp resolution Anshul Kundaje http://align2rawsignal.googlecode.com/ Hoffman MM et al. 2013. Nucleic Acids Res 41:827.
signal tracks extended reads per base Fine-scale data H3K4me2 H3K27me3 Histone modifications Pol2b Egr-1 GABP Pol2 (Myers) Transcription factors Sin3Ak-20 TAF1 300 bp
2685 data sets Maher B 2012. Nature 489:46.
2685 data sets Now what? Maher B 2012. Nature 489:46.
Overview 1. ENCODE Project 2. Semi-automated genomic annotation 3. Chromatin 4. RNA-seq
Semi-automated annotation signal tracks annotation pattern discovery visualization interpretation
Genomic segmentation
Nonoverlapping segments
Nonoverlapping segments
Finite number of labels 0 1 0 1 2 1
Maximize similarity in labels 2 0 1 0 1 1
Bayesian network for ChIP-seq X t signal at position t observed random variable continuous
Bayesian network for ChIP-seq Q t transcription factor present at position t? 0: transcription factor is not present 1: transcription factor is present X t signal at position t hidden random variable observed random variable discrete continuous
Bayesian network for ChIP-seq Q t TF present at position t? µ 0 σ 0 µ 1 σ 1 P(X t Q t = 0) ~ N(µ 0, σ 0 ) P(X t Q t = 1) ~ N(µ 1, σ 1 ) X t signal at position t hidden random variable observed random variable emission probability parameter discrete continuous conditional relationship
Bayesian network: 2 positions Q t Q t+1 µ 0 σ 0 µ 1 σ 1 µ 0 σ 0 µ 1 σ 1 X t X t+1 hidden random variable observed random variable emission probability parameter discrete continuous conditional relationship
Bayesian network: 2 positions Q t 00 01 10 11 Q t+1 µ 0 σ 0 µ 1 σ 1 µ 0 σ 0 µ 1 σ 1 P(Q t+1 = 0 Q t = 0) = 0.99 P(Q t+1 = 1 Q t = 0) = 0.01 P(Q t+1 = 0 Q t = 1) = 0.01 P(Q t+1 = 1 Q t = 1) = 0.99 X t X t+1 hidden random variable observed random variable transition probability parameter emission probability parameter discrete continuous conditional relationship
Dynamic Bayesian network (DBN) Q t 00 01 10 11 Q t+1 Q t+2 00 01 00 01 Q 10 11 10 11 µ 0 σ 0 µ 0 σ 0 µ 0 σ 0 µ µ 1 σ 1 µ 1 σ 1 µ 1 σ 1 µ X t X t+1 X t+2 X hidden random variable observed random variable transition probability parameter emission probability parameter discrete continuous conditional relationship
Dynamic BN for segmentation segment label DNaseI H3K36me3 CTCF hidden random variable observed random variable transition probability parameter emission probability parameter discrete continuous conditional relationship
Heterogeneous missing data Hoffman MM et al. 2012. Nat Methods 9:473.
Handling missing data 00 01 10 11 segment µ 0 σ 0 µ 1 σ 1 µ 0 σ 0 µ 1 σ 1 1 0 DNaseI hidden random variable observed random variable transition probability parameter emission probability parameter discrete continuous conditional switching
Handling missing data present(dnasei) segment label present(h3k36me3) DNaseI present(ctcf) H3K36me3 CTCF hidden random variable observed random variable transition probability parameter emission probability parameter discrete continuous conditional switching
Length distribution present(dnasei) segment label present(h3k36me3) DNaseI present(ctcf) H3K36me3 CTCF
Length distribution frame index ruler segment countdown segment transition present(dnasei) Minimum segment length Maximum segment length present(h3k36me3) Trained geometric length distribution present(ctcf) Dirichlet prior on segment length Weight of prior versus observed data segment label DNaseI H3K36me3 CTCF
Segway A way to segment the genome http://noble.gs.washington.edu/proj/segway/ Hoffman MM et al. 2012. Nat Methods 9:473.
Overview 1. ENCODE Project 2. Semi-automated genomic annotation 3. Chromatin 4. RNA-seq
embryoblast mesendoderm H1 hesc embryonic stem cell endoderm mesoderm lateral mesoderm intermediate mesoderm hemangioblast liver blood vessel endothelium myeloid progenitor hemocytoblast lymphoid progenitor lymphoblast cervix HepG2 hepatocelluar carcinoma cell HUVEC umbilical vein endothelial cell K562 chronic myeloid leukemia cell GM12878 lymphoblastoid cell HeLa-S3 cervical carcinoma cell
Input tracks 49 tracks ENCODE K562 49 ChIP-seq DNase-seq FAIRE-seq 8 different labs
Picking the number of labels 25 labels 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Emission parameters Each cell represents a Gaussian. Means are rownormalized so the highest mean value for a track is red and the lowest mean value is blue. Standard deviation is proportional to the length of the black bar. 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
TSS transcription star GS gene start GM gene middle GE gene end E enhancer I insulator R repression D dead
Transcription start site (TSS) Hoffman MM et al. 2013. Nucleic Acids Res 41:827.
Rediscovering genes
Zooming out 10 TSS segments occur near 5 ends of genes TSS/G* segments missing in gene deserts R*/D* segments occur more in gene deserts
3' gene ends Jason Ernst Hoffman MM et al. 2013. Nucleic Acids Res 41:827.
A puzzling region Lots of genes but very few TSS/GS segments. Why? Because these genes are not expressed in K562.
Experimental validation Testing <1000bp sequences for promoter activity predicted + in K562 predicted in K562 predicted + in GM12878 predicted in GM12878 http://switchgeargenomics.com/products/promoter-reporter-collection/
Luciferase assay results Hoffman MM et al. 2012. Nat Methods 9:473.
Comparison with GWAS catalog Bob Harris, Ross Hardison Hoffman MM et al. 2013. Nucleic Acids Res 41:827.
Summary of results Semi-automated genomic annotation begins with pattern discovery from multiple functional genomics data sets and enables: A simple annotation with a single label for each part of the genome. Visualization reducing multivariate data to a comprehensible representation. Interpretation of the context and potential regulatory impact of variants.
Software availability Segway data tracks segmentation Hoffman MM et al. 2012. Nat Methods 9:473. http://noble.gs.washington.edu/proj/segway/ Segtools segmentation plots and summary statistics Buske OJ et al. 2011. BMC Bioinformatics 12:415 http://noble.gs.washington.edu/proj/segtools/ Genomedata efficient access to numeric data anchored to genome Hoffman MM et al. 2010. Bioinformatics 26:1458. http://noble.gs.washington.edu/proj/genomedata/
Acknowledgments Bill Noble Jeff Bilmes Orion Buske Paul Ellenbogen University of Washington: Harshad Petwe, Meg Olson, Sheila Reynolds, Noble Research Group. University of Massachusetts Medical School: Zhiping Weng. SwitchGear Genomics: Patrick Collins. Stanford University: Anshul Kundaje. Pennsylvania State University: Ross Hardison, Bob Harris. European Bioinformatics Institute: Ewan Birney, Ian Dunham. University of California, Santa Cruz: Kate Rosenbloom, Brian Raney. Cold Spring Harbor Laboratory: Tom Gingeras, Carrie Davis. CRG: Sarah Djebali. RIKEN: Timo Lassmann. ENCODE Project Consortium. NIH/NHGRI: K99HG006259, U54HG004695.