Analysis and Integration of Big Data from Next-Generation Genomics, Epigenomics, and Transcriptomics Christopher Benner, PhD Director, Integrative Genomics and Bioinformatics Core (IGC) idash Webinar, April 17 th 2015
Overview for Webinar: Quick introduction to the wider world of next-generation sequencing (NGS) Overview of HOMER, our software for NGS analysis Using advanced NGS assays to understand B cell development and the generation of antibody repertoires Quick teaser on how innovative NGS assays and genetics can enhance our understanding of transcriptional mechanisms
Next-Generation Sequencing Large Consortiums 1000 Genomes Project TCGA (cancer) many many more Illumina sequencing can sequence any DNA fragment from 0-600 bp in length
NGS Innovation RNA-Seq (i.e. gene expression) GRO-Seq (i.e. transcription rates) ChIP-Seq DNA:protein interactions
Graphic from Illumina Inc.
HOMER (Hypergeometric Optimization of Motif EnRichment) http://homer.salk.edu Next-generation Sequencing Analysis for Quantitative Genomics Software suite for UNIX command-line environment (works downstream of manufacture s pipeline and mapping to reference genome) Quality Control for Experiments Basic and advanced analysis, annotation, and visualization capabilities General framework handles data from different types of quantitative sequencing (ChIP-Seq/RNA-Seq/GRO- Seq/DNase-Seq/etc.) Can work with any organism Regulatory element analysis De novo Motif Discovery Sort out spatial relationships between sequence features
Overview of HOMER
HOMER Functionality Any organism with a FASTA file can be analyzed with HOMER Model organisms are preconfigured with annotation information: Human, mouse, rat, zebrafish, drosophila, C. elegans, yeast, pombe, arabidopsis Genomes annotated on the UCSC Genome Browser are easy to incorporate, but any custom genome can be added with annotation files (i.e., GTF files)
HOMER Tutorials (on website)
Best way to develop NGS Analysis methods: Do it in the context of research! Biology Bioinformatics NGS Methods Development
Interplay between epigenetics, spatial genome conformation, and transcription in B-lymphocyte development
Interplay between epigenetics, spatial genome conformation, and transcription in B-lymphocyte development
Why study transition from pre-pro-b to pro-b cells? Lineage commitment: pro-b cells cannot dedifferentiate back to hematopoietic stem cells. i.e. pre-pro-b cells can be used to reconstitute the whole immune system Antibody Recombination: Pro-B cells are paused at the exact stage when VDJ recombination is set to occur B cell marker expression: Key cell-surface markers and transcription factors are induced in pro-b cells, including CD19, Ebf1 (Early B cell factor), Pax5, and Foxo1.
Mapping the Epigenome
Unbiased Discovery of Regulatory Features in pro-b cells
Relationship between Transcription Factors and Epigenetic Modifications Transcription Factors
Unbiased Discovery of Lineage Determining Transcription Factors Ebf1, E2A mice fail to make pro-b cells
Hi-C: Mapping 3D interactions in the genome GRO- Seq Hi-C method from Lieberman-Aiden et al., Science 2009
Most significant interactions in the genome occur at epigenetically modified locations
Cell-type specific interactions often change their DNA methylation status
Genome Organization into topological domains pre-pro-b pro-b TAD definition by Dixon et al. 2012
CTCF binding site is directional CTCF only makes interactions with other CTCF sites in a specific direction along the DNA determined by the orientation of the motif 5 boundary of TAD 3 boundary of TAD
Clusters of CTCF sites form Super Anchors pre-pro-b pro-b
Clusters of CTCF sites form Super Anchors Igh Firre Foxo1 Borrowing from Richard Young s Super Enhancer concept, we can define over 2500 CTCF super anchors in the data Only 25% of CTCF sites are found at boundaries. However, nearly 50% of Super Anchors are found at the boundaries of topological domains.
Overview of Immunoglobulin Heavy Chain Locus
Igh Locus in the Genome (~3 Mb) Top Super Anchor
Igh Locus in the Genome (~3 Mb) To generate full repertoires of Antibodies, each V region needs to find a way to interact with the D regions to recombine Top Super Anchor
V regions in Igh locus are associated with CTCF sites In addition, each CTCF site associated with V regions is in a consistent orientation
CTCF Orientation at D/J regions
Igh Locus Model VD recombination target Top Super Anchor (looping backstop)
Summary NGS is a lot more than genome sequencing Integration of different data types empowers discovery where any given data type alone falls short The DNA sequence (CTCF motifs and their orientation) dictates the structure of the genome to accomplish critical tasks such as VDJ recombination
Future Directions: Leveraging Genetics
Thanks!