Using Ensembl tools for browsing ENCODE data



Similar documents
Genome Viewing. Module 2. Using Genome Browsers to View Annotation of the Human Genome

Module 3. Genome Browsing. Using Web Browsers to View Genome Annota4on. Kers4n Howe Wellcome Trust Sanger Ins4tute zfish-

GMQL Functional Comparison with BEDTools and BEDOPS

Accessing the 1000 Genomes Data. Paul Flicek European BioinformaMcs InsMtute

Comparing Methods for Identifying Transcription Factor Target Genes

GeneProf and the new GeneProf Web Services

Module 1. Sequence Formats and Retrieval. Charles Steward

The Segway annotation of ENCODE data

On-line supplement to manuscript Galaxy for collaborative analysis of ENCODE data: Making large-scale analyses biologist-friendly

Nebula A web-server for advanced ChIP-seq data analysis. Tutorial. by Valentina BOEVA

RETRIEVING SEQUENCE INFORMATION. Nucleotide sequence databases. Database search. Sequence alignment and comparison

Analysis and Integration of Big Data from Next-Generation Genomics, Epigenomics, and Transcriptomics

BIO 3352: BIOINFORMATICS II HYBRID COURSE SYLLABUS

Systematic discovery of regulatory motifs in human promoters and 30 UTRs by comparison of several mammals

Yale Pseudogene Analysis as part of GENCODE Project

Discovery & Modeling of Genomic Regulatory Networks with Big Data

How To Use The Assembly Database In A Microarray (Perl) With A Microarcode) (Perperl 2) (For Macrogenome) (Genome 2)

Current Motif Discovery Tools and their Limitations

Shouguo Gao Ph. D Department of Physics and Comprehensive Diabetes Center

Visualisation tools for next-generation sequencing

Delivering the power of the world s most successful genomics platform

LifeScope Genomic Analysis Software 2.5

Just the Facts: A Basic Introduction to the Science Underlying NCBI Resources

New solutions for Big Data Analysis and Visualization

-> Integration of MAPHiTS in Galaxy

European Genome-phenome Archive database of human data consented for use in biomedical research at the European Bioinformatics Institute

Computational localization of promoters and transcription start sites in mammalian genomes

THE UNIVERSITY OF MANCHESTER Unit Specification

Fast. Integrated Genome Browser & DAS. Easy. Flexible. Free. bioviz.org/igb

Focusing on results not data comprehensive data analysis for targeted next generation sequencing

Leading Genomics. Diagnostic. Discove. Collab. harma. Shanghai Cambridge, MA Reykjavik

A Brief Introduction on DNase-Seq Data Aanalysis

The human gene encoding Glucose-6-phosphate dehydrogenase (G6PD) is located on chromosome X in cytogenetic band q28.

Gene Models & Bed format: What they represent.

A Primer of Genome Science THIRD

Using caching and optimization techniques to improve performance of the Ensembl website

Using Galaxy for NGS Analysis. Daniel Blankenberg Postdoctoral Research Associate The Galaxy Team

BIOINF 525 Winter 2016 Foundations of Bioinformatics and Systems Biology

MotifLab: a tools and data integration workbench for motif discovery and regulatory sequence analysis

Biological Sciences Initiative. Human Genome

EMBL Identity & Access Management

School of Nursing. Presented by Yvette Conley, PhD

Analysis of ChIP-seq data in Galaxy

Work Package 13.5: Authors: Paul Flicek and Ilkka Lappalainen. 1. Introduction

Prof Brian McStay Wellcome Trust Senior Investigator Award April March 2020

mygenomatix - secure cloud for NGS analysis

GenBank, Entrez, & FASTA

Genome-wide measurements of protein-dna interaction by chromatin immunoprecipitation

Technical document. Section 1 Using the website to investigate a specific B. rapa BAC.

The Moroccan American Pharmaceutical Sciences & Education Network Group (PharMaSeng), University Hassan II MohammediaCasablanca & FST Mohammedia

INSECT: In silico search for co-occurring transcription factors

Frequently Asked Questions Next Generation Sequencing

Chapter 2. imapper: A web server for the automated analysis and mapping of insertional mutagenesis sequence data against Ensembl genomes

Activity 7.21 Transcription factors

BioInterchange 2.0 CODAMONO. Integrating and Scaling Genomics Data

Lecture 1 MODULE 3 GENE EXPRESSION AND REGULATION OF GENE EXPRESSION. Professor Bharat Patel Office: Science 2, b.patel@griffith.edu.

INTERNATIONAL CONFERENCE ON HARMONISATION OF TECHNICAL REQUIREMENTS FOR REGISTRATION OF PHARMACEUTICALS FOR HUMAN USE Q5B

17 July 2014 WEB-SERVER MANUAL. Contact: Michael Hackenberg

Outline. MicroRNA Bioinformatics. microrna biogenesis. short non-coding RNAs not considered in this lecture. ! Introduction

Human-Mouse Synteny in Functional Genomics Experiment

An example of bioinformatics application on plant breeding projects in Rijk Zwaan

NCBI resources III: GEO and ftp site. Yanbin Yin Spring 2013

Core Bioinformatics. Degree Type Year Semester Bioinformàtica/Bioinformatics OB 0 1

Faculty of Medicine. Settore disciplinare: BIO/10. functional domains. Monica Soldi. IFOM-IEO Campus, Milan. Matricola n. R08407

G E N OM I C S S E RV I C ES

Gene Switches A Model

Cloud Computing Solutions for Genomics Across Geographic, Institutional and Economic Barriers

How many of you have checked out the web site on protein-dna interactions?

EPIGENETICS DNA and Histone Model

Processing Genome Data using Scalable Database Technology. My Background

MeDIP-chip service report

BIO 3350: ELEMENTS OF BIOINFORMATICS PARTIALLY ONLINE SYLLABUS

Importance of Statistics in creating high dimensional data

University of Glasgow - Programme Structure Summary C1G MSc Bioinformatics, Polyomics and Systems Biology

The world of non-coding RNA. Espen Enerly

jchip: a graphical environment for exploratory ChIP-Seq data analysis

Control of Gene Expression

Linking the Epigenome to the Genome: Correlation of Different Features to DNA Methylation of CpG Islands

Go where the biology takes you. Genome Analyzer IIx Genome Analyzer IIe

Bioinformatics Grid - Enabled Tools For Biologists.

PREDA S4-classes. Francesco Ferrari October 13, 2015

Lecture 11 Data storage and LIMS solutions. Stéphane LE CROM

Genomes and SNPs in Malaria and Sickle Cell Anemia


Data Sharing Initiative: International Cancer Genome Consortium

GENE REGULATION. Teacher Packet

Bioinformatics Resources at a Glance

TITLE MOTIVATION OBJECTIVES AUDIENCE COURSE INSTRUCTORS. Analysis of regulatory sequences controlling the expression of gene networks

European Medicines Agency

AP BIOLOGY 2009 SCORING GUIDELINES

Three data delivery cases for EMBL- EBI s Embassy. Guy Cochrane

The Galaxy workflow. George Magklaras PhD RHCE

Data search and visualization tools at the Comparative Evolutionary Genomics of Cotton Web resource

OpenCB a next generation big data analytics and visualisation platform for the Omics revolution

Practical Course on. Bioinformatics and Comparative Genomes Analyses. University of La Réunion 2013 May 6-18

Global Alliance. Ewan Birney Associate Director EMBL-EBI

Next generation sequencing (NGS)

BIO 182 General Biology (Majors) II with Lab. Course Package

Argos, a replicable genome information system for FlyBase, eugenes and other databases.

Data Analysis for Ion Torrent Sequencing

Transcription:

Using Ensembl tools for browsing ENCODE data Bert Overduin, Ph.D. Vertebrate Genomics Team EMBL - European Bioinformatics Institute Wellcome Trust Genome Campus Hinxton, Cambridge CB10 1SD United Kingdom

Outline Presentation Introduction to Ensembl ENCODE data hub Ensembl Regulatory Build Regulatory segmentation Adding custom tracks BioMart Variant Effect Predictor Worked examples Browser / Regulation BioMart Variation Effect Predictor Hands-on exercises

Ensembl - Goal To provide automatic annotation of completely sequenced vertebrate genomes To integrate this annotation with other available biological data To make all this information available to the scientific community

Species v71 (Apr 2013): 70 genomes (61 fully supported, 9 on Pre!) Primates Rodents etc. Laurasiatheria Afrotheria Xenartha Other mammals Birds & reptiles Amphibians Fish Other chordates Other eukaryotes On Pre! Ensembl

Data Genomic sequence Gene / transcript / protein models External references Mapped cdnas, proteins, microarray probes, BAC clones, cytogenetic bands, repeats, markers etc. etc. Variation data Sequence / short variants Structural variants Comparative data Gene trees, orthologues, paralogues, protein families Pairwise and multiple whole genome alignments, syntenic regions Regulatory data ENCODE data Regulatory Build: best guess set of regulatory data

Access to data Ensembl web site Pre! web site Archive! web site http://www.ensembl.org http://pre.ensembl.org http://archive.ensembl.org BioMart http://www.ensembl.org/biomart/martview FTP site Amazon Web Services MySQL Perl API REST API ftp://ftp.ensembl.org http://aws.amazon.com/publicdatasets http://www.ensembl.org/info/data/mysql.html http://www.ensembl.org/info/data/api.html http://beta.rest.ensembl.org

Official (cloud-based) mirrors United States West Coast http://uswest.ensembl.org United States East Coast http://useast.ensembl.org Asia http://asia.ensembl.org Geo-IP-based redirection

ENCODE data hub Data from ENCODE integrative analysis (http://genome.ucsc.edu/encode/analysis.html ) Can be attached as a TrackHub http://ftp.ebi.ac.uk/pub/databases/ensembl/encode/ integration_data_jan2011/hub.txt > 2800 tracks

Ensembl Regulatory Build Provides a single best guess set of regulatory features For human and mouse Created by overlap analysis of annotations from genome-wide data sets in a two stage cell type aware manner http://www.ensembl.org/info/docs/funcgen/index.html

Regulatory Build data Focus features (define potential binding sites) 13 cell types Open chromatin (DNase1, FAIRE) ENCODE CTCF (insulator/enhancer) binding sites Roadmap Epigenomics Binding sites for 90 transcription factors Attribute features 42 Histone modifications (methylation, acetylation) RNA Pol II and III binding sites Meta data: http://www.ensembl.org/homo_sapiens/experiment/sources Focus features (define potential binding sites) 5 cell types Open chromatin (DNase1) ENCODE CTCF (insulator/enhancer) binding sites Binding sites for 22 transcription factors Attribute features 8 Histone modifications (methylation, acetylation) RNA Pol II binding sites Meta data: http://www.ensembl.org/mus_musculus/experiment/sources

Regulatory Build procedure Regulatory feature construction: Identify core regions across all available cell types using focus features Extend core regions in a cell type specific manner using attribute features Regulatory feature annotation: Classify regulatory features Annotate the position of putative TFBSs using position weight matrices (PWMs) taken from the JASPAR database

Regulatory feature construction Focus features DNase1 Cell type 1 CTCF Cell type 1 Taf1 Cell type 1 DNase1 Cell type 2 CTCF Cell type 2 Attribute features H3K4me1 Cell type 1 H3K4me1 Cell type 3 H3K4me3 Cell type 3 H3K27ac Cell type 3 MultiCell reg feature Cell type 1 reg feature Cell type 2 reg feature Cell type 3 reg feature

Regulatory feature annotation Promoter Associated Gene Associated Non-gene Associated Polymerase III Associated Unclassified Patterns over-represented in the region of the transcription start site plus or minus 2500 bp upstream of protein coding genes, but not in the downstream gene body. Likely to be a 5' proximal promoter. Patterns over-represented in gene bodies. Often represent gene's transcriptional activity (expressed/ repressed). Patterns over-represented in non-gene regions. Likely to correspond to a distal regulatory element such as an insulator or enhancer. Patterns over-represented in regions 2500 bp upstream of PolIII transcribed regions e.g. trnas. Likely to correspond to a proximal regulatory element specifically associated to Polymerase III transcription. Patterns which are currently unclassifiable.

Regulatory feature annotation signal peak ChiP-Seq signal for transcription factor MAX regulatory feature Position Weight Matrix for MAX from JASPAR database

Regulatory segmentation Provides a summary of the functional architecture (or state ) of the human genome 6 cell types 14 assays, constituting 3 classes of data: open chromatin, transcription factors, histone modifications Segmentations produced using 2 programs: ChromHMM and Segway Segments classified into 7 classes

Regulatory segmentation

Regulatory segmentation CTCF WE T E PF R TSS CTCF enriched Predicted Weak Enhancer/Cis-reg element Predicted Transcribed Region Predicted Enhancer Predicted Promoter Flank Predicted Repressed/Low Activity Predicted Promoter with TSS

Adding custom tracks Upload data Data saved by Ensembl 5 MB limit (and therefore not possible for large file formats) Attach remote file No size limit URL-based (http or ftp) Uploaded / attached tracks can be saved (to account) Uploaded / attached tracks can be shared with other users Only trivial security, do not use for sensitive data

Adding custom tracks Possible formats: BAM sequence alignments (no upload) BED genes / features BedGraph continuous-valued data BigBed genes / features (no upload) BigWig continuous-valued data (no upload) TrackHub collection of tracks (no upload) GBrowse genes / features GFF / GTF genes / features PSL sequence alignments VCF variants (no upload) WIG continuous-valued data

BioMart Data retrieval tool Originally developed for Ensembl (EnsMart) Now used by many large data resources Integrated with several widely used software packages, e.g. Galaxy (https://main.g2.bx.psu.edu/) Central portal: http://www.biomart.org

BioMart - Principle Step 1 - Dataset Choose your dataset and species Step 2 - Filters Limit your dataset Step 3 - Attributes Specify what information you want to output Step 4 - Results Preview and output your results

Variant Effect Predictor (VEP) Predicts functional consequences of variants on Ensembl genes Annotates SNPs, indels and structural variants Accessible through web interface, REST server or Perl script Accepts tab-delimited, VCF and pileup format, variant identifiers and HGVS notations (on genome and Ensembl and RefSeq transcripts) as input SIFT and PolyPhen predictions for missense variants included for human http://www.ensembl.org/info/docs/variation/vep/index.html

Workshops Browser (0.5-2 days) and API (1-3 days) workshops Combination of lectures and hands-on exercises Advertised on http://www.ensembl.info/workshops/calendar/ You can host your own workshop! For academic institutions there is no fee, apart from the instructor s expenses You only need a computer room and participants You can get more info from helpdesk@ensembl.org or me (bert@ebi.ac.uk)

Help & Keeping in touch Helpdesk helpdesk@ensembl.org Mailing lists http://www.ensembl.org/info/about/contact/mailing.html YouTube and YouKu (优酷网) channels: http://www.youtube.com/user/ensemblhelpdesk http://u.youku.com/user_show/uid_ensemblhelpdesk Blog http://www.ensembl.info Facebook http://www.facebook.com/ensembl.org Twitter http://twitter.com/ensembl

Nucl. Acids Res. (2012) doi: 10.1093/nar/gks1236. Ensembl 2013. Paul Flicek, Ikhlak Ahmed, M. Ridwan Amode, Daniel Barrell, Kathryn Beal, Simon Brent, Denise Carvalho-Silva, Peter Clapham, Guy Coates, Susan Fairley, Stephen Fitzgerald, Laurent Gil, Carlos García-Girón, Leo Gordon, Thibaut Hourlier, Sarah Hunt, Thomas Juettemann, Andreas K. Kähäri, Stephen Keenan, Monika Komorowska, Eugene Kulesha, Ian Longden, Thomas Maurel, William M. McLaren, Matthieu Muffato, Rishi Nag, Bert Overduin, Miguel Pignatelli, Bethan Pritchard, Emily Pritchard, Harpreet Singh Riat, Graham R. S. Ritchie, Magali Ruffier, Michael Schuster, Daniel Sheppard, Daniel Sobral, Kieron Taylor, Anja Thormann, Stephen Trevanion, Simon White, Steven P. Wilder, Bronwen L. Aken, Ewan Birney, Fiona Cunningham, Ian Dunham, Jennifer Harrow, Javier Herrero, Tim J. P. Hubbard, Nathan Johnson, Rhoda Kinsella, Anne Parker, Giulietta Spudich, Andy Yates, Amonida Zadissa and Stephen M. J. Searle

Funding European Commission Framework Programme 7