Module 2 Genome Viewing Using Genome Browsers to View Annotation of the Human Genome Bert Overduin, Ph.D. PANDA Coordination & Outreach EMBL - European Bioinformatics Institute Wellcome Trust Genome Campus Hinxton, Cambridge, CB10 1SD, UK
Why Genome Browsers? Browse genes in their genomic context Display features in and around a particular gene Explore larger chromosomal regions Search and retrieve information on a genomewide scale Compare genomes
Genome Browsers Ensembl Genome Browser http://www.ensembl.org NCBI Map Viewer http://www.ncbi.nlm.nih.gov/mapview UCSC Genome Browser http://genome.ucsc.edu
What Distinguishes Ensembl? Automatic annotation for those species for which no manually annotated gene sets exist Data mining tool BioMart Direct database access and programmatic access via the Perl API Not only the data, but also the software code is open source
Ensembl - Organisation Joint project between the European Bioinformatics Institute (EBI) and the Wellcome Trust Sanger Institute (WTSI) Started in 1999 for the Human Genome Project Funded primarily by the Wellcome Trust, with additional funding by EMBL, NIH-NHGRI, NIH- NIAID, BBSRC, MRC and EU Team of ca. 50 people, led by Ewan Birney (EBI) and Tim Hubbard (WTSI)
Ensembl - Species 48 chordates, ranging from human to two Ciona species 3 key eukaryote model organisms: Drosophila melanogaster Caenorhabditis elegans Saccharomyces cerevisiae
http://www.ensemblgenomes.org Aedes aegypti, Anopheles gambiae, Culex quinquefasciatus, 12 Drosophila species, 5 Caenorhabditis species, Ixodes scapularis Plasmodium falciparum, Plasmodium knowlesi, Plasmodium vivax Bacillus, Escherichia/Shigella, Mycobacterium, Neisseria, Pyrococcus, Staphylococcus, Streptococcus Arabidopsis lyrata, Arabidopsis thaliana, Brachypodium distachyon, Oryza sativa, Oryza sativa indica group, Populus trichocarpa, Sorghum bicolor, Vitis vinifera 7 Aspergillus species, Neosartorya fischeri, Saccharomyces cerevisiae, Schizosaccharomyces pombe
Ensembl - Data Genomic sequence Gene / transcript / protein models External references Mapped cdnas, proteins, microarray probes, BAC clones, cytogenetic bands, repeats, markers etc. etc. Comparative data: orthologs and paralogs, protein families, whole genome alignments, syntenic regions Variation data: SNPs Regulatory data: best guess set of regulatory elements Externally stored data (Distributed Annotation System)
Ensembl Gene Models Automatically annotated genes for the whole genome of all species ( Ensembl genes ) Manually annotated genes for part of the human and mouse genome ( Vega/Havana genes)
Biological Evidence All Ensembl gene models are based on evidence from: UniProtKB/Swiss-Prot Proteins, manually curated NCBI RefSeq Proteins and mrnas, partially manually curated UniProtKB/TrEMBL Translations of EMBL-Bank CDSs, automatically annotated EMBL-Bank / GenBank / DDBJ Primary nucleotide sequence repositories
Ensembl Genebuild Genome assembly + Experimental evidence + Computer programs
Access to Data Release web site http://www.ensembl.org Pre-release web site http://pre.ensembl.org Archive web site http://archive.ensembl.org BioMart http://www.ensembl.org/biomart/martview http://www.biomart.org/biomart/martview FTP site ftp://ftp.ensembl.org Amazon Web Services http://aws.amazon.com/publicdatasets MySQL http://www.ensembl.org/info/data/mysql.html Perl API http://www.ensembl.org/info/data/api.html
Ensembl Stable Identifiers Human: ENSG########### Ensembl Gene ID ENST########### Ensembl Transcript ID ENSP########### Ensembl Protein ID ENSE########### Ensembl Exon ID ENSR########### Ensembl Regulatory Feature ID ENSSNP########### Ensembl SNP ID ENSFM########### Ensembl Protein Family ID Other species have a suffix: ENSMUSG########### A mouse (Mus musculus) gene
Summary Genome browsers render the plain sequence more accessible Ensembl provides automatic genome annotation, yet is strongly based on experimental evidence from protein and cdna sequences in public databases Ensembl heavily links to data sets from other species, as well as to external resources
Data Mining Ensembl with BioMart
BioMart Joint project between the European Bioinformatics Institute (EBI) and the Ontario Institute for Cancer Research (OICR) Originally developed for Ensembl (EnsMart) Website : http://www.biomart.org
Publicly Available Marts Ensembl Ensembl Bacteria Ensembl Metazoa Ensembl Protists Dictybase Wormbase Gramene Europhenome UniProt InterPro HGNC Rat Genome Database DroSpeGe ArrayExpress DW Eurexpress HapMap GermOnLine PRIDE PepSeeker VectorBase HTGT Pancreatic Expression Database Reactome EU Rat Mart Paramecium DB International Potato Center (CIP) Central portal: http://www.biomart.org/biomart/martview
BioMart - Principle Step 1 Dataset Choose your dataset and species Step 2 Filters Limit your dataset Step 3 Attributes Specify what information you want to output Step 4 Results Preview and output your results
Summary BioMart is a highly flexible tool for data mining Queries are defined in just 4 steps: Dataset, Filters, Attributes and Results Genomic regions, Gene identifiers, Gene Ontology terms and many other sources of information can serve as filters BioMart heavily links to data sets within Ensembl and provides links to external resources
Help Helpdesk helpdesk@ensembl.org Mailing lists: ensembl-dev@ebi.ac.uk ensembl-announce@ebi.ac.uk Blog: http://ensembl.blogspot.com YouTube channel: http://www.youtube.com/user/ensemblhelpdesk