The Ensembl Core databases and API
Useful links Installation instructions: http://www.ensembl.org/info/docs/api/api_installation.html Schema description: http://www.ensembl.org/info/docs/api/core/core_schema.html Tutorial: http://www.ensembl.org/info/docs/api/core/core_tutorial.html Documentation (Doxygen): http://www.ensembl.org/info/docs/doxygen/core-api/index.html Ensembl-dev mailing list: http://www.ensembl.org/info/about/contact/mailing.html Ensembl helpdesk: helpdesk@ensembl.org
Ensembl databases MySQL Species-specific databases: core: genomic sequences and most annotation variation: genetic variation funcgen: regulatory elements Cross-species database: compara: all comparative data
Public MySQL servers Ensembl host ensembldb.ensembl.org user anonymous password - port 3306 (up to version 47) 5306 (version 48 onwards) Ensembl Genomes host mysql.ebi.ac.uk user anonymous password - port 4157
Ensembl Core databases The Ensembl Core databases store: genomic sequence assembly information gene, transcript and protein models cdna and protein alignments cytogenetic bands, markers, repeats, CpG islands etc. external references homo_sapiens_core_66_37 species group assembly version software version (release)
MySQL Very good knowledge of database schemas needed Queries can quickly become very complex Not recommended (and not supported) to retrieve sequences
Ensembl Core Perl API Used to retrieve data from and store data in the Ensembl Core databases Written in Object-Oriented Perl Partly based on and compatible with BioPerl (version 1.2.3) objects (http://www.bioperl.org) Used by the Ensembl analysis and annotation pipeline and the Ensembl web code Robust, reliable and well-supported Forms the basis for the other Ensembl APIs
What do we need? Perl BioPerl 1.2.3 (this is not the latest BioPerl version!) Ensembl API: http://www.ensembl.org/info/docs/api/api_installation.html A text editor
Versioning API version must match database version Old scripts using the API should continue working with a newer API! your perl script API 65 API 66 65 66 output for e!65 output for e!66
Data objects Data objects model biological entities, e.g. genes, regulatory elements, variations, Each data object encapsulates information from one or a few specific MySQL tables Name space: object modules start with Bio::EnsEMBL, e.g. Bio::EnsEMBL::Gene
Adaptors
Object adaptors Data objects are retrieved from and stored in the database using object adaptors Object adaptors are data object factories Each object adaptor is responsible for creating data objects of only one particular type Name space: object adaptor modules start with Bio::EnsEMBL::DBSQL, e.g. Bio::EnsEMBL::DBSQL::GeneAdaptor
The Registry The Registry is an object adaptor factory Loads all databases of the same version as the API Lazy loads so no connections are made until requested
Each script should start like this #!/usr/bin/perl -w!! use strict;!! use Bio::EnsEMBL::Registry;!! my $registry = 'Bio::EnsEMBL::Registry';!! ## Load the databases into the registry! $registry->load_registry_from_db(! -host => 'ensembldb.ensembl.org',! -user => 'anonymous'! );!! ## Get the object adaptor for the object you re interested in! my $gene_adaptor = $registry->get_adaptor('human', 'Core', 'Gene');! my $slice_adaptor = $registry->get_adaptor('human', 'Core', 'Slice');!!
Coordinate systems Sequences stored in Ensembl are associated with sequence regions Sequence regions are linked to a distinct hierarchy of coordinate systems Coordinate systems vary from species to species: human: chromosome, supercontig, clone, contig zebrafish: chromosome, scaffold, contig Sequence information is directly stored in the database for the sequence level coordinate system The coordinate system of the highest level in a given region is the top level coordinate system
Coordinate systems Top level coordinate system Chromosome Contigs CCAGGCAGCGGGTT AGGGAGAGGGACCTGG GGTTAAGGCTTTTGATTTAGGGAG GGGACCTGGGGGTAA Sequence level coordinate system Clones (Tiling path)
CoordSystem object Retrieve using CoordinateSystemAdaptor Attribute Example value(s) Method(s) name chromosome, scaffold, contig, clone! name! version GRCh37, NCBI36, NCBIM37! version!
Slices A slice represents an arbitrary region of a genome Slices are not directly stored in the database Slices are used to obtain sequences or features from a specific region in a specific coordinate system
Slice object Retrieve using SliceAdaptor Attribute Example value(s) Method(s) coordinate system name chromosome, scaffold, clone! coord_system_name! sequence region name Y, Zv9_scaffold1219, AADC0109557.1! seq_region_name! start 1! start! end 59373566! end! length 59373566! length! strand 1, -1! strand! name chromosome:grch37:y:1:59373566:1! name! sequence TGTTGTATTACGTTTCTTTGTTTAT...! seq!
Exercise 1 An easy exercise to get started: Fetch the slice corresponding to basepair 32890000 to 32891000 of human chromosome 13 and print its sequence. What do you need first, when you want to retrieve a slice? Have a look in the Doxygen documentation at the list of methods available for the object(s) you re using: http://www.ensembl.org/info/docs/doxygen/core-api/index.html If you have time left: Print the soft-masked and hard-masked version as well as the reverse complement of the above sequence.
Features Features have a defined location on the genome All features have a start, end, strand and slice The start coordinate of a feature is always less than its end coordinate, irrespective of the strand on which it is located (exception: insertion features) Features are stored in a single coordinate system
Features Object Gene, Transcript, Exon PredictionTranscript, PredictionExon DNAAlignFeature, ProteinAlignFeature RepeatFeature MarkerFeature OligoFeature KaryotypeBandFeature SimpleFeature MiscFeature Represent(s) Ensembl gene models Genscan gene models cdnas, proteins repeats markers microarray probes cytogenetic bands results of cpg, Eponine, FirstEF and trnascan clones, ENCODE regions ProteinFeature protein domains *protein relative
Inheritance Data objects inherit methods from their parent object So, for example all methods that apply to the Feature object, also apply to its children, i.e the Gene object, the Transcript object, the Exon object etc. etc.
Feature object Retrieve by using FeatureAdaptor Retrieve from Slice Attribute Example value(s) Method(s) name AluSp, D13S1788! display_id! coordinates 13! 22398! 22594! 32912008! 32912204! 1! seq_region_name! start! end! seq_region_start! seq_region_end! strand! slice relative chromosome relative sequence GATTGGTCAGGTAGACAGCAGCAAG...! seq! length 196! length! slice feature slice returns Slice object! with which feature is associated! returns Slice object! that covers feature! slice! feature_slice!
Exercise 2 Get the repeats on the sequence you retrieved in Exercise 1. Print the name of each repeat and its relative (slice) and absolute (chromosomal) coordinates. Is there anything that strikes you with regard to the coordinates of the repeats?
Genes, transcripts, translations Genes, transcripts and exons are features Introns are not explicitly defined in the database Translations are not features Protein sequences are not stored in the database, but computed on the fly using transcript objects
Gene object Retrieve by using GeneAdaptor Retrieve from Slice Attribute Example value(s) Method(s) stable ID ENSG00000139618! stable_id! name BRCA2! external_name! description breast cancer 2, early onset! description! biotype protein_coding, mirna! biotype! analysis ensembl, havana, ensembl_havana_gene! analysis->logic_name! status KNOWN, NOVEL! status! transcripts returns listref of Transcript objects! get_all_transcripts! exons returns listref of Exon objects! get_all_exons! canonical transcript returns Transcript object! canonical_transcript!
Transcript object Retrieve by using TranscriptAdaptor Retrieve from Slice or Gene Attribute Example value(s) Method(s) stable ID ENST00000380152! stable_id! name BRCA2-001! external_name! biotype protein_coding, nonsense_mediated_decay! biotype! analysis ensembl, havana, ensembl_havana_transcript! analysis->logic_name! status KNOWN, NOVEL! status! CDS ATGCCTATTGGATCCAAAGAGAGGC...! translateable_seq! UTRs returns Seq object! five_prime_utr! three_prime_utr! spliced sequence GGGCTTGTGGCGCGAGCTTCTGAAA...! spliced_seq!
Transcript object (continued) Attribute Example value(s) Method(s) translation returns Translation object! translation! exons returns listref of Exon objects! get_all_exons! introns returns listref of Intron objects! get_all_introns! canonical 0, 1! is_canonical!
Exon object Retrieve by using ExonAdaptor Retrieve from Slice, Gene or Transcript Attribute Example value(s) Method(s) stable id ENSE00001184784! stable_id!
Translation object Retrieve by using TranslationAdaptor Retrieve from Transcript Attribute Example value(s) Method(s) stable id ENSP00000369497! stable_id! length 3418! length! sequence MPIGSKERPTFFEIFKTRCNKADLG...! seq!
Exercise 3 Write a script to retrieve the upstream sequences for a list of Ensembl Gene IDs. The script should take as input (from the command line): the species the length of the upstream sequence the name of the file containing the Ensembl Gene IDs and give as output: a file containing the upstream sequences in FASTA format Take into account that a gene can be either on the forward or the reverse strand of the genome! Use as input a file with Ensembl Gene IDs of yourself or use the file 100_human_genes.txt in /homes/evopadmin/ensembl.
External references External references (Xrefs) are cross references to identifiers from other databases, e.g. HGNC, WikiGenes, UniProtKB/ Swiss-Prot, RefSeq, OMIM etc. etc. External references can be on the gene, transcript or protein level
DBEntry object Retrieve by using DBentryAdaptor Retrieve from Gene, Transcript or Translation Attribute Example value(s) Method(s) database name HGNC, Uniprot_SWISSPROT, EMBL! dbname! name BRCA2, BRCA2_HUMAN, AF489725! display_id!
DBAdaptor object Retrieve from Registry Attribute Example value(s) Method(s) database name homo_sapiens_core_66_37, danio_rerio_variation_66_9! dbname! database group core, variation, compara, funcgen! group! database species homo_sapiens, danio_rerio! species! database connection returns DBConnection object! dbc! object adaptors returns ObjectAdaptor! get_objectadaptor!
Exercise 4 Write a script that gets for all Ensembl species the protein sequence of the canonical transcript for the genes that have been annotated with a given gene symbol. The script should take as input (from the command line): the gene symbol and give as output: a file containing the protein sequences in FASTA format with the Ensembl Gene ID and the species name in the FASTA header There are several ways to loop through the core dbs for all species in Ensembl. You can use the DBAdaptor object or, if you feel adventurous, the GenomeDB object from the Compara API.