How To Use The Assembly Database In A Microarray (Perl) With A Microarcode) (Perperl 2) (For Macrogenome) (Genome 2)



Similar documents
Module 3. Genome Browsing. Using Web Browsers to View Genome Annota4on. Kers4n Howe Wellcome Trust Sanger Ins4tute zfish-

Genome Viewing. Module 2. Using Genome Browsers to View Annotation of the Human Genome

RETRIEVING SEQUENCE INFORMATION. Nucleotide sequence databases. Database search. Sequence alignment and comparison

Just the Facts: A Basic Introduction to the Science Underlying NCBI Resources

Bioinformatics Resources at a Glance

GenBank, Entrez, & FASTA

Genomes and SNPs in Malaria and Sickle Cell Anemia

Human Genome Organization: An Update. Genome Organization: An Update

org.rn.eg.db December 16, 2015 org.rn.egaccnum is an R object that contains mappings between Entrez Gene identifiers and GenBank accession numbers.

Module 1. Sequence Formats and Retrieval. Charles Steward

Appendix 2 Molecular Biology Core Curriculum. Websites and Other Resources

When you install Mascot, it includes a copy of the Swiss-Prot protein database. However, it is almost certain that you and your colleagues will want

Outline. MicroRNA Bioinformatics. microrna biogenesis. short non-coding RNAs not considered in this lecture. ! Introduction

PrimePCR Assay Validation Report

Biological Sciences Initiative. Human Genome

Note: This document wh_informatics_practical.doc and supporting materials can be downloaded at

FlipFlop: Fast Lasso-based Isoform Prediction as a Flow Problem

Teaching Bioinformatics to Undergraduates

Biological Sequence Data Formats

Data formats and file conversions

PrimePCR Assay Validation Report

SeqScape Software Version 2.5 Comprehensive Analysis Solution for Resequencing Applications

Processing Genome Data using Scalable Database Technology. My Background

Becker Muscular Dystrophy

Systematic discovery of regulatory motifs in human promoters and 30 UTRs by comparison of several mammals

BIO 3350: ELEMENTS OF BIOINFORMATICS PARTIALLY ONLINE SYLLABUS

Frequently Asked Questions Next Generation Sequencing

Searching Nucleotide Databases

This document presents the new features available in ngklast release 4.4 and KServer 4.2.

Using Databases in R

A Primer of Genome Science THIRD

Apply PERL to BioInformatics (II)

Basic processing of next-generation sequencing (NGS) data

CCR Biology - Chapter 9 Practice Test - Summer 2012

Bioinformatics using Python for Biologists

17 July 2014 WEB-SERVER MANUAL. Contact: Michael Hackenberg

Package hoarder. June 30, 2015

MUTATION, DNA REPAIR AND CANCER

Introduction to transcriptome analysis using High Throughput Sequencing technologies (HTS)

Integration of data management and analysis for genome research

DNA Sequence formats

Standards, Guidelines and Best Practices for RNA-Seq V1.0 (June 2011) The ENCODE Consortium

mirnaselect pep-mir Cloning and Expression Vector

Tutorial. Reference Genome Tracks. Sample to Insight. November 27, 2015

Introduction to Bioinformatics 2. DNA Sequence Retrieval and comparison

Distributed Data Mining in Discovery Net. Dr. Moustafa Ghanem Department of Computing Imperial College London

Data File Formats. File format v1.3 Software v1.8.0

Focusing on results not data comprehensive data analysis for targeted next generation sequencing

DNA Replication & Protein Synthesis. This isn t a baaaaaaaddd chapter!!!

Genetic information (DNA) determines structure of proteins DNA RNA proteins cell structure enzymes control cell chemistry ( metabolism )

Introduction to Bioinformatics 3. DNA editing and contig assembly

Bioinformatics Grid - Enabled Tools For Biologists.

Hidden Markov Models in Bioinformatics. By Máthé Zoltán Kőrösi Zoltán 2006

ISSN Monografias em Ciência da Computação n 27/09

Gene Models & Bed format: What they represent.

Web Services for Management Perl Library VMware ESX Server 3.5, VMware ESX Server 3i version 3.5, and VMware VirtualCenter 2.5

Databases and mapping BWA. Samtools

AGILENT S BIOINFORMATICS ANALYSIS SOFTWARE

HL7 Clinical Genomics and Structured Documents Work Groups

INTERNATIONAL CONFERENCE ON HARMONISATION OF TECHNICAL REQUIREMENTS FOR REGISTRATION OF PHARMACEUTICALS FOR HUMAN USE Q5B

Ingenuity Pathway Analysis (IPA )

AS Replaces Page 1 of 50 ATF. Software for. DNA Sequencing. Operators Manual. Assign-ATF is intended for Research Use Only (RUO):

Breast cancer and the role of low penetrance alleles: a focus on ATM gene

DNA and the Cell. Version 2.3. English version. ELLS European Learning Laboratory for the Life Sciences

LifeScope Genomic Analysis Software 2.5

Introduction to Genome Annotation

Database manager does something that sounds trivial. It makes it easy to setup a new database for searching with Mascot. It also makes it easy to

13.4 Gene Regulation and Expression

Windows Active Directory. DNS, Kerberos and LDAP T h u r s d a y, J a n u a r y 2 7, 2011 INLS 576 Spring 2011

The Human Genome Project

Introduction to NGS data analysis

CPAS Overview. Josh Eckels LabKey Software

UGENE Quick Start Guide

Chapter 2. imapper: A web server for the automated analysis and mapping of insertional mutagenesis sequence data against Ensembl genomes

Tutorial for Windows and Macintosh. Preparing Your Data for NGS Alignment

Chapter 8: Recombinant DNA 2002 by W. H. Freeman and Company Chapter 8: Recombinant DNA 2002 by W. H. Freeman and Company

Genome Explorer For Comparative Genome Analysis

BlackBerry Enterprise Server Resource Kit

Text file One header line meta information lines One line : variant/position

REGULATIONS FOR THE DEGREE OF BACHELOR OF SCIENCE IN BIOINFORMATICS (BSc[BioInf])

Analysis and Integration of Big Data from Next-Generation Genomics, Epigenomics, and Transcriptomics

SMRT Analysis v2.2.0 Overview. 1. SMRT Analysis v SMRT Analysis v2.2.0 Overview. Notes:

SeattleSNPs Interactive Tutorial: Web Tools for Site Selection, Linkage Disequilibrium and Haplotype Analysis

Molecular Databases and Tools

Transcription:

The Ensembl Core databases and API

Useful links Installation instructions: http://www.ensembl.org/info/docs/api/api_installation.html Schema description: http://www.ensembl.org/info/docs/api/core/core_schema.html Tutorial: http://www.ensembl.org/info/docs/api/core/core_tutorial.html Documentation (Doxygen): http://www.ensembl.org/info/docs/doxygen/core-api/index.html Ensembl-dev mailing list: http://www.ensembl.org/info/about/contact/mailing.html Ensembl helpdesk: helpdesk@ensembl.org

Ensembl databases MySQL Species-specific databases: core: genomic sequences and most annotation variation: genetic variation funcgen: regulatory elements Cross-species database: compara: all comparative data

Public MySQL servers Ensembl host ensembldb.ensembl.org user anonymous password - port 3306 (up to version 47) 5306 (version 48 onwards) Ensembl Genomes host mysql.ebi.ac.uk user anonymous password - port 4157

Ensembl Core databases The Ensembl Core databases store: genomic sequence assembly information gene, transcript and protein models cdna and protein alignments cytogenetic bands, markers, repeats, CpG islands etc. external references homo_sapiens_core_66_37 species group assembly version software version (release)

MySQL Very good knowledge of database schemas needed Queries can quickly become very complex Not recommended (and not supported) to retrieve sequences

Ensembl Core Perl API Used to retrieve data from and store data in the Ensembl Core databases Written in Object-Oriented Perl Partly based on and compatible with BioPerl (version 1.2.3) objects (http://www.bioperl.org) Used by the Ensembl analysis and annotation pipeline and the Ensembl web code Robust, reliable and well-supported Forms the basis for the other Ensembl APIs

What do we need? Perl BioPerl 1.2.3 (this is not the latest BioPerl version!) Ensembl API: http://www.ensembl.org/info/docs/api/api_installation.html A text editor

Versioning API version must match database version Old scripts using the API should continue working with a newer API! your perl script API 65 API 66 65 66 output for e!65 output for e!66

Data objects Data objects model biological entities, e.g. genes, regulatory elements, variations, Each data object encapsulates information from one or a few specific MySQL tables Name space: object modules start with Bio::EnsEMBL, e.g. Bio::EnsEMBL::Gene

Adaptors

Object adaptors Data objects are retrieved from and stored in the database using object adaptors Object adaptors are data object factories Each object adaptor is responsible for creating data objects of only one particular type Name space: object adaptor modules start with Bio::EnsEMBL::DBSQL, e.g. Bio::EnsEMBL::DBSQL::GeneAdaptor

The Registry The Registry is an object adaptor factory Loads all databases of the same version as the API Lazy loads so no connections are made until requested

Each script should start like this #!/usr/bin/perl -w!! use strict;!! use Bio::EnsEMBL::Registry;!! my $registry = 'Bio::EnsEMBL::Registry';!! ## Load the databases into the registry! $registry->load_registry_from_db(! -host => 'ensembldb.ensembl.org',! -user => 'anonymous'! );!! ## Get the object adaptor for the object you re interested in! my $gene_adaptor = $registry->get_adaptor('human', 'Core', 'Gene');! my $slice_adaptor = $registry->get_adaptor('human', 'Core', 'Slice');!!

Coordinate systems Sequences stored in Ensembl are associated with sequence regions Sequence regions are linked to a distinct hierarchy of coordinate systems Coordinate systems vary from species to species: human: chromosome, supercontig, clone, contig zebrafish: chromosome, scaffold, contig Sequence information is directly stored in the database for the sequence level coordinate system The coordinate system of the highest level in a given region is the top level coordinate system

Coordinate systems Top level coordinate system Chromosome Contigs CCAGGCAGCGGGTT AGGGAGAGGGACCTGG GGTTAAGGCTTTTGATTTAGGGAG GGGACCTGGGGGTAA Sequence level coordinate system Clones (Tiling path)

CoordSystem object Retrieve using CoordinateSystemAdaptor Attribute Example value(s) Method(s) name chromosome, scaffold, contig, clone! name! version GRCh37, NCBI36, NCBIM37! version!

Slices A slice represents an arbitrary region of a genome Slices are not directly stored in the database Slices are used to obtain sequences or features from a specific region in a specific coordinate system

Slice object Retrieve using SliceAdaptor Attribute Example value(s) Method(s) coordinate system name chromosome, scaffold, clone! coord_system_name! sequence region name Y, Zv9_scaffold1219, AADC0109557.1! seq_region_name! start 1! start! end 59373566! end! length 59373566! length! strand 1, -1! strand! name chromosome:grch37:y:1:59373566:1! name! sequence TGTTGTATTACGTTTCTTTGTTTAT...! seq!

Exercise 1 An easy exercise to get started: Fetch the slice corresponding to basepair 32890000 to 32891000 of human chromosome 13 and print its sequence. What do you need first, when you want to retrieve a slice? Have a look in the Doxygen documentation at the list of methods available for the object(s) you re using: http://www.ensembl.org/info/docs/doxygen/core-api/index.html If you have time left: Print the soft-masked and hard-masked version as well as the reverse complement of the above sequence.

Features Features have a defined location on the genome All features have a start, end, strand and slice The start coordinate of a feature is always less than its end coordinate, irrespective of the strand on which it is located (exception: insertion features) Features are stored in a single coordinate system

Features Object Gene, Transcript, Exon PredictionTranscript, PredictionExon DNAAlignFeature, ProteinAlignFeature RepeatFeature MarkerFeature OligoFeature KaryotypeBandFeature SimpleFeature MiscFeature Represent(s) Ensembl gene models Genscan gene models cdnas, proteins repeats markers microarray probes cytogenetic bands results of cpg, Eponine, FirstEF and trnascan clones, ENCODE regions ProteinFeature protein domains *protein relative

Inheritance Data objects inherit methods from their parent object So, for example all methods that apply to the Feature object, also apply to its children, i.e the Gene object, the Transcript object, the Exon object etc. etc.

Feature object Retrieve by using FeatureAdaptor Retrieve from Slice Attribute Example value(s) Method(s) name AluSp, D13S1788! display_id! coordinates 13! 22398! 22594! 32912008! 32912204! 1! seq_region_name! start! end! seq_region_start! seq_region_end! strand! slice relative chromosome relative sequence GATTGGTCAGGTAGACAGCAGCAAG...! seq! length 196! length! slice feature slice returns Slice object! with which feature is associated! returns Slice object! that covers feature! slice! feature_slice!

Exercise 2 Get the repeats on the sequence you retrieved in Exercise 1. Print the name of each repeat and its relative (slice) and absolute (chromosomal) coordinates. Is there anything that strikes you with regard to the coordinates of the repeats?

Genes, transcripts, translations Genes, transcripts and exons are features Introns are not explicitly defined in the database Translations are not features Protein sequences are not stored in the database, but computed on the fly using transcript objects

Gene object Retrieve by using GeneAdaptor Retrieve from Slice Attribute Example value(s) Method(s) stable ID ENSG00000139618! stable_id! name BRCA2! external_name! description breast cancer 2, early onset! description! biotype protein_coding, mirna! biotype! analysis ensembl, havana, ensembl_havana_gene! analysis->logic_name! status KNOWN, NOVEL! status! transcripts returns listref of Transcript objects! get_all_transcripts! exons returns listref of Exon objects! get_all_exons! canonical transcript returns Transcript object! canonical_transcript!

Transcript object Retrieve by using TranscriptAdaptor Retrieve from Slice or Gene Attribute Example value(s) Method(s) stable ID ENST00000380152! stable_id! name BRCA2-001! external_name! biotype protein_coding, nonsense_mediated_decay! biotype! analysis ensembl, havana, ensembl_havana_transcript! analysis->logic_name! status KNOWN, NOVEL! status! CDS ATGCCTATTGGATCCAAAGAGAGGC...! translateable_seq! UTRs returns Seq object! five_prime_utr! three_prime_utr! spliced sequence GGGCTTGTGGCGCGAGCTTCTGAAA...! spliced_seq!

Transcript object (continued) Attribute Example value(s) Method(s) translation returns Translation object! translation! exons returns listref of Exon objects! get_all_exons! introns returns listref of Intron objects! get_all_introns! canonical 0, 1! is_canonical!

Exon object Retrieve by using ExonAdaptor Retrieve from Slice, Gene or Transcript Attribute Example value(s) Method(s) stable id ENSE00001184784! stable_id!

Translation object Retrieve by using TranslationAdaptor Retrieve from Transcript Attribute Example value(s) Method(s) stable id ENSP00000369497! stable_id! length 3418! length! sequence MPIGSKERPTFFEIFKTRCNKADLG...! seq!

Exercise 3 Write a script to retrieve the upstream sequences for a list of Ensembl Gene IDs. The script should take as input (from the command line): the species the length of the upstream sequence the name of the file containing the Ensembl Gene IDs and give as output: a file containing the upstream sequences in FASTA format Take into account that a gene can be either on the forward or the reverse strand of the genome! Use as input a file with Ensembl Gene IDs of yourself or use the file 100_human_genes.txt in /homes/evopadmin/ensembl.

External references External references (Xrefs) are cross references to identifiers from other databases, e.g. HGNC, WikiGenes, UniProtKB/ Swiss-Prot, RefSeq, OMIM etc. etc. External references can be on the gene, transcript or protein level

DBEntry object Retrieve by using DBentryAdaptor Retrieve from Gene, Transcript or Translation Attribute Example value(s) Method(s) database name HGNC, Uniprot_SWISSPROT, EMBL! dbname! name BRCA2, BRCA2_HUMAN, AF489725! display_id!

DBAdaptor object Retrieve from Registry Attribute Example value(s) Method(s) database name homo_sapiens_core_66_37, danio_rerio_variation_66_9! dbname! database group core, variation, compara, funcgen! group! database species homo_sapiens, danio_rerio! species! database connection returns DBConnection object! dbc! object adaptors returns ObjectAdaptor! get_objectadaptor!

Exercise 4 Write a script that gets for all Ensembl species the protein sequence of the canonical transcript for the genes that have been annotated with a given gene symbol. The script should take as input (from the command line): the gene symbol and give as output: a file containing the protein sequences in FASTA format with the Ensembl Gene ID and the species name in the FASTA header There are several ways to loop through the core dbs for all species in Ensembl. You can use the DBAdaptor object or, if you feel adventurous, the GenomeDB object from the Compara API.