Module 1. Sequence Formats and Retrieval. Charles Steward

Similar documents
RETRIEVING SEQUENCE INFORMATION. Nucleotide sequence databases. Database search. Sequence alignment and comparison

Module 3. Genome Browsing. Using Web Browsers to View Genome Annota4on. Kers4n Howe Wellcome Trust Sanger Ins4tute zfish-

BIO 3350: ELEMENTS OF BIOINFORMATICS PARTIALLY ONLINE SYLLABUS

Committee on WIPO Standards (CWS)

Bioinformatics Resources at a Glance

NCBI resources III: GEO and ftp site. Yanbin Yin Spring 2013

The human gene encoding Glucose-6-phosphate dehydrogenase (G6PD) is located on chromosome X in cytogenetic band q28.

When you install Mascot, it includes a copy of the Swiss-Prot protein database. However, it is almost certain that you and your colleagues will want

org.rn.eg.db December 16, 2015 org.rn.egaccnum is an R object that contains mappings between Entrez Gene identifiers and GenBank accession numbers.

Genome Viewing. Module 2. Using Genome Browsers to View Annotation of the Human Genome

Just the Facts: A Basic Introduction to the Science Underlying NCBI Resources

Data formats and file conversions

BIOINFORMATICS TUTORIAL

This document presents the new features available in ngklast release 4.4 and KServer 4.2.

Searching Nucleotide Databases

Challenges associated with analysis and storage of NGS data

Similarity Searches on Sequence Databases: BLAST, FASTA. Lorenza Bordoli Swiss Institute of Bioinformatics EMBnet Course, Basel, October 2003

Introduction to Bioinformatics 2. DNA Sequence Retrieval and comparison

Linear Sequence Analysis. 3-D Structure Analysis

The Galaxy workflow. George Magklaras PhD RHCE

GenBank, Entrez, & FASTA

Genome and DNA Sequence Databases. BME 110/BIOL 181 CompBio Tools Todd Lowe March 31, 2009

AGILENT S BIOINFORMATICS ANALYSIS SOFTWARE

Library page. SRS first view. Different types of database in SRS. Standard query form

Basic processing of next-generation sequencing (NGS) data

A Tutorial in Genetic Sequence Classification Tools and Techniques

A Multiple DNA Sequence Translation Tool Incorporating Web Robot and Intelligent Recommendation Techniques

ID of alternative translational initiation events. Description of gene function Reference of NCBI database access and relative literatures

Bioinformatics Grid - Enabled Tools For Biologists.

SRA File Formats Guide

SUBMITTING DNA SEQUENCES TO THE DATABASES

SGI. High Throughput Computing (HTC) Wrapper Program for Bioinformatics on SGI ICE and SGI UV Systems. January, Abstract. Haruna Cofer*, PhD

Focusing on results not data comprehensive data analysis for targeted next generation sequencing

An agent-based layered middleware as tool integration

THE UNIVERSITY OF MANCHESTER Unit Specification

Scientific databases. Biological data management

Tutorial. Reference Genome Tracks. Sample to Insight. November 27, 2015

Processing Genome Data using Scalable Database Technology. My Background

Database searching with DNA and protein sequences: An introduction Clare Sansom Date received (in revised form): 12th November 1999

Three data delivery cases for EMBL- EBI s Embassy. Guy Cochrane

Databases and platforms for data analysis from NGS of MTB

Importance of Statistics in creating high dimensional data

Towards Integrating the Detection of Genetic Variants into an In-Memory Database

Comparing Methods for Identifying Transcription Factor Target Genes

Biological Sequence Data Formats

Sequencing the Human Genome

Molecular Databases and Tools

Data Analysis & Management of High-throughput Sequencing Data. Quoclinh Nguyen Research Informatics Genomics Core / Medical Research Institute

PROC. CAIRO INTERNATIONAL BIOMEDICAL ENGINEERING CONFERENCE

Bioinformatics, Sequences and Genomes

Next Generation Sequencing: Technology, Mapping, and Analysis

BIO 3352: BIOINFORMATICS II HYBRID COURSE SYLLABUS

Core Bioinformatics. Degree Type Year Semester Bioinformàtica/Bioinformatics OB 0 1

Introduction to Bioinformatics 3. DNA editing and contig assembly

Removing Sequential Bottlenecks in Analysis of Next-Generation Sequencing Data

Delivering the power of the world s most successful genomics platform

Version 5.0 Release Notes

Databases and mapping BWA. Samtools

GenBank: A Database of Genetic Sequence Data

New solutions for Big Data Analysis and Visualization

G E N OM I C S S E RV I C ES

ACAAGGGACTAGAGAAACCAAAA AGAAACCAAAACGAAAGGTGCAGAA AACGAAAGGTGCAGAAGGGGAAACAGATGCAGA CHAPTER 3

FlipFlop: Fast Lasso-based Isoform Prediction as a Flow Problem

BLAST. Anders Gorm Pedersen & Rasmus Wernersson

Sequence Formats and Sequence Database Searches. Gloria Rendon SC11 Education June, 2011

Biological Databases and Protein Sequence Analysis

EMBL-EBI Web Services

Outline. MicroRNA Bioinformatics. microrna biogenesis. short non-coding RNAs not considered in this lecture. ! Introduction

BIOINF 525 Winter 2016 Foundations of Bioinformatics and Systems Biology

Introduction to NGS data analysis

PPInterFinder A Web Server for Mining Human Protein Protein Interaction

Having a BLAST: Analyzing Gene Sequence Data with BlastQuest

Pairwise Sequence Alignment

Bioinformatics: course introduction

Ontology-Driven Workflow Management for Biosequence Processing Systems

Database schema documentation for SNPdbe

Introduction. Overview of Bioconductor packages for short read analysis

PrimePCR Assay Validation Report

Protein & DNA Sequence Analysis. Bobbie-Jo Webb-Robertson May 3, 2004

ISSN Monografias em Ciência da Computação n 27/09

The EcoCyc Curation Process

CD-HIT User s Guide. Last updated: April 5,

Genome Science Education for Engineering Majors

UGENE Quick Start Guide

Gene Models & Bed format: What they represent.

Lecture Outline. Introduction to Databases. Introduction. Data Formats Sample databases How to text search databases. Shifra Ben-Dor Irit Orr

A demonstration of the use of Datagrid testbed and services for the biomedical community

Custom TaqMan Assays For New SNP Genotyping and Gene Expression Assays. Design and Ordering Guide

DNA and the Cell. Version 2.3. English version. ELLS European Learning Laboratory for the Life Sciences

European Genome-phenome Archive database of human data consented for use in biomedical research at the European Bioinformatics Institute

The Integrated Microbial Genomes (IMG) System: A Case Study in Biological Data Management

Tutorial for Windows and Macintosh. Preparing Your Data for NGS Alignment

MODULE 2: Advanced methodologies and tools for research. Research funding and innovation.

Next generation sequencing (NGS)

PeptidomicsDB: a new platform for sharing MS/MS data.

A Primer of Genome Science THIRD

OpenCB a next generation big data analytics and visualisation platform for the Omics revolution

GAST, A GENOMIC ALIGNMENT SEARCH TOOL

Translation Study Guide

Transcription:

The Open Door Workshop Module 1 Sequence Formats and Retrieval Charles Steward 1

Aims Acquaint you with different file formats and associated annotations. Introduce different nucleotide and protein databases. Show how to access different genomic data from a variety of databases, using UniProt and GQuery (Entrez). Introduce BLAST. 2

Databases Nucleotide databases: DDBJ/EMBL/NCBI form the International Nucleotide Sequence Database Collaboration and store Genomic/cDNAs/ESTs sequences. Protein database: UniProt: Swiss-Prot (manually curated) and TrEMBL (automated annotation) sequences. Accession numbers (a unique number or combination of letters and numbers assigned to each record in a database) identify such sequences. e.g. (AL034553). 3

Information is mirrored daily between DDBJ/EMBL/NCBI. DDBJ/EMBL/GenBank: INSDC (International Nucleotide Sequence Database). DDBJ: DNA databank of Japan. CIB-DDBJ: Centre for Information Biology and DNA Data Banks of Japan. EBI: European Bioinformatics Institute. ENA: European Nucleotide Archive contains EMBL Nucleotide Sequence Database EMBL: European Molecular Biology Laboratory. NCBI: National Centre for Biotechnology Information. IAM: International Advisory Meeting. ICM: International Collaborative Meeting. 4

EMBL format See abbreviation table 5

Abbreviations found in the EMBL flat file: 6

NCBI format DDBJ format 7

Sequence Read Archive (SRA) for next-generation sequencing submission. INSDC now accept sequence data produced by next-generation sequencing machines. This screen shot is taken from the ENA hosted at EBI. For further information go here: http://www.ebi.ac.uk/ena/about/sra_submissions http://www.ebi.ac.uk/ena/about/sra_format 8

NGS file formats BAM format - A BAM file (.bam) is the binary version of a SAM file bigbed format indexed BED file (1 line per feature and at least 3 columns) BigWig format used for dense continuous data and displayed as a graphwiggle plot VCF format - for variants See here for more information: http://www.ensembl.org/info/website/upload/index.html http://www.ensembl.org/info/website/upload/large.html Next Generation Sequencing Courses Wellcome Trust Next Generation Sequencing course 6-13 April 2014 http://www.wellcome.ac.uk/education-resources/courses-and-conferences/ EBI - Monday, October 14, 2013 - Thursday, October 17, 2013 http://www.ebi.ac.uk/training/course/next-generation-sequencing-workshop-0 9

NCBI s ENTREZ system 10

GQuery (Entrez) entry point http://www.ncbi.nlm.nih.gov/books/nbk3837/! 11

Goal: One sequence entry for each naturally occurring DNA, RNA and protein molecule chromosome NC_000000 contig NT_000000 Reference Sequences mrna NM_000000 predicted mrna XM_000000 protein NP_000000 predicted protein XP_000000 Key: curated calculated non-coding RNA NR_000000 predicted non-coding RNA XR_000000 Multiple products for one gene are instantiated as separate RefSeqs with the same LocusID. 12

CCDS Comparison of common CDSs to form consensus gene set QC by UCSC (filter out possible pseudogenes) Build 104.0 27,752 agreed CCDS IDs CCDS set is displayed in Ensembl/Vega/UCSC/NCBI Non redundant gene set agreed by all institutes 13

EBI databases 14

EBI search: access all databases http://www.ebi.ac.uk/ena/about/browser.html 15

ENA sequence window! 16

ENA data view See here for a clone example: http://www.ebi.ac.uk/ena/data/view/bn000065 17

UniProt! 18

PE (protein existence) line Format!! PE Level: Evidence;!! Values" " 1: Evidence at protein level" " 2: Evidence at transcript level" " 3: Inferred from homology" " 4: Predicted" " 5: Uncertain" " http://www.expasy.org/cgi-bin/lists?pe_criteria.txt" 19

TrEMBL entry All information is automatically generated 20

Swiss-Prot entry Manually curated entry containing more information than Trembl 21

BLAST similarity searching Basic Local Alignment Search Tool There are many different databases available to search against, which may vary depending on which site you start from. The most commonly used BLAST site is hosted by the NCBI: http://www.ncbi.nlm.nih.gov/blast/ 22

Blast output. 1) The score is a measure of the similarity of the query sequence to the subject sequence." It is calculated from the number of gaps and substitutions associated with each aligned sequence. The higher the score, the more significant the alignment. Each score links to the corresponding pairwise alignment between query sequence and subject sequence." 2) E-value is estimate of the likelihood that a sequence match with that score has occurred by chance. The E-value is calculated from the size of the sequence, database and score (or scoring system used) and so is specific to that search. Thus, two results on different databases may not be directly comparable. But the take home message: The smaller the E-value, the smaller the likelihood that it has happened at random and is therefore more likely to be real. For example: 0.000001 1 in a million searches would produce a false positive with this score 0.01 1 in 100 searches would produce a false positive with this score 1 1 match above threshold is likely to be FP 100 100 matches above threshold are FP For further details see Karlin & Altschul - PNAS 1990 87:2264-8 23

Worked examples Tasks start on page 18 24