RETRIEVING SEQUENCE INFORMATION. Nucleotide sequence databases. Database search. Sequence alignment and comparison



Similar documents
Bioinformatics Resources at a Glance

Pairwise Sequence Alignment

Just the Facts: A Basic Introduction to the Science Underlying NCBI Resources

Similarity Searches on Sequence Databases: BLAST, FASTA. Lorenza Bordoli Swiss Institute of Bioinformatics EMBnet Course, Basel, October 2003

Module 1. Sequence Formats and Retrieval. Charles Steward

GenBank, Entrez, & FASTA

When you install Mascot, it includes a copy of the Swiss-Prot protein database. However, it is almost certain that you and your colleagues will want

BLAST. Anders Gorm Pedersen & Rasmus Wernersson

Biological Databases and Protein Sequence Analysis

Searching Nucleotide Databases

BIOINFORMATICS TUTORIAL

Teaching Bioinformatics to Undergraduates

BIO 3350: ELEMENTS OF BIOINFORMATICS PARTIALLY ONLINE SYLLABUS

Genomes and SNPs in Malaria and Sickle Cell Anemia

SeqScape Software Version 2.5 Comprehensive Analysis Solution for Resequencing Applications

The human gene encoding Glucose-6-phosphate dehydrogenase (G6PD) is located on chromosome X in cytogenetic band q28.

Introduction to Bioinformatics 3. DNA editing and contig assembly

Linear Sequence Analysis. 3-D Structure Analysis

Analyzing A DNA Sequence Chromatogram

Guide for Bioinformatics Project Module 3

The Galaxy workflow. George Magklaras PhD RHCE

Genome and DNA Sequence Databases. BME 110/BIOL 181 CompBio Tools Todd Lowe March 31, 2009

SGI. High Throughput Computing (HTC) Wrapper Program for Bioinformatics on SGI ICE and SGI UV Systems. January, Abstract. Haruna Cofer*, PhD

Note: This document wh_informatics_practical.doc and supporting materials can be downloaded at

Molecular Databases and Tools

Bioinformatics Grid - Enabled Tools For Biologists.

Tutorial. Reference Genome Tracks. Sample to Insight. November 27, 2015

A Tutorial in Genetic Sequence Classification Tools and Techniques

Biological Sequence Data Formats

SICKLE CELL ANEMIA & THE HEMOGLOBIN GENE TEACHER S GUIDE

Bio-Informatics Lectures. A Short Introduction

A Multiple DNA Sequence Translation Tool Incorporating Web Robot and Intelligent Recommendation Techniques

A Primer of Genome Science THIRD

Gene Models & Bed format: What they represent.

Focusing on results not data comprehensive data analysis for targeted next generation sequencing

Sequence Formats and Sequence Database Searches. Gloria Rendon SC11 Education June, 2011

Module 10: Bioinformatics

Clone Manager. Getting Started

GenBank: A Database of Genetic Sequence Data

Version 5.0 Release Notes

Next Generation Sequencing: Technology, Mapping, and Analysis

Protein & DNA Sequence Analysis. Bobbie-Jo Webb-Robertson May 3, 2004

DNA Sequencing Overview

Introduction to Bioinformatics AS Laboratory Assignment 6

Introduction to Genome Annotation

Exercises for the UCSC Genome Browser Introduction

Module 3. Genome Browsing. Using Web Browsers to View Genome Annota4on. Kers4n Howe Wellcome Trust Sanger Ins4tute zfish-

Database searching with DNA and protein sequences: An introduction Clare Sansom Date received (in revised form): 12th November 1999

MUTATION, DNA REPAIR AND CANCER

ID of alternative translational initiation events. Description of gene function Reference of NCBI database access and relative literatures

Human Genome Organization: An Update. Genome Organization: An Update

UGENE Quick Start Guide

Sequencing the Human Genome

How To Use The Assembly Database In A Microarray (Perl) With A Microarcode) (Perperl 2) (For Macrogenome) (Genome 2)

SUBMITTING DNA SEQUENCES TO THE DATABASES

org.rn.eg.db December 16, 2015 org.rn.egaccnum is an R object that contains mappings between Entrez Gene identifiers and GenBank accession numbers.

Tutorial for Windows and Macintosh. Preparing Your Data for NGS Alignment

Introduction to Bioinformatics 2. DNA Sequence Retrieval and comparison

Biological Sciences Initiative. Human Genome

SOP 3 v2: web-based selection of oligonucleotide primer trios for genotyping of human and mouse polymorphisms

Human-Mouse Synteny in Functional Genomics Experiment

The world of non-coding RNA. Espen Enerly

BIO 3352: BIOINFORMATICS II HYBRID COURSE SYLLABUS

PROC. CAIRO INTERNATIONAL BIOMEDICAL ENGINEERING CONFERENCE

Integration of data management and analysis for genome research

DNA Replication & Protein Synthesis. This isn t a baaaaaaaddd chapter!!!

Data search and visualization tools at the Comparative Evolutionary Genomics of Cotton Web resource

Bioinformatics: course introduction

Custom TaqMan Assays For New SNP Genotyping and Gene Expression Assays. Design and Ordering Guide

ACAAGGGACTAGAGAAACCAAAA AGAAACCAAAACGAAAGGTGCAGAA AACGAAAGGTGCAGAAGGGGAAACAGATGCAGA CHAPTER 3

Structure and Function of DNA

INTERNATIONAL CONFERENCE ON HARMONISATION OF TECHNICAL REQUIREMENTS FOR REGISTRATION OF PHARMACEUTICALS FOR HUMAN USE Q5B

LESSON 9. Analyzing DNA Sequences and DNA Barcoding. Introduction. Learning Objectives

Integrating Bioinformatics, Medical Sciences and Drug Discovery

Welcome to the Plant Breeding and Genomics Webinar Series

Genome Viewing. Module 2. Using Genome Browsers to View Annotation of the Human Genome

Genetic information (DNA) determines structure of proteins DNA RNA proteins cell structure enzymes control cell chemistry ( metabolism )

Worksheet - COMPARATIVE MAPPING 1

BIOINF 525 Winter 2016 Foundations of Bioinformatics and Systems Biology

Simplifying Data Interpretation with Nexus Copy Number

Processing Genome Data using Scalable Database Technology. My Background

Appendix 2 Molecular Biology Core Curriculum. Websites and Other Resources

Algorithms in Computational Biology (236522) spring 2007 Lecture #1

Name Class Date. Figure Which nucleotide in Figure 13 1 indicates the nucleic acid above is RNA? a. uracil c. cytosine b. guanine d.

1 Mutation and Genetic Change

Comparing Methods for Identifying Transcription Factor Target Genes

Multiple Sequence Alignment. Hot Topic 5/24/06 Kim Walker

Bioinformatics Tools Tutorial Project Gene ID: KRas

Core Bioinformatics. Degree Type Year Semester Bioinformàtica/Bioinformatics OB 0 1

Genome Explorer For Comparative Genome Analysis

Final Project Report

CD-HIT User s Guide. Last updated: April 5,

HENIPAVIRUS ANTIBODY ESCAPE SEQUENCING REPORT

Database schema documentation for SNPdbe

Human Genetics: Online Resources

AS Replaces Page 1 of 50 ATF. Software for. DNA Sequencing. Operators Manual. Assign-ATF is intended for Research Use Only (RUO):

Lab 2/Phylogenetics/September 16, PHYLOGENETICS

Genetics Module B, Anchor 3

Choices, choices, choices... Which sequence database? Which modifications? What mass tolerance?

Transcription:

RETRIEVING SEQUENCE INFORMATION Nucleotide sequence databases Database search Sequence alignment and comparison

Biological sequence databases Originally just a storage place for sequences. Currently the databases are bioinformatics work bench which provide many tools for retrieving, comparing and analyzing sequences. There are many different types of sequence databases. Global nucleotide sequence storage databases: GenBank of NCBI (National Center for Biotechnology Information) The European Molecular Biology Laboratory (EMBL) database The DNA Data Bank of Japan (DDBJ) Genome-centered databases NCBI genomes Ensembl Genome Browser UCSC Genome Bioinformatics Site Protein Databases SwissProt

Web browsers for accessing genome data Entrez Life Sciences Search Engine (US National Institutes of Health) Ensembl Genome Browser (European Bioinformatics Institute) UCSC Genome Bioinformatics Site (University of California at Santa Cruz)

Entrez http://www.ncbi.nlm.nih.gov/sites/gquery A service of the US National Center for Biotechnology Information (NCBI) 29 core search pages interconnecting different types of information

Entrez Among the most valuable tools are nucleotide, EST and protein sequence databases, the Human Genome Resources, Microbial Genome Resources, Complete Genomes tools, etc.

GenBank and PubMed The NIH also maintains GenBank and PubMed databases (both linked to Entrez) GenBank is a genetic sequence database that provides an annotated collection of all genomic and cdna sequences that are in the public domain. In august 2009, there were 106,533,156,756 bases in 108,431,692 sequence records in the traditional GenBank divisions. PubMed is a literature search engine providing access to all published medically related articles, abstracts and journals.

Ensemble Ensembl is a joined project between European Bioinformatics Institute (EMBO) and the Sanger Institute in Cambridge. Its major focus is the genomes of animals and other eukaryotes. The annotation of most of the genomes in Ensembl database have been processed automatically though annotation pipelines. Genomes of five species (human, mouse, zebrafish, pig and dog) have been carefully annotated by manual curation. www.ensembl.org

Ensemble A nice feature is customizable home page that keeps track of your searches even if done from different computers. Ensemble has many tools for extracting and downloading userspecific aspects of the data in various formats suitable for high-end bioinformatics analysis chromosome location sequence contig exons

UCSC Genome Bioinformatics Site A highly sophisticated genome browser allowing for visualization of many genome features: mapping and sequencing, phenotype and disease association, genes and gene predictions, transcript evidence, gene expression data, comparative genomics, sequence variations genome.ucsc.edu

Any of the genome databases can be used to extract simple sequence information Depending of the question asked by a researcher, one or another database may be more suitable to specific needs

What to keep in mind when working with sequence databases NCBI (The National Center for Biotechnology Information) accepts all submitted sequences and does not check whether the sequence is accurate or not. Annotations of most genes and genomes are done by researchers and except for specific Reference Sequences (RefSeqs) are not curated by Genbank experts. As a result, the structure of the same gene reported by different labs can be different! Furthermore: All genomes are full of SNPs and other polymorphisms. Gene finding algorithms are not perfect, especially when it comes to predicting splice sites. Alternative splicing can lead to different transcripts (different versions) of the same gene. The implication is: there is often no single correct sequence for a given gene

GenBank Extracting a segment of a sequence which contains the gene of interest. Task: find the sequence of the Yersinia pestis gene lacz. Point your browser to the NCBI main site: http://www.ncbi.nlm.nih.gov search criteria select database ( nucleotides for GenBank) click the Go button

GenBank In the results page, find the link to the entry of interest. The list may have many pages. The link you are looking for, might not be the first entry. A complete genome of a Yersinia pestis strain. Should have the gene you are looking for. the genomic database a standard Genbank entry essentially the same info but slightly different format

GenBank A complete genome page contains lots of info and thousands of genes. Use the find function of your browser to find lacz.

GenBank read info about your gene click either gene or CDS link to go to the gene sequence

GenBank Scroll to the bottom of the file to see the nucleotide and amino acid sequence of the gene amino acid sequence of the encoded protein nucleotide sequence of the gene. Note, that this gene is encoded in a complementary strand relative to the one shown in the database. Therefore, the sequence shows the template strand.

GenBank Choose the FASTA format at the top of the entry page to get a pure sequence that you can cut and paste into another application

select nucleotides for GenBank access Extracting sequence of a human gene How to extract sequence of a human gene: 1. Go to the Entrez site: www.ncbi.nih.gov/sites/entrez In the search window, type the name of the gene, identifier, or GenBank accession number

Extracting sequence of a gene Among several pages of matching entries find the one you want to examine and click on the accession number

Extracting sequence of a gene LOCUS: length, genomic/mrna, date of submission DEFINITION: species, common gene name ACCESSION: GenBank accession number REFERENCE: authors, publication (if any), history Coding sequence (CDS) coordinates The encoded predicted protein Miscellaneous features (exon boundaries, identified protein domains, mutations, etc.). Features can be associated with links to other databases: protein domain database (PFAM), Mendelian inheritance database (OMIM), etc. FEATURE TABLE

The sequence of the gene is shown at the bottom. The default format is blocks of 10 nucleotides, 60 bases per row. At the top of the page, one can choose the format in which sequence is displayed. Sequence can be simply copied form the web page and imported into any of the bioinformatics applications or directly exported into some of them. Extracting sequence of a gene

Homologous sequences If sequences of two proteins are similar, they most likely are derived from the same ancestor and therefore, have similar structures and similar biological functions. Science by analogy: If something is true for one such sequence it is probably true also for the other. Studying a protein (a gene) in the lab takes years, searching a database takes seconds. Proteins (genes) that have the same ancestor are called homologous. Homologous proteins (genes) usually have similar sequences, structures and functions. A general rule: two proteins are likely homologous if their sequences are at least 25% identical. DNA sequences are likely homologous if they are at least 70% identical. Do not confuse homology and similarity. Homology is a binary parameter (homologus vs. non-homologous) and reflects evolutionary relationship between the sequences. Similarity is a quantitative measure two sequences can be more similar or less similar. Similarity does not necessarily reflect evolutionary relationship between sequences.

Sequence comparison A pair-wise comparison of sequences is a very common task in bioinformatics and in genomics. By aligning two sequences we are testing the hypothesis that the sequences are homologous, meaning that each pair of aligned residues are descendants of a common ancestral residue. AGGRTYYFWNEDKKRGHYPLIVAGRHKSAGRYPLAVFDEKRHFFVAPPKRTYDH! AGGKTYFFWNEEKRRGHYPIIVLGRHA-AGRYPVVVFDAKRHFAVAPPKKTYDH! If two sequences have some similarity simply by chance this does not have any biological significance. But if two sequences are homologous, by comparing them we can gain information about their structure, function and evolution. The most common purpose of sequence comparison in bioinformatics is identifying homologous sequences.

Sequence comparison Sequence comparison is not trivial. Consider two alignments of two sequences: Alignment 1 Alignment 2! Sequence 1 ACGCTGA ACGCTGA! Sequence 2 A--CTGT ACTGT--! Which alignment is better? The frequency of common evolutionary events determines the alignment parameters ATCG! ATCG! ATCG! ATCG! ATCG ATCG! ACCG ATCG! ATCG ACG! ATTCG ATCG! ATCG! ATCG! ACCG! ATCG! ATCG! A-CG! ATTCG! AT-CG! identity substitution! deletion insertion The frequency of substitutions is higher than the frequency of indels (insertions or deletions)

Sequence comparison Computational alignment of two sequences is based on assigning a score to each possible alignment. The score is composed of bonuses (for a match) and penalties for a mismatch or indel. Let us assign bonus of +3 points for a match, a penalty of -1 for a mismatch and a penalty of -5 for an indel. Align two sequences: ACGCTTGA and ACCCGTTA ACGCTTGA!! ACCCGTTA! Score: 3 + 3 1 + 3 1 + 3 1 + 3 = + 12 ACGC-TTGA!! ACCCGTT-A! Score: 3 + 3 1 + 3 5 + 3 + 3 5 + 3 = + 7 AC--GCTTGA!! ACCCGTTA--! Score: 3 + 3 5 5 + 3 1 + 3 1 5 5 = - 10 The choice of parameters depends on the biological model. The choice of parameters defines the score of the alignment.

BLAST Once you have determined the sequence of your protein (gene), the next most important thing to do is to find in the database sequences similar to the one you are studying and to retrieve as much information about those sequences as possible. Then you decide which of the known information applies to your protein (gene). Searching sequence databases for the entries with the sequence similar to the sequence of your protein or gene is the most frequently used bioinformatics procedure in genomic sciences. The most frequently used tool for that is BLAST Basic Local Alignment and Search Tool. (Altschul et al., J.Mol.Biol. 1990) BLAST is the most popular bioinformatics data-mining tool.

BLAST BLAST is a tool that searches large sequence databases returning sequences that have regions of similarity to a query sequence provided by the user. BLAST finds isolated regions in sequence pairs that have high levels of similarity. BLAST concatenates all the sequences in the database into one continuous VEEEEEEEEEEEERY long sequence and then looks in this sequence for the matches to segments of the query sequence.

BLAST BLAST comes in many different flavors. The choice of the BLAST protocol depends on your sequence and the question you ask. For protein searches: blastp compares a protein sequence with a protein database ( p for protein) tblastn compares a protein sequence with a translated nucleotide database ( t for translated) Choice of BLAST : I want to find out something about my protein: use blastp. Find proteins similar to yours and what is known about them. I want to discover genes in other organisms that code for a protein similar to mine: use tblastn to compare your protein to the nucleotide sequences translated in 6 reading frames.

BLAST There are several servers which make BLAST accessible world-wide to the researchers. Server at NCBI (hosted in the US): http://blast.ncbi.nlm.nih.gov/ Highly pre-configured for the most common (basic) searches. A great place to start if you do not ask any very specific question about your sequence. Server at EMBnet-ExPASy (hosted in Europe) http://www.ch.embnet.org/software/bblast.html Gives many choices to configure your search. Very useful if you ask specific bioinformatics questions. Both servers are freely accessible to the academic researchers!

blastp Running a balstp search Q: Are there any protein similar to the hamster nucleolin? Approach: find a protein similar to hamster nucleolin protein in the database Swiss-Prot which contains a wealth of information about all the studied proteins. Point the browser to http://blast.ncbi.nlm.nih.gov choose the type of blast you want to run (protein blast, in this case)

blastp Put either your protein accession number or cut and paste its sequence. Your sequence (retrieved via its accession number or written in the window) is called the query sequence choose the database in which you want to search if needed, limit the search to a specific organism Click the BLAST button

blastp An intermediate screen will appear for few seconds to few minutes while the search is in progress. Then the results page opens. The results page may have lots of information and many pages. Always examine it before you choose to print! The results page has three important areas: a graphic display, a hit list, and the alignments.

The query sequence bar blastp A graphic display shows which segment of your query is similar to the sequences in the database. The similarity bars are colorcoded to reflect the degree of similarity: red the most similar magenta still very good green intermediate blue twilight zone black match, but a really bad one A graphic display Some matches do not extend across the entire sequence: they may represent conserved domains in different, even unrelated proteins. Thus, nucleolins, like many other proteins, contain an RNA binding domain.

blastp The hit list The sequence accession number and the name Description The bit score ( goodness of the match). Ignore anything below 50. The E-value (the expectation value). Statistical significance of the match to the hit: how often a match like this would be found if there is no real match in the database. Matches with the E-value below 0.001 (e-3) is worth exploring. Genomic link

The sequence accession number and the name (clickable!) blastp The alignments The percentage of identity (anything more than 25% is worth exploring) The Positives : identity plus similar residues The Gaps the nonaligned residues The coordinates in the query sequence and in the hit (subject) sequence top line the query middle line consensus bottom line the hit

blastp One you choose which hit sequence from the found list you want to work with, you can click on the name of that sequence. The browser will bring you to the database entry of that sequence. From there you can retrieve some primary info (origin, name, length). You can find the relevant publications or take the accession number and look for the info about the protein in other databases.

Blasting DNA sequence Two major modes of searching nucleotide sequence databases are blastn searches nucleotide sequence databases using nucleotide sequence queries blastx searches protein databases using a translated nucleotide query Choice of BLAST: I want to find gene regulatory regions similar to mine: use blastn. Find nucleotide sequences similar to yours in other genomes. I want to find out whether my sequence corresponds to a protein gene: use blastx. Find known proteins similar to the one encoded in your sequence.