BLAST. Anders Gorm Pedersen & Rasmus Wernersson

Similar documents
Pairwise Sequence Alignment

Similarity Searches on Sequence Databases: BLAST, FASTA. Lorenza Bordoli Swiss Institute of Bioinformatics EMBnet Course, Basel, October 2003

RETRIEVING SEQUENCE INFORMATION. Nucleotide sequence databases. Database search. Sequence alignment and comparison

Algorithms in Bioinformatics I, WS06/07, C.Dieterich 47. This lecture is based on the following, which are all recommended reading:

Protein & DNA Sequence Analysis. Bobbie-Jo Webb-Robertson May 3, 2004

Bio-Informatics Lectures. A Short Introduction

Welcome to the Plant Breeding and Genomics Webinar Series

Bioinformatics Resources at a Glance

Design Style of BLAST and FASTA and Their Importance in Human Genome.

Biological Databases and Protein Sequence Analysis

Rapid alignment methods: FASTA and BLAST. p The biological problem p Search strategies p FASTA p BLAST

A Tutorial in Genetic Sequence Classification Tools and Techniques

Bioinformática BLAST. Blast information guide. Buscas de sequências semelhantes. Search for Homologies BLAST

Apply PERL to BioInformatics (II)

BIOINFORMATICS TUTORIAL

Choices, choices, choices... Which sequence database? Which modifications? What mass tolerance?

BIO 3350: ELEMENTS OF BIOINFORMATICS PARTIALLY ONLINE SYLLABUS

Laboratorio di Bioinformatica

Analyzing A DNA Sequence Chromatogram

SGI. High Throughput Computing (HTC) Wrapper Program for Bioinformatics on SGI ICE and SGI UV Systems. January, Abstract. Haruna Cofer*, PhD

PROC. CAIRO INTERNATIONAL BIOMEDICAL ENGINEERING CONFERENCE

Lecture 4: Exact string searching algorithms. Exact string search algorithms. Definitions. Exact string searching or matching

Network Protocol Analysis using Bioinformatics Algorithms

Clone Manager. Getting Started

3. About R2oDNA Designer

When you install Mascot, it includes a copy of the Swiss-Prot protein database. However, it is almost certain that you and your colleagues will want

Module 1. Sequence Formats and Retrieval. Charles Steward

Molecular Databases and Tools

Error Tolerant Searching of Uninterpreted MS/MS Data

Bioinformatics Grid - Enabled Tools For Biologists.

DNA Sequencing Overview

Databases and mapping BWA. Samtools

Sequence Analysis 15: lecture 5. Substitution matrices Multiple sequence alignment

CD-HIT User s Guide. Last updated: April 5,

Sequence homology search tools on the world wide web

Core Bioinformatics. Degree Type Year Semester Bioinformàtica/Bioinformatics OB 0 1

Using MATLAB: Bioinformatics Toolbox for Life Sciences

GenBank: A Database of Genetic Sequence Data

Module 10: Bioinformatics

Introduction to Bioinformatics AS Laboratory Assignment 6

Genome Explorer For Comparative Genome Analysis

Geospiza s Finch-Server: A Complete Data Management System for DNA Sequencing

Amino Acids and Their Properties

Approximate String Matching in DNA Sequences

Efficient Parallel Execution of Sequence Similarity Analysis Via Dynamic Load Balancing

MORPHEUS. Prediction of Transcription Factors Binding Sites based on Position Weight Matrix.

Introduction to Bioinformatics 3. DNA editing and contig assembly

Searching Nucleotide Databases

Sequence information - lectures

Library page. SRS first view. Different types of database in SRS. Standard query form

A greedy algorithm for the DNA sequencing by hybridization with positive and negative errors and information about repetitions

Linear Sequence Analysis. 3-D Structure Analysis

Guide for Bioinformatics Project Module 3

Sequencing the Human Genome

fasta July 28, 2015

Fast string matching

T cell Epitope Prediction

Sequence Formats and Sequence Database Searches. Gloria Rendon SC11 Education June, 2011

Databases indexation

When you install Mascot, it includes a copy of the Swiss-Prot protein database. However, it is almost certain that you and your colleagues will want

A Complete Example of Next- Gen DNA Sequencing Read Alignment. Presentation Title Goes Here

2.3 Identify rrna sequences in DNA

CS 2112 Spring Instructions. Assignment 3 Data Structures and Web Filtering. 0.1 Grading. 0.2 Partners. 0.3 Restrictions

Version 5.0 Release Notes

UCHIME in practice Single-region sequencing Reference database mode

Daniel H. Huson. January 21, Contents 1. 1 Introduction 3. 2 Getting Started 5. 4 Licensing 6. 5 Program Overview 7. 7 Taxonomic Binning 9

DNA Mapping/Alignment. Team: I Thought You GNU? Lars Olsen, Venkata Aditya Kovuri, Nick Merowsky

Gerry Hobbs, Department of Statistics, West Virginia University

BlastReduce: High Performance Short Read Mapping with MapReduce

GenBank, Entrez, & FASTA

BIOLOMICS SOFTWARE & SERVICES GENERAL INFORMATION DOCUMENT

(A GUIDE for the Graphical User Interface (GUI) GDE)

MASCOT Search Results Interpretation

A Multiple DNA Sequence Translation Tool Incorporating Web Robot and Intelligent Recommendation Techniques

An agent-based layered middleware as tool integration

THREE DIMENSIONAL REPRESENTATION OF AMINO ACID CHARAC- TERISTICS

Transcription:

BLAST Anders Gorm Pedersen & Rasmus Wernersson

Database searching Using pairwise alignments to search databases for similar sequences Query sequence Database

Database searching Most common use of pairwise sequence alignments is to search databases for related sequences. For instance: find probable function of newly isolated protein by identifying similar proteins with known function. Most often, local alignment ( Smith-Waterman ) is used for database searching: you are interested in finding out if ANY domain in your protein looks like something that is known. Often, full Smith-Waterman is too time-consuming for searching large databases, so heuristic methods are used (fasta, BLAST).

Database searching: heuristic search algorithms FASTA (Pearson 1995) Uses heuristics to avoid calculating the full dynamic programming matrix Speed up searches by an order of magnitude compared to full Smith-Waterman The statistical side of FASTA is still stronger than BLAST BLAST (Altschul 1990, 1997) Uses rapid word lookup methods to completely skip most of the database entries Extremely fast One order of magnitude faster than FASTA Two orders of magnitude faster than Smith- Waterman Almost as sensitive as FASTA

BLAST flavors BLASTN BLASTP Protein query sequence Protein database BLASTX Protein database Compares all six reading frames with the database TBLASTN Protein query sequence On the fly six frame translation of database TBLASTX Compares all reading frames of query with all reading frames of the database

BLAST flavors BLASTN BLASTP Protein query sequence Protein database BLASTX Protein database Compares all six reading frames with the database TBLASTN Protein query sequence On the fly six frame translation of database TBLASTX Compares all reading frames of query with all reading frames of the database

BLAST flavors BLASTN BLASTP Protein query sequence Protein database BLASTX Protein database Compares all six reading frames with the database TBLASTN Protein query sequence On the fly six frame translation of database TBLASTX Compares all reading frames of query with all reading frames of the database

Searching on the web: BLAST at NCBI Very fast computer dedicated to running BLAST searches Many databases that are always up to date (e.g. NR and Human Genome Nice simple web interface But you still need knowledge about BLAST to use it properly

When is a database hit significant? Problem: Even unrelated sequences can be aligned (yielding a low score) How do we know if a database hit is meaningful? When is an alignment score sufficiently high? Solution: Determine the range of alignment scores you would expect to get for random reasons (i.e., when aligning unrelated sequences). Compare actual scores to the distribution of random scores. Is the real score much higher than you d expect by chance?

Distribution of random alignment scores Software simulation

Random alignment scores follow extreme value distributions Searching a database of unrelated sequences result in scores following an extreme value distribution The exact shape and location of the distribution depends on the exact nature of the database and the query sequence

Significance of a hit: one possible solution (1) Align query sequence to all sequences in database, note scores (2) Fit actual scores to a mixture of two sub-distributions: (a) an extreme value distribution and (b) a normal distribution (3) Use fitted extreme-value distribution to predict how many random hits to expect for any given score (the E-value )

Database searching: E-values in BLAST BLAST uses precomputed extreme value distributions to calculate E- values from alignment scores For this reason BLAST only allows certain combinations of substitution matrices and gap penalties This also means that the fit is based on a different data set than the one you are working on A word of caution: BLAST tends to overestimate the significance of its matches E-values from BLAST are fine for identifying sure hits One should be careful using BLAST s E-values to judge if a marginal hit can be trusted (e.g., you may want to use E-values of 10-4 to 10-5 ).

BLAST heuristics Best possible search: Do full pairwise alignment (Smith-Watermann) between the query sequence and all sequences in the database. ( ssearch does this). BLAST speeds up the search by at least two orders of magnitude, by prescreening the database sequences and only performing the full Dynamic Programming on promising sequences. This is done by indexing all databases sequences in a so-called suffix-tree which makes it very fast to search for perfect matching sub-strings. A suffix tree is the quickest possible way (so far) to search for the longest matching sub-string between two strings. When a BLAST search is run, candidate sequences from the database is picked based on perfect matches to small sub-sequences in the query sequence. (BLASTN and BLASTP does this differently - more about this in a moment). Full Smith-Waterman is then performed on these sequences.

BLASTN Alignment matrix: Perfect match: 1 Mismatch: -3 Match => word size Potential matched of length < word size (not seen by BLAST) Notice: All mismatched are equally penalized: E.g. A:G == A:C == A:A More advanced models for DNA evolution does exist. Heuristics: Perfect match word of the size: 7, 11 (default) or 15. All sequences Subset to align

BLASTP Alignment matrix: PAM and BLOSUM-series (default: BLOSUM 62) 40 aa Match => word size Notice: These alignment matrices incorporates knowledge about protein evolution. Heuristics: 2 x Near match within a windows. Default word length: 3 aa Default window length: 40 aa All sequences Subset to align