What next? Computational Biology and Bioinformatics. Finding homologs 2. Finding homologs. 4. Searching for homologs with BLAST

Similar documents
Pairwise Sequence Alignment

BLAST. Anders Gorm Pedersen & Rasmus Wernersson

Bio-Informatics Lectures. A Short Introduction

Rapid alignment methods: FASTA and BLAST. p The biological problem p Search strategies p FASTA p BLAST

Similarity Searches on Sequence Databases: BLAST, FASTA. Lorenza Bordoli Swiss Institute of Bioinformatics EMBnet Course, Basel, October 2003

Algorithms in Bioinformatics I, WS06/07, C.Dieterich 47. This lecture is based on the following, which are all recommended reading:

RETRIEVING SEQUENCE INFORMATION. Nucleotide sequence databases. Database search. Sequence alignment and comparison

PROC. CAIRO INTERNATIONAL BIOMEDICAL ENGINEERING CONFERENCE

Protein & DNA Sequence Analysis. Bobbie-Jo Webb-Robertson May 3, 2004

Database searching with DNA and protein sequences: An introduction Clare Sansom Date received (in revised form): 12th November 1999

SGI. High Throughput Computing (HTC) Wrapper Program for Bioinformatics on SGI ICE and SGI UV Systems. January, Abstract. Haruna Cofer*, PhD

DNA Insertions and Deletions in the Human Genome. Philipp W. Messer

Ordered Index Seed Algorithm for Intensive DNA Sequence Comparison

BIO 3350: ELEMENTS OF BIOINFORMATICS PARTIALLY ONLINE SYLLABUS

Lecture 4: Exact string searching algorithms. Exact string search algorithms. Definitions. Exact string searching or matching

Protein Sequence Analysis - Overview -

Core Bioinformatics. Degree Type Year Semester Bioinformàtica/Bioinformatics OB 0 1

Amino Acids and Their Properties

Optimal neighborhood indexing for protein similarity search

Web Data Extraction: 1 o Semestre 2007/2008

Design Style of BLAST and FASTA and Their Importance in Human Genome.

Genome Explorer For Comparative Genome Analysis

CSC4510 AUTOMATA 2.1 Finite Automata: Examples and D efinitions Definitions

Sequence Analysis 15: lecture 5. Substitution matrices Multiple sequence alignment

CD-HIT User s Guide. Last updated: April 5,

Consensus alignment server for reliable comparative modeling with distant templates

Network Protocol Analysis using Bioinformatics Algorithms

7 Gaussian Elimination and LU Factorization

Hidden Markov Models in Bioinformatics. By Máthé Zoltán Kőrösi Zoltán 2006

THREE DIMENSIONAL REPRESENTATION OF AMINO ACID CHARAC- TERISTICS

Regular Languages and Finite State Machines

Bioinformatics Grid - Enabled Tools For Biologists.

Introduction to Bioinformatics AS Laboratory Assignment 6

Bioinformatics Resources at a Glance

Module 1. Sequence Formats and Retrieval. Charles Steward

Introduction to Bioinformatics 3. DNA editing and contig assembly

A Tutorial in Genetic Sequence Classification Tools and Techniques

Multiple Sequence Alignment. Hot Topic 5/24/06 Kim Walker

FART Neural Network based Probabilistic Motif Discovery in Unaligned Biological Sequences

UCHIME in practice Single-region sequencing Reference database mode

A java applet visualizing the Aho-Corasick can be found at: buehler/ac/ac1.html

Clone Manager. Getting Started

PHYML Online: A Web Server for Fast Maximum Likelihood-Based Phylogenetic Inference

Final Project Report

Analyzing A DNA Sequence Chromatogram

BIOINFORMATICS TUTORIAL

Efficient Parallel Execution of Sequence Similarity Analysis Via Dynamic Load Balancing

Phylogenetic Trees Made Easy

How To Compare A Markov Algorithm To A Turing Machine

BUDAPEST: Bioinformatics Utility for Data Analysis of Proteomics using ESTs

DNA Printer - A Brief Course in sequence Analysis

Core Bioinformatics. Degree Type Year Semester

Statistical Machine Translation: IBM Models 1 and 2

A First Investigation of Sturmian Trees

GAST, A GENOMIC ALIGNMENT SEARCH TOOL

Approximate String Matching in DNA Sequences

Dynamic Programming. Lecture Overview Introduction

MATCH Commun. Math. Comput. Chem. 61 (2009)

Genetic programming with regular expressions

DNA Sequencing Overview

Computational searches of biological sequences

Protein Protein Interaction Networks

Clustering Artificial Intelligence Henry Lin. Organizing data into clusters such that there is

AC : INTEGRATION OF BIOINFORMATICS IN SCIENCE CURRICULUM AT FORT VALLEY STATE UNIVERSITY

Data Integration. Lectures 16 & 17. ECS289A, WQ03, Filkov

Pushdown Automata. place the input head on the leftmost input symbol. while symbol read = b and pile contains discs advance head remove disc from pile

Primes in Sequences. Lee 1. By: Jae Young Lee. Project for MA 341 (Number Theory) Boston University Summer Term I 2009 Instructor: Kalin Kostadinov

COMPUTATIONAL FRAMEWORKS FOR UNDERSTANDING THE FUNCTION AND EVOLUTION OF DEVELOPMENTAL ENHANCERS IN DROSOPHILA

Webserver: bioinfo.bio.wzw.tum.de Mail:

Core Bioinformatics. Titulació Tipus Curs Semestre Bioinformàtica/Bioinformatics OB 0 1

Informatique Fondamentale IMA S8

Finite Automata. Reading: Chapter 2

Focusing on results not data comprehensive data analysis for targeted next generation sequencing

Guide for Bioinformatics Project Module 3

OD-seq: outlier detection in multiple sequence alignments

Clustering Near-Identical Sequences for Fast Homology Search

Finite Automata. Reading: Chapter 2

6.045: Automata, Computability, and Complexity Or, Great Ideas in Theoretical Computer Science Spring, Class 4 Nancy Lynch

Activity IT S ALL RELATIVES The Role of DNA Evidence in Forensic Investigations

Convergence of Translation Memory and Statistical Machine Translation

Formal Languages and Automata Theory - Regular Expressions and Finite Automata -

Computing the maximum similarity bi-clusters of gene expression data

ClusterControl: A Web Interface for Distributing and Monitoring Bioinformatics Applications on a Linux Cluster

Chapter 13: Query Processing. Basic Steps in Query Processing

2.3 Identify rrna sequences in DNA

Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016

Regular Expressions and Automata using Haskell

Community Detection Proseminar - Elementary Data Mining Techniques by Simon Grätzer

File Management. Chapter 12

Increasing Interaction and Support in the Formal Languages and Automata Theory Course

3515ICT Theory of Computation Turing Machines

Sequence Analysis on a 216-Processor Beowulf Cluster

On line construction of suffix trees 1

Databases. DSIC. Academic Year

UF EDGE brings the classroom to you with online, worldwide course delivery!

How To Cluster

Protecting Websites from Dissociative Identity SQL Injection Attacka Patch for Human Folly

Using MATLAB: Bioinformatics Toolbox for Life Sciences

Development and Implementation of Novel Data Compression Technique for Accelerate DNA Sequence Alignment Based on Smith Waterman Algorithm

Transcription:

Computational Biology and Bioinformatics 4. Searching for homologs with BLAST What next? Comparing sequences and searching for homologs Sequence alignment and substitution matrices Searching for sequences with BLAST MSA and profiles Multiple sequence alignment PSSM-based profiles Evolving sequences Phylogenetic trees Finding homologs We now know how to do alignments and how to score these alignments The next question is: Given a sequences q, can we find other sequences d in a database D that are homologues to q? For instance, when q is the G2A, the search process should find all similar genes. So it has to find genes G1A, G1B et G2B and they have to be at the top of the list The sequences found by this method can provide information concerning the structure and function of the protein q Finding homologs 2 Simple approach : Make a global alignment between q and every sequence d in the database D BUT : Sometimes only a segment of q to a segment in the other sequence d or there is only a similarity in a particular pattern or the order of the domains in q and d is not the same, but there are similarities between domains For these reasons we need to use a local alignment

Finding homologs 3 Local alignment the Smith-Waterman algorithm BUT, exhaustively applying SW takes too much time December 2009, UniprotKB/TREMBL contains 107 entries In the years 1980-1990, the computational resources were limited There was a need for efficient techniques FASTA et BLAST DP guarantees to find the optimal alignment. This is no longer true for FASTA and BLAST since they use heuristic methods FASTA W. Pearson et D.J. Lipman (1988) Improved tools for biological sequence comparison. Proc Natl Acad Sci USA 85:2444-2448 FASTA BLAST http://www.ebi.ac.uk/tools/ fasta33/index.html S.F. Altschul et al (1990) Basic local alignment search tool. J Mol Biol 215:403-410

Some initial definitions BLAST A segment is a subsequence of a certain length in the original sequence A word is a segment of size w Maximal scoring segment pairs (MSP)is a pair of aligned segments (no gaps) of the same length with the highest score in the sequences q and d High scoring segment pairs (HSP) is a pair of aligned segments for which the score can not be improved by extending the the alignment at either side of the two segments (no gaps). BLAST consists of 4 steps 1. For every sequence d in D, look for the words of the sequence d that have a score of at least T when aligned with words in sequence q 2. Determine the HSP: take the pairs of aligned words and try to improve the alignment by extending at both sides 3. Use dynamic programming to determine the gapped alignments for the HSP 4. Retrieve the local alignments from the HSP obtained in the previous round BLAST The structure of BLAST is equivalent to the structure of FASTA There are two important difference between the two systems In the first stage, FASTA looks for k-tuples in every sequence d that are identical to those in the sequence q. BLAST looks for k-tuples with a score above a threshold T BLAST stage 1 For every sequence d in D, look for the words of the sequence d that have a score of at least T when aligned with words in sequence q Every word pair that meets this condition is called a hit Originally the same indexation technique (hashing et chaining) as FASTA was used BUT They showed that a finite state transducer could improve the efficiency In the first stage, FASTA uses a lookup table. BLAST can use the same data structure, but they discovered that a finite state transducer is more efficient

Deterministic Finite State Automata A deterministic finite state automata is defined as a 5-tuple (Q,,δ,q0,F) : Q is a finite set of states is the alphabet δ:qx Q is the transition function q0 Q is the start state F Q is the set of end states Finite state transducers These are DFA that can produce output The Moore machines The Mealey machines Mealey machines are also often used in cryptography There is only one start state Finite state transducers 2 An FST is defined like an ADF but with three differences: 1. There are no end states anymore 2. Λ is a finite output alphabet 3. λ is an output function λ:qx Λ for Mealey machines λ:q Λ for Moore machines Therefore a finite state transducer is defined as 6-tuple (Q,,Λ,δ,λ,q0) M. Cameron et al (2006) A deterministic finite automaton for faster protein hit detection in BLAST. Jour Comp Biol 13(4): 965-978 BLAST stage I 2 Take the following sequence q (size = n=10) and w=2 AQRQRRQARQ The sequence is again partitioned into w-tuples: The alphabet is {A,Q,R} with size α=3 AQ,QR,RQ,QR,RR,RQ,QA,AR,RQ BLOSUM 62 is used for the scores look for all the words of size w (2 3 =8 in total) that have a score bigger T (here T=5) when aligned to the w- tuples of q

AQ QR RQ RR QA AR AA 3-2 -2-2 3 3 AQ 9 0 4 0-2 5 AR 5 4 0 4-2 9 QA -2 4 0 0 9-2 QQ 4 6 6 2 4 0 QR 0 10 2 6 4 4 RA -2 0 4 4 5-2 RQ 4 2 10 6 0 0 RR 0 6 6 10 0 4 BLAST stage I 3 All words with a score bigger than for are accepted The elements in blue represent identical associations The elements in red are similar associations How is the Mealey machine constructed? BLAST stage I 4 Every prefix of size k-1 of a word is state of the transfucer three prefixes : A, Q and R BLAST stage I 5 Every state can have α transitions to other states BLAST stage I 6 The output alphabet corresponds to the start positions of the words in the sequence q 0.2.4.6.8. q = AQRQRRQARQ First the exact matches between words AQ QR RQ RR QA AR AA 3-2 -2-2 3 3 AQ 9 0 4 0-2 5 AR 5 4 0 4-2 9 QA -2 4 0 0 9-2 QQ 4 6 6 2 4 0 QR 0 10 2 6 4 4 RA -2 0 4 4 5-2 RQ 4 2 10 6 0 0 RR 0 6 6 10 0 4

BLAST stage I 7 The output alphabet corresponds to the start positions of the words in the sequence q Second the similar matches between words BLAST stage I 8 Using this Mealey machine we can look for the hits in every sequence d of D 0.2.4.6.8. q = AQRQRRQARQ AQ QR RQ RR QA AR AA 3-2 -2-2 3 3 AQ 9 0 4 0-2 5 AR 5 4 0 4-2 9 QA -2 4 0 0 9-2 QQ 4 6 6 2 4 0 QR 0 10 2 6 4 4 RA -2 0 4 4 5-2 RQ 4 2 10 6 0 0 RR 0 6 6 10 0 4 d = RAAQQARAQR RA 6 AA / AQ 0,7 QQ 1,2,3,5,8 QA 6 AR 0,7 RA 6 AQ 0,7 QR 1,3,4 (0,6) / (2,0),(2,7) (3,1), (3,2),(3,3), (3,5),(3,8) (4,6) (5,0),(5,7) (6,6) (7,0),(7,7) (8,1),(8,3),(8,4) d = RAAQQARAQR BLAST stage I 9 Using this Mealey machine we can look for the hits in every sequence d of D BLAST stage 2 As in FASTA the diagonals are calculated Look for the HSP : Take every hit and try to extend the ends Stop when the score S becomes les than S-X identical similar RA 6 AA / AQ 0,7 QQ 1,2,3,5,8 QA 6 AR 0,7 RA 6 AQ 0,7 QR 1,3,4 (0,6) / (2,0),(2,7) (3,1), (3,2),(3,3), (3,5),(3,8) (4,6) (5,0),(5,7) (6,6) (7,0),(7,7) (8,1),(8,3),(8,4) The (j,i)-pairs are used in the second stage of BLAST identical similar Since the article published in1997 in NAR, one first tries to combine hits that are on the same diagonal When the distance between 2 hits is less than or equal to 4, the two hits are merged

How are the hits combined? BLAST stage 2 2 BLAST stage 2 3 One can also include the score (S) and size (T) of the hit RA (0,6) AA / AQ (2,0),(2,7) QQ (3,1), (3,2),(3,3), (3,5),(3,8) QA (4,6) AR (5,0),(5,7) RA (6,6) AQ (7,0),(7,7) QR (8,1),(8,3),(8,4) identical similar 1. detemine the diagonal for every pair (j,i): for instance : diag(ra)=j-i=0-6=-6 2. Store the start position (in q) in a table map which is indexed by the diagonal value RA : map(-6)=6 AQ : map(2)=0 ; map(-5)=7 QQ : map(2) =1 3. If the index in map is already occupied, determine the distance between the initial positions in q: AQ starts at 0 and QR starts at 1 : 1-0 4 combine the hits et recalculate the score BLOSUM 62 Score(AQQ,AQR) = 4+5+1=10 All complete hits I P S T -6 6 5 2-5 7 10 3-2 5 15 4 0 3 11 6 1 2 6 2 2 0 10 3 4 4 6 2 5 0 15 5 7 0 14 3 BLAST stage 2 4 Take every hit and try to extend the ends Stop when the score S becomes les than S-X Assume here X=1, the one HSP can be extended : extensions In this stage the indels are not considered I P S T -6 6 5 2-5 7 10 3-2 5 15 4 0 3 12 7 1 2 6 2 2 0 10 3 4 4 6 2 5 0 15 5 7 0 14 3 BLAST stage 3 Use dynamic programming to determine the gapped alignments for the HSP All the gapless alignments that have a score higher than S1 are used Assume here S1>14 The HSP with score > S1 I P S T -6 6 5 2-5 7 10 3-2 5 15 4 0 3 12 7 1 2 6 2 2 0 10 3 4 4 6 2 5 0 15 5 7 0 14 3

BLAST stage 3 2 BLAST stage 4 Use a banded version of SW to look over a local alignment with gaps that contains the HSP Retrace starting from the highest value to get the local alignment Like with FASTA, the distance is limited at both sides of the diagonal The idea is to limit the number of insertions and deletions In case of the banded SW algorithm we get : QRQRR-QAR -RAA-QQAR Smith-Waterman with a gap penalty (g) of -10 the HSP BLAST Statistical significance http://blast.ncbi.nlm.nih.gov/ Blast.cgi Does the sequence d* with a score S* found at the top of the list really correspond to a homolog of q? To understand the statistical significance we need to answer two questions: 1. What is the probability that a score of at least S is produced by chance? 2. How many chance associations can one expect when one searches for homologs in a database? the following slides were adapted from INFO-F-434

Statistical significance 2 Statistical significance 3 Using BLOSUM62 S= (sa,b) u u The probability distribution of the scores is an extreme value distribution (EVD) R L A S V E T D M P L T L R Q H.. : :. :..... T L T S L Q T T L K A H L G T H -1+4+0+4+1+2+5-1+2-1-1-2+4-2-1+8=21 What is the statistical significance of this score? When one looks for homologs in a database, one is only interested at those sequences at the top In every alignment we always take the one that is the best The distribution is not Gaussian Its an EVD W.P. Pearson (2000) ISMB tutorial: Protein sequence comparison and protein evolution Statistical significance 4 Where does this EVD come from? Statistical significance 5 Where does this EVD come from? Repeat the following step a 1000 times (3 examples to the right): Sampe 1000 values (z) from a normal distribution In every run, we collect the mean and the maximum The final distribution of mean values is again a norma distribution BUT the distribution of the maximum values is an EVD one run corresponds here to one alignment between two sequences Distribution de la moyenne (Normal) Distribution du maximum (EVD)

Statistical significance 6 Statistical significance 7 Where does this EVD come from? Distribution de la moyenne (Normal) Distribution du maximum (EVD) The distribution shows the probability that we find a certain score by chance f(k,λ) Question 1: What is the probability of finding the score by chance? answer : P-val = 2 -S The theoretical distributions The DVE represents the distribution of scores that one can expect when one performs an alignment between two unrelated sequences alignment score Question 2: How many random matches will I find when searching the database? answer : E-val = N/2 -S