What next? Computational Biology and Bioinformatics. Finding homologs 2. Finding homologs. 4. Searching for homologs with BLAST

Computational Biology and Bioinformatics 4. Searching for homologs with BLAST What next? Comparing sequences and searching for homologs Sequence alignment and substitution matrices Searching for sequences with BLAST MSA and profiles Multiple sequence alignment PSSM-based profiles Evolving sequences Phylogenetic trees Finding homologs We now know how to do alignments and how to score these alignments The next question is: Given a sequences q, can we find other sequences d in a database D that are homologues to q? For instance, when q is the G2A, the search process should find all similar genes. So it has to find genes G1A, G1B et G2B and they have to be at the top of the list The sequences found by this method can provide information concerning the structure and function of the protein q Finding homologs 2 Simple approach : Make a global alignment between q and every sequence d in the database D BUT : Sometimes only a segment of q to a segment in the other sequence d or there is only a similarity in a particular pattern or the order of the domains in q and d is not the same, but there are similarities between domains For these reasons we need to use a local alignment

Finding homologs 3 Local alignment the Smith-Waterman algorithm BUT, exhaustively applying SW takes too much time December 2009, UniprotKB/TREMBL contains 107 entries In the years 1980-1990, the computational resources were limited There was a need for efficient techniques FASTA et BLAST DP guarantees to find the optimal alignment. This is no longer true for FASTA and BLAST since they use heuristic methods FASTA W. Pearson et D.J. Lipman (1988) Improved tools for biological sequence comparison. Proc Natl Acad Sci USA 85:2444-2448 FASTA BLAST http://www.ebi.ac.uk/tools/ fasta33/index.html S.F. Altschul et al (1990) Basic local alignment search tool. J Mol Biol 215:403-410

Some initial definitions BLAST A segment is a subsequence of a certain length in the original sequence A word is a segment of size w Maximal scoring segment pairs (MSP)is a pair of aligned segments (no gaps) of the same length with the highest score in the sequences q and d High scoring segment pairs (HSP) is a pair of aligned segments for which the score can not be improved by extending the the alignment at either side of the two segments (no gaps). BLAST consists of 4 steps 1. For every sequence d in D, look for the words of the sequence d that have a score of at least T when aligned with words in sequence q 2. Determine the HSP: take the pairs of aligned words and try to improve the alignment by extending at both sides 3. Use dynamic programming to determine the gapped alignments for the HSP 4. Retrieve the local alignments from the HSP obtained in the previous round BLAST The structure of BLAST is equivalent to the structure of FASTA There are two important difference between the two systems In the first stage, FASTA looks for k-tuples in every sequence d that are identical to those in the sequence q. BLAST looks for k-tuples with a score above a threshold T BLAST stage 1 For every sequence d in D, look for the words of the sequence d that have a score of at least T when aligned with words in sequence q Every word pair that meets this condition is called a hit Originally the same indexation technique (hashing et chaining) as FASTA was used BUT They showed that a finite state transducer could improve the efficiency In the first stage, FASTA uses a lookup table. BLAST can use the same data structure, but they discovered that a finite state transducer is more efficient

Deterministic Finite State Automata A deterministic finite state automata is defined as a 5-tuple (Q,,δ,q0,F) : Q is a finite set of states is the alphabet δ:qx Q is the transition function q0 Q is the start state F Q is the set of end states Finite state transducers These are DFA that can produce output The Moore machines The Mealey machines Mealey machines are also often used in cryptography There is only one start state Finite state transducers 2 An FST is defined like an ADF but with three differences: 1. There are no end states anymore 2. Λ is a finite output alphabet 3. λ is an output function λ:qx Λ for Mealey machines λ:q Λ for Moore machines Therefore a finite state transducer is defined as 6-tuple (Q,,Λ,δ,λ,q0) M. Cameron et al (2006) A deterministic finite automaton for faster protein hit detection in BLAST. Jour Comp Biol 13(4): 965-978 BLAST stage I 2 Take the following sequence q (size = n=10) and w=2 AQRQRRQARQ The sequence is again partitioned into w-tuples: The alphabet is {A,Q,R} with size α=3 AQ,QR,RQ,QR,RR,RQ,QA,AR,RQ BLOSUM 62 is used for the scores look for all the words of size w (2 3 =8 in total) that have a score bigger T (here T=5) when aligned to the w- tuples of q

AQ QR RQ RR QA AR AA 3-2 -2-2 3 3 AQ 9 0 4 0-2 5 AR 5 4 0 4-2 9 QA -2 4 0 0 9-2 QQ 4 6 6 2 4 0 QR 0 10 2 6 4 4 RA -2 0 4 4 5-2 RQ 4 2 10 6 0 0 RR 0 6 6 10 0 4 BLAST stage I 3 All words with a score bigger than for are accepted The elements in blue represent identical associations The elements in red are similar associations How is the Mealey machine constructed? BLAST stage I 4 Every prefix of size k-1 of a word is state of the transfucer three prefixes : A, Q and R BLAST stage I 5 Every state can have α transitions to other states BLAST stage I 6 The output alphabet corresponds to the start positions of the words in the sequence q 0.2.4.6.8. q = AQRQRRQARQ First the exact matches between words AQ QR RQ RR QA AR AA 3-2 -2-2 3 3 AQ 9 0 4 0-2 5 AR 5 4 0 4-2 9 QA -2 4 0 0 9-2 QQ 4 6 6 2 4 0 QR 0 10 2 6 4 4 RA -2 0 4 4 5-2 RQ 4 2 10 6 0 0 RR 0 6 6 10 0 4

BLAST stage I 7 The output alphabet corresponds to the start positions of the words in the sequence q Second the similar matches between words BLAST stage I 8 Using this Mealey machine we can look for the hits in every sequence d of D 0.2.4.6.8. q = AQRQRRQARQ AQ QR RQ RR QA AR AA 3-2 -2-2 3 3 AQ 9 0 4 0-2 5 AR 5 4 0 4-2 9 QA -2 4 0 0 9-2 QQ 4 6 6 2 4 0 QR 0 10 2 6 4 4 RA -2 0 4 4 5-2 RQ 4 2 10 6 0 0 RR 0 6 6 10 0 4 d = RAAQQARAQR RA 6 AA / AQ 0,7 QQ 1,2,3,5,8 QA 6 AR 0,7 RA 6 AQ 0,7 QR 1,3,4 (0,6) / (2,0),(2,7) (3,1), (3,2),(3,3), (3,5),(3,8) (4,6) (5,0),(5,7) (6,6) (7,0),(7,7) (8,1),(8,3),(8,4) d = RAAQQARAQR BLAST stage I 9 Using this Mealey machine we can look for the hits in every sequence d of D BLAST stage 2 As in FASTA the diagonals are calculated Look for the HSP : Take every hit and try to extend the ends Stop when the score S becomes les than S-X identical similar RA 6 AA / AQ 0,7 QQ 1,2,3,5,8 QA 6 AR 0,7 RA 6 AQ 0,7 QR 1,3,4 (0,6) / (2,0),(2,7) (3,1), (3,2),(3,3), (3,5),(3,8) (4,6) (5,0),(5,7) (6,6) (7,0),(7,7) (8,1),(8,3),(8,4) The (j,i)-pairs are used in the second stage of BLAST identical similar Since the article published in1997 in NAR, one first tries to combine hits that are on the same diagonal When the distance between 2 hits is less than or equal to 4, the two hits are merged

How are the hits combined? BLAST stage 2 2 BLAST stage 2 3 One can also include the score (S) and size (T) of the hit RA (0,6) AA / AQ (2,0),(2,7) QQ (3,1), (3,2),(3,3), (3,5),(3,8) QA (4,6) AR (5,0),(5,7) RA (6,6) AQ (7,0),(7,7) QR (8,1),(8,3),(8,4) identical similar 1. detemine the diagonal for every pair (j,i): for instance : diag(ra)=j-i=0-6=-6 2. Store the start position (in q) in a table map which is indexed by the diagonal value RA : map(-6)=6 AQ : map(2)=0 ; map(-5)=7 QQ : map(2) =1 3. If the index in map is already occupied, determine the distance between the initial positions in q: AQ starts at 0 and QR starts at 1 : 1-0 4 combine the hits et recalculate the score BLOSUM 62 Score(AQQ,AQR) = 4+5+1=10 All complete hits I P S T -6 6 5 2-5 7 10 3-2 5 15 4 0 3 11 6 1 2 6 2 2 0 10 3 4 4 6 2 5 0 15 5 7 0 14 3 BLAST stage 2 4 Take every hit and try to extend the ends Stop when the score S becomes les than S-X Assume here X=1, the one HSP can be extended : extensions In this stage the indels are not considered I P S T -6 6 5 2-5 7 10 3-2 5 15 4 0 3 12 7 1 2 6 2 2 0 10 3 4 4 6 2 5 0 15 5 7 0 14 3 BLAST stage 3 Use dynamic programming to determine the gapped alignments for the HSP All the gapless alignments that have a score higher than S1 are used Assume here S1>14 The HSP with score > S1 I P S T -6 6 5 2-5 7 10 3-2 5 15 4 0 3 12 7 1 2 6 2 2 0 10 3 4 4 6 2 5 0 15 5 7 0 14 3

BLAST stage 3 2 BLAST stage 4 Use a banded version of SW to look over a local alignment with gaps that contains the HSP Retrace starting from the highest value to get the local alignment Like with FASTA, the distance is limited at both sides of the diagonal The idea is to limit the number of insertions and deletions In case of the banded SW algorithm we get : QRQRR-QAR -RAA-QQAR Smith-Waterman with a gap penalty (g) of -10 the HSP BLAST Statistical significance http://blast.ncbi.nlm.nih.gov/ Blast.cgi Does the sequence d* with a score S* found at the top of the list really correspond to a homolog of q? To understand the statistical significance we need to answer two questions: 1. What is the probability that a score of at least S is produced by chance? 2. How many chance associations can one expect when one searches for homologs in a database? the following slides were adapted from INFO-F-434

Statistical significance 2 Statistical significance 3 Using BLOSUM62 S= (sa,b) u u The probability distribution of the scores is an extreme value distribution (EVD) R L A S V E T D M P L T L R Q H.. : :. :..... T L T S L Q T T L K A H L G T H -1+4+0+4+1+2+5-1+2-1-1-2+4-2-1+8=21 What is the statistical significance of this score? When one looks for homologs in a database, one is only interested at those sequences at the top In every alignment we always take the one that is the best The distribution is not Gaussian Its an EVD W.P. Pearson (2000) ISMB tutorial: Protein sequence comparison and protein evolution Statistical significance 4 Where does this EVD come from? Statistical significance 5 Where does this EVD come from? Repeat the following step a 1000 times (3 examples to the right): Sampe 1000 values (z) from a normal distribution In every run, we collect the mean and the maximum The final distribution of mean values is again a norma distribution BUT the distribution of the maximum values is an EVD one run corresponds here to one alignment between two sequences Distribution de la moyenne (Normal) Distribution du maximum (EVD)

Statistical significance 6 Statistical significance 7 Where does this EVD come from? Distribution de la moyenne (Normal) Distribution du maximum (EVD) The distribution shows the probability that we find a certain score by chance f(k,λ) Question 1: What is the probability of finding the score by chance? answer : P-val = 2 -S The theoretical distributions The DVE represents the distribution of scores that one can expect when one performs an alignment between two unrelated sequences alignment score Question 2: How many random matches will I find when searching the database? answer : E-val = N/2 -S