Chapter 3. Multiple Sequence Alignment. 3.1 Definitions. Let S 1,S 2,,S k be k sequences over an alphabet X. We use S i to denote the length of S i.

Transcription

1 Chapter 3 Multiple Sequence Alignment 3.1 Definitions Let S 1,S 2,,S k be k sequences over an alphabet X. We use S i to denote the length of S i. An alignment of S 1,S 2,,S k is given by a k n matrix A, where n S i for every i k, such that Row i contains characters of S i in order, interspersed with n S i spaces, and Each column contains at least one letter from X. Example: The following is an alignment of 4 sequences M Q P I L L - G M L R - L L - - M K - I L L L - M P P V L L I - 59

2 60 CHAPTER 3. MULTIPLE SEQUENCE ALIGNMENT 3.2 Biological motivation Multiple sequence alignments are a start to many analysis of protein families. A good multiple alignment allows us to Find common conserved regions (or motif patterns) among sequences. Detect members of a gene family. Proteins are categorized into families. A protein family is a class of homologous proteins with similar sequences, structure, function, and/or similar evolutionary history. When an unknown protein is newly sequenced, one would often like to know to which family it belongs, as this can be a clue to its function. One approach to find the correct family for a protein is to compare the sequence of the protein to the alignment of each family. Backtracking evolutionary paths through sequence similarity. By counting the mutations that are necessary to explain transformation from an ancestor sequence to a current sequence, one can get an estimated evolutionary time when two sequences diverged.

3 3.3. SCORING A MULTIPLE SEQUENCE ALIGNMENT Scoring a multiple sequence alignment The score of a multiple alignment is defined as the sum of scores of columns. Various scoring schemes have been proposed to score a column. 1. The Sum of Pairs (SP) scoring scheme. The SP score of a column in the alignment is the sum of the scores of all pairs of characters in the column. For example, in the above example, the score SP(P, R,,P) of the 3rd column is s(p, R)+s(P, )+s(p, P)+s(R, )+s(r, P )+s(,p) where s(a, b) is the score of the pair a and b(including spaces) and s(, ) =0. Exercise: Assume we score a match 4 and a mismatch -2 and an indel -1. Find the SP score of the above alignment. 2. Consensus score. The consensus of a multiple alignment is a sequence of the most common characters in each column of the alignment. For example, M Q P I L - L M L R - L - L M K - I L L L M P P V L L I consensus M Q P I L L L The consensus score of a column is the number of characters (including spaces) that are identical to the consensus character in the column.

4 62 CHAPTER 3. MULTIPLE SEQUENCE ALIGNMENT Multiple Sequence Alignment Problem (MSA) Instance: A set of k sequences, and a scoring scheme (say SP and substitution matrix BLOSUM62) Question: Find an alignment of the given sequences that has the maximum score. Remark: 1. Pairwise alignment problem is a special case of the MSA problem in which there are only two sequences. 2. The optimal multiple alignment for all the sequences is not necessarily optimal for a given pair. For instance, consider this optimal alignment A T R A - R - T R A T R A T R in which we score mismatch and indel -2. In the alignment, the 2nd and 3rd alignment are not optimally aligned. Their optimal alignment is A R T R

5 3.4. DYNAMIC PROGRAMMING ALGORITHM FOR MSA Dynamic Programming Algorithm for MSA To solve the MSA problem for k sequences S 1,S 2, S k, we will generalize the two sequence case. For simplicity, we assume each sequence is of length n. Instead of a 2-dimensional table, we have a k-dimensional table T :(n+1) (n+1) (n+1) with (n+1) k entries. For each entry (i 1,i 2,,i k )oft, s(i 1,i 2,,i k ) is the score of the optimal alignment of the length-i 1 prefix of S 1, the length-i 2 prefix of S 2,, up to the length-i k prefix of S k. Finally, we use S 1 [j] to denote the jth letter of S 1. Recall that for the case k =2, we have recurrence relation s(i 1 1,i 2 0) + δ(s 1 [i 1 ], ), S 1 [i 1 ] vs - s(i 1,i 2 ) = max s(i 1 0,i 2 1) + δ(,s 2 [i 2 ]), - vs S 2 [i 2 ] s(i 1 1,i 2 1) + δ(s 1 [i 1 ],S 2 [i 2 ]), S 1 [i 1 ]vss 2 [i 2 ] (3.1) where S 1 [i 1 ] and S 2 [i 2 ] are the last characters of S 1 and S 2 respectively. The recurrence relation (3.1) can be rewritten as s(i) =max{s(i b)+δ( c) b =(1, 0), (0, 1), (1, 1)} where ī =(i 1,i 2 ) and c =(c 1,c 2 ) is defined as follows: c 1 = S 1 [i 1 ]ifb[1] = 1 and - (space) otherwise, c 2 = S 2 [i 2 ]ifb[2] = 1 and - (space) otherwise.

6 64 CHAPTER 3. MULTIPLE SEQUENCE ALIGNMENT The recurrence relation for k sequences becomes s(i) =max{s(i b)+sp( c) b {0, 1} k (0, 0,, 0)} where ī =(i 1,i 2,,i k ), and c =(c 1,c 2,,c k ): c j = S j [i j ]ifb[j] = 1 and - (space) otherwise for j =1, 2,,k. Exercise: What are the base conditions for k sequences? This generalization takes O(k 2 2 k n k ) time. The reason is that there are (n +1) k = O(n k ) entries in the table, for each entry, we consider 2 k 1 possibilities, and the naive calculation of the SP score for a column takes O(k 2 ) steps. The space complexity is O(n k ). Hence, this approach is no longer efficient for aligning a protein family with hundreds of proteins. In addition, it is very unlikely to find a polynomial time algorithm for MSA because of the following complexity result. Theorem Finding an optimal multiple alignment with SP score scheme is NP-hard.

7 3.5. PROGRESSIVE ALIGNMENT APPROACHES Progressive alignment approaches Since the multiple sequence alignment is NP-hard, various heuristic approaches have been proposed. The commonly used one is progressive alignment. This approach works by aligning sequences using a series of pairwise alignments: Initially, two closely related sequences are aligned; this alignment is fixed. Then, a third sequence is chosen and aligned to the first alignment, and so on. This process is iterated until all sequences have been aligned. The heuristic is fast, but does not guarantee an optimal alignment.

8 66 CHAPTER 3. MULTIPLE SEQUENCE ALIGNMENT Star Alignment The idea of the star alignment is to find a sequence which is most similar to all the rest, and then to use it as the center of a star to align all the other sequences to it. Consider 5 sequences S 1 = ATTCGGATT S 2 = ATCCGGATT S 3 = ATGGAATTTT S 4 = ATGTTGTT S 5 = AGTCAGG To calculate the center sequence, we compute all the pairwise alignment scores. Assume these pairwise alignment scores are given in the following matrix S 1 S 2 S 3 S 4 S 5 S S S S S Summing pairwise scores in each row in the matrix, we obtain that S 1 is closest to all the other sequences. Hence, S 1 is selected to be at the center of the star.

9 3.5. PROGRESSIVE ALIGNMENT APPROACHES Calculating the optimal pairwise alignments between S 1 and all other sequences. Assume they are S 1 : S 2 : A T T C G G A T T A T C C G G A T T S 1 : A T T C G G A T T - - S 3 : A T G - G A A T T T T S 1 : A T T C G G A T T S 4 : A T G T T G - T T S 1 A T T C G G A T T S 5 A G T C A G G - -

10 68 CHAPTER 3. MULTIPLE SEQUENCE ALIGNMENT 2. Merging all the alignments using once a gap, always a gap principle. We start with the alignment of S 1 and S 2. Then, we add S 3. Since two spaces follows S 1 in the alignment of S 1 and S 3, two spaces need to be added to the ends of S 1 and S 2. S 2 : A T C C G G A T T - - S 1 : A T T C G G A T T - - S 3 : A T G - G A A T T T T These gaps are never removed from the sequences in the alignments ( once a gap, always a gap ). Finally, we add S 4 and S 5 in order. S 2 : A T C C G G A T T - - S 3 : A T G - G A A T T T T S 1 : A T T C G G A T T - - S 4 : A T G T T G - T T - - S 2 : A T C C G G A T T - - S 3 : A T G - G A A T T T T S 4 : A T G T T G - T T - - S 1 : A T T C G G A T T - - S 5 : A G T C A G G

11 3.5. PROGRESSIVE ALIGNMENT APPROACHES 69 Complexity Analysis: Given k sequences of length n, the star alignment approach takes O(k 2 n 2 ) time to calculate all the pairwise alignment scores and then find the sequence S that is at the center of the star. It then takes O(k l) time for merging all the pairwise alignments to form a multiple alignment, where l is the length of the resulting alignment.

12 70 CHAPTER 3. MULTIPLE SEQUENCE ALIGNMENT Exercise: The star alignment approach is not guaranteed to give an optimal alignment due to the once a gap, always a gap, which tends to introduce excessive gaps. Here is an example to show this. Consider the following three sequences S 1 S 2 S 3 Optimal pairwise alignments are S 1 S 2 TCCGAA TCGAGA TCCAGA TCCGA-A TC -GAGA S 1 S 3 TCC -GAA TCCAGA- When these alignments are merged, the resulting multiple alignment S 1 S 2 TCC -GA - A TC - -GAG A S 3 TCCAGA - - is not optimal since it is worse than alignment S 1 S 2 S 3 TCCGA - A TC -GAG A TCC- AG A

13 3.5. PROGRESSIVE ALIGNMENT APPROACHES ClustalW INPUT: Sequences S 1,S 2,,S k. (1) Compute all the ( ) k 2 pairwise alignment scores and convert them into distances. (2) Construct a guide tree from pairwise distances using the Neighbor-Joining method. (3) Gradually build up the multiple sequence alignment following the order in the guide tree T. In Step (3), sequence-sequence alignments can be done with dynamic programming approach.

14 72 CHAPTER 3. MULTIPLE SEQUENCE ALIGNMENT A sequence is added to an existing group by aligning it to each sequence in the group in turn. The highest scoring pairwise alignment is used to merge the sequence into the alignment of the group following the principle once a gap, always a gap. For example, consider the following group alignment S 1 : A G - A T - S 2 : - G A A T C and a sequence S: CGAAATC. The high scoring pairwise alignment is S 2 : - G - A A T C S: C G A A A T C Hence, S is merged into the group alignment as S 1 : A G - - A T - S 2 : - G - A A T C S: C G A A A T C

15 3.5. PROGRESSIVE ALIGNMENT APPROACHES 73 To align a group with a group, all sequences pairs between two groups are tried. The highest scoring pairwise alignment determines the alignment of two groups. For instance, consider the following two groups: S 1 : A T T G C C A T T - - S 2 : A T C - C A A T T T T S 3 : A T G G C C A T T S 4 : A T C T T C - T T The alignment with S 1 and S 3 : S 1 : S 3 : A T T G C C A T T A T G G C C A T T has the maximum score. Thus it is used for aligning the two groups as S 2 : A T C - C A A T T T T S 1 : A T T G C C A T T - - S 3 : A T G G C C A T T - - S 4 : A T C T T C - T T - -

16 74 CHAPTER 3. MULTIPLE SEQUENCE ALIGNMENT ClustalW suffers the following problem (a) Optimal alignment may not be found. (b) The guide tree is derived from pairwise distances and less reliable. (c) When all the sequences are highly divergent (say less than 25-30% identity between any pair of sequences), this progressive approach becomes less reliable.

17 3.5. PROGRESSIVE ALIGNMENT APPROACHES T-Coffee T-Coffee is another fast multiple sequence alignment method. It stands for Tree-based Consistency Objective Function for alignment Evaluation. The progressive alignment method suffers from its greediness. Errors made in the first alignments cannot be rectified later as the rest of the sequences are added in. T-Coffee was proposed to minimize that effect. Another motivation for T-Coffee is to use properties of both local and global pairwise alignments of given sequences. It has 3 steps: Step 1: Generating a primary library of pairwise alignments. Step 2: Extending library Step 3: Progressively align all the sequences

18 76 CHAPTER 3. MULTIPLE SEQUENCE ALIGNMENT Step 1. Generating a library of pairwise alignments. The primary library contains a set of pairwise alignments between all the sequences to be aligned. Global alignment library is generated using ClustalW; Local alignment library is generated using Lalign in the FASTA package. Each pairwise alignment is considered as a list of pair-wise residue matches (residue a of sequence A is aligned with residue b of sequence B). Each of these matches is a constraint. These constraints are weighted for later use since some may come from parts of alignments that are more likely to be correct. Each aligned pair (a constraint) in a pairwise alignment receives a weight equal to the percent identity within the alignment. To combine local and global alignment information, when any pair is duplicated between the two libraries, it is merged into a single one that has a weight equal to the sum of the two weights.

19 3.5. PROGRESSIVE ALIGNMENT APPROACHES 77 SeqA GARFIELD THE LAST FAT CAT SeqB GARFIELD THE FAST CAT SeqA GARFIELD THE LAST FA-T CAT SeqC GARFIELD THE VERY FAST CAT SeqA GARFIELD THE LAST FAT CAT SeqD THE ---- FAT CAT SeqB GARFIELD THE FAST CAT SeqC GARFIELD THE VERY FAST CAT SeqB GARFIELD THE FAST CAT SeqD THE FA-T CAT SeqC GARFIELD THE VERY FAST CAT SeqD THE ---- FA-T CAT SeqA GARFIELD THE LAST FAT CAT SeqB GARFIELD THE FAST CAT SeqA GARFIELD THE LAST FAT CAT SeqC GARFIELD THE VERY FAST CAT SeqB GARFIELD THE FAST CAT SeqA GARFIELD THE LAST FAT CAT SeqD THE FAT CAT SeqB GARFIELD THE FAST CAT

20 78 CHAPTER 3. MULTIPLE SEQUENCE ALIGNMENT Step 2. Extending Library. Each pair of aligned residues (in the library) is reassigned a weight that reflects some of the information contained in the whole library. The triplet approach is used for reassigning a weight for each aligned pair: It takes each aligned residue pair from the library and check the alignment of the two residues with residues from the remaining sequences. The weight associated with the pair will be the SUM of all the weight gathered through the examination of all the triplets involving that pair. The more intermediate sequences supporting an aligned pair, the higher its weight

21 3.5. PROGRESSIVE ALIGNMENT APPROACHES 79 Step 3 Progressively align all the sequences. Weight will be zero for any residue pairs that never occur in library. (This will be true for the majority of residue pairs.) Thus, for any residue a in seqa and b in seqb, a weight is assigned to (a, b) in Step 1 and Step 2. Thus, all the weights form a position-dependent scoring scheme δ W for alignment. Step 3.1 Pairwise alignments are first made using the scoring scheme δ W to produce a distance matrix between all the sequences; The matrix in turn is used to produce a guide tree T using Neighbor- Joining method. Step 3.2 Gradually build up the multiple sequence alignment following the order in the guide tree T. In Steps 3.1 and 3.2, gap penalty is set to 0. This stems from the fact that the library weights were computed based on pairwise alignments where such penalty had already been applied.

22 80 CHAPTER 3. MULTIPLE SEQUENCE ALIGNMENT Comparison with other methods T-Coffee performed much better than other top alignment methods on benchmark alignment data BaliBase. Method Cat1 (81) Cat2 (23) Cat3 Cat4 Cat5 All ClustalW Prrp T-Coffee Questions 1. Can T-coffee be improved by using difference weighting scheme? 2. Can T-coffee be improved by building a library of 3-sequences alignments? Reference 1. Gotoh, O. (1996) Significant improvement in accuracy of multiple protein sequence alignments by iterative refinements as assessed by reference to structural alignment. JMB 264, Thompson, J., Plewniak, F. and Poch, O. (1999) BaliBase: A benchmark alignment database for the evaluation of multiple sequence alignment programs. Bioinformatics 15, Thompson, J., Plewniak, F. and Poch, O. (1999) A comprehensive comparison of multiple sequence alignment programs, NAR 27, Notredame, C., Higgins, D. and Heringa (2000) J. T-Coffee:... JMB 302,