ABSTRACT. KEY WORDS sequence alignment, global alignment, local alignment, dynamic programming, progressive alignment, iterative alignment

Transcription

1 BIOINFORMATICS: SEQUENCE ALIGNMENT Carmen Nigro Department of Computing Sciences Villanova University Villanova, Pennsylvania ABSTRACT As more data from DNA and protein sequences is discovered, sequence alignment programs are becoming increasingly important for analyzing this data. These alignments can help us learn more about the functions of certain genes and proteins and these observations could ultimately lead to the discovery of cures for certain genetic diseases or even a better insight into the evolutionary process. There are many different types of alignment strategies and choosing the appropriate strategy depends on the ultimate goal of the alignment. This paper outlines and compares the basic approaches and algorithms for sequence alignment. KEY WORDS sequence alignment, global alignment, local alignment, dynamic programming, progressive alignment, iterative alignment 1. INTRODUCTION Sequence alignment is an important division of bioinformatics, which attempts to analyze and compare sequences that make up DNA or proteins. Sequence alignment is a way of comparing two or more sequences by searching for a series of individual characters that are in the same order of both sequences [1]. As improved methods were found for collecting biological data, such as nucleotide and amino acid sequences, there were privacy concerns for the creation of a database for easy storage, retrieval, and revision of the data. Today, bioinformatics scientists are interested in the analysis and interpretation of that data. Because these sequences are too long to be analyzed manually by people, efficient and accurate alignment programs are essential for comparing DNA or protein sequences. The sequences that are being compared are usually represented by nitrogenous bases for DNA sequences or amino acids for protein sequences. There are four different nitrogenous bases which code for DNA, while there are 20 different amino acids which code for different proteins. Through sequence alignments, attempts can be made to identify homologous sequences, that is, sequences with a common evolutionary origin [1]. The discovery of homologous sequences may help to predict the evolutionary process based on segments with mutations and segments which have remained the same over time. Sequence alignment also has functional importance, as sequences that are alike may have the same role or code for the same entity. The Drug Industry has benefited from applying this notion when designing new drugs to treat certain diseases. Some diseases are caused by the lack of certain parts of a protein sequence. Sequence alignment can help to identify those regions, while the lack of parts of a sequence may be compensated by injecting the missing sequence into the protein. Sequence alignment has also been useful for analyzing protein structure. Protein molecules that are alike in sequence are also more likely to have similar structures, as many of the same bonds will form. In addition, similar protein sequences have been used to determine protein structure-function relationships [2]. 2. GLOBAL AND LOCAL ALIGNMENTS Global and local alignments are two different methods of aligning a sequence. Deciding which method to choose depends on the purpose of the alignment. Global alignments attempt to compare every residue of every sequence and are best employed when the sequences are similar and are of the same size, because different sized sequences will produce mismatches at the ends of an alignment. However, when attempting to align every element of dissimilar sequences many gaps will be produced because of the many mismatches between the two sequences, as seen in figure 1. When comparing two long sequences, these gaps can become difficult to analyze. Local alignments are best employed for dissimilar sequences that may have similar regions [3]. Local alignments are very useful for finding a particular pattern that exists on both sequences, as that pattern may also have a similar function. If both sequences are very similar, it should not make a difference which method is used, because the alignments should produce similar results. There is also no difference in time efficiency between the two methods. The most fundamental global and local alignment algorithms are based on dynamic programming. The Needleman-Wunsch algorithm is based on dynamic programming and solves the global alignment problem, while the Smith-Waterman algorithm is also based on dynamic programming and solves the local alignment problem.

2 Figure 1. Examples of local and global alignments 3. PAIRWISE AND MULTIPLE ALIGNMENTS Pairwise alignments attempt to align two sequences at a time while multiple alignments attempt to align three or more sequences at a time. Analyzing three or more sequences at a time can be useful for studying molecular evolution and analyzing sequence-structure relationships [1]. Also, the detection of a pattern common to a set of sequences may only be apparent through multiple sequence alignment [1]. While the dynamic programming techniques described above are reliable methods of alignment, they are not practical to implement for multiple alignments. By extending the dynamic programming algorithm for multiple alignments, an optimal alignment will be produced in time O(n k ) for k sequences [13]. The problem of multiple sequence alignment grows exponentially every time another sequence is added and becomes unreasonable for comparing more than three sequences at a time [1]. Due to the impracticality of using dynamic programming algorithms to solve the multiple alignment problem, many heuristic algorithms have been sought after, which sacrifice accuracy for time efficiency. Heuristic approaches attempt to optimize pairwise alignments rather than searching for an overall optimal alignment [15]. Over 75 methods of solving the multiple alignment problem have been identified and the problem continues to be central to computational molecular biology [15]. 4. SCORING FUNCTIONS Scoring schemes are important for sequence alignment programs, because they are a means of comparing different alignments. In alignment algorithms a scoring function must exist so that scores may be assigned to different alignments based on the number of gaps and the number of matches. Scores are assigned to each possible pair of elements based on their similar chemical properties and evolutionary probability of the mutation. Gap costs are also an important part of any sequence alignment program and have been studied extensively. Gap costs may take into account that a mutational event may insert or delete multiple elements [2]. Gap costs must also take into account aligning elements with nulls, when sequences are of different lengths. Algorithms that have a fixed penalty for each gap are popular and are easily extendable to multiple alignments [2]. An example of a scoring scheme with fixed penalties can be seen in figure 3. Figure 2. A simple scoring function One type of scoring for multiple alignments is the Sum-of-pairs score, which increases with the number of sequences aligned correctly [3]. For multiple alignments, the sum of the pairs is the total of all alignment costs for each pair of the sequences in the alignment. A column score may also be implemented in a multiple sequence alignment program which tests the capability of the program to align all of the sequences correctly. Scoring functions are crucial to any alignment program, because they directly affect the choice of the optimal alignment. 5. SEQUENCE ALIGNMENT ALGORITHMS Significant research into algorithmic approaches to sequence alignment has been performed over the past 20 years [10]. The most popular and current sequence alignment algorithms in use today fall into the following major classifications. 5.1 Dynamic Programming Dynamic programming involves breaking a larger problem down into smaller, more manageable pieces. The basic dynamic programming approach for sequence alignment finds an optimal path through a rectangular path graph. It accomplishes this by turning one sequence into another through a series of edits. Each edit to the sequence is associated with a particular cost and the purpose is to find the edits that produce the lowest cost [1]. This method drastically reduces the number of alignments to be considered while always producing an optimal alignment. Both the Needleman-Wunsch and Smith-Waterman algorithms are based on the dynamic programming method and have a time efficiency of O(nm), n and m being the lengths of the two sequences. However, the basic dynamic programming algorithm runs in O(n k ) for multiple alignment, where k is the number of sequences. The Needle-Wunsch algorithm works by maximizing the number of matches and minimizing the number of gaps needed to align the two sequences. A scoring function must exist so that scores may be assigned to the alignments based on the number of matches and the number of gaps of the alignment. The alignment with the largest score will be the optimal alignment. It is

3 implemented through the use of a scoring matrix in which the horizontal and vertical axes correspond to the two sequences. The algorithm compares every element of a sequence to every other element in the other sequence and then traces back to find the optimal alignment. Execution of the Needleman-Wunsch algorithm can be seen in figure 3. Figure 3. Sample Execution of the Needleman-Wunsch algorithm The Smith-Waterman algorithm acts in a similar manner, but produces a local alignment by finding the region with the highest similarity. The Smith-Waterman algorithm may be obtained from the Needleman-Wunsch algorithm by adjusting the scoring function and changing the method of tracing back to find the longest matching subsequences. 5.2 Progressive Algorithms Progressive alignment is the most widely used heuristic method to align a large number of sequences and operates in O(n 2 k 2 ) time [4,15]. Progressive methods, also known as hierarchical or tree methods, produce a multiple sequence alignment by first aligning the most similar sequences and then successively adding less related sequences to the alignment until the entire set of sequences has been aligned. A guide tree is produced that determines the order in which the sequences are added to the alignment. The most related sequences are aligned first [4]. The tree describing sequence relatedness is usually produced through pairwise comparisons that may include heuristic pairwise alignment methods. This technique is used in many multiple alignment programs such as MULTALIGN, ClustalW, and T-Coffee [4]. However, results of progressive alignments depend heavily on the choice of the most related sequences, which can sometimes be difficult to determine. Also, because the alignment is built up progressively, errors made at any stage in the alignment will be reflected in the final result. These methods generally perform poorly on distantly related sequences. Most progressive methods modify their scoring function by incorporating a weighting function which assigns scaling factors to individual sequences based on their distance from their neighbors in the guide tree. This is used to correct the order in which the sequences are added to the alignment. 5.3 Iterative Algorithms Iterative methods have been produced to help solve the problems surrounding progressive algorithms [4]. Progressive alignments are largely dependent upon the initial alignment, because it is incorporated into the final result. In progressive methods, once a sequence has been aligned, its alignment is not revisited [4]. Iterative methods optimize an objective function based on an alignment scoring function by creating an initial global alignment and then realigning sequence subsets. The realigned subsets are then aligned to produce the next iteration s alignment. This approach has been implemented in programs such as, MUSCLE and DIALIGN [4]. 5.4 Summary While dynamic programming algorithms produce the most accurate sequence alignments, they are not always practical to implement for multiple alignment as their time efficiency grows exponentially as more sequences are added to the alignment. Heuristic algorithms, such as progressive and iterative methods, generally sacrifice accuracy for the sake of time. These types of algorithms have been implemented in the most widely used alignment programs today. It is generally believed that iterative methods are more accurate than progressive methods, because they take into account past alignments each time a new sequence is added. 6. PROPOSAL My proposal aims to identify the specific effects that iteration may have on a progressive alignment algorithm. By default, the ClustalW program performs a progressive alignment only; however, an option has been added which allows for iteration at each step of the progressive alignment. My proposed work aims to compare the scores of multiple alignments when they are aligned with and without iteration. This work also aims to trace the magnitude of the effect that the number of sequences has on the scores of the iterative alignments versus the noniterative alignments. This can be easily done by subsequently adding more sequences to each alignment. This proposal also aims to compare the progressive alignments observed from the program, MULTALIGN, with the both the iterative and non-iterative alignments from the ClustalW tests. These two programs are easily comparable as they both produce global alignments. In order to compare alignments from different programs, objective criteria are needed to determine the quality of an alignment. The BAliBASE benchmark alignment database would serve as a valuable tool for comparing alignments from the two programs. The

4 database contains 142 reference alignments, which could be used for this project [3]. This work will help to more clearly identify the effectiveness of iterative methods compared to progressive methods. It will also help to identify the most accurate kinds of sequence alignment programs available for biologists today. This will help to ensure that biologists are using the most accurate tools available for sequence alignment. Studying and comparing these algorithms could also help us gain better insight into other optimization problems. These algorithms may also be applicable to other fields within computer science, which makes the refinement of such algorithms even more significant. While the studies conducted by Julie D. Thompson et al and Iain M. Wallace et al both conclude that iterative methods are more accurate than progressive methods, this study intends to take a closer look at the effect of the number of sequences on the overall alignment [4,5]. It is hypothesized that iteration will have an even greater effect on multiple alignments as more sequences are added to the alignment. This proposed work has been influenced by and hopes to extend the works of Thompson et al and Wallace et al in the field of iterative multiple sequence alignment. Both ClustalW and MULTALIGN programs are available to download for free at and respectively. MULTALIGN runs on a Unix OS, while ClustalW runs on Windows OS and has been provided with a friendly user interface called ClustalX, where it may be specified whether or not to perform iteration for a specific alignment. One of the main reasons for the success of the ClustalW program is its ease of use [6]. The BAliBASE database is also available for download at All of the components needed for this project are readily available and easily accessible from any internet connection. The work surrounding the project will include becoming acquainted with both the ClustalW and MULTALIGN programs and their underlying algorithms, performing alignment tests for both the ClustalW and MULTALIGN programs, and comparing the results using the BAliBASE database. Tests will also be run on the ClustalW program with and without iteration on a series of different multiple alignments containing different numbers of sequences. Both my experiences as a computer science student and a biology student will be useful for this project. The Analysis of Algorithms class will have been particularly useful in analyzing the efficiency of the algorithms, while my biology class will have helped me to understand the needs of the biologist when analyzing alignment programs. My experiences in both fields will have helped me to become familiar with the terminology in a field of study which merges the two fields together. This project is expected to last about two months and a tentative timetable for the project can be seen in Table 1. Week Task 1 Become acquainted with ClustalW 2-4 Run ClustalW tests 5-6 Become acquainted with MULTALIGN 7-8 Run MULTALIGN tests and compare results Table 1. A tentative schedule for the project Less time has been set aside to become acquainted with the ClustalW program, because of its ease of use. It is believed that the tests described will help us gain a better understanding of the effects of iteration on progressive multiple alignment algorithms. 7. CONCLUSION Multiple sequence alignment is the backbone of comparative and evolutionary genomics, as it allows for a number of sequences to be matched against one another at the same time [13]. Although dynamic programming algorithms are the most accurate known algorithms for sequence alignment, they are inefficient for multiple sequence alignment. Currently heuristic algorithms are implemented for the most popular sequence alignment programs, because of their efficiency. However, these algorithms sacrifice accuracy for time. Additional research must be carried out to refine these algorithms in order to increase their accuracy while also maintaining their efficiency. REFERENCES [1] D.G. Brown, A survey of seeding for sequence alignment, University of Waterloo, Waterloo, Ontario, Canada, [2] D.J. Lipman, S.F. Altschul, and J.D. Kececioglu, A Tool for Multiple Sequence Alignment, Proc. Nail. Acad. Sci. USA, Vol. 86, pp , June [3] H. Rangwala and G. Karypis, Incremental window-based protein sequence alignment algorithms, Oxford Journals: Bioinformatics, Vol. 23, pp. e17-e23, [4] I. M. Wallace, O. Orla, and D. G. Higgins, Evaluation of Iterative Alignment Algorithms for Multiple Alignment, Oxford Journals: Bioinformatics, Vol. 21, pp , [5] J.D. Thompson, F. Plewniak, and O. Poch, A comprehensive comparison of multiple sequence

5 alignment programs, Oxford Journals: Nucleic Acids Research, Vol. 27, pp , [6] J.D. Thompson, T.J. Gibson, F. Plewniak, F. Jeanmougin, and D. G. Higgins, The CLUSTAL_X windows interface: flexible strategies for multiple sequence alignment aided by quality analysis tools, Oxford Journals: Nucleic Acids Research, Vol. 25, pp , [7] J. Hérisson, G. Payen, and R. Gherbi, A 3D pattern matching algorithm for DNA sequences, Oxford Journals: Bioinformatics, Vol. 23, pp , [8] J. M. Sauder, J. W. Arthur, and.r L. Dunbrack, Jr., Large-Scale Comparison of Protein Sequence Alignment Algorithms With Structure Alignments, Proteins: Structure, Function, and Genetics, Vol. 40, pp. 6-22, [9] L. A. Newberg, Memory efficient dynamic programming backtrace and pairwise local sequence alignment, Oxford Journals: Bioinformatics, Vol. 24, pp , [10] L. Delcher, A. Phillippy, J. Carlton and S. L. Salzberg, Fast algorithms for large-scale genome alignment and comparison, Oxford Journals: Nucleic Acids Research, Vol. 30, pp , [11] M.S. Waterman, Efficient Sequence Alignment Algorithms, J. theor. Biol., Vol. 108, pp , [12] R. Chenna, H. Sugawara, T. Koike, R. Lopez, T.J. Gibson, D.G. Higgins, and J.D. Thompson, Multiple sequence alignment with the Clustal series of programs, Oxford Journals: Nucleic Acids Research, Vol. 31, pp , [13] S. Kumar, A. Filipski, Multiple Sequence Alignment: In pursuit of homologous DNA positions, Cold Spring Harbor Laboratory Press: Genome Research, Vol. 17, pp , [14] T. W. Lam, W. K. Sung, S. L. Tam, C. K. Wong, and S. M. Yiu, Compressed indexing and local alignment of DNA, Oxford Journals: Bioinformatics, Vol. 24, pp , [15] Y. Bilu, P. K. Agarwal, R. Kolodny, Faster Algorithms for Optimal Multiple Sequence Alignment Based on Pairwise Comparisons, IEEE/ACM Transactions on Computational Biology and Bioinformatics, Vol. 3, pp , 2006.