Mr G Baktavatchalam Student, ME Software Engineering PSG College of Technology gmbakthavatchalam@gmail.com ABSTRACT

Transcription

1 A Novel Approach to Multiple Sequence Alignment using Hadoop Data Grids Dr G Sudha Sadasivam Professor, CSE Department PSG College of Technology sudhasadhasivam@yahoo.com Mr G Baktavatchalam Student, ME Software Engineering PSG College of Technology gmbakthavatchalam@gmail.com ABSTRACT Multiple alignment of protein sequences is an essential tool in molecular biology. It aids to determine evolutionary linkage and to predict molecular structures. The factors to be considered while aligning multiple sequences are speed and accuracy of alignment. Dynamic programming algorithms like Needleman-Wunsch and Smith-Waterman produce accurate alignments. But these algorithms are computation intensive and are limited to a small number of short sequences. In this paper we propose a time efficient approach to sequence alignment that produces quality alignment. The dynamic nature of the algorithm coupled with data and computational parallelism of hadoop data grids improves the accuracy and speed of sequence alignment. Further due to the scalability of hadoop framework, the proposed multiple sequence alignment is also highly suited for large scale alignment problems. Categories and Subject Descriptors D.1.3 [Concurrent Programming]: parallel and distributed General Terms Algorithms, performance Keywords DNA sequence, global alignment, data grid, hadoop, Needleman- Wunsch. 1. INTRODUCTION An alignment is the arrangement of two or more sequences of nucleotides or amino acids also termed as residues. Alignment maximizes the similarities between the sequences. Pair wise alignment deals with alignment of two sequences. Multiple Sequence Alignment (MSA) is an extension of pairwise alignment with three or more sequences. MSA plays a pivotal role in the reconstruction of phylogenetic trees. Alignments can be global or local. Global alignment uses the entire sequences to maximize the number of matched residues, whereas local alignment approach maximizes the alignment of similar subregions. So, the global methods perform better than local methods. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. MDAC 1, April 26, 21 Raleigh, NC, USA Copyright 21 ACM /1/4...$1.. There are three methods for alignment of multiple sequences. Dynamic Programming (DP) algorithms such as Needleman- Wunsch [1] and Smith-Waterman [2] produce accurate scores. However, these algorithms are demand high computational power. As the size of the sequences increase, computational complexity of these algorithms also increases exponentially. The second approach to MSA uses approximate heuristic algorithms like the progressive approximation method implemented in ClustalW [3], and T-COFFEE [4]. Progressive MSA aligns the closest sequences first and successively adds in more distant ones. This method is very fast and straightforward but it can easily get caught in local minima. This is because, once a sequence has been aligned it cannot be modified again, even if it is suboptimal when other sequences are subsequently aligned. Hence information from more distantly related sequences cannot be used to correct initial misalignments. The third method, an iteration-based approach, uses algorithms that produce an alignment and tries to improve it over successive iterations. This approach includes hidden markov models[5] and genetic algorithms [6]. DP methods produce accurate results but are computation intensive. Progressive alignment methods are fast and deterministic but tend to get caught in local maxima. Iterative methods are slow and the results are probabilistic due to their stochastic nature. Several new algorithms/techniques have been developed to improve the efficiency and quality of protein alignments. A study on methods to improve accuracy in MSA had been conducted. Idonesia [7] uses Bayesian Alignment (BAli) for structure-based Multiple Sequence Alignment (MSA) using a priori information. A Hidden Markov Model (HMM) is built using prealigned sequences. This model is used to test if other sequences belong to the HMM. Basic Local Alignment Search Tool (BLAST) [8] uses sequence-sequence alignment while PCI-BLAST [9] uses profilesequence method. CLUSTALW, uses profile-profile alignment that represents probability of amino acids in a position of MSA. This method gives better accuracy. Align-m [1] that uses nonprogressive local approach produces higher accuracy for distantly related sequences. Probalign [11] uses pairwise posteriori probability to align sequences which gives better accuracy than ProbCons, MAFFT [12] and MUSCLE [13]. eprobalign (Chikkagoudar, 26) is an online implementation of Probalign. NRAlign [14] incorporates horizontal information in alignments to get better accuracy. Gaps [15] have to be considered for MSA. Gloubchik [16] has analysed the various MSA programs by considering the gaps and found that both DIALIGN-T and MAFFT had high accuracy in aligning gaps, whereas T-COFFEE and ClustalW were less accurate. Rosenburg [17] has concluded that the bias in distance estimation is attributed to the greedy nature of the progressive MSA. ISPAlign [18] modifies pair- HMM approach in ProbCons by incorporating profiles of

2 intermediate sequences from the database and is found to produce more accurate results when compared to MAFFT and ProbCons. Aligning several hundred sequences using MSA is a computation intensive task. In MAFFT [19] the CPU time is drastically reduced by using Fast Fourier Transform (FFT) to identify homologous regions. MAFFT s progressive heuristics is faster than that of ClustalW and MAFFT s iterative heuristics is faster than that of T- COFFEE s. PartTree [2] constructs the guidetree for MSA using an approximation in O(NlogN) time when compared to the time complexity of guide tree construction in conventional MSA which is O(N 2 ). Using Partial Order Alignment with progressive alignment can also improve the MSA process. Grammar based sequence distance [21] has been used to improve the computational efficiency of progressive alignment algorithm. Simultaneous Alignment and Tree Construction using Hidden Markov models (SATCHMO) produces a tree and a set of multiple sequence alignments. Progressive alignment considers all columns to be alignable whereas HMM identifies the portions that are alignable across sequences. COmparison of Alignments by Constructing HMM (COACH) aligns two MSA to find their relatedness. Smith Waterman s algorithm for local sequence alignment is computationally costlier. Field Programmable Gate Arrays (FPGAs) hardware can be used to improve the efficiency of computation [22]. The huge computational power of graphic cards can be used to develop high performance solution [23]. Course and fine level parallelism can be combined using general purpose processors for sequence homology database searches [24].Combinatorial Extension Algorithm in a Massively Parallel Mode (CEPAR) [25] uses parallel processor computer architectures for biological data processing. Parallelize BLAST (pblast) [26] includes query distribution, hash table segmentation, computation parallelization, and database segmentation to increase computational efficiency. MSA is NPhard problem. So randomization approaches can be used to calculate distance matrices in MSA by sampling [27]. Genetic Algorithms can be used to find good alignments very efficiently when compared to that of pairwise dynamic programming (DP). ClustalW uses pairwise alignment, guide-tree generation and progressive alignment. ClustalW-MPI is a distributed and parallel implementation of ClustalW. Reconfigurable architectures using FPGAs can be used for fine-grained parallelization of dynamic programming calculations. Windows.NET Network Distributed Basic Local Alignment Search Toolkit (W.ND-BLAST) [28] enhances BLAST with usability, fault tolerance and scalability of Windows OS. It is an interactive tool that allows scientists to utilize available computing resources for high throughput and comprehensive sequence analysis. Kalign2 [29] is an enhancement of Kalign with high computational efficiency and minimized memory requirements. The accuracy of MSA can improve if run iteratively. For this a computationally efficient algorithm is essential Kalign2 is highly suited for the same. BAliBASE [3] provides test cases for realworld MSA problems. Progressive and iterative approaches are computationally efficient whereas dynamic programming approach improves the quality of alignment. Dynamic programming algorithms guarantee a mathematically optimal alignment. It is limited to a small number of short sequences since the computing power required for larger alignments is very high. Our proposed method uses a highly efficient algorithm executed using Hadoop [31] data grid. The dynamic nature of the algorithm coupled with data and computation parallelism that can be achieved in hadoop framework improves the computational efficiency as well as accuracy. As hadoop framework is highly scalable, the proposed multiple sequence aligner is highly suited for large-scale alignment problems. 2. SYSTEM DESIGN Hadoop is a software platform specifically designed to process and handle vast amounts of data. It is based on the principle that moving computation to the place of data is cheaper than moving large data blocks to the place of computation. The Hadoop framework consists of the Hadoop Distributed File System (HDFS) that is designed to run on commodity hardware and MapReduce programming paradigm. HDFS is highly faulttolerant and is designed to be deployed on low-cost commodity hardware. Hadoop is scalable, economical, efficient and reliable. Hadoop implements MapReduce, using the HDFS. MapReduce divides applications into many small blocks of work that can be executed in parallel. HDFS creates multiple replicas of data blocks for reliability, placing them on compute nodes around the cluster. MapReduce can then process the data where it is located. HDFS has a master/slave architecture. A HDFS cluster consists of a single NameNode and a number of DataNodes. The NameNode is a master server that manages the file system namespace and regulates access to files by clients. The DataNodes manage storage attached to the nodes that they run on. Internally, a file is split into one or more blocks and these blocks are stored in a set of DataNodes. The NameNode executes file system namespace operations like opening, closing, and renaming files and directories. It also determines the mapping of blocks to DataNodes. The DataNodes are responsible for serving read and write requests from the file system s clients. MapReduce is a programming paradigm that expresses a large distributed computation as a sequence of distributed operations on data sets of key/value pairs. The Hadoop MapReduce framework harnesses a cluster of machines and executes user defined MapReduce jobs across the nodes in the cluster. A MapReduce computation has two phases - a map phase and a reduce phase. The input to the computation is a data set of key-value pairs. In the map phase, the framework splits the input data set into a large number of fragments and assigns each fragment to a map task. The framework also distributes the map tasks across the cluster of nodes on which it operates (Fig. 1). Each map task consumes key/value (K,V) pairs from its assigned fragment and produces a set of intermediate key/value ( K,V ) pairs. The framework sorts the intermediate data set by key and produces a set of (K',V'*) tuples. In the reduce phase, each reduce task consumes the fragment of (K',V'*) tuples assigned to it. For each such tuple it invokes a user-defined reduce function that transmutes the tuple into an output key/value pair (K,V). The proposed model achieves parallelism by using hadoop data grids for dynamic programming. In this approach, different permutations of sequences are generated and stored in the Hadoop DFS. Each permutation executes the map/reduce phases in parallel. Further by default, each file is split into blocks and processing can take place in parallel in these blocks. For example consider three sequences S1,S2 and S3. The different permutations of these sequences are {S1,S2,S3; S1,S3,S2; S2,S1,S3; S2,S3,S1; S3,S1,S2; S3,S2,S1}. Each permutation can be carried out in parallel (Figure 2). For example consider the permutation S1,S2 and S3 sequences. S1 and S2 read from DFS and then aligned to produce aligned sequences A1S1 and A1S2.

3 These aligned sequences are then stored in the DFS. Then A1S1 is aligned with S3 and A1S2 is aligned with S3 in parallel to produce final aligned sequences A2S1, A2S2 and A1S3. These two alignments have no dependencies and hence they are carried out in parallel (Figure. 2). Figure 1: MapReduce Programming S1 S2 S3 Map/ Reduce aligner A1S1 A2S2 Map/ Reduce Map/ Reduce aligner aligner A2S1 A2S2 A1S3 Figure 2: Illustration of a single combination 3. PROPOSED METHODOLOGY The proposed methodology for MSA uses dynamic programming. Parallelism is achieved at three levels (Fig.3). In the first level, it generates all possible combinations for alignment, and tries to parallelise the execution of all these combinations using hadoop data grid. In the second level, alignment between certain pairs of sequences that have no dependencies can also be parallelized. In the third level, the alignment of a pair of sequences is also parallelized as map phase can be carried out in parallel on different blocks of sequences. As map-reduce programming model achieves parallelism in three levels, performance efficiency is highly improved. As Needleman Wunsch, a dynamic programming methodology is used for pairwise alignment, the alignment scores generated is also accurate. 3.1 Needleman-Wunsch algorithm Needleman Wunsch (Sagl, B. N. etal., 197) is a global alignment method that uses dynamic programming. Dynamic programming (DP) is a recursive procedure that splits a problem into a set of interdependent sub-problems in which the next intermediate solution is a function of a prior sub-problem and depends only on its immediate neighbors. For a pairwise alignment problem, DP starts at the end of the sequences and attempts to match all possible pairs of residues according to a scoring scheme for matches, mismatches and gaps generating a matrix of score values for all possible alignments between the two sequences. The highest score identifies an optimal alignment. The score matrix (F) has dimensions n,m - where n and m are the lengths of the two sequences. To reach a given position F(i, j) in the matrix from a previous move, there are three possible paths: Case 1: a diagonal move with no gap penalty from F(i-1, j-1); Case 2: a move from F(i-1, j) to (i, j) with gap penalty row-wise, Case 3: a move from F(i, j-1) to (i, j) with a gap penalty along columns. The matrix is built recursively according to Equation 1. The actual alignment is obtained from the trace-back matrix(t), which stores information of the moves through the matrix by backtracking the moves made to obtain the highest score. F(i-1,j-1)+ s(x,y) case 1 F(i,j) = max F(i-1,j) d case 2 F(i,j-1) d case 3 (Equation 1) In Equation 1, the function value F(i, j) is the highest score for position i in sequence x and position j in sequence y; d is the cost of the gap penalty (assumed to be 1), and s is the score for a match/mismatch between residues x i and y j (assumed to be 1). Initialisation of the score matrix F is done as follows: F(, ) = F(, i) = i * d F(j, ) = j* d More than one path can lead to the same value and a choice has to be on the best path to follow. The path to be followed is represented by the trace matrix given below: T(i-1,j-1) (diagonal movement) if case 1 T(i,j) = T(i-1,j) (upwards movement) if case 2 T(i,j-1) (left movement) if case 3 For example consider 2 sequences S1= AGTA and S2= ATA, The Score matrix (F) and trace matrix (T) are given below: A G T A A T A

4 Hence the trace moves as F(3,4) F(2,3) F(1,2) F(1,1). The score is 4 along the path and the aligned sequences are AGTA and A-TA The standard algorithm for evaluating the similarity between two sequences of length N and M using Needleman-Wunsch takes O(N * M) steps to complete execution for pairwise alignment. The time complexity and space complexity of the algorithm are O(N * M) for a pair of sequences. The time complexity n sequences of average length N is O(N n ). The structure of the algorithm makes it suitable for a parallel implementation on a systolic array. In particular, hardware parallelism can be exploited to perform string comparison in O(N+M) steps for a pair of sequences. For MSA, the time complexity for parallel implementation is O(n * 2N). A parallel implementation of the Needleman-Wunsch algorithm can be done using a BioWall [32]. It is a mosaic of several thousand transparent electronic modules. The BioWall cannot compete in performance with existing parallel implementations of the Needleman-Wunsch algorithm, since it suffers from the typical performance limitations of a large prototyping platform. Dynamic programming algorithm for biological sequence comparison on a general- purpose parallel computing platform based on a fine-grain event-driven multithreaded program execution model [33] has been suggested. Fine-grain multithreading permits efficient parallelism exploitation in this application both by taking advantage of asynchronous point-to-point synchronizations and communication with low overheads and by effectively tolerating latency through the overlapping of computation and communication. A parallel speculative computational method reduces the computation time of the iterative methods from days to minutes [33]. Ophir etal [34] propose two parallel computational methods for analyzing biological sequences. The first method is used to retrieve sequences that are homologous to a query sequence which provides important clues to the structure and function of the query sequence. The second method, helps in the prediction of the function, structure, and evolutionary history of biological sequences, which is used to align a number of homologous sequences with each other. A fast computation solution using a parallel version of Needleman Wunsch and Alchemi Grid processing engine has been proposed [35]. This is a computational grid approach, wherein the computations are parllelised. The main problem is that the number of CPU s needed for computation increases exponentially with the size of the sequences to achieve true parallelism. 3.2 MSA using Hadoop an illustration For example consider three sequences S1= AGTA, S2= ATA and S3= GAT. The different probable combinations of sequences are: {S1,S2,S3; S1,S3,S2; S2,S1,S3; S2,S3,S1; S3,S1,S2; S3,S2,S1}. All these combinations can be executed in parallel in the HDFS. Combination Combination Combination Combination Combination Combination Combination parallel parallel S1,S2,S3; S1,S3,S2; S2,S1,S3; S2,S3,S1; S3,S1,S2; S3,S2,S1 combination of inputs First phase of 6 Map/Reduce alignments in parallel A1S1,A1S2; A1S1, A1S3; A1S2, A1S1; A1S2, A1S3; A1S3, A1S1; A1S3, A1S2 outputs for 1 st Map/Reduce phase Second phase of 12 Map/Reduce alignments in parallel A2S1,A2S2,A1S3; A2S1, A2S3, A1S3; A2S2, A2S1, A1S3; A2S2, A2S3, A1S3; A2S3, A2S1, A1S3; A2S3,A2S2, A1S3 outputs for 2 nd Map/Reduce phase Figure 3: Illustration of alignment of 3 sequences using hadoop

5 Consider the combination S1,S2,S3. According to this combination, first S1 and S2 are aligned. Then S3 is aligned to the aligned S1 and S2 sequences as shown below: a) Alignment of sequences S1 and S2: S1 and S2 sequences are read from the HDFS and their alignment is given below: A G T A A T A Hence the trace moves as F(3,4) F(2.3) F(1,2) F(1,1). The score is 4 along the path and the aligned sequences A1S1 and A1S2 are AGTA and A-TA. The aligned sequences A1S1 and A1S2 are stored in the HDFS. b) Alignment of sequences A1S1 and S3: The aligned sequence A1S1- AGTA and the third sequence S3 - GAT are read from HDFS and the alignment is performed as follows: A G T A G A T Hence the trace moves as F(3,4) F(3,3) F(2,2) F(1,2). The score is -2 along the path and the aligned sequences are AG-TA and -GAT-. The aligned sequences A2S1 and A1S3 are stored in the HDFS. c) Alignment of sequences A1S2 and S3: The aligned sequence A1S2 - A-TA and the third sequence S3 - GAT are read from HDFS and the alignment is performed as follows: Hence the trace moves as F(3,4) F(3,3) F(2,2) F(1,2) F(1,1). The score is -5 along the path and the aligned sequences are A--TA and -GAT-. The aligned sequences A2S1 and A1S3 are stored in the HDFS. Finally the aligned sequences are A G - T A, A - -T A and - G A T A - T A G A Analysis T For three sequence problem, six different combinations of sequences are possible. All these combinations are aligned in parallel using hadoop data grid. Within each combination, pairwise alignment is carried out. S1 and S2 read from DFS and then aligned to produce aligned sequences A1S1 and A1S2. These aligned sequences are then stored in the DFS. Then A1S1 is aligned with S3 and A1S2 is aligned with S3 in parallel to produce final aligned sequences A2S1, A2S2 and A1S3. These two alignments have no dependencies and hence they are carried out in parallel. Let us consider that n is the number of sequences to be aligned. The number of permutations npn is n!. Each permutation is performed in parallel. Hence its time complexity is complexity of a single combination. Within each permutation, there are (n- 1)+(n-2)+.+1 pairwise alignments which can be performed in parallel using maps. Total number of maps for each permutation is represented by M p = SUM(1 to n-1). (n-1) maps are done in serial in each combination. The order of complexity of each map (between a pair of sequences of length N and M) is O(N * M). If alignment is done in parallel in b blocks then the time complexity is O((N*M)/b)So the time complexity for each permutation is O(n-1 * (M * N)/b). As each permutation is done in parallel, the total complexity of MSA is O(n-1 *(M * N)/b). The method of splitting up a sequence in hadoop into blocks so that the score can be computed in parallel on all these blocks provides fine grained parallelism. Thus the proposed approach provides both data and compute parallelism. 4. EXPERIMENTAL RESULTS Hadoop was installed in fully distributed mode on PIV Dual Core machines with 2 GB RAM and the readings were taken. Time efficiency of the proposed methodology had been measured as follows: a) Varying number of sequences of same size: It is found that the time increases linearly as the alignment depends on the number of sequences. It is also found that as the number of nodes increases, the alignments that can be executed in parallel also increases. Hence the time for alignment decreases.

6 Sec in Time Number of sequences 2 nodes 3 nodes Figure 4: Time comparison for different number of sequences each of size 842 KB b) Same number of sequences of varying size: It is found that the time increases as the pairwise alignment depends on the size of the sequences. It is also found that as the number of nodes increases, the alignments that can be executed in parallel also increases. Hence the time for alignment decreases. Sec e 2 15 in Size of Sequence (KB) 2 nodes 3 nodes Figure 5: Time comparison for different size of sequences for 4 sequences c) Different block sizes: It is found that the time decreases as the block size increases. The sequence is split to blocks in hadoop, and sequence alignment can be done in parallel on these blocks. This is because, as the number of blocks increases, pairwise alignment can be done in parallel in these blocks. It is also found that as the number of nodes increases, the alignments that can be executed in parallel also increases. Hence the time for alignment decreases. Sec in Time Block Size in KB 2 nodes 3 nodes Figure 6: Time comparison for different block size for 4 sequences of 424 KB 5. CONCLUSION The proposed method of MSA improves on the computation time and also maintains the accuracy. To bring about data and compute parallelism hadoop data grids is used. This approach parallelises sequence alignment in 3 levels. The accuracy of alignment is also maintained as dynamic programming method is used. When MSA is performed on n sequences of average length N, the time complexity in dynamic programming is O(N n ). In the proposed approach the time complexity is O((n-1) * N 2 /b), where b is the block size. Further due to the scalability of the hadoop framework, the proposed MSA is highly suited for large scale alignment problems. ACKNOWLEDGEMENT The authors thank Mr K V Chidambaran, Director, Yahoo Software Development India Pvt Ltd for his support in setting up the grid and cloud computing lab at PSG College of Technology. This research is a consequence of PSG-Yahoo research collaboration on Grid and Cloud Computing. 6. REFERENCES: [1] Sagl, B. N. and Christus, D. W. (197) A general method applicable to the search for similarities in the amino acid sequence of two proteins, Journal of molecular biology, Vol.1, pp [2] Smith, T.F. and Waterman, M.S. (1983) Identification of Common Molecular Subsequences, Journal of Molecular Biology, Vol.1, pp [3] Li, K.B. (23) ClustalW-MPI: ClustalW analysis using distributed and parallel computing, Bioinformatics, Vol.19, No.12, pp [4] Higgins, D.G. and Heringa, J. (2) T-Coffee: A novel method for fast and accurate multiple sequence, C Notredame - Journal of Molecular Biology, Vol.32, No.1, pp [5] Holmes, I. and Bruno, W.J. (21), Evolutionary HMMs: A Bayesian approach to multiple alignment, Bioinformatics, Vol.17, pp [6] Zhang, C. and Wong, A.K, (1997) A genetic algorithm for multiple molecular sequence alignment, Bioinformatics, Vol.13, pp [7] Madsen, P. J. and Kleywegt, G.J. (22) Indonesia: An integrated sequence analysis system Manual, [8] Stephen, F. A, Warren, G. Webb, M. Eugene W. M. and David, J. L. (199) Basic Local Alignment Search Tool, Journal of Molecular Biology, Vol. 215, pp [9] Stephen, F. A. Thomas, L. M. Alejandro, A. S. Jinghui Z. Zheng Z. Webb M. and David J. L. (1997), Gapped BLAST and PSI BLAST: a new generation of protein database search programs, Journal of Nucleic acid research, Vol. 25, No.17, pp [1] Van, W. I., Lasters,,I. and Wyns,L. (24) Align-m - a new algorithm for multiple alignment of highly divergent sequences, Bioinformatics, DOI: 1.193/bioinformatics/bth116, 2, pp

7 [11] Roshan, U., Livesay D.R. (26) Probalign: multiple sequence alignment using partition function posterior probabilities, Bioinformatics, Vol. 22, pp [12] Katoh, K., Misawa, K., Kuma, K., Miyata, T. (22) MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform, Nucleic Acids Research, Vol.3, No.14, pp [13] Edgar, R.C. (24), MUSCLE: multiple sequence alignment with high accuracy and high throughput, Nucleic Acids Research, Vol. 32, No.5, pp [14] Chikkagoudar, S., Roshan, U., Livesay, D. (26) eprobalign: generation and manipulation of multiple sequence alignments using partition function posterior probabilities, Bioinformatics, Vol. 22, pp [15] Yue, L. and Sing-Hoi, S, (29) Improving accuracy of multiple sequence alignment algorithms based on alignment of neighboring residues, Nucleic Acids Research, Vo.37, No.2, pp [16] Golubchik, T., Wise, M.J., Easteal, S., Jermiin, L.S. (27) Mind the gaps: evidence of bias in estimates of multiple sequence alignments, Mol Biol Evol, Vol. 35, No.13, pp [17] Rosenberg, M.S. (25) Multiple sequence alignment accuracy and evolutionary distance estimation, BMC:BioInformatics, Vol.6, pp [18] Lu, Y, Sze, S,H, (28) Multiple sequence alignment based on profile alignment of intermediate sequences, J Computional Biology, Vol.15, No.7, pp [19] Kazutaka, Katoh and Hiroyuki, T. (27), PartTree: an algorithm to build an approximate tree from a large number of unaligned sequences, Bioinformatics, Vo. 23, No.3, pp [2] Grasso, C., Lee, C. (24) Combining partial order alignment and progressive multiple sequence alignment increases alignment speed and scalability to very large alignment problems, Bioinformatics, Vo.2, No.1, pp [21] Russell, D.J., Otu, H.H., Sayood. (28) KGrammarbased distance in progressive multiple sequence alignment, BMC Bioinformatics, Vol.1, pp [22] Li, I.T., Shum, W., Truong, K. (27) 16-fold acceleration of the Smith-Waterman algorithm using a field programmable gate array (FPGA), BMC Bioinformatics, Vol.8, pp [23] Manavski, S.A., Valle, G. (28) CUDA compatible GPU cards as efficient hardware accelerators for Smith-Waterman sequence alignment, BMC Bioinformatics, Vol.9, No.2, pp Berlin, pp [26] Rajasekaran, S. Thapar, V. Dave, H. Huang, C.H. (25) Randomized and parallel algorithms for distance matrix calculations in multiple sequence alignment, Journal of clinical monitoring and computin,. Vo.19, No.4-5, pp [27] Oliver, T. Schmidt, B. Jakop, Y. Maskell, D. (28) High Speed Biological Sequence Analysis with Hidden Markov Models on Reconfigurable Platforms, IEEE Trans Information Technology Biomed. Vol.22, pp [28] Dowd, S.E. Zaragoza, J. Rodriguez, J.R. Oliver, M.J. Payton, P.R. (25) Windows.NET Network Distributed Basic Local Alignment Search Toolkit (W.ND-BLAST), BMC Bioinformatics., Vol. 6, pp [29] Timo, L, Oliver. F. and Erik, L. L. (29) Kalign2: high-performance multiple alignment of protein and nucleotide sequences allowing external features, Nucleic acids research, Vol.37, No.3, pp [3] Thompson JD, Koehl P, Ripp R, Poch O (25) BAliBASE 3.: latest developments of the multiple sequence alignment benchmark, PubMed, Vol.61, No.1, pp [31] Apache, Hadoop Documentation (22), Available at [32] Gianluca, T. and Christof Teuscher (23) Biology goes digital-biowall, BioWall, XCell Journal, equencee.html [33] [33] Tieng, K. Y. Ophir, F. Robert, L. M. (1999) Parallel computation in biological sequence analysis, IEEE Transactions on Parallel and Distributed Systems., Vol. 9, No. 3, pp [34] [34] Ophir, F,, Robert L. M., Tieng K. Y.,Peter J. M. (1995) Parallel multiple sequence alignment, Proceedings of Intl. Conf. on Parallel Processing, [35] [35] Tahir, N., Imitaz S. S,, Shaftab, A. (25) Parallel Needleman-Wunsch Algorithm for Grid, Proceedings of the PAK-US International Symposium on High Capacity Optical Networks and Enabling Technologies (HONET 25), Islamabad, Pakistan, Dec 19-21, 25. [24] Meng, X., Chaudhary, V. (26) Optimised fine and coarse parallelism for sequence homology search, International Journal of Bioinformatics Research and Applications, Vol.2, No.4, pp [25] Pedretti, K T. Casavant, T.L. Braun, R.C. Scheetz,T.E. Birkett, C.L. and Roberts C.A.(1999) Three Complementary Approaches to Parallelization of Local BLAST Service on Workstation Clusters, Springer