Mr G Baktavatchalam Student, ME Software Engineering PSG College of Technology gmbakthavatchalam@gmail.com ABSTRACT

Size: px
Start display at page:

Download "Mr G Baktavatchalam Student, ME Software Engineering PSG College of Technology 919790234621 gmbakthavatchalam@gmail.com ABSTRACT"

Transcription

1 A Novel Approach to Multiple Sequence Alignment using Hadoop Data Grids Dr G Sudha Sadasivam Professor, CSE Department PSG College of Technology sudhasadhasivam@yahoo.com Mr G Baktavatchalam Student, ME Software Engineering PSG College of Technology gmbakthavatchalam@gmail.com ABSTRACT Multiple alignment of protein sequences is an essential tool in molecular biology. It aids to determine evolutionary linkage and to predict molecular structures. The factors to be considered while aligning multiple sequences are speed and accuracy of alignment. Dynamic programming algorithms like Needleman-Wunsch and Smith-Waterman produce accurate alignments. But these algorithms are computation intensive and are limited to a small number of short sequences. In this paper we propose a time efficient approach to sequence alignment that produces quality alignment. The dynamic nature of the algorithm coupled with data and computational parallelism of hadoop data grids improves the accuracy and speed of sequence alignment. Further due to the scalability of hadoop framework, the proposed multiple sequence alignment is also highly suited for large scale alignment problems. Categories and Subject Descriptors D.1.3 [Concurrent Programming]: parallel and distributed General Terms Algorithms, performance Keywords DNA sequence, global alignment, data grid, hadoop, Needleman- Wunsch. 1. INTRODUCTION An alignment is the arrangement of two or more sequences of nucleotides or amino acids also termed as residues. Alignment maximizes the similarities between the sequences. Pair wise alignment deals with alignment of two sequences. Multiple Sequence Alignment (MSA) is an extension of pairwise alignment with three or more sequences. MSA plays a pivotal role in the reconstruction of phylogenetic trees. Alignments can be global or local. Global alignment uses the entire sequences to maximize the number of matched residues, whereas local alignment approach maximizes the alignment of similar subregions. So, the global methods perform better than local methods. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. MDAC 1, April 26, 21 Raleigh, NC, USA Copyright 21 ACM /1/4...$1.. There are three methods for alignment of multiple sequences. Dynamic Programming (DP) algorithms such as Needleman- Wunsch [1] and Smith-Waterman [2] produce accurate scores. However, these algorithms are demand high computational power. As the size of the sequences increase, computational complexity of these algorithms also increases exponentially. The second approach to MSA uses approximate heuristic algorithms like the progressive approximation method implemented in ClustalW [3], and T-COFFEE [4]. Progressive MSA aligns the closest sequences first and successively adds in more distant ones. This method is very fast and straightforward but it can easily get caught in local minima. This is because, once a sequence has been aligned it cannot be modified again, even if it is suboptimal when other sequences are subsequently aligned. Hence information from more distantly related sequences cannot be used to correct initial misalignments. The third method, an iteration-based approach, uses algorithms that produce an alignment and tries to improve it over successive iterations. This approach includes hidden markov models[5] and genetic algorithms [6]. DP methods produce accurate results but are computation intensive. Progressive alignment methods are fast and deterministic but tend to get caught in local maxima. Iterative methods are slow and the results are probabilistic due to their stochastic nature. Several new algorithms/techniques have been developed to improve the efficiency and quality of protein alignments. A study on methods to improve accuracy in MSA had been conducted. Idonesia [7] uses Bayesian Alignment (BAli) for structure-based Multiple Sequence Alignment (MSA) using a priori information. A Hidden Markov Model (HMM) is built using prealigned sequences. This model is used to test if other sequences belong to the HMM. Basic Local Alignment Search Tool (BLAST) [8] uses sequence-sequence alignment while PCI-BLAST [9] uses profilesequence method. CLUSTALW, uses profile-profile alignment that represents probability of amino acids in a position of MSA. This method gives better accuracy. Align-m [1] that uses nonprogressive local approach produces higher accuracy for distantly related sequences. Probalign [11] uses pairwise posteriori probability to align sequences which gives better accuracy than ProbCons, MAFFT [12] and MUSCLE [13]. eprobalign (Chikkagoudar, 26) is an online implementation of Probalign. NRAlign [14] incorporates horizontal information in alignments to get better accuracy. Gaps [15] have to be considered for MSA. Gloubchik [16] has analysed the various MSA programs by considering the gaps and found that both DIALIGN-T and MAFFT had high accuracy in aligning gaps, whereas T-COFFEE and ClustalW were less accurate. Rosenburg [17] has concluded that the bias in distance estimation is attributed to the greedy nature of the progressive MSA. ISPAlign [18] modifies pair- HMM approach in ProbCons by incorporating profiles of

2 intermediate sequences from the database and is found to produce more accurate results when compared to MAFFT and ProbCons. Aligning several hundred sequences using MSA is a computation intensive task. In MAFFT [19] the CPU time is drastically reduced by using Fast Fourier Transform (FFT) to identify homologous regions. MAFFT s progressive heuristics is faster than that of ClustalW and MAFFT s iterative heuristics is faster than that of T- COFFEE s. PartTree [2] constructs the guidetree for MSA using an approximation in O(NlogN) time when compared to the time complexity of guide tree construction in conventional MSA which is O(N 2 ). Using Partial Order Alignment with progressive alignment can also improve the MSA process. Grammar based sequence distance [21] has been used to improve the computational efficiency of progressive alignment algorithm. Simultaneous Alignment and Tree Construction using Hidden Markov models (SATCHMO) produces a tree and a set of multiple sequence alignments. Progressive alignment considers all columns to be alignable whereas HMM identifies the portions that are alignable across sequences. COmparison of Alignments by Constructing HMM (COACH) aligns two MSA to find their relatedness. Smith Waterman s algorithm for local sequence alignment is computationally costlier. Field Programmable Gate Arrays (FPGAs) hardware can be used to improve the efficiency of computation [22]. The huge computational power of graphic cards can be used to develop high performance solution [23]. Course and fine level parallelism can be combined using general purpose processors for sequence homology database searches [24].Combinatorial Extension Algorithm in a Massively Parallel Mode (CEPAR) [25] uses parallel processor computer architectures for biological data processing. Parallelize BLAST (pblast) [26] includes query distribution, hash table segmentation, computation parallelization, and database segmentation to increase computational efficiency. MSA is NPhard problem. So randomization approaches can be used to calculate distance matrices in MSA by sampling [27]. Genetic Algorithms can be used to find good alignments very efficiently when compared to that of pairwise dynamic programming (DP). ClustalW uses pairwise alignment, guide-tree generation and progressive alignment. ClustalW-MPI is a distributed and parallel implementation of ClustalW. Reconfigurable architectures using FPGAs can be used for fine-grained parallelization of dynamic programming calculations. Windows.NET Network Distributed Basic Local Alignment Search Toolkit (W.ND-BLAST) [28] enhances BLAST with usability, fault tolerance and scalability of Windows OS. It is an interactive tool that allows scientists to utilize available computing resources for high throughput and comprehensive sequence analysis. Kalign2 [29] is an enhancement of Kalign with high computational efficiency and minimized memory requirements. The accuracy of MSA can improve if run iteratively. For this a computationally efficient algorithm is essential Kalign2 is highly suited for the same. BAliBASE [3] provides test cases for realworld MSA problems. Progressive and iterative approaches are computationally efficient whereas dynamic programming approach improves the quality of alignment. Dynamic programming algorithms guarantee a mathematically optimal alignment. It is limited to a small number of short sequences since the computing power required for larger alignments is very high. Our proposed method uses a highly efficient algorithm executed using Hadoop [31] data grid. The dynamic nature of the algorithm coupled with data and computation parallelism that can be achieved in hadoop framework improves the computational efficiency as well as accuracy. As hadoop framework is highly scalable, the proposed multiple sequence aligner is highly suited for large-scale alignment problems. 2. SYSTEM DESIGN Hadoop is a software platform specifically designed to process and handle vast amounts of data. It is based on the principle that moving computation to the place of data is cheaper than moving large data blocks to the place of computation. The Hadoop framework consists of the Hadoop Distributed File System (HDFS) that is designed to run on commodity hardware and MapReduce programming paradigm. HDFS is highly faulttolerant and is designed to be deployed on low-cost commodity hardware. Hadoop is scalable, economical, efficient and reliable. Hadoop implements MapReduce, using the HDFS. MapReduce divides applications into many small blocks of work that can be executed in parallel. HDFS creates multiple replicas of data blocks for reliability, placing them on compute nodes around the cluster. MapReduce can then process the data where it is located. HDFS has a master/slave architecture. A HDFS cluster consists of a single NameNode and a number of DataNodes. The NameNode is a master server that manages the file system namespace and regulates access to files by clients. The DataNodes manage storage attached to the nodes that they run on. Internally, a file is split into one or more blocks and these blocks are stored in a set of DataNodes. The NameNode executes file system namespace operations like opening, closing, and renaming files and directories. It also determines the mapping of blocks to DataNodes. The DataNodes are responsible for serving read and write requests from the file system s clients. MapReduce is a programming paradigm that expresses a large distributed computation as a sequence of distributed operations on data sets of key/value pairs. The Hadoop MapReduce framework harnesses a cluster of machines and executes user defined MapReduce jobs across the nodes in the cluster. A MapReduce computation has two phases - a map phase and a reduce phase. The input to the computation is a data set of key-value pairs. In the map phase, the framework splits the input data set into a large number of fragments and assigns each fragment to a map task. The framework also distributes the map tasks across the cluster of nodes on which it operates (Fig. 1). Each map task consumes key/value (K,V) pairs from its assigned fragment and produces a set of intermediate key/value ( K,V ) pairs. The framework sorts the intermediate data set by key and produces a set of (K',V'*) tuples. In the reduce phase, each reduce task consumes the fragment of (K',V'*) tuples assigned to it. For each such tuple it invokes a user-defined reduce function that transmutes the tuple into an output key/value pair (K,V). The proposed model achieves parallelism by using hadoop data grids for dynamic programming. In this approach, different permutations of sequences are generated and stored in the Hadoop DFS. Each permutation executes the map/reduce phases in parallel. Further by default, each file is split into blocks and processing can take place in parallel in these blocks. For example consider three sequences S1,S2 and S3. The different permutations of these sequences are {S1,S2,S3; S1,S3,S2; S2,S1,S3; S2,S3,S1; S3,S1,S2; S3,S2,S1}. Each permutation can be carried out in parallel (Figure 2). For example consider the permutation S1,S2 and S3 sequences. S1 and S2 read from DFS and then aligned to produce aligned sequences A1S1 and A1S2.

3 These aligned sequences are then stored in the DFS. Then A1S1 is aligned with S3 and A1S2 is aligned with S3 in parallel to produce final aligned sequences A2S1, A2S2 and A1S3. These two alignments have no dependencies and hence they are carried out in parallel (Figure. 2). Figure 1: MapReduce Programming S1 S2 S3 Map/ Reduce aligner A1S1 A2S2 Map/ Reduce Map/ Reduce aligner aligner A2S1 A2S2 A1S3 Figure 2: Illustration of a single combination 3. PROPOSED METHODOLOGY The proposed methodology for MSA uses dynamic programming. Parallelism is achieved at three levels (Fig.3). In the first level, it generates all possible combinations for alignment, and tries to parallelise the execution of all these combinations using hadoop data grid. In the second level, alignment between certain pairs of sequences that have no dependencies can also be parallelized. In the third level, the alignment of a pair of sequences is also parallelized as map phase can be carried out in parallel on different blocks of sequences. As map-reduce programming model achieves parallelism in three levels, performance efficiency is highly improved. As Needleman Wunsch, a dynamic programming methodology is used for pairwise alignment, the alignment scores generated is also accurate. 3.1 Needleman-Wunsch algorithm Needleman Wunsch (Sagl, B. N. etal., 197) is a global alignment method that uses dynamic programming. Dynamic programming (DP) is a recursive procedure that splits a problem into a set of interdependent sub-problems in which the next intermediate solution is a function of a prior sub-problem and depends only on its immediate neighbors. For a pairwise alignment problem, DP starts at the end of the sequences and attempts to match all possible pairs of residues according to a scoring scheme for matches, mismatches and gaps generating a matrix of score values for all possible alignments between the two sequences. The highest score identifies an optimal alignment. The score matrix (F) has dimensions n,m - where n and m are the lengths of the two sequences. To reach a given position F(i, j) in the matrix from a previous move, there are three possible paths: Case 1: a diagonal move with no gap penalty from F(i-1, j-1); Case 2: a move from F(i-1, j) to (i, j) with gap penalty row-wise, Case 3: a move from F(i, j-1) to (i, j) with a gap penalty along columns. The matrix is built recursively according to Equation 1. The actual alignment is obtained from the trace-back matrix(t), which stores information of the moves through the matrix by backtracking the moves made to obtain the highest score. F(i-1,j-1)+ s(x,y) case 1 F(i,j) = max F(i-1,j) d case 2 F(i,j-1) d case 3 (Equation 1) In Equation 1, the function value F(i, j) is the highest score for position i in sequence x and position j in sequence y; d is the cost of the gap penalty (assumed to be 1), and s is the score for a match/mismatch between residues x i and y j (assumed to be 1). Initialisation of the score matrix F is done as follows: F(, ) = F(, i) = i * d F(j, ) = j* d More than one path can lead to the same value and a choice has to be on the best path to follow. The path to be followed is represented by the trace matrix given below: T(i-1,j-1) (diagonal movement) if case 1 T(i,j) = T(i-1,j) (upwards movement) if case 2 T(i,j-1) (left movement) if case 3 For example consider 2 sequences S1= AGTA and S2= ATA, The Score matrix (F) and trace matrix (T) are given below: A G T A A T A

4 Hence the trace moves as F(3,4) F(2,3) F(1,2) F(1,1). The score is 4 along the path and the aligned sequences are AGTA and A-TA The standard algorithm for evaluating the similarity between two sequences of length N and M using Needleman-Wunsch takes O(N * M) steps to complete execution for pairwise alignment. The time complexity and space complexity of the algorithm are O(N * M) for a pair of sequences. The time complexity n sequences of average length N is O(N n ). The structure of the algorithm makes it suitable for a parallel implementation on a systolic array. In particular, hardware parallelism can be exploited to perform string comparison in O(N+M) steps for a pair of sequences. For MSA, the time complexity for parallel implementation is O(n * 2N). A parallel implementation of the Needleman-Wunsch algorithm can be done using a BioWall [32]. It is a mosaic of several thousand transparent electronic modules. The BioWall cannot compete in performance with existing parallel implementations of the Needleman-Wunsch algorithm, since it suffers from the typical performance limitations of a large prototyping platform. Dynamic programming algorithm for biological sequence comparison on a general- purpose parallel computing platform based on a fine-grain event-driven multithreaded program execution model [33] has been suggested. Fine-grain multithreading permits efficient parallelism exploitation in this application both by taking advantage of asynchronous point-to-point synchronizations and communication with low overheads and by effectively tolerating latency through the overlapping of computation and communication. A parallel speculative computational method reduces the computation time of the iterative methods from days to minutes [33]. Ophir etal [34] propose two parallel computational methods for analyzing biological sequences. The first method is used to retrieve sequences that are homologous to a query sequence which provides important clues to the structure and function of the query sequence. The second method, helps in the prediction of the function, structure, and evolutionary history of biological sequences, which is used to align a number of homologous sequences with each other. A fast computation solution using a parallel version of Needleman Wunsch and Alchemi Grid processing engine has been proposed [35]. This is a computational grid approach, wherein the computations are parllelised. The main problem is that the number of CPU s needed for computation increases exponentially with the size of the sequences to achieve true parallelism. 3.2 MSA using Hadoop an illustration For example consider three sequences S1= AGTA, S2= ATA and S3= GAT. The different probable combinations of sequences are: {S1,S2,S3; S1,S3,S2; S2,S1,S3; S2,S3,S1; S3,S1,S2; S3,S2,S1}. All these combinations can be executed in parallel in the HDFS. Combination Combination Combination Combination Combination Combination Combination parallel parallel S1,S2,S3; S1,S3,S2; S2,S1,S3; S2,S3,S1; S3,S1,S2; S3,S2,S1 combination of inputs First phase of 6 Map/Reduce alignments in parallel A1S1,A1S2; A1S1, A1S3; A1S2, A1S1; A1S2, A1S3; A1S3, A1S1; A1S3, A1S2 outputs for 1 st Map/Reduce phase Second phase of 12 Map/Reduce alignments in parallel A2S1,A2S2,A1S3; A2S1, A2S3, A1S3; A2S2, A2S1, A1S3; A2S2, A2S3, A1S3; A2S3, A2S1, A1S3; A2S3,A2S2, A1S3 outputs for 2 nd Map/Reduce phase Figure 3: Illustration of alignment of 3 sequences using hadoop

5 Consider the combination S1,S2,S3. According to this combination, first S1 and S2 are aligned. Then S3 is aligned to the aligned S1 and S2 sequences as shown below: a) Alignment of sequences S1 and S2: S1 and S2 sequences are read from the HDFS and their alignment is given below: A G T A A T A Hence the trace moves as F(3,4) F(2.3) F(1,2) F(1,1). The score is 4 along the path and the aligned sequences A1S1 and A1S2 are AGTA and A-TA. The aligned sequences A1S1 and A1S2 are stored in the HDFS. b) Alignment of sequences A1S1 and S3: The aligned sequence A1S1- AGTA and the third sequence S3 - GAT are read from HDFS and the alignment is performed as follows: A G T A G A T Hence the trace moves as F(3,4) F(3,3) F(2,2) F(1,2). The score is -2 along the path and the aligned sequences are AG-TA and -GAT-. The aligned sequences A2S1 and A1S3 are stored in the HDFS. c) Alignment of sequences A1S2 and S3: The aligned sequence A1S2 - A-TA and the third sequence S3 - GAT are read from HDFS and the alignment is performed as follows: Hence the trace moves as F(3,4) F(3,3) F(2,2) F(1,2) F(1,1). The score is -5 along the path and the aligned sequences are A--TA and -GAT-. The aligned sequences A2S1 and A1S3 are stored in the HDFS. Finally the aligned sequences are A G - T A, A - -T A and - G A T A - T A G A Analysis T For three sequence problem, six different combinations of sequences are possible. All these combinations are aligned in parallel using hadoop data grid. Within each combination, pairwise alignment is carried out. S1 and S2 read from DFS and then aligned to produce aligned sequences A1S1 and A1S2. These aligned sequences are then stored in the DFS. Then A1S1 is aligned with S3 and A1S2 is aligned with S3 in parallel to produce final aligned sequences A2S1, A2S2 and A1S3. These two alignments have no dependencies and hence they are carried out in parallel. Let us consider that n is the number of sequences to be aligned. The number of permutations npn is n!. Each permutation is performed in parallel. Hence its time complexity is complexity of a single combination. Within each permutation, there are (n- 1)+(n-2)+.+1 pairwise alignments which can be performed in parallel using maps. Total number of maps for each permutation is represented by M p = SUM(1 to n-1). (n-1) maps are done in serial in each combination. The order of complexity of each map (between a pair of sequences of length N and M) is O(N * M). If alignment is done in parallel in b blocks then the time complexity is O((N*M)/b)So the time complexity for each permutation is O(n-1 * (M * N)/b). As each permutation is done in parallel, the total complexity of MSA is O(n-1 *(M * N)/b). The method of splitting up a sequence in hadoop into blocks so that the score can be computed in parallel on all these blocks provides fine grained parallelism. Thus the proposed approach provides both data and compute parallelism. 4. EXPERIMENTAL RESULTS Hadoop was installed in fully distributed mode on PIV Dual Core machines with 2 GB RAM and the readings were taken. Time efficiency of the proposed methodology had been measured as follows: a) Varying number of sequences of same size: It is found that the time increases linearly as the alignment depends on the number of sequences. It is also found that as the number of nodes increases, the alignments that can be executed in parallel also increases. Hence the time for alignment decreases.

6 Sec in Time Number of sequences 2 nodes 3 nodes Figure 4: Time comparison for different number of sequences each of size 842 KB b) Same number of sequences of varying size: It is found that the time increases as the pairwise alignment depends on the size of the sequences. It is also found that as the number of nodes increases, the alignments that can be executed in parallel also increases. Hence the time for alignment decreases. Sec e 2 15 in Size of Sequence (KB) 2 nodes 3 nodes Figure 5: Time comparison for different size of sequences for 4 sequences c) Different block sizes: It is found that the time decreases as the block size increases. The sequence is split to blocks in hadoop, and sequence alignment can be done in parallel on these blocks. This is because, as the number of blocks increases, pairwise alignment can be done in parallel in these blocks. It is also found that as the number of nodes increases, the alignments that can be executed in parallel also increases. Hence the time for alignment decreases. Sec in Time Block Size in KB 2 nodes 3 nodes Figure 6: Time comparison for different block size for 4 sequences of 424 KB 5. CONCLUSION The proposed method of MSA improves on the computation time and also maintains the accuracy. To bring about data and compute parallelism hadoop data grids is used. This approach parallelises sequence alignment in 3 levels. The accuracy of alignment is also maintained as dynamic programming method is used. When MSA is performed on n sequences of average length N, the time complexity in dynamic programming is O(N n ). In the proposed approach the time complexity is O((n-1) * N 2 /b), where b is the block size. Further due to the scalability of the hadoop framework, the proposed MSA is highly suited for large scale alignment problems. ACKNOWLEDGEMENT The authors thank Mr K V Chidambaran, Director, Yahoo Software Development India Pvt Ltd for his support in setting up the grid and cloud computing lab at PSG College of Technology. This research is a consequence of PSG-Yahoo research collaboration on Grid and Cloud Computing. 6. REFERENCES: [1] Sagl, B. N. and Christus, D. W. (197) A general method applicable to the search for similarities in the amino acid sequence of two proteins, Journal of molecular biology, Vol.1, pp [2] Smith, T.F. and Waterman, M.S. (1983) Identification of Common Molecular Subsequences, Journal of Molecular Biology, Vol.1, pp [3] Li, K.B. (23) ClustalW-MPI: ClustalW analysis using distributed and parallel computing, Bioinformatics, Vol.19, No.12, pp [4] Higgins, D.G. and Heringa, J. (2) T-Coffee: A novel method for fast and accurate multiple sequence, C Notredame - Journal of Molecular Biology, Vol.32, No.1, pp [5] Holmes, I. and Bruno, W.J. (21), Evolutionary HMMs: A Bayesian approach to multiple alignment, Bioinformatics, Vol.17, pp [6] Zhang, C. and Wong, A.K, (1997) A genetic algorithm for multiple molecular sequence alignment, Bioinformatics, Vol.13, pp [7] Madsen, P. J. and Kleywegt, G.J. (22) Indonesia: An integrated sequence analysis system Manual, [8] Stephen, F. A, Warren, G. Webb, M. Eugene W. M. and David, J. L. (199) Basic Local Alignment Search Tool, Journal of Molecular Biology, Vol. 215, pp [9] Stephen, F. A. Thomas, L. M. Alejandro, A. S. Jinghui Z. Zheng Z. Webb M. and David J. L. (1997), Gapped BLAST and PSI BLAST: a new generation of protein database search programs, Journal of Nucleic acid research, Vol. 25, No.17, pp [1] Van, W. I., Lasters,,I. and Wyns,L. (24) Align-m - a new algorithm for multiple alignment of highly divergent sequences, Bioinformatics, DOI: 1.193/bioinformatics/bth116, 2, pp

7 [11] Roshan, U., Livesay D.R. (26) Probalign: multiple sequence alignment using partition function posterior probabilities, Bioinformatics, Vol. 22, pp [12] Katoh, K., Misawa, K., Kuma, K., Miyata, T. (22) MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform, Nucleic Acids Research, Vol.3, No.14, pp [13] Edgar, R.C. (24), MUSCLE: multiple sequence alignment with high accuracy and high throughput, Nucleic Acids Research, Vol. 32, No.5, pp [14] Chikkagoudar, S., Roshan, U., Livesay, D. (26) eprobalign: generation and manipulation of multiple sequence alignments using partition function posterior probabilities, Bioinformatics, Vol. 22, pp [15] Yue, L. and Sing-Hoi, S, (29) Improving accuracy of multiple sequence alignment algorithms based on alignment of neighboring residues, Nucleic Acids Research, Vo.37, No.2, pp [16] Golubchik, T., Wise, M.J., Easteal, S., Jermiin, L.S. (27) Mind the gaps: evidence of bias in estimates of multiple sequence alignments, Mol Biol Evol, Vol. 35, No.13, pp [17] Rosenberg, M.S. (25) Multiple sequence alignment accuracy and evolutionary distance estimation, BMC:BioInformatics, Vol.6, pp [18] Lu, Y, Sze, S,H, (28) Multiple sequence alignment based on profile alignment of intermediate sequences, J Computional Biology, Vol.15, No.7, pp [19] Kazutaka, Katoh and Hiroyuki, T. (27), PartTree: an algorithm to build an approximate tree from a large number of unaligned sequences, Bioinformatics, Vo. 23, No.3, pp [2] Grasso, C., Lee, C. (24) Combining partial order alignment and progressive multiple sequence alignment increases alignment speed and scalability to very large alignment problems, Bioinformatics, Vo.2, No.1, pp [21] Russell, D.J., Otu, H.H., Sayood. (28) KGrammarbased distance in progressive multiple sequence alignment, BMC Bioinformatics, Vol.1, pp [22] Li, I.T., Shum, W., Truong, K. (27) 16-fold acceleration of the Smith-Waterman algorithm using a field programmable gate array (FPGA), BMC Bioinformatics, Vol.8, pp [23] Manavski, S.A., Valle, G. (28) CUDA compatible GPU cards as efficient hardware accelerators for Smith-Waterman sequence alignment, BMC Bioinformatics, Vol.9, No.2, pp Berlin, pp [26] Rajasekaran, S. Thapar, V. Dave, H. Huang, C.H. (25) Randomized and parallel algorithms for distance matrix calculations in multiple sequence alignment, Journal of clinical monitoring and computin,. Vo.19, No.4-5, pp [27] Oliver, T. Schmidt, B. Jakop, Y. Maskell, D. (28) High Speed Biological Sequence Analysis with Hidden Markov Models on Reconfigurable Platforms, IEEE Trans Information Technology Biomed. Vol.22, pp [28] Dowd, S.E. Zaragoza, J. Rodriguez, J.R. Oliver, M.J. Payton, P.R. (25) Windows.NET Network Distributed Basic Local Alignment Search Toolkit (W.ND-BLAST), BMC Bioinformatics., Vol. 6, pp [29] Timo, L, Oliver. F. and Erik, L. L. (29) Kalign2: high-performance multiple alignment of protein and nucleotide sequences allowing external features, Nucleic acids research, Vol.37, No.3, pp [3] Thompson JD, Koehl P, Ripp R, Poch O (25) BAliBASE 3.: latest developments of the multiple sequence alignment benchmark, PubMed, Vol.61, No.1, pp [31] Apache, Hadoop Documentation (22), Available at [32] Gianluca, T. and Christof Teuscher (23) Biology goes digital-biowall, BioWall, XCell Journal, equencee.html [33] [33] Tieng, K. Y. Ophir, F. Robert, L. M. (1999) Parallel computation in biological sequence analysis, IEEE Transactions on Parallel and Distributed Systems., Vol. 9, No. 3, pp [34] [34] Ophir, F,, Robert L. M., Tieng K. Y.,Peter J. M. (1995) Parallel multiple sequence alignment, Proceedings of Intl. Conf. on Parallel Processing, [35] [35] Tahir, N., Imitaz S. S,, Shaftab, A. (25) Parallel Needleman-Wunsch Algorithm for Grid, Proceedings of the PAK-US International Symposium on High Capacity Optical Networks and Enabling Technologies (HONET 25), Islamabad, Pakistan, Dec 19-21, 25. [24] Meng, X., Chaudhary, V. (26) Optimised fine and coarse parallelism for sequence homology search, International Journal of Bioinformatics Research and Applications, Vol.2, No.4, pp [25] Pedretti, K T. Casavant, T.L. Braun, R.C. Scheetz,T.E. Birkett, C.L. and Roberts C.A.(1999) Three Complementary Approaches to Parallelization of Local BLAST Service on Workstation Clusters, Springer

NoSQL and Hadoop Technologies On Oracle Cloud

NoSQL and Hadoop Technologies On Oracle Cloud NoSQL and Hadoop Technologies On Oracle Cloud Vatika Sharma 1, Meenu Dave 2 1 M.Tech. Scholar, Department of CSE, Jagan Nath University, Jaipur, India 2 Assistant Professor, Department of CSE, Jagan Nath

More information

Towards a Resource Aware Scheduler in Hadoop

Towards a Resource Aware Scheduler in Hadoop Towards a Resource Aware Scheduler in Hadoop Mark Yong, Nitin Garegrat, Shiwali Mohan Computer Science and Engineering, University of Michigan, Ann Arbor December 21, 2009 Abstract Hadoop-MapReduce is

More information

International Journal of Advance Research in Computer Science and Management Studies

International Journal of Advance Research in Computer Science and Management Studies Volume 2, Issue 8, August 2014 ISSN: 2321 7782 (Online) International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online

More information

Pairwise Sequence Alignment

Pairwise Sequence Alignment Pairwise Sequence Alignment carolin.kosiol@vetmeduni.ac.at SS 2013 Outline Pairwise sequence alignment global - Needleman Wunsch Gotoh algorithm local - Smith Waterman algorithm BLAST - heuristics What

More information

Phylogenetic Analysis using MapReduce Programming Model

Phylogenetic Analysis using MapReduce Programming Model 2015 IEEE International Parallel and Distributed Processing Symposium Workshops Phylogenetic Analysis using MapReduce Programming Model Siddesh G M, K G Srinivasa*, Ishank Mishra, Abhinav Anurag, Eklavya

More information

Supplementary material: A benchmark of multiple sequence alignment programs upon structural RNAs Paul P. Gardner a Andreas Wilm b Stefan Washietl c

Supplementary material: A benchmark of multiple sequence alignment programs upon structural RNAs Paul P. Gardner a Andreas Wilm b Stefan Washietl c Supplementary material: A benchmark of multiple sequence alignment programs upon structural RNAs Paul P. Gardner a Andreas Wilm b Stefan Washietl c a Department of Evolutionary Biology, University of Copenhagen,

More information

Big Data with Rough Set Using Map- Reduce

Big Data with Rough Set Using Map- Reduce Big Data with Rough Set Using Map- Reduce Mr.G.Lenin 1, Mr. A. Raj Ganesh 2, Mr. S. Vanarasan 3 Assistant Professor, Department of CSE, Podhigai College of Engineering & Technology, Tirupattur, Tamilnadu,

More information

Research Article Cloud Computing for Protein-Ligand Binding Site Comparison

Research Article Cloud Computing for Protein-Ligand Binding Site Comparison BioMed Research International Volume 213, Article ID 17356, 7 pages http://dx.doi.org/1.1155/213/17356 Research Article Cloud Computing for Protein-Ligand Binding Site Comparison Che-Lun Hung 1 and Guan-Jie

More information

Fault Tolerance in Hadoop for Work Migration

Fault Tolerance in Hadoop for Work Migration 1 Fault Tolerance in Hadoop for Work Migration Shivaraman Janakiraman Indiana University Bloomington ABSTRACT Hadoop is a framework that runs applications on large clusters which are built on numerous

More information

Bio-Informatics Lectures. A Short Introduction

Bio-Informatics Lectures. A Short Introduction Bio-Informatics Lectures A Short Introduction The History of Bioinformatics Sanger Sequencing PCR in presence of fluorescent, chain-terminating dideoxynucleotides Massively Parallel Sequencing Massively

More information

http://www.paper.edu.cn

http://www.paper.edu.cn 5 10 15 20 25 30 35 A platform for massive railway information data storage # SHAN Xu 1, WANG Genying 1, LIU Lin 2** (1. Key Laboratory of Communication and Information Systems, Beijing Municipal Commission

More information

GraySort and MinuteSort at Yahoo on Hadoop 0.23

GraySort and MinuteSort at Yahoo on Hadoop 0.23 GraySort and at Yahoo on Hadoop.23 Thomas Graves Yahoo! May, 213 The Apache Hadoop[1] software library is an open source framework that allows for the distributed processing of large data sets across clusters

More information

CSE-E5430 Scalable Cloud Computing Lecture 2

CSE-E5430 Scalable Cloud Computing Lecture 2 CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 14.9-2015 1/36 Google MapReduce A scalable batch processing

More information

Reduction of Data at Namenode in HDFS using harballing Technique

Reduction of Data at Namenode in HDFS using harballing Technique Reduction of Data at Namenode in HDFS using harballing Technique Vaibhav Gopal Korat, Kumar Swamy Pamu vgkorat@gmail.com swamy.uncis@gmail.com Abstract HDFS stands for the Hadoop Distributed File System.

More information

Big Data With Hadoop

Big Data With Hadoop With Saurabh Singh singh.903@osu.edu The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials

More information

Local Alignment Tool Based on Hadoop Framework and GPU Architecture

Local Alignment Tool Based on Hadoop Framework and GPU Architecture Local Alignment Tool Based on Hadoop Framework and GPU Architecture Che-Lun Hung * Department of Computer Science and Communication Engineering Providence University Taichung, Taiwan clhung@pu.edu.tw *

More information

UPS battery remote monitoring system in cloud computing

UPS battery remote monitoring system in cloud computing , pp.11-15 http://dx.doi.org/10.14257/astl.2014.53.03 UPS battery remote monitoring system in cloud computing Shiwei Li, Haiying Wang, Qi Fan School of Automation, Harbin University of Science and Technology

More information

Chapter 7. Using Hadoop Cluster and MapReduce

Chapter 7. Using Hadoop Cluster and MapReduce Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in

More information

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA http://kzhang6.people.uic.edu/tutorial/amcis2014.html August 7, 2014 Schedule I. Introduction to big data

More information

Survey on Load Rebalancing for Distributed File System in Cloud

Survey on Load Rebalancing for Distributed File System in Cloud Survey on Load Rebalancing for Distributed File System in Cloud Prof. Pranalini S. Ketkar Ankita Bhimrao Patkure IT Department, DCOER, PG Scholar, Computer Department DCOER, Pune University Pune university

More information

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms Distributed File System 1 How do we get data to the workers? NAS Compute Nodes SAN 2 Distributed File System Don t move data to workers move workers to the data! Store data on the local disks of nodes

More information

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop Journal of Computational Information Systems 7: 16 (2011) 5956-5963 Available at http://www.jofcis.com Large-Scale Data Sets Clustering Based on MapReduce and Hadoop Ping ZHOU, Jingsheng LEI, Wenjun YE

More information

Distributed Framework for Data Mining As a Service on Private Cloud

Distributed Framework for Data Mining As a Service on Private Cloud RESEARCH ARTICLE OPEN ACCESS Distributed Framework for Data Mining As a Service on Private Cloud Shraddha Masih *, Sanjay Tanwani** *Research Scholar & Associate Professor, School of Computer Science &

More information

The Comprehensive Performance Rating for Hadoop Clusters on Cloud Computing Platform

The Comprehensive Performance Rating for Hadoop Clusters on Cloud Computing Platform The Comprehensive Performance Rating for Hadoop Clusters on Cloud Computing Platform Fong-Hao Liu, Ya-Ruei Liou, Hsiang-Fu Lo, Ko-Chin Chang, and Wei-Tsong Lee Abstract Virtualization platform solutions

More information

Efficient Data Replication Scheme based on Hadoop Distributed File System

Efficient Data Replication Scheme based on Hadoop Distributed File System , pp. 177-186 http://dx.doi.org/10.14257/ijseia.2015.9.12.16 Efficient Data Replication Scheme based on Hadoop Distributed File System Jungha Lee 1, Jaehwa Chung 2 and Daewon Lee 3* 1 Division of Supercomputing,

More information

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc hairong@yahoo-inc.com

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc hairong@yahoo-inc.com Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc hairong@yahoo-inc.com What s Hadoop Framework for running applications on large clusters of commodity hardware Scale: petabytes of data

More information

Processing of Hadoop using Highly Available NameNode

Processing of Hadoop using Highly Available NameNode Processing of Hadoop using Highly Available NameNode 1 Akash Deshpande, 2 Shrikant Badwaik, 3 Sailee Nalawade, 4 Anjali Bote, 5 Prof. S. P. Kosbatwar Department of computer Engineering Smt. Kashibai Navale

More information

A Locality Enhanced Scheduling Method for Multiple MapReduce Jobs In a Workflow Application

A Locality Enhanced Scheduling Method for Multiple MapReduce Jobs In a Workflow Application 2012 International Conference on Information and Computer Applications (ICICA 2012) IPCSIT vol. 24 (2012) (2012) IACSIT Press, Singapore A Locality Enhanced Scheduling Method for Multiple MapReduce Jobs

More information

SGI. High Throughput Computing (HTC) Wrapper Program for Bioinformatics on SGI ICE and SGI UV Systems. January, 2012. Abstract. Haruna Cofer*, PhD

SGI. High Throughput Computing (HTC) Wrapper Program for Bioinformatics on SGI ICE and SGI UV Systems. January, 2012. Abstract. Haruna Cofer*, PhD White Paper SGI High Throughput Computing (HTC) Wrapper Program for Bioinformatics on SGI ICE and SGI UV Systems Haruna Cofer*, PhD January, 2012 Abstract The SGI High Throughput Computing (HTC) Wrapper

More information

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK A REVIEW ON HIGH PERFORMANCE DATA STORAGE ARCHITECTURE OF BIGDATA USING HDFS MS.

More information

Hadoop IST 734 SS CHUNG

Hadoop IST 734 SS CHUNG Hadoop IST 734 SS CHUNG Introduction What is Big Data?? Bulk Amount Unstructured Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per day) If a regular machine need to

More information

Introduction to DISC and Hadoop

Introduction to DISC and Hadoop Introduction to DISC and Hadoop Alice E. Fischer April 24, 2009 Alice E. Fischer DISC... 1/20 1 2 History Hadoop provides a three-layer paradigm Alice E. Fischer DISC... 2/20 Parallel Computing Past and

More information

Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique

Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique Mahesh Maurya a, Sunita Mahajan b * a Research Scholar, JJT University, MPSTME, Mumbai, India,maheshkmaurya@yahoo.co.in

More information

Suresh Lakavath csir urdip Pune, India lsureshit@gmail.com.

Suresh Lakavath csir urdip Pune, India lsureshit@gmail.com. A Big Data Hadoop Architecture for Online Analysis. Suresh Lakavath csir urdip Pune, India lsureshit@gmail.com. Ramlal Naik L Acme Tele Power LTD Haryana, India ramlalnaik@gmail.com. Abstract Big Data

More information

marlabs driving digital agility WHITEPAPER Big Data and Hadoop

marlabs driving digital agility WHITEPAPER Big Data and Hadoop marlabs driving digital agility WHITEPAPER Big Data and Hadoop Abstract This paper explains the significance of Hadoop, an emerging yet rapidly growing technology. The prime goal of this paper is to unveil

More information

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets

More information

International Journal of Innovative Research in Computer and Communication Engineering

International Journal of Innovative Research in Computer and Communication Engineering FP Tree Algorithm and Approaches in Big Data T.Rathika 1, J.Senthil Murugan 2 Assistant Professor, Department of CSE, SRM University, Ramapuram Campus, Chennai, Tamil Nadu,India 1 Assistant Professor,

More information

Bioinformatics Grid - Enabled Tools For Biologists.

Bioinformatics Grid - Enabled Tools For Biologists. Bioinformatics Grid - Enabled Tools For Biologists. What is Grid-Enabled Tools (GET)? As number of data from the genomics and proteomics experiment increases. Problems arise for the current sequence analysis

More information

Hadoop Architecture. Part 1

Hadoop Architecture. Part 1 Hadoop Architecture Part 1 Node, Rack and Cluster: A node is simply a computer, typically non-enterprise, commodity hardware for nodes that contain data. Consider we have Node 1.Then we can add more nodes,

More information

An Hadoop-based Platform for Massive Medical Data Storage

An Hadoop-based Platform for Massive Medical Data Storage 5 10 15 An Hadoop-based Platform for Massive Medical Data Storage WANG Heng * (School of Information and Communication Engineering, Beijing University of Posts and Telecommunications, Beijing 100876) Abstract:

More information

R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5

R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5 Distributed data processing in heterogeneous cloud environments R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5 1 uskenbaevar@gmail.com, 2 abu.kuandykov@gmail.com,

More information

Hadoop Scheduler w i t h Deadline Constraint

Hadoop Scheduler w i t h Deadline Constraint Hadoop Scheduler w i t h Deadline Constraint Geetha J 1, N UdayBhaskar 2, P ChennaReddy 3,Neha Sniha 4 1,4 Department of Computer Science and Engineering, M S Ramaiah Institute of Technology, Bangalore,

More information

Performance Evaluation for BlobSeer and Hadoop using Machine Learning Algorithms

Performance Evaluation for BlobSeer and Hadoop using Machine Learning Algorithms Performance Evaluation for BlobSeer and Hadoop using Machine Learning Algorithms Elena Burceanu, Irina Presa Automatic Control and Computers Faculty Politehnica University of Bucharest Emails: {elena.burceanu,

More information

Enabling Multi-pipeline Data Transfer in HDFS for Big Data Applications

Enabling Multi-pipeline Data Transfer in HDFS for Big Data Applications Enabling Multi-pipeline Data Transfer in HDFS for Big Data Applications Liqiang (Eric) Wang, Hong Zhang University of Wyoming Hai Huang IBM T.J. Watson Research Center Background Hadoop: Apache Hadoop

More information

Network Protocol Analysis using Bioinformatics Algorithms

Network Protocol Analysis using Bioinformatics Algorithms Network Protocol Analysis using Bioinformatics Algorithms Marshall A. Beddoe Marshall_Beddoe@McAfee.com ABSTRACT Network protocol analysis is currently performed by hand using only intuition and a protocol

More information

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop Kanchan A. Khedikar Department of Computer Science & Engineering Walchand Institute of Technoloy, Solapur, Maharashtra,

More information

Data-Intensive Computing with Map-Reduce and Hadoop

Data-Intensive Computing with Map-Reduce and Hadoop Data-Intensive Computing with Map-Reduce and Hadoop Shamil Humbetov Department of Computer Engineering Qafqaz University Baku, Azerbaijan humbetov@gmail.com Abstract Every day, we create 2.5 quintillion

More information

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel Parallel Databases Increase performance by performing operations in parallel Parallel Architectures Shared memory Shared disk Shared nothing closely coupled loosely coupled Parallelism Terminology Speedup:

More information

Efficient Parallel Execution of Sequence Similarity Analysis Via Dynamic Load Balancing

Efficient Parallel Execution of Sequence Similarity Analysis Via Dynamic Load Balancing Efficient Parallel Execution of Sequence Similarity Analysis Via Dynamic Load Balancing James D. Jackson Philip J. Hatcher Department of Computer Science Kingsbury Hall University of New Hampshire Durham,

More information

A B S T R A C T. Index Terms : Apache s Hadoop, Map/Reduce, HDFS, Hashing Algorithm. I. INTRODUCTION

A B S T R A C T. Index Terms : Apache s Hadoop, Map/Reduce, HDFS, Hashing Algorithm. I. INTRODUCTION Speed- Up Extension To Hadoop System- A Survey Of HDFS Data Placement Sayali Ashok Shivarkar, Prof.Deepali Gatade Computer Network, Sinhgad College of Engineering, Pune, India 1sayalishivarkar20@gmail.com

More information

How To Handle Big Data With A Data Scientist

How To Handle Big Data With A Data Scientist III Big Data Technologies Today, new technologies make it possible to realize value from Big Data. Big data technologies can replace highly customized, expensive legacy systems with a standard solution

More information

MANAGEMENT OF DATA REPLICATION FOR PC CLUSTER BASED CLOUD STORAGE SYSTEM

MANAGEMENT OF DATA REPLICATION FOR PC CLUSTER BASED CLOUD STORAGE SYSTEM MANAGEMENT OF DATA REPLICATION FOR PC CLUSTER BASED CLOUD STORAGE SYSTEM Julia Myint 1 and Thinn Thu Naing 2 1 University of Computer Studies, Yangon, Myanmar juliamyint@gmail.com 2 University of Computer

More information

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after

More information

Parallel Computing. Benson Muite. benson.muite@ut.ee http://math.ut.ee/ benson. https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage

Parallel Computing. Benson Muite. benson.muite@ut.ee http://math.ut.ee/ benson. https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage Parallel Computing Benson Muite benson.muite@ut.ee http://math.ut.ee/ benson https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage 3 November 2014 Hadoop, Review Hadoop Hadoop History Hadoop Framework

More information

Sector vs. Hadoop. A Brief Comparison Between the Two Systems

Sector vs. Hadoop. A Brief Comparison Between the Two Systems Sector vs. Hadoop A Brief Comparison Between the Two Systems Background Sector is a relatively new system that is broadly comparable to Hadoop, and people want to know what are the differences. Is Sector

More information

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM Sneha D.Borkar 1, Prof.Chaitali S.Surtakar 2 Student of B.E., Information Technology, J.D.I.E.T, sborkar95@gmail.com Assistant Professor, Information

More information

Intro to Map/Reduce a.k.a. Hadoop

Intro to Map/Reduce a.k.a. Hadoop Intro to Map/Reduce a.k.a. Hadoop Based on: Mining of Massive Datasets by Ra jaraman and Ullman, Cambridge University Press, 2011 Data Mining for the masses by North, Global Text Project, 2012 Slides by

More information

Parallel Processing of cluster by Map Reduce

Parallel Processing of cluster by Map Reduce Parallel Processing of cluster by Map Reduce Abstract Madhavi Vaidya, Department of Computer Science Vivekanand College, Chembur, Mumbai vamadhavi04@yahoo.co.in MapReduce is a parallel programming model

More information

Mobile Storage and Search Engine of Information Oriented to Food Cloud

Mobile Storage and Search Engine of Information Oriented to Food Cloud Advance Journal of Food Science and Technology 5(10): 1331-1336, 2013 ISSN: 2042-4868; e-issn: 2042-4876 Maxwell Scientific Organization, 2013 Submitted: May 29, 2013 Accepted: July 04, 2013 Published:

More information

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture DATA MINING WITH HADOOP AND HIVE Introduction to Architecture Dr. Wlodek Zadrozny (Most slides come from Prof. Akella s class in 2014) 2015-2025. Reproduction or usage prohibited without permission of

More information

Introduction to Hadoop

Introduction to Hadoop Introduction to Hadoop 1 What is Hadoop? the big data revolution extracting value from data cloud computing 2 Understanding MapReduce the word count problem more examples MCS 572 Lecture 24 Introduction

More information

Comparision of k-means and k-medoids Clustering Algorithms for Big Data Using MapReduce Techniques

Comparision of k-means and k-medoids Clustering Algorithms for Big Data Using MapReduce Techniques Comparision of k-means and k-medoids Clustering Algorithms for Big Data Using MapReduce Techniques Subhashree K 1, Prakash P S 2 1 Student, Kongu Engineering College, Perundurai, Erode 2 Assistant Professor,

More information

Volume 3, Issue 6, June 2015 International Journal of Advance Research in Computer Science and Management Studies

Volume 3, Issue 6, June 2015 International Journal of Advance Research in Computer Science and Management Studies Volume 3, Issue 6, June 2015 International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online at: www.ijarcsms.com Image

More information

Detection of Distributed Denial of Service Attack with Hadoop on Live Network

Detection of Distributed Denial of Service Attack with Hadoop on Live Network Detection of Distributed Denial of Service Attack with Hadoop on Live Network Suchita Korad 1, Shubhada Kadam 2, Prajakta Deore 3, Madhuri Jadhav 4, Prof.Rahul Patil 5 Students, Dept. of Computer, PCCOE,

More information

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU Heshan Li, Shaopeng Wang The Johns Hopkins University 3400 N. Charles Street Baltimore, Maryland 21218 {heshanli, shaopeng}@cs.jhu.edu 1 Overview

More information

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB Planet Size Data!? Gartner s 10 key IT trends for 2012 unstructured data will grow some 80% over the course of the next

More information

Log Mining Based on Hadoop s Map and Reduce Technique

Log Mining Based on Hadoop s Map and Reduce Technique Log Mining Based on Hadoop s Map and Reduce Technique ABSTRACT: Anuja Pandit Department of Computer Science, anujapandit25@gmail.com Amruta Deshpande Department of Computer Science, amrutadeshpande1991@gmail.com

More information

Enhancing MapReduce Functionality for Optimizing Workloads on Data Centers

Enhancing MapReduce Functionality for Optimizing Workloads on Data Centers Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 2, Issue. 10, October 2013,

More information

A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS

A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS Dr. Ananthi Sheshasayee 1, J V N Lakshmi 2 1 Head Department of Computer Science & Research, Quaid-E-Millath Govt College for Women, Chennai, (India)

More information

Scalable Multiple NameNodes Hadoop Cloud Storage System

Scalable Multiple NameNodes Hadoop Cloud Storage System Vol.8, No.1 (2015), pp.105-110 http://dx.doi.org/10.14257/ijdta.2015.8.1.12 Scalable Multiple NameNodes Hadoop Cloud Storage System Kun Bi 1 and Dezhi Han 1,2 1 College of Information Engineering, Shanghai

More information

Analysis and Modeling of MapReduce s Performance on Hadoop YARN

Analysis and Modeling of MapReduce s Performance on Hadoop YARN Analysis and Modeling of MapReduce s Performance on Hadoop YARN Qiuyi Tang Dept. of Mathematics and Computer Science Denison University tang_j3@denison.edu Dr. Thomas C. Bressoud Dept. of Mathematics and

More information

Big Data Storage Options for Hadoop Sam Fineberg, HP Storage

Big Data Storage Options for Hadoop Sam Fineberg, HP Storage Sam Fineberg, HP Storage SNIA Legal Notice The material contained in this tutorial is copyrighted by the SNIA unless otherwise noted. Member companies and individual members may use this material in presentations

More information

White Paper. Big Data and Hadoop. Abhishek S, Java COE. Cloud Computing Mobile DW-BI-Analytics Microsoft Oracle ERP Java SAP ERP

White Paper. Big Data and Hadoop. Abhishek S, Java COE. Cloud Computing Mobile DW-BI-Analytics Microsoft Oracle ERP Java SAP ERP White Paper Big Data and Hadoop Abhishek S, Java COE www.marlabs.com Cloud Computing Mobile DW-BI-Analytics Microsoft Oracle ERP Java SAP ERP Table of contents Abstract.. 1 Introduction. 2 What is Big

More information

L1: Introduction to Hadoop

L1: Introduction to Hadoop L1: Introduction to Hadoop Feng Li feng.li@cufe.edu.cn School of Statistics and Mathematics Central University of Finance and Economics Revision: December 1, 2014 Today we are going to learn... 1 General

More information

A Brief Outline on Bigdata Hadoop

A Brief Outline on Bigdata Hadoop A Brief Outline on Bigdata Hadoop Twinkle Gupta 1, Shruti Dixit 2 RGPV, Department of Computer Science and Engineering, Acropolis Institute of Technology and Research, Indore, India Abstract- Bigdata is

More information

Big Data Challenges in Bioinformatics

Big Data Challenges in Bioinformatics Big Data Challenges in Bioinformatics BARCELONA SUPERCOMPUTING CENTER COMPUTER SCIENCE DEPARTMENT Autonomic Systems and ebusiness Pla?orms Jordi Torres Jordi.Torres@bsc.es Talk outline! We talk about Petabyte?

More information

GeoGrid Project and Experiences with Hadoop

GeoGrid Project and Experiences with Hadoop GeoGrid Project and Experiences with Hadoop Gong Zhang and Ling Liu Distributed Data Intensive Systems Lab (DiSL) Center for Experimental Computer Systems Research (CERCS) Georgia Institute of Technology

More information

Comparison of Different Implementation of Inverted Indexes in Hadoop

Comparison of Different Implementation of Inverted Indexes in Hadoop Comparison of Different Implementation of Inverted Indexes in Hadoop Hediyeh Baban, S. Kami Makki, and Stefan Andrei Department of Computer Science Lamar University Beaumont, Texas (hbaban, kami.makki,

More information

MapReduce and Hadoop Distributed File System

MapReduce and Hadoop Distributed File System MapReduce and Hadoop Distributed File System 1 B. RAMAMURTHY Contact: Dr. Bina Ramamurthy CSE Department University at Buffalo (SUNY) bina@buffalo.edu http://www.cse.buffalo.edu/faculty/bina Partially

More information

Efficient Analysis of Big Data Using Map Reduce Framework

Efficient Analysis of Big Data Using Map Reduce Framework Efficient Analysis of Big Data Using Map Reduce Framework Dr. Siddaraju 1, Sowmya C L 2, Rashmi K 3, Rahul M 4 1 Professor & Head of Department of Computer Science & Engineering, 2,3,4 Assistant Professor,

More information

ISSN:2320-0790. Keywords: HDFS, Replication, Map-Reduce I Introduction:

ISSN:2320-0790. Keywords: HDFS, Replication, Map-Reduce I Introduction: ISSN:2320-0790 Dynamic Data Replication for HPC Analytics Applications in Hadoop Ragupathi T 1, Sujaudeen N 2 1 PG Scholar, Department of CSE, SSN College of Engineering, Chennai, India 2 Assistant Professor,

More information

Introduction to MapReduce and Hadoop

Introduction to MapReduce and Hadoop Introduction to MapReduce and Hadoop Jie Tao Karlsruhe Institute of Technology jie.tao@kit.edu Die Kooperation von Why Map/Reduce? Massive data Can not be stored on a single machine Takes too long to process

More information

DESIGN ARCHITECTURE-BASED ON WEB SERVER AND APPLICATION CLUSTER IN CLOUD ENVIRONMENT

DESIGN ARCHITECTURE-BASED ON WEB SERVER AND APPLICATION CLUSTER IN CLOUD ENVIRONMENT DESIGN ARCHITECTURE-BASED ON WEB SERVER AND APPLICATION CLUSTER IN CLOUD ENVIRONMENT Gita Shah 1, Annappa 2 and K. C. Shet 3 1,2,3 Department of Computer Science & Engineering, National Institute of Technology,

More information

Autonomic Data Replication in Cloud Environment

Autonomic Data Replication in Cloud Environment International Journal of Electronics and Computer Science Engineering 38 Available Online at www.ijecse.org ISSN- 2277-1956 Autonomic Data Replication in Cloud Environment Dhananjaya Gupt, Mrs.Anju Bala

More information

PROC. CAIRO INTERNATIONAL BIOMEDICAL ENGINEERING CONFERENCE 2006 1. E-mail: msm_eng@k-space.org

PROC. CAIRO INTERNATIONAL BIOMEDICAL ENGINEERING CONFERENCE 2006 1. E-mail: msm_eng@k-space.org BIOINFTool: Bioinformatics and sequence data analysis in molecular biology using Matlab Mai S. Mabrouk 1, Marwa Hamdy 2, Marwa Mamdouh 2, Marwa Aboelfotoh 2,Yasser M. Kadah 2 1 Biomedical Engineering Department,

More information

Distributed Consistency Method and Two-Phase Locking in Cloud Storage over Multiple Data Centers

Distributed Consistency Method and Two-Phase Locking in Cloud Storage over Multiple Data Centers BULGARIAN ACADEMY OF SCIENCES CYBERNETICS AND INFORMATION TECHNOLOGIES Volume 15, No 6 Special Issue on Logistics, Informatics and Service Science Sofia 2015 Print ISSN: 1311-9702; Online ISSN: 1314-4081

More information

Big Application Execution on Cloud using Hadoop Distributed File System

Big Application Execution on Cloud using Hadoop Distributed File System Big Application Execution on Cloud using Hadoop Distributed File System Ashkan Vates*, Upendra, Muwafaq Rahi Ali RPIIT Campus, Bastara Karnal, Haryana, India ---------------------------------------------------------------------***---------------------------------------------------------------------

More information

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012 MapReduce and Hadoop Aaron Birkland Cornell Center for Advanced Computing January 2012 Motivation Simple programming model for Big Data Distributed, parallel but hides this Established success at petabyte

More information

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK A COMPREHENSIVE VIEW OF HADOOP ER. AMRINDER KAUR Assistant Professor, Department

More information

MAPREDUCE Programming Model

MAPREDUCE Programming Model CS 2510 COMPUTER OPERATING SYSTEMS Cloud Computing MAPREDUCE Dr. Taieb Znati Computer Science Department University of Pittsburgh MAPREDUCE Programming Model Scaling Data Intensive Application MapReduce

More information

HIV NOMOGRAM USING BIG DATA ANALYTICS

HIV NOMOGRAM USING BIG DATA ANALYTICS HIV NOMOGRAM USING BIG DATA ANALYTICS S.Avudaiselvi and P.Tamizhchelvi Student Of Ayya Nadar Janaki Ammal College (Sivakasi) Head Of The Department Of Computer Science, Ayya Nadar Janaki Ammal College

More information

Apache Hadoop new way for the company to store and analyze big data

Apache Hadoop new way for the company to store and analyze big data Apache Hadoop new way for the company to store and analyze big data Reyna Ulaque Software Engineer Agenda What is Big Data? What is Hadoop? Who uses Hadoop? Hadoop Architecture Hadoop Distributed File

More information

A very short Intro to Hadoop

A very short Intro to Hadoop 4 Overview A very short Intro to Hadoop photo by: exfordy, flickr 5 How to Crunch a Petabyte? Lots of disks, spinning all the time Redundancy, since disks die Lots of CPU cores, working all the time Retry,

More information

Highly Available Hadoop Name Node Architecture-Using Replicas of Name Node with Time Synchronization among Replicas

Highly Available Hadoop Name Node Architecture-Using Replicas of Name Node with Time Synchronization among Replicas IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661, p- ISSN: 2278-8727Volume 16, Issue 3, Ver. II (May-Jun. 2014), PP 58-62 Highly Available Hadoop Name Node Architecture-Using Replicas

More information

Cloud Storage Solution for WSN Based on Internet Innovation Union

Cloud Storage Solution for WSN Based on Internet Innovation Union Cloud Storage Solution for WSN Based on Internet Innovation Union Tongrang Fan 1, Xuan Zhang 1, Feng Gao 1 1 School of Information Science and Technology, Shijiazhuang Tiedao University, Shijiazhuang,

More information

Seeking Opportunities for Hardware Acceleration in Big Data Analytics

Seeking Opportunities for Hardware Acceleration in Big Data Analytics Seeking Opportunities for Hardware Acceleration in Big Data Analytics Paul Chow High-Performance Reconfigurable Computing Group Department of Electrical and Computer Engineering University of Toronto Who

More information

COST MINIMIZATION OF RUNNING MAPREDUCE ACROSS GEOGRAPHICALLY DISTRIBUTED DATA CENTERS

COST MINIMIZATION OF RUNNING MAPREDUCE ACROSS GEOGRAPHICALLY DISTRIBUTED DATA CENTERS COST MINIMIZATION OF RUNNING MAPREDUCE ACROSS GEOGRAPHICALLY DISTRIBUTED DATA CENTERS Ms. T. Cowsalya PG Scholar, SVS College of Engineering, Coimbatore, Tamilnadu, India Dr. S. Senthamarai Kannan Assistant

More information

Can the Elephants Handle the NoSQL Onslaught?

Can the Elephants Handle the NoSQL Onslaught? Can the Elephants Handle the NoSQL Onslaught? Avrilia Floratou, Nikhil Teletia David J. DeWitt, Jignesh M. Patel, Donghui Zhang University of Wisconsin-Madison Microsoft Jim Gray Systems Lab Presented

More information

CS2510 Computer Operating Systems

CS2510 Computer Operating Systems CS2510 Computer Operating Systems HADOOP Distributed File System Dr. Taieb Znati Computer Science Department University of Pittsburgh Outline HDF Design Issues HDFS Application Profile Block Abstraction

More information

CS2510 Computer Operating Systems

CS2510 Computer Operating Systems CS2510 Computer Operating Systems HADOOP Distributed File System Dr. Taieb Znati Computer Science Department University of Pittsburgh Outline HDF Design Issues HDFS Application Profile Block Abstraction

More information