Optimizing a MapReduce Module of Preprocessing High-Throughput DNA Sequencing Data

Size: px
Start display at page:

Download "Optimizing a MapReduce Module of Preprocessing High-Throughput DNA Sequencing Data"

Transcription

1 2013 IEEE International Conference on Big Data Optimizing a MapReduce Module of Preprocessing High-Throughput DNA Sequencing Data Wei-Chun Chung #,1,3, Yu-Jung Chang #,2, Chien-Chih Chen 2, Der-Tsai Lee 2,3,4, Jan-Ming Ho,1,2 1 Research Center for Information Technology Innovation 2 Institute of Information Science Academia Sinica Taipei, Taiwan, ROC {wcchung, yjchang, rocky, dtlee, Abstract The MapReduce framework has become the de facto choice for big data analysis in a variety of applications. In MapReduce programming model, computation is distributed to a cluster of computing nodes that runs in parallel. The performance of a MapReduce application is thus affected by system and middleware, characteristics of data, and design and implementation of the algorithms. In this study, we focus on performance optimization of a MapReduce application, i.e., CloudRS, which tackles on the problem of detecting and removing errors in the next-generation sequencing de novo genomic data. We present three strategies, i.e., contentexchange, content-grouping, and index-only strategies, of communication between the Map() and Reduce() functions. The three strategies differ in the way messages are exchanged between the two functions. We also present experimental results to compare performance of the three strategies. Keywords-error correction; genome assembly; mapreduce; next-generation sequencing; optimization; I. INTRODUCTION MapReduce [1] is a prominent distributed computational framework that possesses various key features for dealing with large-scale data processing on the cloud [2-4], including fault-tolerance, scheduling, data replication, load balance, and parallelization. By virtue of the scalability and simplicity on development, MapReduce and its implementations [5-7] have been widely-used in different applications, e.g., Web and social networks analysis, scientific emulation, financial and business data processing, and bioinformatics [8-12]. However, the performance and efficiency of MapReduce are affected by different factors, and thus, become challenging for optimization. Optimizing MapReduce is essential as processing data in a timely and cost-efficient manner becomes critical [13-18]. Fortunately, various techniques have been introduced to improve the performance of MapReduce [19-25], including hardware, software, and framework level optimization. One of the optimization techniques is tuning parameters for # These authors contributed equally to this work (co-first authors). 3 Department of Computer Science and Information Engineering, National Taiwan University, Taipei, Taiwan, ROC. 4 Department of Computer Science and Information Engineering, National Chung Hsing University, Taichung, Taiwan, ROC. Corresponding author system, middleware, and MapReduce execution by utilizing expert systems [20-22] or the rule-of-thumb policies [26, 27]. Another type of optimization focuses on the design of algorithm or the characteristics of data of the application [28, 29]. In this study, we focus on CloudRS [9], a MapReduce application for correcting errors in the next-generation sequencing (NGS) data. As the cost of DNA sequencing rapidly reduces [12], the accompanying growth of genome data results in unpredictable execution time, even if the data is processed by MapReduce. Thus, to optimize the performance of CloudRS, we evaluate three kinds of message generation and transmission approaches to reduce the communication cost of MapReduce: content-exchange, content-grouping, and index-only strategies. We also present the experimental results, and discuss the observation and limitation of our proposed strategies. II. BACKGROUND A. The MapReduce programming model The MapReduce programming model is composed of two primitive functions, Map and Reduce. The input data of a MapReduce program is a list of <key, value> pairs, and thus, the Map() function is applied to each pair and generate a set of intermediate pairs, e.g. <key, list(value)>. Then the Reduce() function is applied to each intermediate pair, process values of the list, and produce aggregated final results. Moreover, there are additional functions in the MapReduce execution model, e.g., shuffle and sort, to handle intermediate data. The shuffle function is applied on the Map side, and performs data exchange by key after Map(). Thus, data with the same key will be transmitted to a single Reduce() function. The sort function is launched on the Reduce side after data exchange. It sorts data by the key field to group all the pairs with the same key for further processing. B. The CloudRS algorithm The CloudRS algorithm [9] is implemented with multiple MapReduce rounds. It aims at conservatively correcting sequence errors to avoid yielding false decisions, and thus, improves the quality of de novo assembly. To correct a possible mismatch, CloudRS emulates read alignment and majority voting for each set of reads, denoted as a read stack, /13/$ IEEE 1

2 with the same k-mer subsequence. Note that a k-mer subsequence refers to a genomic subsequence of k base pairs of either guanine-cytosine or adenine-thymine. Once the reads are aligned and the read stack has been built, majority voting can be applied on each position of the read stack to summarize the quality value of each base. Then, a decision is made on each position to correct error if necessary. III. METHODS The basic idea of error detection and correction is to align reads having the same specific subsequence of length k and sort them according to the relative position of the subsequence in the read. A voting algorithm is then used to examine the symbols and quality values at each position of the stack of reads to detect and correct sequencing errors if the reads and their quality values show high level of consistency at each position. Interested readers may refer to [9] and [30] for details. In this section, we are going to present three strategies, i.e., content-exchange strategy, content-grouping strategy, and index-only strategy, to collect reads with the same specific subsequence of length k in the error correction algorithm based on the MapReduce framework. Note that each strategy consists of a pair of Map() and Reduce() functions. The Map() function scans through a read for the k-mer subsequences it contains and emits the <k-mer, read> pairs. The shuffle stage of MapReduce then aggregates reads with the same k-mer subsequence for further processing by a Reduce() function later. The Reduce() function thus performs the align/sort/voting algorithm to identify and recommend sequencing errors containing in the reads it receives. Details of these strategies are presented as follows. A. Content-exchange strategy For each read r of length l, the Map() function of the content-exchange strategy generates a message of the form <k-mer i, (identifier, sequence, quality value)> at each position i of r, where 1 i l-k+1, sequence and quality value are vectors of length l representing the DNA sequence and quality value of r given by the sequencer, and k-mer i is the subsequence of r of length k starting at position i. B. Content-grouping strategy In this subsection, we present a content-grouping strategy in which the Map() function groups messages destined for the same Reduce() function and thus reduces the total size of messages transmitted during the shuffle stage. That is, the key-value pair is defined as <group key, (list(identity key), identifier, sequence, quality value)>. In other words, we divide the original key into two parts, the group key and the identity key. Messages with the same group key are sent to the same instance of Reduce() function which first sorts the reads it receives according to their identity keys, then performs align/sort/voting algorithm to detect and recommend sequencing errors in the subset of reads with the same identity key. C. Index-only strategy In the third strategy, we aim at reducing communication overhead further by using the distributed cache mechanism of Hadoop. The input data file containing sequence data and quality value of each read is replicated to each computing node of the cluster before executing the Map() function. So that the Map() function does not have to duplicate read data including sequence data and quality value for each message generated with each k-mer subsequence. Instead, it generates messages in key-value pairs containing only the k-mer subsequence and the read identifier which will be used later by the Reduce() function to retrieve read data from its local file cache. Each message generated by the Map() function is formatted as <group key, (list(identity key), identifier)>. Thus, the communication cost in terms of total message size is reduced at the cost of I/O overhead of retrieving read data from the local file cache. D. Qualitative comparison of the three strategies To evaluate the effectiveness of proposed methods, we estimate the intermediate data size and its reducing rate by calculation. The input dataset consists of a set of reads; each read is composed of read id, DNA sequence, and a qualityvalue character for each DNA. For a dataset with r reads and each read s sequence composed of l characters, the quality values of a read is also length l. Let the size of a sliding window on reads sequences be k, there will be r*(l-k+1) k- mer substrings, abbreviated as k-mers in the following, to be processed (denoted as t) as the input of mappers. We also define the grouping rate of the k-mers (denoted as n) to evaluate the performance of grouping mechanism. Hence, we can approximately estimate the amount of key groups and the number of k-mers in each group. For convenience, we give related notations as that (a) the length of read id is a fixed size i, (b) the length of group key is p and identity key is k-p, where 1 p k (see content-grouping strategy in Methods for definitions), (c) the grouping rate is a normal distribution, thus 0 < n 1, and there are n*t k-mer groups and 1/n k-mers in each group. The estimated sizes of intermediate data for the three strategies are as follows. For the content-exchanging strategy, the intermediate data size is at least t*(k+i+2l) bytes, since a key-value pair of a k-mer passing to reducers has to carry itself and the id, sequence and quality values of its read with size k+i+2l. The content-grouping strategy produce (nt)*(p+1/n*(k-p)+i+2l) bytes intermediate data, because a message contains a grouping key, a list of identity keys, the id, sequence and quality values of its read. The size of identity key list is 1/n*(k-p), since there are 1/n k-mers in each group and the length of each identity key is k-p. The index-only strategy generates (nt)*(p+1/n*(k-p)+i) bytes, since the key-value pair contains only the id. Table I summarizes the intermediate data size and the complexity of storage space of our proposed strategies. 2

3 TABLE I. Proposed strategies APPROXIMATED INTERMEDIATE DATA SIZE PRODUCED BY PROPOSED STRATEGIES. IV. Approximated intermediate data size (bytes) Content-exchange t*(k+i+2l) O(rl 2 ) Content-grouping (nt)*(p+1/n*(k-p)+i+2l) O(nrl 2 ) Index-only (nt)*(p+1/n*(k-p)+i) O(krl) EXPERIMENTAL RESULTS Complexity of storage spaces A. Environment setup and datasets Our experiments are evaluated in a Hadoop cluster with 10 dedicated computing nodes and an isolated internal network. Each node has 2 quad-core Intel Xeon E5410 CPUs, 16 GB memory, 1 TB local storage, and 1 Gb network connection. We use Ubuntu Linux 8.04 and Hadoop version for our experimental environment. We also set up at most 7 map tasks and 7 reduce tasks execute concurrently for each node. Thus, there are at most 70 map tasks and 70 reduce tasks in a MapReduce wave. The detail configuration of job parameters lists in Appendix Table A1. In addition, to separate the control flow and computation flow of the Hadoop framework, we add an additional control node into our cluster. We also define roles for the 11 nodes: one control node roles as Name Node and Job Tracker, while 10 computing nodes act as Data Nodes and Task Trackers. We use three real datasets to evaluate the performance of CloudRS. Information of datasets is listed in Table II. The dataset D1 is a set of short read data from an Escherichia coli (E. coli) library (SRX000429), which consists of 20.8M 36- bp reads. The dataset D2 is released by Illumina, which includes 12M paired-end 150-bp reads. This dataset contains sequences from a well-characterized E. coli strain K-12 MG1655 library sequenced on the Illumina MiSeq platform. The dataset D3 is Illumina reads from an African Male (NA18507). Note that we set up the size of k-mer as 24 characters in our experiments. Parameter settings of evaluations are bundled within the physical computation limitations, i.e., 8 cores and 16 GB memory of each computing node. B. Evaluation Results We use dataset D1 to demonstrate the effect of parameters affect to a MapReduce program by evaluating the content-exchange strategy. As shown in Table III, the execution time is reduced near 23%, comparing to the first and the third row. We observed that the execution time in the second row is longer than the first row. We also observed that multiple mapper/reducer waves also increases total execution time, as shown in the last 3 rows. The parameter settings of 70 mappers, 70 reducers and 950 MB memory achieves the shortest execution time in our experiment. Thus, we use the setting and the parameters list in Appendix Table A1 for the rest of the evaluations. To demonstrate the efficiency of the content-grouping strategy, we evaluate the strategy with dataset D2 and various partitions of keys. As shown in Table IV, the intermediate data size and execution time decrease with the grouping mechanism. We also evaluate the performance comparing to the content-exchange strategy by setting up the key partition as 24:0. However, we encounter an error during execution since we set the key partition is 6:18 and below. For index-only strategy, we use dataset D2 and D3, and use 12:12 as the key partition. The result lists in Table V. In dataset D2, the execution time has a reduction about 37% with index-only strategy, comparing to the content-grouping one. However, we encounter an unexpected longer execution with dataset D3. V. DISCUSSIONS A. A brief summary on the three strategies The three versions of error correction algorithms basically consider a read as an object and consider each k- mer subsequence of the read as a feature of the read. In the content-exchange strategy, the Map() functions generate a message, for each feature of each object, containing the feature as well as the object. The shuffle stage then collects objects with the same feature for further processing by an instance of Reduce() function. The content-grouping strategy defines features with the same prefix as belonging to the same feature group. The Map() functions generates a message for each feature group of each object. The shuffle stage thus collects objects belonging to the same feature group to an instance of Reduce() function for further processing. Note that total size of messages generated by the content-grouping strategy is smaller than that generated by the content-exchange strategy. However, the Reduce() function of the content-grouping strategy may suffer an exception of JAVA due to insufficient amount of physical memory, and thus, terminate the execution. The index-only strategy incorporates the grouping mechanism, and thus, to reduce the message size. The index-only strategy is thus the least time-consuming among the three. Unfortunately, though the strategy works well with small dataset, it failed when the input data is large. B. Overhead of index-only strategy The index-only strategy utilizes the grouping mechanism and distributed cache to successfully reduce the size of data transmitted by the Map() function, and thus, also reduce communication cost in the shuffle stage. However, since the data, i.e., sequence and quality value, is read from the local cache, performance bottleneck shifts to disk I/O among the Reduce() function. The overhead increases rapidly as the size of input data becomes large. This is mainly due to the fact that reads with the same key, the k-mer, usually scatters in the local cache. Furthermore, there are multiple tasks running concurrently on a single computing node. The lack of cache hit results in a high page-fault rate, especially when physical memory is exhausted by the running tasks. This phenomenon is known as thrashing. When it occurs, the execution time of the application may run indefinitely. 3

4 TABLE II. LIST OF DATASETS OF EVALUATING OUR PORPOSED STRATEGIES. Dataset SRA accession number Reference genome NCBI reference sequence accession number Genome length (MB) Read length Number of reads (M) Genome coverage Data size (GB) D1 SRX E. coli NC_ ~ bp ~ x ~1.59 D2 - E. coli NC_ ~ bp ~ x ~3.50 D3 SRA African Male (NA18507) - ~ bp ~ x ~17.3 VI. CONCLUSION In this era of big data, it is critical to process a large amount of data timely and efficiently. MapReduce is one of the prominent solutions to this end. It provides scalability and fault-tolerance for big data applications. However, the share-nothing nature of MapReduce also elicits researches that study applications with high degree of data dependency. An error detection and correction algorithm based on processing reads with the same k-mer subsequence is an application with high degree of data dependency, especially when it is applied to a large genome. In this paper, we present three strategies to handle data communication between the Map() and Reduce() functions of the MapReduce framework in a bioinformatics application that detects and corrects sequencing errors in the NGS data. Note that the NGS data consists of fixed-length reads, each being associated with sequence and quality value. The first strategy replicates the read data of each k-mer subsequence, and transmits the entire set of data from Map() to Reduce(). The second strategy groups the k-mer subsequences of a read by their prefix, and thus, transmits fewer amounts of data through the network. The third strategy, i.e., index-only strategy, pre-caches the read data directly on each node, and transmits only the indices of reads as messages. The indexonly strategy has been shown to be most efficient for small genomes. However, for large genomes, our current implementation may suffer the thrashing problem. Our future research will focus on improving the performance of the index-only strategy. We will also look into other problems with similar nature, e.g., de novo assembly, and develop applications based on the MapReduce framework. ACKNOWLEDGMENT The authors wish to thank anonymous reviewers for their helpful suggestions, and thank Dr. Wen-Liang Hsu and Dr. Chung-Yen Lin for their valuable discussions and comments. They also wish to thank Chunghwa Telecom Co. and National Communication Project of Taiwan for providing the cloud computing resources. The research is partially supported by Digital Culture Center, Academia Sinica, and National Science Council under grant NSC E MY3. TABLE III. RESULTS OF DIFFERENT PARAMETER SETTINGS WITH CONTENT-EXCHANGE STRATEGY ON DATASET D1. Parameters (mapred.*) map.tasks reduce.tasks child.java.opts Xmx4000m 2, Xmx950m 2, Xmx950m 1, Xmx950m 1, Xmx950m 1, Xmx950m 1,991 TABLE IV. Partition of keys (group:identity) Run time (s) RESULTS OF CONTENT-GROUPING STRATEGY ON DATASET D2. Intermediate data size (bytes) b 24:0 a 393,610,577,668 22,738 20:4 393,596,531,558 22,454 12:12 393,439,082,022 21,949 8:16 391,616,362,031 21,396 Run time (s) 6:18 379,682,068,879 GC overhead exceeded 3:21 160,805,782,820 GC overhead exceeded TABLE V. a Content-grouping method with key partition of 24:0 is same as content-exchange method. b The input data size is 3,062,609,572 bytes. RESULTS OF INDEX-ONLY METHOD ON DATASET D2 AND D3. Dataset Method a Run time (s) D2 D3 Content-grouping 21,728 Index-only 13,691 Content-grouping 16,345 Index-only > 8 hr a Both methods use 12:12 as the partition of keys (group:identity). 4

5 REFERENCES [1] J. Dean and S. Ghemawat, "MapReduce: simplified data processing on large clusters," Commun. ACM, vol. 51, pp , [2] Amazon. Amazon Elastic Compute Cloud (Amazon EC2). Available: [3] Amazon. Amazon Simple Storage Service (Amazon S3). Available: [4] Amazon. Amazon Elastic MapReduce (Amazon EMR). Available: [5] Apache. Welcome to Apache ΤΜ Hadoop! Available: [6] Nokia. Disco MapReduce. Available: [7] M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly, "Dryad: distributed data-parallel programs from sequential building blocks," presented at the Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007, Lisbon, Portugal, [8] Y.-J. Chang, C.-C. Chen, C.-L. Chen, and J.-M. Ho, "A de novo next generation genomic sequence assembler based on string graph and MapReduce cloud computing framework," BMC Genomics, vol. 13, pp. 1-17, [9] C.-C. Chen, Y.-J. Chang, W.-C. Chung, D.-T. Lee, and J.-M. Ho, "CloudRS: An Error Correction Algorithm of High-Throughput Sequencing Data based on Scalable Framework," in IEEE International Conference on Big Data, In press. [10] B. Langmead, K. D. Hansen, and J. T. Leek, "Cloud-scale RNAsequencing differential expression analysis with Myrna," Genome Biol, vol. 11, p. R83, [11] M. C. Schatz, "CloudBurst: highly sensitive read mapping with MapReduce," Bioinformatics, vol. 25, pp , Jun [12] L. D. Stein, "The case for cloud computing in genome informatics," Genome Biol, vol. 11, p. 207, [13] E. Anderson and J. Tucek, "Efficiency matters!," SIGOPS Oper. Syst. Rev., vol. 44, pp , [14] J. Cohen, B. Dolan, M. Dunlap, J. M. Hellerstein, and C. Welton, "MAD skills: new analysis practices for big data," Proc. VLDB Endow., vol. 2, pp , [15] D. J. DeWitt and M. Stonebraker. (2008). MapReduce: A major step backwards. Available: [16] B. Irving. Big data and the power of hadoop. [17] A. Pavlo, E. Paulson, A. Rasin, D. J. Abadi, D. J. DeWitt, S. Madden, et al., "A comparison of approaches to large-scale data analysis," presented at the Proceedings of the 2009 ACM SIGMOD International Conference on Management of data, Providence, Rhode Island, USA, [18] M. Schroepfer. Inside large-scale analytics at facebook. [19] J. Ekanayake, H. Li, B. Zhang, T. Gunarathne, S.-H. Bae, J. Qiu, et al., "Twister: a runtime for iterative MapReduce," presented at the Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing, Chicago, Illinois, [20] W. Guanying, A. R. Butt, P. Pandey, and K. Gupta, "A simulation approach to evaluating design decisions in MapReduce setups," in Modeling, Analysis & Simulation of Computer and Telecommunication Systems, MASCOTS '09. IEEE International Symposium on, 2009, pp [21] H. Herodotou, H. Lim, G. Luo, N. Borisov, L. Dong, F. B. Cetin, et al., "Starfish: A Self-tuning System for Big Data Analytics," In Proceedings of the 5th Conference on Innovative Data Systems Research, [22] E. Jahani, M. J. Cafarella, C. R, and #233, "Automatic optimization for MapReduce programs," Proc. VLDB Endow., vol. 4, pp , [23] S. Lattanzi, B. Moseley, S. Suri, and S. Vassilvitskii, "Filtering: a method for solving graph problems in MapReduce," presented at the Proceedings of the 23rd ACM symposium on Parallelism in algorithms and architectures, San Jose, California, USA, [24] K.-H. Lee, Y.-J. Lee, H. Choi, Y. D. Chung, and B. Moon, "Parallel data processing with MapReduce: a survey," SIGMOD Rec., vol. 40, pp , [25] C. Ranger, R. Raghuraman, A. Penmetsa, G. Bradski, and C. Kozyrakis, "Evaluating MapReduce for Multi-core and Multiprocessor Systems," presented at the Proceedings of the 2007 IEEE 13th International Symposium on High Performance Computer Architecture, [26] Cloudera. Optimizing MapReduce Job Performance. Available: [27] T. White, Hadoop: The Definitive Guide: O'Reilly Media, [28] J. Lin and C. Dyer, Data-Intensive Text Processing with MapReduce: Morgan and Claypool Publishers, [29] J. D. Ullman, "Designing good MapReduce algorithms," XRDS, vol. 19, pp , [30] S. Gnerre, I. MacCallum, D. Przybylski, F. J. Ribeiro, J. N. Burton, B. J. Walker, et al., "High-quality draft assemblies of mammalian genomes from massively parallel sequence data," Proceedings of the National Academy of Sciences, APPENDIX A. Rules-of-thumb policy for configurations Table A1 lists our Hadoop configuration parameters depends on the rules-of-thumb policy that aims at ensuring the values are not exceed the physical limitation of each computing node. Assume that we have 10 computing nodes and each node has 8 CPU cores, 16 GB memory, and acts as Data Node and Task Tracker. We demonstrate the calculation of the first three parameter values of Hadoop framework in Table A1. To achieve the best-effort of CPU utilization, there would be assign 2 processes to utilize for each CPU core, in general. Thus, there are 16 processes execute simultaneously in a node. However, to obtain the functionality of underlying operating system, we prepare one CPU core for system routine process and I/O operations. Thus, there are at most 7 CPU cores for Hadoop framework, and we decide to set up at most 7 map tasks and 7 reduce tasks concurrently. Since in-memory processing is faster than performing operation with swap space or content switching, we optimism the memory usage of each node is within its physical boundary. Furthermore, we preserved around 500 MB for processes of operating system, 1 GB for operations of Data Node, and 1 GB for Task Tracker. To utilize the rest 13 GB memory with at most 14 tasks concurrently, we can assign 950 MB memory for each task. 5

6 TABLE A1. A SUBSET OF JOB CONFIGURATION PARAMETERS OF HADOOP THAT AFFECT JOB PERFORMANCE SIGNIFICANTLY. Parameter Name in Hadoop Description and Use Default Values Our Settings mapred.child.java.opts Java options for the task tracker child processes -Xmx200m -Xmx950m mapred.tasktracker.map.tasks.maximum Maximum number of map tasks run simultaneously by a task tracker 2 7 mapred.tasktracker.reduce.tasks.maximum Maximum number of map tasks run simultaneously by a task tracker 2 7 mapred.reduce.slowstart.completed.maps Fraction of the completed map tasks to start reduce tasks in a job mapred.reduce.parallel.copies Number of parallel transfers 5 15 io.sort.mb Map-side buffer size (in MegaBytes) for buffering and sorting key-value pairs io.sort.record.percent Fraction of io.sort.mb to store metadata of key-value pairs io.sort.factor Number of sorted streams to merge at once when sorting files mapred.map.tasks Default value of map tasks per job 2 10 mapred.reduce.tasks Default value of reduce tasks per job

HADOOP PERFORMANCE TUNING

HADOOP PERFORMANCE TUNING PERFORMANCE TUNING Abstract This paper explains tuning of Hadoop configuration parameters which directly affects Map-Reduce job performance under various conditions, to achieve maximum performance. The

More information

Map-Based Graph Analysis on MapReduce

Map-Based Graph Analysis on MapReduce 2013 IEEE International Conference on Big Data Map-Based Graph Analysis on MapReduce Upa Gupta, Leonidas Fegaras University of Texas at Arlington, CSE Arlington, TX 76019 {upa.gupta,fegaras}@uta.edu Abstract

More information

Shareability and Locality Aware Scheduling Algorithm in Hadoop for Mobile Cloud Computing

Shareability and Locality Aware Scheduling Algorithm in Hadoop for Mobile Cloud Computing Shareability and Locality Aware Scheduling Algorithm in Hadoop for Mobile Cloud Computing Hsin-Wen Wei 1,2, Che-Wei Hsu 2, Tin-Yu Wu 3, Wei-Tsong Lee 1 1 Department of Electrical Engineering, Tamkang University

More information

Evaluating HDFS I/O Performance on Virtualized Systems

Evaluating HDFS I/O Performance on Virtualized Systems Evaluating HDFS I/O Performance on Virtualized Systems Xin Tang xtang@cs.wisc.edu University of Wisconsin-Madison Department of Computer Sciences Abstract Hadoop as a Service (HaaS) has received increasing

More information

Fault Tolerance in Hadoop for Work Migration

Fault Tolerance in Hadoop for Work Migration 1 Fault Tolerance in Hadoop for Work Migration Shivaraman Janakiraman Indiana University Bloomington ABSTRACT Hadoop is a framework that runs applications on large clusters which are built on numerous

More information

Analysing Large Web Log Files in a Hadoop Distributed Cluster Environment

Analysing Large Web Log Files in a Hadoop Distributed Cluster Environment Analysing Large Files in a Hadoop Distributed Cluster Environment S Saravanan, B Uma Maheswari Department of Computer Science and Engineering, Amrita School of Engineering, Amrita Vishwa Vidyapeetham,

More information

The Comprehensive Performance Rating for Hadoop Clusters on Cloud Computing Platform

The Comprehensive Performance Rating for Hadoop Clusters on Cloud Computing Platform The Comprehensive Performance Rating for Hadoop Clusters on Cloud Computing Platform Fong-Hao Liu, Ya-Ruei Liou, Hsiang-Fu Lo, Ko-Chin Chang, and Wei-Tsong Lee Abstract Virtualization platform solutions

More information

A Study on Workload Imbalance Issues in Data Intensive Distributed Computing

A Study on Workload Imbalance Issues in Data Intensive Distributed Computing A Study on Workload Imbalance Issues in Data Intensive Distributed Computing Sven Groot 1, Kazuo Goda 1, and Masaru Kitsuregawa 1 University of Tokyo, 4-6-1 Komaba, Meguro-ku, Tokyo 153-8505, Japan Abstract.

More information

Herodotos Herodotou Shivnath Babu. Duke University

Herodotos Herodotou Shivnath Babu. Duke University Herodotos Herodotou Shivnath Babu Duke University Analysis in the Big Data Era Popular option Hadoop software stack Java / C++ / R / Python Pig Hive Jaql Oozie Elastic MapReduce Hadoop HBase MapReduce

More information

From GWS to MapReduce: Google s Cloud Technology in the Early Days

From GWS to MapReduce: Google s Cloud Technology in the Early Days Large-Scale Distributed Systems From GWS to MapReduce: Google s Cloud Technology in the Early Days Part II: MapReduce in a Datacenter COMP6511A Spring 2014 HKUST Lin Gu lingu@ieee.org MapReduce/Hadoop

More information

Scalable Cloud Computing Solutions for Next Generation Sequencing Data

Scalable Cloud Computing Solutions for Next Generation Sequencing Data Scalable Cloud Computing Solutions for Next Generation Sequencing Data Matti Niemenmaa 1, Aleksi Kallio 2, André Schumacher 1, Petri Klemelä 2, Eija Korpelainen 2, and Keijo Heljanko 1 1 Department of

More information

MapReduce With Columnar Storage

MapReduce With Columnar Storage SEMINAR: COLUMNAR DATABASES 1 MapReduce With Columnar Storage Peitsa Lähteenmäki Abstract The MapReduce programming paradigm has achieved more popularity over the last few years as an option to distributed

More information

Chapter 7. Using Hadoop Cluster and MapReduce

Chapter 7. Using Hadoop Cluster and MapReduce Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in

More information

CloudDOE: A User-Friendly Tool for Deploying Hadoop Clouds and Analyzing High-Throughput Sequencing Data with MapReduce

CloudDOE: A User-Friendly Tool for Deploying Hadoop Clouds and Analyzing High-Throughput Sequencing Data with MapReduce CloudDOE: A User-Friendly Tool for Deploying Hadoop Clouds and Analyzing High-Throughput Sequencing Data with MapReduce Wei-Chun Chung 1,2,3, Chien-Chih Chen 1,2, Jan-Ming Ho 1,3, Chung-Yen Lin 1, Wen-Lian

More information

Duke University http://www.cs.duke.edu/starfish

Duke University http://www.cs.duke.edu/starfish Herodotos Herodotou, Harold Lim, Fei Dong, Shivnath Babu Duke University http://www.cs.duke.edu/starfish Practitioners of Big Data Analytics Google Yahoo! Facebook ebay Physicists Biologists Economists

More information

MapReduce on GPUs. Amit Sabne, Ahmad Mujahid Mohammed Razip, Kun Xu

MapReduce on GPUs. Amit Sabne, Ahmad Mujahid Mohammed Razip, Kun Xu 1 MapReduce on GPUs Amit Sabne, Ahmad Mujahid Mohammed Razip, Kun Xu 2 MapReduce MAP Shuffle Reduce 3 Hadoop Open-source MapReduce framework from Apache, written in Java Used by Yahoo!, Facebook, Ebay,

More information

A SURVEY ON MAPREDUCE IN CLOUD COMPUTING

A SURVEY ON MAPREDUCE IN CLOUD COMPUTING A SURVEY ON MAPREDUCE IN CLOUD COMPUTING Dr.M.Newlin Rajkumar 1, S.Balachandar 2, Dr.V.Venkatesakumar 3, T.Mahadevan 4 1 Asst. Prof, Dept. of CSE,Anna University Regional Centre, Coimbatore, newlin_rajkumar@yahoo.co.in

More information

MapReduce for Data Intensive Scientific Analyses

MapReduce for Data Intensive Scientific Analyses MapReduce for Data Intensive Scientific Analyses Jaliya Ekanayake, Shrideep Pallickara, and Geoffrey Fox Department of Computer Science Indiana University Bloomington, USA {jekanaya, spallick, gcf}@indiana.edu

More information

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging Outline High Performance Computing (HPC) Towards exascale computing: a brief history Challenges in the exascale era Big Data meets HPC Some facts about Big Data Technologies HPC and Big Data converging

More information

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give

More information

The Performance Characteristics of MapReduce Applications on Scalable Clusters

The Performance Characteristics of MapReduce Applications on Scalable Clusters The Performance Characteristics of MapReduce Applications on Scalable Clusters Kenneth Wottrich Denison University Granville, OH 43023 wottri_k1@denison.edu ABSTRACT Many cluster owners and operators have

More information

CSE-E5430 Scalable Cloud Computing Lecture 2

CSE-E5430 Scalable Cloud Computing Lecture 2 CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 14.9-2015 1/36 Google MapReduce A scalable batch processing

More information

SeqPig: simple and scalable scripting for large sequencing data sets in Hadoop

SeqPig: simple and scalable scripting for large sequencing data sets in Hadoop SeqPig: simple and scalable scripting for large sequencing data sets in Hadoop André Schumacher, Luca Pireddu, Matti Niemenmaa, Aleksi Kallio, Eija Korpelainen, Gianluigi Zanetti and Keijo Heljanko Abstract

More information

Energy-Saving Cloud Computing Platform Based On Micro-Embedded System

Energy-Saving Cloud Computing Platform Based On Micro-Embedded System Energy-Saving Cloud Computing Platform Based On Micro-Embedded System Wen-Hsu HSIEH *, San-Peng KAO **, Kuang-Hung TAN **, Jiann-Liang CHEN ** * Department of Computer and Communication, De Lin Institute

More information

Report 02 Data analytics workbench for educational data. Palak Agrawal

Report 02 Data analytics workbench for educational data. Palak Agrawal Report 02 Data analytics workbench for educational data Palak Agrawal Last Updated: May 22, 2014 Starfish: A Selftuning System for Big Data Analytics [1] text CONTENTS Contents 1 Introduction 1 1.1 Starfish:

More information

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Jeffrey D. Ullman slides. MapReduce for data intensive computing Jeffrey D. Ullman slides MapReduce for data intensive computing Single-node architecture CPU Machine Learning, Statistics Memory Classical Data Mining Disk Commodity Clusters Web data sets can be very

More information

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU Heshan Li, Shaopeng Wang The Johns Hopkins University 3400 N. Charles Street Baltimore, Maryland 21218 {heshanli, shaopeng}@cs.jhu.edu 1 Overview

More information

CURTAIL THE EXPENDITURE OF BIG DATA PROCESSING USING MIXED INTEGER NON-LINEAR PROGRAMMING

CURTAIL THE EXPENDITURE OF BIG DATA PROCESSING USING MIXED INTEGER NON-LINEAR PROGRAMMING Journal homepage: http://www.journalijar.com INTERNATIONAL JOURNAL OF ADVANCED RESEARCH RESEARCH ARTICLE CURTAIL THE EXPENDITURE OF BIG DATA PROCESSING USING MIXED INTEGER NON-LINEAR PROGRAMMING R.Kohila

More information

Classification On The Clouds Using MapReduce

Classification On The Clouds Using MapReduce Classification On The Clouds Using MapReduce Simão Martins Instituto Superior Técnico Lisbon, Portugal simao.martins@tecnico.ulisboa.pt Cláudia Antunes Instituto Superior Técnico Lisbon, Portugal claudia.antunes@tecnico.ulisboa.pt

More information

RECOMMENDATION SYSTEM USING BLOOM FILTER IN MAPREDUCE

RECOMMENDATION SYSTEM USING BLOOM FILTER IN MAPREDUCE RECOMMENDATION SYSTEM USING BLOOM FILTER IN MAPREDUCE Reena Pagare and Anita Shinde Department of Computer Engineering, Pune University M. I. T. College Of Engineering Pune India ABSTRACT Many clients

More information

Accelerating Hadoop MapReduce Using an In-Memory Data Grid

Accelerating Hadoop MapReduce Using an In-Memory Data Grid Accelerating Hadoop MapReduce Using an In-Memory Data Grid By David L. Brinker and William L. Bain, ScaleOut Software, Inc. 2013 ScaleOut Software, Inc. 12/27/2012 H adoop has been widely embraced for

More information

Analysis and Optimization of Massive Data Processing on High Performance Computing Architecture

Analysis and Optimization of Massive Data Processing on High Performance Computing Architecture Analysis and Optimization of Massive Data Processing on High Performance Computing Architecture He Huang, Shanshan Li, Xiaodong Yi, Feng Zhang, Xiangke Liao and Pan Dong School of Computer Science National

More information

A Framework for Performance Analysis and Tuning in Hadoop Based Clusters

A Framework for Performance Analysis and Tuning in Hadoop Based Clusters A Framework for Performance Analysis and Tuning in Hadoop Based Clusters Garvit Bansal Anshul Gupta Utkarsh Pyne LNMIIT, Jaipur, India Email: [garvit.bansal anshul.gupta utkarsh.pyne] @lnmiit.ac.in Manish

More information

Analyzing Web Application Log Files to Find Hit Count Through the Utilization of Hadoop MapReduce in Cloud Computing Environment

Analyzing Web Application Log Files to Find Hit Count Through the Utilization of Hadoop MapReduce in Cloud Computing Environment Analyzing Web Application Log Files to Find Hit Count Through the Utilization of Hadoop MapReduce in Cloud Computing Environment Sayalee Narkhede Department of Information Technology Maharashtra Institute

More information

Cloud-enabling Sequence Alignment with Hadoop MapReduce: A Performance Analysis

Cloud-enabling Sequence Alignment with Hadoop MapReduce: A Performance Analysis 2012 4th International Conference on Bioinformatics and Biomedical Technology IPCBEE vol.29 (2012) (2012) IACSIT Press, Singapore Cloud-enabling Sequence Alignment with Hadoop MapReduce: A Performance

More information

The Hadoop Framework

The Hadoop Framework The Hadoop Framework Nils Braden University of Applied Sciences Gießen-Friedberg Wiesenstraße 14 35390 Gießen nils.braden@mni.fh-giessen.de Abstract. The Hadoop Framework offers an approach to large-scale

More information

Keywords: Big Data, HDFS, Map Reduce, Hadoop

Keywords: Big Data, HDFS, Map Reduce, Hadoop Volume 5, Issue 7, July 2015 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Configuration Tuning

More information

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture DATA MINING WITH HADOOP AND HIVE Introduction to Architecture Dr. Wlodek Zadrozny (Most slides come from Prof. Akella s class in 2014) 2015-2025. Reproduction or usage prohibited without permission of

More information

Data-Intensive Computing with Map-Reduce and Hadoop

Data-Intensive Computing with Map-Reduce and Hadoop Data-Intensive Computing with Map-Reduce and Hadoop Shamil Humbetov Department of Computer Engineering Qafqaz University Baku, Azerbaijan humbetov@gmail.com Abstract Every day, we create 2.5 quintillion

More information

Big Data and Apache Hadoop s MapReduce

Big Data and Apache Hadoop s MapReduce Big Data and Apache Hadoop s MapReduce Michael Hahsler Computer Science and Engineering Southern Methodist University January 23, 2012 Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 1 / 23

More information

Parallel Processing of cluster by Map Reduce

Parallel Processing of cluster by Map Reduce Parallel Processing of cluster by Map Reduce Abstract Madhavi Vaidya, Department of Computer Science Vivekanand College, Chembur, Mumbai vamadhavi04@yahoo.co.in MapReduce is a parallel programming model

More information

Research Article Cloud Computing for Protein-Ligand Binding Site Comparison

Research Article Cloud Computing for Protein-Ligand Binding Site Comparison BioMed Research International Volume 213, Article ID 17356, 7 pages http://dx.doi.org/1.1155/213/17356 Research Article Cloud Computing for Protein-Ligand Binding Site Comparison Che-Lun Hung 1 and Guan-Jie

More information

COST MINIMIZATION OF RUNNING MAPREDUCE ACROSS GEOGRAPHICALLY DISTRIBUTED DATA CENTERS

COST MINIMIZATION OF RUNNING MAPREDUCE ACROSS GEOGRAPHICALLY DISTRIBUTED DATA CENTERS COST MINIMIZATION OF RUNNING MAPREDUCE ACROSS GEOGRAPHICALLY DISTRIBUTED DATA CENTERS Ms. T. Cowsalya PG Scholar, SVS College of Engineering, Coimbatore, Tamilnadu, India Dr. S. Senthamarai Kannan Assistant

More information

Chameleon: The Performance Tuning Tool for MapReduce Query Processing Systems

Chameleon: The Performance Tuning Tool for MapReduce Query Processing Systems paper:38 Chameleon: The Performance Tuning Tool for MapReduce Query Processing Systems Edson Ramiro Lucsa Filho 1, Ivan Luiz Picoli 2, Eduardo Cunha de Almeida 2, Yves Le Traon 1 1 University of Luxembourg

More information

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control EP/K006487/1 UK PI: Prof Gareth Taylor (BU) China PI: Prof Yong-Hua Song (THU) Consortium UK Members: Brunel University

More information

Master s Thesis: A Tuning Approach Based on Evolutionary Algorithm and Data Sampling for Boosting Performance of MapReduce Programs

Master s Thesis: A Tuning Approach Based on Evolutionary Algorithm and Data Sampling for Boosting Performance of MapReduce Programs paper:24 Master s Thesis: A Tuning Approach Based on Evolutionary Algorithm and Data Sampling for Boosting Performance of MapReduce Programs Tiago R. Kepe 1, Eduardo Cunha de Almeida 1 1 Programa de Pós-Graduação

More information

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Introduction to Big Data! with Apache Spark UC#BERKELEY# Introduction to Big Data! with Apache Spark" UC#BERKELEY# This Lecture" The Big Data Problem" Hardware for Big Data" Distributing Work" Handling Failures and Slow Machines" Map Reduce and Complex Jobs"

More information

A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS

A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS Dr. Ananthi Sheshasayee 1, J V N Lakshmi 2 1 Head Department of Computer Science & Research, Quaid-E-Millath Govt College for Women, Chennai, (India)

More information

A Cost-Benefit Analysis of Indexing Big Data with Map-Reduce

A Cost-Benefit Analysis of Indexing Big Data with Map-Reduce A Cost-Benefit Analysis of Indexing Big Data with Map-Reduce Dimitrios Siafarikas Argyrios Samourkasidis Avi Arampatzis Department of Electrical and Computer Engineering Democritus University of Thrace

More information

LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT

LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT Samira Daneshyar 1 and Majid Razmjoo 2 1,2 School of Computer Science, Centre of Software Technology and Management (SOFTEM),

More information

Big Data With Hadoop

Big Data With Hadoop With Saurabh Singh singh.903@osu.edu The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials

More information

Market Basket Analysis Algorithm on Map/Reduce in AWS EC2

Market Basket Analysis Algorithm on Map/Reduce in AWS EC2 Market Basket Analysis Algorithm on Map/Reduce in AWS EC2 Jongwook Woo Computer Information Systems Department California State University Los Angeles jwoo5@calstatela.edu Abstract As the web, social networking,

More information

Analysis and Modeling of MapReduce s Performance on Hadoop YARN

Analysis and Modeling of MapReduce s Performance on Hadoop YARN Analysis and Modeling of MapReduce s Performance on Hadoop YARN Qiuyi Tang Dept. of Mathematics and Computer Science Denison University tang_j3@denison.edu Dr. Thomas C. Bressoud Dept. of Mathematics and

More information

Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics

Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics Juwei Shi, Yunjie Qiu, Umar Farooq Minhas, Limei Jiao, Chen Wang, Berthold Reinwald, and Fatma Özcan IBM Research China IBM Almaden

More information

Introduction to Hadoop

Introduction to Hadoop Introduction to Hadoop 1 What is Hadoop? the big data revolution extracting value from data cloud computing 2 Understanding MapReduce the word count problem more examples MCS 572 Lecture 24 Introduction

More information

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after

More information

The Impact of Capacity Scheduler Configuration Settings on MapReduce Jobs

The Impact of Capacity Scheduler Configuration Settings on MapReduce Jobs The Impact of Capacity Scheduler Configuration Settings on MapReduce Jobs Jagmohan Chauhan, Dwight Makaroff and Winfried Grassmann Dept. of Computer Science, University of Saskatchewan Saskatoon, SK, CANADA

More information

FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data

FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data Miguel Liroz-Gistau, Reza Akbarinia, Patrick Valduriez To cite this version: Miguel Liroz-Gistau, Reza Akbarinia, Patrick Valduriez. FP-Hadoop:

More information

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2016 MapReduce MapReduce is a programming model

More information

CLOUDDMSS: CLOUD-BASED DISTRIBUTED MULTIMEDIA STREAMING SERVICE SYSTEM FOR HETEROGENEOUS DEVICES

CLOUDDMSS: CLOUD-BASED DISTRIBUTED MULTIMEDIA STREAMING SERVICE SYSTEM FOR HETEROGENEOUS DEVICES CLOUDDMSS: CLOUD-BASED DISTRIBUTED MULTIMEDIA STREAMING SERVICE SYSTEM FOR HETEROGENEOUS DEVICES 1 MYOUNGJIN KIM, 2 CUI YUN, 3 SEUNGHO HAN, 4 HANKU LEE 1,2,3,4 Department of Internet & Multimedia Engineering,

More information

Securing Health Care Information by Using Two-Tier Cipher Cloud Technology

Securing Health Care Information by Using Two-Tier Cipher Cloud Technology Securing Health Care Information by Using Two-Tier Cipher Cloud Technology M.Divya 1 J.Gayathri 2 A.Gladwin 3 1 B.TECH Information Technology, Jeppiaar Engineering College, Chennai 2 B.TECH Information

More information

BSPCloud: A Hybrid Programming Library for Cloud Computing *

BSPCloud: A Hybrid Programming Library for Cloud Computing * BSPCloud: A Hybrid Programming Library for Cloud Computing * Xiaodong Liu, Weiqin Tong and Yan Hou Department of Computer Engineering and Science Shanghai University, Shanghai, China liuxiaodongxht@qq.com,

More information

Java Garbage Collection Characteristics and Tuning Guidelines for Apache Hadoop TeraSort Workload

Java Garbage Collection Characteristics and Tuning Guidelines for Apache Hadoop TeraSort Workload Java Garbage Collection Characteristics and Tuning Guidelines for Apache Hadoop TeraSort Workload Shrinivas Joshi, Software Performance Engineer Vasileios Liaskovitis, Performance Engineer 1. Introduction

More information

DESIGN AND ESTIMATION OF BIG DATA ANALYSIS USING MAPREDUCE AND HADOOP-A

DESIGN AND ESTIMATION OF BIG DATA ANALYSIS USING MAPREDUCE AND HADOOP-A DESIGN AND ESTIMATION OF BIG DATA ANALYSIS USING MAPREDUCE AND HADOOP-A 1 DARSHANA WAJEKAR, 2 SUSHILA RATRE 1,2 Department of Computer Engineering, Pillai HOC College of Engineering& Technology Rasayani,

More information

Optimization and analysis of large scale data sorting algorithm based on Hadoop

Optimization and analysis of large scale data sorting algorithm based on Hadoop Optimization and analysis of large scale sorting algorithm based on Hadoop Zhuo Wang, Longlong Tian, Dianjie Guo, Xiaoming Jiang Institute of Information Engineering, Chinese Academy of Sciences {wangzhuo,

More information

RevoScaleR Speed and Scalability

RevoScaleR Speed and Scalability EXECUTIVE WHITE PAPER RevoScaleR Speed and Scalability By Lee Edlefsen Ph.D., Chief Scientist, Revolution Analytics Abstract RevoScaleR, the Big Data predictive analytics library included with Revolution

More information

Performance Optimization of a Distributed Transcoding System based on Hadoop for Multimedia Streaming Services

Performance Optimization of a Distributed Transcoding System based on Hadoop for Multimedia Streaming Services RESEARCH ARTICLE Adv. Sci. Lett. 4, 400 407, 2011 Copyright 2011 American Scientific Publishers Advanced Science Letters All rights reserved Vol. 4, 400 407, 2011 Printed in the United States of America

More information

A Comparative Analysis of Join Algorithms Using the Hadoop Map/Reduce Framework

A Comparative Analysis of Join Algorithms Using the Hadoop Map/Reduce Framework A Comparative Analysis of Join Algorithms Using the Hadoop Map/Reduce Framework Konstantina Palla E H U N I V E R S I T Y T O H F G R E D I N B U Master of Science School of Informatics University of Edinburgh

More information

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS By HAI JIN, SHADI IBRAHIM, LI QI, HAIJUN CAO, SONG WU and XUANHUA SHI Prepared by: Dr. Faramarz Safi Islamic Azad

More information

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK A REVIEW ON HIGH PERFORMANCE DATA STORAGE ARCHITECTURE OF BIGDATA USING HDFS MS.

More information

Introduction to Hadoop

Introduction to Hadoop 1 What is Hadoop? Introduction to Hadoop We are living in an era where large volumes of data are available and the problem is to extract meaning from the data avalanche. The goal of the software tools

More information

Exploring the Efficiency of Big Data Processing with Hadoop MapReduce

Exploring the Efficiency of Big Data Processing with Hadoop MapReduce Exploring the Efficiency of Big Data Processing with Hadoop MapReduce Brian Ye, Anders Ye School of Computer Science and Communication (CSC), Royal Institute of Technology KTH, Stockholm, Sweden Abstract.

More information

Recognization of Satellite Images of Large Scale Data Based On Map- Reduce Framework

Recognization of Satellite Images of Large Scale Data Based On Map- Reduce Framework Recognization of Satellite Images of Large Scale Data Based On Map- Reduce Framework Vidya Dhondiba Jadhav, Harshada Jayant Nazirkar, Sneha Manik Idekar Dept. of Information Technology, JSPM s BSIOTR (W),

More information

Parallel Computing. Benson Muite. benson.muite@ut.ee http://math.ut.ee/ benson. https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage

Parallel Computing. Benson Muite. benson.muite@ut.ee http://math.ut.ee/ benson. https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage Parallel Computing Benson Muite benson.muite@ut.ee http://math.ut.ee/ benson https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage 3 November 2014 Hadoop, Review Hadoop Hadoop History Hadoop Framework

More information

Analysis of MapReduce Algorithms

Analysis of MapReduce Algorithms Analysis of MapReduce Algorithms Harini Padmanaban Computer Science Department San Jose State University San Jose, CA 95192 408-924-1000 harini.gomadam@gmail.com ABSTRACT MapReduce is a programming model

More information

A Hybrid Scheduling Approach for Scalable Heterogeneous Hadoop Systems

A Hybrid Scheduling Approach for Scalable Heterogeneous Hadoop Systems A Hybrid Scheduling Approach for Scalable Heterogeneous Hadoop Systems Aysan Rasooli Department of Computing and Software McMaster University Hamilton, Canada Email: rasooa@mcmaster.ca Douglas G. Down

More information

Reducer Load Balancing and Lazy Initialization in Map Reduce Environment S.Mohanapriya, P.Natesan

Reducer Load Balancing and Lazy Initialization in Map Reduce Environment S.Mohanapriya, P.Natesan Reducer Load Balancing and Lazy Initialization in Map Reduce Environment S.Mohanapriya, P.Natesan Abstract Big Data is revolutionizing 21st-century with increasingly huge amounts of data to store and be

More information

A Case for Flash Memory SSD in Hadoop Applications

A Case for Flash Memory SSD in Hadoop Applications A Case for Flash Memory SSD in Hadoop Applications Seok-Hoon Kang, Dong-Hyun Koo, Woon-Hak Kang and Sang-Won Lee Dept of Computer Engineering, Sungkyunkwan University, Korea x860221@gmail.com, smwindy@naver.com,

More information

MapReduce and Hadoop Distributed File System

MapReduce and Hadoop Distributed File System MapReduce and Hadoop Distributed File System 1 B. RAMAMURTHY Contact: Dr. Bina Ramamurthy CSE Department University at Buffalo (SUNY) bina@buffalo.edu http://www.cse.buffalo.edu/faculty/bina Partially

More information

A Study on Big Data Integration with Data Warehouse

A Study on Big Data Integration with Data Warehouse A Study on Big Data Integration with Data Warehouse T.K.Das 1 and Arati Mohapatro 2 1 (School of Information Technology & Engineering, VIT University, Vellore,India) 2 (Department of Computer Science,

More information

Hadoop and Map-reduce computing

Hadoop and Map-reduce computing Hadoop and Map-reduce computing 1 Introduction This activity contains a great deal of background information and detailed instructions so that you can refer to it later for further activities and homework.

More information

Intro to Map/Reduce a.k.a. Hadoop

Intro to Map/Reduce a.k.a. Hadoop Intro to Map/Reduce a.k.a. Hadoop Based on: Mining of Massive Datasets by Ra jaraman and Ullman, Cambridge University Press, 2011 Data Mining for the masses by North, Global Text Project, 2012 Slides by

More information

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Summary Xiangzhe Li Nowadays, there are more and more data everyday about everything. For instance, here are some of the astonishing

More information

Enhancing MapReduce Functionality for Optimizing Workloads on Data Centers

Enhancing MapReduce Functionality for Optimizing Workloads on Data Centers Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 2, Issue. 10, October 2013,

More information

A Brief Introduction to Apache Tez

A Brief Introduction to Apache Tez A Brief Introduction to Apache Tez Introduction It is a fact that data is basically the new currency of the modern business world. Companies that effectively maximize the value of their data (extract value

More information

Local Alignment Tool Based on Hadoop Framework and GPU Architecture

Local Alignment Tool Based on Hadoop Framework and GPU Architecture Local Alignment Tool Based on Hadoop Framework and GPU Architecture Che-Lun Hung * Department of Computer Science and Communication Engineering Providence University Taichung, Taiwan clhung@pu.edu.tw *

More information

What is Analytic Infrastructure and Why Should You Care?

What is Analytic Infrastructure and Why Should You Care? What is Analytic Infrastructure and Why Should You Care? Robert L Grossman University of Illinois at Chicago and Open Data Group grossman@uic.edu ABSTRACT We define analytic infrastructure to be the services,

More information

HADOOP IN THE LIFE SCIENCES:

HADOOP IN THE LIFE SCIENCES: White Paper HADOOP IN THE LIFE SCIENCES: An Introduction Abstract This introductory white paper reviews the Apache Hadoop TM technology, its components MapReduce and Hadoop Distributed File System (HDFS)

More information

Resource Scalability for Efficient Parallel Processing in Cloud

Resource Scalability for Efficient Parallel Processing in Cloud Resource Scalability for Efficient Parallel Processing in Cloud ABSTRACT Govinda.K #1, Abirami.M #2, Divya Mercy Silva.J #3 #1 SCSE, VIT University #2 SITE, VIT University #3 SITE, VIT University In the

More information

Enhancing Dataset Processing in Hadoop YARN Performance for Big Data Applications

Enhancing Dataset Processing in Hadoop YARN Performance for Big Data Applications Enhancing Dataset Processing in Hadoop YARN Performance for Big Data Applications Ahmed Abdulhakim Al-Absi, Dae-Ki Kang and Myong-Jong Kim Abstract In Hadoop MapReduce distributed file system, as the input

More information

International Journal of Advance Research in Computer Science and Management Studies

International Journal of Advance Research in Computer Science and Management Studies Volume 2, Issue 8, August 2014 ISSN: 2321 7782 (Online) International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online

More information

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop Journal of Computational Information Systems 7: 16 (2011) 5956-5963 Available at http://www.jofcis.com Large-Scale Data Sets Clustering Based on MapReduce and Hadoop Ping ZHOU, Jingsheng LEI, Wenjun YE

More information

Benchmarking Hadoop & HBase on Violin

Benchmarking Hadoop & HBase on Violin Technical White Paper Report Technical Report Benchmarking Hadoop & HBase on Violin Harnessing Big Data Analytics at the Speed of Memory Version 1.0 Abstract The purpose of benchmarking is to show advantages

More information

MATE-EC2: A Middleware for Processing Data with AWS

MATE-EC2: A Middleware for Processing Data with AWS MATE-EC2: A Middleware for Processing Data with AWS Tekin Bicer Department of Computer Science and Engineering Ohio State University bicer@cse.ohio-state.edu David Chiu School of Engineering and Computer

More information

MapReduce Design of K-Means Clustering Algorithm

MapReduce Design of K-Means Clustering Algorithm MapReduce Design of K-Means Clustering Algorithm Prajesh P Anchalia Department of CSE, Email: prajeshanchalia@yahoo.in Anjan K Koundinya Department of CSE, Email: annjank@gmail.com Srinath N K Department

More information

http://www.wordle.net/

http://www.wordle.net/ Hadoop & MapReduce http://www.wordle.net/ http://www.wordle.net/ Hadoop is an open-source software framework (or platform) for Reliable + Scalable + Distributed Storage/Computational unit Failures completely

More information

Log Mining Based on Hadoop s Map and Reduce Technique

Log Mining Based on Hadoop s Map and Reduce Technique Log Mining Based on Hadoop s Map and Reduce Technique ABSTRACT: Anuja Pandit Department of Computer Science, anujapandit25@gmail.com Amruta Deshpande Department of Computer Science, amrutadeshpande1991@gmail.com

More information

StreamStorage: High-throughput and Scalable Storage Technology for Streaming Data

StreamStorage: High-throughput and Scalable Storage Technology for Streaming Data : High-throughput and Scalable Storage Technology for Streaming Data Munenori Maeda Toshihiro Ozawa Real-time analytical processing (RTAP) of vast amounts of time-series data from sensors, server logs,

More information

Optimizing Cost and Performance Trade-Offs for MapReduce Job Processing in the Cloud

Optimizing Cost and Performance Trade-Offs for MapReduce Job Processing in the Cloud Optimizing Cost and Performance Trade-Offs for MapReduce Job Processing in the Cloud Zhuoyao Zhang University of Pennsylvania zhuoyao@seas.upenn.edu Ludmila Cherkasova Hewlett-Packard Labs lucy.cherkasova@hp.com

More information