Optimizing a MapReduce Module of Preprocessing High-Throughput DNA Sequencing Data
|
|
- Blaze Parker
- 8 years ago
- Views:
Transcription
1 2013 IEEE International Conference on Big Data Optimizing a MapReduce Module of Preprocessing High-Throughput DNA Sequencing Data Wei-Chun Chung #,1,3, Yu-Jung Chang #,2, Chien-Chih Chen 2, Der-Tsai Lee 2,3,4, Jan-Ming Ho,1,2 1 Research Center for Information Technology Innovation 2 Institute of Information Science Academia Sinica Taipei, Taiwan, ROC {wcchung, yjchang, rocky, dtlee, hoho}@iis.sinica.edu.tw Abstract The MapReduce framework has become the de facto choice for big data analysis in a variety of applications. In MapReduce programming model, computation is distributed to a cluster of computing nodes that runs in parallel. The performance of a MapReduce application is thus affected by system and middleware, characteristics of data, and design and implementation of the algorithms. In this study, we focus on performance optimization of a MapReduce application, i.e., CloudRS, which tackles on the problem of detecting and removing errors in the next-generation sequencing de novo genomic data. We present three strategies, i.e., contentexchange, content-grouping, and index-only strategies, of communication between the Map() and Reduce() functions. The three strategies differ in the way messages are exchanged between the two functions. We also present experimental results to compare performance of the three strategies. Keywords-error correction; genome assembly; mapreduce; next-generation sequencing; optimization; I. INTRODUCTION MapReduce [1] is a prominent distributed computational framework that possesses various key features for dealing with large-scale data processing on the cloud [2-4], including fault-tolerance, scheduling, data replication, load balance, and parallelization. By virtue of the scalability and simplicity on development, MapReduce and its implementations [5-7] have been widely-used in different applications, e.g., Web and social networks analysis, scientific emulation, financial and business data processing, and bioinformatics [8-12]. However, the performance and efficiency of MapReduce are affected by different factors, and thus, become challenging for optimization. Optimizing MapReduce is essential as processing data in a timely and cost-efficient manner becomes critical [13-18]. Fortunately, various techniques have been introduced to improve the performance of MapReduce [19-25], including hardware, software, and framework level optimization. One of the optimization techniques is tuning parameters for # These authors contributed equally to this work (co-first authors). 3 Department of Computer Science and Information Engineering, National Taiwan University, Taipei, Taiwan, ROC. 4 Department of Computer Science and Information Engineering, National Chung Hsing University, Taichung, Taiwan, ROC. Corresponding author system, middleware, and MapReduce execution by utilizing expert systems [20-22] or the rule-of-thumb policies [26, 27]. Another type of optimization focuses on the design of algorithm or the characteristics of data of the application [28, 29]. In this study, we focus on CloudRS [9], a MapReduce application for correcting errors in the next-generation sequencing (NGS) data. As the cost of DNA sequencing rapidly reduces [12], the accompanying growth of genome data results in unpredictable execution time, even if the data is processed by MapReduce. Thus, to optimize the performance of CloudRS, we evaluate three kinds of message generation and transmission approaches to reduce the communication cost of MapReduce: content-exchange, content-grouping, and index-only strategies. We also present the experimental results, and discuss the observation and limitation of our proposed strategies. II. BACKGROUND A. The MapReduce programming model The MapReduce programming model is composed of two primitive functions, Map and Reduce. The input data of a MapReduce program is a list of <key, value> pairs, and thus, the Map() function is applied to each pair and generate a set of intermediate pairs, e.g. <key, list(value)>. Then the Reduce() function is applied to each intermediate pair, process values of the list, and produce aggregated final results. Moreover, there are additional functions in the MapReduce execution model, e.g., shuffle and sort, to handle intermediate data. The shuffle function is applied on the Map side, and performs data exchange by key after Map(). Thus, data with the same key will be transmitted to a single Reduce() function. The sort function is launched on the Reduce side after data exchange. It sorts data by the key field to group all the pairs with the same key for further processing. B. The CloudRS algorithm The CloudRS algorithm [9] is implemented with multiple MapReduce rounds. It aims at conservatively correcting sequence errors to avoid yielding false decisions, and thus, improves the quality of de novo assembly. To correct a possible mismatch, CloudRS emulates read alignment and majority voting for each set of reads, denoted as a read stack, /13/$ IEEE 1
2 with the same k-mer subsequence. Note that a k-mer subsequence refers to a genomic subsequence of k base pairs of either guanine-cytosine or adenine-thymine. Once the reads are aligned and the read stack has been built, majority voting can be applied on each position of the read stack to summarize the quality value of each base. Then, a decision is made on each position to correct error if necessary. III. METHODS The basic idea of error detection and correction is to align reads having the same specific subsequence of length k and sort them according to the relative position of the subsequence in the read. A voting algorithm is then used to examine the symbols and quality values at each position of the stack of reads to detect and correct sequencing errors if the reads and their quality values show high level of consistency at each position. Interested readers may refer to [9] and [30] for details. In this section, we are going to present three strategies, i.e., content-exchange strategy, content-grouping strategy, and index-only strategy, to collect reads with the same specific subsequence of length k in the error correction algorithm based on the MapReduce framework. Note that each strategy consists of a pair of Map() and Reduce() functions. The Map() function scans through a read for the k-mer subsequences it contains and emits the <k-mer, read> pairs. The shuffle stage of MapReduce then aggregates reads with the same k-mer subsequence for further processing by a Reduce() function later. The Reduce() function thus performs the align/sort/voting algorithm to identify and recommend sequencing errors containing in the reads it receives. Details of these strategies are presented as follows. A. Content-exchange strategy For each read r of length l, the Map() function of the content-exchange strategy generates a message of the form <k-mer i, (identifier, sequence, quality value)> at each position i of r, where 1 i l-k+1, sequence and quality value are vectors of length l representing the DNA sequence and quality value of r given by the sequencer, and k-mer i is the subsequence of r of length k starting at position i. B. Content-grouping strategy In this subsection, we present a content-grouping strategy in which the Map() function groups messages destined for the same Reduce() function and thus reduces the total size of messages transmitted during the shuffle stage. That is, the key-value pair is defined as <group key, (list(identity key), identifier, sequence, quality value)>. In other words, we divide the original key into two parts, the group key and the identity key. Messages with the same group key are sent to the same instance of Reduce() function which first sorts the reads it receives according to their identity keys, then performs align/sort/voting algorithm to detect and recommend sequencing errors in the subset of reads with the same identity key. C. Index-only strategy In the third strategy, we aim at reducing communication overhead further by using the distributed cache mechanism of Hadoop. The input data file containing sequence data and quality value of each read is replicated to each computing node of the cluster before executing the Map() function. So that the Map() function does not have to duplicate read data including sequence data and quality value for each message generated with each k-mer subsequence. Instead, it generates messages in key-value pairs containing only the k-mer subsequence and the read identifier which will be used later by the Reduce() function to retrieve read data from its local file cache. Each message generated by the Map() function is formatted as <group key, (list(identity key), identifier)>. Thus, the communication cost in terms of total message size is reduced at the cost of I/O overhead of retrieving read data from the local file cache. D. Qualitative comparison of the three strategies To evaluate the effectiveness of proposed methods, we estimate the intermediate data size and its reducing rate by calculation. The input dataset consists of a set of reads; each read is composed of read id, DNA sequence, and a qualityvalue character for each DNA. For a dataset with r reads and each read s sequence composed of l characters, the quality values of a read is also length l. Let the size of a sliding window on reads sequences be k, there will be r*(l-k+1) k- mer substrings, abbreviated as k-mers in the following, to be processed (denoted as t) as the input of mappers. We also define the grouping rate of the k-mers (denoted as n) to evaluate the performance of grouping mechanism. Hence, we can approximately estimate the amount of key groups and the number of k-mers in each group. For convenience, we give related notations as that (a) the length of read id is a fixed size i, (b) the length of group key is p and identity key is k-p, where 1 p k (see content-grouping strategy in Methods for definitions), (c) the grouping rate is a normal distribution, thus 0 < n 1, and there are n*t k-mer groups and 1/n k-mers in each group. The estimated sizes of intermediate data for the three strategies are as follows. For the content-exchanging strategy, the intermediate data size is at least t*(k+i+2l) bytes, since a key-value pair of a k-mer passing to reducers has to carry itself and the id, sequence and quality values of its read with size k+i+2l. The content-grouping strategy produce (nt)*(p+1/n*(k-p)+i+2l) bytes intermediate data, because a message contains a grouping key, a list of identity keys, the id, sequence and quality values of its read. The size of identity key list is 1/n*(k-p), since there are 1/n k-mers in each group and the length of each identity key is k-p. The index-only strategy generates (nt)*(p+1/n*(k-p)+i) bytes, since the key-value pair contains only the id. Table I summarizes the intermediate data size and the complexity of storage space of our proposed strategies. 2
3 TABLE I. Proposed strategies APPROXIMATED INTERMEDIATE DATA SIZE PRODUCED BY PROPOSED STRATEGIES. IV. Approximated intermediate data size (bytes) Content-exchange t*(k+i+2l) O(rl 2 ) Content-grouping (nt)*(p+1/n*(k-p)+i+2l) O(nrl 2 ) Index-only (nt)*(p+1/n*(k-p)+i) O(krl) EXPERIMENTAL RESULTS Complexity of storage spaces A. Environment setup and datasets Our experiments are evaluated in a Hadoop cluster with 10 dedicated computing nodes and an isolated internal network. Each node has 2 quad-core Intel Xeon E5410 CPUs, 16 GB memory, 1 TB local storage, and 1 Gb network connection. We use Ubuntu Linux 8.04 and Hadoop version for our experimental environment. We also set up at most 7 map tasks and 7 reduce tasks execute concurrently for each node. Thus, there are at most 70 map tasks and 70 reduce tasks in a MapReduce wave. The detail configuration of job parameters lists in Appendix Table A1. In addition, to separate the control flow and computation flow of the Hadoop framework, we add an additional control node into our cluster. We also define roles for the 11 nodes: one control node roles as Name Node and Job Tracker, while 10 computing nodes act as Data Nodes and Task Trackers. We use three real datasets to evaluate the performance of CloudRS. Information of datasets is listed in Table II. The dataset D1 is a set of short read data from an Escherichia coli (E. coli) library (SRX000429), which consists of 20.8M 36- bp reads. The dataset D2 is released by Illumina, which includes 12M paired-end 150-bp reads. This dataset contains sequences from a well-characterized E. coli strain K-12 MG1655 library sequenced on the Illumina MiSeq platform. The dataset D3 is Illumina reads from an African Male (NA18507). Note that we set up the size of k-mer as 24 characters in our experiments. Parameter settings of evaluations are bundled within the physical computation limitations, i.e., 8 cores and 16 GB memory of each computing node. B. Evaluation Results We use dataset D1 to demonstrate the effect of parameters affect to a MapReduce program by evaluating the content-exchange strategy. As shown in Table III, the execution time is reduced near 23%, comparing to the first and the third row. We observed that the execution time in the second row is longer than the first row. We also observed that multiple mapper/reducer waves also increases total execution time, as shown in the last 3 rows. The parameter settings of 70 mappers, 70 reducers and 950 MB memory achieves the shortest execution time in our experiment. Thus, we use the setting and the parameters list in Appendix Table A1 for the rest of the evaluations. To demonstrate the efficiency of the content-grouping strategy, we evaluate the strategy with dataset D2 and various partitions of keys. As shown in Table IV, the intermediate data size and execution time decrease with the grouping mechanism. We also evaluate the performance comparing to the content-exchange strategy by setting up the key partition as 24:0. However, we encounter an error during execution since we set the key partition is 6:18 and below. For index-only strategy, we use dataset D2 and D3, and use 12:12 as the key partition. The result lists in Table V. In dataset D2, the execution time has a reduction about 37% with index-only strategy, comparing to the content-grouping one. However, we encounter an unexpected longer execution with dataset D3. V. DISCUSSIONS A. A brief summary on the three strategies The three versions of error correction algorithms basically consider a read as an object and consider each k- mer subsequence of the read as a feature of the read. In the content-exchange strategy, the Map() functions generate a message, for each feature of each object, containing the feature as well as the object. The shuffle stage then collects objects with the same feature for further processing by an instance of Reduce() function. The content-grouping strategy defines features with the same prefix as belonging to the same feature group. The Map() functions generates a message for each feature group of each object. The shuffle stage thus collects objects belonging to the same feature group to an instance of Reduce() function for further processing. Note that total size of messages generated by the content-grouping strategy is smaller than that generated by the content-exchange strategy. However, the Reduce() function of the content-grouping strategy may suffer an exception of JAVA due to insufficient amount of physical memory, and thus, terminate the execution. The index-only strategy incorporates the grouping mechanism, and thus, to reduce the message size. The index-only strategy is thus the least time-consuming among the three. Unfortunately, though the strategy works well with small dataset, it failed when the input data is large. B. Overhead of index-only strategy The index-only strategy utilizes the grouping mechanism and distributed cache to successfully reduce the size of data transmitted by the Map() function, and thus, also reduce communication cost in the shuffle stage. However, since the data, i.e., sequence and quality value, is read from the local cache, performance bottleneck shifts to disk I/O among the Reduce() function. The overhead increases rapidly as the size of input data becomes large. This is mainly due to the fact that reads with the same key, the k-mer, usually scatters in the local cache. Furthermore, there are multiple tasks running concurrently on a single computing node. The lack of cache hit results in a high page-fault rate, especially when physical memory is exhausted by the running tasks. This phenomenon is known as thrashing. When it occurs, the execution time of the application may run indefinitely. 3
4 TABLE II. LIST OF DATASETS OF EVALUATING OUR PORPOSED STRATEGIES. Dataset SRA accession number Reference genome NCBI reference sequence accession number Genome length (MB) Read length Number of reads (M) Genome coverage Data size (GB) D1 SRX E. coli NC_ ~ bp ~ x ~1.59 D2 - E. coli NC_ ~ bp ~ x ~3.50 D3 SRA African Male (NA18507) - ~ bp ~ x ~17.3 VI. CONCLUSION In this era of big data, it is critical to process a large amount of data timely and efficiently. MapReduce is one of the prominent solutions to this end. It provides scalability and fault-tolerance for big data applications. However, the share-nothing nature of MapReduce also elicits researches that study applications with high degree of data dependency. An error detection and correction algorithm based on processing reads with the same k-mer subsequence is an application with high degree of data dependency, especially when it is applied to a large genome. In this paper, we present three strategies to handle data communication between the Map() and Reduce() functions of the MapReduce framework in a bioinformatics application that detects and corrects sequencing errors in the NGS data. Note that the NGS data consists of fixed-length reads, each being associated with sequence and quality value. The first strategy replicates the read data of each k-mer subsequence, and transmits the entire set of data from Map() to Reduce(). The second strategy groups the k-mer subsequences of a read by their prefix, and thus, transmits fewer amounts of data through the network. The third strategy, i.e., index-only strategy, pre-caches the read data directly on each node, and transmits only the indices of reads as messages. The indexonly strategy has been shown to be most efficient for small genomes. However, for large genomes, our current implementation may suffer the thrashing problem. Our future research will focus on improving the performance of the index-only strategy. We will also look into other problems with similar nature, e.g., de novo assembly, and develop applications based on the MapReduce framework. ACKNOWLEDGMENT The authors wish to thank anonymous reviewers for their helpful suggestions, and thank Dr. Wen-Liang Hsu and Dr. Chung-Yen Lin for their valuable discussions and comments. They also wish to thank Chunghwa Telecom Co. and National Communication Project of Taiwan for providing the cloud computing resources. The research is partially supported by Digital Culture Center, Academia Sinica, and National Science Council under grant NSC E MY3. TABLE III. RESULTS OF DIFFERENT PARAMETER SETTINGS WITH CONTENT-EXCHANGE STRATEGY ON DATASET D1. Parameters (mapred.*) map.tasks reduce.tasks child.java.opts Xmx4000m 2, Xmx950m 2, Xmx950m 1, Xmx950m 1, Xmx950m 1, Xmx950m 1,991 TABLE IV. Partition of keys (group:identity) Run time (s) RESULTS OF CONTENT-GROUPING STRATEGY ON DATASET D2. Intermediate data size (bytes) b 24:0 a 393,610,577,668 22,738 20:4 393,596,531,558 22,454 12:12 393,439,082,022 21,949 8:16 391,616,362,031 21,396 Run time (s) 6:18 379,682,068,879 GC overhead exceeded 3:21 160,805,782,820 GC overhead exceeded TABLE V. a Content-grouping method with key partition of 24:0 is same as content-exchange method. b The input data size is 3,062,609,572 bytes. RESULTS OF INDEX-ONLY METHOD ON DATASET D2 AND D3. Dataset Method a Run time (s) D2 D3 Content-grouping 21,728 Index-only 13,691 Content-grouping 16,345 Index-only > 8 hr a Both methods use 12:12 as the partition of keys (group:identity). 4
5 REFERENCES [1] J. Dean and S. Ghemawat, "MapReduce: simplified data processing on large clusters," Commun. ACM, vol. 51, pp , [2] Amazon. Amazon Elastic Compute Cloud (Amazon EC2). Available: [3] Amazon. Amazon Simple Storage Service (Amazon S3). Available: [4] Amazon. Amazon Elastic MapReduce (Amazon EMR). Available: [5] Apache. Welcome to Apache ΤΜ Hadoop! Available: [6] Nokia. Disco MapReduce. Available: [7] M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly, "Dryad: distributed data-parallel programs from sequential building blocks," presented at the Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007, Lisbon, Portugal, [8] Y.-J. Chang, C.-C. Chen, C.-L. Chen, and J.-M. Ho, "A de novo next generation genomic sequence assembler based on string graph and MapReduce cloud computing framework," BMC Genomics, vol. 13, pp. 1-17, [9] C.-C. Chen, Y.-J. Chang, W.-C. Chung, D.-T. Lee, and J.-M. Ho, "CloudRS: An Error Correction Algorithm of High-Throughput Sequencing Data based on Scalable Framework," in IEEE International Conference on Big Data, In press. [10] B. Langmead, K. D. Hansen, and J. T. Leek, "Cloud-scale RNAsequencing differential expression analysis with Myrna," Genome Biol, vol. 11, p. R83, [11] M. C. Schatz, "CloudBurst: highly sensitive read mapping with MapReduce," Bioinformatics, vol. 25, pp , Jun [12] L. D. Stein, "The case for cloud computing in genome informatics," Genome Biol, vol. 11, p. 207, [13] E. Anderson and J. Tucek, "Efficiency matters!," SIGOPS Oper. Syst. Rev., vol. 44, pp , [14] J. Cohen, B. Dolan, M. Dunlap, J. M. Hellerstein, and C. Welton, "MAD skills: new analysis practices for big data," Proc. VLDB Endow., vol. 2, pp , [15] D. J. DeWitt and M. Stonebraker. (2008). MapReduce: A major step backwards. Available: [16] B. Irving. Big data and the power of hadoop. [17] A. Pavlo, E. Paulson, A. Rasin, D. J. Abadi, D. J. DeWitt, S. Madden, et al., "A comparison of approaches to large-scale data analysis," presented at the Proceedings of the 2009 ACM SIGMOD International Conference on Management of data, Providence, Rhode Island, USA, [18] M. Schroepfer. Inside large-scale analytics at facebook. [19] J. Ekanayake, H. Li, B. Zhang, T. Gunarathne, S.-H. Bae, J. Qiu, et al., "Twister: a runtime for iterative MapReduce," presented at the Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing, Chicago, Illinois, [20] W. Guanying, A. R. Butt, P. Pandey, and K. Gupta, "A simulation approach to evaluating design decisions in MapReduce setups," in Modeling, Analysis & Simulation of Computer and Telecommunication Systems, MASCOTS '09. IEEE International Symposium on, 2009, pp [21] H. Herodotou, H. Lim, G. Luo, N. Borisov, L. Dong, F. B. Cetin, et al., "Starfish: A Self-tuning System for Big Data Analytics," In Proceedings of the 5th Conference on Innovative Data Systems Research, [22] E. Jahani, M. J. Cafarella, C. R, and #233, "Automatic optimization for MapReduce programs," Proc. VLDB Endow., vol. 4, pp , [23] S. Lattanzi, B. Moseley, S. Suri, and S. Vassilvitskii, "Filtering: a method for solving graph problems in MapReduce," presented at the Proceedings of the 23rd ACM symposium on Parallelism in algorithms and architectures, San Jose, California, USA, [24] K.-H. Lee, Y.-J. Lee, H. Choi, Y. D. Chung, and B. Moon, "Parallel data processing with MapReduce: a survey," SIGMOD Rec., vol. 40, pp , [25] C. Ranger, R. Raghuraman, A. Penmetsa, G. Bradski, and C. Kozyrakis, "Evaluating MapReduce for Multi-core and Multiprocessor Systems," presented at the Proceedings of the 2007 IEEE 13th International Symposium on High Performance Computer Architecture, [26] Cloudera. Optimizing MapReduce Job Performance. Available: [27] T. White, Hadoop: The Definitive Guide: O'Reilly Media, [28] J. Lin and C. Dyer, Data-Intensive Text Processing with MapReduce: Morgan and Claypool Publishers, [29] J. D. Ullman, "Designing good MapReduce algorithms," XRDS, vol. 19, pp , [30] S. Gnerre, I. MacCallum, D. Przybylski, F. J. Ribeiro, J. N. Burton, B. J. Walker, et al., "High-quality draft assemblies of mammalian genomes from massively parallel sequence data," Proceedings of the National Academy of Sciences, APPENDIX A. Rules-of-thumb policy for configurations Table A1 lists our Hadoop configuration parameters depends on the rules-of-thumb policy that aims at ensuring the values are not exceed the physical limitation of each computing node. Assume that we have 10 computing nodes and each node has 8 CPU cores, 16 GB memory, and acts as Data Node and Task Tracker. We demonstrate the calculation of the first three parameter values of Hadoop framework in Table A1. To achieve the best-effort of CPU utilization, there would be assign 2 processes to utilize for each CPU core, in general. Thus, there are 16 processes execute simultaneously in a node. However, to obtain the functionality of underlying operating system, we prepare one CPU core for system routine process and I/O operations. Thus, there are at most 7 CPU cores for Hadoop framework, and we decide to set up at most 7 map tasks and 7 reduce tasks concurrently. Since in-memory processing is faster than performing operation with swap space or content switching, we optimism the memory usage of each node is within its physical boundary. Furthermore, we preserved around 500 MB for processes of operating system, 1 GB for operations of Data Node, and 1 GB for Task Tracker. To utilize the rest 13 GB memory with at most 14 tasks concurrently, we can assign 950 MB memory for each task. 5
6 TABLE A1. A SUBSET OF JOB CONFIGURATION PARAMETERS OF HADOOP THAT AFFECT JOB PERFORMANCE SIGNIFICANTLY. Parameter Name in Hadoop Description and Use Default Values Our Settings mapred.child.java.opts Java options for the task tracker child processes -Xmx200m -Xmx950m mapred.tasktracker.map.tasks.maximum Maximum number of map tasks run simultaneously by a task tracker 2 7 mapred.tasktracker.reduce.tasks.maximum Maximum number of map tasks run simultaneously by a task tracker 2 7 mapred.reduce.slowstart.completed.maps Fraction of the completed map tasks to start reduce tasks in a job mapred.reduce.parallel.copies Number of parallel transfers 5 15 io.sort.mb Map-side buffer size (in MegaBytes) for buffering and sorting key-value pairs io.sort.record.percent Fraction of io.sort.mb to store metadata of key-value pairs io.sort.factor Number of sorted streams to merge at once when sorting files mapred.map.tasks Default value of map tasks per job 2 10 mapred.reduce.tasks Default value of reduce tasks per job
HADOOP PERFORMANCE TUNING
PERFORMANCE TUNING Abstract This paper explains tuning of Hadoop configuration parameters which directly affects Map-Reduce job performance under various conditions, to achieve maximum performance. The
More informationShareability and Locality Aware Scheduling Algorithm in Hadoop for Mobile Cloud Computing
Shareability and Locality Aware Scheduling Algorithm in Hadoop for Mobile Cloud Computing Hsin-Wen Wei 1,2, Che-Wei Hsu 2, Tin-Yu Wu 3, Wei-Tsong Lee 1 1 Department of Electrical Engineering, Tamkang University
More informationMap-Based Graph Analysis on MapReduce
2013 IEEE International Conference on Big Data Map-Based Graph Analysis on MapReduce Upa Gupta, Leonidas Fegaras University of Texas at Arlington, CSE Arlington, TX 76019 {upa.gupta,fegaras}@uta.edu Abstract
More informationEvaluating HDFS I/O Performance on Virtualized Systems
Evaluating HDFS I/O Performance on Virtualized Systems Xin Tang xtang@cs.wisc.edu University of Wisconsin-Madison Department of Computer Sciences Abstract Hadoop as a Service (HaaS) has received increasing
More informationFault Tolerance in Hadoop for Work Migration
1 Fault Tolerance in Hadoop for Work Migration Shivaraman Janakiraman Indiana University Bloomington ABSTRACT Hadoop is a framework that runs applications on large clusters which are built on numerous
More informationAnalysing Large Web Log Files in a Hadoop Distributed Cluster Environment
Analysing Large Files in a Hadoop Distributed Cluster Environment S Saravanan, B Uma Maheswari Department of Computer Science and Engineering, Amrita School of Engineering, Amrita Vishwa Vidyapeetham,
More informationThe Comprehensive Performance Rating for Hadoop Clusters on Cloud Computing Platform
The Comprehensive Performance Rating for Hadoop Clusters on Cloud Computing Platform Fong-Hao Liu, Ya-Ruei Liou, Hsiang-Fu Lo, Ko-Chin Chang, and Wei-Tsong Lee Abstract Virtualization platform solutions
More informationA Study on Workload Imbalance Issues in Data Intensive Distributed Computing
A Study on Workload Imbalance Issues in Data Intensive Distributed Computing Sven Groot 1, Kazuo Goda 1, and Masaru Kitsuregawa 1 University of Tokyo, 4-6-1 Komaba, Meguro-ku, Tokyo 153-8505, Japan Abstract.
More informationHerodotos Herodotou Shivnath Babu. Duke University
Herodotos Herodotou Shivnath Babu Duke University Analysis in the Big Data Era Popular option Hadoop software stack Java / C++ / R / Python Pig Hive Jaql Oozie Elastic MapReduce Hadoop HBase MapReduce
More informationFrom GWS to MapReduce: Google s Cloud Technology in the Early Days
Large-Scale Distributed Systems From GWS to MapReduce: Google s Cloud Technology in the Early Days Part II: MapReduce in a Datacenter COMP6511A Spring 2014 HKUST Lin Gu lingu@ieee.org MapReduce/Hadoop
More informationScalable Cloud Computing Solutions for Next Generation Sequencing Data
Scalable Cloud Computing Solutions for Next Generation Sequencing Data Matti Niemenmaa 1, Aleksi Kallio 2, André Schumacher 1, Petri Klemelä 2, Eija Korpelainen 2, and Keijo Heljanko 1 1 Department of
More informationMapReduce With Columnar Storage
SEMINAR: COLUMNAR DATABASES 1 MapReduce With Columnar Storage Peitsa Lähteenmäki Abstract The MapReduce programming paradigm has achieved more popularity over the last few years as an option to distributed
More informationChapter 7. Using Hadoop Cluster and MapReduce
Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in
More informationMapReduce on GPUs. Amit Sabne, Ahmad Mujahid Mohammed Razip, Kun Xu
1 MapReduce on GPUs Amit Sabne, Ahmad Mujahid Mohammed Razip, Kun Xu 2 MapReduce MAP Shuffle Reduce 3 Hadoop Open-source MapReduce framework from Apache, written in Java Used by Yahoo!, Facebook, Ebay,
More informationCloudDOE: A User-Friendly Tool for Deploying Hadoop Clouds and Analyzing High-Throughput Sequencing Data with MapReduce
CloudDOE: A User-Friendly Tool for Deploying Hadoop Clouds and Analyzing High-Throughput Sequencing Data with MapReduce Wei-Chun Chung 1,2,3, Chien-Chih Chen 1,2, Jan-Ming Ho 1,3, Chung-Yen Lin 1, Wen-Lian
More informationDuke University http://www.cs.duke.edu/starfish
Herodotos Herodotou, Harold Lim, Fei Dong, Shivnath Babu Duke University http://www.cs.duke.edu/starfish Practitioners of Big Data Analytics Google Yahoo! Facebook ebay Physicists Biologists Economists
More informationA SURVEY ON MAPREDUCE IN CLOUD COMPUTING
A SURVEY ON MAPREDUCE IN CLOUD COMPUTING Dr.M.Newlin Rajkumar 1, S.Balachandar 2, Dr.V.Venkatesakumar 3, T.Mahadevan 4 1 Asst. Prof, Dept. of CSE,Anna University Regional Centre, Coimbatore, newlin_rajkumar@yahoo.co.in
More informationMapReduce for Data Intensive Scientific Analyses
MapReduce for Data Intensive Scientific Analyses Jaliya Ekanayake, Shrideep Pallickara, and Geoffrey Fox Department of Computer Science Indiana University Bloomington, USA {jekanaya, spallick, gcf}@indiana.edu
More informationIntroduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data
Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give
More informationOutline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging
Outline High Performance Computing (HPC) Towards exascale computing: a brief history Challenges in the exascale era Big Data meets HPC Some facts about Big Data Technologies HPC and Big Data converging
More informationCURTAIL THE EXPENDITURE OF BIG DATA PROCESSING USING MIXED INTEGER NON-LINEAR PROGRAMMING
Journal homepage: http://www.journalijar.com INTERNATIONAL JOURNAL OF ADVANCED RESEARCH RESEARCH ARTICLE CURTAIL THE EXPENDITURE OF BIG DATA PROCESSING USING MIXED INTEGER NON-LINEAR PROGRAMMING R.Kohila
More informationThe Performance Characteristics of MapReduce Applications on Scalable Clusters
The Performance Characteristics of MapReduce Applications on Scalable Clusters Kenneth Wottrich Denison University Granville, OH 43023 wottri_k1@denison.edu ABSTRACT Many cluster owners and operators have
More informationCSE-E5430 Scalable Cloud Computing Lecture 2
CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 14.9-2015 1/36 Google MapReduce A scalable batch processing
More informationSeqPig: simple and scalable scripting for large sequencing data sets in Hadoop
SeqPig: simple and scalable scripting for large sequencing data sets in Hadoop André Schumacher, Luca Pireddu, Matti Niemenmaa, Aleksi Kallio, Eija Korpelainen, Gianluigi Zanetti and Keijo Heljanko Abstract
More informationReport 02 Data analytics workbench for educational data. Palak Agrawal
Report 02 Data analytics workbench for educational data Palak Agrawal Last Updated: May 22, 2014 Starfish: A Selftuning System for Big Data Analytics [1] text CONTENTS Contents 1 Introduction 1 1.1 Starfish:
More informationEnergy-Saving Cloud Computing Platform Based On Micro-Embedded System
Energy-Saving Cloud Computing Platform Based On Micro-Embedded System Wen-Hsu HSIEH *, San-Peng KAO **, Kuang-Hung TAN **, Jiann-Liang CHEN ** * Department of Computer and Communication, De Lin Institute
More informationJeffrey D. Ullman slides. MapReduce for data intensive computing
Jeffrey D. Ullman slides MapReduce for data intensive computing Single-node architecture CPU Machine Learning, Statistics Memory Classical Data Mining Disk Commodity Clusters Web data sets can be very
More informationBenchmark Hadoop and Mars: MapReduce on cluster versus on GPU
Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU Heshan Li, Shaopeng Wang The Johns Hopkins University 3400 N. Charles Street Baltimore, Maryland 21218 {heshanli, shaopeng}@cs.jhu.edu 1 Overview
More informationClassification On The Clouds Using MapReduce
Classification On The Clouds Using MapReduce Simão Martins Instituto Superior Técnico Lisbon, Portugal simao.martins@tecnico.ulisboa.pt Cláudia Antunes Instituto Superior Técnico Lisbon, Portugal claudia.antunes@tecnico.ulisboa.pt
More informationRECOMMENDATION SYSTEM USING BLOOM FILTER IN MAPREDUCE
RECOMMENDATION SYSTEM USING BLOOM FILTER IN MAPREDUCE Reena Pagare and Anita Shinde Department of Computer Engineering, Pune University M. I. T. College Of Engineering Pune India ABSTRACT Many clients
More informationAnalysis and Optimization of Massive Data Processing on High Performance Computing Architecture
Analysis and Optimization of Massive Data Processing on High Performance Computing Architecture He Huang, Shanshan Li, Xiaodong Yi, Feng Zhang, Xiangke Liao and Pan Dong School of Computer Science National
More informationAccelerating Hadoop MapReduce Using an In-Memory Data Grid
Accelerating Hadoop MapReduce Using an In-Memory Data Grid By David L. Brinker and William L. Bain, ScaleOut Software, Inc. 2013 ScaleOut Software, Inc. 12/27/2012 H adoop has been widely embraced for
More informationA Framework for Performance Analysis and Tuning in Hadoop Based Clusters
A Framework for Performance Analysis and Tuning in Hadoop Based Clusters Garvit Bansal Anshul Gupta Utkarsh Pyne LNMIIT, Jaipur, India Email: [garvit.bansal anshul.gupta utkarsh.pyne] @lnmiit.ac.in Manish
More informationCloud-enabling Sequence Alignment with Hadoop MapReduce: A Performance Analysis
2012 4th International Conference on Bioinformatics and Biomedical Technology IPCBEE vol.29 (2012) (2012) IACSIT Press, Singapore Cloud-enabling Sequence Alignment with Hadoop MapReduce: A Performance
More informationThe Hadoop Framework
The Hadoop Framework Nils Braden University of Applied Sciences Gießen-Friedberg Wiesenstraße 14 35390 Gießen nils.braden@mni.fh-giessen.de Abstract. The Hadoop Framework offers an approach to large-scale
More informationDATA MINING WITH HADOOP AND HIVE Introduction to Architecture
DATA MINING WITH HADOOP AND HIVE Introduction to Architecture Dr. Wlodek Zadrozny (Most slides come from Prof. Akella s class in 2014) 2015-2025. Reproduction or usage prohibited without permission of
More informationKeywords: Big Data, HDFS, Map Reduce, Hadoop
Volume 5, Issue 7, July 2015 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Configuration Tuning
More informationData-Intensive Computing with Map-Reduce and Hadoop
Data-Intensive Computing with Map-Reduce and Hadoop Shamil Humbetov Department of Computer Engineering Qafqaz University Baku, Azerbaijan humbetov@gmail.com Abstract Every day, we create 2.5 quintillion
More informationHow To Analyze Log Files In A Web Application On A Hadoop Mapreduce System
Analyzing Web Application Log Files to Find Hit Count Through the Utilization of Hadoop MapReduce in Cloud Computing Environment Sayalee Narkhede Department of Information Technology Maharashtra Institute
More informationBig Data and Apache Hadoop s MapReduce
Big Data and Apache Hadoop s MapReduce Michael Hahsler Computer Science and Engineering Southern Methodist University January 23, 2012 Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 1 / 23
More informationChameleon: The Performance Tuning Tool for MapReduce Query Processing Systems
paper:38 Chameleon: The Performance Tuning Tool for MapReduce Query Processing Systems Edson Ramiro Lucsa Filho 1, Ivan Luiz Picoli 2, Eduardo Cunha de Almeida 2, Yves Le Traon 1 1 University of Luxembourg
More informationResearch Article Cloud Computing for Protein-Ligand Binding Site Comparison
BioMed Research International Volume 213, Article ID 17356, 7 pages http://dx.doi.org/1.1155/213/17356 Research Article Cloud Computing for Protein-Ligand Binding Site Comparison Che-Lun Hung 1 and Guan-Jie
More informationParallel Processing of cluster by Map Reduce
Parallel Processing of cluster by Map Reduce Abstract Madhavi Vaidya, Department of Computer Science Vivekanand College, Chembur, Mumbai vamadhavi04@yahoo.co.in MapReduce is a parallel programming model
More informationCOST MINIMIZATION OF RUNNING MAPREDUCE ACROSS GEOGRAPHICALLY DISTRIBUTED DATA CENTERS
COST MINIMIZATION OF RUNNING MAPREDUCE ACROSS GEOGRAPHICALLY DISTRIBUTED DATA CENTERS Ms. T. Cowsalya PG Scholar, SVS College of Engineering, Coimbatore, Tamilnadu, India Dr. S. Senthamarai Kannan Assistant
More informationIntroduction to Hadoop
Introduction to Hadoop 1 What is Hadoop? the big data revolution extracting value from data cloud computing 2 Understanding MapReduce the word count problem more examples MCS 572 Lecture 24 Introduction
More informationDeveloping Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control
Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control EP/K006487/1 UK PI: Prof Gareth Taylor (BU) China PI: Prof Yong-Hua Song (THU) Consortium UK Members: Brunel University
More informationMaster s Thesis: A Tuning Approach Based on Evolutionary Algorithm and Data Sampling for Boosting Performance of MapReduce Programs
paper:24 Master s Thesis: A Tuning Approach Based on Evolutionary Algorithm and Data Sampling for Boosting Performance of MapReduce Programs Tiago R. Kepe 1, Eduardo Cunha de Almeida 1 1 Programa de Pós-Graduação
More informationAnalysis and Modeling of MapReduce s Performance on Hadoop YARN
Analysis and Modeling of MapReduce s Performance on Hadoop YARN Qiuyi Tang Dept. of Mathematics and Computer Science Denison University tang_j3@denison.edu Dr. Thomas C. Bressoud Dept. of Mathematics and
More informationA STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS
A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS Dr. Ananthi Sheshasayee 1, J V N Lakshmi 2 1 Head Department of Computer Science & Research, Quaid-E-Millath Govt College for Women, Chennai, (India)
More informationIntroduction to Big Data! with Apache Spark" UC#BERKELEY#
Introduction to Big Data! with Apache Spark" UC#BERKELEY# This Lecture" The Big Data Problem" Hardware for Big Data" Distributing Work" Handling Failures and Slow Machines" Map Reduce and Complex Jobs"
More informationLARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT Samira Daneshyar 1 and Majid Razmjoo 2 1,2 School of Computer Science, Centre of Software Technology and Management (SOFTEM),
More informationBig Data With Hadoop
With Saurabh Singh singh.903@osu.edu The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials
More informationMarket Basket Analysis Algorithm on Map/Reduce in AWS EC2
Market Basket Analysis Algorithm on Map/Reduce in AWS EC2 Jongwook Woo Computer Information Systems Department California State University Los Angeles jwoo5@calstatela.edu Abstract As the web, social networking,
More informationClash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics
Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics Juwei Shi, Yunjie Qiu, Umar Farooq Minhas, Limei Jiao, Chen Wang, Berthold Reinwald, and Fatma Özcan IBM Research China IBM Almaden
More informationPLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS
PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS By HAI JIN, SHADI IBRAHIM, LI QI, HAIJUN CAO, SONG WU and XUANHUA SHI Prepared by: Dr. Faramarz Safi Islamic Azad
More informationFP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data
FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data Miguel Liroz-Gistau, Reza Akbarinia, Patrick Valduriez To cite this version: Miguel Liroz-Gistau, Reza Akbarinia, Patrick Valduriez. FP-Hadoop:
More informationThe Impact of Capacity Scheduler Configuration Settings on MapReduce Jobs
The Impact of Capacity Scheduler Configuration Settings on MapReduce Jobs Jagmohan Chauhan, Dwight Makaroff and Winfried Grassmann Dept. of Computer Science, University of Saskatchewan Saskatoon, SK, CANADA
More informationINTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY
INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK A REVIEW ON HIGH PERFORMANCE DATA STORAGE ARCHITECTURE OF BIGDATA USING HDFS MS.
More informationCSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)
CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2016 MapReduce MapReduce is a programming model
More informationIntroduction to Hadoop
1 What is Hadoop? Introduction to Hadoop We are living in an era where large volumes of data are available and the problem is to extract meaning from the data avalanche. The goal of the software tools
More informationOpen source software framework designed for storage and processing of large scale data on clusters of commodity hardware
Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after
More informationCLOUDDMSS: CLOUD-BASED DISTRIBUTED MULTIMEDIA STREAMING SERVICE SYSTEM FOR HETEROGENEOUS DEVICES
CLOUDDMSS: CLOUD-BASED DISTRIBUTED MULTIMEDIA STREAMING SERVICE SYSTEM FOR HETEROGENEOUS DEVICES 1 MYOUNGJIN KIM, 2 CUI YUN, 3 SEUNGHO HAN, 4 HANKU LEE 1,2,3,4 Department of Internet & Multimedia Engineering,
More informationA Hybrid Scheduling Approach for Scalable Heterogeneous Hadoop Systems
A Hybrid Scheduling Approach for Scalable Heterogeneous Hadoop Systems Aysan Rasooli Department of Computing and Software McMaster University Hamilton, Canada Email: rasooa@mcmaster.ca Douglas G. Down
More informationA Cost-Benefit Analysis of Indexing Big Data with Map-Reduce
A Cost-Benefit Analysis of Indexing Big Data with Map-Reduce Dimitrios Siafarikas Argyrios Samourkasidis Avi Arampatzis Department of Electrical and Computer Engineering Democritus University of Thrace
More informationReducer Load Balancing and Lazy Initialization in Map Reduce Environment S.Mohanapriya, P.Natesan
Reducer Load Balancing and Lazy Initialization in Map Reduce Environment S.Mohanapriya, P.Natesan Abstract Big Data is revolutionizing 21st-century with increasingly huge amounts of data to store and be
More informationSecuring Health Care Information by Using Two-Tier Cipher Cloud Technology
Securing Health Care Information by Using Two-Tier Cipher Cloud Technology M.Divya 1 J.Gayathri 2 A.Gladwin 3 1 B.TECH Information Technology, Jeppiaar Engineering College, Chennai 2 B.TECH Information
More informationJava Garbage Collection Characteristics and Tuning Guidelines for Apache Hadoop TeraSort Workload
Java Garbage Collection Characteristics and Tuning Guidelines for Apache Hadoop TeraSort Workload Shrinivas Joshi, Software Performance Engineer Vasileios Liaskovitis, Performance Engineer 1. Introduction
More informationBSPCloud: A Hybrid Programming Library for Cloud Computing *
BSPCloud: A Hybrid Programming Library for Cloud Computing * Xiaodong Liu, Weiqin Tong and Yan Hou Department of Computer Engineering and Science Shanghai University, Shanghai, China liuxiaodongxht@qq.com,
More informationPerformance Optimization of a Distributed Transcoding System based on Hadoop for Multimedia Streaming Services
RESEARCH ARTICLE Adv. Sci. Lett. 4, 400 407, 2011 Copyright 2011 American Scientific Publishers Advanced Science Letters All rights reserved Vol. 4, 400 407, 2011 Printed in the United States of America
More informationA Comparative Analysis of Join Algorithms Using the Hadoop Map/Reduce Framework
A Comparative Analysis of Join Algorithms Using the Hadoop Map/Reduce Framework Konstantina Palla E H U N I V E R S I T Y T O H F G R E D I N B U Master of Science School of Informatics University of Edinburgh
More informationOptimization and analysis of large scale data sorting algorithm based on Hadoop
Optimization and analysis of large scale sorting algorithm based on Hadoop Zhuo Wang, Longlong Tian, Dianjie Guo, Xiaoming Jiang Institute of Information Engineering, Chinese Academy of Sciences {wangzhuo,
More informationDESIGN AND ESTIMATION OF BIG DATA ANALYSIS USING MAPREDUCE AND HADOOP-A
DESIGN AND ESTIMATION OF BIG DATA ANALYSIS USING MAPREDUCE AND HADOOP-A 1 DARSHANA WAJEKAR, 2 SUSHILA RATRE 1,2 Department of Computer Engineering, Pillai HOC College of Engineering& Technology Rasayani,
More informationExploring the Efficiency of Big Data Processing with Hadoop MapReduce
Exploring the Efficiency of Big Data Processing with Hadoop MapReduce Brian Ye, Anders Ye School of Computer Science and Communication (CSC), Royal Institute of Technology KTH, Stockholm, Sweden Abstract.
More informationRevoScaleR Speed and Scalability
EXECUTIVE WHITE PAPER RevoScaleR Speed and Scalability By Lee Edlefsen Ph.D., Chief Scientist, Revolution Analytics Abstract RevoScaleR, the Big Data predictive analytics library included with Revolution
More informationEnhancing Dataset Processing in Hadoop YARN Performance for Big Data Applications
Enhancing Dataset Processing in Hadoop YARN Performance for Big Data Applications Ahmed Abdulhakim Al-Absi, Dae-Ki Kang and Myong-Jong Kim Abstract In Hadoop MapReduce distributed file system, as the input
More informationRecognization of Satellite Images of Large Scale Data Based On Map- Reduce Framework
Recognization of Satellite Images of Large Scale Data Based On Map- Reduce Framework Vidya Dhondiba Jadhav, Harshada Jayant Nazirkar, Sneha Manik Idekar Dept. of Information Technology, JSPM s BSIOTR (W),
More informationAnalysis of MapReduce Algorithms
Analysis of MapReduce Algorithms Harini Padmanaban Computer Science Department San Jose State University San Jose, CA 95192 408-924-1000 harini.gomadam@gmail.com ABSTRACT MapReduce is a programming model
More informationIntro to Map/Reduce a.k.a. Hadoop
Intro to Map/Reduce a.k.a. Hadoop Based on: Mining of Massive Datasets by Ra jaraman and Ullman, Cambridge University Press, 2011 Data Mining for the masses by North, Global Text Project, 2012 Slides by
More informationA Study on Big Data Integration with Data Warehouse
A Study on Big Data Integration with Data Warehouse T.K.Das 1 and Arati Mohapatro 2 1 (School of Information Technology & Engineering, VIT University, Vellore,India) 2 (Department of Computer Science,
More informationHadoop and Map-reduce computing
Hadoop and Map-reduce computing 1 Introduction This activity contains a great deal of background information and detailed instructions so that you can refer to it later for further activities and homework.
More informationParallel Computing. Benson Muite. benson.muite@ut.ee http://math.ut.ee/ benson. https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage
Parallel Computing Benson Muite benson.muite@ut.ee http://math.ut.ee/ benson https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage 3 November 2014 Hadoop, Review Hadoop Hadoop History Hadoop Framework
More informationChapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related
Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Summary Xiangzhe Li Nowadays, there are more and more data everyday about everything. For instance, here are some of the astonishing
More informationMapReduce and Hadoop Distributed File System
MapReduce and Hadoop Distributed File System 1 B. RAMAMURTHY Contact: Dr. Bina Ramamurthy CSE Department University at Buffalo (SUNY) bina@buffalo.edu http://www.cse.buffalo.edu/faculty/bina Partially
More informationEnhancing MapReduce Functionality for Optimizing Workloads on Data Centers
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 2, Issue. 10, October 2013,
More informationWhat is Analytic Infrastructure and Why Should You Care?
What is Analytic Infrastructure and Why Should You Care? Robert L Grossman University of Illinois at Chicago and Open Data Group grossman@uic.edu ABSTRACT We define analytic infrastructure to be the services,
More informationHADOOP IN THE LIFE SCIENCES:
White Paper HADOOP IN THE LIFE SCIENCES: An Introduction Abstract This introductory white paper reviews the Apache Hadoop TM technology, its components MapReduce and Hadoop Distributed File System (HDFS)
More informationLocal Alignment Tool Based on Hadoop Framework and GPU Architecture
Local Alignment Tool Based on Hadoop Framework and GPU Architecture Che-Lun Hung * Department of Computer Science and Communication Engineering Providence University Taichung, Taiwan clhung@pu.edu.tw *
More informationA Brief Introduction to Apache Tez
A Brief Introduction to Apache Tez Introduction It is a fact that data is basically the new currency of the modern business world. Companies that effectively maximize the value of their data (extract value
More informationResource Scalability for Efficient Parallel Processing in Cloud
Resource Scalability for Efficient Parallel Processing in Cloud ABSTRACT Govinda.K #1, Abirami.M #2, Divya Mercy Silva.J #3 #1 SCSE, VIT University #2 SITE, VIT University #3 SITE, VIT University In the
More informationInternational Journal of Advance Research in Computer Science and Management Studies
Volume 2, Issue 8, August 2014 ISSN: 2321 7782 (Online) International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online
More informationMATE-EC2: A Middleware for Processing Data with AWS
MATE-EC2: A Middleware for Processing Data with AWS Tekin Bicer Department of Computer Science and Engineering Ohio State University bicer@cse.ohio-state.edu David Chiu School of Engineering and Computer
More informationLarge-Scale Data Sets Clustering Based on MapReduce and Hadoop
Journal of Computational Information Systems 7: 16 (2011) 5956-5963 Available at http://www.jofcis.com Large-Scale Data Sets Clustering Based on MapReduce and Hadoop Ping ZHOU, Jingsheng LEI, Wenjun YE
More informationBenchmarking Hadoop & HBase on Violin
Technical White Paper Report Technical Report Benchmarking Hadoop & HBase on Violin Harnessing Big Data Analytics at the Speed of Memory Version 1.0 Abstract The purpose of benchmarking is to show advantages
More informationStreamStorage: High-throughput and Scalable Storage Technology for Streaming Data
: High-throughput and Scalable Storage Technology for Streaming Data Munenori Maeda Toshihiro Ozawa Real-time analytical processing (RTAP) of vast amounts of time-series data from sensors, server logs,
More informationhttp://www.wordle.net/
Hadoop & MapReduce http://www.wordle.net/ http://www.wordle.net/ Hadoop is an open-source software framework (or platform) for Reliable + Scalable + Distributed Storage/Computational unit Failures completely
More informationA Comparison of Approaches to Large-Scale Data Analysis
A Comparison of Approaches to Large-Scale Data Analysis Sam Madden MIT CSAIL with Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel Abadi, David DeWitt, and Michael Stonebraker In SIGMOD 2009 MapReduce
More informationDetection of Distributed Denial of Service Attack with Hadoop on Live Network
Detection of Distributed Denial of Service Attack with Hadoop on Live Network Suchita Korad 1, Shubhada Kadam 2, Prajakta Deore 3, Madhuri Jadhav 4, Prof.Rahul Patil 5 Students, Dept. of Computer, PCCOE,
More informationLog Mining Based on Hadoop s Map and Reduce Technique
Log Mining Based on Hadoop s Map and Reduce Technique ABSTRACT: Anuja Pandit Department of Computer Science, anujapandit25@gmail.com Amruta Deshpande Department of Computer Science, amrutadeshpande1991@gmail.com
More informationOptimizing Cost and Performance Trade-Offs for MapReduce Job Processing in the Cloud
Optimizing Cost and Performance Trade-Offs for MapReduce Job Processing in the Cloud Zhuoyao Zhang University of Pennsylvania zhuoyao@seas.upenn.edu Ludmila Cherkasova Hewlett-Packard Labs lucy.cherkasova@hp.com
More information