Analysis and Modeling of MapReduce s Performance on Hadoop YARN

Size: px
Start display at page:

Download "Analysis and Modeling of MapReduce s Performance on Hadoop YARN"

Transcription

1 Analysis and Modeling of MapReduce s Performance on Hadoop YARN Qiuyi Tang Dept. of Mathematics and Computer Science Denison University tang_j3@denison.edu Dr. Thomas C. Bressoud Dept. of Mathematics and Computer Science Denison University bressoud@denison.edu ABSTRACT With the rapid growth of technology, scientists have realized the challenge of efficiently analyzing large data sets since the beginning of 21 st century. Increases in data volume and data complexity shift scientists focus to parallel, distributed algorithms running on clusters. In 2004, Jeffrey Dean and Sanjay Ghemawat from Google introduced a new programming model to store and process large data sets, called MapReduce[2]. Apache Hadoop, an opensource software framework, which uses MapReduce as its data-processing layer, was developed at Yahoo as early as 2006 and evolved to a stable platform by Although Hadoop has been widely used in industry, its performance characteristics are not well understood. This paper, following Hadoop s workflow, analyzes the factors that influence the running time of each phase in Hadoop execution. Given those factors, our goal is to model the performance of MapReduce applications. 1. INTRODUCTION MapReduce provides an abstraction for distributed processing of large data sets in clusters of computers by hiding the details of scheduling, resource management, and faulttolerance; therefore it is easier for software engineers to develop applications by focusing on just map and reduce tasks (explained below). As a realization of the MapReduce model, Hadoop YARN(Yet Another Resource Negotiator) was first introduced in Apache Hadoop on November 11, 2011, in order to address scalability and issues of earlier realization. This version became generally available in October However, the performance characteristics of the runtime system is not well studied because it depends on the complicated interactions between the network and the computing nodes, the distributed file system, as well as a tremendous number of system configuration parameters. Further, Hadoop might perform differently as clusters grow in size and faults become more frequent. Our objective was to analyze the performance of Hadoop application on a scalable local cluster and ultimately build a model that predicts the running time based on input size, system properties and configuration parameters. This paper describes the factors that affect the performance of each MapReduce phase and suggest methods to aggregate them together into a model. 2. BACKGROUND 2.1 Hadoop YARN Hadoop YARN is the second generation of Hadoop. Compare to the first generation, YARN performs better when the input scales up because it decouples a centralized resource manager and scheduler combined with application task control into separate

2 components of a resource manager and an application master. The resource manager takes care of job scheduling and resource allocation, in cooperate with distributed nodelocal management, and the application master monitors the task progress for a single application. The new system could run multiple application masters to enable multiple concurrent jobs. In addition to resource manager and application master, YARN also has a HDFS (Hadoop Distributed File System) that can be used to transfer data from node to node. 2.2 Hadoop Workflow A Hadoop job can be partitioned into three phases: the map phase, the shuffle/merge phase and the reduce phase. Prior to running a job, the user must ensure that the big data of the job has been transferred into the distributed file system (HDFS). The data set is broken into blocks and the data spread among the computers of the cluster. HDFS performs this distribution regardless of the number or size of the files. Before the application may begin, the resource manager considers the input data and associates a map task with each block-size share of the input data. The input for a map task is called a split. Each map task interprets its split as a collection of key-value pairs and runs a map algorithm written by the user. The outputs of the map function, usually referred as intermediate output, are a set of key-value pairs, which are stored locally in a buffer or on disk. To enable aggregation analysis of these results obtained from processing the input data in a structured way, any given intermediate data with the same key is processed by a single reduce task, to be discussed below. In most discussions of MapReduce, shuffle and merge are discussed as if they were a single phase, but shuffle and merge actually have different responsibilities, and hence we consider them as two separate phases. The shuffle phase copies the intermediate output from the map tasks to the node of the reducer responsible for that key value. As the copying is going on and the intermediate outputs are gathered on the reducer s node, a background thread launches to merge the intermediate outputs for a given reducer into a large file sorted by key. This process is known as the merge phase, and some may refer it as the sort phase. One thing to notice is that the shuffle phase can start when the map phase is running, but the merge phase initiates after shuffle completes and must terminate before the reduce phase can start so that any reducers know that they have the entire set of the intermediate outputs associate with their keys. The last part of the workflow is the reduce phase. The reduce task processes the large, sorted intermediate output file, which is now locally stored, according to the customized reduce algorithm, and emits final output keyvalue pairs. Those final outputs are stored back in the HDFS and the user can analyze them manually. The phases of this process are illustrated in Figure 1. The input is on the far left, where each split is given to a separate map task. Each map task generates intermediate output, which is then distributed relative to key to the set of reducers. Each reducer processes its part of the key set and produces the final output. Figure 1: Hadoop Workflow

3 3. METHODS 3.1 Assumptions In this paper, we make several assumptions to simplify our analysis by controlling variables. Those assumptions are not always realistic and the plan is to relax them in the future research. (1) The Hadoop job is fault-free. Hadoop is designed to be fault-tolerant, which means the job can correctly complete even in the presence of failed map or reduce tasks. However, instances of failures increase as the size of the cluster increases. Although the fault-tolerance mechanism is completely hidden from the programmer s view, it has a significant impact on the running time, since it involves re-execution of map or reduce tasks and a time-out mechanism. This is the subject of future study. (2) All the configuration properties, which are not mentioned in this paper, take the default value. Since our goal is to identify the significant performance factors, we only consider the configuration parameters that impact those factors. For any configuration parameters that we do not experiment with, we leave them as the default value as defined by the Hadoop system. One thing to notice is short-circuit local read, which means the user reads the file directly without going through the datanode, which reduce the data transition time. In our experiments, we leave dfs.client.read.shortcircuit as default, which means it is not allowed. (3) The network and cluster operate with nonexceptional conditions during application execution. Saturation of the network, resulting in switch congestion, packet loss, exponential backoff, timeouts and other exceptional conditions will cause extreme variation in the times for data transfer and will also result in faulttolerance mechanisms being invoked. These performance effects, by their nature, are difficult to model. Thus, we assume application execution under non-exceptional network conditions. (4) The set of intermediate keys and the number of intermediate key-value pairs destined for reduce tasks result in a balanced workload among the reduce tasks. Keys are assigned to reducers through a Hadoop class called a partitioner. If the distribution of intermediate keys is not uniform or if the number of key-value pairs for a given key varies widely, some reduce tasks will have considerably more work than others, and can take significantly more time. Hence, we want assume an effective partitioner to make the intermediate outputs evenly spread over all the reduce tasks, in which case we can model the running time of the reducers statistically by using expected value. (5) The size of the input data is large enough that the input for any given map task has high probability of being a full block size. We want to use the block size as the input size of each map task. If the capacity of a block is not fully employed, the estimate running time can be longer than the actual running time. Since we are interested in large data applications, this assumption is easily met. 3.2 System Equipment In this research, we use a Beowulf cluster of 32 computers running Ubuntu of the computers have Intel Quad Core i CPU with processor speeds of 3.20GHz, 451GB of disk storage and 8GB of memory. The other 16 computers have Intel Core 2 Quad CPU with processor speeds of 2.66GHz, 451GB of disk storage and 4 GB of memory. The network switch is a gigabit speed Cisco switch. We only have the OS and Hadoop on each node, and there is no monitoring

4 software running in order to measure the performance accurately. 4. MAP PHASE As described earlier, the map phase is the first phase of a Hadoop application, during which the input data is partitioned and processed by the set of map tasks. Due to the potentially large volume of input, the map phase can take a long time to finish. To account for significant performance factors, our goal is to model the running time of each map task. From this we can aggregate map task execution to arrive at predictions for the full map phase. So we wish to measure and then model the time between map task launch time and map task finish time. From experimentation on our Beowulf cluster, we identified the most significant factors in map task execution time as: Blocksize: the size of the input split for each map task. Locality of the Input Split: When possible, Hadoop executes a map task on the same node where its input data resides in HDFS. When this is not possible, the data must be transferred to the location of the map task. Level of Concurrency: With each cluster node running four hyperthreaded cores, the resource allocation and scheduling should execute multiple map tasks on any given cluster node. But there is a tradeoff between too few concurrent activities and too many, and these affect the performance of the map tasks Randomness of the Running Time In the real world, the running time of map tasks can vary over a significant range, even under the same configuration and provisioning. But we may model the performance by a probability distribution. By conducting a series of experiments, we realized the running time distribution (exclusive of the above performance factors) is similar to a normal distribution. However, the running time distribution has a long right tail, indicating that there are outliers taking much more time than the average. This right tail skews the mean and standard deviation. By shifting the mean to the left and using a smaller standard deviation, it is possible to fit a normal distribution to the observed running time distribution. Figure 2: Map Task Running Time Distribution Figure 2 illustrates an example of an experimental map task run time distribution. This is a histogram obtained from running 4096 map tasks with block size of 128 MB and concurrency level of 6. In the graph, clearly there is a peak and a long right tail. If we only consider the peak part of the graph, we can fit a normal distribution to it. In this graph, normal distribution has mean of ms and standard deviation of 1382 ms Block Size In Hadoop, the block size, which is the size of the input to a map task, can be altered by changing configuration parameters for the execution of HDFS. In practice, people tend to use 64 MB or 128 MB block size, because too small a block size can cause significant overhead, while too large a block size can cause long running map tasks, with greater variation in their run times, which can result

5 in performance degradation when the faulttolerance mechanism is invoked. As one might expect, our benchmark shows that the running time of a map task increases linearly with the block size. This indicates that the overhead of the runtime in initiating a transfer of a block is small compared to the transfer time for blocks of this size. In a series of experiments investigating this performance factor, we created 1024 files in HDFS, where each file is 1024 MB in size. To keep the level of concurrency fixed, the memory available on each node is 10 GB while each map task takes 2 GB, allowing five concurrent map tasks executing on any given cluster node. One thing to notice is that the memory of each node (10GB) is in term of physical memory while the memory of each map task (2GB) is virtual memory. The parameter, yarn.nodemanager.vmem-pmemratio, converts the physical memory to virtual memory for each nodemanager. Hence, we leave space for nodemanager and datanode in virtual memory and still have enough space to run the map task. In Table 1, we summarize the results of these experiments, showing every time when we double the block size, the average map time is also doubled. Block Size (MB) Average map time (sec) Table 1: Map task time and block size 4.3. Locality of an Input Split During execution of the map phase, the resource manager, in cooperation with the application master and HDFS, must schedule map tasks on cluster nodes with sufficient memory and sufficient CPU resources. Priority is given to scheduling map tasks on the node on which their required input data resides 1. Such a case of the map task and its data being co-located on the same cluster node is known as data-local. However, depending on the memory and CPU resources, such co-location is not always possible. Instead of deferring map tasks, Hadoop would then schedule a map task on a different node in the cluster. In this case, the data must be transferred from the cluster node where the data resides to the cluster node chosen for the execution of the map task. Assuming that, for the network topology, the location of the map task and the location of the data are on the same network rack, this case is known as rack-local. The process of transferring data in the rack-local case is a significant performance factor and must be modeled. We refer to the transfer time as the communication gap. Once a map task obtains its input, it processes it in the normal way. To address the performance factor of locality, we measure and model the communication gap of data-local map tasks versus rack-local map tasks. Consider the graph of map task execution in Figure 4. The x-axis is time, and the y-axis is the set of distinct map tasks, sorted by ID. This graph shows the flow of the map phase on a single node. If one were to draw a vertical line at a particular time, t, one can see the number of concurrent map tasks. In this example, the concurrency level constrained to four map tasks. As long as Hadoop can schedule data-local map tasks, it will do so. Thus, on the graph we see many data-local map tasks before a rack-local map task. The graph is annotated with the first such instance, along with the interval of time we call the communication gap. 1 This is an example of moving the processing to the data, rather than having the data moved to the location of processing.

6 Figure 4: Map Task Flow on a Node By conducting a series of experiments and measuring the communication gap time preceding rack-local map tasks, we found that this time does not change appreciably with changes in concurrency level as long as the network has not saturated. In addition, we found that the data-local map tasks also have a small gap before them, which we attribute to the overhead of the runtime system in supplying the local data. A normal distribution fits well for rack-local map tasks over a range of other configuration setups. We model the gap time of rack-local map tasks by using a normal distribution with mean = 13000ms and stdev = 3000ms. For the data-local map tasks, we use a constant 300ms Concurrency Level The maximum number of tasks in execution on a given cluster node, which we refer to as the concurrency level, may be controlled by specifying Hadoop parameters that inform the resource manager of the total memory on each compute node and the memory required for any given tasks. We expected that, in an underutilized system, one in which the concurrency level is low, and more could be done without adversely affecting the performance of the alreadyexecuting tasks, individual map task time might be fast, but overall execution time should suffer. Our experimentation bore out these expectations. When the concurrency level is 3 or lower, individual map task time is at its best, but aggregate time of the map phase suffers. But we wanted to understand the affect of further increasing the concurrency level. With greater levels of concurrency, the mean time for individual map tasks may increase, but, with more tasks running in parallel, what is the affect on the aggregate time in the map phase? By manipulating the Hadoop configuration parameters as described above, we were able to experiment with concurrency levels of 1 through 8. For concurrency levels of 4 through 8, the aggregate map phase time remained essentially unchanged for a fixed input size, meaning that the average map task time had a directly proportional relationship with the concurrency level. For example, individual map task time at concurrency level 8 is twice as long as individual map task time at concurrency level 4, but since there are twice as many of them at any given point in time, the aggregate time remains the same. Consider the graph of Figure 5. Each line represents a certain level of concurrency. We can see when the input scales up, the average time to process each split tends to be constant for each concurrency level. The gap between consecutive lines approaches to a fixed value as well. Figure 5: Average Map Task Time

7 In Figure 6 below, we relate total running time of a job with concurrency level. Surprising, once the input scale up, the job running time is almost the same. When the concurrency level is low, the overhead becomes significant and affects the performance. Figure 6: Total Running Time of a Job 5. SHUFFLE PHASE In the shuffle phase, the runtime must transfer the intermediate key-value pairs from all of the map tasks to the particular reduce task responsible for a given set of keys. Note that any of the map tasks might produce any particular key, so this transfer could potentially be an all-to-all between map tasks and reduce tasks. Ultimately, the key-value pairs for a given reduce task must be on its local cluster node. The merge phase, which combines into sorted order the set of keys for a reducer, cannot start before the shuffle phase completes. For the shuffle phase to start, reduce tasks must be scheduled for execution on a cluster node in the system. Due to the limitation of resources, the scheduler might not launch all the reduce tasks at the same time, but once a reduce task starts, it will check the intermediate outputs from all the map tasks to see if there are keys that it is responsible for. Hence, the analysis in this section is based on the shuffling performance for each individual reduce task, and not the aggregate. The reduce tasks can be scheduled before the map phase completes, and the point in time during the map phase when such scheduling is permitted is governed by a Hadoop parameter known as slowstart. This parameter gives the percentage of map tasks that must have completed before the first reduce task is scheduled. So if slowstart equals 1.0, no reduce task may be scheduled until all map tasks have completed. In practice, a slowstart of 1.0 is inefficient and can significantly increase the time of the shuffle phase, because the data for all reduce tasks must be transferred over the network at the same time and it is possible to use up the full capacity of the network. On the other hand, if slowstart is too low, it is likely that there is no data to copy and the scheduled reduce tasks will just occupy the resources of the computing node without doing any work. Figure 7 illustrates how slowstart works. Figure 7: Slowstart When slowstart is not 1.0, a portion of the shuffle copying work will overlap with the remaining executing map tasks of the map phase. Measuring the time from the completion of the map phase to the start of the merge phase is thus not representative of the actual work. So, contrary to normal practice, we set slowstart to 1.0 in our experiments to simplify the problem and allow measurement of the entirety of the shuffle phase. We leave it to future research to model the overlap and its effect on the overall performance of a map-reduce application. We expected that the time for the shuffle phase would be dependent on the volume of data encompassed by the aggregate of all the

8 key-value pairs in the intermediate data. We constructed benchmarks to control both the number of key-value records produced by map tasks, as well as the size of such records. In addition, we wanted to see if the number of different keys in the domain of output keys would yield a performance impact, so we allowed our benchmark to control that as well. Finally, we wanted to control the number of reduce tasks employed by the reduce phase. After exploring all these potential factors, we found through our experimentation that the only factor significantly affecting the shuffle time was the total number of bytes involved in the shuffle phase (i.e. the combination of number of key-value records and their size). In our experiments, the rate (ms/mb) for shuffling the data was almost constant across our combinations of experimental setups changing the parameters defined above. The mean shuffle rate was ms/byte. 6. MERGE PHASE The merge phase begins after all of the reduce tasks finish their shuffling and must complete before the reduce phase may begin. Each reduce task performs its merging task independently, and so the following analysis applies to the merge phase of every single reduce task. We use the same benchmark as in the shuffle phase to control the number and the size of key-value pairs, the key distribution and the number of reducers. The result we get is similar to the shuffle phase, in that the major factor affecting the merge phase running time is the volume of data, with the assumptions that the workload is evenly spread over all the reducers. In the merge phase, we do not need to consider the effect of slowstart, so we use linear regression to approximate the merge time. Using the experimental result, the linear regression is y= x where x is the number of bytes and y is merge time in milliseconds. The r value of this linear regression is Figure 8: Summation of Merge Time 7.REDUCE PHASE The workflow of the reduce phase parallels the map phase in many ways, for example, both of them use a customized deterministic function as their work function. Hence, we use the same method to analyze the reduce phase. The input for the reduce phase is the output of the map phase, which is already transferred to the computing node by the shuffle and merge phases. Hence, unlike the analysis of the map phase, all the data are local, and so locality will not be a factor. Another difference between the map phase and the reduce phase is that the user can specify the number of reduce tasks in the reduce phase, whereas in the map phase, the number of map tasks depends on the input size and the block size. As we discuss in the Method section, we assume the partitioner works properly, which means all the reduce tasks have a similar amount of work, and so we can use the expected value to estimate the workload of each reduce task. The factors we find that will affect the reduce task s performance are the following: (1) Expected size of input (2) Number of reducers

9 We use the benchmark that we mentioned in the previous section to experiment with these two factors. However, instead of analyzing them separately, we choose to examine the correlation between the volume of input and the sum of reduce task time (so looking at reduce time in aggregate). Figure 9: Summation of Reduce Time According to our experiments, the sum of all reduces task time is directly proportional to the number of bytes entering the reduce phase. Hence, we argue the time to process one byte is fixed. With fewer reduce tasks, each reduce task expects to handle more bytes, and thus takes longer time to finish. Therefore, the running time is actually inversely proportional to the number of reducers. Figure 9 shows the relationship between input size and sum of reduce time, and clearly, a linear approximation applies in this case. The linear regression we obtained is y= x where x is the number of bytes and y is the running time in milliseconds. The r value is RELATED WORK Previous research helps us gain a big picture about MapReduce s performance. Wottrich [5] focused on how input size and output size influence the running time of the mapreduce process, as well as analyzed the effect of overhead that can significantly drop the performance. Bressoud and Kozuch [1] investigated the performance characteristics with the presence of failures and built a discrete event simulator, CFTsim (Cluster Fault Tolerance simulator). 9. CONCLUSION In this project, we studied the major factors that impact the running time of each Hadoop phase. Due to the difference in functionality and mechanism of each phase, those factors can be different. By aggregating these analyses, we obtain a big picture about how Hadoop performs when running a job. Although the input size has a huge influence on the total job time, by tuning the configuration parameters, such as block size and level of concurrency, a user can minimize the overheads and fully use the capacity of a Hadoop system, and thus improve the performance of a job. Using the statistical models built on our analysis, a user can estimate a Hadoop job s running time before actually running the job, as well as experimenting with different configurations to find the optimized configuration of running the job. 10. FUTURE DIRECTION For the next step of this project, we are going to release the assumptions that we mentioned in Section 3.1. For instance, we will allow varying workloads of reducers and see how that affects our analysis. We considered two future directions to emend our work. The first is embedding fault tolerance into both the model and the discrete event simulation being built to model those aspects that are not amenable to individual analysis. The system of Hadoop is designed to be fault-tolerant, but fault-tolerant mechanism has a huge impact on the performance because it includes re-execution and time-out. We expect different failure

10 rates will impact application performance in non-linear ways, and want to understand the limits of the Hadoop mechanisms for faulttolerance. Another direction of investigation is to explore how Hadoop performance is impacted when scaling to very large systems. The prevailing trend is for businesses to use cloud computing for hosting large parallel applications, and these may perform differently than our work has shown for local clusters. Additional overheads in the underlying storage and in the use of virtual hosts employed in such cloud-based systems will impact any predictions of performance. 11. ACKNOWLEDGEMENTS This research was supported by funds from the Laurie Bukovac and David Hodgson Endowed Fund of Denison University. Denison students Yifu Zhao and Wei Peng helped build and configure the Beowulf cluster used for this research. [4] Apache Hadoop. September [5] K. Wottrich and T. Bressoud. The Performance Characteristics of MapReduce Applications on Scalable Clusters. In Proceedings of the Midstates Conference for Undergraduate Research in Computer Science and Mathematics (MCURCSM), [6] Konstantin Shvachko, Hairong Kuang, Sanjay Radia, and Robert Chansler. The Hadoop Distributed File System. In Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST), MSST 10. IEEE Computer Society, [7] Jared Gray and Thomas C. Bressoud. Towards a MapReduce Application Performance Model. Proceedings of the 2012 Midstates Conference on Undergraduate Research in Computer Science and Mathematics (MCURCSM 2012), Delaware, OH. 12. REFERENCE [1] Bressoud, Thomas C. and Kozuch, Michael A. Cluster Fault-Tolerance: An Experimental Evaluation of Checkpointing and MapReduce through Simulation. IEEE International Conference on Cluster Computing (Cluster 09). Web. September [2] Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. In Proceedings of the 6 th Conference on Symposium on Operating System Design and Implementation Volume 6,OSDI 04. USENIX Association, [3] Tom White. Hadoop The Definitive Guide: Storage and Analysis at Internet Scale. O Reilly, Sebastopol, CA, 3 rd Edition, 2012.

The Performance Characteristics of MapReduce Applications on Scalable Clusters

The Performance Characteristics of MapReduce Applications on Scalable Clusters The Performance Characteristics of MapReduce Applications on Scalable Clusters Kenneth Wottrich Denison University Granville, OH 43023 wottri_k1@denison.edu ABSTRACT Many cluster owners and operators have

More information

GraySort and MinuteSort at Yahoo on Hadoop 0.23

GraySort and MinuteSort at Yahoo on Hadoop 0.23 GraySort and at Yahoo on Hadoop.23 Thomas Graves Yahoo! May, 213 The Apache Hadoop[1] software library is an open source framework that allows for the distributed processing of large data sets across clusters

More information

Keywords: Big Data, HDFS, Map Reduce, Hadoop

Keywords: Big Data, HDFS, Map Reduce, Hadoop Volume 5, Issue 7, July 2015 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Configuration Tuning

More information

Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique

Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique Mahesh Maurya a, Sunita Mahajan b * a Research Scholar, JJT University, MPSTME, Mumbai, India,maheshkmaurya@yahoo.co.in

More information

Evaluating HDFS I/O Performance on Virtualized Systems

Evaluating HDFS I/O Performance on Virtualized Systems Evaluating HDFS I/O Performance on Virtualized Systems Xin Tang xtang@cs.wisc.edu University of Wisconsin-Madison Department of Computer Sciences Abstract Hadoop as a Service (HaaS) has received increasing

More information

CSE-E5430 Scalable Cloud Computing Lecture 2

CSE-E5430 Scalable Cloud Computing Lecture 2 CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 14.9-2015 1/36 Google MapReduce A scalable batch processing

More information

Task Scheduling in Hadoop

Task Scheduling in Hadoop Task Scheduling in Hadoop Sagar Mamdapure Munira Ginwala Neha Papat SAE,Kondhwa SAE,Kondhwa SAE,Kondhwa Abstract Hadoop is widely used for storing large datasets and processing them efficiently under distributed

More information

An Experimental Approach Towards Big Data for Analyzing Memory Utilization on a Hadoop cluster using HDFS and MapReduce.

An Experimental Approach Towards Big Data for Analyzing Memory Utilization on a Hadoop cluster using HDFS and MapReduce. An Experimental Approach Towards Big Data for Analyzing Memory Utilization on a Hadoop cluster using HDFS and MapReduce. Amrit Pal Stdt, Dept of Computer Engineering and Application, National Institute

More information

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl Big Data Processing, 2014/15 Lecture 5: GFS & HDFS!! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm Looking behind

More information

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic BigData An Overview of Several Approaches David Mera Masaryk University Brno, Czech Republic 16/12/2013 Table of Contents 1 Introduction 2 Terminology 3 Approaches focused on batch data processing MapReduce-Hadoop

More information

Chapter 7. Using Hadoop Cluster and MapReduce

Chapter 7. Using Hadoop Cluster and MapReduce Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in

More information

MapReduce and Hadoop Distributed File System

MapReduce and Hadoop Distributed File System MapReduce and Hadoop Distributed File System 1 B. RAMAMURTHY Contact: Dr. Bina Ramamurthy CSE Department University at Buffalo (SUNY) bina@buffalo.edu http://www.cse.buffalo.edu/faculty/bina Partially

More information

Big Data and Apache Hadoop s MapReduce

Big Data and Apache Hadoop s MapReduce Big Data and Apache Hadoop s MapReduce Michael Hahsler Computer Science and Engineering Southern Methodist University January 23, 2012 Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 1 / 23

More information

Introduction to Parallel Programming and MapReduce

Introduction to Parallel Programming and MapReduce Introduction to Parallel Programming and MapReduce Audience and Pre-Requisites This tutorial covers the basics of parallel programming and the MapReduce programming model. The pre-requisites are significant

More information

Processing of Hadoop using Highly Available NameNode

Processing of Hadoop using Highly Available NameNode Processing of Hadoop using Highly Available NameNode 1 Akash Deshpande, 2 Shrikant Badwaik, 3 Sailee Nalawade, 4 Anjali Bote, 5 Prof. S. P. Kosbatwar Department of computer Engineering Smt. Kashibai Navale

More information

A Study on Workload Imbalance Issues in Data Intensive Distributed Computing

A Study on Workload Imbalance Issues in Data Intensive Distributed Computing A Study on Workload Imbalance Issues in Data Intensive Distributed Computing Sven Groot 1, Kazuo Goda 1, and Masaru Kitsuregawa 1 University of Tokyo, 4-6-1 Komaba, Meguro-ku, Tokyo 153-8505, Japan Abstract.

More information

Hadoop Scheduler w i t h Deadline Constraint

Hadoop Scheduler w i t h Deadline Constraint Hadoop Scheduler w i t h Deadline Constraint Geetha J 1, N UdayBhaskar 2, P ChennaReddy 3,Neha Sniha 4 1,4 Department of Computer Science and Engineering, M S Ramaiah Institute of Technology, Bangalore,

More information

Benchmarking Cassandra on Violin

Benchmarking Cassandra on Violin Technical White Paper Report Technical Report Benchmarking Cassandra on Violin Accelerating Cassandra Performance and Reducing Read Latency With Violin Memory Flash-based Storage Arrays Version 1.0 Abstract

More information

Fault Tolerance in Hadoop for Work Migration

Fault Tolerance in Hadoop for Work Migration 1 Fault Tolerance in Hadoop for Work Migration Shivaraman Janakiraman Indiana University Bloomington ABSTRACT Hadoop is a framework that runs applications on large clusters which are built on numerous

More information

Analysing Large Web Log Files in a Hadoop Distributed Cluster Environment

Analysing Large Web Log Files in a Hadoop Distributed Cluster Environment Analysing Large Files in a Hadoop Distributed Cluster Environment S Saravanan, B Uma Maheswari Department of Computer Science and Engineering, Amrita School of Engineering, Amrita Vishwa Vidyapeetham,

More information

Survey on Scheduling Algorithm in MapReduce Framework

Survey on Scheduling Algorithm in MapReduce Framework Survey on Scheduling Algorithm in MapReduce Framework Pravin P. Nimbalkar 1, Devendra P.Gadekar 2 1,2 Department of Computer Engineering, JSPM s Imperial College of Engineering and Research, Pune, India

More information

Mobile Cloud Computing for Data-Intensive Applications

Mobile Cloud Computing for Data-Intensive Applications Mobile Cloud Computing for Data-Intensive Applications Senior Thesis Final Report Vincent Teo, vct@andrew.cmu.edu Advisor: Professor Priya Narasimhan, priya@cs.cmu.edu Abstract The computational and storage

More information

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give

More information

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK A REVIEW ON HIGH PERFORMANCE DATA STORAGE ARCHITECTURE OF BIGDATA USING HDFS MS.

More information

Reducer Load Balancing and Lazy Initialization in Map Reduce Environment S.Mohanapriya, P.Natesan

Reducer Load Balancing and Lazy Initialization in Map Reduce Environment S.Mohanapriya, P.Natesan Reducer Load Balancing and Lazy Initialization in Map Reduce Environment S.Mohanapriya, P.Natesan Abstract Big Data is revolutionizing 21st-century with increasingly huge amounts of data to store and be

More information

Journal of science STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS)

Journal of science STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS) Journal of science e ISSN 2277-3290 Print ISSN 2277-3282 Information Technology www.journalofscience.net STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS) S. Chandra

More information

From GWS to MapReduce: Google s Cloud Technology in the Early Days

From GWS to MapReduce: Google s Cloud Technology in the Early Days Large-Scale Distributed Systems From GWS to MapReduce: Google s Cloud Technology in the Early Days Part II: MapReduce in a Datacenter COMP6511A Spring 2014 HKUST Lin Gu lingu@ieee.org MapReduce/Hadoop

More information

An improved task assignment scheme for Hadoop running in the clouds

An improved task assignment scheme for Hadoop running in the clouds Dai and Bassiouni Journal of Cloud Computing: Advances, Systems and Applications 2013, 2:23 RESEARCH An improved task assignment scheme for Hadoop running in the clouds Wei Dai * and Mostafa Bassiouni

More information

Performance and Energy Efficiency of. Hadoop deployment models

Performance and Energy Efficiency of. Hadoop deployment models Performance and Energy Efficiency of Hadoop deployment models Contents Review: What is MapReduce Review: What is Hadoop Hadoop Deployment Models Metrics Experiment Results Summary MapReduce Introduced

More information

Developing MapReduce Programs

Developing MapReduce Programs Cloud Computing Developing MapReduce Programs Dell Zhang Birkbeck, University of London 2015/16 MapReduce Algorithm Design MapReduce: Recap Programmers must specify two functions: map (k, v) * Takes

More information

Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2

Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2 Volume 6, Issue 3, March 2016 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Special Issue

More information

Efficient Data Replication Scheme based on Hadoop Distributed File System

Efficient Data Replication Scheme based on Hadoop Distributed File System , pp. 177-186 http://dx.doi.org/10.14257/ijseia.2015.9.12.16 Efficient Data Replication Scheme based on Hadoop Distributed File System Jungha Lee 1, Jaehwa Chung 2 and Daewon Lee 3* 1 Division of Supercomputing,

More information

How to properly misuse Hadoop. Marcel Huntemann NERSC tutorial session 2/12/13

How to properly misuse Hadoop. Marcel Huntemann NERSC tutorial session 2/12/13 How to properly misuse Hadoop Marcel Huntemann NERSC tutorial session 2/12/13 History Created by Doug Cutting (also creator of Apache Lucene). 2002 Origin in Apache Nutch (open source web search engine).

More information

A Hybrid Scheduling Approach for Scalable Heterogeneous Hadoop Systems

A Hybrid Scheduling Approach for Scalable Heterogeneous Hadoop Systems A Hybrid Scheduling Approach for Scalable Heterogeneous Hadoop Systems Aysan Rasooli Department of Computing and Software McMaster University Hamilton, Canada Email: rasooa@mcmaster.ca Douglas G. Down

More information

Research on Clustering Analysis of Big Data Yuan Yuanming 1, 2, a, Wu Chanle 1, 2

Research on Clustering Analysis of Big Data Yuan Yuanming 1, 2, a, Wu Chanle 1, 2 Advanced Engineering Forum Vols. 6-7 (2012) pp 82-87 Online: 2012-09-26 (2012) Trans Tech Publications, Switzerland doi:10.4028/www.scientific.net/aef.6-7.82 Research on Clustering Analysis of Big Data

More information

GraySort on Apache Spark by Databricks

GraySort on Apache Spark by Databricks GraySort on Apache Spark by Databricks Reynold Xin, Parviz Deyhim, Ali Ghodsi, Xiangrui Meng, Matei Zaharia Databricks Inc. Apache Spark Sorting in Spark Overview Sorting Within a Partition Range Partitioner

More information

Benchmarking Hadoop & HBase on Violin

Benchmarking Hadoop & HBase on Violin Technical White Paper Report Technical Report Benchmarking Hadoop & HBase on Violin Harnessing Big Data Analytics at the Speed of Memory Version 1.0 Abstract The purpose of benchmarking is to show advantages

More information

MapReduce Jeffrey Dean and Sanjay Ghemawat. Background context

MapReduce Jeffrey Dean and Sanjay Ghemawat. Background context MapReduce Jeffrey Dean and Sanjay Ghemawat Background context BIG DATA!! o Large-scale services generate huge volumes of data: logs, crawls, user databases, web site content, etc. o Very useful to be able

More information

A Brief Introduction to Apache Tez

A Brief Introduction to Apache Tez A Brief Introduction to Apache Tez Introduction It is a fact that data is basically the new currency of the modern business world. Companies that effectively maximize the value of their data (extract value

More information

International Journal of Advance Research in Computer Science and Management Studies

International Journal of Advance Research in Computer Science and Management Studies Volume 2, Issue 8, August 2014 ISSN: 2321 7782 (Online) International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online

More information

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines Big Data Technology Map-Reduce Motivation: Indexing in Search Engines Edward Bortnikov & Ronny Lempel Yahoo Labs, Haifa Indexing in Search Engines Information Retrieval s two main stages: Indexing process

More information

MapReduce. MapReduce and SQL Injections. CS 3200 Final Lecture. Introduction. MapReduce. Programming Model. Example

MapReduce. MapReduce and SQL Injections. CS 3200 Final Lecture. Introduction. MapReduce. Programming Model. Example MapReduce MapReduce and SQL Injections CS 3200 Final Lecture Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. OSDI'04: Sixth Symposium on Operating System Design

More information

Hadoop. Scalable Distributed Computing. Claire Jaja, Julian Chan October 8, 2013

Hadoop. Scalable Distributed Computing. Claire Jaja, Julian Chan October 8, 2013 Hadoop Scalable Distributed Computing Claire Jaja, Julian Chan October 8, 2013 What is Hadoop? A general-purpose storage and data-analysis platform Open source Apache software, implemented in Java Enables

More information

METHOD OF A MULTIMEDIA TRANSCODING FOR MULTIPLE MAPREDUCE JOBS IN CLOUD COMPUTING ENVIRONMENT

METHOD OF A MULTIMEDIA TRANSCODING FOR MULTIPLE MAPREDUCE JOBS IN CLOUD COMPUTING ENVIRONMENT METHOD OF A MULTIMEDIA TRANSCODING FOR MULTIPLE MAPREDUCE JOBS IN CLOUD COMPUTING ENVIRONMENT 1 SEUNGHO HAN, 2 MYOUNGJIN KIM, 3 YUN CUI, 4 SEUNGHYUN SEO, 5 SEUNGBUM SEO, 6 HANKU LEE 1,2,3,4,5 Department

More information

marlabs driving digital agility WHITEPAPER Big Data and Hadoop

marlabs driving digital agility WHITEPAPER Big Data and Hadoop marlabs driving digital agility WHITEPAPER Big Data and Hadoop Abstract This paper explains the significance of Hadoop, an emerging yet rapidly growing technology. The prime goal of this paper is to unveil

More information

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Summary Xiangzhe Li Nowadays, there are more and more data everyday about everything. For instance, here are some of the astonishing

More information

MapReduce and Hadoop Distributed File System V I J A Y R A O

MapReduce and Hadoop Distributed File System V I J A Y R A O MapReduce and Hadoop Distributed File System 1 V I J A Y R A O The Context: Big-data Man on the moon with 32KB (1969); my laptop had 2GB RAM (2009) Google collects 270PB data in a month (2007), 20000PB

More information

Enhancing Dataset Processing in Hadoop YARN Performance for Big Data Applications

Enhancing Dataset Processing in Hadoop YARN Performance for Big Data Applications Enhancing Dataset Processing in Hadoop YARN Performance for Big Data Applications Ahmed Abdulhakim Al-Absi, Dae-Ki Kang and Myong-Jong Kim Abstract In Hadoop MapReduce distributed file system, as the input

More information

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop Journal of Computational Information Systems 7: 16 (2011) 5956-5963 Available at http://www.jofcis.com Large-Scale Data Sets Clustering Based on MapReduce and Hadoop Ping ZHOU, Jingsheng LEI, Wenjun YE

More information

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Introduction to Big Data! with Apache Spark UC#BERKELEY# Introduction to Big Data! with Apache Spark" UC#BERKELEY# This Lecture" The Big Data Problem" Hardware for Big Data" Distributing Work" Handling Failures and Slow Machines" Map Reduce and Complex Jobs"

More information

A Comparison of Join Algorithms for Log Processing in MapReduce

A Comparison of Join Algorithms for Log Processing in MapReduce A Comparison of Join Algorithms for Log Processing in MapReduce Spyros Blanas, Jignesh M. Patel Computer Sciences Department University of Wisconsin-Madison {sblanas,jignesh}@cs.wisc.edu Vuk Ercegovac,

More information

MAPREDUCE [1] is proposed by Google in 2004 and

MAPREDUCE [1] is proposed by Google in 2004 and IEEE TRANSACTIONS ON COMPUTERS 1 Improving MapReduce Performance Using Smart Speculative Execution Strategy Qi Chen, Cheng Liu, and Zhen Xiao, Senior Member, IEEE Abstract MapReduce is a widely used parallel

More information

Energy Efficient MapReduce

Energy Efficient MapReduce Energy Efficient MapReduce Motivation: Energy consumption is an important aspect of datacenters efficiency, the total power consumption in the united states has doubled from 2000 to 2005, representing

More information

The Improved Job Scheduling Algorithm of Hadoop Platform

The Improved Job Scheduling Algorithm of Hadoop Platform The Improved Job Scheduling Algorithm of Hadoop Platform Yingjie Guo a, Linzhi Wu b, Wei Yu c, Bin Wu d, Xiaotian Wang e a,b,c,d,e University of Chinese Academy of Sciences 100408, China b Email: wulinzhi1001@163.com

More information

EXPERIMENTATION. HARRISON CARRANZA School of Computer Science and Mathematics

EXPERIMENTATION. HARRISON CARRANZA School of Computer Science and Mathematics BIG DATA WITH HADOOP EXPERIMENTATION HARRISON CARRANZA Marist College APARICIO CARRANZA NYC College of Technology CUNY ECC Conference 2016 Poughkeepsie, NY, June 12-14, 2016 Marist College AGENDA Contents

More information

Big Data With Hadoop

Big Data With Hadoop With Saurabh Singh singh.903@osu.edu The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials

More information

Storage and Retrieval of Data for Smart City using Hadoop

Storage and Retrieval of Data for Smart City using Hadoop Storage and Retrieval of Data for Smart City using Hadoop Ravi Gehlot Department of Computer Science Poornima Institute of Engineering and Technology Jaipur, India Abstract Smart cities are equipped with

More information

MAPREDUCE Programming Model

MAPREDUCE Programming Model CS 2510 COMPUTER OPERATING SYSTEMS Cloud Computing MAPREDUCE Dr. Taieb Znati Computer Science Department University of Pittsburgh MAPREDUCE Programming Model Scaling Data Intensive Application MapReduce

More information

Big Data Storage, Management and challenges. Ahmed Ali-Eldin

Big Data Storage, Management and challenges. Ahmed Ali-Eldin Big Data Storage, Management and challenges Ahmed Ali-Eldin (Ambitious) Plan What is Big Data? And Why talk about Big Data? How to store Big Data? BigTables (Google) Dynamo (Amazon) How to process Big

More information

Parallel Processing of cluster by Map Reduce

Parallel Processing of cluster by Map Reduce Parallel Processing of cluster by Map Reduce Abstract Madhavi Vaidya, Department of Computer Science Vivekanand College, Chembur, Mumbai vamadhavi04@yahoo.co.in MapReduce is a parallel programming model

More information

APPENDIX 1 USER LEVEL IMPLEMENTATION OF PPATPAN IN LINUX SYSTEM

APPENDIX 1 USER LEVEL IMPLEMENTATION OF PPATPAN IN LINUX SYSTEM 152 APPENDIX 1 USER LEVEL IMPLEMENTATION OF PPATPAN IN LINUX SYSTEM A1.1 INTRODUCTION PPATPAN is implemented in a test bed with five Linux system arranged in a multihop topology. The system is implemented

More information

Apache Hadoop. Alexandru Costan

Apache Hadoop. Alexandru Costan 1 Apache Hadoop Alexandru Costan Big Data Landscape No one-size-fits-all solution: SQL, NoSQL, MapReduce, No standard, except Hadoop 2 Outline What is Hadoop? Who uses it? Architecture HDFS MapReduce Open

More information

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms Distributed File System 1 How do we get data to the workers? NAS Compute Nodes SAN 2 Distributed File System Don t move data to workers move workers to the data! Store data on the local disks of nodes

More information

Hadoop Architecture. Part 1

Hadoop Architecture. Part 1 Hadoop Architecture Part 1 Node, Rack and Cluster: A node is simply a computer, typically non-enterprise, commodity hardware for nodes that contain data. Consider we have Node 1.Then we can add more nodes,

More information

A Framework for Performance Analysis and Tuning in Hadoop Based Clusters

A Framework for Performance Analysis and Tuning in Hadoop Based Clusters A Framework for Performance Analysis and Tuning in Hadoop Based Clusters Garvit Bansal Anshul Gupta Utkarsh Pyne LNMIIT, Jaipur, India Email: [garvit.bansal anshul.gupta utkarsh.pyne] @lnmiit.ac.in Manish

More information

Matchmaking: A New MapReduce Scheduling Technique

Matchmaking: A New MapReduce Scheduling Technique Matchmaking: A New MapReduce Scheduling Technique Chen He Ying Lu David Swanson Department of Computer Science and Engineering University of Nebraska-Lincoln Lincoln, U.S. {che,ylu,dswanson}@cse.unl.edu

More information

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Jeffrey D. Ullman slides. MapReduce for data intensive computing Jeffrey D. Ullman slides MapReduce for data intensive computing Single-node architecture CPU Machine Learning, Statistics Memory Classical Data Mining Disk Commodity Clusters Web data sets can be very

More information

MapReduce on GPUs. Amit Sabne, Ahmad Mujahid Mohammed Razip, Kun Xu

MapReduce on GPUs. Amit Sabne, Ahmad Mujahid Mohammed Razip, Kun Xu 1 MapReduce on GPUs Amit Sabne, Ahmad Mujahid Mohammed Razip, Kun Xu 2 MapReduce MAP Shuffle Reduce 3 Hadoop Open-source MapReduce framework from Apache, written in Java Used by Yahoo!, Facebook, Ebay,

More information

A Study on Data Analysis Process Management System in MapReduce using BPM

A Study on Data Analysis Process Management System in MapReduce using BPM A Study on Data Analysis Process Management System in MapReduce using BPM Yoon-Sik Yoo 1, Jaehak Yu 1, Hyo-Chan Bang 1, Cheong Hee Park 1 Electronics and Telecommunications Research Institute, 138 Gajeongno,

More information

Big Data in the Enterprise: Network Design Considerations

Big Data in the Enterprise: Network Design Considerations White Paper Big Data in the Enterprise: Network Design Considerations What You Will Learn This document examines the role of big data in the enterprise as it relates to network design considerations. It

More information

Research on Job Scheduling Algorithm in Hadoop

Research on Job Scheduling Algorithm in Hadoop Journal of Computational Information Systems 7: 6 () 5769-5775 Available at http://www.jofcis.com Research on Job Scheduling Algorithm in Hadoop Yang XIA, Lei WANG, Qiang ZHAO, Gongxuan ZHANG School of

More information

Use of Hadoop File System for Nuclear Physics Analyses in STAR

Use of Hadoop File System for Nuclear Physics Analyses in STAR 1 Use of Hadoop File System for Nuclear Physics Analyses in STAR EVAN SANGALINE UC DAVIS Motivations 2 Data storage a key component of analysis requirements Transmission and storage across diverse resources

More information

Architectures for Big Data Analytics A database perspective

Architectures for Big Data Analytics A database perspective Architectures for Big Data Analytics A database perspective Fernando Velez Director of Product Management Enterprise Information Management, SAP June 2013 Outline Big Data Analytics Requirements Spectrum

More information

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2016 MapReduce MapReduce is a programming model

More information

Scalable Cloud Computing Solutions for Next Generation Sequencing Data

Scalable Cloud Computing Solutions for Next Generation Sequencing Data Scalable Cloud Computing Solutions for Next Generation Sequencing Data Matti Niemenmaa 1, Aleksi Kallio 2, André Schumacher 1, Petri Klemelä 2, Eija Korpelainen 2, and Keijo Heljanko 1 1 Department of

More information

Survey on Load Rebalancing for Distributed File System in Cloud

Survey on Load Rebalancing for Distributed File System in Cloud Survey on Load Rebalancing for Distributed File System in Cloud Prof. Pranalini S. Ketkar Ankita Bhimrao Patkure IT Department, DCOER, PG Scholar, Computer Department DCOER, Pune University Pune university

More information

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012 MapReduce and Hadoop Aaron Birkland Cornell Center for Advanced Computing January 2012 Motivation Simple programming model for Big Data Distributed, parallel but hides this Established success at petabyte

More information

Virtual Machine Based Resource Allocation For Cloud Computing Environment

Virtual Machine Based Resource Allocation For Cloud Computing Environment Virtual Machine Based Resource Allocation For Cloud Computing Environment D.Udaya Sree M.Tech (CSE) Department Of CSE SVCET,Chittoor. Andra Pradesh, India Dr.J.Janet Head of Department Department of CSE

More information

Network-Aware Scheduling of MapReduce Framework on Distributed Clusters over High Speed Networks

Network-Aware Scheduling of MapReduce Framework on Distributed Clusters over High Speed Networks Network-Aware Scheduling of MapReduce Framework on Distributed Clusters over High Speed Networks Praveenkumar Kondikoppa, Chui-Hui Chiu, Cheng Cui, Lin Xue and Seung-Jong Park Department of Computer Science,

More information

HDFS Space Consolidation

HDFS Space Consolidation HDFS Space Consolidation Aastha Mehta*,1,2, Deepti Banka*,1,2, Kartheek Muthyala*,1,2, Priya Sehgal 1, Ajay Bakre 1 *Student Authors 1 Advanced Technology Group, NetApp Inc., Bangalore, India 2 Birla Institute

More information

Introduction to Hadoop

Introduction to Hadoop Introduction to Hadoop 1 What is Hadoop? the big data revolution extracting value from data cloud computing 2 Understanding MapReduce the word count problem more examples MCS 572 Lecture 24 Introduction

More information

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel Parallel Databases Increase performance by performing operations in parallel Parallel Architectures Shared memory Shared disk Shared nothing closely coupled loosely coupled Parallelism Terminology Speedup:

More information

FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data

FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data Miguel Liroz-Gistau, Reza Akbarinia, Patrick Valduriez To cite this version: Miguel Liroz-Gistau, Reza Akbarinia, Patrick Valduriez. FP-Hadoop:

More information

Optimization and analysis of large scale data sorting algorithm based on Hadoop

Optimization and analysis of large scale data sorting algorithm based on Hadoop Optimization and analysis of large scale sorting algorithm based on Hadoop Zhuo Wang, Longlong Tian, Dianjie Guo, Xiaoming Jiang Institute of Information Engineering, Chinese Academy of Sciences {wangzhuo,

More information

Evaluating partitioning of big graphs

Evaluating partitioning of big graphs Evaluating partitioning of big graphs Fredrik Hallberg, Joakim Candefors, Micke Soderqvist fhallb@kth.se, candef@kth.se, mickeso@kth.se Royal Institute of Technology, Stockholm, Sweden Abstract. Distributed

More information

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA http://kzhang6.people.uic.edu/tutorial/amcis2014.html August 7, 2014 Schedule I. Introduction to big data

More information

Mining Large Datasets: Case of Mining Graph Data in the Cloud

Mining Large Datasets: Case of Mining Graph Data in the Cloud Mining Large Datasets: Case of Mining Graph Data in the Cloud Sabeur Aridhi PhD in Computer Science with Laurent d Orazio, Mondher Maddouri and Engelbert Mephu Nguifo 16/05/2014 Sabeur Aridhi Mining Large

More information

What is Analytic Infrastructure and Why Should You Care?

What is Analytic Infrastructure and Why Should You Care? What is Analytic Infrastructure and Why Should You Care? Robert L Grossman University of Illinois at Chicago and Open Data Group grossman@uic.edu ABSTRACT We define analytic infrastructure to be the services,

More information

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging Outline High Performance Computing (HPC) Towards exascale computing: a brief history Challenges in the exascale era Big Data meets HPC Some facts about Big Data Technologies HPC and Big Data converging

More information

The Hadoop Distributed File System

The Hadoop Distributed File System The Hadoop Distributed File System Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Yahoo! Sunnyvale, California USA {Shv, Hairong, SRadia, Chansler}@Yahoo-Inc.com Presenter: Alex Hu HDFS

More information

http://www.wordle.net/

http://www.wordle.net/ Hadoop & MapReduce http://www.wordle.net/ http://www.wordle.net/ Hadoop is an open-source software framework (or platform) for Reliable + Scalable + Distributed Storage/Computational unit Failures completely

More information

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after

More information

NoSQL and Hadoop Technologies On Oracle Cloud

NoSQL and Hadoop Technologies On Oracle Cloud NoSQL and Hadoop Technologies On Oracle Cloud Vatika Sharma 1, Meenu Dave 2 1 M.Tech. Scholar, Department of CSE, Jagan Nath University, Jaipur, India 2 Assistant Professor, Department of CSE, Jagan Nath

More information

The Comprehensive Performance Rating for Hadoop Clusters on Cloud Computing Platform

The Comprehensive Performance Rating for Hadoop Clusters on Cloud Computing Platform The Comprehensive Performance Rating for Hadoop Clusters on Cloud Computing Platform Fong-Hao Liu, Ya-Ruei Liou, Hsiang-Fu Lo, Ko-Chin Chang, and Wei-Tsong Lee Abstract Virtualization platform solutions

More information

THE HADOOP DISTRIBUTED FILE SYSTEM

THE HADOOP DISTRIBUTED FILE SYSTEM THE HADOOP DISTRIBUTED FILE SYSTEM Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Presented by Alexander Pokluda October 7, 2013 Outline Motivation and Overview of Hadoop Architecture,

More information

PART III. OPS-based wide area networks

PART III. OPS-based wide area networks PART III OPS-based wide area networks Chapter 7 Introduction to the OPS-based wide area network 7.1 State-of-the-art In this thesis, we consider the general switch architecture with full connectivity

More information

Energy-Saving Cloud Computing Platform Based On Micro-Embedded System

Energy-Saving Cloud Computing Platform Based On Micro-Embedded System Energy-Saving Cloud Computing Platform Based On Micro-Embedded System Wen-Hsu HSIEH *, San-Peng KAO **, Kuang-Hung TAN **, Jiann-Liang CHEN ** * Department of Computer and Communication, De Lin Institute

More information

Reference Architecture and Best Practices for Virtualizing Hadoop Workloads Justin Murray VMware

Reference Architecture and Best Practices for Virtualizing Hadoop Workloads Justin Murray VMware Reference Architecture and Best Practices for Virtualizing Hadoop Workloads Justin Murray ware 2 Agenda The Hadoop Journey Why Virtualize Hadoop? Elasticity and Scalability Performance Tests Storage Reference

More information

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets

More information

Comparision of k-means and k-medoids Clustering Algorithms for Big Data Using MapReduce Techniques

Comparision of k-means and k-medoids Clustering Algorithms for Big Data Using MapReduce Techniques Comparision of k-means and k-medoids Clustering Algorithms for Big Data Using MapReduce Techniques Subhashree K 1, Prakash P S 2 1 Student, Kongu Engineering College, Perundurai, Erode 2 Assistant Professor,

More information