Analysis and Modeling of MapReduce s Performance on Hadoop YARN

Analysis and Modeling of MapReduce s Performance on Hadoop YARN Qiuyi Tang Dept. of Mathematics and Computer Science Denison University tang_j3@denison.edu Dr. Thomas C. Bressoud Dept. of Mathematics and Computer Science Denison University bressoud@denison.edu ABSTRACT With the rapid growth of technology, scientists have realized the challenge of efficiently analyzing large data sets since the beginning of 21 st century. Increases in data volume and data complexity shift scientists focus to parallel, distributed algorithms running on clusters. In 2004, Jeffrey Dean and Sanjay Ghemawat from Google introduced a new programming model to store and process large data sets, called MapReduce[2]. Apache Hadoop, an opensource software framework, which uses MapReduce as its data-processing layer, was developed at Yahoo as early as 2006 and evolved to a stable platform by 2011. Although Hadoop has been widely used in industry, its performance characteristics are not well understood. This paper, following Hadoop s workflow, analyzes the factors that influence the running time of each phase in Hadoop execution. Given those factors, our goal is to model the performance of MapReduce applications. 1. INTRODUCTION MapReduce provides an abstraction for distributed processing of large data sets in clusters of computers by hiding the details of scheduling, resource management, and faulttolerance; therefore it is easier for software engineers to develop applications by focusing on just map and reduce tasks (explained below). As a realization of the MapReduce model, Hadoop YARN(Yet Another Resource Negotiator) was first introduced in Apache Hadoop 0.23.0 on November 11, 2011, in order to address scalability and issues of earlier realization. This version became generally available in October 2013. However, the performance characteristics of the runtime system is not well studied because it depends on the complicated interactions between the network and the computing nodes, the distributed file system, as well as a tremendous number of system configuration parameters. Further, Hadoop might perform differently as clusters grow in size and faults become more frequent. Our objective was to analyze the performance of Hadoop application on a scalable local cluster and ultimately build a model that predicts the running time based on input size, system properties and configuration parameters. This paper describes the factors that affect the performance of each MapReduce phase and suggest methods to aggregate them together into a model. 2. BACKGROUND 2.1 Hadoop YARN Hadoop YARN is the second generation of Hadoop. Compare to the first generation, YARN performs better when the input scales up because it decouples a centralized resource manager and scheduler combined with application task control into separate

components of a resource manager and an application master. The resource manager takes care of job scheduling and resource allocation, in cooperate with distributed nodelocal management, and the application master monitors the task progress for a single application. The new system could run multiple application masters to enable multiple concurrent jobs. In addition to resource manager and application master, YARN also has a HDFS (Hadoop Distributed File System) that can be used to transfer data from node to node. 2.2 Hadoop Workflow A Hadoop job can be partitioned into three phases: the map phase, the shuffle/merge phase and the reduce phase. Prior to running a job, the user must ensure that the big data of the job has been transferred into the distributed file system (HDFS). The data set is broken into blocks and the data spread among the computers of the cluster. HDFS performs this distribution regardless of the number or size of the files. Before the application may begin, the resource manager considers the input data and associates a map task with each block-size share of the input data. The input for a map task is called a split. Each map task interprets its split as a collection of key-value pairs and runs a map algorithm written by the user. The outputs of the map function, usually referred as intermediate output, are a set of key-value pairs, which are stored locally in a buffer or on disk. To enable aggregation analysis of these results obtained from processing the input data in a structured way, any given intermediate data with the same key is processed by a single reduce task, to be discussed below. In most discussions of MapReduce, shuffle and merge are discussed as if they were a single phase, but shuffle and merge actually have different responsibilities, and hence we consider them as two separate phases. The shuffle phase copies the intermediate output from the map tasks to the node of the reducer responsible for that key value. As the copying is going on and the intermediate outputs are gathered on the reducer s node, a background thread launches to merge the intermediate outputs for a given reducer into a large file sorted by key. This process is known as the merge phase, and some may refer it as the sort phase. One thing to notice is that the shuffle phase can start when the map phase is running, but the merge phase initiates after shuffle completes and must terminate before the reduce phase can start so that any reducers know that they have the entire set of the intermediate outputs associate with their keys. The last part of the workflow is the reduce phase. The reduce task processes the large, sorted intermediate output file, which is now locally stored, according to the customized reduce algorithm, and emits final output keyvalue pairs. Those final outputs are stored back in the HDFS and the user can analyze them manually. The phases of this process are illustrated in Figure 1. The input is on the far left, where each split is given to a separate map task. Each map task generates intermediate output, which is then distributed relative to key to the set of reducers. Each reducer processes its part of the key set and produces the final output. Figure 1: Hadoop Workflow

3. METHODS 3.1 Assumptions In this paper, we make several assumptions to simplify our analysis by controlling variables. Those assumptions are not always realistic and the plan is to relax them in the future research. (1) The Hadoop job is fault-free. Hadoop is designed to be fault-tolerant, which means the job can correctly complete even in the presence of failed map or reduce tasks. However, instances of failures increase as the size of the cluster increases. Although the fault-tolerance mechanism is completely hidden from the programmer s view, it has a significant impact on the running time, since it involves re-execution of map or reduce tasks and a time-out mechanism. This is the subject of future study. (2) All the configuration properties, which are not mentioned in this paper, take the default value. Since our goal is to identify the significant performance factors, we only consider the configuration parameters that impact those factors. For any configuration parameters that we do not experiment with, we leave them as the default value as defined by the Hadoop system. One thing to notice is short-circuit local read, which means the user reads the file directly without going through the datanode, which reduce the data transition time. In our experiments, we leave dfs.client.read.shortcircuit as default, which means it is not allowed. (3) The network and cluster operate with nonexceptional conditions during application execution. Saturation of the network, resulting in switch congestion, packet loss, exponential backoff, timeouts and other exceptional conditions will cause extreme variation in the times for data transfer and will also result in faulttolerance mechanisms being invoked. These performance effects, by their nature, are difficult to model. Thus, we assume application execution under non-exceptional network conditions. (4) The set of intermediate keys and the number of intermediate key-value pairs destined for reduce tasks result in a balanced workload among the reduce tasks. Keys are assigned to reducers through a Hadoop class called a partitioner. If the distribution of intermediate keys is not uniform or if the number of key-value pairs for a given key varies widely, some reduce tasks will have considerably more work than others, and can take significantly more time. Hence, we want assume an effective partitioner to make the intermediate outputs evenly spread over all the reduce tasks, in which case we can model the running time of the reducers statistically by using expected value. (5) The size of the input data is large enough that the input for any given map task has high probability of being a full block size. We want to use the block size as the input size of each map task. If the capacity of a block is not fully employed, the estimate running time can be longer than the actual running time. Since we are interested in large data applications, this assumption is easily met. 3.2 System Equipment In this research, we use a Beowulf cluster of 32 computers running Ubuntu 14.04. 16 of the computers have Intel Quad Core i5-4570 CPU with processor speeds of 3.20GHz, 451GB of disk storage and 8GB of memory. The other 16 computers have Intel Core 2 Quad CPU with processor speeds of 2.66GHz, 451GB of disk storage and 4 GB of memory. The network switch is a gigabit speed Cisco switch. We only have the OS and Hadoop on each node, and there is no monitoring

software running in order to measure the performance accurately. 4. MAP PHASE As described earlier, the map phase is the first phase of a Hadoop application, during which the input data is partitioned and processed by the set of map tasks. Due to the potentially large volume of input, the map phase can take a long time to finish. To account for significant performance factors, our goal is to model the running time of each map task. From this we can aggregate map task execution to arrive at predictions for the full map phase. So we wish to measure and then model the time between map task launch time and map task finish time. From experimentation on our Beowulf cluster, we identified the most significant factors in map task execution time as: Blocksize: the size of the input split for each map task. Locality of the Input Split: When possible, Hadoop executes a map task on the same node where its input data resides in HDFS. When this is not possible, the data must be transferred to the location of the map task. Level of Concurrency: With each cluster node running four hyperthreaded cores, the resource allocation and scheduling should execute multiple map tasks on any given cluster node. But there is a tradeoff between too few concurrent activities and too many, and these affect the performance of the map tasks. 4.1. Randomness of the Running Time In the real world, the running time of map tasks can vary over a significant range, even under the same configuration and provisioning. But we may model the performance by a probability distribution. By conducting a series of experiments, we realized the running time distribution (exclusive of the above performance factors) is similar to a normal distribution. However, the running time distribution has a long right tail, indicating that there are outliers taking much more time than the average. This right tail skews the mean and standard deviation. By shifting the mean to the left and using a smaller standard deviation, it is possible to fit a normal distribution to the observed running time distribution. Figure 2: Map Task Running Time Distribution Figure 2 illustrates an example of an experimental map task run time distribution. This is a histogram obtained from running 4096 map tasks with block size of 128 MB and concurrency level of 6. In the graph, clearly there is a peak and a long right tail. If we only consider the peak part of the graph, we can fit a normal distribution to it. In this graph, normal distribution has mean of 12963 ms and standard deviation of 1382 ms. 4.2. Block Size In Hadoop, the block size, which is the size of the input to a map task, can be altered by changing configuration parameters for the execution of HDFS. In practice, people tend to use 64 MB or 128 MB block size, because too small a block size can cause significant overhead, while too large a block size can cause long running map tasks, with greater variation in their run times, which can result

in performance degradation when the faulttolerance mechanism is invoked. As one might expect, our benchmark shows that the running time of a map task increases linearly with the block size. This indicates that the overhead of the runtime in initiating a transfer of a block is small compared to the transfer time for blocks of this size. In a series of experiments investigating this performance factor, we created 1024 files in HDFS, where each file is 1024 MB in size. To keep the level of concurrency fixed, the memory available on each node is 10 GB while each map task takes 2 GB, allowing five concurrent map tasks executing on any given cluster node. One thing to notice is that the memory of each node (10GB) is in term of physical memory while the memory of each map task (2GB) is virtual memory. The parameter, yarn.nodemanager.vmem-pmemratio, converts the physical memory to virtual memory for each nodemanager. Hence, we leave space for nodemanager and datanode in virtual memory and still have enough space to run the map task. In Table 1, we summarize the results of these experiments, showing every time when we double the block size, the average map time is also doubled. Block Size (MB) 64 4.56 128 11.25 256 24.30 Average map time (sec) Table 1: Map task time and block size 4.3. Locality of an Input Split During execution of the map phase, the resource manager, in cooperation with the application master and HDFS, must schedule map tasks on cluster nodes with sufficient memory and sufficient CPU resources. Priority is given to scheduling map tasks on the node on which their required input data resides 1. Such a case of the map task and its data being co-located on the same cluster node is known as data-local. However, depending on the memory and CPU resources, such co-location is not always possible. Instead of deferring map tasks, Hadoop would then schedule a map task on a different node in the cluster. In this case, the data must be transferred from the cluster node where the data resides to the cluster node chosen for the execution of the map task. Assuming that, for the network topology, the location of the map task and the location of the data are on the same network rack, this case is known as rack-local. The process of transferring data in the rack-local case is a significant performance factor and must be modeled. We refer to the transfer time as the communication gap. Once a map task obtains its input, it processes it in the normal way. To address the performance factor of locality, we measure and model the communication gap of data-local map tasks versus rack-local map tasks. Consider the graph of map task execution in Figure 4. The x-axis is time, and the y-axis is the set of distinct map tasks, sorted by ID. This graph shows the flow of the map phase on a single node. If one were to draw a vertical line at a particular time, t, one can see the number of concurrent map tasks. In this example, the concurrency level constrained to four map tasks. As long as Hadoop can schedule data-local map tasks, it will do so. Thus, on the graph we see many data-local map tasks before a rack-local map task. The graph is annotated with the first such instance, along with the interval of time we call the communication gap. 1 This is an example of moving the processing to the data, rather than having the data moved to the location of processing.

Figure 4: Map Task Flow on a Node By conducting a series of experiments and measuring the communication gap time preceding rack-local map tasks, we found that this time does not change appreciably with changes in concurrency level as long as the network has not saturated. In addition, we found that the data-local map tasks also have a small gap before them, which we attribute to the overhead of the runtime system in supplying the local data. A normal distribution fits well for rack-local map tasks over a range of other configuration setups. We model the gap time of rack-local map tasks by using a normal distribution with mean = 13000ms and stdev = 3000ms. For the data-local map tasks, we use a constant 300ms. 4.3. Concurrency Level The maximum number of tasks in execution on a given cluster node, which we refer to as the concurrency level, may be controlled by specifying Hadoop parameters that inform the resource manager of the total memory on each compute node and the memory required for any given tasks. We expected that, in an underutilized system, one in which the concurrency level is low, and more could be done without adversely affecting the performance of the alreadyexecuting tasks, individual map task time might be fast, but overall execution time should suffer. Our experimentation bore out these expectations. When the concurrency level is 3 or lower, individual map task time is at its best, but aggregate time of the map phase suffers. But we wanted to understand the affect of further increasing the concurrency level. With greater levels of concurrency, the mean time for individual map tasks may increase, but, with more tasks running in parallel, what is the affect on the aggregate time in the map phase? By manipulating the Hadoop configuration parameters as described above, we were able to experiment with concurrency levels of 1 through 8. For concurrency levels of 4 through 8, the aggregate map phase time remained essentially unchanged for a fixed input size, meaning that the average map task time had a directly proportional relationship with the concurrency level. For example, individual map task time at concurrency level 8 is twice as long as individual map task time at concurrency level 4, but since there are twice as many of them at any given point in time, the aggregate time remains the same. Consider the graph of Figure 5. Each line represents a certain level of concurrency. We can see when the input scales up, the average time to process each split tends to be constant for each concurrency level. The gap between consecutive lines approaches to a fixed value as well. Figure 5: Average Map Task Time

In Figure 6 below, we relate total running time of a job with concurrency level. Surprising, once the input scale up, the job running time is almost the same. When the concurrency level is low, the overhead becomes significant and affects the performance. Figure 6: Total Running Time of a Job 5. SHUFFLE PHASE In the shuffle phase, the runtime must transfer the intermediate key-value pairs from all of the map tasks to the particular reduce task responsible for a given set of keys. Note that any of the map tasks might produce any particular key, so this transfer could potentially be an all-to-all between map tasks and reduce tasks. Ultimately, the key-value pairs for a given reduce task must be on its local cluster node. The merge phase, which combines into sorted order the set of keys for a reducer, cannot start before the shuffle phase completes. For the shuffle phase to start, reduce tasks must be scheduled for execution on a cluster node in the system. Due to the limitation of resources, the scheduler might not launch all the reduce tasks at the same time, but once a reduce task starts, it will check the intermediate outputs from all the map tasks to see if there are keys that it is responsible for. Hence, the analysis in this section is based on the shuffling performance for each individual reduce task, and not the aggregate. The reduce tasks can be scheduled before the map phase completes, and the point in time during the map phase when such scheduling is permitted is governed by a Hadoop parameter known as slowstart. This parameter gives the percentage of map tasks that must have completed before the first reduce task is scheduled. So if slowstart equals 1.0, no reduce task may be scheduled until all map tasks have completed. In practice, a slowstart of 1.0 is inefficient and can significantly increase the time of the shuffle phase, because the data for all reduce tasks must be transferred over the network at the same time and it is possible to use up the full capacity of the network. On the other hand, if slowstart is too low, it is likely that there is no data to copy and the scheduled reduce tasks will just occupy the resources of the computing node without doing any work. Figure 7 illustrates how slowstart works. Figure 7: Slowstart When slowstart is not 1.0, a portion of the shuffle copying work will overlap with the remaining executing map tasks of the map phase. Measuring the time from the completion of the map phase to the start of the merge phase is thus not representative of the actual work. So, contrary to normal practice, we set slowstart to 1.0 in our experiments to simplify the problem and allow measurement of the entirety of the shuffle phase. We leave it to future research to model the overlap and its effect on the overall performance of a map-reduce application. We expected that the time for the shuffle phase would be dependent on the volume of data encompassed by the aggregate of all the

key-value pairs in the intermediate data. We constructed benchmarks to control both the number of key-value records produced by map tasks, as well as the size of such records. In addition, we wanted to see if the number of different keys in the domain of output keys would yield a performance impact, so we allowed our benchmark to control that as well. Finally, we wanted to control the number of reduce tasks employed by the reduce phase. After exploring all these potential factors, we found through our experimentation that the only factor significantly affecting the shuffle time was the total number of bytes involved in the shuffle phase (i.e. the combination of number of key-value records and their size). In our experiments, the rate (ms/mb) for shuffling the data was almost constant across our combinations of experimental setups changing the parameters defined above. The mean shuffle rate was 0.0000293983 ms/byte. 6. MERGE PHASE The merge phase begins after all of the reduce tasks finish their shuffling and must complete before the reduce phase may begin. Each reduce task performs its merging task independently, and so the following analysis applies to the merge phase of every single reduce task. We use the same benchmark as in the shuffle phase to control the number and the size of key-value pairs, the key distribution and the number of reducers. The result we get is similar to the shuffle phase, in that the major factor affecting the merge phase running time is the volume of data, with the assumptions that the workload is evenly spread over all the reducers. In the merge phase, we do not need to consider the effect of slowstart, so we use linear regression to approximate the merge time. Using the experimental result, the linear regression is y=1.345 10-5 x - 570163 where x is the number of bytes and y is merge time in milliseconds. The r value of this linear regression is 0.96. Figure 8: Summation of Merge Time 7.REDUCE PHASE The workflow of the reduce phase parallels the map phase in many ways, for example, both of them use a customized deterministic function as their work function. Hence, we use the same method to analyze the reduce phase. The input for the reduce phase is the output of the map phase, which is already transferred to the computing node by the shuffle and merge phases. Hence, unlike the analysis of the map phase, all the data are local, and so locality will not be a factor. Another difference between the map phase and the reduce phase is that the user can specify the number of reduce tasks in the reduce phase, whereas in the map phase, the number of map tasks depends on the input size and the block size. As we discuss in the Method section, we assume the partitioner works properly, which means all the reduce tasks have a similar amount of work, and so we can use the expected value to estimate the workload of each reduce task. The factors we find that will affect the reduce task s performance are the following: (1) Expected size of input (2) Number of reducers

We use the benchmark that we mentioned in the previous section to experiment with these two factors. However, instead of analyzing them separately, we choose to examine the correlation between the volume of input and the sum of reduce task time (so looking at reduce time in aggregate). Figure 9: Summation of Reduce Time According to our experiments, the sum of all reduces task time is directly proportional to the number of bytes entering the reduce phase. Hence, we argue the time to process one byte is fixed. With fewer reduce tasks, each reduce task expects to handle more bytes, and thus takes longer time to finish. Therefore, the running time is actually inversely proportional to the number of reducers. Figure 9 shows the relationship between input size and sum of reduce time, and clearly, a linear approximation applies in this case. The linear regression we obtained is y=8.865 10-5 x-2911708 where x is the number of bytes and y is the running time in milliseconds. The r value is 0.97. 8. RELATED WORK Previous research helps us gain a big picture about MapReduce s performance. Wottrich [5] focused on how input size and output size influence the running time of the mapreduce process, as well as analyzed the effect of overhead that can significantly drop the performance. Bressoud and Kozuch [1] investigated the performance characteristics with the presence of failures and built a discrete event simulator, CFTsim (Cluster Fault Tolerance simulator). 9. CONCLUSION In this project, we studied the major factors that impact the running time of each Hadoop phase. Due to the difference in functionality and mechanism of each phase, those factors can be different. By aggregating these analyses, we obtain a big picture about how Hadoop performs when running a job. Although the input size has a huge influence on the total job time, by tuning the configuration parameters, such as block size and level of concurrency, a user can minimize the overheads and fully use the capacity of a Hadoop system, and thus improve the performance of a job. Using the statistical models built on our analysis, a user can estimate a Hadoop job s running time before actually running the job, as well as experimenting with different configurations to find the optimized configuration of running the job. 10. FUTURE DIRECTION For the next step of this project, we are going to release the assumptions that we mentioned in Section 3.1. For instance, we will allow varying workloads of reducers and see how that affects our analysis. We considered two future directions to emend our work. The first is embedding fault tolerance into both the model and the discrete event simulation being built to model those aspects that are not amenable to individual analysis. The system of Hadoop is designed to be fault-tolerant, but fault-tolerant mechanism has a huge impact on the performance because it includes re-execution and time-out. We expect different failure

rates will impact application performance in non-linear ways, and want to understand the limits of the Hadoop mechanisms for faulttolerance. Another direction of investigation is to explore how Hadoop performance is impacted when scaling to very large systems. The prevailing trend is for businesses to use cloud computing for hosting large parallel applications, and these may perform differently than our work has shown for local clusters. Additional overheads in the underlying storage and in the use of virtual hosts employed in such cloud-based systems will impact any predictions of performance. 11. ACKNOWLEDGEMENTS This research was supported by funds from the Laurie Bukovac and David Hodgson Endowed Fund of Denison University. Denison students Yifu Zhao and Wei Peng helped build and configure the Beowulf cluster used for this research. [4] Apache Hadoop. http://hadoop.apache.org, September 2015. [5] K. Wottrich and T. Bressoud. The Performance Characteristics of MapReduce Applications on Scalable Clusters. In Proceedings of the Midstates Conference for Undergraduate Research in Computer Science and Mathematics (MCURCSM), 2011. [6] Konstantin Shvachko, Hairong Kuang, Sanjay Radia, and Robert Chansler. The Hadoop Distributed File System. In Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST), MSST 10. IEEE Computer Society, 2010. [7] Jared Gray and Thomas C. Bressoud. Towards a MapReduce Application Performance Model. Proceedings of the 2012 Midstates Conference on Undergraduate Research in Computer Science and Mathematics (MCURCSM 2012), Delaware, OH. 12. REFERENCE [1] Bressoud, Thomas C. and Kozuch, Michael A. Cluster Fault-Tolerance: An Experimental Evaluation of Checkpointing and MapReduce through Simulation. IEEE International Conference on Cluster Computing (Cluster 09). Web. September 2009. [2] Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. In Proceedings of the 6 th Conference on Symposium on Operating System Design and Implementation Volume 6,OSDI 04. USENIX Association, 2004. [3] Tom White. Hadoop The Definitive Guide: Storage and Analysis at Internet Scale. O Reilly, Sebastopol, CA, 3 rd Edition, 2012.