Do You Feel the Lag of Your Hadoop?

Do You Feel the Lag of Your Hadoop? Yuxuan Jiang, Zhe Huang, and Danny H.K. Tsang Department of Electronic and Computer Engineering The Hong Kong University of Science and Technology, Hong Kong Email: {yjiangad, ecefelix, eetsang}@ust.hk Abstract The configuration of a Hadoop cluster is significantly important to its performance, because an improper configuration can greatly deteriorate the job execution performance. Unfortunately, systematic guidelines on how to configure a Hadoop cluster are still missing. In this paper, we undertake an empirical study on key operations and mechanisms of Hadoop job execution, including the task assignment strategy and speculative execution. Based on the experiments, we provide suggestions on the system configuration, particularly on the matching between the hardware resource partitioning scheme and the job splitting granularity. I. INTRODUCTION Recent years have witnessed the rapidly increasing demand for large-scale data processing, such as webpage indexing, data mining, scientific simulations, spam detection and the like. For example, Facebook processed more than 00 TB of new data every day according to a study conducted in the year 0 []. MapReduce [] has emerged as a promising parallel processing framework for big data analytics. Apache Hadoop [] is the de facto open-source standard implementation of the MapReduce framework. It has been adopted by numerous users throughout the world including Twitter, EBay, Yahoo, Facebook, and Hulu []. The popularity of Hadoop can be demonstrated by a recent report that shows the production Hadoop cluster operated by Yahoo had successfully processed over thousand jobs from various users over a period of ten months []. Hadoop carries out enormous data analyzing jobs using computing clusters in a scale-out manner. The Hadoop framework parallelizes the data analysis by separating the processes into two parts, Map tasks that perform filtering and sorting, and Reduce tasks that perform a summary operation. The performance of a Hadoop cluster depends on how Map and Reduce tasks are scheduled to be processed by separate nodes in the cluster. On the one hand, job schedulers [], [], [] are proposed to address different performance issues in the resource allocation processes for jobs. On the other hand, enhancing data locality by task-level scheduling inside jobs is also important [] []. However, the resource allocation in Hadoop is complicated and depends on many system parameters and design decisions. Without carefully fine tuning the system, the performance of the system can be far from optimal. Unfortunately, resource allocation mechanisms in Hadoop are not documented in detail. For most users, these mechanisms operate as black-boxes. This motivates us to study such mechanisms empirically through extensive experiments. The goal of this paper is to shed some light on how to properly configure the Hadoop system so as to improve the job execution performance. To understand how the Hadoop system interacts with system configurations, detailed behaviors of Hadoop are investigated. In this paper, we are particularly interested in the following issues: () Map and Reduce task assignment preference; () the Hadoop speculative execution mechanism for tasks; () hardware granularity of Hadoop task slots; and () the granularity of job splitting. More specifically, the task assignment preference determines which Map or Reduce task is assigned to which node to be executed. The speculative execution mechanism determines when and which task should be duplicated as a backup proactively for fault-tolerance. The hardware granularity of Hadoop task slots decides how resources are partitioned and shared among multiple tasks, and the granularity of job splitting determines the size of each task for a job. The above aspects are closely tangled together. Improper configuration can easily create a severe performance bottleneck. Official documents discuss the above issues at a very high level []. Currently, users mainly rely on their own experience to come up with the configuration parameters. From our experiments in this paper, the key observations and conclusions include: Hadoop task assignment only depends on data locality. Performance bottlenecks emerge when data locality is taken out of consideration, due to the lack of workload balancing. The Hadoop speculative execution mechanism is heuristic and simple. Imperative backup task execution may be delayed or prevented, while unnecessary backups are likely to be created. Matching the hardware resource partitioning granularity with the job splitting granularity significantly improves the job execution performance. However, this matching requires the user to have detailed knowledge of the jobs and the hardware configuration of the cluster. The points above provide a general guideline on configuring the system parameters. More importantly, they open up new directions to improve the current Hadoop implementation. The rest of this paper is organized as follows. Key factors on Hadoop job execution are presented in Section II. Our experimental settings are described in Section III. Experimental results for task assignment, speculative execution and granularity matching are reported and analyzed in Section IV, V and VI, respectively. Finally, Section VII concludes the paper. II. BACKGROUND OF HADOOP JOB EXECUTION A. Hadoop Resource Provisioning and Job Splitting Computing task parallelization lies in the core concept of the MapReduce framework. A particular Hadoop job is parallelized

into multiple Map and Reduce tasks. Each Map task is responsible for processing a portion of the data stored in the Hadoop Distributed File System (HDFS). Reduce tasks then sort and combine the Map results. There are many factors that influence the execution performance of a Hadoop job. One such factor is the job splitting scheme. Data associated with the jobs are split into blocks and stored in the HDFS. The number of Map tasks of one job should be no smaller than the amount of data blocks associated with this job. By default, they are exactly equal to each other, and only one task is generated for Reduce processing. The total number of tasks of a job should be carefully determined. If a job is split into a large number of tasks, each task will require less time and fewer resources for processing. In this case, task scheduling is more flexible. But an excessive number of tasks will introduce unnecessary queuing delay and extra task initialization overheads. However, if the number of tasks is too small, scheduling becomes cumbersome. Once all the tasks have already been scheduled, the Hadoop cluster is not able to start new tasks, even if it has extra capacity. Another key factor that influences the job execution performance is the hardware resource partitioning granularity. In classical Hadoop (i.e., before Hadoop.0), the computing resources of a Hadoop cluster are divided into capacity units called task slots. One task slot is able to process only one Hadoop task. Under the framework after Hadoop.0, Apache YARN [] is introduced to manage hardware resources, in which the basic capacity units are called containers. To simplify our discussion, in this paper, we refer to the capacity required to execute one job task as a task slot. There exists a trade-off between the number of simultaneous task executions and the program parallelization efficiency. When the number of tasks that can be executed in parallel is too small, jobs experience unnecessary queueing delay, because unscheduled tasks are waiting for vacant task slots. However, if the cluster configuration allows an excessive number of simultaneous task executions, multi-thread program scheduling overheads and the bottleneck of resource sharing among slots emerge and deteriorate the performance. B. Speculative Execution Speculative execution is introduced as a fault-tolerance mechanism in Hadoop. This mechanism constantly monitors the progressions of tasks. If any task falls behind in progress and is in danger of failure, the speculative execution mechanism proactively creates an identical backup task using an available slot. Because backup tasks consume extra resources, unnecessary restart of tasks can potentially slow down the execution of other active jobs. As a result, the speculative execution mechanism should be carefully designed. III. EXPERIMENTAL SETUP In the following experiments, Hadoop.. is deployed onto a computing cluster. In this paper, we focus on the Hadoop operations and mechanisms for one single job, so a relatively small Hadoop cluster with one master node and six slave nodes is sufficient. The master node is hosted by an HP Compaq DX00 desktop, and six slave nodes are hosted by homogeneous virtual machines (VMs) in our private computing cloud that consists of Dell PowerEdge T0 servers. Each of the VMs is allocated with four virtual processing cores from Intel Xeon E0 CPUs and GB of memory. The HDFS is built upon local storage on each server instead of the network attached storage system. In the experiments, the word-count job is adopted as the benchmark application. According to a trace study of a Yahoo production Hadoop cluster [], % of submitted jobs are map-only or map-mostly. The word-count job falls into the map-mostly category. This program is widely adopted in Hadoop performance analysis experiments by MIT [], Intel [] and the like. Since data locality in the Hadoop system is a well-studied subject [] [], we investigate other possible factors that influence the performance by eliminating the data locality issue. In all of the following experiments, the HDFS data replication parameter is set to so that each slave node will maintain a complete copy of all the data. Hadoop capacity scheduler [] is used to submit jobs. During the experiment, only one job is submitted each time, so the single job microscopic performance can be studied in detail. IV. HADOOP TASK ASSIGNMENT PREFERENCE The task assignment preference of Hadoop determines how the workload is balanced among slave nodes. In this part of the experiments, how Hadoop assigns job tasks to active slave nodes is investigated. In total, 0. GB of Wikipedia English website dump data [] are split into data blocks, with a block size of MB in the HDFS. The data are assigned to create Map tasks. This number is selected to create uneven workloads among slave nodes. One Reduce task is used to summarize the Map results. To study the task assignment behavior, we take Map task assignment as an example. The experiment is repeated four times, and in each run the hardware resources in the Hadoop cluster are partitioned into a different number of task slots. Figure (a) indicates the execution time of each Map task when the Hadoop cluster is configured so that each slave node contains Map slots. Therefore, the cluster contains Map slots in total. Job tasks assigned to the same slave node are presented using the same color in the figure. The results indicate that slaves, and are left totally idle, slave hosts only one task and slaves and are fully loaded to host the remaining tasks. Figures (b), (c) and (d) show similar results when the Hadoop cluster is configured so that each slave node contains, and Map slots, respectively. In the above experiments, the results clearly indicate that job tasks are assigned to slave nodes one-by-one. The task scheduler only assigns tasks to another slave node when the current slave node s task slots are fully occupied. As a result, there is only one under-utilized slave node that hosts the remaining number of job tasks. More importantly, the tasks hosted by this under-utilized slave node complete much earlier than other tasks running on fully utilized slave nodes (e.g., task in Figure (a); tasks, and in Figure (b); tasks,,, and in Figure (c); and task in Figure (d)). Our observation contradicts the belief that Hadoop slots are homogeneous computational resources. The

00 0 00 0 00 0 00 0 Tasks on Slave Tasks on Slave Tasks on Slave 0 00 0 00 0 00 0 Tasks on Slave Tasks on Slave Tasks on Slave 0 00 0 00 0 00 0 Tasks on Slave Tasks on Slave Tasks on Slave 0 00 0 00 0 00 0 Tasks on Slave Tasks on Slave Tasks on Slave Tasks on Slave 0 (a) Total Number of Map Slots = 0 0 (b) Total Number of Map Slots = (c) Total Number of Map Slots = Fig.. Execution Time Distribution for Map Tasks 0 (d) Total Number of Map Slots = performance of a Hadoop task slot on a slave node depends on the total workload on that node. This can be explained by the resource bottleneck due to the sharing among all Hadoop task slots on one single slave node. For example, the storage bottleneck of a slave node prevents a large number of Map tasks from reading data at the same time. It is learnt that Hadoop task assignment mainly depends on data locality. The lack of workload balancing can result in serious degradation of performance. Load balancing is, therefore, suggested to be implemented into Hadoop task assignment. For example, a simple heuristic strategy would be distributing tasks evenly among all active slave nodes with regard to data locality. V. HADOOP SPECULATIVE EXECUTION The speculative execution mechanism launches backups for active tasks with slow progressions. However, there is no clear description of the triggering criterion of backup tasks in the official documents of Hadoop []. In this section, we study the triggering conditions of backup tasks in speculative execution. To simplify the experiment, a KB ebook downloaded from the Gutenberg project [] is submitted to the cluster, and Map tasks are created in total. Artificial delay is programmed in the Map function of the Hadoop word-count job. As a result, the progress rate of the Map task is evenly injected by artificial delay in granularity of around 0.%. The cluster is configured to have Map slots. Without introducing delay, our measurements report that a Map task roughly takes seconds to complete. The experimental settings above allow us to freely control the duration of each individual Map task. Figure (a) shows the execution progressions of Map tasks in the time domain when the same amount of delay is introduced for all the tasks. In this case, all the tasks have similar progress rates and hence no backup task is triggered. Figure (b) shows that two backup tasks are generated when extra delay is introduced for tasks and, leading to slow progressions for them. In Figure (c), the backup task is triggered only for task if the extra delay introduced for task is lower than that for task, but higher than that for other tasks. The above results imply that one of the necessary conditions for the speculative execution mechanism to launch a backup task is that there exists a slow-running task whose progress rate is lower than a certain threshold compared to other healthy tasks. However, the progress rate threshold is not the only triggering criterion of backup tasks in the speculative execution mechanism. In the next part of our experiment, artificial delay is introduced only for task. By continuously increasing the injected delay from zero to a large number, the time condition that will trigger backup tasks can be located. Figure (d) shows that no backup task is created if task is finished before t = s. Figure (e) shows that one backup task is launched for task at the time t = s if its execution time is seconds. Note that task has a much slower progress rate than its peer tasks in both Figures (d) and (e). If the progress rate threshold is the only condition, backup tasks should be triggered for both cases. More interestingly, we observe that the backup task is always launched at around t = 0s when the execution time for task is longer than or equal to seconds. This observation suggests that speculative execution takes effect to create backup tasks by monitoring existing tasks after around t = 0s. In Figure (f), each slave node is configured to have only Map task slot. In this way, Map tasks cannot be scheduled within just one wave in the Map phase. The amount of artificial delay is introduced for all the Map tasks in the same way as that in Figure (a). In this case, backup tasks are launched for tasks, and right after the original tasks start. Combined with the observations from Figures (d) and (e), it is concluded that backup tasks can only be launched after the job has been running for some time. This period is an absolute value of around one minute in our experiment. Additionally, Hadoop will check the absolute progress rate of the targeted normal task before launching its backup. If the targeted task is approaching completion, the backup will not be launched. Figure (f) also illustrates another feature of speculative execution. In this figure, the execution times for all Map tasks are close to each other and are,, 0,,,,, and seconds, respectively. For tasks, and, scheduled in the second wave of the Map phase, their backup tasks are launched almost immediately after the corresponding normal tasks are created. This means these tasks are considered to be at the risk of failure from the very beginning of their executions. This phenomenon implies that the absolute progress rate of one task at each time point, not the relative progress rate with respect to the time at which the task is scheduled, is compared with its peer tasks. This strategy obviously increases the probability for backup tasks to be launched. However, the design here is

0 0 0 0 0 00 (a) Uniformly Delay 0 Backup for Task Backup for Task 0 0 0 0 0 00 0 (b) Extra Delay for Tasks and 0 Backup for Task 0 0 0 0 0 00 0 (c) Higher Delay for Task 0 0 0 0 0 (d) Delay Introduced only for Task 0 Fig.. Backup for Task 0 0 0 0 0 (e) More Delay Introduced for Task Transient Progressions of Map Tasks 0 Backup for Task Backup for Task Backup for Task 0 0 00 0 00 (f) Uniform Delay but Fewer Map Slots reasonable in terms of fully utilizing the computing resources in the cluster. In Hadoop, unscheduled normal tasks have a higher priority to use free task slots than backup tasks. If free slots still exist after all normal tasks have been scheduled, it is better to utilize them for backup tasks to prevent failure proactively, rather than just leave them idle. Nevertheless, and more importantly, a potential design weakness of slow-running task detection is exposed here. Genuine slow-running tasks in terms of their relative progress rates may not have free task slots to launch backups if these slots are all occupied by unnecessary backups whose corresponding normal tasks are judged to be slow in progress by absolute progress rates, but healthy by relative progress rates. We summarize our observations into the following conditions for speculative execution to launch a backup task: All normal tasks have already been scheduled, and there exists one free task slot in the cluster. There exists a slow-running task whose progress rate is behind some progress rate threshold. The job should have lasted longer than or equal to around one minute. The slow-running task is not approaching completion at the time when its backup task should be triggered. Our observations above agree with the descriptions by Dinu et al. on the speculative execution mechanism []. In summary, the current implementation of speculative execution is simple and heuristic. Unnecessary backups are likely to be generated, while genuine slow-running tasks in terms of relative progress rates may not be granted resources for backups. Also the creation of backup tasks might be delayed due to the condition that backup tasks can only be launched after the job has lasted for one minute. To improve the speculative execution mechanism, it is suggested that more advanced algorithms (e.g. machine learning techniques) should be applied to the detection of slow-running tasks. VI. RESOURCE PARTITION AND JOB SPLITTING SCHEMES Hardware resource partitioning and job splitting granularity determine how efficiently the workload is parallelized and distributed inside the cluster. Given fixed hardware resources, we aim to find the optimal number of task slots that each slave node should have. An optimal job splitting strategy that best matches the hardware resource partitioning scheme is also investigated. In the following experiments, the same 0. GB of English Wikipedia web dump data is used, and the number of Map task slots on each slave node is varied from to. We mainly focus on the performance of the Map phase because it accounts for more than % of the total job execution time in our experiment. To quantitatively measure the effects of job splitting, the job execution times when the job is split into different numbers of Map tasks are measured, under the cases with the total number of Map task slots from to 0. Job execution time measurement results are reported in Table I. It can be observed by the columns that for one fixed job splitting scheme, with the increasing of the total number of Map slots, the job execution time follows the generic trend that it first goes down, then reaches optimal, which is the shortest time, and finally goes up again. Note that insufficient partitioning of task slots on one node leads to severe queueing delay for tasks because parallelization is not enough. On the other hand, an excessive number of task slots on one node incurs extra multi-thread program scheduling overheads. Also in this situation, a heavy workload on one node can cause a resource sharing bottleneck among task slots, which has been indicated in Section IV. In this way, task execution can further be retarded. We speculate from Table I that, in general, the

TABLE I MEASUREMENT RESULTS OF WORD-COUNT JOB EXECUTION TIMES (SEC) # of Job # of Total Splits Map Task Slots 0 0 0 0 00 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 0 0 0 00 0 0 0 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 0 0 optimal number of total task slots in one cluster is slightly bigger than the total number of processing cores in that cluster, which is roughly within the range of 0 to in our experiment. Given the total number of Map slots, job execution time varies according to the job splitting scheme, which can be observed by the rows in Table I. Thus, an appropriate matching between the total number of task slots and the total amount of job splits is required. It can be observed from Table I that splitting a job into Map tasks of exactly the same amount as the total number of Map slots in the cluster achieves the best performance (e.g., execution times highlighted in red in the table). According to the observations in Section IV, this job splitting strategy ensures that every slot is assigned a normal task, and the overall Map execution takes just one wave. If the total number of Map tasks is fewer than the total number of Map slots, job execution performance degrades due to the resource underutilization that some slots or even nodes are left idle or used for backups. Likewise, it can be observed that splitting a job into a number of tasks which is the integral multiple of the total number of Map slots achieves sub-optimal performance (e.g., execution times highlighted in blue in the table) because, on the whole, each slot is assigned a normal task in each wave. However, splitting a job into too many Map tasks deteriorates the execution performance and should be avoided. In this case, the whole data processing is divided into tiny units and initialization overheads dominate the task execution. In practice, tasks with longer execution times have higher probabilities of failure. Therefore, a job containing a huge volume of data is better to be split into a larger number of tasks to reduce the cost of failure recovery. As a result, the number of tasks of this kind of jobs is encouraged to be the integral multiple of the total number of slots in the cluster. VII. CONCLUSION How to configure the Hadoop system according to its operations and mechanisms in job execution is of great significance to the execution performance. In this paper, we perform extensive experiments to get insights from a practical perspective. Based on experimental observations, we provide suggestions on system configuration, particularly on granularity determination. REFERENCES [] A. Menon, Big data@ Facebook, in Proc. ACM Workshop on Management of Big Data Systems, 0, pp.. [] J. Dean and S. Ghemawat, MapReduce: simplified data processing on large clusters, Communications of the ACM, vol., no., pp. 0, 00. [] T. White, Hadoop: The definitive guide. O Reilly Media, Inc., 0. [] Hadoop Wiki: Powered By. [Online]. Available: https://wiki.apache.org/ hadoop/poweredby [] S. Kavulya, J. Tan, R. Gandhi, and P. Narasimhan, An analysis of traces from a production MapReduce cluster, in Proc. IEEE/ACM Int. Conf. Cluster, Cloud and Grid Computing (CCGrid), 00, pp. 0. [] K. Kc and K. Anyanwu, Scheduling Hadoop jobs to meet deadlines, in Proc. IEEE Int. Conf. Cloud Computing Technology and Science (CloudCom), 00, pp.. [] T. Sandholm and K. Lai, Dynamic proportional share scheduling in Hadoop, in Job Scheduling Strategies for Parallel Processing. Springer, 00, pp. 0. [] Z. Guo, G. Fox, and M. Zhou, Investigation of data locality in MapReduce, in Proc. IEEE/ACM Int. Symp. Cluster, Cloud and Grid Computing (CCGrid), 0, pp.. [] X. Zhang, Z. Zhong, S. Feng, B. Tu, and J. Fan, Improving data locality of MapReduce by scheduling in homogeneous computing environments, in Proc. IEEE Int. Symp. Parallel and Distributed Processing with Applications (ISPA), 0, pp. 0. [0] J. Jin, J. Luo, A. Song, F. Dong, and R. Xiong, Bar: an efficient data locality driven task scheduling algorithm for cloud computing, in Proc. IEEE/ACM Int. Symp. Cluster, Cloud and Grid Computing (CCGrid), 0, pp. 0. [] W. Wang, K. Zhu, L. Ying, J. Tan, and L. Zhang, A throughput optimal algorithm for map task scheduling in MapReduce with data locality, ACM SIGMETRICS Performance Evaluation Review, vol. 0, no., pp., 0. [] Hadoop YARN. [Online]. Available: http://hadoop.apache.org/docs/ current/hadoop-yarn/hadoop-yarn-site/yarn.html [] Y. Mao, R. Morris, and M. F. Kaashoek, Optimizing MapReduce for multicore architectures, in Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Tech. Rep., 00. [] Optimizing Hadoop Deployments. [Online]. Available: http://www.intel.com/content/dam/doc/white-paper/ cloud-computing-optimizing-hadoop-deployments-paper.pdf [] Wiki Dump. [Online]. Available: http://dumps.wikimedia.org/enwiki/ [] Free ebooks: Project Gutenberg. [Online]. Available: http://www. gutenberg.org/ [] F. Dinu and T. Ng, Understanding the effects and implications of compute node related failures in Hadoop, in Proc. ACM Int. Symp. High-Performance Parallel and Distributed Computing (HPDC), 0, pp..