Hadoop Performance Monitoring Tools

Hadoop Performance Monitoring Tools Kai Ren, Lianghong Xu, Zongwei Zhou Abstract Diagnosing performance problems in large-scale data-intensive programs is inherently difficult due to its scale and distributed nature. This paper presents a non-intrusive, effective, distributed monitoring tool to facilitate performance diagnosis on Hadoop platform, an open source implementation of the well-known MapReduce programming framework. The tool explores data from Hadoop logs and aggregates perjob/per-node operating system level metrics on each cluster node. It provides multi-view visualization and view-zooming functionality to assist application developers in better reasoning about job execution progress and system behavior. Experimental results show that, due to the extra levels of information it reveals as well as its effective visualization scheme, our tool is able to diagnose some kinds of problems that previous tools fail to. I. INTRODUCTION Data intensive applications are gaining increasing popularity and importance in various fields such as scientific computations and large scale Internet services. Several software programming frameworks have been proposed for such applications on clusers consisting of commodity machines, among which MapReduce[3] is the most well-known and generally understood. Hadoop [5] is an open-source Java implementation of MapReduce, which is currently widely used by many Internet services companies (e.g., Yahoo! and Facebook). Monitoring Hadoop and diagnosing MapReduce performance problems, however, are inherently difficult due to its large scale and distributed nature. Debugging Hadoop programs by looking at its various logs is painful. Hadoop logs can get excessively large, which makes it impractical to handle them manually. Most current Hadoop monitoring tools such as Mochi [13] are merely log-based and lack certain important operating system level metrics, which may be a significant addition to the diagnosis process. Cluster monitoring tools such as Ganglia [7] expose per-node and cluster-wide OS metrics but provide no MapReduce-specific information. We present a non-intrusive, effective, distributed Hadoop monitoring tool which targets at making Hadoop program debugging simpler. It extracts a number of useful MapReduce and File System (HDFS) level metrics from Hadoop logs and correlates them with operating system level metrics such as CPU and disk I/O utilization on a per-job/pertask basis. In order to effectively present these metrics to programmers, we propose a flexible multi-view visualization scheme, including cluster view, node view, job view and task view, each of which represents a different level of granularity. We propose a view-zooming model to assist application developers in better reasoning about their MapReduce job execution progress and program behavior. As far as we know, our tool, along with its multi-view visulization and view-zooming model, is the first monitoring tool to correlate per-job/per-task level operating system metrics with MapReduce-specific metrics, and to support both online monitoring and offline analysis. We give some case studies to show how our tool can effectively diagnose performance problems in MapReduce programs, some of which cannot be easily diagnosed by other tools. In the rest of this paper, Section II introduces the architecture of our monitoring tool, and Section III describes implementation details. We show some evaluation results in Section IV, and provides several examples to demonstrate the effectiveness of our tool in diagnosing MapReduce programs. Section VI compares our tool with related works, and Section VII concludes this paper. II. ARCHITECTURE Our non-intrusive monitoring tool explores data from Hadoop logs and extracts OS level information of jobrelated processes on each cluster node. We construct four different views of the collected metrics, namely cluster view, node view, job view, and task view, and propose a viewzooming model in order to assist system administrators and application developers to better reason about system behavior and job execution progess. A. Metrics aggregation In order to expose as much useful information as possible, our monitoring tool collects a number of metrics in MapReduce, Hadoop File System (HDFS), and operating system levels. MapReduce and HDFS metrics are extracted from logs of Hadoop job tracker, task tracker and HDFS data node, while operating system metrics are gathered on each node in a cluster. 1) MapReduce and HDFS metrics: Hadoop provides various logs detailing MapReduce job execution as well as the underlying data node activities. These logs get excessively large after a MapReduce job runs for a while. However, we observe that a large amount of information in the logs are redundant and overly trivial. As a result, there exists a good chance for us to filter only useful information from the massive logs. A close examination of all these logs enables us to extract a number of MapReduce and HDFS

level metrics that to our knowledge may be of interest to the potential users of our tool. MapReduce level metrics are mainly obtained from jobtracker and tasktracker logs. Jobtracker log provides rich information of job execution. To be specific, it records job duration, lifetime span of each task and various job-related counters, including total launched maps and reduces, number of bytes read from and written to HDFS for each task, map input records, reduce shuffle bytes and so on. Tasktracker logs are distributed over every node and are sent to a central collector for analysis. They reveal detailed information of a task execution state such as the Java Virtual Machine ID associated with each task and the shuffle data flow including total shuffle bytes, source and destination node location. HDFS data node logs mainly describe data flow on the underlying file system level. A typical data node log entry contains operation type (read or write), source and destination network addresses of the two nodes involved, number of transfer bytes, and operation timestamp. We don t look into master node log because it only contains metadatarelated information, which we believe is of little interest to cluster administrators and application developers. 2) Operating system metrics: A distinguishing feature of our monitoring tool is the ability to aggregate per-node operating system level metrics within a cluster. For now, the exported metrics include CPU, memory, disk and network utilization on each node, which we believe can satisfy most need of our users for now. B. Multi-view visulization Our monitoring tool provides both coarse-grained and fine-grained views of MapReduce job execution. Specifically, four levels of views-cluster view, job view, task view, and node view- are presented according to users need. 1) Cluster view: Cluster view constructs an overall perception of a cluster s running state. It does so by visulizing resource utilization of multiple jobs on each node within the cluster. For example, cluster view shows that for every job executing within the temporal scope, a cluster view is able to present the change over time of the average CPU usage across the entire cluster. Cluster view offers useful information to system administrators and assists them to obtain a comprehensive perspective of the utilization of cluster resources. MapReduce program developers may also benefit from a cluster view, from which they can examine the concurrent running jobs during the execution of their own jobs. When there are a large number of concurrent jobs competing for the cluster resources, it would not be surprising if the performance of the job execution is worse than expectation. 2) Node view: Node view is similar to cluster view but it only exposes information related to a specific node. Because all the information are gathered on a single node, it is reasonable to assume that node view reveals more findgrained data than cluster view, in which data is averaged across all the nodes and may hide some deviations among them. An example of node view is shown in Figure 8. The CPU percentage of Hadoop processes are aggregated against the host percentage during that period. From this figure, we can clearly see how the CPU resource is shared among the concurrent running processes on this node as well as how much portion a specific process takes as to the total aggregated percentage. 3) Job view: Job view provides coarse-grained information of the execution progress of a specific job. Job view focuses on high-level job-centric information, including three main statistical distributions: task-based distribution, nodebased distribution and time-based distribution. Figure 2 shows an exemplary task-based distribution of file read bytes for reducer tasks. Figure 3 gives an example of nodebased distribution of shuffle bytes sent and received by each node. Time-based distribution is illustrated by Figure 4, which presents the duration of all the tasks involved in a job execution. Job view provides rich information about job execution status and is probably the most important view that application programmers would like to look into. It is easy to diagnose job execution abnormalies with the assistance of job view. For instance, it is not difficult to detect that a task runs for a much longer time than the other peer tasks or that a reducer task shuffles much more bytes than other reducers. 4) Task view: Sometimes job view is not sufficient to reveal all the underlying details about the execution state of a specific task. As a result, our tool presents task view, which provides fine-grained information about a specific task execution. It can be viewed as a part of job view, but with extra level of details in order to reveal all the information that to our knowledge application developers may care about CPU perceptage (%) 1 8 6 4 2 5 1 15 2 Figure 1. time job411 job41 tasktracker datanode host An example of node view

a given task. Task view breaks task duration down into sereral periods, each representing a specific task execution state, as shown in Figure 5. To help users understand exactly which execution state causes the skewness, it also shows the average and standard deviation of execution time among all the peer tasks for each period. For instance, for the shuffle period of a reduce task, task view constructs the shuffle data flow including the source and destination node locations as well as the associated volume of transferred bytes, as shown in Figure 6. C. View zooming Above we have shown what the four views are and what they look like. We also proposes a view-zooming model to correlate these views together to assist application developers in reasoning about performance degeneration as illustrated in Figure 7. However, users are not necessarily restricted to this model and may leverage the multiple views provided by our tool according to their needs. We construct the four views in two dimensions: the physical dimension indicates whether a view s scope is on a single node or aross the entire cluster, while the MR dimension represents the granularity of the views in terms of number of jobs and tasks. For application developers, before launching any MapReduce jobs, they may want to have a look at the cluster view, which tells them how many jobs are concurrently running and how system resources are distributed among them, and then decide whether or not it is a good time to join the contention. After a job is finished, they may want to zoom in from the cluster view to the job view and focus on the execution progress of their own jobs. The three distributions presented by job view provides ample information for them to detect some, if any, performance problems. For example, from time-based or task-based distribution, they may find a suspect task consuming much more time to finish than its peer tasks. It is most likely that at this point applcation developers would like to zoom in to the task view of the morbid task to obtain the insights of the root cause of the problem. However, another possibility exists that application developers, by examining the node-distribution from the job view and detect some node exhibits abnormal behavior, would like to zoom in to node view to find out the detailed information on this node during the job execution. In order to make the discussion concrete, we also investigate how cluster administrators may benefit from these views. Because our monitoring tool correlates MapReduce metrics together with OS metrics, it can expose information as to the resource sharing situation of each job running in the cluster, which cluster administrators may use for serviceoriented charging and can not be provided by traditional system monitoring tool such as Ganglia. In our view monitoring model, a typical path taken by cluster administrators may be zooming between cluster view and node view, which show the resource utilization on the entire cluster and a single node respectively. A. Metric Aggregation III. IMPLEMENTATION For the performance diagnosis of Hadoop map-reduce jobs, we gather and analyze OS-level peromance metrics for each map-reduce task, without requiring any modifications to the Hadoop system and the OS. The basic idea to collect OS level metrics for each task is to infer them from its JVM process. Hadoop runs each map/reduce task in its own Java Virtual Machine to isolate them from other running tasks. In Linux, OS-level performance metrics for each process are made available as text files in the /proc/[pid]/ pseduo file system. It provides many aggregation counters for specific machine 18 16 45 4 send receive HDFS Read Bytes (MB) 14 12 1 8 6 4 Shuffle bytes (MB) 35 3 25 2 15 1 2 5 1 2 3 4 5 6 Map Task ID 1 2 3 4 5 6 7 Machine ID Figure 2. task-based distribution of shuffle bytes for reducer tasks Figure 3. node-based distribution of shuffle bytes for reducer tasks

12 1 map-input map-output bytes Task ID 8 6 4 map task no.18 2 map reduce 1 2 3 4 5 6 7 8 9 1 Time (1s) 1 2 3 4 5 6 Machine ID Figure 4. time-based distribution of task durations Figure 6. map task data flow 8 7 reduce task no.28 average Time (s) 6 5 4 3 2 1 phase 2 phase 1 phase 4 phase 3 phase 5 total Figure 5. task duration breakdown resources. For example write bytes in /proc/pid/io/ gives the number of bytes that is sent to the storage layer in the lifetime of a particular process. If these aggreation counters are periocally collected, then the throughput of some resource like disk I/O can be approximated. This motivates us to collect metrics data from /proc/ file system and correlate them with its corresponding map/reduce task. However, the difficulty lies in how to correlate process metrics to each map/reduce task. One map/reduce task can correspond to one or serveral processes in OS, because a task may spawn serveral subprocesses, which is a common case in stream processing jobs. To solve this problem, we create the process tree for each JVM process through the parent id provided in /proc file system, and aggregate the metrics of every subprocess to the counters of each map/reduce task. Another difficulty is that Hadoop allows the reuse of Java Virtual Machine. With the reuse of JVM enabled, Figure 7. View-zooming model multiple tasks of one job can run in the same JVM process sequentially in order to reduce the overhead of starting a new JVM. However, in the OS-level metrics, we can only infer the identification of the first task running on a JVM process. There is no additional information in OS level about the tasks that later reuse the JVM process. After checked the Hadoop log, we found that Hadoop logs the JVM id of each mapreduce task as well as the event of creating a new JVM. Thus, we can identify the JVM id of a particular JVM process from the Hadoop log. And according to the JVM id and the timestamp of creating and destorying a mapreduce task, we can infer which process this task run in.

Collect Metrics Collect Metrics Collect Metrics File Access Network gmond gmond gmond Figure 8. B. Online Reporting Online monitoring system architecture gmetad RRD tool database To achieve online reporting job-centric metrics, we adopt the source code from Ganglia. Ganglia is a distributed monitoring system that has great scalability and portability. It runs a daemon called gmond on each cluster node that periodically collects metric data of the local machine and reports the data directly to a central node gmetad. The central node stores metric data into RRD Database tools. Gmond supports python module which can be used to add functionality to collect metrics. It is easy fo us to plug-in new metric collection module. However, the central node gmetad has its limitation in organization of metrics data. Its interal data structure orgnazies the metrics by its hostname and the cluster it belongs to. In order to organize the metrics by mapreduce job, we modify its internal data structure as well as its storage organization in RDD database. IV. EVALUATION The effectiveness of our monitoring tool relies on its low overhead which enables it to collect metrics consistenly without slowing down the cluster, as well as its extensive coverage of the events monitored, which prevents potentially interesting behaviors from flying under the raddar. Thus our evaluation consists of two parts: one is to evaluate the overhead of our monitoring tool and another one is to do case study to show the effectiveness of our monitoring tool to assist cluster administrator and hadoop programmer to trace interesting events in Hadoop system. We perform our experiments on a 64-nodes testbed cluster. Each node has 2 quad-core Intel E544, 16GB RAM, four 1TB SATA disks and a Qlogic 1GE NIC. Each node runs Debian GNU/Linux 5. with Linux kernel 2.6.32. A. Benchmarks We use the following Hadoop programs to evalute the overhead of our monitoring tools. We show that through our monitoring tools users can infer some root causes of performance problems in their programs. The two Hadoop programs are: TeraSort: Terasort benchmark sorts 1 1 records (approximately 1TB), each including a 1-byte key. Terasort implemented using Hadoop distributes the record dataset to a number of mappers to generate key-value pairs; then reducers fetch these records from mappers and sort them by their keys. We run TeraSort in multiple times, and test the overhead caused by our monitoring tool. Identity Change Detection: This is a data-mining job that analyzes telephone call graph and detect the idenfication changes of the owner of a particular telephone number. This program does analysis on a very large data sets, and consists of multiple mapreduce jobs. We use one of its subjobs as our case-study. One characteristic of this job is that its generates very large output in its map phase. Suppose the input size of this job is M key-value pairs, and the its immediate result emitted by mapper function is about M 2 values. For the largest data set in our experiment, it writes nearly 3TB data into Hadoop filesystem. B. Overhead From the information provided by the /proc filesystem, we observe that local per-node overhead of our monitoring tool accounts for less than.1% of the CPU resource. The peak virtual memory usage is about 55MB. We also test disk I/O throughput of the log-version of our monitoring tool. During the normal operation time of the cluster, the disk write throughput of our monitoring tool is about.5kb/s in average. The write rate will increase as the number of map-reduce jobs increases. A. Detecting disk-hog V. CASE STUDY In this case study, we inject a disk-hog, a C program that repeatly write large files into disks, to certain machines in the experiment cluster. We use this disk-hog to simulate real world disk I/O hotspot in cluster, and study how this I/O hotspot influences our MapReduce job. We run the identity change detection program describled in Section IV for experiment. The experimental setting is also identical to the one in Section IV. We show that only extracting MapReduce level information from Hadoop log is not sufficient to detect our simpel disk-hog, but corelatting OS level metrics with MapReduce level information could help us quickly detect it. Figure 9 shows the running time distribution of map tasks in normal case and disk-hog case. In each case, there are map tasks running in different amout of time. Even if disk-hog can cause some map tasks running for a little bit longer time,

12 14 1 12 Duration Time (s) 8 6 4 Duration Time (s) 1 8 6 4 2 2 1 2 3 4 5 6 Map Task ID 5 1 15 2 25 3 35 4 45 5 Reduce Task ID (a) Normal Case (a) Normal Case 12 14 1 12 Duration Time (s) 8 6 4 Duration Time (s) 1 8 6 4 2 2 1 2 3 4 5 6 Map Task ID 5 1 15 2 25 3 35 4 45 5 Reduce Task ID (b) Disk-hog Case (b) Disk-hog Case Figure 9. Comparing map task running time distribution of normal case and disk-hog case Figure 1. Comparing reduce task running time distribution of normal case and disk-hog case it still can not change the entire running time distribution significantly. As we can see, the distribution under two cases still looks quite similar, and it is difficult to detect anomaly which is caused by disk-hog. Figure 1 shows the running time distribution of reduce tasks in normal case and disk-hog case. As we can see, all reduce tasks in disk-hog task are running slower than normal case, because each reduce task in our MR job have to wait for the shuffle bytes from the map task(s) running on the disk-hog machine(s). However, under disk-hog case, the distribution of reduce task running time are still quite similar with the normal case. The influence of disk-hog on reduce task running time distribution is not significant enough for anomaly detection. However, by corelatting OS level metrics with MapReduce level information, we can detect the disk-hog. There is one graph in our Job View (see Figure 11). This graph shows the comparison of the average job I/O write throught and machine I/O write throughput on each cluster machine. It is easy to see that on machine no. 1 and 23, the average machine I/O write throughput is much larger than job I/O write throughput, and there is no other MR job running on the cluster. It is quite possible that there is I/O write anomalies in these two machines. The programmer can then use zooming methodology to see the actual I/O write throughput variation on the suspected machines, and inform cluster administrator to check the I/O status on those

write bytes (KB/s) 3 25 2 15 1 5 job461 host CPU perceptage (%) 1 8 6 4 2 job411 job41 avecpu-411:1.84% avecpu-41:18.2% 1 2 3 4 5 6 7 machine ID 5 1 15 2 25 3 time write bytes (KB/s) 7 6 5 4 3 2 1 (a) Normal Case 1 2 3 4 5 6 7 machine ID job462 host CPU perceptage (%) 1 8 6 4 2 Figure 12. 5 1 15 2 25 3 time job422 job423 avecpu-422:17.78% avecpu-423:12.56% How two job influence each other in Node View (b) Disk-hog Case Figure 11. Comparing MR job and machine I/O wirte bytes throughut in normal case and disk-hog case machines. B. Concurrent Job influence This case study shows how zooming technology and finegrained information help with debugging MapReduce program. When there are multiple jobs running on your cluster and you want to evaluate how these jobs influence each other. From system-wide Cluster View resources utilization, you may see two jobs consume almost the same CPU usage in an extended time period, and concludes that these two jobs influence each other. However, when we use zooming methodology to jump into Node View, we will get more fine-grained information, such as exact CPU usage temporal variation. From this fine-grained informaiton, we can get a exact perception of how these concurrent job interleave with each other, and how significant they influence each other. As shown in Figure 12, job 411 and 41 is scheduled to run concurrently, and they influence each other in CPU usage, but job 422 and 423 is totally interleaved. Their average CPU usage in the time period shown is quie similar in these two cases. VI. RELATED WORK Mochi [13], a visual log-analysis based tool, partially solves some of the mentioned problems. It parses the logs generated by Hadoop under debugging mode, using SALSA technique [12], infers the casual relations among recorded events, and then reports visualized metrics like per-task s execution time, workload of each node. However, it still does not correlate per-job / per-task MR-specific behavior with OS-level metrics. And it does not provide any root-cause analysis for the perfomrance problems of MR programs.

Also, its analysis runs offline and thus cannot provide instant visualized monitoring information to users. Ganesha [9] and Kahuna [14] combine black-box OS metrics together with MapReduce state information to diagnose performance problems of MapReduce programs. However, they expose OS metrics on a per-node level, and our work can trace OS metrics into per-job or even per-task level, enabling more fine-grained performance problem diagnosis. Java debugging/profiling tools (jstack [11], hprof [1]) focus on helping debug local code-level errors rather than distributed problems over the cluster. Path-tracing tools (e.g., Xtrace [4], Pinpoint [1]), although report fined-grained data like RPC call graph, fail to provide insights at the higher level of abstraction (e.g. Maps and Reduces) that is more natural to application programmers. They tracks information by using instrumented middleware or libraries, and will introduce more overhead. Ganglia [7], LiveRAC [8] and ENaVis [6] are cluster monitoring tools that collect per-node as well as system-wide data of several high-level variables (e.g., CPU utilization, I/O throughput, free disk space for each monitored nodei). However, they mainly focus on high-level variables and track only system wide totals. It is usually used to help flag misbehaviors (e.g., a node went down ). Our system correlates OS metrics together with high level MapReduce abstraction, and thus would be more powerful in debugging MapReduce performance problems. Artemis [2] is a pluggable framework for distributed log collection, analysis and visualization. Our system collected and analyzed MapReduce level information from Hadoop logs and also OS level metrics on cluster machines. We can build our system as Artemis plugins for online monitoring. VII. CONCLUSION We have built a non-intrusive, effective, distributed monitoring tool to facilitate the Hadoop program debugging process. The tool correlates rich MapReduce metrics extracted from Hadoop logs and per-job/per-task level operating system metrics. We have proposed a multi-view visualization scheme to effectively present these metrics, as well as a view-zooming model to help programmers better reason about job execution progress and system behavior. The preliminary results are promising: our tool can successfully diagnose several performance problems that previous monitoring tools cannot. As for future work, we plan to instrument the Hadoop framework in the hope of exploring more information from Hadoop metrics APIs. Another potential direction is to achieve online automatic data analysis based on the aggregated metrics. Instead of storing temporary data on RRD database, we also would like to find an effective way to maintain long-term storage of the collected metrics. REFERENCES [1] Mike Y. Chen, Emre Kiciman, Eugene Fratkin, Armando Fox, and Eric Brewer. Pinpoint: Problem determination in large, dynamic internet services. In DSN 2: Proceedings of the 22 International Conference on Dependable Systems and Networks, 22. [2] Gabriela F. Cretu-Ciocarlie, Mihai Budiu, and Moises Goldszmidt. Hunting for problems with artemis. In USENIX workshop on Analysis of System Logs (WASL), 28. [3] Jeffrey Dean and Sanjay Ghemawat. Mapreduce: Simplified data processing on large clusters. In OSDI 4, pages 137 15, December 24. [4] Rodrigo Fonseca, George Porter, Randy H. Katz, Scott Shenker, and Ion Stoica. X-trace: A pervasive network tracing framework. In NSDI 7, 27. [5] The Apache Software Foundation. Hadoop. http://hadoop. apache.org/core. [6] Qi Liao, Andrew Blaich, Aaron Striegel, and Douglas Thain. Enavis: Enterprise network activities visualization. In Large Installation System Administration Conference (LISA), 28. [7] Matthew L. Massie, Brent N. Chun, and David E. Culler. The ganglia distributed monitoring system: Design, implementation and experience. Parallel Computing, 3:24, 23. [8] Peter McLachlan, Tamara Munzner, Eleftherios Koutsofios, and Stephen North. Liverac: interactive visual exploration of system management time-series data. In CHI 8: Proceeding of the twenty-sixth annual SIGCHI conference on Human factors in computing systems, pages 1483 1492, 28. [9] Xinghao Pan, Jiaqi Tan, Soila Kavulya, Rajeev Gandhi, and Priya Narasimhan. Ganesha: blackbox diagnosis of mapreduce systems. SIGMETRICS Perform. Eval. Rev., 37(3):8 13, 29. [1] Oracle SUN. Hprof: A heap/cpu profiling tool in j2se 5.. http://java.sun.com/developer/technicalarticles/ Programming/HPROF.html. [11] Oracle SUN. jstack - stack trace for sun java real-time system. http://java.sun.com/j2se/1.5./docs/tooldocs/share/jstack. html. [12] Jiaqi Tan, Xinghao Pan, Solia Kavulya, Rajeev Gandhi, and Priya Narasimhan. Salsa: Analysing logs as state machines. In USENIX Workshop on Analysis of System Logs (WASL), 28. [13] Jiaqi Tan, Xinghao Pan, Solia Kavulya, Rajeev Gandhi, and Priya Narasimhan. Mochi: Visual log-analysis based tools for debugging hadoop. In HotCloud 9, San Diego, CA, Jun 29. [14] Jiaqi Tan, Xinghao Pan, Solia Kavulya, Rajeev Gandhi, and Priya Narasimhan. Kahuna: Problem diagnosis for mapreduce-based cloud computing environments. In IEEE/IFIP Network Operations and Management Symposium (NOMS), 21.