Theius: A Streaming Visualization Suite for Hadoop Clusters

1 Theius: A Streaming Visualization Suite for Hadoop Clusters Jon Tedesco, Roman Dudko, Abhishek Sharma, Reza Farivar, Roy Campbell {tedesco1, dudko1, sharma17, farivar2, rhc} @ illinois.edu University of Illinois at Urbana-Champaign, Urbana, IL, USA Abstract As cloud computing clusters continue to grow, maintaining the health of these clusters becomes increasingly challenging. Recent work has studied how we can efficiently monitor the status of machines in these clusters and how we can detect problems or predict them before they occur, yet little work has focused on addressing the bottleneck between when these failures occur and when they are fixed: system administrators. As monitoring and failure detection systems mature, we are able to extract tremendous amounts of information about the status of the system in real time. However, this amount of data is difficult to understand for human beings, especially those inexperienced with the particular cluster. In this paper, we introduce a web-based visualization suite called Theius to allow system administrators to quickly understand the state of the cloud system as a whole. We outline the key features of this visualization tool, and show that it is more intuitive and easy to use than Ganglia, a state-of-theart visualization tool for clusters. Likewise, we demonstrate that our tool can scale, presenting a use case with our visualization showing a 5000 node cluster. Although our tool is implemented for Hadoop clusters, our contribution is general to any cloud computing system. Keywords-visualization, failure detection, failure prediction, monitoring, Hadoop, cloud computing, cluster computing. I. INTRODUCTION AS cloud computing clusters grow in size, failure detection and prediction become increasingly critical challenges in cloud computing research. Naturally, developments in these technologies necessitate reliable and efficient methods of monitoring large clusters. Recent work has introduced strategies for efficiently monitoring cloud systems and statistical methods of predicting failures, increasing the power and detail of information available for system administrators. However, the bottleneck for responding to system problems still lies in the delay between when a problem is detected and when the system administrators respond. Thus, this data is only useful for system administrators if we can communicate it quickly and effectively. While much work has been done on collecting data, and detecting and predicting failures, little work has focused on how to effectively visualize this data to system administrators to allow problems to be fixed or prevented. State of the art systems typically provide detailed visualizations of the state of the cluster as a whole or particular nodes, by an abundance of data through basic plots. These tools are not designed to support real-time data, and provide little control to users, limiting their ability to customize the visualizations. Offering only basic graphs, they do not draw the attention of system administrators to potential problems with the cluster as quickly as they could. Likewise, these visualizations do not gracefully handle clusters containing several hundred or several thousand nodes, facing performance problems or too little detail using graphs alone. We design our suite with these shortcomings in mind, and aim to provide an intuitive, succinct visualization of the cluster as a whole, but further allow the user to drill down to detail for a particular rack or node. Likewise, our tool is designed to support streaming data and provides a highly interactive and controllable interface for users. Further, we demonstrate our visualization running successfully on a large cluster, containing approximately 5000 nodes. We compare our tools with the visualizations of Ganglia, a state-of-the-art monitoring system that represents the cutting edge of cloud visualizations, through a user study of 5 graduate students at the University of Illinois at Urbana-Champaign, and show that our system allows users to identify problems more quickly and intuitively than Ganglia, provides the ability to identify problems that are impractical or impossible to do with current visualization tools, and is more scalable than existing tools. We implement Theius for Hadoop clusters specifically, but our assumptions of the underlying monitoring and prediction systems are general enough to be used for other types of clusters. II. RELATED WORK To give a thorough background for the motivation for our research and demonstrate its value, we first give an overview of related work behind visualization of the health of cloud systems. Data management is well-known to be a key challenge of cloud monitoring systems. The sheer amount of data produced by logs is challenging to handle effectively, and varying log formats on different platforms make developing a unified solution problematic. In addition, distributed systems can mask problems that would otherwise be evident on smaller clusters. Even if we can effectively detect failures of individual machines, finding relationships between failures and determining their root causes is difficult. In order to develop an effective monitoring system, it must be scalable, robust, extensible, manageable, portable, and have minimal performance overhead [1], but achieving all of these properties simultaneously is often impossible. Further, expectations and restrictions of monitoring systems vary based on the class of

2 distributed system being monitored and expectations of the system, making a ubiquitous solution to cloud monitoring extremely difficult to obtain [2]. Much research has studied how to effectively monitor large cloud systems, even in the context of Hadoop systems specifically. In this paper, we focus primarily on Ganglia as the state of the art cloud monitoring system. Ganglia is widely used [1], and scales well to large clusters and grids by using a hierarchical tree to manage data. Further, it does not require knowledge of the layout of a cluster a priori [1] [2]. As distributed monitoring technologies have improved, researchers have proposed new ideas to leverage log information and statistical methods to predict and prevent machine failures before they occur. Recent contributions have shown that using SVMs can accurately predict failures, including the specific type of failure, up to two days in advance [3] [4]. Similar work verifies this, and has even shown that black box failure prediction can be effective [5]. Although much work has been done to improve monitoring and prediction strategies in cloud systems, the data produced by these systems is often overwhelming for human administrators. For errors that require human intervention, research in cloud system visualization helps minimize the delay between failure detection and the system administrator understanding the error in the system. As Bodik et al. articulated in [6], since humans are excellent at visual pattern recognition, visualizations play a critical role in cloud systems. In particular, visual representations of the state of the system can drastically improve how system administrators respond and analyze failures, allowing them to easily get a high-level idea of the system and distinguish false alarms from real problems. Existing work such as MR-Scope, SALSA, and X-Trace focuses on visualization of Hadoop systems, boasting features such as realtime MapReduce task status, visual log analysis, and performance analysis tools [7] [8] [9] [10]. Likewise, general cloud visualization applications such as Artemis provide a platform for plug-ins for problem diagnosis [11]. Sigelman et al. likewise describe a largescale production tracing infrastructure at Google, similar to X-Trace, but again do not focus on the role of visualizations in diagnosing large-scale problems [12]. Work by Gregg shows promise for scalable cloud visualizations using DTrace, but only focuses on static visualizations with a single basic network topology view and does not offer any similar principlebased framework to the one we introduce here [13]. Although it is best known for its contributions to cloud monitoring, Ganglia also provides visualizations of the cluster, and represents a state of the art visualization suite. Graphs shown in Ganglia are primarily time based, plotting metrics such as CPU usage against time for particular nodes or the cluster as a whole. A screenshot of the node-level visualization can be found in Figure 1 [1] [2]. In the design of our system, we leverage the fact that we have these current technologies for monitoring and failure prediction in general clusters, using streaming data containing both the states of nodes and the failure prediction data of nodes throughout our visualization. Fig. 1: Screenshot of the node-level view of Ganglia for the demonstration Triton cluster [14] III. MOTIVATION Despite advances in cloud monitoring and prediction technologies, for problems that arise in cloud systems requiring the involvement of system administrators, the time between when a problem is detected by the monitoring system and when it is realized by system administrators represents the bottleneck between when a problem arises and when it is addressed. Succinctly displaying the plethora of information retrieved by modern monitoring and prediction systems is a nontrivial task, but necessary to address this delay. Thus, our visualization attempts to communicate the entire state of the system quickly and concisely using non-traditional visualizations, to allow administrators to respond to potential problems in a timely manner. Although work has also been done on how to visualize data for a cloud system, the visualizations associated with Ganglia are widely used in practice, and represent a state of the art monitoring and visualization system. However, these visualizations are largely limited to time-based graphs, plotting data such as resource usage across time for the cluster and particular nodes. To the best of our knowledge, no previous work has attempted to visualization both status data of the cluster and prediction data for the cluster, and do so with nonobvious visualizations that provide a concise, holistic view of the cluster. Our suite attempts to visualize the topology of the cluster, which most existing visualizations do not attempt. Likewise, current visualizations are not designed to support streaming data. For example, Ganglia requires users to manually click a button to refresh the current data, rather than automatically updating the visualization. Neither Ganglia nor any other current visualization, to the best of our knowledge, allow users to control metrics for visualizations, or support very large clusters. IV. VISUALIZATION To address these problems with current cloud visualization systems, we present Theius, an interactive web-based visualization suite for Hadoop clusters. We will first discuss the key design principles behind the interface of Theius, its basic architecture and implementation details, and present the features of each visualization it contains.

3 A. Design Principles The interface of Theius was driven by current work in cluster monitoring and prediction, and by room for improvement we saw with existing cloud visualization work. Based on the motivation described, we developed these five properties which we strove to achieve in our interface design and implementation: 1) Interactive: The visualization should be responsive to user interaction, and controllable by the user. Users should be able to specify metrics on which to base the visualization if the defaults do not suit their needs. 2) Real-time: The visualization should be built to support streaming data. It should allow the user to control the active stream of data, and gracefully update the visualization as it receives new data from the cluster. 3) Informative: The visualization should display the cluster s topology and attract the user s attention to potential problems quickly. 4) Intuitive: The visualization should require users to be knowledgeable about cloud systems in general, but not require experience with a particular cluster. 5) Scalable: The visualization implementation should be able to handle large clusters, without falling behind the stream of data from the cluster or affecting its usability. As mentioned above, these principles are derived from problems we found with existing visualization systems. B. Architecture Theius is built using a simple client-server model, where the client displays the visualizations we present in this paper, and the server simulates a cloud system. We chose to simulate the server, rather than use a physical system or log trace for ease of implementation and because it was not our primary focus. Specifically, the data we stream to the client includes the CPU and memory usage, context switch rate of each node, the predicted time until failure for each node, the probability of encountering problems of particular severity levels, and the status of MapReduce tasks and jobs in the system. We selected these data types specifically from types of data gathered in previous work; specifically, each type of data was shown to be able to be efficiently collected in real-time from large cloud systems, verifying the feasibility of gathering the particular data we use in our visualizations on a real system. We also create a heuristic called health based on the severity of the simulated log events. The data values are generated randomly during execution, using heuristics based on the results of previous work we studied. The web client interface is written using a JavaScript HTML5-based visualization framework called d3.js. It is designed with responsiveness and scalability in mind, motivating our extensive use of asynchronous client-server requests, and d3.js, which visualizes data efficiently by building a DOM with HTML5 elements mirroring some data [15]. Using this framework effectively gives us a performance edge that helps to allow us to use this client with large cluster visualizations. Fig. 3: A tree visualization using a large, 5000-node cluster, showing nodes for two randomly selected racks C. Visualization Features 1) Overview: Figure 2 shows the main page of our visualization interface, called Theius. There are three main panels, and a navigation bar at the top, for easy access to the various visualizations. The left-hand side displays a panel for visualization controls, the middle is the main visualization, which changes depending on which visualization is selected in the top navigation bar, and the right sidebar is a panel showing information about the cluster as a whole. In particular visualizations, a bar graph appears at the bottom, which highlights the selected metrics over time. Note that none of the figures shown here are mockups; rather, each is a screenshot of the running application. 2) Primary Visualizations: There are a total of 7 different visualizations, which show data to users in diverse and novel ways. The visualizations are selected by the top navigation bar, and always appear in the center panel. The control panel on the left changes depending on which visualization is chosen, and each of the visualizations update in real time as new information comes into the system. a) Tree Visualization: The Tree Visualization shows a hierarchy of nodes, organized by rack. The color and shape of nodes can be associated with a particular data set by using the left panel. As an example, in Figure 2, we see the Tree Visualization where the color represents health (red is poorer health), and the size represents the amount of memory being used by the node. From the figure, it is clear that rack 1 is performing poorly, probably to due thrashing, since node memory usage is also high. The Tree Visualization also has the ability to hide and expand parts of the tree. Figure 3 shows an example of this, in which 5000 nodes are present in the system, but the user has only selected 20 of them to compare. This allows a system administrator to quickly find the root of the problem, without having to examine all node statistics simultaneously. The system is still responsive even with 5000 nodes, which is important for system administrators supervising large clusters with thousands of machines. b) MapReduce Visualization: The MapReduce Visualization is very similar to the Tree Visualization, except that in instead of displaying nodes, it is displaying map reduce tasks. In Figure 4, we observe various map reduce tasks clustered by job. In this figure, blue denotes a map task, green a reduce task, and the size of the circle indicates estimated time remaining

4 Fig. 2: The main page of Theius Fig. 4: A tree visualization based on MapReduce jobs and tasks Fig. 6: A scatterplot visualization showing resource usage for nodes of the cluster Fig. 5: A TreeMap visualization of node (square) by CPU usage (size) and rack (color) for that particular task. c) TreeMap Visualization: The TreeMap Visualization allows the user to observe various data characteristics by the size of squares. Figure 5 shows the TreeMap Visualization where nodes with the same color belong to the same rack, and the size of the squares represents their CPU usage. Clearly the rack in orange has too much load on it as compared to other racks. d) Scatterplot Visualization: The Scatterplot Visualization allows the user to compare trends between various metrics. Figure 6 shows three data sets plotted against each other: CPU usage, memory usage and context switch rate. For example, for the box in the second column, first row, the x axis represents memory usage, and the y axis represents CPU usage. There is clearly no correlation in this box, but if the user observes the box in the first row, third column, he will see there is a linear correlation between CPU usage and context switch rate. Using this visualization, the user may draw conclusions about relationships between various metrics, which may help to diagnose an issue. As a simple example, in some circumstances, a user could plot the node page fault rate against memory usage, and a positive correlation between the two may indicate a memory thrashing problem on the corresponding nodes. e) Circle Packing Visualization: The circle packing visualization is very similar to the TreeMap Visualization 5 mentioned previously, using circles instead of rectangles. The size of circles represents a user-selected metric, and circles are nested based on the network topology. However, this visualization allows circles to be nested beyond two levels, so deeper hierarchical data can be displayed than in the TreeMap. See Figure 7 for an example. f) Individual Node Visualization: In all of the above mentioned visualizations, if a user clicks on a particular node, it brings up a view that presents all the data known about that

5 Task 1 2 3 4 Theius Time (seconds) 5 21 48 7 Ganglia Time (seconds) 41 36 54 18 Fig. 9: Timings for performing comparative tasks 1-4 in both Theius and Ganglia Illinois at Urbana-Champaign, each of which had experience with cloud computing, but none of which had used our system before. We designed four tasks to be completed in both Theius and the state of the art visualization tool Ganglia [1] [2]. We also designed six tasks to be performed exclusively on Theius, since they were difficult to perform on Ganglia. We timed the users as they were completing tasks, and also collected subjective feedback from them after they had some experience using Theius. Fig. 7: A circle packing visualization of nodes by CPU usage and rack Fig. 8: The timeline of log events machine to the user, and displays a graph of some statistic that is configurable by the user. 3) Information Panel: The Information Panel is always displayed on the right-hand side, regardless of which particular main visualization is chosen, as can be seen in Figure 2. The Information Panel has three tabs: Events, Nodes and General. The Events Tab displays log events as they are coming in to the system, the Nodes Tab ranks nodes based on some data characteristic, and the General Tab displays cumulative statistics about the cluster. 4) Timeline: In Figure 2, the controls sidebar has an option to pause and resume the current visualizations. When any visualization is paused, it freezes the current visualization at that given moment in time, and a bar graph appears at the bottom of the screen called the timeline. As shown in Figure 8, the timeline graphs a user-selected metric across time, overlaying a window representing the current time interval displayed in the main interface. This interval can be expanding or shifted by the user to display historical data on any of the visualizations. We designed this timeline to allow system administrators to view the past state of the system or observe its behavior over time. V. USER STUDY To test the effectiveness of our system, we conducted a user study with five graduate students from the University Of A. Comparative Tasks The four comparative tasks we asked the users to perform on both Theius and Ganglia involved identifying basic information about nodes and the cluster as a whole: 1) Find the CPU usage of a particular node 2) Find the node(s) with the highest CPU usage in the cluster 3) Find all nodes which were using all of their memory 4) Find the cumulative CPU usage of the cluster We realize that these tasks are fairly simple, and may not always be an important task for system administrators. However for the sake of the users that were used in this study, we kept them simple, and we believe they still give a good measure of the effectiveness of the user interface. Figure 9 shows the average time taken by the five users to complete each of the four tasks in both Theius and Ganglia. For the scenarios presented to them, the users took less time to find the answers in Theius than in Ganglia. B. Theius Tasks Six additional scenarios were presented to the users to perform only on Theius. The tasks were intended to determine whether users were able to identify trends between different metrics in a reasonable amount of time. Users were not expected to be able to perform these tasks on Ganglia, because Ganglia does not offer comparative visualizations that Theius does. The 6 scenarios, shown in Figure 10, were: 1) Users were presented with a heterogeneous cluster, in which one cluster had an abnormally large number of nodes compared to the other clusters. We asked the user to filter out all other racks and just focus on one particular rack. 2) Users were asked to find the rack which showed an abnormal CPU usage. 3) Users were asked to find the machine that received the last fatal log. 4) Users were asked to find machines that had abnormally high CPU, memory, or context switch rate

6 Task 1 2 3 4 5 6 Time (seconds) 2.2 6.2 10.0 67.4 1.2 7.8 Fig. 10: Average time per task for users across tasks possible on both Theius and Ganglia 5) Users were asked to find the rack that had abnormally high CPU, memory, or context switch rate. 6) Users were asked to find the correlation between context switch rate and CPU usage. The users also gave us some useful feedback on what they liked about the system and how we could improve it. Users commented that the interface was intuitive, and mentioned that they liked that Theius allowed them to see correlations between different metrics. VI. CONCLUSION AND FUTURE WORK Maintaining the health of cloud computing clusters presents a current and challenging problem in cloud computing research. Although work has been done to study how to efficiently monitor large clusters, little work has focused on how to minimize the true bottleneck in addressing problems when they arise: the time between when a monitoring system detects a potential problem and when a system administrator responds to the problem. Although monitoring and prediction strategies have improved, to realize the full benefit of these systems, we must be able to communicate the state of the cluster quickly to system administrators. To address this problem, we introduce Theius, an interactive, real-time, informative, intuitive, and scalable web-based application for for visualizing status and prediction data of a cluster. We focus our efforts on Hadoop clusters specifically, but the contributions we present are generalizable to any cluster. To the best of our knowledge, Theius represents the first cloud visualization system to implement an interactive and informative interface, to be designed to support streaming data, and to run with a large cluster (5000 nodes) without sacrificing usability. To verify that this system is indeed an improvement over existing technology, we perform a user study with 5 graduate students at the University of Illinois at Urbana Champaign, who perform 4 tasks on both Theius and Ganglia. We see that, on average, users require less time to perform the same task on Theius. Likewise, we allow users to perform several tasks not possible on Ganglia, and gathered subjective feedback. Overall, users thought our system was intuitive to use and gave us several points of constructive criticism, many of which we have already addressed in this paper. Although we believe Theius is a significant step forward in cloud system visualization research, there is still much work left to be done. Although we present a preliminary user study here, a more formal study should be done in the future with a larger group of users, and should include system administrators, since they represent the primary user base for Theius and may help justify the relevance and impartiality of the tasks we chose for evaluation. Likewise, we attempt to quantify the quality of insights given by our system through the time required to complete a particular task, when in fact further work should be done to verify this as an appropriate metric. In future work, we also hope to provide a more thorough study of the scalability of Theius, and introduce additional MapReduce-specific visualizations. The Theius source code is available on Github at https://github.com/jtedesco/theius. VII. ACKNOWLEDGEMENTS We would like to acknowledge the graduate students who participated in our user study, and made our evaluation possible. his material is based on research sponsored by the Air Force Research Laboratory and the Air Force Office of Scientific Research, under agreement number FA8750-11-2-0084. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright notation thereon. REFERENCES [1] F. Sacerdoti, M. Katz, M. Massie, and D. Culler, Wide area cluster monitoring with ganglia, in Cluster Computing, 2003. Proceedings. 2003 IEEE International Conference on, dec. 2003, pp. 289 298. [2] M. Massie, The ganglia distributed monitoring system: design, implementation, and experience, Parallel Computing, vol. 30, no. 7, pp. 817 840, Jul. 2004. [Online]. Available: http://dx.doi.org/10.1016/j.parco.2004.04.001 [3] E. W. Fulp, G. A. Fink, and J. N. Haack, Predicting computer system failures using support vector machines, in Proceedings of the First USENIX conference on Analysis of system logs, ser. WASL 08. Berkeley, CA, USA: USENIX Association, 2008, pp. 5 5. [Online]. Available: http://dl.acm.org/citation.cfm?id=1855886.1855891 [4] Y. Liang, Y. Zhang, H. Xiong, and R. Sahoo, Failure prediction in ibm bluegene/l event logs, in Proceedings of the 2007 Seventh IEEE International Conference on Data Mining. Washington, DC, USA: IEEE Computer Society, 2007, pp. 583 588. [Online]. Available: http://dl.acm.org/citation.cfm?id=1441428.1442122 [5] A. W. Williams, S. M. Pertet, and P. Narasimhan, Tiresias: Black-box failure prediction in distributed systems, in IPDPS, 2007, pp. 1 8. [6] P. Bodik, G. Friedman, L. Biewald, H. Levine, G. C, K. Patel, G. Tolle, J. Hui, O. Fox, M. I. Jordan, and D. Patterson, Combining visualization and statistical analysis to improve operator confidence and efficiency for failure detection and localization, in In Proceedings of the 2nd IEEE International Conference on Autonomic Computing (ICAC 2005. IEEE Computer Society, 2005, pp. 89 100. [7] D. Huang, X. Shi, S. Ibrahim, L. Lu, H. Liu, S. Wu, and H. Jin, Mrscope: a real-time tracing tool for mapreduce, in Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing, ser. HPDC 10. New York, NY, USA: ACM, 2010, pp. 849 855. [Online]. Available: http://doi.acm.org/10.1145/1851476.1851598 [8] J. Boulon et al., Chukwa, a large-scale monitoring system, in Cloud Computing and its Applications, 2008, pp. 1 5. [Online]. Available: http://www.cca08.org/papers/paper-13-ariel-rabkin.pdf [9] J. Tan, X. Pan, S. Kavulya, R. Gandhi, and P. Narasimhan, Mochi: visual log-analysis based tools for debugging hadoop, in Proceedings of the 2009 conference on Hot topics in cloud computing, ser. HotCloud 09. Berkeley, CA, USA: USENIX Association, 2009. [Online]. Available: http://dl.acm.org/citation.cfm?id=1855533.1855551 [10] A. Konwinski and M. Zaharia, Finding the elephant in the data center: Tracing hadoop, 2008, pp. 1 27. [11] G. F. Cretu-ciocarlie, M. Budiu, and M. Goldszmidt, Hunting for problems with artemis. [12] B. H. Sigelman, L. A. Barroso, M. Burrows, P. Stephenson, M. Plakal, D. Beaver, S. Jaspan, and C. Shanbhag, Dapper, a large-scale distributed systems tracing infrastructure, 2010. [Online]. Available: http://research.google.com/archive/papers/dapper-2010-1.pdf [13] B. Gregg. Visualizing the cloud. [Online]. Available: http://dtrace.org/blogs/brendan/2011/10/04/visualizing-the-cloud/ [14] [Online]. Available: http://meta.rocksclusters.org/ganglia/ [15] d3.js. [Online]. Available: http://d3js.org/