Hadoop Kelvin An Overview

Transcription

1 Hadoop Kelvin An Overview Aviad Pines and Lev Faerman, HUJI LAWA Group Introduction: This document outlines the Hadoop Kelvin monitoring system, and the benefits it brings to a Hadoop cluster. Why Hadoop is (currently) sub-optimal: At present Hadoop has a simple concept of Network locality. At best, it can be configured to know that a specific group of machines is located in the same rack, and enjoy a better communication link than in general. This allows it to schedule computation tasks close to the data. However, this notion of locality is limited. In a larger data center environment a single rack houses only a few machines when compared to the whole data center inventory of tens, hundreds (or even thousands). Once a task cannot be scheduled on the same rack, it is scheduled on one of the remaining available machines in the cluster. Since traffic between machines not on the same rack is different between adjacent and non-adjacent racks, this leads to sub-optimal task assignment. This issue is compounded when a Hadoop cluster is deployed in a cloud environment where the administrator is not capable of knowing the rack topology of the machines he is currently working on. As such, a need arises for Hadoop to be able to detect strong and weak network links between machines in order to improve its scheduling mechanism. Even more detailed information will be required to manage Hadoop workload when it is distributed into two or more distinct clusters, as is expected to become a common future cloud configuration. What is Hadoop Kelvin, and What It Enables: Hadoop Kelvin is a network monitoring system designed for the Hadoop Map-Reduce framework. It monitors data (not control) traffic between Hadoop nodes and provides multiple ways to store, visualize and access the stored monitoring data. It is designed to be easily extensible, flexible and to operate with a minimal effect on the running time of Hadoop jobs. Because Hadoop Kelvin is tightly integrated into Hadoop by instrumenting some of the Hadoop source code, it is only available as part of the Hadoop-LAWA version of Hadoop. Once Hadoop Kelvin has amassed a certain amount of monitoring data, it presents any agent utilizing it with data on the network links in the cluster. This data can be later used to improve Hadoop's decision-making in its scheduling process by placing computation tasks as close to the data as possible, expanding on the rack concept used in the regular Hadoop scheduler.

2 Method: Hadoop Kelvin collects data about the following data transfers: HDFS reads (regardless of who is performing the read). HDFS writes (regardless of who is the origin of the data). Data transfers between Mappers and Reducers during a Map-Reduce job execution. The data collected about each transfer includes: Source machine. Destination machine. Starting timestamp. Duration of transfer in milliseconds. Size of the transferred data, in bytes. The data is collected by a statistics server, several of which may be present in a sufficiently large Hadoop cluster. The number is statistic servers is configurable and each machine in the cluster is configured to report to a single statistic server. This method of operation was chosen because it offers a complete view of the heavy network traffic in the system, which is the motion of data. This ignores the light-weight management traffic (such as requests for blocks, heartbeats, and other traffic that would be caught by external monitoring tools). The single-location (or, more correctly, few-location ) statistics storage is designed in order to allow the future scheduler quick access to the stored data. If this data was stored locally on each cluster machine it would slow down the scheduler's decision-making process while it was waiting for the measurement data relevant to it at this particular moment because the collection

3 Architectural Overview: High-Level Design: There are two main parts in Hadoop Kelvin: These are the Statistics Server and the Statistics Client. The Statistics Server is a program which runs on a single machine in the cluster (typically one of the master machines in the cluster if a single Statistics Server is present. Alternatively a set of slave machines can be used if several such servers are required) and serves as a sink for all the traffic reports arriving from the cluster nodes. The server operates a set of user-configurable (via XML) data storers (which are write-only), data retrievers (which are read-only) and data manipulators (which provide read-and-write access) to which measurement data is stored and from which queries about past measurement data are completed. Currently, Hadoop Kelvin provides a Log-based information store which stores all traffic reports in plaintext form, and also a data manipulator which is based on a SQL database which collects the traffic reports. The Statistics Client allows a 3 rd party program to access the data stored inside the data storers of the Statistics Server and also to submit reports of its own. The protocol all Hadoop Kelvin traffic uses is HTTP.

4 Data reports and data requests are abstracted into Packets and Queries respectively, and the system includes a Packet which contains the information described above in the Method section, and also a Query which allows a user to retrieve information from the Statistics Server s SQL database. The system additionally includes two hook points for web applications deployed on top of an embedded web server (Jetty). They are usually used by the following: 1. Kelvin In Action This is a web application designed for displaying the monitoring data collected by Kelvin about the Hadoop cluster in a convenient, easy to understand web-based fashion. Kelvin In Action is discussed in more details in Appendix A. 2. Typically, this hook point is occupied by additional visualization tools which are accessible via any Internet browser.

5 Data Storers, Data Retrievers and Data Manipulators Why Hadoop Kelvin is More than a Logger: As briefly described above, the system incorporates the notions of a Data Storer, a Data Retriever and a Data Manipulator, we refer to them all as Data Handlers. The first two define a Java Interface which can be implemented by anyone seeking to expand upon the functionality of Kelvin, while the latter is simply an entity implementing both these interfaces at once. The addition of extra such elements does not require the recompilation of Hadoop (they just need to be located in a JAR file which is located on the classpath and need to be enabled in the XML configuration files), but it does require a re-start of the statistic server(s). The default Kelvin implementation supplies one Data Storer (LogStatisticStore) and one Data Manipulator (H2DBManipulator) which is also a Storer. The LogStatisticStore logs all traffic reports to a log4j log file. This is the simplest form of a Data Storer, and should be mainly used for debugging or research purposes. The log files have a tendency to grow very large rather quickly, so it is not suited for long-term, constant deployment in a production environment. The H2DBManipulator stores the traffic reports into a H2 (SQL) database. This database provides the basic building block for the future Hadoop scheduler as it allows other code to access the traffic reports collected over a period of time. The existence of the SQL database, and the extensibility of Kelvin set it apart from a simple logger interface (it in fact contains just one such logger as a default).

6 Data Flow Through Kelvin: Packets and Queries Submitting and Requesting Information to/from Kelvin: The communication with Kelvin is done via two serializable Java types. Objects inheriting the StatisticQuery abstract class can be sent via the Statistic Client to the Statistic Server in order to obtain a response from a retriever (the specific response depends on the data retriever the query is addressed to). These queries allow access to the data stored within Kelvin. A specific type of query is created for each target retriever (and more than one is possible per retriever). After being processed by the target Retriever, the data is sent back to the client. Currently, the NetworkMatrixStatisticQuery exists to allow retrieving matrices of traffic reports from the H2DBManipulator. Objects inheriting the StatisticPacket abstract class can be also sent via the Statistic Client to the Statistic Server in order to store information in all the storers which support the specific packet, as opposed to the queries which are targeted specifically for a target retriever. Currently, the NetworkMatrixStatisticsPacket stores traffic reports in the H2DBManipulator, LogStatisticsStore and the DebugStore. Sending data to the Statistics Server Inhereting the StatisticsPacket class Sending data to the statisitcs server is done by sending objects which inherit the StatisticPacket abstract class. Inheriting this class compels the user to implement the method addto(datastorer collector). Since the implmentation is done by the Visitor design pattern, in order for the process to work the method has to be implemented as public void addto(datastorer collector) { collector.accept(this); }

7 Other than that, the user is free to choose how to construct and what methods to implement inside the Packet, to be later used inside the Storers and Manipulators of his choosing. Retrieving data from the Statistics Server Inhereting the StatisticsQuery class Retrieving data from the statistics server is done by sending objects which inherit the StatisticsQuery abstract class. This compels the user to implement the following methods: public Verifiable perform(dataretriever retriever) public Verifiable query() public Verifiable query(httpstatisticclient client) Another method, Verifiable perform(dataretriever retriever, has to be implemented, and again due to the restrictions of the Visitor design pattern the user must implement it as public Verifiable perform(dataretriever retriever) { return retriever.retrieve(this)); } A verifiable object is an object whose data can be assessed to be valid or not via the boolean isdatavalid() method. If a query result is always valid, the method should always return true. As with the Statistics Packet, other than the specified methods the user is free to choose how to construct and what methods to implement inside the Query, to be later used inside the Retrievers and Manipulators of his choosing. Data Storers, Retrievers and Manipulators The Statistic Packets and Queries are being sent from the Statistics Client to the Statistics Server, where they are processed by the Storers (Packets), Retrievers (Queries) and Manipulators (both Packets and Queries). Data Storers are objects which implement the DataStorer interface. A Data Storer is designed to store the data in a particular way, such as in a log file or a database. Data Storer can support multiple kinds of Statisitc Packets. For each Statistic Packet supported by the Storer, an accept method needs to be implemented: public void accept(<packettype> packet) Where <packettype> extends the StatisticPacket class. Looking at the previously mentioned H2DBManipulator class, it has two accept methods: public void accept(debugpacket packet); public void accept(networkstatisticpacket packet);

8 One method for each supported Packet. Note that due to limitations of the Visitor design pattern, each accept method needs to be written in the DataStorer abstract class as well, and implemented inside all the inherting classes. If a class does not support a specific packet, it should implement an empty method. Data Retrievers are objects which implement the DataRetriever interface, and their job is to retrieve data that was stored on the server and send it back to the user. The H2DBManipulator for example, retrieves data from a H2 Database according to the parameters given by the user. Similar to the DataStorer, for each supported Statistic Query, a method Verifiable retrieve(<querytype> query), where <QueryType> is a class which inherits the StatisticsQuery class needs to be implemented in the DataRetriever abstract class itself and all the inheriting subclasses. The H2DBManipulator for example implements the method Verifiable retrieve(networkmatrixstatisticquery query) in order to be able to respond to NetworkMatrixStatisticQueries. User API: The following section details how someone using Kelvin can access the information it currently provides. In order to access Kelvin via user code, the Hadoop HUJI Common jar needs to be on the classpath, since all Kelvin classes are located there. This file is a part of the standard Hadoop HUJI distribution. Retrieving H2DBManipulator Data: In order to retrieve information from the Kelvin database, the user code needs to create a NetworkMatrixStatisticQuery object. This object has three constructors. public NetworkMatrixStatisticQuery(String querytarget) public NetworkMatrixStatisticQuery(String querytarget, Date timestamp) public NetworkMatrixStatisticQuery(String querytarget, Date timestart, Date timeend) The first receives only the class name (it needs to be a full class name. For example: org.apache.hadoop.statistics.waldoes.h2dbmanipulator) of the retriever it tries to access. In this case, the data returned will be only the latest measurements between all nodes, one per each node pair (or none, if no traffic passed between these particular nodes). The second specifies an additional Date object. Only measurements that have taken place at this particular second will be returned. The final constructor requires two dates and returns the results of all measurements between all nodes falling into this time period. Aggregation is currently done by the user as he sees fit. After the query object has been created, its query() method needs to be called. This retrieves an instance of the Statistic Client singleton, performs the query and returns the

9 result as a NetworkStatisticsMatrix object which can then be used to access the query results. Appendix A Kelvin In Action Kelvin In Action is a web-front for the H2 Database data manipulator (although it is designed to be easily extensible for displaying other data). It presents the user with a web-based means of accessing the traffic reports collected by Kelvin by allowing it to generate NetworkMatrixStatisticQueries from his browser. The query interface allows the user to specify the time frame for which the measurement data will be returned (the default is the latest measurements) and also to specify the aggregation method (sum, mean, max, min and so on) of the results to enable their display on the heat-map (which obviously shows only a single square for each pair of nodes). In the end the aggregated results are displayed in a heat-map fashion, showing the current hot-spots and cold-spots of the cluster. It also shows a list of the cluster's machines and their IPs. A hover-over tooltip allows the user to expand the heat-map in order to obtain additional information about the color-represented data transfer. The legend of the heat-map is highly configurable as well, and the user can define the color scheme and the ranges of the legend. A screenshot illustrates further: The Kelvin In Action WAR file is included in the LAWA CDH3 Hadoop release. Additionally, it can be found at: / username: lawa, password: thisislawa!

10 Appendix B - Deployment Instructions: To configure Hadoop Kelvin, you would first need to configure your LAWA Hadoop cluster (as a regular Hadoop cluster). In a fashion similar to Hadoop itself, Hadoop Kelvin uses XML configuration files which are added to the Hadoop conf directory. This section describes the various configuration files and their possible parameters. Please note that parameters which are marked as mandatory must be specified, or the system will fail to load. statistics-site.xml: Property Name Description / Valid Values Mandatory server.hostname The host on which the Statistics Server is Yes running. aggregation.threshold The minimum value of traffic reports which Yes are submitted to the statistics server at once. sleep.cycles The number of sleep cycles done by the Yes statistics client before submitting all the waiting reports even if their number does not reach the aggregation threshold. sleep.cycle.duration The duration in milliseconds the thread Yes sleeps in each sleep cycle. statistics.port The port on which the Statistics Server is Yes listening for traffic reports and data queries statistics.webapp.port The port on which the Statistics Server web Yes application is accessible server.statistics.kia.port The port which on which the Kelvin In Action Yes application is accessible. request.timeout The timeout for HTTP transactions directed Yes at the Statistics Server (milliseconds) server.statistics.sub.url The sub-url of the Statistics Server on Yes server.hostname. server.statistics.webapp.sub.ur The sub-url of the Statistics Server web Yes l application on server.hostname server.statistisics.kia.sub.url The sub-url of the Kelvin In Action Yes application. visualizer.war The path for the webapp application war file. Yes kelvin.in.action.war The path for Kelvin In Action's war file. Yes statistics.stores The list of fully-qualified Java class names of Yes the data stores and data retrievers to be used by the Statistics Server. This is a comma-separated list. mapreduce.fetcher.enable.rep Whether reporting from the Fetch phase is Yes or enabled. Fetching is the transmission of data from the Mappers to the Reducers during a Map-Reduce Job. hdfs.blockreader.enable.report Whether reporting of network flows which are HDFS reads is enabled. If enabled, any read from the HDFS will be monitored. Yes

11 Property Name Description / Valid Values Mandatory hdfs.blockreceiver.enable.repo Whether reporting of HDFS writes is Yes rt enabled. If enabled, any writes to the HDFS hdfs.blockreader.aggregation.f actor will be monitored. The level of local aggregation between reports of HDFS reads. If set to X, then every X reports will be aggregated into a single report. This is used to reduce the report-load in cases where many small reads are performed (as opposed to a single large read). This often occurs while processing text in a Map-Reduce Job. Yes dbmanipulator-site.xml: Property Name Description / Valid Values Mandatory database.xml.definition The full path and file name of the database definitions file used for the DB manipulator. No. Required if H2DBManipulator is to be used with the system. database.name The name of the database defined in the configuration file specified in database.xml.definition to use for the DB manipulator. No. Required if H2DBManipulator is to be used with the system. Database configuration file (the one referenced by database.xml.defintion): Each database used in the statistics package needs to be defined in a database configuration file. Each database definition consists of four properties: Field Name Description / Valid Values Mandatory name The name that the database will be referred No to in the code via the get method. If no name supplied, the default name will be full path supplied. path The full path to the database file. Yes username The username to access the database, if No needed. Password The password to access the database, if needed. No Example Hadoop Kelvin configuration files are included with the Hadoop LAWA release, in the conf directory.

12 Starting the Statistics Server: After configuring your Hadoop cluster, you would need to start the Hadoop Kelvin Statistics Server. This is done by executing the regular hadoop script (located in the bin subdirectory of your Hadoop location) with the stats parameter ( bin/hadoop stats ). The Hadoop cluster itself will function even without the Statistics Server running, and the Statistics Server can be left running when you're taking the cluster itself down (this can be used to leave its data accessible even during maintenance periods). Once the Statistics Server is running, it will begin receiving traffic reports from all cluster nodes as data is shifted around according to the configuration points which are enabled. Accessing the Web Frontend: Similar to the HDFS and Map-Reduce web front-ends, Hadoop Kelvin's information storage can be accessed via a regular browser. Just point your browser to the following URL: Where <server.hostname> and <statistics.webapp.port> are the values you've specified in the documentation.