A Survey Study on Monitoring Service for Grid

A Survey Study on Monitoring Service for Grid Erkang You erkyou@indiana.edu ABSTRACT Grid is a distributed system that integrates heterogeneous systems into a single transparent computer, aiming to provide unified and coherent access to distributed computing, data storage and analysis, instruments, and other resources to advance scientific explorations. Monitoring services are of great importance of a grid system. Study into the monitoring service can help understanding the performance limitation, advice in the deployment of the system, help evaluate future development work, performance prediction and job scheduling. There have already been several monitoring systems proposed and even put into production environments. This paper describes a set of such monitoring systems with their corresponding major components. Besides, the common architecture of publish/subscribe and broker mechanism have also been discussed in great details. 1. Introduction Grid systems require and depend on monitoring service s to support the discovery, monitoring and management of the distributed resources for a variety of tasks. For example, a system administer may want to be notified of any unusual activities of the network or CPU usage, a user may want to determine the best platform to run an application on, a client program may want to collect a stream of data to help steer an application, a system architecture may want to analyze the performance of the system and locate the bottleneck, etc. Because of all these possible application requirements, it is helpful to study the existing designs of monitoring systems. On the other hand, in order to monitor the grid systems, performance information needs to be collected and analyzed. In most cases, publish/subscribe systems are introduced to solve this problem, which further brings the broker mechanism into our attention. For our term project, we will be implementing a monitoring system for a grid system using publish/subscribe and brokering method.. There are many different approaches for distributed monitoring. For example, MonALISA [1] provides a centralized systems level view of Grid resources by aggregating and displaying data from existing cluster or system administrator tools. INCA [2] provides user level Grid monitoring with periodic, automated user level testing of the software and services required to support Grid operation. Others used publish/subscribe (pub/sub) solutions [3 5] to implement Grid monitoring. In pub/sub systems, publishers publish data and subscribers receive data that they are interested in, and they work independently. Da

ta are discovered through the middleware. Because of the distributed nature of the information provider and consumer in monitoring systems, pub/sub systems are suitable for distributed monitoring [6]. This paper is organized as follows: chapter 2 describes a set of different existing monitoring systems and discusses their advantages and disadvantages; chapter 3 talks about the general idea of publish/subscribe system and the broker mechanism, and also gives some detailed discussion on NaradaBrokering system; chapter 4 summarize this paper and chapter 5 discusses some potential future work and the term project. 2. Monitoring Services 2.1 Grid Monitoring Architecture Grid Monitoring Architecture (GMA) [3] is proposed by Global Grid Forum (GGF) in order to facilitate the development of interoperable and high performance monitoring middleware. GMA consists of three types of components, as shown in Figure 1: Figure 1: Grid Monitoring Architecture Components (adapted from [3]) Directory Service: supports information publication and discovery Producer: makes performance data available (performance event source) Consumer: receives performance data (performance event sink) The GMA is designed to handle performance data transmitted as time stamped (performance) events. An event is a typed collection of data with a specific structure that is defined by an event schema. Performance event data is always sent directly from a producer to a consumer. The GMA supports both a streaming publish/subscribe model and a query/response model. For both models, producers or consumers that accept connections publish their existence in the directory service. Consumers can use the directory service to discover producers of interest, and producers can use the directory service to discover consumers of interest. Either a producer or a consumer

may initiate the connection to a discovered component. Communication of control messages and transfer of performance data occurs directly between each consumer/producer pair without further involvement of the directory service. There is an implementation of GMA called the Relational Grid Monitoring Architecture (R GMA) [5]. R GMA has a large virtual database (Fig. 2) which looks and operates like a conventional relational database. A subset of the standard SQL language has been supported in R GMA. Data are published using SQL INSERT statement and queried using SQL SELECT statement. The difference between a virtual database and conventional relational database is that a virtual database has no central storage and data are distributed all over the network. Data discovery is through registry and schema. Producers and consumers register their addresses in the registry. Data must be disseminated via the producer and consumer to reach destination. Data transfer between consumer and destination is query/response only. R GMA conforms to Web Services Architecture. It uses SOAP messaging over HTTP/HTTPS and Java Servlet technology to exchange request/response (except data streaming which is implemented in a more efficient way). R GMA APIs are available in Java, C, C++ and Python. Figure 2: R GMA virtual database (adapted from [6]) 2.2 Globus Toolkit s Monitoring and Discovery Service Globus Toolkit uses the Monitoring and Discovery Service (MDS) [7, 8] as grid monitoring service. MDS uses an extensible framework for managing static and dynamic information about the status of a computational Grid and all its components: networks, compute nodes, storage systems, instruments, and so on. MDS is built on top of the Lightweight Directory Access Protocol (LDAP). Primarily, MDS is used to address the resource selection problem, that is: how does a user identify the nodes on which to run an application? In order to make the decision, the user needs to have access to some performance/hardware/software information about the nodes. MDS is designed to provide this kind of information by providing a standard mechanism for publishing and discovering resource status and configuration information. These kinds of information are collected by low level information providers, which interact with MDS with a uniform, flexible interface. Since MDS has a decentralized structure, it can support scaling up very well with the ability to handle static or dynamic data about resources, queues, etc. MDS is equipped with a security

mechanism called GSI (Grid Security Infrastructure). MDS has a hierarchical structure (see Figure 3) that consists of three main components. A Grid Index Information Service (GIIS) provides an aggregate directory of lower level data. A Grid Resource Information Service (GRIS) runs on a resource and acts as a modular content gateway for a resource. Information Providers (IPs) interface from any data collection service and then talk to a GRIS. Each service registers with others using a soft state protocol that allows dynamic cleaning of dead resources. Each level also has caching to minimize the transfer of un stale data and lessen network overhead. 2.3 Inca 2 Figure 3: The MDS architecture (adapted from [9]) Inca 2 [2] is a system that provides user level monitoring of Grid functionality and performance. It was designed to be general, flexible, scalable, and secure, in addition to being easy to deploy and maintain. Inca benefits Grid operators who oversee the day to day operation of a Grid, system administrators who provide and manage resources, and users who run applications on a Grid. The Inca system collects a wide variety of user level monitoring results such as simple test data and more complex performance benchmark output. It is able to capture the context of a test or benchmark as it executes so that system administrators have enough information to understand the result and can troubleshoot system problems without having to know the internals of Inca. The process of writing tests or benchmarks and deploying them into Inca installations are easy using Inca 2. Means for sharing tests and benchmarks between Inca users are provided. New resources and monitoring requirements can be easily adapted in order to facilitate maintenance of a running Inca deployment. Inca 2 will store and archive monitoring results (especially error messages) in order to understand the behavior of a Grid over time. The results are available through a flexible querying interface. Inca 2 also provides some level of security by manages short term proxies for testing of Grid services with MyProxy [10]. Measurement of the system impact of tests and benchmarks executing on the monitored resources are recorded and analyzed in order to tune their execution frequency and reduce the impact on resources as needed.

Figure 4 shows the architecture of Inca 2, which incorporates three core components (highlighted box) the agent, depot, and reporter manager. The agent and reporter managers coordinate the execution of tests and performance measurements on the Grid resources and the depot stores and archives the results. The inputs to Inca 2 are one or more reporter repositories that contain user level tests and benchmarks, called reporters, and a configuration file describing how to execute them on the Grid resources. This configuration is normally created using an administration GUI tool called incat (Inca Administration Tool). The output or results collected from the resources are queried by the data consumer and displayed to users. The following steps describe how an Inca administrator would deploy user level tests and/or performance measurements to their resources. The Inca administrator either writes reporters to monitor the user level functionality and performance of their Grid or uses existing reporters in a published repository. The Inca administrator creates a deployment configuration file that describes the userlevel monitoring for their Grid using incat and submits it to the agent. The agent fetches reporters from the reporter repository, creates a reporter manager on each resource, and sends the reporters and instructions for executing them to each reporter manager. Each reporter manager executes reporters according to its schedule and sends data to the depot. Data consumers display collected data by querying the depot. 2.4 JMS and NaradaBrokering Figure 4: Inca architecture (adapted from [2]) Java Message Service (JMS) [11] is a very popular and widely accepted industry standard. It is aimed to help simplify the efforts needed for applications to use Message Oriented Middleware (MOM). JMS provides a set of APIs written in Java. Those APIs can help programmers send and

receive messages via MOM in a uniform and vendor neutral way regardless of what the actual underlying middleware is. In JMS, a very important notion is destination, which falls into two different kinds: queue and topic. JMS uses messages to transport data. It supports two data dissemination modes: Point To Point (PTP), which does not include a broker, and publish/subscribe, which uses brokering. Messages are delivered via a topic. Both synchronous and asynchronous data transfers are supported in JMS. In synchronous approach, the subscriber can either poll or wait for the next message. For asynchronous mode, the subscriber registers itself as a listening object, and the publisher will automatically send messages by invoking a method of the subscriber, also known as callback. NaradaBrokering [4] is an open source, distributed messaging middleware. It is designed to fully support JMS. It is compliant with SOAP messages, JMS messages and complicated events. PTP and pub/sub data dissemination modes are implemented in NaradaBrokering and synchronous and asynchronous data transfer modes are also supported. NaradaBrokering can be used in a number of Web Services and Grid Services specifications, for example, WS Resource Framework, WS Notification and WS Eventing. A Broker Network Map (BNM) is composed by several brokers, see Figure 5 for example. Broker Discovery Node (BDN) is a highly specialized node that can be used to discover new brokers. NaradaBrokering implements a very efficient algorithm to find a shortest route to send the events to the destination in a BNM by introducing hierarchy layout into BNM. It is a very fast message dissemination middleware and it has been successfully adopted for audio/video teleconferencing. NaradaBrokering supports a number of underlying data transport protocols, including blocking and non blocking TCP, UDP, multicast, SSL, HTTP, HTTPS and Parallel TCP streams. Figure 5: NaradaBrokering Network Map (adapted from [6]) Although neither JMS nor NaradaBrokering is a monitoring system for a grid, they combined together can be easily implemented as a monitoring service for a grid system [6]. In that scenario, the publishers are the lower level data collectors, which publish the performance data at fixed time interval. Subscribers, on the other hand, can be users who are interested in certain topics about the performance of the system. NaradaBrokering here provides a messaging middleware which helps deliver the published messages to the corresponding subscribers.

3. Publish/Subscribe and Broker As we can see from chapter 2, pub/sub system and brokers are very important in providing monitoring service to a grid system. Hence, we will briefly discuss about these topics in this chapter. 3.1 Publish/Subscribe approach A publish/subscribe (pub/sub) system is a many to many data dissemination system. Publishers publish data and subscribers receive data that they are interested in. Publishers and subscribers are independent and need to know nothing about each other. The middleware delivers data to its destination. The middleware s functionality is more than forwarding data from source to destination. It provides advanced functions like data discovery, dissemination, filtering, persistence and reliability, etc. Data are discovered through the middleware and can be transferred either directly from publisher to subscriber or via a broker. The subscriber can be automatically notified when new data becomes available. Compared to a traditional centralized client/server communication model, pub/sub system is asynchronous and is usually distributed and scalable. 3.2 Broker Broker is the smallest unit of the messaging middleware. These middleware provide message forwarding service and more, as described in the previous section. Brokers are able to intelligently process and route messages while working with multiple underlying communication protocols. Usually, broker messaging middleware should meet the following requirements: scaling, efficient dissemination, guaranteed delivery mechanisms, location independence, support for interactions between nodes, interoperate with other messaging clients, communication through proxies and firewall, extensible transport framework, ability to monitor the performance of the system, security infrastructure. 4. Summary In this paper, we described four different monitoring services for grid system. These different approaches to implement monitoring services help us better understand the architecture of the grid network and some nice structures of monitoring service. As going into more details about the monitoring systems, we noticed that a lot of them support the pub/sub approach in messaging. Thus, pub/sub messaging mechanism is discussed as well as broker, which provides message middleware service between the publishers and the subscribers. Through this survey study, we are able to have better understanding of distributed systems and this also helps us gaining more insight into the final term project we are working on. 5. Future work With the knowledge gained from this survey study, we will move on to implement a pub/sub based performance monitoring service on FutureGrid. Brokering will be used as the messaging

middleware, which will ease us from dealing with the complicated communication between publishers and subscribers. A simple GUI will be provided to show graphical monitoring data to the users. REFERENCE [1] I. Legranda, et al., "MonALISA : A Distributed Monitoring Service Architecture," Computer Physics Communications, vol. 180, pp. 2472 2498, 2009. [2] S. Smallen, et al., "User level Grid Monitoring with Inca 2," presented at the GMW, Monterey, California, USA, 2007. [3] R. Aydt, et al., "A Grid Monitoring Architecture," presented at the GWD, 2002. [4] S. Pallickara and G. Fox, "NaradaBrokering: a distributed middleware framework and architecture for enabling durable peer to peer grids," in Middleware, 2003. [5] A. W. Cooke, et al., "The Relational Grid Monitoring Architecture: Mediating Information about the Grid " Journal of Grid Computing, vol. 2, pp. 323 339, 2004. [6] C. Huang, et al., "A Study of Publish/Subscribe Systems for Real Time Grid Monitoring," presented at the IEEE International Parallel and Distributed Processing Symposium, Long Beach, CA, USA, 2007. [7] K. Czajkowski, et al., "Grid Information Services for Distributed Resource Sharing," in IEEE International Symposium on High Performance Distributed Computing, San Francisco, California, 2001. [8] MDS. Available: http://www.globus.org/mds/ [9] X. Zhang, et al., "A Performance Study of Monitoring and Information Services for Distributed Systems," presented at the IEEE International Symposium on High Performance Distributed Computing, Seattle, Washington 2003. [10] (2007). MyProxy Credential Management Service. Available: http://myproxy.ncsa.uiuc.edu [11] Java Message Service. Available: http://java.sun.com/products/jms/