1 A Monitoring Sensor Management System for Grid Environments Brian Tierney, Brian Crowley, Dan Gunter, Mason Holding, Jason Lee, Mary Thompson Computing Sciences Directorate Lawrence Berkeley National Laboratory University of California, Berkeley, CA, Abstract Large distributed systems such as Computational Grids require a large amount of monitoring data be collected for a variety of tasks such as fault detection, performance analysis, performance tuning, performance prediction, and scheduling. Ensuring that all necessary monitoring is turned on and that data is being collected can be a very tedious and error-prone task. We have developed an agent-based system to automate the execution of monitoring sensors and the collection of event data. 1.0 Introduction The ability to monitor and manage distributed computing components is critical for enabling high-performance distributed computing. Monitoring data is needed to determine the source of performance problems and to tune the system for better performance. Fault detection and recovery mechanisms need monitoring data to determine if a server is down, and whether to restart the server or redirect service requests elsewhere. A performance prediction service might use monitoring data as inputs for a prediction model , which would in turn be used by a scheduler to determine which resources to use. As distributed systems such as Computational Grids  become bigger, more complex, and more widely distributed, it becomes important that this monitoring and management be automated. Our approach to this problem is to use a collection of software agents  as an event management system which is designed for use in a Grid environment. For this discussion, an agent is an autonomous, adaptable entity that is capable of monitoring and managing distributed system components. Our system is based on a collection of such agents, which can be updated or reconfigured on the fly. Agents work with ; we use the term event here to mean time-stamped data about the state of any system component, such as current CPU usage, or whether a server is no longer running. Agents perform a number of tasks. They can securely start monitoring programs on any host, and manage their resulting data. They can provide a standard interface to host monitoring sensors for CPU load, interrupt rate, TCP retransmissions, TCP window size, and so on. Agents can also independently perform various administrative tasks, such as restarting servers or monitoring processes. This agent-based monitoring architecture is sufficiently general that it could be adapted for use in distributed environments other than the Grid. For example, it could be used in large compute farms or clusters that require constant monitoring to ensure all nodes are running correctly. 2.0 JAMM Monitoring System The automated agent-based architecture that we have developed is called Java Agents for Monitoring and Management (JAMM). The agents, whose implementation is based on Java and Java Remote Method Invocation (RMI), can be used to launch a wide range of system and network monitoring tools, and then to extract, summarize, and publish the results. The JAMM system is designed to facilitate the execution of monitoring programs, such as netstat, iostat, and vmstat, by triggering or adapting their execution based on actual client usage. On-demand monitoring reduces the total amount of data collected, which in turn simplifies data management. For example, an ftp client connecting to an ftp server could automatically trigger netstat and vmstat monitoring on both the client and server for the duration of the ftp connection. Application activity is detected by a port monitor agent running on the client and server hosts, which monitors traffic on a configurable set of ports. We use the JAMM system to generate monitoring data for analysis by NetLogger , a toolkit for performance analysis of distributed systems. When performing a NetLogger analysis of a complex distributed system such as a Computational Grid, activating all desired host and network monitoring can be a very tedious task, and may
2 require an account on the server host. The JAMM system greatly facilitates this task. 2.1 JAMM Architecture The JAMM architecture uses a producer/consumer model, similar to several existing Event Service systems, such as the CORBA Event Service . JAMM sensors (producers) publish their existence in a directory service. Clients (consumers) look up the sensors in the directory service, and then subscribe (i.e., connect) to sensor data via an event gateway. The JAMM system consists of the following components, shown in Figure 1, and described in detail below: sensors sensor managers event gateways directory service event consumers event archives There are many existing systems with an event model similar to the JAMM model. The CORBA event service  has a rich set of features, including the ability to push or pull, and the ability for the consumer to pass a filter to the event supplier. However we believe the JAMM architecture is much more scalable, as described in section 2.3 below. JINI has a Distributed Event Specification , which is a simple specification for how an object in one Java virtual machine (JVM) registers interest in an event occurring in an object in some other JVM, and in how it receives notification when that event occurs. There are several systems with alternative event models, such as the Common Component Architecture; many of which are summarized in . JAMM s goals are also similar to some SNMP-based  monitoring packages such as HP Openview or SGI s Performance Co-Pilot . However, SNMP is designed for managing and monitoring network attached devices and event consumer "subscribed" Host A Event Gateway sensor manager sensor sensor sensor sensor directory (LDAP) Host B sensor manager sensor sensor sensor sensor publication info Figure 1: JAMM Components event consumer Event Gateway sensor manager Host C sensor sensor sensor hosts, not distributed systems. Integrating application monitoring and arbitrary sensors into an SNMP-based system would be quite difficult. We believe that none of the previously described systems are a perfect match for a Grid monitoring system. Therefore we have tried to combine the relevant strengths of each. Other related systems include the Pablo system , which has had the notion of application sensors for several years. Wolski et al.  also describe an architecture with similar characteristics to the service described here. 2.2 JAMM Components sensors The JAMM system is designed to control the execution of sensors. A sensor is any program which generates a time-stamped performance monitoring event. For example, we have sensors that monitor CPU usage, memory usage, and network usage. are also used to monitor error conditions, such as a server process crashing, or CRC errors on a router. We have implemented the following types of sensors: host sensors: These sensors perform host monitoring tasks, such as monitoring CPU load, available memory, or TCP retransmissions. The monitoring can be configured to run all the time, or be triggered by detecting network activity on a specified port. Host sensors may be layered on top of SNMP-based tools, and therefore run remotely from the host being monitored. network sensors: These sensors perform SNMP queries to a network device, typically a router or switch. Information on which device statistics are being monitored is published in the directory service. process sensors: Process sensors generate when there is a change in process status (for example, when it starts, dies normally, or dies abnormally). They might also generate an event if some dynamic threshold is reached (for example, if the average number of users over a certain time period exceeds a given threshold). application sensors: Autonomous sensors can also be embedded inside of applications. These sensors might generate if a static threshold is reached (for example, if the number of locks taken exceeds a threshold), upon user connect/disconnect or change of password, upon receipt of a UNIX signal, or upon any other user-defined event. Application sensors can also be used to collect detailed monitoring data about the application to be used for performance analysis. These
3 types of sensors would not be directly under JAMM control, but could still feed their results to the JAMM system. sensor manager The sensor manager agent is responsible for starting and stopping the sensors, and keeping the sensor directory up to date. to be run are specified by a configuration file, which may be local or on a remote HTTP server. can be configured to run always, when requested by a sensor manager GUI, or when requested by the port monitor agent. There is typically one sensor manager per host. An important component of the sensor manager is the port monitor agent. This agent monitors traffic on specified ports, and starts sensors only when network traffic on that port is detected. Using the port monitor agent, one is able to customize which sensors are run based on which applications are currently active, assuming that the applications use well-known ports. For example, we can turn on network monitoring when network intensive applications are running, CPU monitoring when CPU intensive applications run, and memory monitoring when memory intensive applications run. This assumes that the applications get triggered via a message from the network, which is usually true for Grid applications. event gateway Event gateways are responsible for listening for requests from event consumers. Event gateways can service streaming or query requests from consumers. In streaming mode the consumer opens an event channel and the are returned in a stream. In query mode the consumer does not open an event channel, but only requests the most recent event. Event gateways can accept several types of requests from consumers. The consumer may request all event data, or only to be notified of certain types of. For example the netstat sensor may output value of the TCP retransmission counter every second, but most consumers only want to be notified when the counter changes, and not every second. A consumer can also request that an event be sent only if it s value crosses a certain threshold. Examples of such a threshold would be if CPU load becomes greater than 50%, or if load changes by more than 20%. The event gateway can also be configured to compute summary data. For example, it can compute 1, 10, and 60 minute averages of CPU usage, and make this information available to consumers. The event gateways can also be used to provide access control to the sensors, allowing different access to different classes of users. Some sites may only allow internal access to real-time sensor streams, with only summary data being available off-site. These types of policy would be enforced by the gateways. This mechanism is especially important for monitoring clusters or computer farms, where there may be a large amount of internal monitoring, but only a limited amount of monitoring data accessible to the Grid. directory service The directory service is used to publish the location of all sensors and their associated gateway. This allows consumers to discover which sensors are currently active, and which gateway to contact to subscribe to a given sensor s output. Query-optimized directory services such as LDAP, Globus MDS , the Legion Information Base, and the Novell NDS, all provide the necessary base functionality for this. We are currently using LDAP, because it is a simple, standard solution. LDAP servers can be hierarchical, with referrals to other LDAP servers which contain the directory service information for each site. LDAP also supports the notion of replicated servers, providing fault tolerance. Replication is critical to JAMM. Otherwise, failure of the sensor directory server could take down the entire system. Current implementations of LDAP servers are optimized for read access, and do not work well in an environment with many updates. However, it is possible to use only the naming and communications portions of LDAP, without the underlying database. For example, the Globus  system uses its own optimized database underneath the LDAP communications protocol to improve the performance of updates. We are also interested in exploring the event notification service of LDAPv3  as soon as it is available. This service lets a client register interest in an entry (i.e., sensor running) with the LDAP server, and LDAP will notify the client when that entry becomes available or is updated. event consumers An event consumer is any program that requests data from a sensor. There are many possible types of consumers. The current JAMM system includes the following: event collector: This consumer is used to collect monitoring data in real time for use by real-time analysis tools. It checks the directory service to see what data is available, and then subscribes, via the event gateway, to all the sensors it is interested in. The sensors then send the event data to the consumer as it is generated. Data from many sensors, as well as streams of data from application sensors, is then merged into a file for use by programs such as nlv, the NetLogger real-time event visualization tool .
4 archiver agent: This consumer is used to collect data for an archive service. It subscribes to the logging agents, collects the event data, and places it in the archive. It also creates an archive directory service entry indicating the contents of the archive. process monitor: This consumer can be used to trigger an action based on an event from a server process. For example, it might run a script to restart the processes, send to a system administrator, call a pager, etc. overview monitor: This consumer collects information from sensors on several hosts, and uses the combined information to make some decision that could not be made on the basis of data from only one host. For example, one may want to trigger a page to a system administrator at 2 A.M. only if both the primary and backup servers are down. event archives It is important to archive event data in order to provide the ability to do historical analysis of system performance, and determine when/where changes occurred. While it may not be desirable to archive all monitoring data, it is necessary to archive a good sampling of both normal and abnormal system operation, so that when problems arise it is possible to compare the current system to a previously working system. Archives might also be used by performance prediction systems, such as the Network Weather Service (NWS) . The JAMM architecture provides a flexible method for selecting what gets archived, because the archive is just another consumer. In some environments very little will be monitored, and in others, it may be desirable to archive everything. With this approach the archive need not get in the way of any real-time monitoring. 2.3 Scalability Issues One of the biggest issues in defining a monitoring architecture for use in a Grid environment is scalability. It is critical that the act of monitoring does not affect the systems being monitored. In the JAMM model, one can add additional event gateways, and additional sensor directories as needed, reducing the load where necessary. In the case where many consumers are requesting the same event data, the use of an event gateway reduces the amount of work on and the amount of network traffic from the host being monitored. An event gateway would typically be run on a separate host from the grid resources, to ensure that the load from the gateway did not affect what was being monitored. In particular, we believe that JAMM is more scalable than the CORBA Event Service. In the JAMM architecture, event data is not sent anywhere unless it is requested by a consumer. Many of the current event service systems, including CORBA, send all event data to a central component, which consumers then contact. In the JAMM model, only sensor publication (location) data is sent to a central directory server. Event data goes directly from producer to consumer. We believe this model will scale much better in a Grid environment. 3.0 JAMM Implementation The JAMM sensor managers, event gateways, and some of the consumers are implemented as Java Activatable Remote Method Invocation (RMI) objects. RMI provides a number of features, including the ability to add, remove, or reconfigure sensors on the fly. RMI also facilitates code development by making the network communication transparent to the programmer. Activatable RMI objects can be loaded and run simply by invoking one of their methods, and will unload themselves automatically after a period of inactivity. RMI objects can be dynamically downloaded from an HTTP server every time the RMI daemon is restarted, making software updates trivial, and easing code deployment. The use of Java also facilitates porting the agents to a large number of platforms in a heterogeneous Grid environment. JAMM event data can be in either XML format or ULM (Universal Logger Message) , a simple ASCII-based format which is used by NetLogger. We are looking into adding a binary format option for high throughput event data that can not tolerate the parsing overhead of ASCII formats. We hope that the use of standard formats like XML will make it possible for JAMM to interoperate with other monitoring sensors on the Grid. We are also considering supplementing RMI with Sun's Java Management Extension (JMX) . JMX has a good communication model and a rich set of SNMP tools, and requires less memory than our current RMI-based implementation. However, the only good implementation of JMX currently is a commercial product from Sun, JDMK , which may not be affordable in an academic environment. 4.0 JAMM Usage We often use JAMM to collect monitoring for use with the NetLogger Toolkit . NetLogger (short for Networked Application Logger) is designed to monitor, under actual operating conditions, the behavior of all the elements of the application-to-application communication path in order to determine exactly where time is spent within a complex system. Distributed application components are instrumented to perform precision time-stamping and event logging at every critical point (e.g., all I/O and any significant computational routines). The are correlated with host and network monitoring to characterize the performance of all aspects of the system
5 in detail. NetLogger includes a library that makes it easy for distributed applications to log interesting at every critical point. NetLogger is designed to facilitate identification of bottlenecks and help with performance tuning. An example use of JAMM is shown in Figure 2. Monitoring data is collected at both the client and server host, and at all network routers between them. All event data is sent to a real-time monitor consumer for real-time visualization and NetLogger analysis. Server and router data is also sent to the archive. Networkaware client Client Host "subscribed" real-time monitoring real-time monitor consumer host monitor router network Sensor sensor directory router "publication" Figure 2: Sample JAMM Usage Server host monitor archiver consumer Event Log Archive Adding new sensors to JAMM is quite simple. If the sensor is an RMI object, one just copies the Java object to an HTTP accessible directory and edits the central configuration file. Every few minutes the sensor managers check for updates to the configuration file, and activate new sensors if necessary, publishing them in the sensor directory. If the sensor is not Java-based, it will have to be copied to each host by some other means. Converting existing monitoring tools to JAMM sensors is quite easy using the JAMM Perl sensor modules and the JAMM Java sensor class. 5.0 Results As an example of how JAMM may be used to monitor a Grid application, we describe it s use in the DARPA MATISSE project , an NGI project whose goal is to enable MEMS (micro-electro-mechanical systems) researchers to efficiently access, manipulate, and view high resolution high frame rate video data of MEMS devices remotely over the DARPA Supernet , an OC 48 (2.4 Gbits/second) NGI testbed network. A demonstration of the Matisse application was performed in May 2000 in Arlington, VA, the configuration of which is shown in Figure 3. Data was stored on a Distributed Parallel Storage System (DPSS)  at LBNL in Berkeley, CA. Data was transferred on-demand across Supernet to a Linux compute cluster, which did the data analysis, and then sent the results to a workstation. Storage Cluster (DPSS) 1000 BT Berkeley Lab Storage Cluster OC-12 NTON Visualization Workstation DARPA Supernet ISI East (Arlington, VA) Compute Cluster Figure 3: Matisse Application Environment JAMM was used to monitor all system components, and verify that all required hardware and software was running properly. JAMM was also used to do real-time performance monitoring and analysis, and proved to be very helpful in tracking down some performance problems. The following sensors, shown in Figure 4, were used: CPU and memory sensors on every host, DPSS processes monitors, clock synchronization monitors, TCP monitors (retransmits and window size), and network switch / router monitors. The TCP sensor we are using is a version of tcpdump modified to generate NetLogger when it detects a TCP retransmission or a change in window size . Performance from the point of view of the client was quite bursty. Sometimes images arrived at 6 frames/sec, and other times only 1-2 frames/sec. As is typical in this type of environment, it was not clear what the source of the problem was. Was the source of the burstiness the data server, the compute server, the receiving host, the network, or the viewing application? To do the performance analysis, an event collector agent was started, and sensors on all components in use by the application were located in the Real-Time monitoring merged file of event data application monitoring "Network- Aware" Application Real-time event collector "subscribed" OC-48 sensor directory (LDAP) sensor summary data server (e.g.: average network delay, etc.) Compute Cluster (8 nodes) 1000 BT CPU Compute Cluster Monitoring Event Gateway Memory Network Storge Cluster Monitoring CPU Event Gateway Memory Figure 4: Use of JAMM with the Matisse Application Network
6 TCPD_RETRANSMITS X X X X MPLAY_END_PUT_IMAGE MPLAY_START_PUT_IMAGE MPLAY_END_READ_FRAME MPLAY_START_READ_FRAME VMSTAT_USER_TIME VMSTAT_SYS_TIME VMSTAT_FREE_MEMORY Time (seconds) dpss5.lbl.gov dpss4.lbl.gov Figure 5: NetLogger real time analysis of JAMM managed Sensor data directory service and subscribed to. This agent collected all the monitoring data, which was then used as input to nlv, the NetLogger visualization tool. nlv results are shown in Figure 5. The graph shows time on the X axis, and on the Y axis. The vertical sloping lines show key from a trace of a data frame request through the entire system. The greater the slope of the line, the more time a particular event took. The graph also shows the results of the CPU, memory, and TCP sensors. Note the correlation between the TCP retransmit and the large gap with no data being received by the application. Also of interest is the high level of system CPU usage on the receiving host, shown by the line for event VMSTAT_SYS_TIME. From this we were able to narrow down the problem to either the network or the receiving host. The next question was why were there TCP retransmissions? SNMP errors on the end switches and routers were also monitored by JAMM, but no errors were reported. This, along with the high system CPU load, pointed us to the receiving host as the probable source of the problems. The client was reading data from four DPSS servers, so we next used the Iperf network performance test tool  to compare TCP performance of a single TCP input stream versus four parallel streams. To our surprise the aggregate throughput for four streams was only 30 Mbits/sec compared to 140 Mbits/sec for a single stream. This again mems.cairn.net dpss3.lbl.gov dpss2.lbl.gov pointed to the receiving host as the source of the problem. By using a single DPSS server instead of four servers, (and thus one data socket instead of four), we were able to increase the throughput to 140 Mbits/sec. The system CPU load with only one data socket was much lower as well. We are not yet certain why the performance using four sockets is so much worse than using one socket, yet we believe it has something to do with the amount of load the gigabit ethernet card and device driver place on the system. Interestingly, this behavior is only observed with wide-area transfers; LAN throughput for both one and four data streams are 200 Mbits/second. We plan to perform more experiments to track down the cause of this behavior. Note that this type of analysis could have been done without the use of JAMM, but would have been much more difficult. One would need to have an account on every system, with superuser privileges (to run the tcpdump sensor), and log into every system (13 in this example) and start every sensor by hand, and then copy the results to one place for analysis. Clearly this is more work than most application users are willing to do. Using JAMM, all that is required is for the application user to start up a consumer and subscribe to the relevant sensor data. JAMM also simplifies the job of a system administrator to configure all the necessary sensors. The NetLogger tools make analysis of this data quite straightforward.
7 6.0 Current Status and Future Work At the present time all basic functionality of the main JAMM components (sensor manager, port monitor, event gateway, sensor directory, event collector, and several sensors) is complete. We are currently adding the ability to do filtering and summary data to the event gateway, and working on security, access control, and policy mechanisms as well. We are also developing a ULM to XML filter for the gateway, so a consumer can request either format for event data. We are also developing a summary data service and client API, as shown in Figure 4. For example, network sensors publish summary throughput and latency data in the directory service, which is used by a network-aware client  to optimally set its TCP buffer size. The summary data service might be part of the sensor directory, could be a separate LDAP server, or could be built into the gateways. We are exploring the performance and scalability issues of each of these approaches. 6.1 Security Issues A distributed agent system such as JAMM creates a number of security vulnerabilities which must be analyzed and addressed before such a system can be safely deployed on a production Grid. The users of such a system are likely to be remote from the machines being monitored and to belong to different organizations. Users want to find out what sensors are running and how to subscribe to their event data; users may need to cause sensor programs to be started or to generate a higher level of data collection; and finally users want to subscribe to sensor data via an event gateway. In each case the domain that is being monitored is likely to want to control which users may perform which actions. Discovering what sensors are running and what their gateways are is done via an LDAP lookup. Starting new sensors is done by a request to a gateway, which then contacts a sensor manager. Subscribing to an event stream is done by establishing a network connection to a event gateway. Our goal is to provide a single method for user identification and authorization for each of these steps. Current LDAP servers provide user/password style protection to regions of the LDAP tree which can be used to protect the lookup and publishing functions. Normally these passwords are sent in clear text, but there are SSL-enabled versions of LDAP, e.g. Netscape, where the LDAP server has a key which can be used to encrypt the connection to the client. Each LDAP server manages its own set of users and passwords. This same user/password scheme could be used to control the other access points. However, this approach becomes awkward if the sensors that a user is interested in are in multiple domains, because each domain must assign user names and passwords. Public key based X.509 identity certificates  are a recognized solution for cross-realm identification of users. When the certificate is presented through a secure protocol such as SSL (Secure Socket Layer), the server side can be assured that the connection is indeed to the legitimate user named in the certificate. SSL libraries exist for HTTP connections, RMI connections, or simply as C or Java libraries to be used with socket code. There are two existing packages that use identity credentials for client side authentication and authorization for remote resources. One is the Globus implementation of the GSS-API (GSI) , which uses a protocol such as SSL, to provide the server side with assurance that the connection is to the legitimate user named in the certificate. A server side map file is used to map the Globus X.509 user identities to local user-ids which can be used by existing access control mechanisms. The second existing package is Akenti , which uses vanilla X.509 identity certificates and SSL protocol to authenticate remote users. In addition, Akenti provides a way for the resource stakeholders to remotely determine the authorization for resource use based on components of the users distinguished name or attribute certificates. These two mechanisms can be combined to allow Globus clients in a Grid environment to present Globus proxy ids, and non-globus clients to provide standard X.509 identity certificates. At the server side, a domain can decide if they want to use locally maintained access control lists or the more distributed Akenti policy certificates. User (consumer) access at each of the points mentioned above (LDAP lookup and subscription to a gateway), would require an identity certificate passed though a secure protocol, e.g. SSL. A wrapper to the LDAP server and the gateway could both call the same authorization interface with the user's identity and the name of the resource the user wants to access. This authorization interface could return a list of allowed actions, or simply deny access if the user is unauthorized. Communication between the gateway and the sensor managers also needs to be controlled, so that a malicious user can't communicate directly with the sensor manager. This is a simpler problem since a sensor manager only needs to communicate with a small known set of gateway agents and thus can just have a list of the Identity Certificates for each agent to which it will allow a connection. We plan to add credential based security to the JAMM system in the near future. 7.0 Conclusions Monitoring is critical to providing a robust, high-performance Grid environment. We have presented a flexible, scalable architecture for managing monitoring
8 sensors for all components of the Grid, including hosts, networks, and applications. The ability to access the event data collected by the monitoring sensors will enhance or enable a wide range of other Grid services, such as scheduling, network tuning, performance analysis, QoS, and so on. 8.0 Acknowledgments We are greatly indebted to many members of the Grid Forum (http://www.gridforum.org), from whom many of the ideas in this paper came. In particular, discussions with Rich Wolski, University of Tennessee, Ruth Aydt, University of Illinois, Ian Foster, Steve Tuecke, and Darcy Quesnel, Argonne National Lab, Dennis Gannon, University of Indiana, and Warren Smith, NASA Ames, all contributed to the architecture described here. We also thank Michael Amabile from the Sarnoff Corporation for helping collect the NetLoggerized MEMS data viewing application. This work was supported by the Director, Office of Science. Office of Advanced Scientific Computing Research. Mathematical, Information, and Computational Sciences Division under U.S. Department of Energy Contract No. DE-AC03-76SF This is report no. LBNL References  Abela, J., T. Debeaupuis, Universal Format for Logger Messages, IETF Internet Draft,  Case, J., R. Mundy, D. Partain, B. Stewart, Introduction to Version 3 of the Internet-standard Network Management Framework, IETF RFC 2570, April  CORBA, Systems Management: Event Management Service, X/Open Document Number: P437,  DeRose, L., D. Reed, SvPablo: A Multi-Language Architecture-Independent Performance Analysis System, Proceedings of the International Conference on Parallel Processing (ICPP'99), Fukushima, Japan, September  Genersereth, M., S. Ketchpel, Software Agents, Communications of the ACM, July,  Fitzgerald, S., I. Foster, C. Kesselman, G. von Laszewski, W. Smith, and S. Tueke, A Directory Service for Configuring High-Performance Distributed Computations. In Proc. 6th IEEE Symp. on High Performance Distributed Computing, August  Globus:  Housely, R., W. Ford, W. Polk, D. Solo, Internet X.509 Public Key Infrastructure, IETF RFC Jan  Iperf:  JDMK:  Jini Distributed Event Specification,  JMX:  Matisse:  Pablo Scalable Performance Tools,  Peng, X, Survey on Event Service,  Performance Co-Pilot:  Supernet:  tcpdump: NetLogger version, /projects/enable/tcpdump/  Thompson, M., W. Johnston, S. Mudumbai, G. Hoo, K. Jackson, A. Essiari, Certificate-based Access Control for Widely Distributed Resources, Proceedings of the Eighth Usenix Security Symposium, Aug  Tierney, B. J. Lee, B. Crowley, M. Holding, J. Hylton, F. Drake, A Network-Aware Distributed Storage Cache for Data Intensive Environments, Proceeding of IEEE High Performance Distributed Computing conference (HPDC-8), August 1999, LBNL  Tierney, B., W. Johnston, B. Crowley, G. Hoo, C. Brooks, D. Gunter, The NetLogger Methodology for High Performance Distributed Systems Performance Analysis, Proceeding of IEEE High Performance Distributed Computing conference, July 1998, LBNL  Wahl M., T. Howes, S. Kille, Lightweight Directory Access Protocol (v3), IETF RFC 2251, Dec  Wolski, R., N. Spring, J. Hayes, The Network Weather Service: A Distributed Resource Performance Forecasting Service for Metacomputing, Future Generation Computing Systems,  Wolski, R., M. Swany, S. Fitzgerald, White Paper: Developing a Dynamic Performance Information Infrastructure for Grid Systems, GridForum/Perf-WG/white.PDF  The Grid: Blueprint for a New Computing Infrastructure, edited by Ian Foster and Carl Kesselman. Morgan Kaufmann, Pub. August ISBN