PDM: Programmable Monitoring For Distributed Applications

PDM: Programmable Monitoring For Distributed Applications Michael D. Rogers James E. Lumpp Jr. James Griffioen Department of Computer Science Department of Electrical Engineering University of Kentucky Lexington, KY 40506 Abstract We have constructed PDM, a Programmable, Distributed Monitoring system for building performance monitoring/analysis, steering, and adaptive systems for largescale, parallel applications. PDM takes advantage of mobile languages to reduce monitoring overheads and offers new monitoring capabilities. PDM monitoring code is currently written in Tcl. PDM is capable of decomposing and distributing the monitoring code to the nodes in the system where the performance data is generated, thereby reducing the amount of network traffic required to collect monitoring information. Furthermore, PDM scripts can monitor and bound their own resource usage, further reducing monitoring overheads. Keywords: Performance Monitoring, Distributed Computing, Parallel Applications, Steering, Reactive Systems 1 Introduction Multi-processor systems have been used to solve many computationally intensive problems [3, 14, 15, 16, 19]. However, these systems often fail to achieve the magnitude of expected speedup. Monitoring systems have been used to debug and optimize performance. The PDM system extends the abilities and features of past systems with efficient mobile and extensible code that can be decomposed and migrated to any node in the system. As a result, PDM offers the following new features: Reduced Monitoring Overhead. Because the monitoring code can be decomposed and migrated to collect and analyze the performance data where it originates, network monitoring traffic can be reduced. This research was supported by the National Science Foundation under Grant CDA-9502645, the Advanced Research Projects Agency under Grant DAAH04-96-1-0327, and the Experimental Program to Stimulate Competitive Research under Grant EPS-9874764. Extensibility. Because monitoring code is written in an expressive and extensible language, new application-specific monitoring functions can be added to the system to collect application specific information. Bounded Resource Usage. Because the monitoring code can make local decisions about what data to record and transmit, monitoring overhead can be limited by dynamically adjusting the amount of information recorded by the system. Dynamic Buffering. PDM can adjust network buffering based on monitoring-overhead feedback. PDM provides more buffering when network utilization is high and less buffering when network utilization is low. The remainder of this paper is organized as follows. Section 2 discusses related work on performance monitoring tools. Section 3 describes the PDM monitoring abstractions and introduces PDM s programming language. Section 4 describes the implementation of PDM. Section 5 discusses how programming PDM with movable scripts provides opportunities to enhanced monitoring performance. Lastly, Section 6 describes experimental results. 2 Related Work A variety of performance monitoring tools exist to aid programmers in determining the source of performance problems. One class of monitoring tools is software Profilers such as Gprof [4], Prof [7], and Quartz [1] that collect information via sampling. Unfortunately, profiling tools record a limited number of metrics, which restricts the amount of insight a user can gain from such tools [6]. Performance visualization and instrumentation tools such as Pablo [13], Paradyn [11], and AIMS [10] allow programmers to instrument the application code to collect

runtime information. Interactive steering tools such as Falcon [2, 5] and Progress [17] take monitoring a step further by allowing the user to modify a running application. However, users cannot insert new code into the system to analyze performance once the application is running. Meta [8, 9] and Magellan [18] allow programmers to associate conditions based on performance parameters with one or more actions. When a condition becomes TRUE, the corresponding action executes. As we describe in section 5.1, Meta cannot evaluate conditions locally that are used to compute global state. The expression parser in Magellan is somewhat more sophistocated. If a Magellan condition and its associated actions reference steering objects local to a machine, Magellan migrates the code to that machine. However, all of the steering objects referenced by a condition and it actions must be local to a particular machine for Magellan to take advantage of this optimization. Both Magellan and Meta use languages specific to the domain of performance analysis and steering instead of general programming languages, and, thus, the languages have limited flexibility and extensibility. 3 Programmable Monitoring PDM collects and presents information about the distributed state of the operating system (distributed, shared memory or message passing) and the application. Examples of operating system state include average CPU load across all machines, longest time that any process waits at a particular barrier, and total network traffic caused by a shared-memory region. Examples of application state include average time spent in a particular function or the average number of times a particular loop iterates. PDM defines a set of monitoring abstractions for collecting this distributed state. Furthermore, PDM provides a programming language to query those abstractions. The following sections discuss the abstractions and introduces the PDM programming language. 3.1 PDM Monitoring Abstractions PDM has four basic abstractions: events, samples, metrics, and resources. Events occur during program execution and fall into two types, system events and application events. System events are operating-system actions such as transmitting a network packet, triggering a system alarm, and faulting in a virtual memory page. Application events depend on the application and include such things as ending a computation phase, invoking a procedure, or referencing a variable. A programmer must instrument application code to generate application events. PDM samples are measures of system state. That is, PDM uses a statistical sampling approach to find the state of the system from periodic instantaneous measurements. Examples of samples include CPU ready-queue length, memory utilization, and other system state. Metrics represent performance information computed from event and sample data. PDM maintains two types of metrics, system metrics and application metrics. System metrics can be sample-based or event-based. System metrics are predefined by PDM. PDM computes sample-based metrics by incorporating (averaging) samples into the associated system metrics. For example, PDM updates the average CPU load metric by averaging samples of the system s ready-queue length. PDM maintains system event-based metrics through instrumented event points in the operating system code. The instrumentation code updates the metric each time it is invoked. For example, the network driver may generate a packet-sent event containing the size of the packet being sent. PDM then updates its bytestransferred metric. Unlike system metrics, PDM only computes application metrics from application events (not sampling). An application writer instruments the application to generate application events. When PDM receives an application event, PDM updates the metric associated with the event. A resource is an application abstraction such as a function, a data structure, a code block, or a system abstraction such as a lock, barrier, or shared-memory region. PDM associates every metric with a resource, and every metric is uniquely identified by a (resource,metric id) pair. For example, (CPU,LOAD) represents the load metric associated with the CPU resource. Assigning metrics to resources allows PDM to attribute resource usage to each resource. For example, when a distributed shared-memory system sends updates for a particular segment, PDM can update the number of bytes sent for that segment by updating the bytestransferred metric associated with the segment. 3.2 The PDM Programming Language An application communicates with PDM via expressions specified in the Tcl [12] language. Expressions can be either queries or predicates. Figure 1 lists some example queries and predicates. An application can submit a query to PDM to request information about the state of the distributed system, and can submit a predicate so that the application will be notified when the state of the distributed system changes. A query can evaluate to a numeric value, a string value, a (Tcl) list, or a boolean value (TRUE or FALSE). A predicate is a boolean expression that will evaluate to TRUE or FALSE. In the next section, we describe how PDM makes use of a two level architecture to evaluate queries and predicates.

Query Examples 1) [worker 6 {$metric(cpu,load)}] "What is the CPU load on process 6?" 2) [average [worker {1 2 3} {$metric(segment3,bytesxfered)}]] "What is the average of bytes transferred for processes 1, 2, and 3 for segment 3?" 3) [worker 3 {$metric(worktree,numnodes)}] "How many nodes does process 3 have in its work tree? " Predicate Examples 4) [worker 6 {$metric(barrier1,lastwait)}] > [avg [worker {7 8} {$metric(barrier1,lastwait)}]] * 1.10 "Did process 6 wait on barrier 1 at least ten percent longer than the average time processes 7 and 8 waited on that barrier" 5) [worker 6 {$metric(cpu,load) > 1}] "Is the average CPU load on process 6 greater than 1 6) [any [worker ALL {$metric(worktree,numnodes) > 50)}]] "Does any process have more than 50 nodes in its work tree?" Figure 1: Example queries and predicates and their corresponding English statements. $METRIC is a reference to a particular metric followed by a (resource,metric id) pair. The WORKER clause takes a list of processes and a subexpression. When the predicate or query is evaluated, the WORKER clause will be replaced by a list of values of the subexpression evaluated at each process. 4 The PDM Monitoring Architecture PDM collects distributed state constructed from local state information at each machine. However, continuously updating distributed state from local state can impose a heavy load on the network. PDM reduces network load by filtering out information (at the local site) that is not important to the distributed state being queried. The PDM architecture is represented in Figure 2. PDM Machine 1 Machine 2 Machine 3 D i s t r i b u t e d A p p l i c a t i o n evaluated locally on each machine. The LSL does not immediately evaluate the subexpression. Instead, the LSL stores the subexpression for future evaluations. The LSL continuously gathers performance data about the underlying system. If the state of a local machine changes, the LSL re-evaluates, on that machine, all subexpressions that depend on the changed state. If the value of the subexpression changes, the LSL sends a notification to the DSL containing its new value. When the DSL receives a notification from the LSL, it re-evaluates all registered predicates that have terms containing the subexpression, using the value that the LSL sent in the notification. If a predicate fires, the DSL invokes a callback in the application. PDM Local State Layer D i s t r i s t r i b u t e d S t a te L a y e r Local State Layer D i s t r i b u t e d O S Local State Layer Figure 2: Layered implementation of PDM. The Distributes State Layer (DSL) collects information about the state of the distributed system and the Local State Layer (LSL) collects information about the individual machines. consists of two layers: a Distributed State Layer (DSL) and a Local State Layer (LSL). The DSL is responsible for collecting information about the distributed system and presenting it to the application. The DSL collects information about the distributed system by aggregating information from the LSL (see Figure 3). In Figure 3, the DSL s Tcl interpreter parses the predicate and decomposes it into subexpressions that will be sent, installed, and eventually 5 Controlling Monitoring Overhead PDM s integration of a flexible, mobile programming language provides the opportunity to reduce monitoring overhead. Specifically, migrating subexpressions can reduce network traffic by eliminating unimportant monitoring information. Furthermore, PDM scripts can monitor and bound their own resource usage, thus further reducing monitoring overheads. 5.1 Migrating Subexpressions PDM s ability to evaluate expressions with terms local to a particular machine differentiates it from other monitoring subsystems, such as Meta [8, 9]. PDM hides the decomposition and distribution tasks from the programmer. Because the language is easily decomposable and mobile, PDM provides this service automatically. Local subexpression evaluation has two major benefits. First, The subexpressions can be computed in parallel, reducing the amount of time required to evaluate the parent query or predicate. Second, processes send notifications only when the value

Is the CPU load on any worker greater than 3? Is the CPU load on any worker greater than 3? True True False False CPU Load > 3? CPU Load > 3? CPU Load > 3? Cpu Load = 4 Cpu Load = 4 Cpu Load = 4 CPU Load > 3? CPU Load > 3? CPU Load > 3? Cpu Load = 4 Cpu Load = 1 Cpu Load = 2 (a) The DSL parses the predicate and passes local subexpressions to the LSL (b) The LSL evaluates the local subexpression and returns the answer to the DSL Figure 3: PDM registering and evaluating a predicate. of one of the subexpressions changes, reducing both network traffic and the number of times a predicate needs to be evaluated. As an example, in Figure 4, the expression labeled a shows a representation of the condition part of a Meta guarded command that references sensors on more that one process and thus does not take advantage of locality. Each time the load changes on process 1 or on process 2, the new load is sent to the process that registered the predicate. Not only does the changing load generate network traffic, but the process re-evaluates the predicate each time, which wastes CPU cycles. The predicate labeled b is a similar expression rewritten for PDM and takes full advantage of locality. The process that registered the predicate will only evaluate the predicate when the value of the expression $metric(cpu,load) 2 or the value of the expression $metric(cpu,load) 2 changes. a) p.load > 2 && q.load < 2 b) [worker 1 $metric(cpu,load) > 2] && [worker 2 $metric(cpu,load) < 2] Figure 4: The expression in a represents the condition part of a guarded command that references the Load sensor in two different contexts, labeled p and q. Expression b is the PDM expression restructured to take full advantage of locality. 5.2 Bounding Resource Usage And Controlling Buffering Each time a subexpression is evaluated on a local machine, it consumes CPU, memory, and, possibly, network resources. Because PDM s programming language is flexible, programmers can write predicates that monitor their own resource consumption and limit them to some predefined maximum monitoring overhead. Consider the flow of control for evaluating a subexpression that monitors its own resources that is depicted in Figure 5. First, PDM updates the current CPU and memory resource consumption for this subexpression. Next, PDM determines if the CPU and/or memory resource utilizations for this subexpression are below given thresholds. If not, the flow of control exits and the subexpression is not evaluated, which wastes no resources for evaluating this subexpression. If resource consumption is below the thresholds, the subexpression is evaluated. Start Update CPU/ memory Statistics CPU/ memory threashold violated? No Evaluate Subexpression Yes Sub expression Value Changed? Network threashold violated? No No Yes Buffer Notification Send Notification Update Network Statistics Legend: Required Tcl predicate Optional Tcl code to bound resource usage Optional Tcl code to track specific resource usage Figure 5: Flow of control for evaluating a subexpression that monitors its own resources. Note that the value of the subexpression may or may not have changed since its last evaluation. If not, the subexpression evaluation code terminates. However, if the subexpression s value has changed, PDM then determines if the network resources consumed by this subexpression are below a given threshold. If so, a notification is immediately sent to the DSL. However, if network resources have exceeded the threshold, the notification is buffered and no

network traffic is generated. Lastly, if needed, PDM updates the network resource utilization for the subexpression. This flow of control provides both resource monitoring and buffering based upon feedback information. PDM provides the necessary hooks to update persubexpression memory utilization and control buffering. PDM also automatically updates per-subexpression CPU and network utilization. Therefore, constructing a predicate that does its own resource monitoring and dynamic buffering can be done simply, as shown in Figure 6. Note that the?: conditional operator in Tcl is similar to that of the C programming language. The subexpression in this predicate contains an outer conditional that evaluates the inner conditional if the CPU resource usage, i.e. the $perf($this,evalspersec) variable, is below the given evaluations-per-second threshold. The inner conditional evaluates the predicate and sets the $buffer($this) variable if the network usage, stored in the $perf($this,bytespersec) variable, violates the given bytes-per-second threshold. PDM buffers the notification if the $buffer($this) variable is set and value of the expression has changed. [worker ALL {$perf($this,evalspersec) < $ethresh? [expr {$perf($this,bytespersec) < $bthresh? [expr $subex] : [expr {[set buffer($this) 1]? [expr {$subex}] : 0}]}] : $Lastval($this)}] Figure 6: This script monitors its own CPU and network performance. The script will not evaluate the subexpression ($subex) once the EvalPerSec threshold is exceeded. Furthermore, PDM buffers the notification for this subexpression if the network threshold is exceeded. 6 Experimentation We evaluated network overhead by measuring network traffic generated by PDM for runs of a Successive Over- Relaxation (SOR) benchmark written for a DSM system that has been instrumented to interact with PDM. We modified the benchmark to register an alarm that was triggered whenever a machine performed its task exceptionally fast or exceptionally slow. The pseudo-code for registering the alarm is shown in Figure 7. Specifically, the SOR object registers alarms that are triggered when the amount of time any process spends waiting at the barrier is above or below a certain threshold. The SOR object initially obtains an estimate of how long a process should wait at the barrier by recording the maximum, minimum, and average wait times over the first ten iterations. It then sets a lowval and a highval threshold to detect outliers; machines that took for each worker i Answer[i] = Query("How much time did worker i spend waiting on the barrier") max = maximum of all Answer[i], where i = 1 to number of workers min = minimum of all Answer[i], where i = 1 to number of workers avg = (sum of all Answer[i], where i = 1 to number of workers) / 2 RegisterAlarm("Tell me if any workers barrier wait time is less than lowval") RegisterAlarm("Tell me if any workers barrier wait time is greater than highval") Figure 7: Pseudo-code for the SOR predicate. to long or completed too fast (see Figure 7). The system then registers an alarm to be triggered when an outlier is detected. When the alarm occurs, the system adjusts the load. To show the benefits of subexpression locality, we performed two sets of runs. In the first set of runs, the SOR application registers alarms with predicates that allow PDM to evaluate subexpressions locally. In the second set of runs, the SOR application registers alarms with predicates that do not take advantage of subexpression locality. The predicates for the two sets of runs are shown in Figure 8. The results of the runs is shown in Figure 9. The graph Central Evaluation [smallest [worker ALL {$metric(barrier, WaitTime)}]] < $lowval] [largest [worker ALL {$metric(barrier, WaitTime)}]] > $highval] Local Evaluation [anytrue [worker ALL {$metric(barrier, WaitTime) < $lowval}]] [anytrue [worker ALL {$metric(barrier, WaitTime) > $highval}]] Figure 8: Predicates used in the SOR runs. shows the number of packets PDM transferred as we increased the number of processes at each new run. PDM transmits substantially fewer network packets for the second set of runs, i.e. when PDM evaluates subexpressions locally. Less network traffic is generated because the values of the subexpressions change more slowly than the metrics themselves. Therefore, the processes evaluating the subexpression send fewer notifications to the process that registered the alarm. For eight processes, PDM network traffic was reduced by an order of magnitude. In addition, Figure 9 shows that as the number of processes increases, evaluating expressions locally scales better than the centralized approach. Note that these results are for a single metric. For applications with many metrics the results would be even more dramatic, i.e. network traffic for the centralized approach would increase proportionate to the number of metrics, while traffic for the local approach would grow relatively slowly.

Number Of Packets 40000 35000 30000 25000 20000 15000 10000 5000 PDM Packets Transferred / SOR benchmark No Local Evaluations Local Evaluations 0 2 3 4 5 6 7 8 Number Of Workers Figure 9: Number of PDM packets transferred during runs of SOR. 7 Summary PDM is a generalized toolkit for creating monitoring/analysis, steering and reactive systems. PDM collects system and application performance metrics that an application can use to improve its performance. Applications can construct metric expressions to query the state of the application and the operating system. Applications can also register alarms that PDM triggers when the value of a metric expression (predicate) becomes true. What differentiates PDM from other monitoring systems is its mobile scripting language. The scripts are portable, i.e. PDM migrates scripts to the machine where performance data is generated. We have presented performance results which show that migrating scripts can reduce network traffic, thereby reducing monitoring overhead. Furthermore, PDM scripts can monitor and bound their own resource usage, further reducing monitoring overheads. References [1] T. E. Anderson and E. D. Lazowska. Quartz: A tool for tuning parallel program performance. In 1990 ACM SIGMET- RICS Conference On Measurment and Modeling of Computer Systems, volume 18, pages 22 25, May 1990. [2] G. Eisenhaur, W. Gu, K. Schwan, and N. Mattavarupu. Falcon - Toward Interactive Parallel Programs: The Online Steering of a Molecular Dynamics Application. High Performance Distributed Computing, August 1994. [3] A. Geist, A. Beguelin, J. Dongarra, R. Manchek, W. Jiangand, and V. Sunderam. PVM: A Users Guide and Tutorial for Networked Parallel Computing. MIT Press, 1994. [4] S. L. Graham, P. B. Kessler, and M. K. McKusick. gprof: A call graph execution profiler. In Proceedings of the SIGPLAN 82 Symposium on Compiler Construction, volume 17, pages 120 126, 1982. [5] W. Gu, G. Eisenhaur, and K. Schwan. Falcon: On-line monitoring and steering of parallel programs. IEEE Transactions on Parallel and Distributed Systems, November 1994. [6] J. K. Hollingsworth and B P. Miller. Parallel Program Performance Metrics: A Comparison and Validation. In Supercomputing 92 Proceedings, Minneapolis, MN, November 1992. ACM and IEEE. [7] Unix Programmer s Manual. prof command. Technical Report section 1, Bell Laboratories, January 1979. [8] K. Marzullo and M. D. Wood. Tools for Constructing Distributed Reactive Systems. Technical report, Cornell Univerity, Department of Computer Science, Ithaca, New York 14853, february 1991. [9] K. Marzullo and M. D. Wood. Tools for Monitoring and Controlling Distributed Applications. In IEEE Computer, volume 24, pages 42 51, 1991. [10] P. Mehra, B. VanVoorst, and J. C. Yan. Automated instrumentation, monitoring and visualization of pvm programs. In Proceedings of the 7th SIAM Conference on Parallel Processing for Scientific Computing, pages 832 837, 1995. [11] B. P. Miller, M. D. Callaghan, J. M. Cargille, J. K. Hollingsworth, R. B. Irvin, K. L. Karavanic, K. Kunchithapadam, and T. Newhal. The Paradyn Parallel Performance Measurement Tools. In IEEE Computer, volume 28, pages 37 46, November 1995. [12] J. K. OusterHout. Tcl: An embedable command language. In 1990 Winter Usenix Conference Proceedings, 1990. [13] D. A. Reed, R. A. Aydt, T. M. Madhyastha, R. J Noe, K. A. Shields, and B. W. Schwartz. An Overview of the Pablo Performance Analysis Environment. Technical report, University of Illinois, November 1992. [14] J. Reuther. Aerodynamic shape optimization of supersonic aircraft configurations via anadjoint formulation on distributed memory parallel computers. Computers and Fluids, 28(4 5):675 700, MAY-JUNE 1999. [15] I. Rosenblum. Multi-processor molecular dynamics using the Brenner potential: Parallelization of an implicit multibody potential. International Journal of Modern Physics C, 10(1):189 203, FEBRUARY 1999. [16] V. N. Vatsa. Parallelization of a multiblock flow code: an engineering implementation. Computers and Fluids, 38(4 5):603 614, MAY-JUNE 1999. [17] J. Vetter and K. Schwan. Progress: A Toolkit for Interactive Program Steering. Technical report, Georgia Institute of Technology, August 1995. [18] J. Vetter and K. Schwan. Techniques for high-performance computational steering. IEEE Concurrency, pages 63 74, October-December 1999. [19] P. Wang. Parallel multigrid finite volume computation of three-dimensional thermal convection. Computers and Mathematics with Applications, 37(9):49 60, MAY 1999.