Performance Monitoring and Visualization of Large-Sized and Multi- Threaded Applications with the Pajé Framework

Performance Monitoring and Visualization of Large-Sized and Multi- Threaded Applications with the Pajé Framework Mehdi Kessis France Télécom R&D {Mehdi.kessis}@rd.francetelecom.com Jean-Marc Vincent Laboratoire ID-IMAG {Jean-Marc.vincent}@imag.fr Abstract Performance is a critical issue in the context of massively parallel programs. In practice, it is almost impossible to observe and understand the behavior of such programs without the assistance of automated tools which offer the program developer insights into the execution behavior of complex applications. Most of these tools are based on event-based and tracing techniques. As the size of the parallel system grows, it generates a huge size of events. Visualizing and animating gathered traces often produce a complex and non understandable diagrams and displays. Recently, the Pajé visualization frameworks have been developed in our laboratory ID (Informatique et Distribution). This framework provides interactive and scalable behavioral visualizations of parallel and distributed applications. This papers presents an empirical performance monitoring and visualization study with two large scale applications (JonAS (200.000 LOC1) and Jboss (400.000 LOC)). We have developed an event-based tracing framework for large scale applications monitoring. We use Pajé framework to observe and visualize a large amount of harvested events. 1. Introduction Performance monitoring is a hot topic in the current massively parallel and distributed programs. To understand complex applications or network behavior we need automated tools for instrumentation, tracing, visualization, analysis, etc. Such tools offer the developer and the network administrator insights into the execution behavior of the system (program or network). Most of these tools are based on an event-based tracing technique [12]. Events are intercepted during the application runtime of the monitoring activity. Then, they are stored in 1 LOC: Lignes Of Code log files called traces files. As the size of the parallel system grows, the amount of events data becomes huge and difficult to interpret textually. The visualization of these events can be an efficient way to rapidly identify performance problems. The role of visualization tools is to animate these traces. Analysis of traces is generally done off line, through monitoring and visualization tools of multithreaded applications [2, 3, 4, 5, 10]. One of the major challenges for these tools is the visualization scalability [7, 9]. In fact, most of performance monitoring produces non understandable diagrams when the number of observed events grows. Typically, such results are obtained when we monitor large scale systems (applications, networks, clusters, etc.) Recently, the Pajé [1] visualization framework has been developed in our laboratory ID (Informatique & Distribution-IMG). This framework provides an interactive and scalable behavioral visualization of parallel and distributed applications, helping to capture the dynamics of their executions. Pajé was initially designed to ease performance debugging of MPI parallel programs by visualizing their executions. The aim of this paper is to present the results of our experiments in performance monitoring of largesized distributed applications. To evaluate our visualization tool, we test two real world applications servers; JOnAS 2 and JBoss 3 J2EE servers. During our experiments, we have used the Pajé framework to visualize the behavior of these programs. We focus on three main aspects of multithreaded systems; (i) Thread's activity and state, (ii) Locks and semaphores states and (iii) Garbage Collector's (also known as the GC) activity. The paper is organized as follows. Section 2 presents Pajé visualization framework. Section 3 outlines the global architecture of our performance 2 http://jonas.objectweb.org 3 http://sourceforge.net/projects/jboss 1

monitoring infrastructure. Finally, section 4 presents and discusses the results of our monitoring use cases. 2. The Pajé visualization framework The Pajé framework is [1] trace-based visualization tool. It takes as input trace files of timestamped events and produces scalable visualizations of these traces. Figure 1 shows an example of visualization with Pajé. Location axis t Monitor waiting Monitor blocked monitors Threads Monitor Fig 1 Multi-dimension diagram of Pajé Time axis Thread running Thread blocked Thread-monitor communication Typically, a Pajé visualization is a multidimensional diagram (time and space) in which the horizontal axis represents time while threads are displayed along the vertical axis. Threads are grouped by JVM. But can also be grouped by machine or application. One of the particularities of Pajé diagrams consists on combining the representation of the states and the representation of communications of each thread [7]. Communications events are presented by methods calls triangles, links symbolize interaction between threads. The states of threads are displayed by colored rectangles. Colors are used to indicate the activity of a thread. Another interesting feature supported by Pajé is threads and monitors (also called semaphore and locks) visualization. Each monitor is associated with the threads that locks and unlocks it. The states of monitors are represented like the states of thread. Further, Pajé is particularly interactive. Moving the mouse pointer over the representation of a blocked thread highlights the corresponding monitor that will unblock it. Similarly, all threads blocked in a semaphore are highlighted when the pointer is moved over the corresponding state of the semaphore. Thus, the administrator or the tester can navigate through these displays to understand the behavior of a multithreaded program and analyze causality between events. Progress of the visualization is entirely driven by user-controlled time displacements: at any time during a simulation, it is possible to move forward or backward in time. In addition, it is possible to programmers using the tool to define what they wish to visualize and how this should be represented (masking, group, expand, getinformations, getstatistics, etc.) 3. The Monitoring infrastructure Pajé is a visualization framework. In order to visualize traces we need to generate them. Traditionally, observing and monitoring systems require instrumenting them. In this section our instrumentation approach is introduced. Instrumentation can be done at different levels (OS, middleware, JVM, application, etc.). In this study, we have chosen to place our probe at the JVM level. So, we developed a C++ instrumentation agent based on JVMPI [13] technology. JVMPI is a native JAVA library that provides some facilities to observe events occurring inside the JVM. During runtime, a JVMPI agent receives events fired by the JVM. It decides whether to intercept them or not. The figure 2 illustrates the global architecture of our monitoring infrastructure. Traces visualization Traces data base Post-processing (filtering, merging traces, etc.) JVM instrumentation and events gathering Fig 2 Global performance monitoring architecture A computation overhead results of intercepting events. In order to minimize this perturbation, we limit the number of observed events. We focus only 2

on events related to threads, monitors and GC. At runtime, gathered and filtered events are written to disk throw a buffer mechanism periodically flushed to traces files. The results of this information gathering process are global trace files which are postprocessed in order to be treated by higher level tools. The resulting collected traces are then processed in order to get a coherent global trace. Gathered traces are post-processed by higher level tools. First, traces are merged. Then they are filtered. After, dependencies between events and monitored objects and threads can be correlated. Final global traces are stored in traces database and then can be visualized by Pajé. 4. Performance Monitoring visualization experiments and In the next sub-section we focus on performance monitoring of JOnAS and JBoss. We studied several scenarios: initialization scenario (starting J2EE services), src-scenario (deploying a web application), clusterdemo scenario (running several clients that send several queries to JOnAS). Following to these scenarios, we observe the thread activity and the GC activity. 4.1 Performance monitoring of large scale applications To evaluate Pajé scalability when dealing with large scale systems we test tow real industrial large scale applications; JOnAS and JBoss J2EE servers. Generally, J2EE servers are used to deploy and host web applications. They offer different services (naming service, persistence service, EJB service, Web service, etc.) to the applications they host. These servers are generally deployed in very critical contexts (e-business, telecommunication, egovernment, etc.). Also, when deployed, they are, usually, solicited by thousands to hundreds of users. Improving these applications performance is very critical issue in this context. However, it is complex for two reasons; (a) these programs are very largesized applications (JOnAS more than 200.000 LOC and JBoss more than 400.000 LOC) and (b) they are composed of hundreds of threads monitors. 4.2 Thread Monitoring The first scenario that we evaluate is the JonAS initialization scenario which consists on starting JOnAS services (see figure 4). These tow illustrations show the visualizations that we obtained with Pajé. On the Top of the two figures we visualize the monitor's activity. White bleu rectangles show an active state of the monitor. While the pink rectangles show a blocked state of the monitor. Black lines represent the interaction between treads and monitors. We can analyze and visualize at every time which thread had blocked a resource. When moving the mouse pointer over the representation of a blocked thread state highlights the corresponding monitor or thread state. Besides, Pajé provides the possibility to inspect the displayed objects or to relate a given visualization to the source code. So doing, we can find immediately the related code to an identified bottleneck. Following our analysis, of the visualizations we obtained, we noticed that there is contention region in the JonAS start initialization period. That means, there is a massive interaction between threads and monitors. Such behavior may slow down JonAS performance when deployed on clusters or when it runs large web applications. To analyze this phenomenon, we used Pajé Zooming functionalities. As Fig 3 illustrates, we can zoom in the contention region and visualize the evolution of the thread's states. Zooming functionally permits us to identify the thread's state references. As we notice, the zooming functionality permits to better observe the behavior of complex systems. Threads and monitors become navigable. We can easily identify the concerned code related to synchronization bottleneck. Step3 Step2 Step1 Fig 3: Zooming on the contention region 3

Contention region Monitors GC thread DataBase DataBase Servelet engine Threads Servelet engine Thread is running Thread is waiting Thread is blocked Monitor is blocked Monitor is free Interaction thread-monitor Fig 4 Jonas and Jboss initialization scenario 4.3 Cluster-Demo example thread visualization This scenario consists on simulating the deployment of JOnAS on a cluster and testing it with n Clients running m loops of tasks. Figure 5 shows monitors and threads interaction during Cluster-Demo scenario. As we can notice, with large scale applications and with a high number of threads, the visualization become non-understandable. Navigation and zooming features of Pajé permit us to better observe and understand thread scheduling and to identify bottlenecks. Monitors activity Contention region Threads activity Fig 5: Threads visualization with 100 JOnAS clients performing 100 loops 4.4 GC activity visualization One of the most frequent performance problems in java application is memory management. The Garbage collector (also known as GC) is a form of 4

automatic memory management inside the Java virtual machine. It attempts to reclaim garbage, or memory used by objects that will never be accessed again by the application. The GC is known to be very intensive CPU consuming. Besides when it starts, it blocks all threads executions and slows down application's performance. The aim of this sub-section is to observe the real effect of the GC Following the visualization of the obtained traces of the GC activity during JOnAS and JBoss initialization we have obtained Fig 6 and Fig 7. First, we notice that the garbage collector's activity of JOnAS is very intensive at the starting time (Fig 7). This activity tends to stabilize after the starting. Whereas, JBoss server requires less activation of the GC when starting its services. A massive activation of the GC may cause a performance degradation of the program. GC thread is waiting GC thread is running Fig 6 JBoss GC thread during server initialization Fig 7 JOnAS GC thread during server initialization Comparing the GC activity of GC for JOnAS and JBoss J2EE server to JBoss server come to fact that JOnAS at the starting time requires a massive activation of the GC. These results lead JOnAS team to improve JOnAS server performance by reducing the number of the loaded objects at starting time. Finally we are able to propose GC strategies dependent on the application behavior. Also, some corrections has been done to optimize the interaction with the specific Java Sun Hotspot JVM's garbage collector 5. Results and discussion During our experiments we observe some traces composed by 2000 up to 81.000 events. The overhead caused by JVMPI instrumentation varies between 10, 3% and 92, 17%. We notice that JVMPI activation slows down radically the program execution. The table.1 shows that during the starting scenario of JOnAS our JVMPI agent causes a high computation overhead. Improving the GC activity and the thread's lock (and unlock) policy will reduce this overhead by reducing the number of traced methods. Although, the overhead caused by the JVMPI agent was high, we identified very critical bottlenecks. In future work, we plan to investigate JVMTI 4 profiling technology. We hope that replacing our JVMPI agent with a JVMTI agent will reduce the instrumentation overhead. Besides Table 1 instrumentation overhead 6. Related works Total events Total runtime (with out instrumentation) (seconds) Several visualization tools for concurrent programs were proposed in the 90's [3, 10, 11]. Many visualization tools are not interactive, only giving the possibility to adjust the simulation speed; it is not possible to interact with the displayed objects or to move backwards in time. In addition, to our knowledge there is no performance tool that proposes visualizations for communications and synchronizations. Previous studies concerned TomCat servlet engine performances with Paraver [6]. This tool does not propose neither zooming nor navigation capabilities. Recently, several studies were concentrated on object-oriented concurrent programs. Several performance UML 5 based visualization display tools [4,5] were proposed. The Pajé visualization framework can be used in this context. However, we do not offer any UML-based diagrams. Attali et al. [8] proposed the Jitan a visualization framework for object oriented programs. However, Jitan is a simple research prototype it can not test 4 http://java.sun.com/j2se/1.5.0/docs/guide/jvmti/ Total runtime (with instrumentation (seconds) Initialisation scenario 2.081 14,449 27,768 92,17% Overhead Src-scenario 4.008 27,57 32,778 18,89% Stopping scenario 2.081 1,892 1,995 10,3% ClusterDemoScenario: 10 clients and 10 loops ClusterDemoScenario: 10 clients and 100 loop 26.253 81,311 151 85,7% 81.077 462,677 562,92 17,80% 5 www.uml.org 5

more than 100 objects. Whereas, with Pajé we can debug very large and real industrial applications. The Paradyn performance debugging environment [11] was designed to identify performance errors automatically. Pajé does not propose any assistance to identify automatically deadlocks or failures. Another major limitation is that is that it does not support online visualization. 6. Conclusion and future work In this paper, we presented our visualization and performance monitoring of large scale java applications. A performance monitoring infrastructure was proposed, implemented and evaluated. During experiments, we monitored some performance key elements of two application servers. We identified very critical bottlenecks (contention regions, initialization problem, massive GC activation policy, etc.). These results have been used by the JonAS team during to improve its performance. In, the future, we plan to extend Pajé with online monitoring functionalities. Pjaé was used in the context of performance monitoring of parallel programs deployed on clusters. Some ongoing works study its use in the context of network monitoring 8. References [1] J. C. de Kergommeaux, B. de O. Stein "Flexible performance visualization of parallel and distributed applications", Future Generation Computer Systems 19,2003, page 735 747. [2] W. Pauw, O. Gruber, E. Jensen, R.Konuru, N. Mitchell, G.Sevitsky, J. Vlissides and J. Yang. Jinsight: Visualizing the execution of Java programs. IBM Research Report, February 2000. [3] L. De Rose, Y. Zhang and D. A. Reed, SvPablo : A Multi-language performance Analysis system, Lecture Notes in Computer Science, vol. 1469, Sept.1998, pp. 352-355. [4] T. Souder, S. Mancoridis and M. Salah, A Framework for Creating Views of Program Execution, Proceedings of the IEEE International Conference on Software Mainframe (ICSL 01), 2001. [5] H. Leroux, C. Mingins and A. Réquilé- Romanczuk, JACOT: A UML-Based Tool for the Run-Time Inspection of Concurrent Java Programs, 2nd International Conference on the Principles and Practices of Programming in Java, Kilkenny City, Irland, 2003. [6] D. Carrera, J. Guitart, J. Torres, E. Ayguadé and J. Labarta, An Instrumentation Tool for Threaded Java Application Servers, 13th Jornals de Paralelismo, pp 205-210, Spain, Sept. 2002. [7] M. T. Heath, A. D. Malony, D.T. Rover, "Parallel Performance Visualization: From Practice to Theory", IEEE Parallel & Distributed Technology: Systems & Technology, v.3 n.4, p.44-60, December 1995. [8] I. Attali, D. Caromel and M. Russo, Graphical Visualization of Java Objects, Threads, and Locks, IEEE DS Online, Volume 2, N 1, January 2001. [9] O. Naím D'Paola C. E. Wu, A. Bolmarcich, M. Snir, D. Wootton, F. Parpia, A. Chan, E. Lusk, and W. Gropp, "Performance Visualization Of Parallel Programs", Proceedings of SC2000: High- Performance Networking and Computing, November 2000. [10] M. Heath et J. A. Etheridge, Visualizing the Performances of parallel Programs. IEEE Trans. Sofw.Eng, vol8, n 5, May 1991, pp. 29-39. [11] B. P. Miller, M. D. Callaghan, J. M Cargille, J. K. Hollingsworth, R. B. Irvin, K. L Karavanic, K. Kunchithapadam et T. Newhall, The Paradyn parallel performance measurement tool. Computer vol 28, N 11, Nov. 1995. [12] S. P. Reiss, M. Renieris, "Generating Java Trace Data", In Proceedings of the ACM 2000 Java Grande, San Francisco, CA, June 2000. ACM Press. [13] JavaTM Virtual Machine Profiler Interface (JVMPI), URL: http://java.sun.com/j2se/1.4.2/docs/guide/jvmpi/jvmpi.html 6