Uppsala Master s Theses in Computing Science (216) Examensarbete DV3 2002-06-07 ISSN 1100-1836 Java Profiler for TelORB Johnny Yakoub Information Technology Computing Science Department Uppsala University Box 311 S-751 05 Uppsala Sweden This work has been carried out at Ericsson Utvecklings AB Box 1505 12525Älvsjö Sweden Abstract Producing fast software is a non trivial task. Performance isn t a single step in the software development process, it needs to be a part of the overall software development plan. Profiling a Java program running on Solaris, Linux or Windows is an easy task to do by using some commercial- or non-commercial tool. So is not the case when trying to profile an application running on TelORB. Why this is, and the solution to the problem are some of the topics being discussed in this MSc Thesis report. Supervisor: Martin Sköld. Examiner: Mikael Pettersson. Passed:
Contents 1Introduction...3 1.1WhatisTelORB...3 2Profiling...5 2.1Howdoesitwork...5 2.1.1JVM...5 2.1.2JVMPI...5 2.1.3ProfilerAgent(PA)...5 2.1.4ProfilerFront-end(PF)...6 3TelORB...7 3.1WhatisTelORB...7 3.2TelORBSWBasics...7 3.3CommercialJavaprofilerswithTelORB...7 4ProfilerProducts...8 4.1 Commercial Products....8 4.1.1OptimizeIt...8 4.1.2RationalQuantify...11 4.1.3DevPartner...11 4.1.4JProbe...12 4.2 Non-Commercial Products...12 4.2.1Jinsight...12 4.2.2PerfAnal...12 4.2.3Profiler...12 4.2.4Apache(Jakarta-JMeter)/HPjmeter...12 4.2.5JProf(forTelORB)...12 5ImplementationofthenewEJProf...15 5.1RewriteTheProfilerinJava...15 5.1.1HowisJProfImplemented...15 5.1.2ANewJProfinJava...15 5.1.3EJProfimplementation...21 5.2 What can be done in the future... 35 Appendix A: Users Manual Appendix B: Some tuning techniques References:...44 2
1 Introduction Java Programs are becoming increasingly complex, they are executing on different machines which have different platforms, and in different physical locations. Examining a local program during runtime is difficult. Examining multiple programs executing on different virtual machines, in different locations is even more difficult. And most diffecult of all is when we are going to examine a lot of programs running on a Java Virtual Machine(JVM). of the TelORB, the distributed operating system. Producing good quality software is non trivial. Quality isn't a single step in the software development process, it needs to be a part of the overall software development plan. An important part of defining software quality is to ensure that it conforms to the specifications defined for the resulting product. Part of producing good quality software includes controlling resource usage since running out of resources might cause the software to fail or perform poorly and thus not meet the requirements. Two important resources that need to be controlled through "performance profiling" are CPU usage and memory usage. The well known object oriented software development process has four phases: 1- Analysis 2- Design 3- Coding 4- Testing In order to get better and faster software we need to add one more phase: 5- Performance profiling. Analysis Design Coding Testing Figure 1 Performance Accepted? PROFILE Done 1.1 What is TelORB TelORB is a distributed shared-nothing cluster running on off-the-shelf hardware (Intel Pentium processors). Applications are defined as processes running anywhere on a subset 3
of the available processors (could be between two or all processors in the cluster). This is achieved by defining logical groups of processors called processor pools that application process types are declared to be allocated to. The operating system of TelORB then decides on what specific processor a process should run at a given point in time. This can vary depending on the availability of the physical processors in the pool. TelORB is at Ericsson. It is suitable for controlling telecom applications. It can be loaded into a group of processors which behave like onesingle system. Applications can run on top of TelORB. TelORB is described as an open fault tolerant and truly scalable system. It can run Java and C++ code, moreover IDL(Interface Definition Language) is supported as part of the CORBA standard. TelORB contains (among other things): 1 a distributed soft real-time operating system 2 a distributed main-memory database system 3 support for up to 40 processors in one cluster 4 an optimized and reliable inter-process communication protocol (IPC) 5 a Hot-Spot JVM 6 support for on-line upgrades The TelORB run-time environment is complemented with a development environment using off-the-shelf compilers (gcc and javac) as well as script languages. A specific specification language has been developed for defining processes, process communication, and persistent data. From the name TelORB it is obvious that CORBA is supported and compilers for both IDL to C++ and IDL to Java are also part of the development environment. Part of the requirement of the work presented in this thesis was to evaluate commercial profiling tools and determine any of them can be used with TelORB. If no commercial products were found suitable the goal was to create a tool that would send stack traces of a JVM to a remote location at a fixed interval. This stack trace data could be used in several different ways: interaction developer observation of the entire system, visualization of the control flow in the system, or collection of the data for future analysis. This tool would have to run on different OS, different architecture, and different JVMs, with minimal performance impact. What is profiling then?... 4
2 PROFILING Many systems do not meet all of the performance requirements on the first try. Once a performance problem has been determined, a need to begin profiling is desired. Profiling means the ability to monitor and trace events that occur during run time, the ability to track the cost of these events, as well as the ability to attribute the cost of the events to specific parts of the program. It determines what areas of the system are consuming the most resources, and pinpoints where bottlenecks or other performance problems such as memory leaks might occur. Once the developer know where potential trouble spots are located, he/she can change the code to remove or reduce their impact. 2.1 How does it work Figure 2.1 shows four important things to make it possible to get the profile information from the JVM. These are: JVM JVMPI events controls Profiler Agent wire protocol Profiler front-end Figure 2.1 2.1.1 JVM (the Java Virtual Machine) Where the profiled application is to be executed. 2.1.2 JVMPI (JVM Profiler Interface) is a two-way function call interface between the JVM and the profiler agent. The JVM notifies the agent of various events, corresponding to heap allocation, thread start, etc. 2.1.3 The profiler agent which issues control and requests for more information through the JVMPI. In EJPro we use the HPROF which is a part of JDK The HPROF Profiler- Agent (PA) The JDK provides an interface known as the JVMPI (the one mentioned above). The HPROF agent is a dynamically-linked library shipped with JDK 1.x. It interacts with the JVMPI and presents profiling information either to the user directly or through profiler - Frontend. We can invoke the PA by passing a special option to the JVM: java -Xrunhprof:<tags> ProgName Where the ProgName is the java application which we want to profile. Depending on the type of profiling requested (the tags), PA instructs the JVM to send it the relevant profiling events. It gathers the event data into profiling information and outputs the result by default to a file. For example, the following command obtains the CPU samples: 5
java -Xrunhprof:cpu=samples ProgName The HPROF Stack Traces The stack trace in each frame contains class name, method name, source file name, and the line number. The user can set the maximum number of frames collected by the HPROF agent. The stack trace can be used to reveal which methods performed heap allocation and which were ultimately responsible for making calls that result in memory allocation. HPROF CPU Profiling A CPU time profiler collects data about how much CPU time is spent in different parts of the program. Using this information, the programmer can find ways to reduce the total execution time. There are two ways to obtain profiling information: either by using statistical sampling, which is less disruptive to program execution, but cannot provide completely accurate in formation; or through code instrumentation, which may be more distruptive, but allows the profiler to record all the events it is interested in. The latter method can be implement ed by injecting profiling code directly into the profiled program. 2.1.4 The profiler front-end(pf) This may or may not run in the same process as the PA. It can execute at another process at the same machine as the PA, or on a remote machine connected via the network. The only thing the user can see is the PF. He/she defines some values such as the depth of the traces, which port the PF and the PA are going to communicate through, etc, and then starts the PF. The PA starts together with the JVM and connects to the PF (that of course has to be defined when we start the application). Depending on how the PF works, the information received from the PA will be organised and later either visualized or saved to a file. In this MSc Thesis, I am going to test and analyse some of the existing products on the market, trying to find out whether one/some/none of them: 1- has a PF which supports Solaris and Linux platforms in the first hand, and windows in the second hand. And 2.a- has either a PA which can be started as a TelORB process on TelORB platform, or 2.b- can communicate with the HPROF agent, which in turn is a part of TelORB If these criteria do not exist in any of them, the next step will be to develop a PF, and the protocol between it and the HPROF agent. The best solution of course is to build a new profiler from scratch, that is developing a PA and a PF, something which may take a long, long time. For example, for the OptimizeIt profiler it took 5 years to become true. There are several good commercial profiling tools on the market, but before presenting them I have to say some words about TelORB, and why it is so hard or even impossible to find a commercial profiler which works as it should in TelORB. 6
3TelORB 3.1 Software Development for TelORB TelORB processors execute code contained in load modules, these files are found on the I/O processor.in the UNIX computer containing compiled and linked program, and special functions for communication between itself and the TelORB processors. When the system is booted the load modules are loaded onto the TelORB processors through a multi-cast load. It means that all processors receive all load modules, but only the ones configurated will be kept. Figure 3.1Producing Load modules and Site Configurations Producing Load modules and Site Configurations Load modules are produced by using the TelORB design environment. It is also the environment for building site specific configurations. A number of command based tools, EPTOOLS, are provided (see figure 3.1). These tools makes it possible to generate code from Delos and IDL specifications. The designers add implementation code to complete the functionality of the application.the code is compiled into load modules. A site configuration is specified in the epct file. The epct command generates a site configuration that can be loaded onto a system. 3.2 Commercial Java Profilers with TelORB The JVM Profiling Interface (JVMPI) is a standard protocol that allows different vendors to provide profiling tools for JVMs that support the protocol.the JVMPI protocol is, however, not a network protocol where the JVM and the Profiler Agent (PA) can be separated over the network. The PA is linked together with the JVM and must be recompiled (if available with source code) or be delivered as a relocatable library for the target OS. Since no vendors give away PA source code or currently support generating libraries for Dicos, it is not possible to use commercial Java profilers with the TelORB JVM.TelORB provides a Java profiler, JProf, for Vega simulators running on SPARC processors. By setting profiler options developers can obtain: - Snapshots of all live objects in the Java heap - Information on the most heavily allocated objects - Profiles of program execution based on CPU data - Snapshots of monitors in the system and information on monitor contention. JProf will be described in more details in 4.2.5. 7
4 Profiler Products There are many performance analysis and tuning tools. Not surprisingly, the most and best of them are commercial and have free demo version for a 15 days period or one week. I have tested some of these products, to decide whether it works with TelORB or not, and here are the results: 4.1 Commercial products 4.1.1 OptimizeIt What is OptimizeIt: OptimizeIt software is a Java profiling tool which enables developers to test and improve the performance of their Java applications, applets, servlets, Java beans, Enterprise Java beans and Java ServerPages.OptimizeIt is plug and play: there is no need to recompile the program with a custom compiler or to modify class files before execution. It is just to run the program from OptimizeIt to start testing its performance. Because no code modification is required, any Java code that the program uses is included in the profile. OptimizeIt has two main components: -The OptimizeIt user interface (the front end) is a window that displays profiles and controls for refining the profiles and viewing source code. -The OptimizeIt audit system (the profiler agent) is a real-time detective that reports the activity on the JVM back to the OptimizeIt user interface. OptimizeIt main features Memory Profiler (figure 4.1) Provides real-time display of all classes used by the test program and of the number of allocated instances Graphically indicates where instances are allocated Filters class lists to make it possible to focus on relevant classes Automatically highlights lines of code that allocate object instances Gives control over garbage collection Displays incoming and outgoing object references in real time Displays string representations of allocated instances Provides Java API calls to make it easy to invoke the memory profiler from inside the test program Computes reduced reference graph for incoming references Displays references from reference graph roots CPU Profiler (figure 4.2) Allows starting or stopping profiling at any time Displays data for pure CPU use or for elapsed time (pure CPU and inactive phases) Graphically represents thread activities for the sampling period Displays profiling information for each thread or thread group Finds frequently used methods using Hot spot detectors Provides Java API calls so that one can invoke the CPU profiler from inside the program Provides both a sampler based profiler and an instrumentation based profiler 8
Can provide millisecond or microsecond precision Displays invocation count Provides filter to remove fast methods Figure 4.1 GUI of memory profiler Other features (figure 4.3) Displays and exports charts showing high level VM information including heap size, heap used, number of threads, number of busy threads and number of loaded classes Starts any Java application, applet or servlet directly from the OptimizeIt user interface Saves snapshots of a profiling session at any time. Snapshots can be reloaded later for analysis or comparison of profiling. Provides compatibility with JRun 2.3 and 3.0, WebLogic 4.5, 5.0 and 5.1, iplanet Web Server (NES 4.0 and 4.1), JServ 1.1, Java Web Server 1.1.3 and 2.0, ServletExec 2.2 and 3.0, Jakarta Tomcat, IBM WebSphere, Apple WebObjects and Netscape Application Server(NAS), ATG Dynamo, Silver Stream, Gemtone Executes the test program remotely while analyzing performance Exports any information in HTML or ASCII Provides Java API calls so that one can invoke the OptimizeIt audit from any Java code Tests and Analysis This software has both the profiler agent (audit) and front end (optimizeit). To start the agent we have to make a call to -Xrunoii with some options, just like when we call - Xrunhprof, and some paths have to be set: CLASSPATH LD_LIBRARY_PATH <InstallDir>/OptimizeIt/lib/optit.jar <InstallDir>/OptimizeIt/lib 9
Figure 4.2 CPU Profiler It was a little bit complicated to make it work, not just in TelOrb, but in the usual Java JDK1.2 too. Some problems to be mentioned are: Tests done in Java 1.2 environment - When the CLASSPATH is set to the one above, the JVM could not find the Figure 4.3 The VM Information window test program. Therefor I was supposed to save the program under the same directory as 10
the optit.jar. -When that was done and I tried to start the application with the following call: java -Xrunoii mytestapi an error message was returned saying that a JVMPI doesn t exists. -When I tried again with executing the api from the jre1.2.2, which is enclosed with OptimizeIt profiler:../jre1.2.2/bin/java -Xrunoii mytestapi it worked. The profile agent and the front end was connected to each other on port 1470, and the front end began showing the activity it was suggested to show. Tests done in TelORB -In the.epct file, the following row has to be added VariableName = -Xrunoii -port <portno> -dllpath <LD_LIBRARY> This, of course, can not be done due to the Variable name which does not exist in TelORB. To make that be true, we have to have the source code of the PA of this product to adapt it to be recompiled in TelORB. -The agent can be started from the application code. And when the application starts it needs to load some libraries which have to be TelORB compatible. But these do not exists, which makes it impossible to be done. Summary This tool works with its own JVMPI and needs to link many libraries. This implies that it cannot be run on TelORB processors, without having access to the source code of it, or to a Dicos (1) compatible Libraries which it needs. All the other profilers mentioned below that had problem with linking libraries, have exactly the same problem as OptimizeIt. 4.1.2 Rational Quantify This profiler tool is available for Windows and Unix, not Linux. No tests ware made due to the fact that it had to be installed by the system administrator. 4.1.3 DevPartner DevPartner Remote Agent is available for Linux, Solaris and Windows. No test has been done using this agent. The reason is that it works just like hprof and Jprof, and it can communicate only with one type of PF called TrueTime, and this in turn is available only for Windows. Another reason is that even if there is a compatible version of True Time working with Unix, the agent has many libraries to be linked at start, which makes it impossible to work with TelORB. (1) One of the interface specification languages in TelORB. The Dicos source code is fed into the Dicoos compiler, which generates C++/Java stub and skeleton files. The application-specific code is then manually added to the skeleton files before final compilation (C++ or Java compiler) together with other application code. 11
4.1.4 JProbe This was just like OptimizeIt. It has an agent and front-end, the PF works on its own JVM, and the agent runs on the server s JVM. The reason that it does not make it able to work with TelORB, is the libraries that have to be linked. 4.2 Non-commercial products 4.2.1 Jinsight This product is developed by IBM and it supports only windows, AIX and OS/340. 4.2.2 PerfAnal This is a tool to visualize the files created by hprof. It is very useful and makes it very easy to understand the profile file of hprof. PerfAnal has an easy understanding GUI. It makes a tree of the heap traces and visualizes all information about the currently active and passive threads in the system. The problem with this product is that it can just read one type of file, that is the binary file created when we call hprof with the cup samples tag, i.e. PerfAnal can only visualize the CPU-profiler data, but not the memory-profiler data. Another thing is that it can only read the file once, i.e. it can not be a useful tool for a runtime profiler. It took about four hours to make this work as they described. The reason was that it could not read files created by Hprof of TelORB, but the files created in Linux and Solaris could be visualized. Something more to say about PerfAnal/TelORB is that we can not let hprof in TelORB create binary file, just ascii file, which is useless with PerfAnal. 4.2.3 Profiler There is almost nothing to mention here, it just takes some data from the JVMPI and prints it to standard out. 4.2.4 Apache(Jakarta-JMeter)/HPjmeter These two products have exactly the same properties as PerfAnal, they are just more advanced than PerfAnal, and can read ascii-files, but they can not be used for runtime execution. 4.2.5 JProf (for TelORB) JProf is a runtime profiler-frontend (PF) developed in Ericsson Utvecklings AB. It has a lot of functionality and a non-graphical user interface programmed in C. JProf works with hprof as a profiler-agent (PA). It sends commands to hprof and receives profile information coming from JVMPI through HProf. This is done when the usersends a com mand to PA, PA in turn sends another command to the JVM, which suspends all processes currently running on it and gets either the heap dump or some other useful information, such as memory and/or CPU profile data, number of threads, CPU samples, etc.. It then sends it back to the PA, which takes the information, edits it and sends it to the PF, which just prints out everything in its buffer, to a file. 12
Using JProf, we can get the following information: - Heap dump - Allocation sites - Newly created traces. - Monitor dump. We can also configure it to change the information properties. Available configurations are: - Force GC (this does not work as it should, TelORB just terminates). - Change some settings: - Alloc traces on/off - CPU sampling on/off/clear. (This configuration too does not work either. Sun Microsystems say that CPU sampling can not be collected at runtime, but when I made some tests with other tags than the tags I must use, it worked. The negative effect was that the memory profiler stopped working. The test I made was as follows: To make hprof (the client) connect to the PF(the server), first one start the PF, then hprof with this command: Java Xrunhprof:net= <machine>:<port #> class to be profiled This means that we start hprof by connecting it to the PF, which runs on the defined machine at the defined port, and all the commands will be sent through the PF to PA (hprof) via the net. I.e if we want to get the allocation sites, we have to send a command to hprof telling it to send us the information we want. However we cannot get the CPU samples at runtime via the net, according to Sun Microsystems. Now, what I did was that I started hprof with this command: Java Xrunhprof:cpu=samples, thread=on, net= <M>:<P> class telling the JVMPI that we want the CPU samples sent over the net, and it will be sent to us every time we send a command from PF to PA. The undesired effect of this is that we will not be able to collect information about the memory. - Change max stack depth in CPU samples and alloc traces. JProf is the only PF which is TelORB compatible, users can get many types of profiler information from it. We can of course not get as much information as OptimizeIt, but on the other hand, we can rewrite it in Java and develop a graphical user interface to make it easier for the user to understand the results collected from the JVM. Profiler Options The options initially used by the profiler are set by including a JavaRunHProf parameter entry in the Initial Data section of the.epct file (system configuration file of TelORB). The parameter takes a string of Xrunhprof options and some tags consisting of an option and a value: JavaRunHProf = "-Xrunhprof:<option=value>,..." 13
Table 1 below summarizes the available options and their possible values: Table 1: available options Option Value Note heap dump sites all Default: all dump sites all cpu samples times old Default: Turned off samples times old monitor y n Default: n format a b Default: a Do not use this option, it does not give dependable results on TelORB A profile of program execution by measuring the time spent in individual methods, as well as by counting the number of times each method is called. See CPU TIME. Output is generated when the profiler exits. Use jprof command to obtain output. A profile of program execution by listing the number of times a method has been called by another method. Output is generated when the profiler exits. Use the jprof command to obtain output. Set this option to enable output of: MONITOR DUMP MONITOR TIME Output is generated when the profiler exits. Use jprof command to obtain output. Set output format to ascii (a) or binary (b). file <filename> Default: Write output to: - loadinggroup01/java.hprof (for binary output) - loadinggroup01/java.txt (for ascii output) net <hostname>:<port> Default: Write to file. depth <size> Default: 4 Set this option to write output over a tcp socket. This option is required when using jprof Specifies the stack trace depth 14
cutoff <value> Default: 0.0001 lineno y n Default: y thread y n Default: n doe y n Default: y Help for the Xrunhprof command is available by typing at a terminal prompt. java -Xrunhprof:help Table 1: available options Option Value Note Specifies the output cutoff point Enables printing of line numbers in traces Enables printing of thread numbers in traces This option must be set to obtain output from the options specified by: heap:dump heap:sites cpu:samples - Not to be used on TelORB cpu:times cpu:old monitor:y The only option that we have to define in the.epct file is the net option, the other option can be sent via JProf to hprof by typing the command number. This is going to be described in very small details later on. 15
5 Implementation of the new EJprof No commercial product was working with TelORB, but I still had three more choices to choose between: 1- Design and implement a new profiler containing a (PA) and (PF).This of course cannot be happen in so short time, 6 months. It took more than five years to develop the OptimizeIt, where more than 7 developers were designing and coding it. 2- Develop a new (PF) with a protocol to work with HPROF or better than this 3- Rewrite the JProf which already is TelORB and HPROF compatible in Java, and design a new GUI to make it easier to understand the output data received from HPROF. I chosen the third one, due to the short time it would take. The by this implementation was to build a (PF) which fulfil the following criteria: 1- Profile the Memory 2- Profiler the CPU 3- Gives information about the VM 4- Is multi-threaded server where HPROF can connect to it and understands its protocol. 5- Executes on different platforms 6- And, most important, can visualize the information received from HPROF. Many of these criteria can be solved by rewriting the JProf in Java, but it is not sure that they do the right thing in JProf either. The most important source for my idea is the OptimizeIt PF(OPF). I tried to develop a GUI which visualize the information in very easy understanding way as (OPF). The idea about the VM information is also from tis source, and of course many other ones However, not all dreams can be true, but some of them. And to know which are which, let me begin write something about the implementation process I got through. 5.1 Rewrite the profiler in Java I divided this process into three steps, 1- understanding the implementation of Jprof, 2- redesigning and implementing Jprof in java (without a GUI), and 3- implementing JProf (EJprof) in Java with GUI. 5.1.1 How is JProf implemented This PF is implemented as a server. It starts and PA connects to it. When this is done the profiling can be started, and we exit by ending the PF or the PA. A preview of the implementation is presented in pseudo code: START THE SERVER AT PORT <port> WAIT FOR CONNECTION LOOP WHILE NOT EXIT READ FROM STDIN BUILD THE COMMAND SEND COMMAND 16
WAIT FOR RETURNED DATA RECEIVE DATA WRITE DATA TO A FILE END LOOP CLOSE SOCKET EXIT 5.1.2 A new JProf in Java Rewriting this program in Java is an easy task, except the protocol between the client and server was difficult to understand and to implement in Java. JProf has a 15-bytes buffer (figure5.1) which can contain different values depending on what command we want to execute. All possible values are presented in (table 2). The Command It is a 1-byte int between 1 and 8 A 4-bytes int presenting a serial number A4-bytesint presnting the number of bytes after these. Sorting, a 2-bytes int either a 2-bytes int presenting id or a 4-bytes float presenting cutoff ratio Figure 5.1 (the structure of the buffer sent to hprof from JProf) The most difficult job was to save an int in either a 2- or 4-bytes array, I had to make a distinction between these two situations, by writing a function which takes the index in the buffer where the int has to be saved, an int, and a flag which can be 2 or 4. This function converts this int to either two or four digits, and saves it in the buffer starting with the place equal to the index attribute as follows: void inttobyte(int index, int i2b, int flag){ int tus, hun, tio, en, i = i2b; switch(flag){ case 4: if(i>999){ tus = i/1000; i = i-(tus*1000); hun = i/100; i = i-(hun*100); tio = i/10; en = i-(tio*10); else if (i>99){ 17
tus = 0; hun = i/100; i = i-(hun*100); tio = i/10; en = i-(tio*10); else if(i>9){ tus = 0; hun = 0; tio = i/10; en = i-(tio*10); else{ tus = 0; hun = 0; tio = 0; en = i; cmdbuf[index] = (byte)tus; cmdbuf[index+1] = (byte)hun; cmdbuf[index+2] = (byte)tio; cmdbuf[index+3] = (byte)en; break; default: if(i>9){ tio = i/10; en = i-(tio*10); else{ tio = 0; en = i; cmdbuf[index] = (byte)tio; cmdbuf[index+1] = (byte)en; return; Another thing was to present a float between 0.0 and 1.0 in 4-bytes, and I solved this problem by reading the float value from standard in, and save it directly in a buffer consisting of 4-bytes. Then I just made a function which copies this buffer of 4-bytes to the buffer that has to be sent: // READ THE FLOAT NUMBER FROM STDIN // AND SAVE IT AS BYTE ARRAY byte[] tmp = new byte[zbuflen]; try{ int i = System.in.read(tmp);... catch(exception e){ 18
... // CALL LATER THE coppyarray TO SAVE THE // FLOAT IN THE BUFFER // copyarray void copyarray(byte [] arr1, byte [] arr2, int start1, int start2, int len){ for(int i=0;i<len;i++) arr1[i+start1] = arr2[i+start2]; SITES BEGIN (ordered by allocated bytes) Wed Apr 24 15:05:42 2002 percent live alloc ed stack class rank self accum bytes objs bytes objs trace name 1 92.02% 92.02% 102744 40 3843344 1067 278 [C 2 0.96% 92.98% 520 13 40160 1004 23 [C 3 0.79% 93.77% 32800 2 32800 2 78 [C 4 0.61% 94.38% 936 39 25560 1065 281 java/lang/string 5 0.58% 94.96% 23552 2 24144 6 70 [C... 35 0.01% 99.04% 264 2 616 6 115 [C 36 0.01% 99.05% 0 0 528 1 235 [B 37 0.01% 99.06% 528 1 528 1 226 [B 38 0.01% 99.08% 0 0 480 20 317 java/lang/stringbuffer 39 0.01% 99.09% 440 7 440 7 196 [C.................. Totals: 100.00% 254944 1284 4176624 7283 SITES END Figure 5.2 19
Table 2: BUFFER LAYOUT TAG HPROF_CMD_GC (0x01) + 8 HPROF_CMD_DUMP_HEAP (0x002) + 8 HPROF_CMD_ALLOC_SITES (0x03) + 8 followed by a 2 bytes which can be: 0x0001 incremental vs. complete 0x0002 sorted by alloc vs. live 0x0003 whether to force GC followed by a 4 bytes cutoff ration which is a float number having a value between 0.0 ~ 1.0 HPROF_CMD_HEAP_SUMMARY (0x04) + 8 HPROF_CMD_DUMP_TRACES (0x06) +8 HPROF_CMD_CPU_SAMPLES (0x07) +8 followed by 2-bytes int (ignored by SUN) followed by a 4 bytes cutoff ration which is a float number having a value between 0.0 ~ 1.0 HPROF_CMD_CONTROL (0x08) + 8 followed by a 2-bytes int which has one of the following values: 0x0001: alloc traces on 0x0002: alloc traces off 0x0003: CPU sampling on followed by a 2 bytes int presenting thread object id NULL for all 0x0004: CPU sampling off followed by a 2 bytes int presenting thread object id NULL for all 0x0005: CPU sampling clear 0x0006: clear alloc sites info 0x0007: set max stack depth in CPU samples and alloc traces, this is followed by a 2-bytes int presenting the new depth. force a GC obtain a heap dump Notes obtain allocation sites obtain heap summary obtain all newly created traces obtain a HPROF_CPU_SAMPLES record changing settings (8) = a 4-bytes int=(serial #) followed by a 4-bytes int=(the remaining bytes). 20
5.1.3 EJProf implementation The Old JProf is a PF which does not have a graphical user interface, therefore I was supposed to start designing a PF which has a GUI and acts exactly like the JProf programed in Java. In order to do that, I divided the job into some steps: Analysis In this step, I started testing all profiling possibilities I could test (the ones that can be collected from TelORB), shown in table 3 and their output data in table 4. The results I got from this were: Table 3: COMMANDS SUPPORTED Command Functionality 1 Force a GC Runs the garbage collector (does not work) 2 Obtain a heap dump See HEAP DUMP 3 Obtain allocation sites See SITES 4 Obtain heap summary Generates a summary of number of allocated bytes and instances 5 Exit The profiler will exit and produce a dump consisting of the following (if the options have been specified in the.epct file configuration parameter): - Contended monitor usage, see MONI- TOR TIME - Java heap, see HEAP DUMP - Allocation sites, see SITES - CPU samples- does not give dependable results on TelORB - CPU timing, see CPU TIME 6 Obtain all newly created traces Shows all new traces, see TRACE 7 Obtain a HPROF_CPU_SAMPLES record Do not use this option, it does not give dependable results on TelORB 8 Change settings Allows you to change stack trace depth from the value specified with depth in the.epct file as well as changing allocation trace settings. 9 Obtain a monitor dump Shows a snapshot of monitors and threads in the system. 255 Help Shows jprof help information 21
Table 4: THE PROFILER OUTPUT DATA THREAD START THREAD END TRACE These words in the printout mark the lifetime of Java threads represents a Java stack trace. Each trace consists of a series of stack frames. Other records refer to numbered TRACEs to identify: - Where object allocations have taken place - The frames in which GC roots were found - Frequently executed methods. HEAP DUMP is a complete snapshot of all live objects in the Java heap. The following distinctions are made: ROOT - the root set as determined by GC CLS - classes OBJ - instances ARR - arrays SITES CPU SAMPLES CPU TIME MONITOR TIME MONITOR DUMP is a sorted list of allocation sites. This identifies the most heavily allocated object types, and the TRACE at which those allocations occurred. Not to be used on TelORB a profile of program execution obtained by measuring the time spent in individual methods (excluding the time spent in callees), as well as by counting the number of times each method "count" field indicates the number of times each TRACE is invoked. is a profile of monitor contention obtained by measuring the time spent by a thread waiting to enter a monitor. Entries in this record are TRACEs ranked by the percentage of total monitor contention time and a brief description of the monitor. The "count" field indicates the number of times the monitor was contended at that TRACE. is a complete snapshot of all the monitors and threads in the system. 22
Memory profile: By adding the row JavaRunHProf = "-Xrunhprof:net=<machine>:<port>" to the.epct, and sending command number 2 through the PF, we could get the heap dump. That is all the traces, all the created objects and the creator of them, the allocation sites, all threads both active and suspended, and some information about the heap something like, heap size(min, current and maximum), the number of allocated- and free KBytes, and the overhead. One important thing to mention about this profiling informa tion is that this information is about a 5MB big file, which makes the system slow down when sending it through the net every one, two or three seconds between the PA and the PF. Due to this, we have to forget command 2 (the heap dump). Well it gives us very useful and important information, but it takes a lot of resources, and that is exactly what we are trying to avoid. The second command is number 3, obtain allocation sites. By executing this command, the PA returned information about the sites, which- and how many (bytes or number) are live or allocated, and in which trace it became created, as shown in figure 5.2. This information was about 300 hundred lines in the absolute start of the profiled application, and after a few seconds it decreased to about 7-10 lines. And it contains a very useful informa tion about the memory. Command number 1, forcing the Garbage Collection (GC), was useless, because the GC of the TelORB JVM cannot be touched. When I tried to execute the command, TelORB just stopped totally. Obtaining a newly created traces can be done by executing command number 6 We can do many configuration by calling command number 8. We can for example change the max depth of the stack, change the sorting of the allocated sites, etc. CPU SAMPLES BEGIN (27 samples, 7 ticks, depth 6) Mon Feb 4 15:13:04 2002 rank self accum method 1 28.57% 0.00% java/io/printstream.print (Ljava/lang/String;)V 2 28.57% 14.29% java/io/outputstreamwriter.flushbuffer ()V 3 28.57% 14.29% Str.main ([Ljava/lang/String;)V 4 28.57% 14.29% java/io/printstream.write (Ljava/lang/String;)V..... 20 14.29% 71.43% java/io/printstream.write ([BII)V 21 14.29% 71.43% java/util/jar/jarfile.<init> (Ljava/lang/String;)V 24 14.29% 71.43% java/util/jar/jarfile.<init> (Ljava/lang/String;Z)V CPU SAMPLES END Figure 5.3 The output representing the CPU-Samples 23
CPU profile: There is only one command we can use here, that is number 7. Actually it does not work either. When we use the net tag in the call to the PA, and if we change the call, as mentioned before, by adding the cpu=samples tag, we can get the information. But we still can get the information only once, i.e the information is the same, even if we try to get a new information every second. This means that this command also has to be left out. The CPU-samples givs very important information about the CPU, where the PA returns information about every method, how many percent of the CPU resources it uses, and which other method is responsible for this, as shown in figure 5.3. VM information: The data about the VM can be collected in just one way. That is to execute the heap dump command, i.e. command # 2, and read the last 3 lines. Of course this is not a good solution, because of the size of information that has to be transferred between the PA and the PF. Therefore we have to skip this one too. TelORB VM 1 Java VM 2 JVM N HPROF HPROF HPROF EJProf Profiler-Frontend GUI1 of EJProf GUI2 of EJProf GUI N of EJProf Figure 5.4 EJProf PF server architecture After these analyses, one can see that there is not so much profiling information we can get from HPROF running on TelORB. However this little information is still useful to get some idea about the memory, some more information about the traces, which classes create some specific one, and how many allocated classes there exist in the system. Design Using the analysis discussed above, one gets an idea about how the EJProf PF server is going to proceed, what functionality it is going to have and how the protocol between 24
the PA and this PF is constructed. The only thing we have to design is the GUI; how it has to be designed to get the users commands, give them to the protocol, and later take the received data from the receiver and visualize it to us. Another thing which can be implemented in this server is that it can be multi-threaded, that is to support the use of multiple analysis tools, which can be added or removed at runtime, as shown in figure 5.4. START What Port? EXIT Start Listening 1 Connected Define Sorting Start Listening N Connected Define Sorting Define Cutoff Ratio... Define Cutoff Ratio Memory Profile Memory Profile STOP EXIT STOP EXIT Continue Continue Figure 5.5 EJProf func. As shown, EJProf can be executed in any platform, it has been tested with PA running on TelORB, Windows, Linux and Solaris, It self was running on Windows, Linux and Solaris. The functionality of EJProf is shown in figure 5.5 and pseudo code can be in the form of: START EJPROF DEFINE PORT LOOP WHILE NOT EXIT SERVER CLICKONLISTEN WAIT FOR CONNECTION WHILE NOT EXIT LOOP DEFINE THE SORTING METHOD (IF NEEDED) 25
DEFINE THE CUTOFF RATIO (DEFAULT 0.0) CLICKONMEMORYPROFILER a new GUI starts presenting the profiler data TO STOP TEMPORARILY, CLICK ON STOP TO CONTINUE PROFILING CLICK ON CONTINUE TO EXIT CLICK ON EXIT TO EXIT CLICK ON EXIT END LOOP EXIT SERVER. Implementation Three windows had to be implemented: 1- The EJProf server window (figure 5.6). In this window the user can see the following: - A field to define the port number the server will listen to, in order to establish a connection with a PA client. Figure 5.6: The server window of EJProf - A button for starting listening. - A button for exiting the program and all currently running PF threads in the system. - A menu consisting of the same functions as mentioned above, and a help menu to understand how the PF works. 2- The second window (figure 5.7) is opened by clicking on the button Listen to port, where the server starts listening to the port defined in the first window, and opens the PF window which belongs to the specified port. This has the following components: - Three subareas -The Memory profiler area In this area we have three check boxes. The first and second box are for choosing how the information about the allocation sites will be presented in the table in the next window. The third box is to start and stop the GC, but this does not work with TelORB and has to be inactivated. There is also a slider which we can use to define the cutoff ratio. It has a percentage format 0~100%, presenting the cutoff ratio which is between 0.0~1.0. The last component in this area is a button to start the memory profiler. -The CPU profiler area. It is not activated in this GUI, but just as a shell 26
program. However, this does not work anyway with TelORB or via the net. -Finally a current state area. Here we can read what is happening at this moment in the system. Actually, this only givs us information about if the system is listening or connected. The idea was that it would give a full information about every detail currently happening in the system, but the time was so short to make it be real. - An exit button to stop this thread of PF. - A Menu where we can start the Memory/ CPU profiler and get the same help window we could get from the first window. Figure 5.7: Profiler frontend window.we can distiguish between it and another window5 by checking the name of thewindow. It has the same name as the port number 3- The third window is the one which gives us the profiler information collected from the the JVMPI sent through HPROF to EJProf server, and visualized on this window (figure 5.8). This frame contains: -An area presenting the total number of all objects live or allocated in the JVM. These numbers change value every 3 seconds, and to have some idea about what is happening in the system it has to be saved in somehow, and I also have chosen to visualize it in a diagram which makes it easier to se. The PF gets the values from the PA, saves it in an array and then traverse the array and visualize it in the two diagrams shown in this window. - There is also a table consisting of three columns.the first one contains the name of the class, the second is the number of live classes of this one, and the third is 27
the number of allocated classes in the memory. - A stop and a continue button, which can be used to temporarily stop visualizing data received from the PA, i.e. the data is received even if we stop executing this thread, but it is never visualized. This property can be used in the future to save the data in some data structure to reuse it if wanted, to create a history file. - Another important thing here is that we can get the information about the callers to the classes. By clicking on a specific row-class name, a new window opens containing information about the methods that have created the specific object we clicked on (figure 5.9), and in which row of this methods-code this call has been executed. Figure 5.8: The profiler data window. By clicking on a specific row we can get information about which methods have created this class, i.e. the traces, and in which row of the code. The difference between the old version of this PF, the JProf, and this one is that the JProf just had to read the transferred data from the hprof and save it in a file once. With once I mean that JProf didn t send a new command to the PA if the user did not command it to 28
do that. The new application do send the same command automatically every 3 seconds. This leads to a big number of data, which will be transferred every three seconds, and automatically so. Another thing to think about is that the data received by JProf was just written to a file. No organization or sorting of this data was needed. In EJProf Instead, the data has to be organized in groups, sorted and visualized. Some data has also to be saved and used again and again. Figure 5.9: from this tree we can see the methods that have created the class [C (char array) Depending on where the application that will be profiled is executing on, TelORB, win, Solaris, Linux, etc.), the same command from the PF can lead the PA to return different structure of profile information. E.g. the memory profiler in TelORB returns the allocation sites as a string in table format and the space between the columns is not a tab but spacebar characters, but in Solaris the spaces are tabs. The numbers in TelORB are centered text, but in Solaris it is aligned to the right. It is enough with these two examples to let us handle the received data one character at a time. The second difficulty is that the number of classes methods created when the profiled application starts, differs a lot from the number of them after, let us say, 2 minutes. E.g. in TelORB, we can not profile just one class or just one application, instead all the classes will be profiled. That is, when we start TelORB we have to take care of about 2000 classes in the first few minutes, later this number can decrease to about 40 classes. In Solaris, when I started my test application - which is just 10 rows of code - over 200 29
rows of information about the classes was received, and after about 10 seconds the number decreased to 3 classes. For these reasons I had to add some new classes, a dynamic arrays, to the JProf which was implemented in Java (part 5.1.2). These classes do not exist in java. Actually, they exist, but they are not dynamic. In the same time the received data has to be handled one character at a time, to make it possible to execute the PF on any platform. The allocation sites contain strings and numbers. For that we need two types of arrays, DynamicStringArray and DynamicIntArray. These work like this: CREATE AN ARRAY OF 10 CELLS FILL IT WITH DATA IF THE ARRAY IS FULL CREATE A NEW ARRAY WITH 10 * RASIZING RATIO CELLS COPY THE OLD ONE TO THE 10 FIRST CELLS REMOVE THE OLD ONE RENAME THE NEW TO THE OLD ONE S NAME And here is the code of this class: public class DynamicIntArray{ private int[] intarr; //resizing ratio when the array is full int resize; // Creates a new dynamic array containing integers with length of 10. DynamicIntArray(){ intarr = new int[10]; resize = 2; // Creates a new dynamic array containing integers with length of size. DynamicIntArray(int size){ intarr = new int[size]; resize = 2; // Creates a new dynamic array containing integers with length of size. // Changes the resizing factor to news DynamicIntArray(int size, int news){ intarr = new int[size]; resize = news; // Adds an element to the array in the specified position public void addelement(int pos, int elem){ //check if the array is full if (pos >= intarr.length) { // The specified position is outside the actual size of // the intarr array. Double the size, or if that still does // not include the specified position, set the new size // to resize*position. 30
int newsize = resize * intarr.length; if (pos >= newsize) newsize = resize * pos; int[] newdata = new int[newsize]; System.arraycopy(intArr, 0, newdata, 0, intarr.length); intarr = newdata; // The following line is for demonstration purposes only. intarr[pos] = elem; // Returns the element at the position pos. public int getelement(int pos) { // Get the value from the specified position in the array. // Since all array positions are initially zero, when the // specified position lies outside the actual physical size // of the intarr array, a value of 0 is returned. if (pos >= intarr.length) return 0; else return intarr[pos]; // Returns the Array as a normal integer array public int[] getintarr(){ return intarr; // Returns the length of the array public int length(){ return intarr.length; The DynamicStringArray has exactly the same members, constructors and methods but supports strings instead of ints. Organizing the data received from the PA is done as follows: - The data starts with some informational text, which will be thrown away. The second part of this data is the traces. These will be read and saved in a file. This file is created when we start profiling, and it has a name consisting of the string ejproffile followed by the port number, i.e. ejproffile30331 for the session shown in figure 5.8. If the file already exists it will be replaced by the new one. That is for using it later when we want the callers tree in figure 5.9. - Allocation sites is next. This data is received as a string in table format, as shown in figure 5.2. This data will be organized by the program in dynamic arrays, and will be later visualized as a table, figure 5.8. This data, as mentioned before, has to read one character at a time, and when we build every word, we save it in respectively array as described in the following example. 31
Example: Let s say we have got these rows of data: SITES BEGIN (ordered by allocated bytes) Wed Apr 24 15:05:42 2002 percent live alloc ed stack class rank self accum bytes objs bytes objs trace name 1 92.02% 92.02% 102744 40 3843344 1067 278 [C 2 0.96% 92.98% 520 13 40160 1004 23 [C Totals: 100.00% 254944 1284 4176624 7283 SITES END When the buffer reader finds the occurrence of SITES BEGIN, it reads and throws away the two following rows. Afterwards, it reads all characters following this strategy: START LOOP WHILE STRING!(Totals: ) READ UNTIL FIRST INT AND THROW IT (1) READUNTIL FIRST INT BUILD THE FLOAT UNTIL % (92.02) SAVE IT IN A VARIABLE accum READ THE EMPTY CHARS AND THROW (Tab or spaces) READ AND THROW THE FIRST INT (102744) READ THE EMPTY CHARS AND THROW (Tab or spaces) READ THE INTEGER AND SAVE IT IN VARIABLE live (40) READ THE EMPTY CHARS AND THROW (Tab or spaces) READ THE INTEGER AND SAVE IT IN VARIABLE alloc (1067) READ THE EMPTY CHARS AND THROW (Tab or spaces) READ THE INTEGER AND SAVE IT IN VARIABLE traces. (276) READ THE EMPTY CHARS AND THROW (Tab or spaces) READ THE STRING AND SAVE IT IN VARIABLE name (40) (*) SAVE THE VARIABLES IN THE DYNAMIC ARRAYS (described later) END LOOP READ EMPTY CHARS, FLOAT, %, EMPTY CHARS, INT, EMPTY CHARS READ INT AND SAVE IT IN THE VARIABLE lives (1284) READ EMPTY CHARS, INT, EMPTY CHARS READ INT AND SAVE IN VARIABLE allocs END (*)Now when we have the variables name, live, alloc, and trace, we check if this method is already in our array. If so then we just add the number of live objects, to the number which exists in the same place as this method in the array liveobjs, and also do the same with the number of allocated objects. Then we append the trace to the string in the array traces. We do the same thing for every row existing between SITES BEGIN and Totals, and later this will be the table presented at the profiler window at figure 5.8. The total number of live objects and allocated objects will be saved in the dynamic arrays, lives and allocs respectively. Later, using these two arrays we will be able to draw the diagrams shown in figure 5.8. 32
The code presenting this function follows below: void organisedata(string str){ int rank, liveobj, allocobj; float self, accum; String name=null, trace=null, bool = null; int live=0, alloc=0, counter = 0; try{ namearr = new DynamicStringArray(); tracearr = new DynamicStringArray(); boolarr = new DynamicStringArray(); liveobjarr = new DynamicIntArray(); allocobjarr= new DynamicIntArray(); if(str.startswith("sites BEGIN")){ for(int i=0;i<3;i++) str=socketiput.readline(); int count = 0; while(!(str.startswith("sites END"))){ if(str.startswith("totals")) break; int i= 0; while((str.charat(i))!= % ){ i++; self = Float.parseFloat ((str.substring(i-6,i)).trim()); i++; while((str.charat(i))!= % ){ i++; accum = Float.parseFloat ((str.substring(i-6,i)).trim()); i++; while((str.charat(i)) == ){ i++; while((str.charat(i))!= ){ i++; while((str.charat(i)) == ){ i++; int lives = i; while((str.charat(i))!= ){ i++; liveobj = Integer.parseInt ((str.substring(lives,i)).trim()); while((str.charat(i)) == ){ i++; while((str.charat(i))!= ){ i++; 33
while((str.charat(i)) == ){ i++; int allocs = i; while((str.charat(i))!= ){ i++; allocobj = Integer.parseInt ((str.substring(allocs,i)).trim()); while((str.charat(i)) == ){ i++; int traces = i; while((str.charat(i))!= ){ i++; trace = (str.substring(traces,i)).trim(); trace = trace.concat("\n"); while((str.charat(i)) == ){ i++; name = (str.substring(i)).trim(); for(int index = 0;index <= counter;index++){ if((counter > 0) && (name.equalsignorecase (namearr.getelement(index)))){ liveobjarr.addelement (index,((liveobjarr.getelement(index))+ liveobj)); live = live + liveobj; allocobjarr.addelement (index,((allocobjarr.getelement(index))+ allocobj)); alloc = alloc + allocobj; tracearr.addelement (index,trace.concat (tracearr.getelement(index))); index = counter; counter--; else if (index == counter){ namearr.addelement(counter, name); liveobjarr.addelement(counter,liveobj); allocobjarr.addelement(counter,allocobj); tracearr.addelement(counter, trace); alloc = alloc + allocobj; live = live + liveobj; while(!(socketiput.ready())){ //-- create a timer which loops in 1-2 minutes //-- because if the profiler agent crashes then this //-- will execute forever without a timer stopping it str=socketiput.readline(); counter++; if(relative == -1){ 34
relative = (alloc * 1000); if (relative > 1000000) relative = 1000000; tot[3] = relative; relative = (live * 100); tot[2] = relative; while(socketiput.ready()){ SocketIput.readLine(); tot[0] = live; tot[1] = alloc; //Start the gui. if(!(sitesflag)){ memrcv = new MemoryRCV(srv, this); memrcv.memthread.start(); //create a new gui //new Sites(the arrays of the data). sitesflag = true; else if(str.startswith("cpu SAMPLES BEGIN")){ //To be constructed in the future, if possible changes happens // to the CPU-samples // function at the net. else if(str.startswith("heap DUMP BEGIN")){ // To be constructed in the future // to get VM information catch(exception e){ System.out.println("Error Receiving in Organise: " + e); 5.2 Summary and future work. To profile application running on the TelORB OS of Ericsson We need a TelORB compatible profiler. Such a profiler did not exist. therefor an analysis had to be done, investigating weather a commercial profiler does exists or not. The result of this analysis was negative, that was because we wanted either the source code of the commercial product to be able to recompile the application under TelORB. Or a link library which is compatible with TelORB. Of course neither of them could be arranged. The solution for this was to build a new PF, with the HPROF as a PA, this PF would be able to profile the memory and the CPU, give information about the VM and visualize the profiler data it gets from the PA. unfortunately, not all of these goals were fulfilled. The reasons were, the time limit I had left to implement it, TelORB had its own requirements, and the HPROF had some other. However! this PA were called EJProf (Ericsson Java Profiler) The EJProf can be developed in many ways, it can almost become like the OptimizeItPF. Some things to be added are: 35
- Getting the heap dump and trying to organize it in some way to collect the information about the threads, i.e. if they are active or passive, how long each thread has executed or been suspended, and at the same time trying to get some more information about the type of every object and instances of this object. - Try to solve the problem with forcing and stopping the GC. - Try to get information about the virtual machine. We already have information about the classes, the number of them and how many are activated. But we still need information about: * Heap graph (size, min, max, used) * Threads (number, active) * GC activity - Trying to get around the problem of collecting the CPU Profiler information. When this is solved, the time spent on each thread can be visualized in different ways as follows: 1- CalleeInclusive 2- CallerInclusive 3- LineExeclusive 4- LineInclusive - When finding some bottleneck we should be able to click on this information, and a window containing the code of this is displayed where we can edit it directly in that frame (editor). Of course these things are not so easy accomplished, but they are not impossible. Take for example the CPU-samples. This can be solved by redefining the code of this function in hprof, which is open source, and make this working. So nothing is impossible if we just have the time we need to do that. 36
Appendix A: User Manual for EJProf The application has to be started by following these steps: 1- Build the application: - Move to the directory where the Makefile is and type make (i.e uabs31i1c18> make) 2- start the program EJProf: - Move to the directory classes/ cd classes - Type java EJProfPackage/EJProf 3- Redefine the file.epct to connect to the profiler server. Add the following data under INITIAL DATA JavaRunHProf = "-Xrunhprof:net=<machine name>:<port number>" for example if your machine is uabs90i1c19 and you have started the profiler server at port 30333, then the line will look like: JavaRunHProf = "-Xrunhprof:net=uabs90i1c19:30333" Be careful not to use the same port as one of TelORBs ports (the ports defined in the.epct file). The default port is 30331. 4- Start the Server: - type the port number in the server window (figure 5.6), it has to be same as the port number defined in the file mysystem.epct - Click on "Listen to port" to start a server which listens to the defined port. - A new frame will be opened and the only thing you can do now is either waiting for the client to connect to the server, or click on exit to terminate this server and close the frame. 5- Start TelORB (or the application in Solaris/Linux) - Start TelORB as usual - if you start the application in a Solaris/Linux environment, then type, java -Xrunhprof:net=<machine name>:<port number> <file to be profiled> 6- When the button "Memory Profiler" gets activated, you can click on it to start the PF, which will visualize the profiling information. 7- If you want to see which methods are creating a specific class, click on this class and a 37
tree presenting the traces will be shown. you can change the values of the radio buttons and the slider at runtime. One thing you shouldn t do is marking the checkbox which lets the PF force the GC. This will stop TelORB totally. Appendix B: 38
Some tuning techniques covered in the Java Performance Tuning book adding RAM to improve system memory limitations altering network parameters to reduce unnecessary communication overheads altering process priorities to increase the amount of time allocated by the CPU to selected processes altering specifications to eliminate performance conflicts forced by inappropriate specifications altering the data structure to improve the performance of a class without changing its external interface anticipating data requirements to transfer extra data together with required data in distributed applications applying loop optimizations to speed up runtime processing applying many small optimizations to speed up overall execution avoiding growing files to reduce operating system overheads avoiding access control to speed up method invocations avoiding blocking the paint() method to eliminate interface blocking and maintain responsiveness avoiding casts to speed up runtime processing avoiding creating copies to reduce object creation and garbage collection overheads avoiding decompression to speed up runtime processing avoiding dependencies to eliminate blocking and maintain responsiveness avoiding garbage collection to reduce garbage collection overheads avoiding initialization to eliminate unnecessary overheads avoiding locks to reduce synchronization overheads and to eliminate blocking and maintain responsiveness avoiding logging to eliminate unnecessary i/o and method call overheads avoiding method calls to reduce runtime overheads avoiding object creation to reduce object creation and garbage collection overheads avoiding serialized execution to support multiple processors and reduce synchronization overheads avoiding speculative casts to reduce runtime overheads avoiding String creation to reduce object creation and garbage collection overheads avoiding synchronization to reduce synchronization overheads avoiding system paging to improve runtime performance avoiding temporary objects to reduce object creation and garbage collection overheads avoiding too much parallelism that might incur excessive runtime overheads avoiding tuning: one of the simplest tuning techniques avoiding unnecessary assignments to reduce runtime overheads avoiding wrapping primitives to reduce runtime overheads, object creation overheads and garbage collection overheads basic tuning techniques: the simplest techniques to try first batching to reduce unnecessary communication overheads in distributed applications and to improve performance by combining activities batching data to reduce unnecessary communication overheads in distributed applications buffering i/o to reduce i/o overheads bypassing serialization overheads to eliminate unnecessary runtime overheads bypassing shared resources to reduce performance costs caching to improve the performance of repeated access and calculations caching distributed data to reduce unnecessary communication overheads in distributed applications and to speed up repeated access of distributed data caching frequently accessed elements to improve the performance of repeated access and calculations caching i/o to reduce unnecessary communication overheads in distributed applications and to speed up repeated access of distributed objects caching InetAddress to reduce unnecessary communication overheads in address lookups caching intermediate results to improve the performance of calculations canonicalizing objects to reduce object creation and garbage collection overheads and to speed up comparisons of objects choosing faster collections to improve the performance of collection objects clustering files together to reduce i/o and operating system overheads clustering objects to reduce i/o and conversion overheads combining messages to reduce communication overheads in distributed applications comparing Strings by identity to speed up comparisons compressing data to speed up network transfers converting recursion to iteration to speed up runtime processing 39
cutting dead code to eliminate unnecessary runtime overheads decoupling i/o from other activities to eliminate blocking and maintain responsiveness designing applets to improve applet download time desynchronizing classes to reduce synchronization overheads duplicating data to reduce communication overheads in distributed applications eliminating common expressions to reduce runtime overheads eliminating error checking to reduce runtime overheads eliminating logging to reduce i/o and runtime overheads eliminating null tests to reduce runtime overheads eliminating prints to reduce i/o and runtime overheads eliminating unnecessary variables to reduce runtime overheads enumerating constants to speed up comparisons of objects and improve memory requirements externalizing instead of serializing to speed up serialization faster conversions to strings to improve runtime performance faster data conversion to improve runtime performance and speed up serialization faster formatting to improve runtime performance faster hostname translation to reduce unnecessary communication overheads in address lookups faster i/o using cached filesystems faster manipulation of array elements to improve runtime performance faster manipulation of variables to improve runtime performance faster startup from cached filesystems faster startup from disk sweet spots faster tests to improve runtime performance flattening objects to reduce object creation and garbage collection overheads flexible method entry points to support faster methods focusing on object creation to reduce garbage collection and object creation overheads focusing on shared resources to eliminate blocking and maintain responsiveness identifying performance limitations to eliminate performance conflicts improving case-insensitive searches to speed up comparisons of objects improving low level connections to reduce unnecessary communication overheads improving search strategies to speed up runtime processing improving the user interface to improve the user?s perception of application performance increasing swap space to improve system memory limitations initializing variables once only, to eliminate unnecessary runtime overheads inlining to speed up runtime execution inlining in bottlenecks for targetted speed up of runtime processing inserting delays to stabilize the user?s perception of the performance isolatingswaptoimprovesystemi/o journaling to speed up i/o keeping files open to reduce i/o overheads keeping spare capacity to reduce system overheads load balancing to improve runtime performance load balancing TCP/IP to improve network server performance locking memory to specify the amount of memory allocated by the system to selected processes managing threads to reduce runtime overheads measuring network speeds to improve the user?s perception of the performance minimizing communication to reduce unnecessary communication overheads in distributed applications minimizing CPU contention to reduce operating system overheads minimizing server down-time to improve the user?s perception of the performance minimizing transaction time to reduce blocking and maintain responsiveness and to improve i/o throughput monitoring the application to identify performance changes monitoring the system to identify performance changes monitoring threads to improve performance moving loops to native routines to speed up runtime processing moving object creation time to speed up runtime processing multiplexing to reduce unnecessary communication overheads in distributed applications multiplexing i/o using select() to reduce runtime overheads multithreading stateful singletons to support multiple processors and reduce synchronization overheads, and to 40
reduce object creation and garbage collection overheads optimizing array matching algorithms to speed up comparisons of objects optimizing collections to improve the performance of a class without changing its external interface optimizing comparisons in sorts to speed up the sort optimizing CPU utilization optimizing for update or access to improve the performance of a class without changing its external interface optimizing load balancing to improve runtime performance optimizing loop termination tests to improve runtime performance optimizing network packet sizes to reduce unnecessary communication overheads optimizing sorting overiding default serialization to speed up serialization packaging classfiles to reduce i/o overheads parallelizing i/o to speed up i/o partially reading objects to speed up i/o and data conversions partitioning applications to reduce unnecessary communication overheads in distributed applications partitioning data to reduce unnecessary communication overheads in distributed applications partitioning system resources to allocate determinate resources preallocating objects to speed up runtime processing predicting performance for analysis and design phases pre-sizing collections to speed up runtime processing and to reduce object creation and garbage collection overheads putting i/o in the background to reduce blocking, maintain responsiveness and to improve i/o throughput reading forwards to speed up i/o recording all changes to identify performance changes redesigning for less communications to reduce unnecessary communication overheads in distributed applications reducing dropped packets to reduce network retransmissions reducing features to reduce performance overheads reducing method call frequency to speed up runtime processing reducing overheads at the design stage to improve performance reducing total transmissions to reduce unnecessary communication overheads in distributed applications reducing unnecessary communication overheads in distributed applications removing unnecessary transactions to reduce blocking, maintain responsiveness and to improve i/o throughput removing unused fields to reduce object creation and garbage collection overheads renaming to shorter names to reduce class loading and network transfer times replacing classes to eliminate extraneous overheads replacing object collections with arrays to reduce object creation and garbage collection overheads replacing objects with primitives to reduce object creation and garbage collection overheads replacing primitives with ints to speed up runtime processing reuseable object pools to reduce object creation and garbage collection overheads reusing collections to reduce object creation and garbage collection overheads reusing exceptions to speed up runtime processing, reduce object creation and garbage collection overheads reusing linked list nodes to reduce object creation and garbage collection overheads reusing objects to reduce object creation and garbage collection overheads reusing parameters to reduce object creation and garbage collection overheads rewriting switch statements to speed up runtime processing scheduling recently used threads to improve runtime performance searching compressed data directly to speed up runtime processing and improve memory requirements shrinking classfiles to reduce class loading and network transfer times sorting approximately to speed up runtime processing sorting directly on a field to speed up runtime processing sorting linked lists faster sortingtwicetospeedupruntimeprocessing specifying and eliminating environments for analysis and design phases specifying performance at the analysis and design phases speculative optimization by adaptive compilers speeding applet downloads to improve applet download time speeding network transfers to speed up serialization 41
speeding object creation time speeding up array copying splitting transfers to eliminate blocking and maintain responsiveness striping disks to speed up i/o stubbing to reduce unnecessary communication overheads in distributed applications targetting easier fixes to speed up tuning threading class loading to improve startup time threading data structures to speed up intense calculations threading slow operations to improve runtime performance threading strategies to improve runtime performance tightly specifying SQL to speed up SQL queries timing out processes to reduce operating system overheads timing out transactions to reduce blocking and maintain responsiveness transferring blame to improve the user?s perception of the performance tuning disks to speed up i/o unrolling loops to make them faster upgrading disks to speed up i/o and system performance using a local DNS server to reduce unnecessary communication overheads in address lookups using array lookups to replace runtime processing using asynchronous communications to eliminate blocking and maintain responsiveness using asynchronous i/o to reduce blocking and maintain responsiveness and to improve i/o throughput using atomic operations to reduce synchronization overheads using batch processing to improve performance by combining activities using better hardware to speed up i/o, CPU, system performance and network bandwidth using bigger buffers to improve buffer effects using change-objects to improve transaction times by moving changes into change-objects using char arrays instead of Strings to reduce object creation and garbage collection overheads, and to speed up character processing and avoid String performance limitations using code motion to speed up runtime processing using CollationKeys instead of Collators to speed up sorting using comparison by identity for faster comparisons of objects using comparisons to 0 to speed up comparisons of data using compression for large transfers to speed up data transfers and reduce communication overheads using compression to speed up i/o using data specific comparison algorithms to speed up comparisons of objects using dummy objects to speed up method invocations using exception terminated loops to speed up loops using extra method parameters to reduce method invocations using HashMap instead of Hashtable to improve performance using hybrid structures to improve the performance of a class without changing its external interface using immutable objects to reduce object creation and garbage collection overheads and to speed up object access using int data types to speed up runtime processing using interfaces to provide implementation flexibility using JDBC optimizations to improve runtime performance using lazy initialization to reduce object creation and garbage collection overheads, and to speed up runtime processing and improve memory requirements using locks to reduce conflicts generated by simultaneous access to shared resources usingmemorymappedfilestospeedupi/o using native method calls to improve performance using parallelism to improve performance using plain arrays to improve performance using prepared statements to speed up SQL queries using prime numbers for hashing functions to improve performance using raw partitions to speed up i/o using shared memory to speed up i/o using singletons to speed up runtime processing and improve memory requirements, and to reduce object creation and garbage collection overheads 42
using slack system time to improve performance using specialized keys to improve lookup times using specialized Maps to improve access and updates for particular data types using specialized sorts to speed up sorting using stateless objects to reduce blocking and maintain responsiveness using static fields instead of instance fields to reduce object creation and garbage collection overheads using statically defined queries to speed up database queries using strength reduction to speed up runtime processing using string canonicalization to speed up comparisons of objects and to reduce object creation and garbage collection overheads using String methods for comparisons to speed up comparisons using StringBuffer instead of string concatenation to reduce object creation and garbage collection overheads using thread pools to reduce runtime overheads using transactionless modes to reduce blocking and maintain responsiveness and to improve i/o throughput and performance using transient fields to avoid serialization and speed up serialization using two collections to improve speeds for different types of access and update using type specific sorting to speed up runtime processing using weakly referenced objects to reduce object creation and garbage collection overheads violating encapsulation to speed up field access wrapping objects to provide implementation flexibility wrapping objects in sorts to speed up comparisons of objects unwrapping synchronized wrapped classes to improve performance 43
References [1] Nathan Meyers PerfAnal: A Performance Analysis Tool Sun Microsystems March 09, 2000 http://search.java.sun.com ClickThru?qt=hprof&url=http%3A%2F%2Fdeveloper.java.sun.com%2Fdeveloper%2FtechnicalArticles%2FPr ogramming%2fperfanal%2findex.html&pathinfo=%2fsearch%2fjava%2findex.jsp&hit- Num=1&col=java&col=jdc&col=wireless [2]Sun Microsystems: JavaTM Heap Analysis Tool (HAT) http://java.sun.com/people/billf/heap/index.html [3]Sun Microsystems: Java Virtual Machine Profiler Interface (JVMPI) http://java.sun.com/products/jdk/1.2/docs/guide/jvmpi/jvmpi.html [4]Sun Microsystems: Advanced Programming for the JavaTM 2 Platform Chapter 8 Continued: Performance Analysis http://developer.java.sun.com/developer/onlinetraining/programming/jdcbook/perf3.html [5] Steve Wilson, Jeff Kesselman: JavaTM Platform Performance Strategies and Tactics Chapter 3, Sun Microsystems, http://java.sun.com/docs/books/performance/1st_edition/html/ JPTitle.fm.html [6]Sun Microsystems: Tech Tips, http://developer.java.sun.com/developer/jdctechtips/ [7] Jack Shirazi: Java Performance Tuning O Reilly, September 2000, http://www.oreilly.com/catalog/javapt/ [8] Ercisson UTvecklings AB: TelORB Introduction February 2001, http://www.ericsson.com/about/publications/review/1999_03/ article86.shtml 44