How to Spice Up Java with the VTune Performance Analyzer?

Transcription

1 How to Spice Up Java with the VTune Performance Analyzer? by Levent Akyil 1. Introduction Managed environments enable developers to introduce their products to the market quickly while reducing if not eliminating the need to spend valuable resources on porting efforts. One of the key advantages of managed environments is the ability to extend through platform independence, allowing many different systems to run the same software (Figure 1). In this article, we will focus on the Java programming language and platform. Java is an object-oriented programming language that was designed to be small, simple, and portable across platforms and operating systems at the source and binary levels. It is easy to use, and enables the user to develop platform independent applications. On the flip side, applications written in Java have a reputation for being slower and requiring more memory than those written in natively compiled languages such as Fortran, C or C++. Figure 1 The Java programs' execution speed has improved significantly over the years due to advancements in Just-In Time (JIT) compilation, adaptive optimization techniques and in language features supporting better code analysis. The Java Virtual Machine (JVM) itself is continuously optimized. Java's platform independent applications heavily depend on the JVM to provide optimal performance for the platform. Therefore, the efficiency/success with which the JVM handles the code generation, thread management, memory allocation and garbage collection becomes very critical to determine the performance of Java applications. There is no easy way to present definitive advice on the performance of Java applications due to the fact that applications have diverse performance characteristics with different Java development tools, such as compilers and virtual machines, on various operating systems. The Java programming language is still evolving, and its performance continues to improve. The aim of this article is to promote awareness of Java performance issues and come up with a methodology so that developers can make appropriate choices for performance analysis of their applications. 2. Scope The aim of this article is to provide a top-down methodology for analyzing applications written in Java programming language, with a special focus on micro-architectural optimizations. In this article, I ll show how the Intel VTune Performance Analyzer can be used to analyze Java applications. This article is not an in-depth look at the expected performance of managed environments, associated runtime engines and system architectures. This article also does not intend to address all performance issues or discuss all types of tools available for Java performance analysis.

2 3. Top-down Approach Software optimization is the process of improving software by eliminating bottlenecks so that it operates more efficiently on a given system and optimally uses resources. Identifying the bottlenecks in the target application and eliminating them appropriately is the key to efficient optimization. There are many optimization methodologies, which help developers answer the questions of why to optimize, what to optimize and how to optimize, and these methods aid developers in reaching their performance requirements. In this article, I ll use a top-down approach (Figure 2) and this means that I ll start at a very high level, taking a look at the overall environment, and then successively drill down into more detail as I begin to tune the individual components within the system. This approach is targeted towards Java server applications, but can be applied to client applications as well. Performance of the Java application in a nutshell depends on: the database and I/O configuration, if used; the choice of operating system; the choice of JVM and JVM parameters; the algorithms used; the choice of hardware Figure System and Application Level Analysis If I/O and database accesses are part of any Java application, then the constraints introduced by I/O devices, such as bandwidth and latency, have a bigger impact on the performance than the constraints introduced by microarchitecture. Although tuning and optimizing of system level parameters are critical, database, I/O and OS tuning are outside the scope of this article. Java code or managed code more generally speaking, is a very specific concept referring to an executable image that runs under the supervision of a runtime execution engine. The top-down approach is reasonable due to some of these unique language features (e.g. dynamic class loading and validation, runtime exceptions checking, automatic garbage collection, multithreading, etc) in addition to memory foot-print and the choice of JVM configuration.

3 JVMs and Just-in-Time Compilers Just-in-time compilation (JIT), also known as dynamic translation, is a technique for improving the runtime performance of a program by converting the code during runtime before executing it natively (Figure 3). Initially the JVMs perform interpret the bytecode and based on certain criteria they dynamically compile the byte code. The JIT code generally offers far better performance than interpreters. In addition, it can in some or many cases offer better performance than static compilation, as many optimizations are only possible at runtime. This actually resembles the profile guided optimization support provided by static compilers. With JIT compilation, the code can be optimized for optimal performance; it can be recompiled, re-optimized for the target CPU and the operating system model where the application runs. During runtime JIT can choose to generate SSE (Streaming SIMD Extensions) instructions whenever the underlying CPU supports it. With static compilers, if the code is compiled using SSE support then the generated binary might not execute on the target processors if the target processors don t support the appropriate SSE. However, due to the nature of dynamic translation, a slight delay in initial execution of an application is introduced. This is simply due to the bytecode compilation. This start-up delay is usually not a big concern for server java applications but rather for client applications. In general, the more optimization a JIT compiler performs, the better code it will generate but the start-up delay will increase. In client applications, less compilation and optimization is performed to minimize the start-up time. In server mode, since server application usually started once and ran for extensive period of time, more compilation and optimization is performed, to maximize the performance. Figure 3 More recent Java platforms have introduced many performance improvements, including faster memory allocation, improved dynamic code generation, improved garbage collection, and reduction of class sizes. The improvements will significantly help the Java applications but understanding and tuning key parameters in the JVMs will get you closer to optimal performance Tuning Java Virtual Machine Parameters Many JVM options can significantly impact performance. Improved performance is possible with proper configuration of JVM parameters, particularly those related to memory usage and garbage collection. Unfortunately, it is not possible to go over all the parameters and their usage; therefore I ll try to introduce a few useful ones.

4 Heap Size and Garbage Collection All objects created by the executing Java program(s) are stored in the JVM s heap, whereas all local variables live on the Java stack and each thread has its own stack. Objects are created with the new operator and the memory are allocated from the heap. Garbage collection is the process of cleaning unused (unreferenced) objects which were created in the heap. An object is considered garbage when it can no longer be reached. Therefore, the garbage collector (GC) should detect unused referenced objects, free the heap space and then make it available again for the applications. This functionality however doesn t come for free. The JVM has to keep track of which objects are being referenced, and then finalize and free unreferenced objects during the runtime. This activity steals precious CPU time. Therefore, having optimal heap size and garbage collection strategy is vital for optimal application performance. Generally, JVMs provide different GC strategies (or combination of strategies) and choosing the correct GC type is important based on the type of application (Figure 4). Stop-the-world Concurrent Parallel (stop-the-world) Figure 4 The problems associated with Garbage Collection (GC) can be given as: Increased latency: application is paused during GC; Decreased throughput: GC sequential overhead (serialization) leads to low throughput and decreases efficiency and scalability; Non-deterministic behaviour: GC pauses makes the application behaviour non-deterministic. All of the above problems affect both client and server side Java applications. The client Java applications generally require rapid response time and low pause times whereas server applications in addition to client side requirement also require increased throughput. If we take server applications as an example, we usually see big heap sizes, and as a result an increased pause time in GC. Excessive object allocation increases the pressure on the heap and memory. Reducing the allocation rate will help performance in addition to observing the GC behaviour and identifying the proper size. Observe garbage collection behaviour o verbose:gc: This option will log valuable information about GC pause times, the frequency of GC, application run times, size of objects created and destroyed, memory recycled at each GC, and the rate of object creation. o XX[1]:+PrintGCDetails, XX:+PrintGCTimeStamps options will give information of GC.

5 For applications requiring low pause o Use Xconcgc For throughput applications o Use -XX:+AggressiveHeap Avoid using System.gc() o Use -XX:+DisableExplicitGC Avoid old generation undersized heaps o Reduces collection time, but leads to lot of other problems like fragmentation Since all objects live in the JVM s heap and its heap size affects the execution of GC, tuning heap size has a strong impact on performance. Heap size affects GC frequency and collection times, number of short and long term objects, fragmentation and locality. Starting size (-Xms) too small causes resize Max size (-Xmx) too small causes GC to run often without recovering much heap Max size too large: GC runs longer, less efficiently Identify proper young generation size: -Xmn, -XX:NewSize, -XX:MaxNewSize, Also options such as Trace, Verbose, VerboseGC, NoJIT, NoClassGC, NoAsyncGC, MaxJStack, Verify (depends on the JVM) which are usually used for debugging Java applications, will hurt performance Tools for GC and Heap analysis Some of the useful tools for tuning JVM are JStack, VisualGC, GCPortal, JConsole, jstat, jps, NetBeans Profiler, HeapAnalyzer and there are many others. I am planning to write more on this later Java Programming Tips Some basic programming tips can be given as follows[2]: Choose the right algorithm(s) and data structures. o Algorithms with O(N 2 ) complexity will be slower than the algorithm with O(N) or Nlog(N) complexities. Use the fastest JVM and JVM that takes advantage of the underlying processor architecture. Compile with optimization flag, javac -O. Use multithread for multi-core and multi-processor systems. o For single threaded applications, avoid using synchronized methods (e.g Vector vs. ArrayList) o Keep synchronized methods outside the loops. Use private and static methods, and final classes, to encourage inlining. Use local variables as much as possible. Local variables are faster than instance variables, which are faster than array elements. Memory Usage: Significant time is spent in memory allocation. The new operator (or delete) uses a generic variable-size allocator that is much slower than more specialized memory allocators. Variable-size memory allocators degrade performance under heavy use because of memory fragmentation. Complex memory usage can result in delayed or missed opportunities for object deletion by the garbage collector. If significant time is being spent in memory allocation for a class or structure, replace (or overload) the new operator (or delete) for the class with a more appropriate memory allocation routine. Choose an algorithm which will reuse deleted space and also reduce memory fragmentation. Use fixed-size allocation to manage blocks of fixed size, such as objects of a single class. A fixed-size allocator maintains a linked list of blocks of a fixed size. Allocation takes a block off the list, and deallocation adds a block to the list. Allocation and deallocation are very fast with fixed-size allocators and do not degrade under heavy use.

6 Reuse obsolete objects whenever possible to avoid allocating new objects. Make it clear to the garbage collector that an object is no longer being used by assigning null (or another object) to each object sequence after its last use. Object Creation: Avoid creating objects in frequently used routines, this will prevent creating objects frequently, and negatively impacting the overall performance of object cycling. Group frequently accessed fields together so they end up in a minimum number of cache lines, often with object header. Experience shows that scalar fields should be grouped together, separately from object reference fields. Do not declare an object twice. Strings Usage: Consider declaring a single StringBuffer object once at the beginning of the program, which can then be reused every time concatenation is required. The StringBuffer methods setlength(), append(), and tostring() can then be used to initialize it, to append one or more Strings, and to convert the result back to a String type, each time concatenation is needed. If concatenation is being used to format strings for stream output, avoid concatenation altogether by writing each string to the I/O buffer separately. For instance, if the result is being printed using System.out.println(), print each operand individually using PrintStream.out.print(), and PrintStream.out.println() for the last. Use of a general-purpose StringBuffer saves the time needed to allocate and free the temporary buffer every time concatenation is used. Separately writing each string to the output buffer avoids any concatenation overhead altogether, resulting in somewhat faster program execution. The benefit is partially offset, however, by the added overhead of the second I/O call. I/O: Use buffered I/O whenever possible. If the amount of data being moved is large, consider using the specified I/O class instead, for increased performance through data buffering. Data buffering can save a substantial amount of execution time by making fewer total accesses to the physical I/O device while also allowing parallel operations to occur through multithreading. The more data is being moved, the greater the benefit. The use of readfully() instead of buffered I/O can significantly improve the performance of programs. The overhead of synchronization employed by buffered I/O routines can be avoided by using readfully() to a large buffer, and then managing and interpreting the data yourself Micro-architectural optimization Performance tuning at micro-architecture level usually focuses on reducing the time it takes to complete a welldefined workload. Performance events can be used to measure the elapsed time; therefore, reducing the elapsed time of completing a workload is equivalent to reducing measured processor cycles (clockticks). Intel VTune Performance Analyzer is one of the most powerful tools available for software developers who are interested in this type of performance analysis. The easiest way to identify the hotspots in a given application is to sample the application with processor cycles, and VTune analyzer utilizes two profiling techniques, sampling and call graph, to help the developer identify where most of the clockticks are spent, in addition to many other processor events[1]. The sampling can be in two forms: processor event based and time based sampling. Event based sampling (EBS) relies on the performance monitoring unit (PMU) supported by the processors. From this point forward, event based sampling (EBS) will be referred to as sampling.

7 VTune Performance Analyzer basics In a compatible Java development environment, the VTune Performance Analyzer can be used to monitor and sample JIT-compiled Java code. The VTune analyzer gets the names of the active methods along with their load addresses, sizes, and debug information from the JVM and keeps this data in an internal file for later processing when viewing results. When your Java application executes, the Just-in-Time (JIT) compiler converts your VM bytecode to native machine code. Depending on your Java environment, either the VM or the JIT provides the VTune analyzer with information about active Java classes and methods, such as their memory addresses, sizes, and symbol information. The VTune analyzer uses this information to keep track of all the classes and methods loaded into memory and the processes that are executed. It also uses this information for final analysis. In summary, the VTune analyzer can identify performance bottlenecks, code emitted by the JIT Compiler (no bytecode support with EBS) and analyze the flow control How to start analyzing with the VTune analyzer Basically, there are two ways VTune Analyzer can analyze your Java application: VTune analyzer starts the JVM and the application to analyze or VTune analyzer starts the analysis without starting the application and user has to start application outside the VTune analyzer separately. The latter method is available only for sampling method and is useful to analyze applications such daemons, services, long running apps, etc Analyzing applications started with the VTune analyzer The first and easiest way for VTune analyzer to analyze your application is to start it within VTune analyzer (Figure 6). VTune analyzer allows 3 types of applications or configuration: Application (.class or.jar), Script, and Applet. The following steps are showing how to setup VTune analyzer to analyze.class application. Start VTune Performance Analyzer. Click Sampling/Call Graph Wizard (please see following sections for more detail on these methods). Select Java Profiling. Select one of the following methods: Application (.class or.jar), Script, and Applet. We ll assume application is selected for the following steps. Application: The VTune(TM) Performance Analyzer will launch a Java application. You must specify the Java launcher and the application. Script: A launching script invokes a specified Java application. Applet: The VTune analyzer invokes a Java applet. You must specify the applet viewer and applet. Select Java Launcher and enter any other special JVM arguments Select the main class or jar file. If there are any command line arguments used, enter them. Also select any component (.jar/directory) needed in the Classpath. Click Finish. The VTune analyzer will now launch application.

8 Figure Analyzing applications started outside the VTune Analyzer Sometimes, it is not possible or desired to start the Java application (e.g daemons, services, etc) with VTune analyzer however it is still possible to perform analysis on such applications (Figure 7). Start VTune Performance Analyzer. Click Sampling/Call Graph Wizard (please see following sections for more detail on these methods). Select Window*/Windows* CE/Linux Profiling. Uncheck Automatically generate tuning advice if selected. Uncheck (de-select) No application to launch. Check (select) Modify default configuration when done with wizard, and then click Finish. In Advanced Activity Configuration window, select Start data collection paused if it is not desired to start collection right away. Resume the collections before starting the application or during the application running. Start the application in the usual manner. Wait patiently for your software to complete and/or run your software until you have executed the code path(s) of interest. Note 1: Selecting Window*/Windows* CE/Linux Profiling is not a mistake because this is the only option where we can tell VTune analyzer not to launch any application. Note 2: If the Java application is executed outside the VTune analyzer then please make sure to pass correct argument to the JVM.

9 For Java version 1.4.x, use -Xrunjavaperf For Java version 1.5 and higher, use "-agentlib:javaperf" Figure 6 For Java applications (running on BEA, Sun or IBM JVMs), all the Java methods are combined and displayed as java.exe.jit on windows and java.jit on Linux in the Module view. You can view the individual methods and drill down to the Hotspot view by double-clicking it. The IBM and Sun JVMs use both interpretation mode and jitted mode of Java code execution. When sampling, only jitted code profiling is associated with the executing Java methods. When the JVM interprets the code, the samples are attributed to the JVM itself. You can use the call graph collector to obtain a complete view of all executed methods Identifying hotspots, Using Event Based Sampling I used SciMark2[3] for this example (Figure 7). SciMark2 is a Java benchmark for scientific and numerical computing. It measures several computational kernels and reports a composite score in approximate MFLOPS.

10 Figure 7 After the analysis, the VTune analyzer will display the information about the processes and modules (Figure 8). Figure 8 When the sampling wizard is used, the VTune analyzer by default uses processor cycles (clockticks) and instructions retired[4] to analyze the application. The count of cycles, also known as clockticks, forms the fundamental basis for measuring how long a program takes to execute. The total cycle measurement is the start to finish the view of total number of cycles to complete the application of interest. In typical performance tuning situations, the metric Total Cycles can be measured by the event CPU_CLK_UNHALTED.CORE[5]. The instructions retired event indicates the number of instructions that retired or executed completely. This does not include partially processed instructions executed due to branch mis-predictions. The ratio of clockticks (non-halted) and instructions retired is called clocks per instruction (CPI) and it can be good indicator of performance problems (indicator of the efficiency of instructions generated by the compiler and/or low CPU utilization) i. It is also possible to change the processor performance events to use for sampling. )[6].

11 The java.exe.jit is the module of interest to us. If we drill down further (double click on the module or click on Hotspot View button), hotspot view will show us all the functions executed during the benchmark for which we have samples for. Figure 9 From the hotspot view it is clear that the jitted bytecodes created very efficient code considering the theoretical CPI ratio achievable is 0.25 (Figure 9). This is also natural when considering that the benchmark without any command line arguments executes the sizes that fit in to the cache. It is also possible to further drill down and see the source and associated sample or total event counts. Simply double-clicking on a function of interest will take you to the java source of that method (Figure 10). Figure 10

12 Identifying hotspots, Using Call Graph Analysis Creating a call graph activity is similar to creating a sampling activity and one can follow the similar process. In this step it is possible to change the Java Launcher and enter special JVM arguments as well. For call graph analysis -agentlib:javaperf=cg JVM parameter will be picked up automatically by the VTune analyzer. Please note that the VTune analyzer enables you to distinguish between JIT-compiled, interpreted Java and inlined Java methods. You can then examine the timing differences between each type of method execution. JIT-compiled methods are grouped together into modules with the.jit extension, interpreted methods into modules with the.interpreted and inlined methods with.inlined extensions. Also note that call site[7] information is not collected for Java call graphs. Figure 11 The red arrows show us the critical path during the analysis; in other words, the flow path that consumed most of the time (Figure 11) When do JVMs decide to JIT? Call graph analysis gives us very valuable information. In addition to flow control of the Java application, it is also easy to see how many times one particular method was interpreted before getting Jitted by the JVM. To see this information, simply group the view by Class and then sort by Function (Figure 12). After these arrangements, one can easily see matmult function in SparseCompRow.java was interpreted 3 times and then it was jitted. It is also possible to see that the jitted version of matmult took ~524 microseconds (Self Time / Calls) where as interpreted version took ~ microseconds. Similar calculations can also be done. We can see more dramatic difference in inverse function. While the interpreted version of inverse function runs in 73 microseconds, the jitted version only runs in 7 microseconds.

13 Figure 12 Figure 13 From the same information, we can also find out how many time a certain function has to be executed before getting jitted (Figure 13) Identifying Memory Problems No single issue effects software performance more than the speed of memory. Slow memory or in efficient memory accesses hurt performance by forcing the processor to wait for instructions operands. Identifying memory access issues can be the first steps in analyzing the memory related performance problems. Before going into memory related issues, it is time to give a basic formula used in micro-architecture performance analysis. It is accurate to say that the total number of cycles an application takes is the sum of the cycles dispatching μops[7] and the cycles not dispatching μops (stalls). This can be formulized with the Intel Core architecture processor event names as shown below. This formula is explained in greater details in Intel 64 and IA-32 Intel Architecture Optimization Reference Manual. Therefore, for more complete analysis please refer to the optimization manual. In this approach, Total Cycles = Cycles dispatching μops + Cycles not dispatching μops CPU_CLK_UNHALTED.CORE ~ RS_UOPS_DISPATCHED.CYCLES_ANY + RS_UOPS_DISPATCHED.CYCLES_NONE

14 Cycles dispatching μops can be counted with the RS_UOPS_DISPATCHED.CYCLES_ANY event while cycles where no μops were dispatched (stalls) can be counted with the RS_UOPS_DISPATCHED.CYCLES_NONE event. Therefore the equation given earlier in Formula 1 can be re-written as given in Formula 2. The ratio of RS_UOPS_DISPATCHED.CYCLES_NONE to CPU_CLK_UNHALTED.CORE will tell you the percentage of cycles wasted due stalls. These very stalls can turn the execution unit of a processor into a major bottleneck. The execution unit by definition is always the bottleneck because it defines the throughput and an application will perform as fast as its bottleneck. Consequently, it is extremely critical to identify the causes for the stall cycles and remove them, if possible. Our goal is to determine how we can minimize the causes for the stalls and let the bottleneck (i.e: execution unit due to stalls) do to what it is designed to do. In sum, the execution unit should not sit idle. There are many contributing factors to the stall cycles and sub-optimal usage of the execution unit. Memory accesses (e.g: cache misses), Branch mis-predictions (pipeline flushes as a result), Floating-point (FP) operations (ops) (e.g: long latency operations such as division, fp control word change etc) and μops not retiring due to the out of order (OOO) engine can be given as some of them. I will focus on memory related issues and how to identify them with the VTune analyzer. The memory related issues that can happen in Java programs are not any different than the ones which can happen in other programming languages. However, the JVM can overcome some of the data locality issues by runtime optimizations it can perform during jitting. There are three cases which cause the processors to access main memory (i.e cache loads): conflict, capacity and compulsory. Capacity loads occur when data that was already in the cache being used is being reloaded. Using a smaller working set of data can reduce the capacity loads. Conflict loads occur because every cache row can hold specific memory addresses. This can be avoided by changing the alignment. Compulsory loads occur when data is loaded for the first time. The number of compulsory loads can be reduced but can't be avoided (this should be taken care of by the prefetchers prefetch instructions). For this section, I decided to focus on a sneaky memory related problem which usually occurs in multi-threaded applications running on SMP systems. I wrote a Java application called FalseShare.java for this purpose (Figure 14 shows the section of interest). This Java application is multi-threaded (creates threads based on the number of physical cores) and runs the same algorithm in two different data distribution models. The platform used for this example is an Intel 45nm Core 2 Quad Processor running Red Hat Enterprise Linux 5. startindx = tid; for (loops = 0; loops < Constants.ITERS; loops++) for (p = startindx; p < Constants.NUMPOINTS; p += Constants.THREADS) { double d = java.lang.math.sqrt(points[p].x*points[p].x + points[p].y*points[p].y + points[p].z*points[p].z); points[p].x /= d; points[p].y /= d; points[p].z /= d; } Figure 14 If we recall from the formula given earlier and collect related events we can estimate the number of stall cycles. After running the first version of the program, the stall cycles is ~88% of the useful cycles. Most of the time, the stall cycles are the symptoms of something wrong happening in execution stage.

15 Figure 15 If we drill-down to the source code, we can see the source code contributing to the stall cycles. With careful analysis the code section contributing heavily to the stall cycles, we see a phenomenon called false sharing in version1. Unlike true sharing, where threads share variables, in false sharing threads do not share a global variable but rather share the cache line. Figure 16 False sharing occurs when multiple threads write to the same cache line over and over. In many cases, modified data sharing implies that two or more threads race on using and modifying data in one cache line which is 64bytes in size. The frequent occurrences of modified data sharing causes demand misses that have a high penalty. When false sharing is removed, code performance can dramatically improve. Having said this, if we carefully look at the source code we can see that access to point[] happens in cyclical distribution of data over each iteration. Re-writing it as shown in Figure 18 (Version 2) and compare it to the first version, we see that Version 1 will run roughly ~6.19 (if we simply compare the clockticks) times slower. If we run the second version (case: VERSION2) and compare it to the first version, we can see that Version 1 will run roughly ~6.19 (if we simply compare the clockticks) times slower. Figure 17

16 switch (testtype) { case VERSION1: startindx = tid; for (loops = 0; loops < Constants.ITERS; loops++) for (p = startindx; p < Constants.NUMPOINTS; p += Constants.THREADS) { double d = java.lang.math.sqrt(points[p].x*points[p].x + points[p].y*points[p].y + points[p].z*points[p].z); points[p].x /= d; points[p].y /= d; points[p].z /= d; } break; case VERSION2: indx = tid; int delta = (int) ((float)constants.numpoints / Constants.THREADS); startindx = indx * delta; endindx = startindx + delta; if (indx == Constants.THREADS - 1) endindx = Constants.NUMPOINTS; for (loops = 0; loops < Constants.ITERS; loops++) for (p = startindx; p < endindx; p++) { double d = java.lang.math.sqrt(points[p].x * points[p].x + points[p].y * points[p].y + points[p].z*points[p].z); points[p].x /= d; points[p].y /= d; points[p].z /= d; } break; } Figure 18 Figure 19 So let s look at L2 cache misses to see the impact. After sampling both versions with two key processor events, MEM_LOAD_RETIRED.L2_LINE_MISS[9] and L2_LINES_IN.SELF[10], we can clearly see the differences in 2 versions. The second version uses a block distribution of the data to eliminate the cache line sharing or false sharing. Version 1 Version2

17 Figure 20 Clearly, memory access issues in this example will prevent the application scale as the number of cores increase. Plotting the non-false sharing and false sharing versions shows the impact of false sharing on scaling. Seconds Threads Core 2 Extreme QX9650 NOFS Core 2 Extreme QX9650 FS Figure Identifying SIMD and Vectorization Usage Leveraging SIMD and SSE (Streaming SIMD Extensions) support available on target processors is one of the key optimization techniques JVMs use (or should use). The question is how to identify which Jitted methods use SSE? This has been a common question from many Java developers. The VTune analyzer s event based sampling can help users pinpoint exactly which methods are optimized to use SSE. One can simply use SIMD_INST_RETIRED.ANY event (Retired Streaming SIMD instructions (precise event)) to count the overall number of SIMD instructions retired. The following events can give further break-down of this one event. These are the events that are available on Core architecture based processors. Symbol Name[11] FP_MMX_TRANS.TO_FP FP_MMX_TRANS.TO_MMX SIMD_ASSIST SIMD_COMP_INST_RETIRED.PACKED_DOUBLE SIMD_COMP_INST_RETIRED.PACKED_SINGLE SIMD_COMP_INST_RETIRED.SCALAR_DOUBLE SIMD_COMP_INST_RETIRED.SCALAR_SINGLE SIMD_INSTR_RETIRED SIMD_INST_RETIRED.ANY SIMD_INST_RETIRED.PACKED_DOUBLE Description Transitions from MMX (TM) Instructions to Floating Point Instructions. Transitions from Floating Point to MMX (TM) Instructions. SIMD assists invoked. Retired computational Streaming SIMD Extensions 2 (SSE2) packed-double instructions. Retired computational Streaming SIMD Extensions (SSE) packedsingle instructions. Retired computational Streaming SIMD Extensions 2 (SSE2) scalar-double instructions. Retired computational Streaming SIMD Extensions (SSE) scalarsingle instructions. SIMD Instructions retired. Retired Streaming SIMD instructions (precise event). Retired Streaming SIMD Extensions 2 (SSE2) packed-double instructions.

18 SIMD_INST_RETIRED.PACKED_SINGLE SIMD_INST_RETIRED.SCALAR_DOUBLE SIMD_INST_RETIRED.SCALAR_SINGLE SIMD_INST_RETIRED.VECTOR SIMD_SAT_INSTR_RETIRED SIMD_SAT_UOP_EXEC SIMD_UOPS_EXEC SIMD_UOP_TYPE_EXEC.ARITHMETIC SIMD_UOP_TYPE_EXEC.LOGICAL SIMD_UOP_TYPE_EXEC.MUL SIMD_UOP_TYPE_EXEC.PACK SIMD_UOP_TYPE_EXEC.SHIFT SIMD_UOP_TYPE_EXEC.UNPACK Retired Streaming SIMD Extensions (SSE) packed-single instructions. Retired Streaming SIMD Extensions 2 (SSE2) scalar-double instructions. Retired Streaming SIMD Extensions (SSE) scalar-single instructions. Retired Streaming SIMD Extensions 2 (SSE2) vector integer instructions. Saturated arithmetic instructions retired. SIMD saturated arithmetic micro-ops executed. SIMD micro-ops executed (excluding stores). SIMD packed arithmetic micro-ops executed SIMD packed logical micro-ops executed SIMD packed multiply micro-ops executed SIMD pack micro-ops executed SIMD packed shift micro-ops executed SIMD unpack micro-ops executed Figure 22 If we collect some of these events on SciMark2 again (Figure 22 and 23), we ll see that actually all our benchmarks are using the SIMD (Single Instruction Multiple Data) unit. It is also important to note that SIMD_INST_RETIRED.ANY should be equal to the total of all the sub events.

19 Figure Parallelization, threads and more As we mentioned earlier, Java uses heap to allocate memory for objects however, in multi-threaded cases it becomes quite clear that the access to the heap can quickly become a significant concurrency bottleneck, as every allocation would involve acquiring a lock that guards the heap. Luckily JVMs use thread-local allocation blocks (TLAB), where each thread allocates a larger chunk of memory from the heap and services small allocation requests sequentially out of that thread-local block. As a result, this greatly enhances the scalability and reduces the number of times a thread must acquire the shared heap lock, improving concurrency. TLAB[12] enables a thread to do object allocation using thread local top and limit pointers, which is faster than doing a serialized access to the heap which is shared across threads. However, as the number of threads exceeds the number of processors, the cost of committing memory to localallocation buffers becomes a challenge and sophisticated sizing policies must be employed. The single-threaded copying collector can become a bottleneck in an application which is parallelized to take advantage of multiple processors. To take full advantage of all available CPUs on a multiprocessor machine (e.g version of the HotSpot JVM) offers an optional multithreaded collector[13]. The parallel collector tries to keep related objects together to improve memory locality and cache utilization. This is accomplished by copying objects in depth first order. It is hard not to think about multi-threading when current generation of processors are having more and more core. As more and more multi-core processors are becoming available, utilizing all these cores becoming increasingly important. Luckily Java language and JVMs are inherently multi-threaded. Threading can help the user increase the throughput and determinism (GC threads can find more hardware threads/cores available). Now, we ll look at how to increase performance and improve resource utilization by leveraging the threading. I ll use the Java Grande benchmark suite[14] v1.0 to demonstrate the necessity of parallelism and multithreading to leverage the potential of any multi-core platform. Some of the benchmark results can be found in the graph below.

20 Section3:RayTracer:Total:SizeA Section3:MonteCarlo:Total:SizeA Section3:MolDyn:Total:SizeA Section2:SparseMatmult:Kernel:SizeB Section2:SOR:Kernel:SizeB Section2:Crypt:Kernel:SizeB threads seconds How to use VTune Analyzer Programmatic APIs? Figure 24 How to use the VTune analyzer programmatic APIs? is a very common question and the solution is already provided by VTune analyzer. Before I introduce how to use programmatic APIs, let s quickly mention what VTune Programmatic APIs are and what they are good for. The VTune Analyzer provides Pause and Resume APIs (VTPause() and VTResume() ) to allow the user analyze only certain parts of the application. By using these APIs user can skip the uninteresting parts such as initialization, GUI interaction, etc. These APIs are also available for Java and are implemented in the VTuneAPI.class which is located in the <installdir>\analyzer\bin\com\intel\vtune directory. The following Pause/Resume API enable pause and resume data collection for Java applications: com.intel.vtune.vtuneapi.vtpause() : pauses sampling and call graph data collection for a Java application. com.intel.vtune.vtuneapi.vtresume() : resumes sampling and call graph data collection for a Java application. com.intel.vtune.vtuneapi.vtnamethread(stringthreadname) : names the current thread for call graph data collection for a Java application.

21 Figure 25 In between the calls to VTPause() and VTResume() performance data is not collected. One can simply select Start with data collection paused during the configuration and resume the collection anytime with VTResume() call. If you run the Java application using these APIs, VTune analyzer console will report these API usages during the analysis. Thu Sep 25 15:32: Data collection started... Thu Sep 25 15:34: Data collection paused... Thu Sep 25 15:34: Data collection resumed... Thu Sep 25 15:34: Data collection paused... Thu Sep 25 15:34: Data collection finished... However, if you are one of those people who would like to do their way and want to implement your own wrapper for these APIs, JNI (Java Native Interface) can always help. /* VTApis.java */ class VTApis { public native void assignthreadname(); public native void resumevtune(); public native void pausevtune(); static { System.loadLibrary("VTApis"); } } /* VTApis.c */ #include <stdio.h> #include <vtuneapi.h> #include "VTApis.h" javah // this header file was generated by /* * Class: VTApis * Method: resumevtune * Signature: ()V */ JNIEXPORT void JNICALL Java_VTApis_resumeVTune (JNIEnv *, jobject)

22 How to compile and generate the wrapper using JNI Compile VTune analyzer API class: $>javac.exe VTApis.java Generate the header file VTApis.h: $>javah.exe -jni VTApis Compile and generate.dll: $>cl /Zi I"C:\Program Files\Java\jdk1.6.0_04\include" - I"C:\Program Files\Java\jdk1.6.0_04\include\win32" -I"C:\Program Files\Intel\VTune\Analyzer\Include" -LD VTApis.c -fixed:no - FeVTApis.dll VtuneApi.lib VTune Analyzer s JIT Profiling API { } VTResume(); printf("resuming VTune collection.\n"); /* * Class: VTApis * Method: pausevtune * Signature: ()V */ JNIEXPORT void JNICALL Java_VTApis_pauseVTune (JNIEnv *, jobject) { VTPause(); printf("pausing VTune collection.\n"); } Figure 26 Optimizing how the JVM generates jitted code is not the goal of this article, but it is worth mentioning a few things about this topic. Tune JIT code generation to match the underlying processor architecture: o Take advantage of new architectural features; Example: Use efficient SSE instructions to move data o Improve decode/allocation efficiency; Avoid length-changing prefixes to improve decode efficiency Branch target alignment o Eliminate inherent stalls in generated code; o Tune register allocation to reduce memory traffic; Better register allocation can reduce stack operations o Use additional registers afforded by SSE or Intel64 It is also important to note that 64-bit JVMs enable heaps larger than 4 GB, but they are usually slower than 32-bit JVMs simply due to extra memory needed and system pressure caused by using and moving 64-bit pointers. Using 32-bit offsets from a Java heap base address instead of 64-bit pointers can significantly improve the performance. On Intel Xeon platforms, resulting 64- bit JVM can be faster than 32-bit equivalent! In addition to the Intel VTune Performance Analyzer s normal Java Figure 27 support, starting with the VTune Performance Analyzer 9.1, the JIT Profiling API became public and provide further functionality and control in profiling runtime generated code (Figure 27). The JIT compilers or JVMs can use this API to insert API calls to gather more detailed information about the interpreted or dynamically generated code.

23 Instructions to include JIT Profiling Support ii Include JITProfiling.h file located under C:\Program Files\Intel\VTune\Analyzer\include directory for Microsoft* operating systems and under /opt/intel/vtune/analyzer/include for Linux* operating systems. This header file provides all API function prototypes and type definitions. Link the Virtual Machine (or any code using these APIs) with JITProfiling.lib located under C:\Program Files\Intel\VTune\Analyzer\lib on Windows*, and with JITProfiling.a located under /opt/intel/vtune/analyzer/bin on Linux* operating systems. On Linux* please link with the standard libraries libdl.so and libpthread.so. Note: JITProfiling.a which comes with VTune analyzer is compiled with g++ and not with gcc, therefore either compile your code with g++ or compile with gcc and link with -lstdc++ library. In order to function properly, a VM that uses the JITProfiling API should implement a mode-change callback function and register it using ijit_registercallbackex. The callback function is executed every time the profiling mode changes. This ensures that the VM issues appropriate notifications when mode changes happen. To enable JIT profiling support, set the environment variable ENABLE_JITPROFILING=1. On Windows: set ENABLE_JITPROFILING=1 On Linux: export ENABLE_JITPROFILING=1 On Linux JIT profiling can only be used with the command line interface (vtl) and jitprofiling option needs to be used. For call graph analysis: vtl activity jitcg -c callgraph -o jitprofiling -app./jitprof run For sampling analysis: vtl activity jitsamp -c sampling -o jitprofiling -app./jitprof run If you wish to perform JIT profiling on a remote Linux OS system, define the BISTRO_COLLECTORS_DO_JIT_PROFILING environment variable in the shell where vtserver executes. export BISTRO_COLLECTORS_DO_JIT_PROFILING=1

24 End Notes and References 1. Please note that options that begin with -X are non-standard while XX options are not stable. These options are not guaranteed to be supported on all VM implementations, and are subject to change without notice in subsequent releases of the JDK. Please check for more information Instructions Retired: Recent generations of Intel 64 and IA-32 processors feature microarchitectures using an out-of-order execution engine. They are also accompanied by an in-order front end and retirement logic that enforces program order. Instructions executed to completion are referred as instructions retired. Intel 64 and IA-32 Architectures Software Developer s Manual Volume 3B: System Programming Guide, Part For these examples, I have used Intel Core micro-architecture based systems. 5. The Intel Core micro-architecture is capable of reaching CPI as low as 0.25 in ideal situations. The greater value of CPI for a given workload indicates that there are more opportunities for code tuning to improve performance. Intel 64 and IA-32 Architectures Software Developer s Manual Volume 3B: System Programming Guide, Part 2 6. Call site is defined as the location in the caller function from where a call is made to a callee. 7. Micro-operations, also known as a micro-ops or μops, are simple, RISC-like microprocessor instructions used by some CISC processors to implement more complex instructions. Wikipedia: 8. Counts the number of retired load operations that missed the L2 cache. 9. Counts the number of cache lines allocated in the L2 cache. Cache lines are allocated in the L2 cache as a result of requests from the L1 data and instruction caches and the L2 hardware prefetchers to cache lines that are missing in the L2 cache. 10. Taken from Intel VTune Performance Analyzer help. Intel VTune Performance Analyzer ( The Java Grande Forum is a community initiative to promote the use of Java for so-called Grande applications. A Grande application is an application which has large requirements for any or all of: Memory, Bandwidth and processing power. You can download the benchmark suite from