How to Spice Up Java with the VTune Performance Analyzer?
|
|
|
- Beverley Glenn
- 9 years ago
- Views:
Transcription
1 How to Spice Up Java with the VTune Performance Analyzer? by Levent Akyil 1. Introduction Managed environments enable developers to introduce their products to the market quickly while reducing if not eliminating the need to spend valuable resources on porting efforts. One of the key advantages of managed environments is the ability to extend through platform independence, allowing many different systems to run the same software (Figure 1). In this article, we will focus on the Java programming language and platform. Java is an object-oriented programming language that was designed to be small, simple, and portable across platforms and operating systems at the source and binary levels. It is easy to use, and enables the user to develop platform independent applications. On the flip side, applications written in Java have a reputation for being slower and requiring more memory than those written in natively compiled languages such as Fortran, C or C++. Figure 1 The Java programs' execution speed has improved significantly over the years due to advancements in Just-In Time (JIT) compilation, adaptive optimization techniques and in language features supporting better code analysis. The Java Virtual Machine (JVM) itself is continuously optimized. Java's platform independent applications heavily depend on the JVM to provide optimal performance for the platform. Therefore, the efficiency/success with which the JVM handles the code generation, thread management, memory allocation and garbage collection becomes very critical to determine the performance of Java applications. There is no easy way to present definitive advice on the performance of Java applications due to the fact that applications have diverse performance characteristics with different Java development tools, such as compilers and virtual machines, on various operating systems. The Java programming language is still evolving, and its performance continues to improve. The aim of this article is to promote awareness of Java performance issues and come up with a methodology so that developers can make appropriate choices for performance analysis of their applications. 2. Scope The aim of this article is to provide a top-down methodology for analyzing applications written in Java programming language, with a special focus on micro-architectural optimizations. In this article, I ll show how the Intel VTune Performance Analyzer can be used to analyze Java applications. This article is not an in-depth look at the expected performance of managed environments, associated runtime engines and system architectures. This article also does not intend to address all performance issues or discuss all types of tools available for Java performance analysis.
2 3. Top-down Approach Software optimization is the process of improving software by eliminating bottlenecks so that it operates more efficiently on a given system and optimally uses resources. Identifying the bottlenecks in the target application and eliminating them appropriately is the key to efficient optimization. There are many optimization methodologies, which help developers answer the questions of why to optimize, what to optimize and how to optimize, and these methods aid developers in reaching their performance requirements. In this article, I ll use a top-down approach (Figure 2) and this means that I ll start at a very high level, taking a look at the overall environment, and then successively drill down into more detail as I begin to tune the individual components within the system. This approach is targeted towards Java server applications, but can be applied to client applications as well. Performance of the Java application in a nutshell depends on: the database and I/O configuration, if used; the choice of operating system; the choice of JVM and JVM parameters; the algorithms used; the choice of hardware Figure System and Application Level Analysis If I/O and database accesses are part of any Java application, then the constraints introduced by I/O devices, such as bandwidth and latency, have a bigger impact on the performance than the constraints introduced by microarchitecture. Although tuning and optimizing of system level parameters are critical, database, I/O and OS tuning are outside the scope of this article. Java code or managed code more generally speaking, is a very specific concept referring to an executable image that runs under the supervision of a runtime execution engine. The top-down approach is reasonable due to some of these unique language features (e.g. dynamic class loading and validation, runtime exceptions checking, automatic garbage collection, multithreading, etc) in addition to memory foot-print and the choice of JVM configuration.
3 JVMs and Just-in-Time Compilers Just-in-time compilation (JIT), also known as dynamic translation, is a technique for improving the runtime performance of a program by converting the code during runtime before executing it natively (Figure 3). Initially the JVMs perform interpret the bytecode and based on certain criteria they dynamically compile the byte code. The JIT code generally offers far better performance than interpreters. In addition, it can in some or many cases offer better performance than static compilation, as many optimizations are only possible at runtime. This actually resembles the profile guided optimization support provided by static compilers. With JIT compilation, the code can be optimized for optimal performance; it can be recompiled, re-optimized for the target CPU and the operating system model where the application runs. During runtime JIT can choose to generate SSE (Streaming SIMD Extensions) instructions whenever the underlying CPU supports it. With static compilers, if the code is compiled using SSE support then the generated binary might not execute on the target processors if the target processors don t support the appropriate SSE. However, due to the nature of dynamic translation, a slight delay in initial execution of an application is introduced. This is simply due to the bytecode compilation. This start-up delay is usually not a big concern for server java applications but rather for client applications. In general, the more optimization a JIT compiler performs, the better code it will generate but the start-up delay will increase. In client applications, less compilation and optimization is performed to minimize the start-up time. In server mode, since server application usually started once and ran for extensive period of time, more compilation and optimization is performed, to maximize the performance. Figure 3 More recent Java platforms have introduced many performance improvements, including faster memory allocation, improved dynamic code generation, improved garbage collection, and reduction of class sizes. The improvements will significantly help the Java applications but understanding and tuning key parameters in the JVMs will get you closer to optimal performance Tuning Java Virtual Machine Parameters Many JVM options can significantly impact performance. Improved performance is possible with proper configuration of JVM parameters, particularly those related to memory usage and garbage collection. Unfortunately, it is not possible to go over all the parameters and their usage; therefore I ll try to introduce a few useful ones.
4 Heap Size and Garbage Collection All objects created by the executing Java program(s) are stored in the JVM s heap, whereas all local variables live on the Java stack and each thread has its own stack. Objects are created with the new operator and the memory are allocated from the heap. Garbage collection is the process of cleaning unused (unreferenced) objects which were created in the heap. An object is considered garbage when it can no longer be reached. Therefore, the garbage collector (GC) should detect unused referenced objects, free the heap space and then make it available again for the applications. This functionality however doesn t come for free. The JVM has to keep track of which objects are being referenced, and then finalize and free unreferenced objects during the runtime. This activity steals precious CPU time. Therefore, having optimal heap size and garbage collection strategy is vital for optimal application performance. Generally, JVMs provide different GC strategies (or combination of strategies) and choosing the correct GC type is important based on the type of application (Figure 4). Stop-the-world Concurrent Parallel (stop-the-world) Figure 4 The problems associated with Garbage Collection (GC) can be given as: Increased latency: application is paused during GC; Decreased throughput: GC sequential overhead (serialization) leads to low throughput and decreases efficiency and scalability; Non-deterministic behaviour: GC pauses makes the application behaviour non-deterministic. All of the above problems affect both client and server side Java applications. The client Java applications generally require rapid response time and low pause times whereas server applications in addition to client side requirement also require increased throughput. If we take server applications as an example, we usually see big heap sizes, and as a result an increased pause time in GC. Excessive object allocation increases the pressure on the heap and memory. Reducing the allocation rate will help performance in addition to observing the GC behaviour and identifying the proper size. Observe garbage collection behaviour o verbose:gc: This option will log valuable information about GC pause times, the frequency of GC, application run times, size of objects created and destroyed, memory recycled at each GC, and the rate of object creation. o XX[1]:+PrintGCDetails, XX:+PrintGCTimeStamps options will give information of GC.
5 For applications requiring low pause o Use Xconcgc For throughput applications o Use -XX:+AggressiveHeap Avoid using System.gc() o Use -XX:+DisableExplicitGC Avoid old generation undersized heaps o Reduces collection time, but leads to lot of other problems like fragmentation Since all objects live in the JVM s heap and its heap size affects the execution of GC, tuning heap size has a strong impact on performance. Heap size affects GC frequency and collection times, number of short and long term objects, fragmentation and locality. Starting size (-Xms) too small causes resize Max size (-Xmx) too small causes GC to run often without recovering much heap Max size too large: GC runs longer, less efficiently Identify proper young generation size: -Xmn, -XX:NewSize, -XX:MaxNewSize, Also options such as Trace, Verbose, VerboseGC, NoJIT, NoClassGC, NoAsyncGC, MaxJStack, Verify (depends on the JVM) which are usually used for debugging Java applications, will hurt performance Tools for GC and Heap analysis Some of the useful tools for tuning JVM are JStack, VisualGC, GCPortal, JConsole, jstat, jps, NetBeans Profiler, HeapAnalyzer and there are many others. I am planning to write more on this later Java Programming Tips Some basic programming tips can be given as follows[2]: Choose the right algorithm(s) and data structures. o Algorithms with O(N 2 ) complexity will be slower than the algorithm with O(N) or Nlog(N) complexities. Use the fastest JVM and JVM that takes advantage of the underlying processor architecture. Compile with optimization flag, javac -O. Use multithread for multi-core and multi-processor systems. o For single threaded applications, avoid using synchronized methods (e.g Vector vs. ArrayList) o Keep synchronized methods outside the loops. Use private and static methods, and final classes, to encourage inlining. Use local variables as much as possible. Local variables are faster than instance variables, which are faster than array elements. Memory Usage: Significant time is spent in memory allocation. The new operator (or delete) uses a generic variable-size allocator that is much slower than more specialized memory allocators. Variable-size memory allocators degrade performance under heavy use because of memory fragmentation. Complex memory usage can result in delayed or missed opportunities for object deletion by the garbage collector. If significant time is being spent in memory allocation for a class or structure, replace (or overload) the new operator (or delete) for the class with a more appropriate memory allocation routine. Choose an algorithm which will reuse deleted space and also reduce memory fragmentation. Use fixed-size allocation to manage blocks of fixed size, such as objects of a single class. A fixed-size allocator maintains a linked list of blocks of a fixed size. Allocation takes a block off the list, and deallocation adds a block to the list. Allocation and deallocation are very fast with fixed-size allocators and do not degrade under heavy use.
6 Reuse obsolete objects whenever possible to avoid allocating new objects. Make it clear to the garbage collector that an object is no longer being used by assigning null (or another object) to each object sequence after its last use. Object Creation: Avoid creating objects in frequently used routines, this will prevent creating objects frequently, and negatively impacting the overall performance of object cycling. Group frequently accessed fields together so they end up in a minimum number of cache lines, often with object header. Experience shows that scalar fields should be grouped together, separately from object reference fields. Do not declare an object twice. Strings Usage: Consider declaring a single StringBuffer object once at the beginning of the program, which can then be reused every time concatenation is required. The StringBuffer methods setlength(), append(), and tostring() can then be used to initialize it, to append one or more Strings, and to convert the result back to a String type, each time concatenation is needed. If concatenation is being used to format strings for stream output, avoid concatenation altogether by writing each string to the I/O buffer separately. For instance, if the result is being printed using System.out.println(), print each operand individually using PrintStream.out.print(), and PrintStream.out.println() for the last. Use of a general-purpose StringBuffer saves the time needed to allocate and free the temporary buffer every time concatenation is used. Separately writing each string to the output buffer avoids any concatenation overhead altogether, resulting in somewhat faster program execution. The benefit is partially offset, however, by the added overhead of the second I/O call. I/O: Use buffered I/O whenever possible. If the amount of data being moved is large, consider using the specified I/O class instead, for increased performance through data buffering. Data buffering can save a substantial amount of execution time by making fewer total accesses to the physical I/O device while also allowing parallel operations to occur through multithreading. The more data is being moved, the greater the benefit. The use of readfully() instead of buffered I/O can significantly improve the performance of programs. The overhead of synchronization employed by buffered I/O routines can be avoided by using readfully() to a large buffer, and then managing and interpreting the data yourself Micro-architectural optimization Performance tuning at micro-architecture level usually focuses on reducing the time it takes to complete a welldefined workload. Performance events can be used to measure the elapsed time; therefore, reducing the elapsed time of completing a workload is equivalent to reducing measured processor cycles (clockticks). Intel VTune Performance Analyzer is one of the most powerful tools available for software developers who are interested in this type of performance analysis. The easiest way to identify the hotspots in a given application is to sample the application with processor cycles, and VTune analyzer utilizes two profiling techniques, sampling and call graph, to help the developer identify where most of the clockticks are spent, in addition to many other processor events[1]. The sampling can be in two forms: processor event based and time based sampling. Event based sampling (EBS) relies on the performance monitoring unit (PMU) supported by the processors. From this point forward, event based sampling (EBS) will be referred to as sampling.
7 VTune Performance Analyzer basics In a compatible Java development environment, the VTune Performance Analyzer can be used to monitor and sample JIT-compiled Java code. The VTune analyzer gets the names of the active methods along with their load addresses, sizes, and debug information from the JVM and keeps this data in an internal file for later processing when viewing results. When your Java application executes, the Just-in-Time (JIT) compiler converts your VM bytecode to native machine code. Depending on your Java environment, either the VM or the JIT provides the VTune analyzer with information about active Java classes and methods, such as their memory addresses, sizes, and symbol information. The VTune analyzer uses this information to keep track of all the classes and methods loaded into memory and the processes that are executed. It also uses this information for final analysis. In summary, the VTune analyzer can identify performance bottlenecks, code emitted by the JIT Compiler (no bytecode support with EBS) and analyze the flow control How to start analyzing with the VTune analyzer Basically, there are two ways VTune Analyzer can analyze your Java application: VTune analyzer starts the JVM and the application to analyze or VTune analyzer starts the analysis without starting the application and user has to start application outside the VTune analyzer separately. The latter method is available only for sampling method and is useful to analyze applications such daemons, services, long running apps, etc Analyzing applications started with the VTune analyzer The first and easiest way for VTune analyzer to analyze your application is to start it within VTune analyzer (Figure 6). VTune analyzer allows 3 types of applications or configuration: Application (.class or.jar), Script, and Applet. The following steps are showing how to setup VTune analyzer to analyze.class application. Start VTune Performance Analyzer. Click Sampling/Call Graph Wizard (please see following sections for more detail on these methods). Select Java Profiling. Select one of the following methods: Application (.class or.jar), Script, and Applet. We ll assume application is selected for the following steps. Application: The VTune(TM) Performance Analyzer will launch a Java application. You must specify the Java launcher and the application. Script: A launching script invokes a specified Java application. Applet: The VTune analyzer invokes a Java applet. You must specify the applet viewer and applet. Select Java Launcher and enter any other special JVM arguments Select the main class or jar file. If there are any command line arguments used, enter them. Also select any component (.jar/directory) needed in the Classpath. Click Finish. The VTune analyzer will now launch application.
8 Figure Analyzing applications started outside the VTune Analyzer Sometimes, it is not possible or desired to start the Java application (e.g daemons, services, etc) with VTune analyzer however it is still possible to perform analysis on such applications (Figure 7). Start VTune Performance Analyzer. Click Sampling/Call Graph Wizard (please see following sections for more detail on these methods). Select Window*/Windows* CE/Linux Profiling. Uncheck Automatically generate tuning advice if selected. Uncheck (de-select) No application to launch. Check (select) Modify default configuration when done with wizard, and then click Finish. In Advanced Activity Configuration window, select Start data collection paused if it is not desired to start collection right away. Resume the collections before starting the application or during the application running. Start the application in the usual manner. Wait patiently for your software to complete and/or run your software until you have executed the code path(s) of interest. Note 1: Selecting Window*/Windows* CE/Linux Profiling is not a mistake because this is the only option where we can tell VTune analyzer not to launch any application. Note 2: If the Java application is executed outside the VTune analyzer then please make sure to pass correct argument to the JVM.
9 For Java version 1.4.x, use -Xrunjavaperf For Java version 1.5 and higher, use "-agentlib:javaperf" Figure 6 For Java applications (running on BEA, Sun or IBM JVMs), all the Java methods are combined and displayed as java.exe.jit on windows and java.jit on Linux in the Module view. You can view the individual methods and drill down to the Hotspot view by double-clicking it. The IBM and Sun JVMs use both interpretation mode and jitted mode of Java code execution. When sampling, only jitted code profiling is associated with the executing Java methods. When the JVM interprets the code, the samples are attributed to the JVM itself. You can use the call graph collector to obtain a complete view of all executed methods Identifying hotspots, Using Event Based Sampling I used SciMark2[3] for this example (Figure 7). SciMark2 is a Java benchmark for scientific and numerical computing. It measures several computational kernels and reports a composite score in approximate MFLOPS.
10 Figure 7 After the analysis, the VTune analyzer will display the information about the processes and modules (Figure 8). Figure 8 When the sampling wizard is used, the VTune analyzer by default uses processor cycles (clockticks) and instructions retired[4] to analyze the application. The count of cycles, also known as clockticks, forms the fundamental basis for measuring how long a program takes to execute. The total cycle measurement is the start to finish the view of total number of cycles to complete the application of interest. In typical performance tuning situations, the metric Total Cycles can be measured by the event CPU_CLK_UNHALTED.CORE[5]. The instructions retired event indicates the number of instructions that retired or executed completely. This does not include partially processed instructions executed due to branch mis-predictions. The ratio of clockticks (non-halted) and instructions retired is called clocks per instruction (CPI) and it can be good indicator of performance problems (indicator of the efficiency of instructions generated by the compiler and/or low CPU utilization) i. It is also possible to change the processor performance events to use for sampling. )[6].
11 The java.exe.jit is the module of interest to us. If we drill down further (double click on the module or click on Hotspot View button), hotspot view will show us all the functions executed during the benchmark for which we have samples for. Figure 9 From the hotspot view it is clear that the jitted bytecodes created very efficient code considering the theoretical CPI ratio achievable is 0.25 (Figure 9). This is also natural when considering that the benchmark without any command line arguments executes the sizes that fit in to the cache. It is also possible to further drill down and see the source and associated sample or total event counts. Simply double-clicking on a function of interest will take you to the java source of that method (Figure 10). Figure 10
12 Identifying hotspots, Using Call Graph Analysis Creating a call graph activity is similar to creating a sampling activity and one can follow the similar process. In this step it is possible to change the Java Launcher and enter special JVM arguments as well. For call graph analysis -agentlib:javaperf=cg JVM parameter will be picked up automatically by the VTune analyzer. Please note that the VTune analyzer enables you to distinguish between JIT-compiled, interpreted Java and inlined Java methods. You can then examine the timing differences between each type of method execution. JIT-compiled methods are grouped together into modules with the.jit extension, interpreted methods into modules with the.interpreted and inlined methods with.inlined extensions. Also note that call site[7] information is not collected for Java call graphs. Figure 11 The red arrows show us the critical path during the analysis; in other words, the flow path that consumed most of the time (Figure 11) When do JVMs decide to JIT? Call graph analysis gives us very valuable information. In addition to flow control of the Java application, it is also easy to see how many times one particular method was interpreted before getting Jitted by the JVM. To see this information, simply group the view by Class and then sort by Function (Figure 12). After these arrangements, one can easily see matmult function in SparseCompRow.java was interpreted 3 times and then it was jitted. It is also possible to see that the jitted version of matmult took ~524 microseconds (Self Time / Calls) where as interpreted version took ~ microseconds. Similar calculations can also be done. We can see more dramatic difference in inverse function. While the interpreted version of inverse function runs in 73 microseconds, the jitted version only runs in 7 microseconds.
13 Figure 12 Figure 13 From the same information, we can also find out how many time a certain function has to be executed before getting jitted (Figure 13) Identifying Memory Problems No single issue effects software performance more than the speed of memory. Slow memory or in efficient memory accesses hurt performance by forcing the processor to wait for instructions operands. Identifying memory access issues can be the first steps in analyzing the memory related performance problems. Before going into memory related issues, it is time to give a basic formula used in micro-architecture performance analysis. It is accurate to say that the total number of cycles an application takes is the sum of the cycles dispatching μops[7] and the cycles not dispatching μops (stalls). This can be formulized with the Intel Core architecture processor event names as shown below. This formula is explained in greater details in Intel 64 and IA-32 Intel Architecture Optimization Reference Manual. Therefore, for more complete analysis please refer to the optimization manual. In this approach, Total Cycles = Cycles dispatching μops + Cycles not dispatching μops CPU_CLK_UNHALTED.CORE ~ RS_UOPS_DISPATCHED.CYCLES_ANY + RS_UOPS_DISPATCHED.CYCLES_NONE
14 Cycles dispatching μops can be counted with the RS_UOPS_DISPATCHED.CYCLES_ANY event while cycles where no μops were dispatched (stalls) can be counted with the RS_UOPS_DISPATCHED.CYCLES_NONE event. Therefore the equation given earlier in Formula 1 can be re-written as given in Formula 2. The ratio of RS_UOPS_DISPATCHED.CYCLES_NONE to CPU_CLK_UNHALTED.CORE will tell you the percentage of cycles wasted due stalls. These very stalls can turn the execution unit of a processor into a major bottleneck. The execution unit by definition is always the bottleneck because it defines the throughput and an application will perform as fast as its bottleneck. Consequently, it is extremely critical to identify the causes for the stall cycles and remove them, if possible. Our goal is to determine how we can minimize the causes for the stalls and let the bottleneck (i.e: execution unit due to stalls) do to what it is designed to do. In sum, the execution unit should not sit idle. There are many contributing factors to the stall cycles and sub-optimal usage of the execution unit. Memory accesses (e.g: cache misses), Branch mis-predictions (pipeline flushes as a result), Floating-point (FP) operations (ops) (e.g: long latency operations such as division, fp control word change etc) and μops not retiring due to the out of order (OOO) engine can be given as some of them. I will focus on memory related issues and how to identify them with the VTune analyzer. The memory related issues that can happen in Java programs are not any different than the ones which can happen in other programming languages. However, the JVM can overcome some of the data locality issues by runtime optimizations it can perform during jitting. There are three cases which cause the processors to access main memory (i.e cache loads): conflict, capacity and compulsory. Capacity loads occur when data that was already in the cache being used is being reloaded. Using a smaller working set of data can reduce the capacity loads. Conflict loads occur because every cache row can hold specific memory addresses. This can be avoided by changing the alignment. Compulsory loads occur when data is loaded for the first time. The number of compulsory loads can be reduced but can't be avoided (this should be taken care of by the prefetchers prefetch instructions). For this section, I decided to focus on a sneaky memory related problem which usually occurs in multi-threaded applications running on SMP systems. I wrote a Java application called FalseShare.java for this purpose (Figure 14 shows the section of interest). This Java application is multi-threaded (creates threads based on the number of physical cores) and runs the same algorithm in two different data distribution models. The platform used for this example is an Intel 45nm Core 2 Quad Processor running Red Hat Enterprise Linux 5. startindx = tid; for (loops = 0; loops < Constants.ITERS; loops++) for (p = startindx; p < Constants.NUMPOINTS; p += Constants.THREADS) { double d = java.lang.math.sqrt(points[p].x*points[p].x + points[p].y*points[p].y + points[p].z*points[p].z); points[p].x /= d; points[p].y /= d; points[p].z /= d; } Figure 14 If we recall from the formula given earlier and collect related events we can estimate the number of stall cycles. After running the first version of the program, the stall cycles is ~88% of the useful cycles. Most of the time, the stall cycles are the symptoms of something wrong happening in execution stage.
15 Figure 15 If we drill-down to the source code, we can see the source code contributing to the stall cycles. With careful analysis the code section contributing heavily to the stall cycles, we see a phenomenon called false sharing in version1. Unlike true sharing, where threads share variables, in false sharing threads do not share a global variable but rather share the cache line. Figure 16 False sharing occurs when multiple threads write to the same cache line over and over. In many cases, modified data sharing implies that two or more threads race on using and modifying data in one cache line which is 64bytes in size. The frequent occurrences of modified data sharing causes demand misses that have a high penalty. When false sharing is removed, code performance can dramatically improve. Having said this, if we carefully look at the source code we can see that access to point[] happens in cyclical distribution of data over each iteration. Re-writing it as shown in Figure 18 (Version 2) and compare it to the first version, we see that Version 1 will run roughly ~6.19 (if we simply compare the clockticks) times slower. If we run the second version (case: VERSION2) and compare it to the first version, we can see that Version 1 will run roughly ~6.19 (if we simply compare the clockticks) times slower. Figure 17
16 switch (testtype) { case VERSION1: startindx = tid; for (loops = 0; loops < Constants.ITERS; loops++) for (p = startindx; p < Constants.NUMPOINTS; p += Constants.THREADS) { double d = java.lang.math.sqrt(points[p].x*points[p].x + points[p].y*points[p].y + points[p].z*points[p].z); points[p].x /= d; points[p].y /= d; points[p].z /= d; } break; case VERSION2: indx = tid; int delta = (int) ((float)constants.numpoints / Constants.THREADS); startindx = indx * delta; endindx = startindx + delta; if (indx == Constants.THREADS - 1) endindx = Constants.NUMPOINTS; for (loops = 0; loops < Constants.ITERS; loops++) for (p = startindx; p < endindx; p++) { double d = java.lang.math.sqrt(points[p].x * points[p].x + points[p].y * points[p].y + points[p].z*points[p].z); points[p].x /= d; points[p].y /= d; points[p].z /= d; } break; } Figure 18 Figure 19 So let s look at L2 cache misses to see the impact. After sampling both versions with two key processor events, MEM_LOAD_RETIRED.L2_LINE_MISS[9] and L2_LINES_IN.SELF[10], we can clearly see the differences in 2 versions. The second version uses a block distribution of the data to eliminate the cache line sharing or false sharing. Version 1 Version2
17 Figure 20 Clearly, memory access issues in this example will prevent the application scale as the number of cores increase. Plotting the non-false sharing and false sharing versions shows the impact of false sharing on scaling. Seconds Threads Core 2 Extreme QX9650 NOFS Core 2 Extreme QX9650 FS Figure Identifying SIMD and Vectorization Usage Leveraging SIMD and SSE (Streaming SIMD Extensions) support available on target processors is one of the key optimization techniques JVMs use (or should use). The question is how to identify which Jitted methods use SSE? This has been a common question from many Java developers. The VTune analyzer s event based sampling can help users pinpoint exactly which methods are optimized to use SSE. One can simply use SIMD_INST_RETIRED.ANY event (Retired Streaming SIMD instructions (precise event)) to count the overall number of SIMD instructions retired. The following events can give further break-down of this one event. These are the events that are available on Core architecture based processors. Symbol Name[11] FP_MMX_TRANS.TO_FP FP_MMX_TRANS.TO_MMX SIMD_ASSIST SIMD_COMP_INST_RETIRED.PACKED_DOUBLE SIMD_COMP_INST_RETIRED.PACKED_SINGLE SIMD_COMP_INST_RETIRED.SCALAR_DOUBLE SIMD_COMP_INST_RETIRED.SCALAR_SINGLE SIMD_INSTR_RETIRED SIMD_INST_RETIRED.ANY SIMD_INST_RETIRED.PACKED_DOUBLE Description Transitions from MMX (TM) Instructions to Floating Point Instructions. Transitions from Floating Point to MMX (TM) Instructions. SIMD assists invoked. Retired computational Streaming SIMD Extensions 2 (SSE2) packed-double instructions. Retired computational Streaming SIMD Extensions (SSE) packedsingle instructions. Retired computational Streaming SIMD Extensions 2 (SSE2) scalar-double instructions. Retired computational Streaming SIMD Extensions (SSE) scalarsingle instructions. SIMD Instructions retired. Retired Streaming SIMD instructions (precise event). Retired Streaming SIMD Extensions 2 (SSE2) packed-double instructions.
18 SIMD_INST_RETIRED.PACKED_SINGLE SIMD_INST_RETIRED.SCALAR_DOUBLE SIMD_INST_RETIRED.SCALAR_SINGLE SIMD_INST_RETIRED.VECTOR SIMD_SAT_INSTR_RETIRED SIMD_SAT_UOP_EXEC SIMD_UOPS_EXEC SIMD_UOP_TYPE_EXEC.ARITHMETIC SIMD_UOP_TYPE_EXEC.LOGICAL SIMD_UOP_TYPE_EXEC.MUL SIMD_UOP_TYPE_EXEC.PACK SIMD_UOP_TYPE_EXEC.SHIFT SIMD_UOP_TYPE_EXEC.UNPACK Retired Streaming SIMD Extensions (SSE) packed-single instructions. Retired Streaming SIMD Extensions 2 (SSE2) scalar-double instructions. Retired Streaming SIMD Extensions (SSE) scalar-single instructions. Retired Streaming SIMD Extensions 2 (SSE2) vector integer instructions. Saturated arithmetic instructions retired. SIMD saturated arithmetic micro-ops executed. SIMD micro-ops executed (excluding stores). SIMD packed arithmetic micro-ops executed SIMD packed logical micro-ops executed SIMD packed multiply micro-ops executed SIMD pack micro-ops executed SIMD packed shift micro-ops executed SIMD unpack micro-ops executed Figure 22 If we collect some of these events on SciMark2 again (Figure 22 and 23), we ll see that actually all our benchmarks are using the SIMD (Single Instruction Multiple Data) unit. It is also important to note that SIMD_INST_RETIRED.ANY should be equal to the total of all the sub events.
19 Figure Parallelization, threads and more As we mentioned earlier, Java uses heap to allocate memory for objects however, in multi-threaded cases it becomes quite clear that the access to the heap can quickly become a significant concurrency bottleneck, as every allocation would involve acquiring a lock that guards the heap. Luckily JVMs use thread-local allocation blocks (TLAB), where each thread allocates a larger chunk of memory from the heap and services small allocation requests sequentially out of that thread-local block. As a result, this greatly enhances the scalability and reduces the number of times a thread must acquire the shared heap lock, improving concurrency. TLAB[12] enables a thread to do object allocation using thread local top and limit pointers, which is faster than doing a serialized access to the heap which is shared across threads. However, as the number of threads exceeds the number of processors, the cost of committing memory to localallocation buffers becomes a challenge and sophisticated sizing policies must be employed. The single-threaded copying collector can become a bottleneck in an application which is parallelized to take advantage of multiple processors. To take full advantage of all available CPUs on a multiprocessor machine (e.g version of the HotSpot JVM) offers an optional multithreaded collector[13]. The parallel collector tries to keep related objects together to improve memory locality and cache utilization. This is accomplished by copying objects in depth first order. It is hard not to think about multi-threading when current generation of processors are having more and more core. As more and more multi-core processors are becoming available, utilizing all these cores becoming increasingly important. Luckily Java language and JVMs are inherently multi-threaded. Threading can help the user increase the throughput and determinism (GC threads can find more hardware threads/cores available). Now, we ll look at how to increase performance and improve resource utilization by leveraging the threading. I ll use the Java Grande benchmark suite[14] v1.0 to demonstrate the necessity of parallelism and multithreading to leverage the potential of any multi-core platform. Some of the benchmark results can be found in the graph below.
20 Section3:RayTracer:Total:SizeA Section3:MonteCarlo:Total:SizeA Section3:MolDyn:Total:SizeA Section2:SparseMatmult:Kernel:SizeB Section2:SOR:Kernel:SizeB Section2:Crypt:Kernel:SizeB threads seconds How to use VTune Analyzer Programmatic APIs? Figure 24 How to use the VTune analyzer programmatic APIs? is a very common question and the solution is already provided by VTune analyzer. Before I introduce how to use programmatic APIs, let s quickly mention what VTune Programmatic APIs are and what they are good for. The VTune Analyzer provides Pause and Resume APIs (VTPause() and VTResume() ) to allow the user analyze only certain parts of the application. By using these APIs user can skip the uninteresting parts such as initialization, GUI interaction, etc. These APIs are also available for Java and are implemented in the VTuneAPI.class which is located in the <installdir>\analyzer\bin\com\intel\vtune directory. The following Pause/Resume API enable pause and resume data collection for Java applications: com.intel.vtune.vtuneapi.vtpause() : pauses sampling and call graph data collection for a Java application. com.intel.vtune.vtuneapi.vtresume() : resumes sampling and call graph data collection for a Java application. com.intel.vtune.vtuneapi.vtnamethread(stringthreadname) : names the current thread for call graph data collection for a Java application.
21 Figure 25 In between the calls to VTPause() and VTResume() performance data is not collected. One can simply select Start with data collection paused during the configuration and resume the collection anytime with VTResume() call. If you run the Java application using these APIs, VTune analyzer console will report these API usages during the analysis. Thu Sep 25 15:32: Data collection started... Thu Sep 25 15:34: Data collection paused... Thu Sep 25 15:34: Data collection resumed... Thu Sep 25 15:34: Data collection paused... Thu Sep 25 15:34: Data collection finished... However, if you are one of those people who would like to do their way and want to implement your own wrapper for these APIs, JNI (Java Native Interface) can always help. /* VTApis.java */ class VTApis { public native void assignthreadname(); public native void resumevtune(); public native void pausevtune(); static { System.loadLibrary("VTApis"); } } /* VTApis.c */ #include <stdio.h> #include <vtuneapi.h> #include "VTApis.h" javah // this header file was generated by /* * Class: VTApis * Method: resumevtune * Signature: ()V */ JNIEXPORT void JNICALL Java_VTApis_resumeVTune (JNIEnv *, jobject)
22 How to compile and generate the wrapper using JNI Compile VTune analyzer API class: $>javac.exe VTApis.java Generate the header file VTApis.h: $>javah.exe -jni VTApis Compile and generate.dll: $>cl /Zi I"C:\Program Files\Java\jdk1.6.0_04\include" - I"C:\Program Files\Java\jdk1.6.0_04\include\win32" -I"C:\Program Files\Intel\VTune\Analyzer\Include" -LD VTApis.c -fixed:no - FeVTApis.dll VtuneApi.lib VTune Analyzer s JIT Profiling API { } VTResume(); printf("resuming VTune collection.\n"); /* * Class: VTApis * Method: pausevtune * Signature: ()V */ JNIEXPORT void JNICALL Java_VTApis_pauseVTune (JNIEnv *, jobject) { VTPause(); printf("pausing VTune collection.\n"); } Figure 26 Optimizing how the JVM generates jitted code is not the goal of this article, but it is worth mentioning a few things about this topic. Tune JIT code generation to match the underlying processor architecture: o Take advantage of new architectural features; Example: Use efficient SSE instructions to move data o Improve decode/allocation efficiency; Avoid length-changing prefixes to improve decode efficiency Branch target alignment o Eliminate inherent stalls in generated code; o Tune register allocation to reduce memory traffic; Better register allocation can reduce stack operations o Use additional registers afforded by SSE or Intel64 It is also important to note that 64-bit JVMs enable heaps larger than 4 GB, but they are usually slower than 32-bit JVMs simply due to extra memory needed and system pressure caused by using and moving 64-bit pointers. Using 32-bit offsets from a Java heap base address instead of 64-bit pointers can significantly improve the performance. On Intel Xeon platforms, resulting 64- bit JVM can be faster than 32-bit equivalent! In addition to the Intel VTune Performance Analyzer s normal Java Figure 27 support, starting with the VTune Performance Analyzer 9.1, the JIT Profiling API became public and provide further functionality and control in profiling runtime generated code (Figure 27). The JIT compilers or JVMs can use this API to insert API calls to gather more detailed information about the interpreted or dynamically generated code.
23 Instructions to include JIT Profiling Support ii Include JITProfiling.h file located under C:\Program Files\Intel\VTune\Analyzer\include directory for Microsoft* operating systems and under /opt/intel/vtune/analyzer/include for Linux* operating systems. This header file provides all API function prototypes and type definitions. Link the Virtual Machine (or any code using these APIs) with JITProfiling.lib located under C:\Program Files\Intel\VTune\Analyzer\lib on Windows*, and with JITProfiling.a located under /opt/intel/vtune/analyzer/bin on Linux* operating systems. On Linux* please link with the standard libraries libdl.so and libpthread.so. Note: JITProfiling.a which comes with VTune analyzer is compiled with g++ and not with gcc, therefore either compile your code with g++ or compile with gcc and link with -lstdc++ library. In order to function properly, a VM that uses the JITProfiling API should implement a mode-change callback function and register it using ijit_registercallbackex. The callback function is executed every time the profiling mode changes. This ensures that the VM issues appropriate notifications when mode changes happen. To enable JIT profiling support, set the environment variable ENABLE_JITPROFILING=1. On Windows: set ENABLE_JITPROFILING=1 On Linux: export ENABLE_JITPROFILING=1 On Linux JIT profiling can only be used with the command line interface (vtl) and jitprofiling option needs to be used. For call graph analysis: vtl activity jitcg -c callgraph -o jitprofiling -app./jitprof run For sampling analysis: vtl activity jitsamp -c sampling -o jitprofiling -app./jitprof run If you wish to perform JIT profiling on a remote Linux OS system, define the BISTRO_COLLECTORS_DO_JIT_PROFILING environment variable in the shell where vtserver executes. export BISTRO_COLLECTORS_DO_JIT_PROFILING=1
24 End Notes and References 1. Please note that options that begin with -X are non-standard while XX options are not stable. These options are not guaranteed to be supported on all VM implementations, and are subject to change without notice in subsequent releases of the JDK. Please check for more information Instructions Retired: Recent generations of Intel 64 and IA-32 processors feature microarchitectures using an out-of-order execution engine. They are also accompanied by an in-order front end and retirement logic that enforces program order. Instructions executed to completion are referred as instructions retired. Intel 64 and IA-32 Architectures Software Developer s Manual Volume 3B: System Programming Guide, Part For these examples, I have used Intel Core micro-architecture based systems. 5. The Intel Core micro-architecture is capable of reaching CPI as low as 0.25 in ideal situations. The greater value of CPI for a given workload indicates that there are more opportunities for code tuning to improve performance. Intel 64 and IA-32 Architectures Software Developer s Manual Volume 3B: System Programming Guide, Part 2 6. Call site is defined as the location in the caller function from where a call is made to a callee. 7. Micro-operations, also known as a micro-ops or μops, are simple, RISC-like microprocessor instructions used by some CISC processors to implement more complex instructions. Wikipedia: 8. Counts the number of retired load operations that missed the L2 cache. 9. Counts the number of cache lines allocated in the L2 cache. Cache lines are allocated in the L2 cache as a result of requests from the L1 data and instruction caches and the L2 hardware prefetchers to cache lines that are missing in the L2 cache. 10. Taken from Intel VTune Performance Analyzer help. Intel VTune Performance Analyzer ( The Java Grande Forum is a community initiative to promote the use of Java for so-called Grande applications. A Grande application is an application which has large requirements for any or all of: Memory, Bandwidth and processing power. You can download the benchmark suite from
Basics of VTune Performance Analyzer. Intel Software College. Objectives. VTune Performance Analyzer. Agenda
Objectives At the completion of this module, you will be able to: Understand the intended purpose and usage models supported by the VTune Performance Analyzer. Identify hotspots by drilling down through
Garbage Collection in the Java HotSpot Virtual Machine
http://www.devx.com Printed from http://www.devx.com/java/article/21977/1954 Garbage Collection in the Java HotSpot Virtual Machine Gain a better understanding of how garbage collection in the Java HotSpot
More on Pipelining and Pipelines in Real Machines CS 333 Fall 2006 Main Ideas Data Hazards RAW WAR WAW More pipeline stall reduction techniques Branch prediction» static» dynamic bimodal branch prediction
Java Performance Tuning
Summer 08 Java Performance Tuning Michael Finocchiaro This white paper presents the basics of Java Performance Tuning for large Application Servers. h t t p : / / m f i n o c c h i a r o. w o r d p r e
2 2011 Oracle Corporation Proprietary and Confidential
The following is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material,
Java Garbage Collection Basics
Java Garbage Collection Basics Overview Purpose This tutorial covers the basics of how Garbage Collection works with the Hotspot JVM. Once you have learned how the garbage collector functions, learn how
Using jvmstat and visualgc to Solve Memory Management Problems
Using jvmstat and visualgc to Solve Memory Management Problems java.sun.com/javaone/sf 1 Wally Wedel Sun Software Services Brian Doherty Sun Microsystems, Inc. Analyze JVM Machine Memory Management Problems
Interpreters and virtual machines. Interpreters. Interpreters. Why interpreters? Tree-based interpreters. Text-based interpreters
Interpreters and virtual machines Michel Schinz 2007 03 23 Interpreters Interpreters Why interpreters? An interpreter is a program that executes another program, represented as some kind of data-structure.
Get an Easy Performance Boost Even with Unthreaded Apps. with Intel Parallel Studio XE for Windows*
Get an Easy Performance Boost Even with Unthreaded Apps for Windows* Can recompiling just one file make a difference? Yes, in many cases it can! Often, you can achieve a major performance boost by recompiling
2
1 2 3 4 5 For Description of these Features see http://download.intel.com/products/processor/corei7/prod_brief.pdf The following Features Greatly affect Performance Monitoring The New Performance Monitoring
PART IV Performance oriented design, Performance testing, Performance tuning & Performance solutions. Outline. Performance oriented design
PART IV Performance oriented design, Performance testing, Performance tuning & Performance solutions Slide 1 Outline Principles for performance oriented design Performance testing Performance tuning General
Eliminate Memory Errors and Improve Program Stability
Eliminate Memory Errors and Improve Program Stability with Intel Parallel Studio XE Can running one simple tool make a difference? Yes, in many cases. You can find errors that cause complex, intermittent
Zing Vision. Answering your toughest production Java performance questions
Zing Vision Answering your toughest production Java performance questions Outline What is Zing Vision? Where does Zing Vision fit in your Java environment? Key features How it works Using ZVRobot Q & A
INTEL PARALLEL STUDIO XE EVALUATION GUIDE
Introduction This guide will illustrate how you use Intel Parallel Studio XE to find the hotspots (areas that are taking a lot of time) in your application and then recompiling those parts to improve overall
Garbage Collection in NonStop Server for Java
Garbage Collection in NonStop Server for Java Technical white paper Table of contents 1. Introduction... 2 2. Garbage Collection Concepts... 2 3. Garbage Collection in NSJ... 3 4. NSJ Garbage Collection
The Intel VTune Performance Analyzer
The Intel VTune Performance Analyzer Focusing on Vtune for Intel Itanium running Linux* OS Copyright 2002 Intel Corporation. All rights reserved. VTune and the Intel logo are trademarks or registered trademarks
JBoss Data Grid Performance Study Comparing Java HotSpot to Azul Zing
JBoss Data Grid Performance Study Comparing Java HotSpot to Azul Zing January 2014 Legal Notices JBoss, Red Hat and their respective logos are trademarks or registered trademarks of Red Hat, Inc. Azul
picojava TM : A Hardware Implementation of the Java Virtual Machine
picojava TM : A Hardware Implementation of the Java Virtual Machine Marc Tremblay and Michael O Connor Sun Microelectronics Slide 1 The Java picojava Synergy Java s origins lie in improving the consumer
Validating Java for Safety-Critical Applications
Validating Java for Safety-Critical Applications Jean-Marie Dautelle * Raytheon Company, Marlborough, MA, 01752 With the real-time extensions, Java can now be used for safety critical systems. It is therefore
Practical Performance Understanding the Performance of Your Application
Neil Masson IBM Java Service Technical Lead 25 th September 2012 Practical Performance Understanding the Performance of Your Application 1 WebSphere User Group: Practical Performance Understand the Performance
Performance Measurement of Dynamically Compiled Java Executions
Performance Measurement of Dynamically Compiled Java Executions Tia Newhall and Barton P. Miller University of Wisconsin Madison Madison, WI 53706-1685 USA +1 (608) 262-1204 {newhall,bart}@cs.wisc.edu
Replication on Virtual Machines
Replication on Virtual Machines Siggi Cherem CS 717 November 23rd, 2004 Outline 1 Introduction The Java Virtual Machine 2 Napper, Alvisi, Vin - DSN 2003 Introduction JVM as state machine Addressing non-determinism
Tool - 1: Health Center
Tool - 1: Health Center Joseph Amrith Raj http://facebook.com/webspherelibrary 2 Tool - 1: Health Center Table of Contents WebSphere Application Server Troubleshooting... Error! Bookmark not defined. About
Intel Application Software Development Tool Suite 2.2 for Intel Atom processor. In-Depth
Application Software Development Tool Suite 2.2 for Atom processor In-Depth Contents Application Software Development Tool Suite 2.2 for Atom processor............................... 3 Features and Benefits...................................
Memory Management in the Java HotSpot Virtual Machine
Memory Management in the Java HotSpot Virtual Machine Sun Microsystems April 2006 2 Table of Contents Table of Contents 1 Introduction.....................................................................
MAGENTO HOSTING Progressive Server Performance Improvements
MAGENTO HOSTING Progressive Server Performance Improvements Simple Helix, LLC 4092 Memorial Parkway Ste 202 Huntsville, AL 35802 [email protected] 1.866.963.0424 www.simplehelix.com 2 Table of Contents
WebSphere Architect (Performance and Monitoring) 2011 IBM Corporation
Track Name: Application Infrastructure Topic : WebSphere Application Server Top 10 Performance Tuning Recommendations. Presenter Name : Vishal A Charegaonkar WebSphere Architect (Performance and Monitoring)
NetBeans Profiler is an
NetBeans Profiler Exploring the NetBeans Profiler From Installation to a Practical Profiling Example* Gregg Sporar* NetBeans Profiler is an optional feature of the NetBeans IDE. It is a powerful tool that
Extreme Performance with Java
Extreme Performance with Java QCon NYC - June 2012 Charlie Hunt Architect, Performance Engineering Salesforce.com sfdc_ppt_corp_template_01_01_2012.ppt In a Nutshell What you need to know about a modern
VirtualCenter Database Performance for Microsoft SQL Server 2005 VirtualCenter 2.5
Performance Study VirtualCenter Database Performance for Microsoft SQL Server 2005 VirtualCenter 2.5 VMware VirtualCenter uses a database to store metadata on the state of a VMware Infrastructure environment.
JVM Performance Study Comparing Oracle HotSpot and Azul Zing Using Apache Cassandra
JVM Performance Study Comparing Oracle HotSpot and Azul Zing Using Apache Cassandra January 2014 Legal Notices Apache Cassandra, Spark and Solr and their respective logos are trademarks or registered trademarks
MS SQL Performance (Tuning) Best Practices:
MS SQL Performance (Tuning) Best Practices: 1. Don t share the SQL server hardware with other services If other workloads are running on the same server where SQL Server is running, memory and other hardware
Performance Improvement In Java Application
Performance Improvement In Java Application Megha Fulfagar Accenture Delivery Center for Technology in India Accenture, its logo, and High Performance Delivered are trademarks of Accenture. Agenda Performance
THE BUSY DEVELOPER'S GUIDE TO JVM TROUBLESHOOTING
THE BUSY DEVELOPER'S GUIDE TO JVM TROUBLESHOOTING November 5, 2010 Rohit Kelapure HTTP://WWW.LINKEDIN.COM/IN/ROHITKELAPURE HTTP://TWITTER.COM/RKELA Agenda 2 Application Server component overview Support
CSCI E 98: Managed Environments for the Execution of Programs
CSCI E 98: Managed Environments for the Execution of Programs Draft Syllabus Instructor Phil McGachey, PhD Class Time: Mondays beginning Sept. 8, 5:30-7:30 pm Location: 1 Story Street, Room 304. Office
Overview. CISC Developments. RISC Designs. CISC Designs. VAX: Addressing Modes. Digital VAX
Overview CISC Developments Over Twenty Years Classic CISC design: Digital VAX VAXÕs RISC successor: PRISM/Alpha IntelÕs ubiquitous 80x86 architecture Ð 8086 through the Pentium Pro (P6) RJS 2/3/97 Philosophy
Java Performance. Adrian Dozsa TM-JUG 18.09.2014
Java Performance Adrian Dozsa TM-JUG 18.09.2014 Agenda Requirements Performance Testing Micro-benchmarks Concurrency GC Tools Why is performance important? We hate slow web pages/apps We hate timeouts
INTEL PARALLEL STUDIO EVALUATION GUIDE. Intel Cilk Plus: A Simple Path to Parallelism
Intel Cilk Plus: A Simple Path to Parallelism Compiler extensions to simplify task and data parallelism Intel Cilk Plus adds simple language extensions to express data and task parallelism to the C and
Oracle9i Release 2 Database Architecture on Windows. An Oracle Technical White Paper April 2003
Oracle9i Release 2 Database Architecture on Windows An Oracle Technical White Paper April 2003 Oracle9i Release 2 Database Architecture on Windows Executive Overview... 3 Introduction... 3 Oracle9i Release
Delivering Quality in Software Performance and Scalability Testing
Delivering Quality in Software Performance and Scalability Testing Abstract Khun Ban, Robert Scott, Kingsum Chow, and Huijun Yan Software and Services Group, Intel Corporation {khun.ban, robert.l.scott,
Chapter 3: Operating-System Structures. System Components Operating System Services System Calls System Programs System Structure Virtual Machines
Chapter 3: Operating-System Structures System Components Operating System Services System Calls System Programs System Structure Virtual Machines Operating System Concepts 3.1 Common System Components
HeapStats: Your Dependable Helper for Java Applications, from Development to Operation
: Technologies for Promoting Use of Open Source Software that Contribute to Reducing TCO of IT Platform HeapStats: Your Dependable Helper for Java Applications, from Development to Operation Shinji Takao,
Whitepaper: performance of SqlBulkCopy
We SOLVE COMPLEX PROBLEMS of DATA MODELING and DEVELOP TOOLS and solutions to let business perform best through data analysis Whitepaper: performance of SqlBulkCopy This whitepaper provides an analysis
Angelika Langer www.angelikalanger.com. The Art of Garbage Collection Tuning
Angelika Langer www.angelikalanger.com The Art of Garbage Collection Tuning objective discuss garbage collection algorithms in Sun/Oracle's JVM give brief overview of GC tuning strategies GC tuning (2)
Java Troubleshooting and Performance
Java Troubleshooting and Performance Margus Pala Java Fundamentals 08.12.2014 Agenda Debugger Thread dumps Memory dumps Crash dumps Tools/profilers Rules of (performance) optimization 1. Don't optimize
The Design of the Inferno Virtual Machine. Introduction
The Design of the Inferno Virtual Machine Phil Winterbottom Rob Pike Bell Labs, Lucent Technologies {philw, rob}@plan9.bell-labs.com http://www.lucent.com/inferno Introduction Virtual Machine are topical
Characteristics of Java (Optional) Y. Daniel Liang Supplement for Introduction to Java Programming
Characteristics of Java (Optional) Y. Daniel Liang Supplement for Introduction to Java Programming Java has become enormously popular. Java s rapid rise and wide acceptance can be traced to its design
What s Cool in the SAP JVM (CON3243)
What s Cool in the SAP JVM (CON3243) Volker Simonis, SAP SE September, 2014 Public Agenda SAP JVM Supportability SAP JVM Profiler SAP JVM Debugger 2014 SAP SE. All rights reserved. Public 2 SAP JVM SAP
Tuning WebSphere Application Server ND 7.0. Royal Cyber Inc.
Tuning WebSphere Application Server ND 7.0 Royal Cyber Inc. JVM related problems Application server stops responding Server crash Hung process Out of memory condition Performance degradation Check if the
An Oracle White Paper September 2013. Advanced Java Diagnostics and Monitoring Without Performance Overhead
An Oracle White Paper September 2013 Advanced Java Diagnostics and Monitoring Without Performance Overhead Introduction... 1 Non-Intrusive Profiling and Diagnostics... 2 JMX Console... 2 Java Flight Recorder...
Advanced Computer Architecture-CS501. Computer Systems Design and Architecture 2.1, 2.2, 3.2
Lecture Handout Computer Architecture Lecture No. 2 Reading Material Vincent P. Heuring&Harry F. Jordan Chapter 2,Chapter3 Computer Systems Design and Architecture 2.1, 2.2, 3.2 Summary 1) A taxonomy of
Binary search tree with SIMD bandwidth optimization using SSE
Binary search tree with SIMD bandwidth optimization using SSE Bowen Zhang, Xinwei Li 1.ABSTRACT In-memory tree structured index search is a fundamental database operation. Modern processors provide tremendous
language 1 (source) compiler language 2 (target) Figure 1: Compiling a program
CS 2112 Lecture 27 Interpreters, compilers, and the Java Virtual Machine 1 May 2012 Lecturer: Andrew Myers 1 Interpreters vs. compilers There are two strategies for obtaining runnable code from a program
Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging
Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging In some markets and scenarios where competitive advantage is all about speed, speed is measured in micro- and even nano-seconds.
Holly Cummins IBM Hursley Labs. Java performance not so scary after all
Holly Cummins IBM Hursley Labs Java performance not so scary after all So... You have a performance problem. What next? Goals After this talk you will: Not feel abject terror when confronted with a performance
Rational Application Developer Performance Tips Introduction
Rational Application Developer Performance Tips Introduction This article contains a series of hints and tips that you can use to improve the performance of the Rational Application Developer. This article
Performance Tools for Parallel Java Environments
Performance Tools for Parallel Java Environments Sameer Shende and Allen D. Malony Department of Computer and Information Science, University of Oregon {sameer,malony}@cs.uoregon.edu http://www.cs.uoregon.edu/research/paracomp/tau
Mission-Critical Java. An Oracle White Paper Updated October 2008
Mission-Critical Java An Oracle White Paper Updated October 2008 Mission-Critical Java The Oracle JRockit family of products is a comprehensive portfolio of Java runtime solutions that leverages the base
VMware Server 2.0 Essentials. Virtualization Deployment and Management
VMware Server 2.0 Essentials Virtualization Deployment and Management . This PDF is provided for personal use only. Unauthorized use, reproduction and/or distribution strictly prohibited. All rights reserved.
Running a Workflow on a PowerCenter Grid
Running a Workflow on a PowerCenter Grid 2010-2014 Informatica Corporation. No part of this document may be reproduced or transmitted in any form, by any means (electronic, photocopying, recording or otherwise)
Using Power to Improve C Programming Education
Using Power to Improve C Programming Education Jonas Skeppstedt Department of Computer Science Lund University Lund, Sweden [email protected] jonasskeppstedt.net jonasskeppstedt.net [email protected]
Elemental functions: Writing data-parallel code in C/C++ using Intel Cilk Plus
Elemental functions: Writing data-parallel code in C/C++ using Intel Cilk Plus A simple C/C++ language extension construct for data parallel operations Robert Geva [email protected] Introduction Intel
Advanced Performance Forensics
Advanced Performance Forensics Uncovering the Mysteries of Performance and Scalability Incidents through Forensic Engineering Stephen Feldman Senior Director Performance Engineering and Architecture [email protected]
x64 Servers: Do you want 64 or 32 bit apps with that server?
TMurgent Technologies x64 Servers: Do you want 64 or 32 bit apps with that server? White Paper by Tim Mangan TMurgent Technologies February, 2006 Introduction New servers based on what is generally called
IA-64 Application Developer s Architecture Guide
IA-64 Application Developer s Architecture Guide The IA-64 architecture was designed to overcome the performance limitations of today s architectures and provide maximum headroom for the future. To achieve
OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA
OpenCL Optimization San Jose 10/2/2009 Peng Wang, NVIDIA Outline Overview The CUDA architecture Memory optimization Execution configuration optimization Instruction optimization Summary Overall Optimization
Profiling and Testing with Test and Performance Tools Platform (TPTP)
Profiling and Testing with Test and Performance Tools Platform (TPTP) 2009 IBM Corporation and Intel Corporation; made available under the EPL v1.0 March, 2009 Speakers Eugene Chan IBM Canada [email protected]
Toad for Oracle 8.6 SQL Tuning
Quick User Guide for Toad for Oracle 8.6 SQL Tuning SQL Tuning Version 6.1.1 SQL Tuning definitively solves SQL bottlenecks through a unique methodology that scans code, without executing programs, to
Section 1.4. Java s Magic: Bytecode, Java Virtual Machine, JIT,
J A V A T U T O R I A L S : Section 1.4. Java s Magic: Bytecode, Java Virtual Machine, JIT, JRE and JDK This section clearly explains the Java s revolutionary features in the programming world. Java basic
Bindel, Spring 2010 Applications of Parallel Computers (CS 5220) Week 1: Wednesday, Jan 27
Logistics Week 1: Wednesday, Jan 27 Because of overcrowding, we will be changing to a new room on Monday (Snee 1120). Accounts on the class cluster (crocus.csuglab.cornell.edu) will be available next week.
Performance Monitoring and Tuning. Liferay Chicago User Group (LCHIUG) James Lefeu 29AUG2013
Performance Monitoring and Tuning Liferay Chicago User Group (LCHIUG) James Lefeu 29AUG2013 Outline I. Definitions II. Architecture III.Requirements and Design IV.JDK Tuning V. Liferay Tuning VI.Profiling
Chapter 3 Operating-System Structures
Contents 1. Introduction 2. Computer-System Structures 3. Operating-System Structures 4. Processes 5. Threads 6. CPU Scheduling 7. Process Synchronization 8. Deadlocks 9. Memory Management 10. Virtual
Performance Monitoring of the Software Frameworks for LHC Experiments
Proceedings of the First EELA-2 Conference R. mayo et al. (Eds.) CIEMAT 2009 2009 The authors. All rights reserved Performance Monitoring of the Software Frameworks for LHC Experiments William A. Romero
Java Coding Practices for Improved Application Performance
1 Java Coding Practices for Improved Application Performance Lloyd Hagemo Senior Director Application Infrastructure Management Group Candle Corporation In the beginning, Java became the language of the
Inside the Erlang VM
Rev A Inside the Erlang VM with focus on SMP Prepared by Kenneth Lundin, Ericsson AB Presentation held at Erlang User Conference, Stockholm, November 13, 2008 1 Introduction The history of support for
BridgeWays Management Pack for VMware ESX
Bridgeways White Paper: Management Pack for VMware ESX BridgeWays Management Pack for VMware ESX Ensuring smooth virtual operations while maximizing your ROI. Published: July 2009 For the latest information,
Sawmill Log Analyzer Best Practices!! Page 1 of 6. Sawmill Log Analyzer Best Practices
Sawmill Log Analyzer Best Practices!! Page 1 of 6 Sawmill Log Analyzer Best Practices! Sawmill Log Analyzer Best Practices!! Page 2 of 6 This document describes best practices for the Sawmill universal
on an system with an infinite number of processors. Calculate the speedup of
1. Amdahl s law Three enhancements with the following speedups are proposed for a new architecture: Speedup1 = 30 Speedup2 = 20 Speedup3 = 10 Only one enhancement is usable at a time. a) If enhancements
Lua as a business logic language in high load application. Ilya Martynov [email protected] CTO at IPONWEB
Lua as a business logic language in high load application Ilya Martynov [email protected] CTO at IPONWEB Company background Ad industry Custom development Technical platform with multiple components Custom
General Introduction
Managed Runtime Technology: General Introduction Xiao-Feng Li ([email protected]) 2012-10-10 Agenda Virtual machines Managed runtime systems EE and MM (JIT and GC) Summary 10/10/2012 Managed Runtime
An Easier Way for Cross-Platform Data Acquisition Application Development
An Easier Way for Cross-Platform Data Acquisition Application Development For industrial automation and measurement system developers, software technology continues making rapid progress. Software engineers
Mobile Application Development Android
Mobile Application Development Android MTAT.03.262 Satish Srirama [email protected] Goal Give you an idea of how to start developing Android applications Introduce major Android application concepts
WebSphere Performance Monitoring & Tuning For Webtop Version 5.3 on WebSphere 5.1.x
Frequently Asked Questions WebSphere Performance Monitoring & Tuning For Webtop Version 5.3 on WebSphere 5.1.x FAQ Version 1.0 External FAQ1. Q. How do I monitor Webtop performance in WebSphere? 1 Enabling
Virtuoso and Database Scalability
Virtuoso and Database Scalability By Orri Erling Table of Contents Abstract Metrics Results Transaction Throughput Initializing 40 warehouses Serial Read Test Conditions Analysis Working Set Effect of
The Fundamentals of Tuning OpenJDK
The Fundamentals of Tuning OpenJDK OSCON 2013 Portland, OR Charlie Hunt Architect, Performance Engineering Salesforce.com sfdc_ppt_corp_template_01_01_2012.ppt In a Nutshell What you need to know about
IBM Software Group. SW5706 JVM Tools. 2007 IBM Corporation 4.0. This presentation will act as an introduction to JVM tools.
SW5706 JVM Tools This presentation will act as an introduction to. 4.0 Page 1 of 15 for tuning and problem detection After completing this topic, you should be able to: Describe the main tools used for
JBoss Cookbook: Secret Recipes. David Chia Senior TAM, JBoss May 5 th 2011
JBoss Cookbook: Secret Recipes David Chia Senior TAM, JBoss May 5 th 2011 Secret Recipes Byteman Cluster and Load Balancing Configuration Generator Troubleshooting High CPU Mocking a JBoss Hang State Byte
RTI Quick Start Guide
RTI Quick Start Guide This is the RTI Quick Start guide for new users or evaluators. It will help you get RTI installed and collecting data on your application quickly in an environment where you develop
Java DB Performance. Olav Sandstå Sun Microsystems, Trondheim, Norway Submission ID: 860
Java DB Performance Olav Sandstå Sun Microsystems, Trondheim, Norway Submission ID: 860 AGENDA > Java DB introduction > Configuring Java DB for performance > Programming tips > Understanding Java DB performance
System Structures. Services Interface Structure
System Structures Services Interface Structure Operating system services (1) Operating system services (2) Functions that are helpful to the user User interface Command line interpreter Batch interface
Berlin Mainframe Summit. Java on z/os. 2006 IBM Corporation
Java on z/os Martina Schmidt Agenda Berlin Mainframe Summit About the mainframe Java runtime environments under z/os For which applications should I use a mainframe? Java on z/os cost and performance Java
DMS Performance Tuning Guide for SQL Server
DMS Performance Tuning Guide for SQL Server Rev: February 13, 2014 Sitecore CMS 6.5 DMS Performance Tuning Guide for SQL Server A system administrator's guide to optimizing the performance of Sitecore
Measuring Cache and Memory Latency and CPU to Memory Bandwidth
White Paper Joshua Ruggiero Computer Systems Engineer Intel Corporation Measuring Cache and Memory Latency and CPU to Memory Bandwidth For use with Intel Architecture December 2008 1 321074 Executive Summary
Building Applications Using Micro Focus COBOL
Building Applications Using Micro Focus COBOL Abstract If you look through the Micro Focus COBOL documentation, you will see many different executable file types referenced: int, gnt, exe, dll and others.
Objectives. Chapter 2: Operating-System Structures. Operating System Services (Cont.) Operating System Services. Operating System Services (Cont.
Objectives To describe the services an operating system provides to users, processes, and other systems To discuss the various ways of structuring an operating system Chapter 2: Operating-System Structures
Ready Time Observations
VMWARE PERFORMANCE STUDY VMware ESX Server 3 Ready Time Observations VMware ESX Server is a thin software layer designed to multiplex hardware resources efficiently among virtual machines running unmodified
Contents. 2. cttctx Performance Test Utility... 8. 3. Server Side Plug-In... 9. 4. Index... 11. www.faircom.com All Rights Reserved.
c-treeace Load Test c-treeace Load Test Contents 1. Performance Test Description... 1 1.1 Login Info... 2 1.2 Create Tables... 3 1.3 Run Test... 4 1.4 Last Run Threads... 5 1.5 Total Results History...
11.1 inspectit. 11.1. inspectit
11.1. inspectit Figure 11.1. Overview on the inspectit components [Siegl and Bouillet 2011] 11.1 inspectit The inspectit monitoring tool (website: http://www.inspectit.eu/) has been developed by NovaTec.
Java Virtual Machine: the key for accurated memory prefetching
Java Virtual Machine: the key for accurated memory prefetching Yolanda Becerra Jordi Garcia Toni Cortes Nacho Navarro Computer Architecture Department Universitat Politècnica de Catalunya Barcelona, Spain
Why Threads Are A Bad Idea (for most purposes)
Why Threads Are A Bad Idea (for most purposes) John Ousterhout Sun Microsystems Laboratories [email protected] http://www.sunlabs.com/~ouster Introduction Threads: Grew up in OS world (processes).
