MPI Application Tune-Up Four Steps to Performance

Size: px
Start display at page:

Download "MPI Application Tune-Up Four Steps to Performance"

Transcription

1 MPI Application Tune-Up Four Steps to Performance Abstract Cluster systems continue to grow in complexity and capability. Getting optimal performance can be challenging. Making sense of the MPI communications, whether determining load balance or finding platform bandwidth limitations in the process/ranks can be daunting and result in different levels of performance. A four step process is outlined using a Poisson solver that is implemented as an MPI application. The tools that will be used for the profiling and analysis will be Intel Trace Analyzer and Collector and Intel VTune Amplifier XE to demonstrate the process. This paper is a first introduction and overview of the methodology with emphasis on the most important features that Intel Trace Analyzer and Collector (1) (2) and VTune Amplifier XE (3) offer for analysis of pure MPI applications on HPC clusters using Poisson solver as an illustrative example. We concentrated on a fixed size workload to perform a strong scaling analysis by varying the number of MPI ranks, each of them mapped to a single physical core. Besides showing a sample analysis, the reader will learn detailed command lines and GUI usage information demonstrating how to use the above mentioned tools effectively in a cluster environment. We have used the Intel Endeavor cluster, housed at Intel s Customer Response Team (CRT) Datacenter in New Mexico. Each node of this cluster comprises 2 Intel 12-core Xeon E5 v2 processors using the Intel microarchitecture code name Ivy Town (4), and connected through a Mellanox* InfiniBand. 1

2 Methodology Parallel High Performance Computing (HPC) applications often rely on multi-node architectures of modern clusters. Performance tuning of such applications must involve analysis of cross-node application behavior as well as single node performance analysis. Two performance analysis tools, Intel Trace Analyzer and Collector and VTune Amplifier, can provide important insights to help in this analysis. For example, message passing interface (MPI) communication hotspots, synchronization bottlenecks, load balancing and other complex issues can be investigated using Intel Trace Analyzer and Collector. At the same time, VTune Amplifier can be used to understand intra-node performance issues. We will apply both of these tools to the pure MPI version of the Poisson solver. The more complex case of hybrid applications is left for later studies. The methodology presented below represents broad recommendations to combine global performance application metrics such as speedup and parallel efficiency with more detailed measurements such as message passing rates and memory bandwidth. With more detailed analysis beyond simple scaling profiling, the use of software tools such as Intel Trace Analyzer and Collector and VTune Amplifier is necessary. Our goal is not to achieve better performance per se but to show how to use Intel Trace Analyzer and Collector and VTune Amplifier to better understand performance problems on a specific hardware platform as a prerequisite for subsequent tuning. We focus on the cluster level performance optimizations assuming that application already went through single core performance tuning and hence achieved some level of maturity. The performance issues related to the scalability have to be evaluated at level of concurrency the application is used in real-life situation. For example, by performing analysis using realistic level workloads running at scale on the cluster, we investigate application at relevant memory footprint conditions. The methodology consists of 4 phases depicted in Figure 1 (a more detailed version is shown in Appendix C): 2

3 1.Global Analysis characteristic values for the whole program: Timing, Speedup, Efficiencies + Imbalance diagram Yes Imbalance > Interconnect No 2. Algorithmic Investigations Change algorithm: some workloads provide alternatives. Analyze and change MPI pattern. Analyze load imbalance of the computation. 3. MPI Runtime tuning Setting Intel MPI environment variables. Using a faster comm. network. Applying an optimized rank to node mapping. Yes Runtime reduced? No Runtime reduced? Yes 4. Single node/process tuning Hotspot analysis for MPI routines. Tuning of routines showing a load imbalance. Bandwidth analysis for single node scalability Iterate Figure 1 Flow Chart of the tuning methodology. The central decision is taken based on the relation of Imbalance time vs. Interconnect time. These times can be determined by ITAC (see below). Large imbalance time means that we have much waiting time in MPI routines and we cannot get much faster even with an ideal communication network. Large interconnect time means that any improvement of the network performance speeds up the program significantly. 3

4 Scaling using different MPI rank distributions on the computational grid The experiments were conducted on Intel Endeavor cluster with 2 sockets per node of Xeon E5 v2 processors (the Intel microarchitecture code name Ivy Town) providing 24 physical cores per node. Intel Hyper-Threading Technology was enabled on the cluster nodes. However, in our case of pure MPI runs, we used no more than 24 ranks per node, mapping one MPI rank to one physical core. We investigate our Poisson solver on a square 3200x3200 computational grid that is large enough to run into bandwidth limitations. The bandwidth limitations can be remedied by undersubscribing computational nodes, tuning hardware and software prefetchers, and by algorithmic changes and other means. The 3200x3200 grid points can be distributed to MPI ranks using a 2D process grid, e.g., in the case of 4 ranks, one can use 2 rows x 2 columns of processes or a 1D distribution with 4 rows x 1 column or 1 row x 4 columns (Figure 2) x1600 local grid points per MPI rank x800 local grid points per MPI rank Figure 2 Mapping of computation grid on MPI Processes/Grid using 2D (2x2) and 1D (1x4) processes grid. 4

5 According to the proposed methodology, the first stage is global Analysis and we start it with a scaling investigation by running the application with different numbers of processes (p), recording the timings T[p] and then measuring speedup defined as S[p] = T[1]/T[p]. The complimentary metric, parallel efficiency, defined as E[p] = S[p]/p, will be used later, as well. The speedup curves for the 2D quadratic and 1D process (1D 1xN and 1D Nx1) grids show some differences in scaling (Figure 3). Indeed, the 48x32 = 1536 process grid delivers a speedup of 284, while the 1536x1 process grid gives 288 speedup but the 1x1536 process decomposition only produces a 144x scaling (Please see Appendix A for the cluster configuration that was used in these experiments). Benchmark configuration in Appendix A Figure 3 Speedup for 2D and 1D process grids. Note Y axis logarithmic scale. A single node contains 24 cores (IVT). The Ideal curve is just speedup == # ranks. A small additional dent in the Ideal curve results from the fact that the ranks are not powers of 2. The 2D distribution uses a square distribution NxM with n == m. If this is not possible then the nearest approximation with n>m is chosen: e.g. for 384 ranks NxM == 24x16. From the timing data it is not obvious whether the scaling degradation is due to MPI performance or to single core compute performance. We can use Intel Trace Analyzer and Collector function profile functionality to separate and further analyze impact of different ranks placements on inter-node and intra-node performances (ITAC: chart->function profile). 5

6 Before we do that we may have a look on the message passing performance due to different process mappings. Message Passing Profile Message Passing Profile is a display of various characteristics of message passing in a sender/receiver matrix that can be obtained through Charts-> Message Profile Chart. Because dealing with 1536 ranks generates a huge matrix we may fuse all ranks for each node: Advanced-> Process Aggregation-> All Nodes. The diagonal now shows the intra-node performance characteristics while the off diagonals show the inter-node statistics. Without Process Aggregation the diagonal will be only filled if we send messages from rank n to the same rank n which is usually not a good idea. Several attributes may be displayed by using Right click->attribute to show. Most interesting is the attribute Average Transfer Rate, displaying the message passing rate including all waiting times. But some other attributes like total volume [in MB] and the number of messages may also be of interest. Figure 4: The Message Passing Profile for 1536 = 48 x 32 ranks process placement on 64 nodes. The rows and columns represent senders and receivers correspondingly with squares also showing a color coded legend for the chosen attribute and representing the messages between different sender/receiver pairs. In this case we fused all ranks inside each node. The attribute is Average Transfer rate [MB/s] in this case. Rates are pretty low and this gives a first indication that we should investigate the message passing in more detail here. 6

7 Figure 5: The Message Passing Profile for 1536x1 ranks 1D process placement on 64 nodes. Rates are much better compared to the 2D case. Especially, the intra node communication (1D) has about a factor of 1.38 GB/ 30 MB better throughput compared to 2D! By looking at other attributes offered by the charts on Figure 4 and Figure 5 one can realize that we have fewer massages in the 1D case that are larger, which in turn leads to a higher average transfer rate. But in the 1D case we also transfer a larger amount of data. These 2 different MPI ranks placements, 2D (48x32) and 1D (1536x1) with message profiles illustrated on Figure 4 and Figure 5, result in similar performance. While the simple 1D message passing pattern offers not much potential for optimization, we may apply some optimizations to the 2D pattern like reordering the MPI_Isend, MPI_I receive and MPI_Waitall() for several messages. There are other optimization ideas that can be worked out using Intel Trace Analyzer and Collector, and they are left for future work. 7

8 Application and compute part parallel efficiency. To shed the light on the issues related to the scalability on the cluster level, we first look at MPI communication and compute performance breakdown of total runtime T[p] = T_comp[p] + T_mpi[p] that can be accessed through the Trace Analyzer s Function Profile (Intel Trace Analyzer displays the Function Profile Chart when opening a trace file) (Figure 6). The trace file for Trace Analyzer and Collector can be generated by adding the flag -trace to the Intel MPI mpirun or mpiexec.hydra command. The trace analyzer API was used to time just 100 of 1653 iterations. VT_API is paused time Timing is accumulated over Ranks. Application time is T_comp This column is the average time per process. It can be added by right click and Function Profile Settings Figure 6: Intel Trace Analyzer and Collector Function Profile Chart. This snapshot shows the output for a 768 process run. A breakdown of MPI functions can be seen by right clicking on Group MPI and Ungroup MPI. This will simply reveal that MPI_Allreduce and MPI_Waitall are the main hotspots in the MPI library. Speedup and parallel efficiency can then be calculated and plotted for the compute time of the application separately (Figure 7). First, one can see that MPI Time is insignificant up to 48 8

9 cores (the equivalent of two nodes). Above 96 ranks (4 nodes), pure computational application performance also yields super linear scaling. However, at the same data point, at around 96 ranks, MPI time becomes the main reason for low efficiency. Benchmark configuration in Appendix A Figure 7: Compute vs. Total parallel efficiency as a function of number of MPI ranks for the 2D distribution. Compute + MPI is the whole application. This Plot is still part of the Global Analysis. It shows that the efficiency is determined by the computation efficiency for small rank count and by the network overhead for more than 2 nodes. MPI hotspot functions were also determined by the Function Profile To investigate why this is the case, we may need to look beyond the flat profile since it is not clear if the poor timings shown in calls to MPI routines are caused by slow network performance or algorithmic inefficiencies causing unnecessary wait time as a first decision branch in the proposed methodology chart (Figure 1) Interconnect time vs. Imbalance time To understand relative impact of MPI application imbalance vs. interconnect (hardware and software stack) on application scalability (see Flow Chart of tuning methodology, Figure 1), we can start by employing the ideal network simulator (invoked through the Advanced- >Idealization menu). This allows us to separate network stack performance impact on total 9

10 MPI performance from algorithmic inefficiencies like imbalance and dependencies. A simple network model for the transfer time as a function of message volume V is T_trans[V] = L + (1/BW)*V where L is latency, defined as the time needed to transfer a 0 byte message, and bandwidth BW is the transfer rate for asymptotically large messages. The ideal network may be simulated by setting all transfer times to 0. This would mean L = 0 and BW =. The process of ideal trace generation is automated in Intel Trace Analyzer and Collector. It can be invoked through the Advanced->Idealization menu. The analyzer s imbalance diagrams (Advanced -> Application Imbalance Diagram menu) can then be generated using real and idealized traces. The imbalance diagrams are represented as stacked column charts for the different processes distributions (Figure 8). Benchmark configuration in Appendix A Figure 8: Poisson Solver Imbalance diagram for different process (rank) placements in case of 1536 ranks (=64 24-core nodes). Timings looked up automatically by Intel Trace Analyzer and Collector from the original and simulated traces. The 2D distribution and the 1536x1 distribution looks quite the same although the MPI exchange pattern is very different. It was expected that the 1x1536 distribution compute performance is much worse because each row will only contain one or two elements. The traces were collected for a reduced run-time of only 100 iterations. The Y-axis is time in seconds summed over all ranks. We suggest using a minimum number of iterations and ranks that would still capture main performance application features to minimize analysis time. This Graph is not the original imbalance diagram delivered by Intel Trace Analyzer and Collector because we wanted to combine three different experiments in a single plot. 10

11 The imbalance diagram can be dominated in theory by either transfer times (the algorithm is balanced but we have to improve the network performance by different process placement or new network hardware) or waiting times (the algorithm has to be revisited for better load balancing and removal of dependencies). In case of the Poisson solver, the imbalance diagram (Figure 8) shows that the application suffers predominantly from high transfer times. Therefore, following tuning methodology chart (Figure 1), the first decision after Global Analysis should be to concentrate on the reducing of the transfer times (Interconnect) first by MPI runtime tuning before the investigations related to imbalance. Algorithmic Investigations In the last chapter we determined that investigations related to imbalance are not the most efficient next step for this Poisson solver. However, to illustrate Trace Collector and Analyzer capabilities, we may shortly describe what the root causes of the observed wait times are since each application is unique and imbalance impact can prevail in case of other application. One of the common reasons behind possible imbalance issues is the inability to perfectly map the processor grid onto the computational grid. For example, if we map 1536 processors on a computational grid of 3200x3200 points we have in the 2D case with the 48x32 processor grid a local size of 3200/48 x 3200/32 points which leads to 32x32 processes with 67x100 grid points and 16x32 processes with 66x100 grid points. The difference of less than 2% for the number of grid points may be not observable. In the 1D case with a 1536x1 process grid we would get 128 processors with 3x3200 grid points and 1408 processors with 2x3200 local grid points. The differences in run time are observable by clicking on the trace analyzer s Load Balance Tab, available next to Flat Profile Tab (Figure 9). 11

12 One additional row of grid points for process Figure 9 Intel Trace Analyzer and Collector Load Balance information for each MPI process. Another possible cause of load imbalance might be not because of the algorithmic inefficiencies but due to a phenomenon called OS jitter (OS for Operating System). The OS jitter describes OS events that can slow down compute performance. Some processes may run slower for some iterations and they will cause imbalances and MPI wait time. For applications running on thousands of nodes (common scenario these days), this noise can become crucial because a single event on single process may slow down the whole application. The reduction of this noise by using a specialized minimal OS is a current HPC research topic (5). The other source of MPI waiting time is dependency 1 in the MPI coding techniques used in Poisson solver. A closer look at the MPI hotspots in the idealized trace file reveals that 85% of the wait time is due to MPI_Allreduce. The rest is due to MPI_Waitall. This shows that 1 Dependency means that Part B of an application can only be started when part A has been finished. Each message introduces a dependency because the program on the receiver rank can only proceed after the message from Part A has been send and Part B on the receiver rank has received it. If the receiver has already arrived at Part B while Part A has not sent the message, we see the receive routine inside waiting mode. The time is, however, reported as plain MPI time and only the idealization can tell whether we have waiting time (imbalance time). 12

13 practically all message passing dependencies have been removed. In a previous version of Poisson the exchange was programmed with blocking MPI_Send/Recv. This version clearly showed substantial wait time in MPI_Recv in the idealized trace file. Changing the algorithm to MPI_Isend/Irecv/Waitall successfully reduced these dependencies. MPI Runtime tuning This is the third phase of the methodology and it is necessary when we observe high transfer times as shown in the imbalance diagram. We often can improve MPI performance without changing the source code. This can be done by using Intel MPI environment variables or by changing the process mapping of ranks to compute nodes. The process to node mapping can be altered by advanced methodologies like machine- or configuration files or by reordering the ranks inside of a communicator. The MPI standard also contains support for Cartesian topologies as it is described in chapter 4 of (6). For the applications with high transfer times, it is also beneficial to use faster communication hardware in contrast to high wait time that cannot be removed even by the ideal network. We may start the tuning by concentrating on global operations. The simple Poisson solver uses only MPI_Allreduce with an MPI_SUM operation for a global sum. Just a single double precision (8Byte) value is summed over all processes. This is necessary for building the residual which is a measure for the difference of the result array computed for two subsequent iterations. The solver iterations stop when the residuum falls below a predefined threshold. It is always good advice to set the environment variable I_MPI_DEBUG to the integer value 5. This will print valuable information about used variables, network fabrics and process placement. Setting I_MPI_DEBUG to 6 will further reveal the default algorithms for collective algorithms. The Intel MPI reference guide reveals that we can select 8 different algorithms for MPI_Allreduce. Some of these algorithms will be not appropriate for single 8-byte values but we can simply test all of these algorithms for our application and find out if a non-default value may provide better performance. The algorithm can easily be changed by setting the environment variable I_MPI_ADJUST_ALLREDUCE to an integer value in the range of 1-8, corresponding to the algorithm found in the Intel MPI reference manual. A comparison of run times for each algorithm is shown on Figure 10. Algorithm 1 and 5 are pretty close but #5 delivers the best performance for 1536 ranks. 13

14 Benchmark configuration in Appendix A Figure 10 Poisson speedup for the 8 different MPI_Allreduce algorithms. Algorithm 1 and 5 are pretty close but #5 delivers the best performance for 1536 ranks. The default is algorithm #1 Recursive doubling algorithm for 1 48 processors and algorithm #5 Binomial gather + scatter algorithm above. We see that we can slightly optimize the performance for processors by choosing algorithm #1. Analyzing Intra-Node performance using Intel VTune Amplifier XE Hotspot Analysis of MPI functions VTune Amplifier was designed for single node analysis including threading. Many performance events can be read from the Performance Monitoring Unit (PMU) for a detailed analysis of Intel processor core and uncore behavior under a specific program. For a complete analysis of parallel programs Intel Trace Analyzer and Collector is not sufficient due to its primarily focus on MPI performance. This becomes even more obvious when we start analyzing hybrid codes that combine parallel MPI processes with threading for a more efficient exploitation of computing resources. We show first how to further analyze MPI hotspots with VTune Amplifier. Then we measure the bandwidth in search of a better understanding of the efficiency curve plotted in Figure 7. For Vtune Amplifier hotspot analysis we may run the amplifier command line interface as the parallel MPI program to distribute with N MPI ranks. The Poisson solver invocation comes as a 14

15 parameter to the amplifier command line: mpirun n N amplxe-cl result-dir hotspots_n collect hotspots -- poisson.x This command line runs poisson.x on the N ranks and produces for each rank a result directory containing a hotspot analysis for this rank. The result directory for rank # m will be named: hotspots_n.m. The hotspot analysis for a chosen rank is done usually by transferring the result directory, the executable and sources to the workstation for further inspection using VTune Amplifier GUI. After unpacking the results on a workstation, we open VTune Amplifier bottom up tab and select the Call Stack Mode : user functions +1. This action will show MPI functions (prefixed with P for the profile version) in VTune Amplifier GUI (Figure 11). From Intel Trace Analyzer and Collector analysis we know which MPI functions are the hotspots but we don t know which occurrence of MPI_Waitall function actually has the largest contribution to the application runtime. By revealing call stack information, VTune Amplifier can point to specific MPI_Waitall function dominating the runtime. This is useful starting point for implementing code changes to improve application performance. Figure 11 MPI functions in VTune Amplifier GUI. One of the MPI_Waitcall functions dominates with 57.7% of runtime 15

16 As a result we see that the last MPI_Waitall call stack (Appendix D, poisson.c, line 226, function call #1) dominates with 57.7% of runtime. The second call stack (Appendix D, call #2) got 34.6% and the first call stack (Appendix D, #3) has only 7.7%. The corresponding source can be found in Appendix D. The reason is that the first exchange starts almost at the same time for all processes but generates an imbalance. The second call is slowed down because of this imbalance. This gets worse in the last exchange until all ranks are synchronized again by MPI_Allreduce and a new iteration starts. The corresponding Intel Trace Analyzer and Collector snapshot for a single iteration is depicted on Figure 12: Callstack #3 Callstack #2 Callstack #1 Figure 12 Intel Trace Analyzer and Collector snapshot for a single iteration of Poisson solver. A single iteration can be detected by using more advance user function instrumentation. For this simple implementation we just know that at the end of each iteration there is an MPI_Allreduce.We just have to zoom in after an Allreduce including the following Allreduce. The numbering of call stacks is done by Vtune AmplifierXE. The first call stack is associated with the largest time fraction. Such VTune Amplifier Hotspots analysis of MPI hotspot functions analysis may be conducted for all ranks but usually we can begin with a single rank, at least in the case of homogeneous clusters (clusters consisting of Intel Xeon or Xeon Phi only but not a mixture of both). Since the speedup curve for a single node (Fig. 4) shows saturation on the node (24 cores per node), 16

17 we may anticipate some bandwidth saturation issues. Fortunately, VTune Amplifier XE provides a bandwidth analysis collection to verify this assumption. Bandwidth Analysis To use VTune Amplifier bandwidth analysis in conjunction with MPI, we can use the same trick as in the previous example for interposing VTune Amplifier with the MPI invocations of poisson.x, but with an added wrinkle to restrict VTune Amplifier to a single rank. Here is an example of a command to start 59 ranks as usual with VTune Amplifier bandwidth analysis data being collected only for the first rank: mpirun -n 1 amplxe-cl -start-paused --result-dir snb-bandwidth_60 --collect snb-bandwidth \ -- poisson.x : -n 59 poisson.x We have used above the snb-bandwidth analysis type, which also incorporates the architecture bandwidth analysis for the microarchitecture code name Ivy Town (as follows from <vtune_installation_dir>/config/analysis_type/snb_bandwidth.cfg) in the current release of VTune Amplifier XE 2013 Update 16. Since bandwidth analysis employs hardware collection sampling, the SEP (sampling) driver must be installed on each node where data will be collected. We also had to disable NMI watchdog to enable collection with hardware counters (Please see Appendix B). Since we are not interested in analyzing the MPI startup or data initialization section of the application we would like to collect VTune Amplifier data only for a specific time period, when the application runs a computational kernel. The command line arguments allow data collection to begin after a specified number of seconds (through that -start-paused command line option). This option was used in conjunction with VTune Amplifier API functions itt_resume() and itt_pause() surrounding computation kernel in the source code. It should be understood that even though we specified only one rank to invoke VTune Amplifier in the command above, that invocation will collect everything running on the node executing that rank. When collecting event-based sampling data such as LLC (Last Level Cache) misses, those events can be linked to the appropriate process under which they occurred. In case of a few MPI ranks on a single node, we will see the event based data (LLC) divided up among these MPI processes in the VTune Amplifier GUI as well as the total number of LLC misses. However, the bandwidth data are reported per memory channels and packages (not per rank/process) and then summed up for the whole node. A summary of the collected bandwidth data comes out via standard output after collection is done and it gets redirected by the LSF scheduler into the job report. Alternatively, one can use the command line tool to generate a summary report to standard output using this command: amplxe-cl -R summary -r <results-directory>. 17

18 Bandwidth on one node, GB/s Parallel Efficiency The total bandwidth, in GB/S, is reported separately for each package on the node. We subsequently summed up the reported bandwidths for both packages to obtain the total bandwidth for the node. By plotting the result of bandwidth analysis and parallel efficiency at the same time as we scale out (Figure 13), we observe an inverse correlation between them. The bandwidth in our experiments saturates at about 87 GB/s. This is about 88% of the STREAM benchmark (6) bandwidth results, 98GB/sec, performed during the same job on the same node. The STREAM bandwidth result in turn is ~80% of peak theoretical for quad-channel RAM installed on the nodes used for this test. Bandwidth vs. Parallel Efficiency on a first node Number of ranks Bandwidth, GB/s Parallel Efficiency Benchmark configuration in Appendix A Figure 13 Efficiency vs. Bandwidth, GB/S as function of number of MPI ranks.1-24 ranks are located on a single IVT node. Summary We present a methodology for performing analysis of HPC applications using Intel Trace Analyzer and Collector and VTune Amplifier. This methodology is applied starting from the whole application (or part of program that should be tuned) followed by detailed analysis with focus on the communication patterns and single MPI routines. A key role is played by the Intel Trace Analyzer and Collector s idealizer, which can simulate program execution on an ideal network with infinitely fast communications but the same processor speed. The outcome of this simulation guides us to the next steps: algorithmic investigations or tuning of MPI library. While algorithmic investigations may lead to the code 18

19 changes, we also can choose to use the Intel MPI environment variables or MPI processes placement on specific physical cores to reduce MPI communications runtime. Intel VTune Amplifier XE can be used for call stack analysis of hotspot MPI functions (MPI_Waitall in the Poisson case). Vtune Amplifier based bandwidth analysis has been shown to be useful in finding performance bottlenecks of Poisson solver application on an HPC cluster. It can clearly explain the reasons behind the scaling saturation on a single node. References 1. Intel Trace Collector. Reference Guide. [Online] e/index.htm. 2. Intel Trace Analyzer.Reference Guide. [Online] e/index.htm. 3. Intel VTune Amplifier XE [Online] 4. Intel Xeon Processor E v2 (30M Cache, 2.40 GHz). [Online] GHz. 5. OS Jitter Mitigation Techniques. [Online] d_4b40_9d82_446ebc23c550/page/os%20jitter%20mitigation%20techniques. 6. William Gropp, Ewing L. Lusk, Anthony Skjellum. Using MPI - 2nd Edition: Portable Parallel Programming with the Message Passing Interface. s.l. : The MIT Press, STREAM benchmark. [Online] 8. Intel MPI Library. Reference Manual for Linux* OS. [Online] al/index.htm. 9. Intel Xeon Processor E5 v2 and E7 v2. Product Families Uncore Performance Monitoring Reference Manual. [Online] v2-uncore-manual.pdf. 19

20 Appendix A Benchmark Environment Intel Xeon E5 v2 processors (Ivy Town) with 12 cores. Frequency: 2.7 GHz 2 processors per node (24 cores per node) Mellanox QDR Infiniband Operating system: RedHat EL 6.4 Intel MPI Appendix B Disabling of Non Maskable Interrupt in cluster environment. The Non Maskable Interrupt (NMI) Watchdog has to be disabled for VTune Amplifier to function properly. NMI can be used in Linux kernel to periodically detect if a CPU is locked. However NMI Watchdog needs to use a hardware performance counter, so other performance tools including VTune Amplifier can t use PMU event-based sampling data collection. To permanently disable the nmi_watchdog interrupt: 1. Under the root account, edit /boot/grub/grub.conf by adding the nmi_watchdog=0 parameter to the kernel line so that it looks like: /boot/vmlinuz el6.x86_64 ro root=/dev/sda8 panic=60 nmi_watchdog= Reboot the system. 3. After rebooting, enter the following command to verify whether nmi_watchdog is disabled: grep NMI /proc/interrupts. If you see zeroes, nmi_watchdog is successfully disabled. To temporarily disable the nmi_watchdog interrupt, enter: echo 0 > /proc/sys/kernel/nmi_watchdog On Endeavor cluster, disabling of NMI interrupt is implemented through setting a run-time variable NMI_WATCHDOG=OFF. This runtime variable defines a new behavior in the modified LSF job manager where VTune Amplifier is treated as a resource enabled by a user request in the cluster manager prologue. In the LSF prologue, this resource gets cleaned up once the job is done. 20

21 Appendix C Detailed Methodology The methodology consists of 4 main phases: 1. Global Analysis of the whole application that gives first indications of performance issues that can be further subdivided into: a. Run time and scaling analysis b. Message Passing performance analysis on an inter/intra node level, including finding of MPI hotspots c. Network Idealization that yields an imbalance diagram, providing guidance on how to proceed, either to phase 2 below if significant wait time is found, or in case of high transfer times we may skip phase 2 and proceed directly to phase Algorithmic investigation: source code changes to implement better message passing practices or improve the load balance of the application by: a. Fixing imbalances in communication patterns of MPI and non-mpi routines. For example, slow sequential I/O often causes imbalances. b. Removing unnecessary synchronization. For example, message passing patterns using blocking send and receive may cause a send/receive order that increases wait times. This may be resolved by using non-blocking MPI_Isend/MPI_Irecv pairs. 3. MPI run-time tuning: Intel MPI can be tuned without changing the source code using: a. Environment variables for tuning of collective operations, e.g., I_MPI_ADJUST_ALLREDUCE b. Environment variables for changing the message passing characteristics, e.g., I_MPI_DAPL_DIRECT_COPY_THRESHOLD c. It is also possible to change the MPI process/rank to node mapping for a better inter/intra node communication balance 4. Single process/node tuning: is necessary for serial performance optimizations. Furthermore, single node tuning is important for improving overall application scalability and reducing load imbalance. a. We suggest conducting a hotspot analysis for each rank or critical ranks identified in phase 1 and 2. The call stack information for a specific MPI routine may be also helpful in refining of the analysis in 1 (b). b. Bandwidth analysis on the node is important for an understanding of deficiencies in cluster level scaling. This technique will be used in the paper to explain the dive in the parallel efficiency curve of our Poisson solver. After each tuning step the analysis can be repeated by starting with phase 1. At least steps 1 (a-c) should be conducted again to get new advice for the next tuning actions. 21

22 Appendix D Compute Part Source Code The following source code shows the iteration loop. The enumeration of exchange functions calls (from #1 through #3) corresponds to the MPI functions hotspot analysis weight (the hottest function is marked as #1 below and through the text surrounding Figure 11) Iteration index it. ITAC API cutting iteration 100 to 200 Copy array for later residuum calc. CALL #3, Exchange routine at line #205 contains MPI_Waitall Update of red points CALL #2, Exchange routine at line #216 Update black points CALL # 1, Exchange routine at line #226 Residuum calc. contains MPI_Allreduce 22

23 Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark* and MobileMark*, are measured using specific computer systems, components, software, operations, and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to For more information regarding performance and optimization choices in Intel software products, visit Optimization Notice: Intel s compilers may or may not optimize to the same degree for non-intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision # , Intel Corporation. All rights reserved. Intel, the Intel logo, Intel Cilk Plus, and Intel VTune are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. 23

INTEL PARALLEL STUDIO XE EVALUATION GUIDE

INTEL PARALLEL STUDIO XE EVALUATION GUIDE Introduction This guide will illustrate how you use Intel Parallel Studio XE to find the hotspots (areas that are taking a lot of time) in your application and then recompiling those parts to improve overall

More information

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Family-Based Platforms Executive Summary Complex simulations of structural and systems performance, such as car crash simulations,

More information

Performance Monitoring of Parallel Scientific Applications

Performance Monitoring of Parallel Scientific Applications Performance Monitoring of Parallel Scientific Applications Abstract. David Skinner National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory This paper introduces an infrastructure

More information

Keys to node-level performance analysis and threading in HPC applications

Keys to node-level performance analysis and threading in HPC applications Keys to node-level performance analysis and threading in HPC applications Thomas GUILLET (Intel; Exascale Computing Research) IFERC seminar, 18 March 2015 Legal Disclaimer & Optimization Notice INFORMATION

More information

FLOW-3D Performance Benchmark and Profiling. September 2012

FLOW-3D Performance Benchmark and Profiling. September 2012 FLOW-3D Performance Benchmark and Profiling September 2012 Note The following research was performed under the HPC Advisory Council activities Participating vendors: FLOW-3D, Dell, Intel, Mellanox Compute

More information

Delivering Quality in Software Performance and Scalability Testing

Delivering Quality in Software Performance and Scalability Testing Delivering Quality in Software Performance and Scalability Testing Abstract Khun Ban, Robert Scott, Kingsum Chow, and Huijun Yan Software and Services Group, Intel Corporation {khun.ban, robert.l.scott,

More information

Overlapping Data Transfer With Application Execution on Clusters

Overlapping Data Transfer With Application Execution on Clusters Overlapping Data Transfer With Application Execution on Clusters Karen L. Reid and Michael Stumm reid@cs.toronto.edu stumm@eecg.toronto.edu Department of Computer Science Department of Electrical and Computer

More information

Performance Analysis and Optimization Tool

Performance Analysis and Optimization Tool Performance Analysis and Optimization Tool Andres S. CHARIF-RUBIAL andres.charif@uvsq.fr Performance Analysis Team, University of Versailles http://www.maqao.org Introduction Performance Analysis Develop

More information

LS-DYNA Scalability on Cray Supercomputers. Tin-Ting Zhu, Cray Inc. Jason Wang, Livermore Software Technology Corp.

LS-DYNA Scalability on Cray Supercomputers. Tin-Ting Zhu, Cray Inc. Jason Wang, Livermore Software Technology Corp. LS-DYNA Scalability on Cray Supercomputers Tin-Ting Zhu, Cray Inc. Jason Wang, Livermore Software Technology Corp. WP-LS-DYNA-12213 www.cray.com Table of Contents Abstract... 3 Introduction... 3 Scalability

More information

Scaling up to Production

Scaling up to Production 1 Scaling up to Production Overview Productionize then Scale Building Production Systems Scaling Production Systems Use Case: Scaling a Production Galaxy Instance Infrastructure Advice 2 PRODUCTIONIZE

More information

Get an Easy Performance Boost Even with Unthreaded Apps. with Intel Parallel Studio XE for Windows*

Get an Easy Performance Boost Even with Unthreaded Apps. with Intel Parallel Studio XE for Windows* Get an Easy Performance Boost Even with Unthreaded Apps for Windows* Can recompiling just one file make a difference? Yes, in many cases it can! Often, you can achieve a major performance boost by recompiling

More information

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance 11 th International LS-DYNA Users Conference Session # LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance Gilad Shainer 1, Tong Liu 2, Jeff Layton 3, Onur Celebioglu

More information

MAQAO Performance Analysis and Optimization Tool

MAQAO Performance Analysis and Optimization Tool MAQAO Performance Analysis and Optimization Tool Andres S. CHARIF-RUBIAL andres.charif@uvsq.fr Performance Evaluation Team, University of Versailles S-Q-Y http://www.maqao.org VI-HPS 18 th Grenoble 18/22

More information

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi ICPP 6 th International Workshop on Parallel Programming Models and Systems Software for High-End Computing October 1, 2013 Lyon, France

More information

- An Essential Building Block for Stable and Reliable Compute Clusters

- An Essential Building Block for Stable and Reliable Compute Clusters Ferdinand Geier ParTec Cluster Competence Center GmbH, V. 1.4, March 2005 Cluster Middleware - An Essential Building Block for Stable and Reliable Compute Clusters Contents: Compute Clusters a Real Alternative

More information

Workshop on Parallel and Distributed Scientific and Engineering Computing, Shanghai, 25 May 2012

Workshop on Parallel and Distributed Scientific and Engineering Computing, Shanghai, 25 May 2012 Scientific Application Performance on HPC, Private and Public Cloud Resources: A Case Study Using Climate, Cardiac Model Codes and the NPB Benchmark Suite Peter Strazdins (Research School of Computer Science),

More information

Understanding the Benefits of IBM SPSS Statistics Server

Understanding the Benefits of IBM SPSS Statistics Server IBM SPSS Statistics Server Understanding the Benefits of IBM SPSS Statistics Server Contents: 1 Introduction 2 Performance 101: Understanding the drivers of better performance 3 Why performance is faster

More information

Benchmarking Hadoop & HBase on Violin

Benchmarking Hadoop & HBase on Violin Technical White Paper Report Technical Report Benchmarking Hadoop & HBase on Violin Harnessing Big Data Analytics at the Speed of Memory Version 1.0 Abstract The purpose of benchmarking is to show advantages

More information

-------- Overview --------

-------- Overview -------- ------------------------------------------------------------------- Intel(R) Trace Analyzer and Collector 9.1 Update 1 for Windows* OS Release Notes -------------------------------------------------------------------

More information

INTEL PARALLEL STUDIO EVALUATION GUIDE. Intel Cilk Plus: A Simple Path to Parallelism

INTEL PARALLEL STUDIO EVALUATION GUIDE. Intel Cilk Plus: A Simple Path to Parallelism Intel Cilk Plus: A Simple Path to Parallelism Compiler extensions to simplify task and data parallelism Intel Cilk Plus adds simple language extensions to express data and task parallelism to the C and

More information

OpenFOAM: Computational Fluid Dynamics. Gauss Siedel iteration : (L + D) * x new = b - U * x old

OpenFOAM: Computational Fluid Dynamics. Gauss Siedel iteration : (L + D) * x new = b - U * x old OpenFOAM: Computational Fluid Dynamics Gauss Siedel iteration : (L + D) * x new = b - U * x old What s unique about my tuning work The OpenFOAM (Open Field Operation and Manipulation) CFD Toolbox is a

More information

JBoss Data Grid Performance Study Comparing Java HotSpot to Azul Zing

JBoss Data Grid Performance Study Comparing Java HotSpot to Azul Zing JBoss Data Grid Performance Study Comparing Java HotSpot to Azul Zing January 2014 Legal Notices JBoss, Red Hat and their respective logos are trademarks or registered trademarks of Red Hat, Inc. Azul

More information

Oracle Database Scalability in VMware ESX VMware ESX 3.5

Oracle Database Scalability in VMware ESX VMware ESX 3.5 Performance Study Oracle Database Scalability in VMware ESX VMware ESX 3.5 Database applications running on individual physical servers represent a large consolidation opportunity. However enterprises

More information

OpenMP and Performance

OpenMP and Performance Dirk Schmidl IT Center, RWTH Aachen University Member of the HPC Group schmidl@itc.rwth-aachen.de IT Center der RWTH Aachen University Tuning Cycle Performance Tuning aims to improve the runtime of an

More information

The ROI from Optimizing Software Performance with Intel Parallel Studio XE

The ROI from Optimizing Software Performance with Intel Parallel Studio XE The ROI from Optimizing Software Performance with Intel Parallel Studio XE Intel Parallel Studio XE delivers ROI solutions to development organizations. This comprehensive tool offering for the entire

More information

Optimizing Shared Resource Contention in HPC Clusters

Optimizing Shared Resource Contention in HPC Clusters Optimizing Shared Resource Contention in HPC Clusters Sergey Blagodurov Simon Fraser University Alexandra Fedorova Simon Fraser University Abstract Contention for shared resources in HPC clusters occurs

More information

Load Imbalance Analysis

Load Imbalance Analysis With CrayPat Load Imbalance Analysis Imbalance time is a metric based on execution time and is dependent on the type of activity: User functions Imbalance time = Maximum time Average time Synchronization

More information

SR-IOV: Performance Benefits for Virtualized Interconnects!

SR-IOV: Performance Benefits for Virtualized Interconnects! SR-IOV: Performance Benefits for Virtualized Interconnects! Glenn K. Lockwood! Mahidhar Tatineni! Rick Wagner!! July 15, XSEDE14, Atlanta! Background! High Performance Computing (HPC) reaching beyond traditional

More information

Benchmarking Cassandra on Violin

Benchmarking Cassandra on Violin Technical White Paper Report Technical Report Benchmarking Cassandra on Violin Accelerating Cassandra Performance and Reducing Read Latency With Violin Memory Flash-based Storage Arrays Version 1.0 Abstract

More information

Parallel Processing and Software Performance. Lukáš Marek

Parallel Processing and Software Performance. Lukáš Marek Parallel Processing and Software Performance Lukáš Marek DISTRIBUTED SYSTEMS RESEARCH GROUP http://dsrg.mff.cuni.cz CHARLES UNIVERSITY PRAGUE Faculty of Mathematics and Physics Benchmarking in parallel

More information

APPENDIX 1 USER LEVEL IMPLEMENTATION OF PPATPAN IN LINUX SYSTEM

APPENDIX 1 USER LEVEL IMPLEMENTATION OF PPATPAN IN LINUX SYSTEM 152 APPENDIX 1 USER LEVEL IMPLEMENTATION OF PPATPAN IN LINUX SYSTEM A1.1 INTRODUCTION PPATPAN is implemented in a test bed with five Linux system arranged in a multihop topology. The system is implemented

More information

Recommended hardware system configurations for ANSYS users

Recommended hardware system configurations for ANSYS users Recommended hardware system configurations for ANSYS users The purpose of this document is to recommend system configurations that will deliver high performance for ANSYS users across the entire range

More information

Measuring Cache and Memory Latency and CPU to Memory Bandwidth

Measuring Cache and Memory Latency and CPU to Memory Bandwidth White Paper Joshua Ruggiero Computer Systems Engineer Intel Corporation Measuring Cache and Memory Latency and CPU to Memory Bandwidth For use with Intel Architecture December 2008 1 321074 Executive Summary

More information

HP ProLiant BL660c Gen9 and Microsoft SQL Server 2014 technical brief

HP ProLiant BL660c Gen9 and Microsoft SQL Server 2014 technical brief Technical white paper HP ProLiant BL660c Gen9 and Microsoft SQL Server 2014 technical brief Scale-up your Microsoft SQL Server environment to new heights Table of contents Executive summary... 2 Introduction...

More information

Cloud Computing through Virtualization and HPC technologies

Cloud Computing through Virtualization and HPC technologies Cloud Computing through Virtualization and HPC technologies William Lu, Ph.D. 1 Agenda Cloud Computing & HPC A Case of HPC Implementation Application Performance in VM Summary 2 Cloud Computing & HPC HPC

More information

Basics of VTune Performance Analyzer. Intel Software College. Objectives. VTune Performance Analyzer. Agenda

Basics of VTune Performance Analyzer. Intel Software College. Objectives. VTune Performance Analyzer. Agenda Objectives At the completion of this module, you will be able to: Understand the intended purpose and usage models supported by the VTune Performance Analyzer. Identify hotspots by drilling down through

More information

Removing Performance Bottlenecks in Databases with Red Hat Enterprise Linux and Violin Memory Flash Storage Arrays. Red Hat Performance Engineering

Removing Performance Bottlenecks in Databases with Red Hat Enterprise Linux and Violin Memory Flash Storage Arrays. Red Hat Performance Engineering Removing Performance Bottlenecks in Databases with Red Hat Enterprise Linux and Violin Memory Flash Storage Arrays Red Hat Performance Engineering Version 1.0 August 2013 1801 Varsity Drive Raleigh NC

More information

SIDN Server Measurements

SIDN Server Measurements SIDN Server Measurements Yuri Schaeffer 1, NLnet Labs NLnet Labs document 2010-003 July 19, 2010 1 Introduction For future capacity planning SIDN would like to have an insight on the required resources

More information

Technical Paper. Moving SAS Applications from a Physical to a Virtual VMware Environment

Technical Paper. Moving SAS Applications from a Physical to a Virtual VMware Environment Technical Paper Moving SAS Applications from a Physical to a Virtual VMware Environment Release Information Content Version: April 2015. Trademarks and Patents SAS Institute Inc., SAS Campus Drive, Cary,

More information

D1.2 Network Load Balancing

D1.2 Network Load Balancing D1. Network Load Balancing Ronald van der Pol, Freek Dijkstra, Igor Idziejczak, and Mark Meijerink SARA Computing and Networking Services, Science Park 11, 9 XG Amsterdam, The Netherlands June ronald.vanderpol@sara.nl,freek.dijkstra@sara.nl,

More information

Three Paths to Faster Simulations Using ANSYS Mechanical 16.0 and Intel Architecture

Three Paths to Faster Simulations Using ANSYS Mechanical 16.0 and Intel Architecture White Paper Intel Xeon processor E5 v3 family Intel Xeon Phi coprocessor family Digital Design and Engineering Three Paths to Faster Simulations Using ANSYS Mechanical 16.0 and Intel Architecture Executive

More information

High Performance Computing in CST STUDIO SUITE

High Performance Computing in CST STUDIO SUITE High Performance Computing in CST STUDIO SUITE Felix Wolfheimer GPU Computing Performance Speedup 18 16 14 12 10 8 6 4 2 0 Promo offer for EUC participants: 25% discount for K40 cards Speedup of Solver

More information

RevoScaleR Speed and Scalability

RevoScaleR Speed and Scalability EXECUTIVE WHITE PAPER RevoScaleR Speed and Scalability By Lee Edlefsen Ph.D., Chief Scientist, Revolution Analytics Abstract RevoScaleR, the Big Data predictive analytics library included with Revolution

More information

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging In some markets and scenarios where competitive advantage is all about speed, speed is measured in micro- and even nano-seconds.

More information

Tech Tip: Understanding Server Memory Counters

Tech Tip: Understanding Server Memory Counters Tech Tip: Understanding Server Memory Counters Written by Bill Bach, President of Goldstar Software Inc. This tech tip is the second in a series of tips designed to help you understand the way that your

More information

Performance Tuning Guidelines for PowerExchange for Microsoft Dynamics CRM

Performance Tuning Guidelines for PowerExchange for Microsoft Dynamics CRM Performance Tuning Guidelines for PowerExchange for Microsoft Dynamics CRM 1993-2016 Informatica LLC. No part of this document may be reproduced or transmitted in any form, by any means (electronic, photocopying,

More information

DIABLO TECHNOLOGIES MEMORY CHANNEL STORAGE AND VMWARE VIRTUAL SAN : VDI ACCELERATION

DIABLO TECHNOLOGIES MEMORY CHANNEL STORAGE AND VMWARE VIRTUAL SAN : VDI ACCELERATION DIABLO TECHNOLOGIES MEMORY CHANNEL STORAGE AND VMWARE VIRTUAL SAN : VDI ACCELERATION A DIABLO WHITE PAPER AUGUST 2014 Ricky Trigalo Director of Business Development Virtualization, Diablo Technologies

More information

Technical White Paper. Symantec Backup Exec 10d System Sizing. Best Practices For Optimizing Performance of the Continuous Protection Server

Technical White Paper. Symantec Backup Exec 10d System Sizing. Best Practices For Optimizing Performance of the Continuous Protection Server Symantec Backup Exec 10d System Sizing Best Practices For Optimizing Performance of the Continuous Protection Server Table of Contents Table of Contents...2 Executive Summary...3 System Sizing and Performance

More information

Eliminate Memory Errors and Improve Program Stability

Eliminate Memory Errors and Improve Program Stability Eliminate Memory Errors and Improve Program Stability with Intel Parallel Studio XE Can running one simple tool make a difference? Yes, in many cases. You can find errors that cause complex, intermittent

More information

Improve Fortran Code Quality with Static Analysis

Improve Fortran Code Quality with Static Analysis Improve Fortran Code Quality with Static Analysis This document is an introductory tutorial describing how to use static analysis on Fortran code to improve software quality, either by eliminating bugs

More information

Performance analysis with Periscope

Performance analysis with Periscope Performance analysis with Periscope M. Gerndt, V. Petkov, Y. Oleynik, S. Benedict Technische Universität München September 2010 Outline Motivation Periscope architecture Periscope performance analysis

More information

Throughput Capacity Planning and Application Saturation

Throughput Capacity Planning and Application Saturation Throughput Capacity Planning and Application Saturation Alfred J. Barchi ajb@ajbinc.net http://www.ajbinc.net/ Introduction Applications have a tendency to be used more heavily by users over time, as the

More information

JVM Performance Study Comparing Oracle HotSpot and Azul Zing Using Apache Cassandra

JVM Performance Study Comparing Oracle HotSpot and Azul Zing Using Apache Cassandra JVM Performance Study Comparing Oracle HotSpot and Azul Zing Using Apache Cassandra January 2014 Legal Notices Apache Cassandra, Spark and Solr and their respective logos are trademarks or registered trademarks

More information

DELL. Virtual Desktop Infrastructure Study END-TO-END COMPUTING. Dell Enterprise Solutions Engineering

DELL. Virtual Desktop Infrastructure Study END-TO-END COMPUTING. Dell Enterprise Solutions Engineering DELL Virtual Desktop Infrastructure Study END-TO-END COMPUTING Dell Enterprise Solutions Engineering 1 THIS WHITE PAPER IS FOR INFORMATIONAL PURPOSES ONLY, AND MAY CONTAIN TYPOGRAPHICAL ERRORS AND TECHNICAL

More information

Virtualization Performance on SGI UV 2000 using Red Hat Enterprise Linux 6.3 KVM

Virtualization Performance on SGI UV 2000 using Red Hat Enterprise Linux 6.3 KVM White Paper Virtualization Performance on SGI UV 2000 using Red Hat Enterprise Linux 6.3 KVM September, 2013 Author Sanhita Sarkar, Director of Engineering, SGI Abstract This paper describes how to implement

More information

Parallels Cloud Server 6.0

Parallels Cloud Server 6.0 Parallels Cloud Server 6.0 Parallels Cloud Storage I/O Benchmarking Guide September 05, 2014 Copyright 1999-2014 Parallels IP Holdings GmbH and its affiliates. All rights reserved. Parallels IP Holdings

More information

Performance Characteristics of VMFS and RDM VMware ESX Server 3.0.1

Performance Characteristics of VMFS and RDM VMware ESX Server 3.0.1 Performance Study Performance Characteristics of and RDM VMware ESX Server 3.0.1 VMware ESX Server offers three choices for managing disk access in a virtual machine VMware Virtual Machine File System

More information

VirtualCenter Database Performance for Microsoft SQL Server 2005 VirtualCenter 2.5

VirtualCenter Database Performance for Microsoft SQL Server 2005 VirtualCenter 2.5 Performance Study VirtualCenter Database Performance for Microsoft SQL Server 2005 VirtualCenter 2.5 VMware VirtualCenter uses a database to store metadata on the state of a VMware Infrastructure environment.

More information

SAS Business Analytics. Base SAS for SAS 9.2

SAS Business Analytics. Base SAS for SAS 9.2 Performance & Scalability of SAS Business Analytics on an NEC Express5800/A1080a (Intel Xeon 7500 series-based Platform) using Red Hat Enterprise Linux 5 SAS Business Analytics Base SAS for SAS 9.2 Red

More information

Performance Evaluation of VMXNET3 Virtual Network Device VMware vsphere 4 build 164009

Performance Evaluation of VMXNET3 Virtual Network Device VMware vsphere 4 build 164009 Performance Study Performance Evaluation of VMXNET3 Virtual Network Device VMware vsphere 4 build 164009 Introduction With more and more mission critical networking intensive workloads being virtualized

More information

FPGA-based Multithreading for In-Memory Hash Joins

FPGA-based Multithreading for In-Memory Hash Joins FPGA-based Multithreading for In-Memory Hash Joins Robert J. Halstead, Ildar Absalyamov, Walid A. Najjar, Vassilis J. Tsotras University of California, Riverside Outline Background What are FPGAs Multithreaded

More information

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

GPU System Architecture. Alan Gray EPCC The University of Edinburgh GPU System Architecture EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? GPU-CPU comparison Architectural reasons for GPU performance advantages GPU accelerated systems

More information

Whitepaper: performance of SqlBulkCopy

Whitepaper: performance of SqlBulkCopy We SOLVE COMPLEX PROBLEMS of DATA MODELING and DEVELOP TOOLS and solutions to let business perform best through data analysis Whitepaper: performance of SqlBulkCopy This whitepaper provides an analysis

More information

Large-Data Software Defined Visualization on CPUs

Large-Data Software Defined Visualization on CPUs Large-Data Software Defined Visualization on CPUs Greg P. Johnson, Bruce Cherniak 2015 Rice Oil & Gas HPC Workshop Trend: Increasing Data Size Measuring / modeling increasingly complex phenomena Rendering

More information

Parallel Programming Survey

Parallel Programming Survey Christian Terboven 02.09.2014 / Aachen, Germany Stand: 26.08.2014 Version 2.3 IT Center der RWTH Aachen University Agenda Overview: Processor Microarchitecture Shared-Memory

More information

IBM Spectrum Scale vs EMC Isilon for IBM Spectrum Protect Workloads

IBM Spectrum Scale vs EMC Isilon for IBM Spectrum Protect Workloads 89 Fifth Avenue, 7th Floor New York, NY 10003 www.theedison.com @EdisonGroupInc 212.367.7400 IBM Spectrum Scale vs EMC Isilon for IBM Spectrum Protect Workloads A Competitive Test and Evaluation Report

More information

Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software

Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software WHITEPAPER Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software SanDisk ZetaScale software unlocks the full benefits of flash for In-Memory Compute and NoSQL applications

More information

GPU File System Encryption Kartik Kulkarni and Eugene Linkov

GPU File System Encryption Kartik Kulkarni and Eugene Linkov GPU File System Encryption Kartik Kulkarni and Eugene Linkov 5/10/2012 SUMMARY. We implemented a file system that encrypts and decrypts files. The implementation uses the AES algorithm computed through

More information

Parallel Ray Tracing using MPI: A Dynamic Load-balancing Approach

Parallel Ray Tracing using MPI: A Dynamic Load-balancing Approach Parallel Ray Tracing using MPI: A Dynamic Load-balancing Approach S. M. Ashraful Kadir 1 and Tazrian Khan 2 1 Scientific Computing, Royal Institute of Technology (KTH), Stockholm, Sweden smakadir@csc.kth.se,

More information

Cloud Storage. Parallels. Performance Benchmark Results. White Paper. www.parallels.com

Cloud Storage. Parallels. Performance Benchmark Results. White Paper. www.parallels.com Parallels Cloud Storage White Paper Performance Benchmark Results www.parallels.com Table of Contents Executive Summary... 3 Architecture Overview... 3 Key Features... 4 No Special Hardware Requirements...

More information

Red Hat Network Satellite Management and automation of your Red Hat Enterprise Linux environment

Red Hat Network Satellite Management and automation of your Red Hat Enterprise Linux environment Red Hat Network Satellite Management and automation of your Red Hat Enterprise Linux environment WHAT IS IT? Red Hat Network (RHN) Satellite server is an easy-to-use, advanced systems management platform

More information

Red Hat Satellite Management and automation of your Red Hat Enterprise Linux environment

Red Hat Satellite Management and automation of your Red Hat Enterprise Linux environment Red Hat Satellite Management and automation of your Red Hat Enterprise Linux environment WHAT IS IT? Red Hat Satellite server is an easy-to-use, advanced systems management platform for your Linux infrastructure.

More information

LS DYNA Performance Benchmarks and Profiling. January 2009

LS DYNA Performance Benchmarks and Profiling. January 2009 LS DYNA Performance Benchmarks and Profiling January 2009 Note The following research was performed under the HPC Advisory Council activities AMD, Dell, Mellanox HPC Advisory Council Cluster Center The

More information

Introduction 1 Performance on Hosted Server 1. Benchmarks 2. System Requirements 7 Load Balancing 7

Introduction 1 Performance on Hosted Server 1. Benchmarks 2. System Requirements 7 Load Balancing 7 Introduction 1 Performance on Hosted Server 1 Figure 1: Real World Performance 1 Benchmarks 2 System configuration used for benchmarks 2 Figure 2a: New tickets per minute on E5440 processors 3 Figure 2b:

More information

Windows Server Performance Monitoring

Windows Server Performance Monitoring Spot server problems before they are noticed The system s really slow today! How often have you heard that? Finding the solution isn t so easy. The obvious questions to ask are why is it running slowly

More information

Petascale Software Challenges. William Gropp www.cs.illinois.edu/~wgropp

Petascale Software Challenges. William Gropp www.cs.illinois.edu/~wgropp Petascale Software Challenges William Gropp www.cs.illinois.edu/~wgropp Petascale Software Challenges Why should you care? What are they? Which are different from non-petascale? What has changed since

More information

Performance Characteristics of Large SMP Machines

Performance Characteristics of Large SMP Machines Performance Characteristics of Large SMP Machines Dirk Schmidl, Dieter an Mey, Matthias S. Müller schmidl@rz.rwth-aachen.de Rechen- und Kommunikationszentrum (RZ) Agenda Investigated Hardware Kernel Benchmark

More information

A Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures

A Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures 11 th International LS-DYNA Users Conference Computing Technology A Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures Yih-Yih Lin Hewlett-Packard Company Abstract In this paper, the

More information

VMWARE WHITE PAPER 1

VMWARE WHITE PAPER 1 1 VMWARE WHITE PAPER Introduction This paper outlines the considerations that affect network throughput. The paper examines the applications deployed on top of a virtual infrastructure and discusses the

More information

Deliverable 2.1.4. 150 Billion Triple dataset hosted on the LOD2 Knowledge Store Cluster. LOD2 Creating Knowledge out of Interlinked Data

Deliverable 2.1.4. 150 Billion Triple dataset hosted on the LOD2 Knowledge Store Cluster. LOD2 Creating Knowledge out of Interlinked Data Collaborative Project LOD2 Creating Knowledge out of Interlinked Data Project Number: 257943 Start Date of Project: 01/09/2010 Duration: 48 months Deliverable 2.1.4 150 Billion Triple dataset hosted on

More information

Revealing the performance aspects in your code. Intel VTune Amplifier XE Generics. Rev.: Sep 1, 2013

Revealing the performance aspects in your code. Intel VTune Amplifier XE Generics. Rev.: Sep 1, 2013 Revealing the performance aspects in your code Intel VTune Amplifier XE Generics Rev.: Sep 1, 2013 1 Agenda Introduction to Intel VTune Amplifier XE profiler High-level Features Types of Analysis Hotspot

More information

Performance and scalability of a large OLTP workload

Performance and scalability of a large OLTP workload Performance and scalability of a large OLTP workload ii Performance and scalability of a large OLTP workload Contents Performance and scalability of a large OLTP workload with DB2 9 for System z on Linux..............

More information

white paper Capacity and Scaling of Microsoft Terminal Server on the Unisys ES7000/600 Unisys Systems & Technology Modeling and Measurement

white paper Capacity and Scaling of Microsoft Terminal Server on the Unisys ES7000/600 Unisys Systems & Technology Modeling and Measurement white paper Capacity and Scaling of Microsoft Terminal Server on the Unisys ES7000/600 Unisys Systems & Technology Modeling and Measurement 2 This technical white paper has been written for IT professionals

More information

Binary search tree with SIMD bandwidth optimization using SSE

Binary search tree with SIMD bandwidth optimization using SSE Binary search tree with SIMD bandwidth optimization using SSE Bowen Zhang, Xinwei Li 1.ABSTRACT In-memory tree structured index search is a fundamental database operation. Modern processors provide tremendous

More information

ECLIPSE Performance Benchmarks and Profiling. January 2009

ECLIPSE Performance Benchmarks and Profiling. January 2009 ECLIPSE Performance Benchmarks and Profiling January 2009 Note The following research was performed under the HPC Advisory Council activities AMD, Dell, Mellanox, Schlumberger HPC Advisory Council Cluster

More information

High Performance Tier Implementation Guideline

High Performance Tier Implementation Guideline High Performance Tier Implementation Guideline A Dell Technical White Paper PowerVault MD32 and MD32i Storage Arrays THIS WHITE PAPER IS FOR INFORMATIONAL PURPOSES ONLY, AND MAY CONTAIN TYPOGRAPHICAL ERRORS

More information

PART IV Performance oriented design, Performance testing, Performance tuning & Performance solutions. Outline. Performance oriented design

PART IV Performance oriented design, Performance testing, Performance tuning & Performance solutions. Outline. Performance oriented design PART IV Performance oriented design, Performance testing, Performance tuning & Performance solutions Slide 1 Outline Principles for performance oriented design Performance testing Performance tuning General

More information

MAGENTO HOSTING Progressive Server Performance Improvements

MAGENTO HOSTING Progressive Server Performance Improvements MAGENTO HOSTING Progressive Server Performance Improvements Simple Helix, LLC 4092 Memorial Parkway Ste 202 Huntsville, AL 35802 sales@simplehelix.com 1.866.963.0424 www.simplehelix.com 2 Table of Contents

More information

Informatica Ultra Messaging SMX Shared-Memory Transport

Informatica Ultra Messaging SMX Shared-Memory Transport White Paper Informatica Ultra Messaging SMX Shared-Memory Transport Breaking the 100-Nanosecond Latency Barrier with Benchmark-Proven Performance This document contains Confidential, Proprietary and Trade

More information

Interactive comment on A parallelization scheme to simulate reactive transport in the subsurface environment with OGS#IPhreeqc by W. He et al.

Interactive comment on A parallelization scheme to simulate reactive transport in the subsurface environment with OGS#IPhreeqc by W. He et al. Geosci. Model Dev. Discuss., 8, C1166 C1176, 2015 www.geosci-model-dev-discuss.net/8/c1166/2015/ Author(s) 2015. This work is distributed under the Creative Commons Attribute 3.0 License. Geoscientific

More information

Using VMware VMotion with Oracle Database and EMC CLARiiON Storage Systems

Using VMware VMotion with Oracle Database and EMC CLARiiON Storage Systems Using VMware VMotion with Oracle Database and EMC CLARiiON Storage Systems Applied Technology Abstract By migrating VMware virtual machines from one physical environment to another, VMware VMotion can

More information

VMware Virtual SAN Backup Using VMware vsphere Data Protection Advanced SEPTEMBER 2014

VMware Virtual SAN Backup Using VMware vsphere Data Protection Advanced SEPTEMBER 2014 VMware SAN Backup Using VMware vsphere Data Protection Advanced SEPTEMBER 2014 VMware SAN Backup Using VMware vsphere Table of Contents Introduction.... 3 vsphere Architectural Overview... 4 SAN Backup

More information

Measuring MPI Send and Receive Overhead and Application Availability in High Performance Network Interfaces

Measuring MPI Send and Receive Overhead and Application Availability in High Performance Network Interfaces Measuring MPI Send and Receive Overhead and Application Availability in High Performance Network Interfaces Douglas Doerfler and Ron Brightwell Center for Computation, Computers, Information and Math Sandia

More information

NoSQL Performance Test In-Memory Performance Comparison of SequoiaDB, Cassandra, and MongoDB

NoSQL Performance Test In-Memory Performance Comparison of SequoiaDB, Cassandra, and MongoDB bankmark UG (haftungsbeschränkt) Bahnhofstraße 1 9432 Passau Germany www.bankmark.de info@bankmark.de T +49 851 25 49 49 F +49 851 25 49 499 NoSQL Performance Test In-Memory Performance Comparison of SequoiaDB,

More information

Accelerating Server Storage Performance on Lenovo ThinkServer

Accelerating Server Storage Performance on Lenovo ThinkServer Accelerating Server Storage Performance on Lenovo ThinkServer Lenovo Enterprise Product Group April 214 Copyright Lenovo 214 LENOVO PROVIDES THIS PUBLICATION AS IS WITHOUT WARRANTY OF ANY KIND, EITHER

More information

Application of Predictive Analytics for Better Alignment of Business and IT

Application of Predictive Analytics for Better Alignment of Business and IT Application of Predictive Analytics for Better Alignment of Business and IT Boris Zibitsker, PhD bzibitsker@beznext.com July 25, 2014 Big Data Summit - Riga, Latvia About the Presenter Boris Zibitsker

More information

Advanced Memory and Storage Considerations for Provisioning Services

Advanced Memory and Storage Considerations for Provisioning Services Advanced Memory and Storage Considerations for Provisioning Services www.citrix.com Contents Introduction... 1 Understanding How Windows Handles Memory... 1 Windows System Cache... 1 Sizing Memory for

More information

Sun Constellation System: The Open Petascale Computing Architecture

Sun Constellation System: The Open Petascale Computing Architecture CAS2K7 13 September, 2007 Sun Constellation System: The Open Petascale Computing Architecture John Fragalla Senior HPC Technical Specialist Global Systems Practice Sun Microsystems, Inc. 25 Years of Technical

More information

Petascale Software Challenges. Piyush Chaudhary piyushc@us.ibm.com High Performance Computing

Petascale Software Challenges. Piyush Chaudhary piyushc@us.ibm.com High Performance Computing Petascale Software Challenges Piyush Chaudhary piyushc@us.ibm.com High Performance Computing Fundamental Observations Applications are struggling to realize growth in sustained performance at scale Reasons

More information