Auto-Tunning of Data Communication on Heterogeneous Systems

Size: px
Start display at page:

Download "Auto-Tunning of Data Communication on Heterogeneous Systems"

Transcription

1 1 Auto-Tunning of Data Communication on Heterogeneous Systems Marc Jordà 1, Ivan Tanasic 1, Javier Cabezas 1, Lluís Vilanova 1, Isaac Gelado 1, and Nacho Navarro 1, 2 1 Barcelona Supercomputing Center 2 Universitat Politècnica de Catalunya {first.last}@bsc.es Abstract Heterogeneous systems formed by trandional CPUs and compute accelerators, such as GPUs, are becoming widely used to build modern supercomputers. However, many different system topologies, i.e., how CPUs, accelerators, and I/O devices are interconnected, are being deployed. Each system organization presents different trade-offs when transferring data between CPUs, accelerators, and nodes within a cluster, requiring each of them a different software implementation to achieve optimal data communication bandwidth. Hence, there is a great potential for auto-tunning of applications to match the constrains of the system where the code is being executed. In his paper we explore the potential impact of the two key optimizations to achieve optimal data transfer bandwidth: topology-aware process placement policies, and double-buffering. We design a set of experiments to evaluate all possible alternatives, and run each of them on different hardware configurations. We show that optimal data transfer mechanisms not only depend on the hardware topology, but also on the application dataset. Our experimental results show that auto-tunning applications to match the hardware topology, and the usage of double-buffering for large data transfers can improve the data transfer bandwidth by % of local communication. We also show that doublebuffering large data transfers is key to achieve optimal bandwidht on remote communication for data transfers larger than 128KB, while topology-aware policies produce minimal benefits. Keywords:. I. INTRODUCTION Heterogeneous computing systems that couple general purpose CPUs and massively parallel accelerators are becoming common place in High Performance Computing (HPC) environments. For example, Titan, the #1 supercomputer in the Top0 list [1] as of November 12, is formed by 18,688 nodes that integrate one AMD Opteron CPU and one NVIDIA Kx GPU. This architecture enables applications to benefit both from the very short single-threaded latency of CPUs, key to achieve performance in control-intensive application phases, and from the high multi-threaded throughput of GPUs, which boosts the performance of application phases where massive amounts of data parallelism exist. Besides the performance benefits, heterogeneous architectures also provides higher energy efficiencies than traditional homogeneous systems. For instance, the top four supercomputers in the Green0 list [2], as of November 12, are heterogeneous systems where each node includes one or several massively parallel compute accelerators (i.e., Intel Xeon Phi or AMD and NVIDIA GPUs). Most programming models for heterogeneous systems, such as NVIDIA CUDA [3] and OpenCL [4], assume that the code running in the CPU (i.e., host) and the accelerator (i.e., device) has access to separate virtual address spaces. Therefore, any data communication between the CPU and accelerators, or between accelerators, requires explicit data transfers calls (e.g., cudamemcpy in CUDA, and clenqueuecopybuffer in OpenCL) to copy data across address spaces. The optimal implementation of these data transfer calls heavily depends on the underlaying hardware organization. Hence, the behaviour of applications have to be adapted to the hardware topology where they are being executed. Often time, the design heterogeneous node includes one multi-core CPU chip connected to one accelerator through a PCIe bus, where each chip is connected to separate physical memories. In this design, each memory is only accessible from the device physically connected to it, so Direct Memory Access (DMA) copies (i.e., cudamemcpy in CUDA) are used to move data between the CPU and the accelerator. A similar organization attaches several accelerators to the same PCIe bus. In this case, each accelerator is typically attached to its own physical memory, and it is possible to transfer data across accelerators using DMA transfers without intervention of the CPU. Another approach to integrate several accelerators in the same node is to build a Symmetric Multi Processor (SMP) system by connecting several CPU chips through a cache-coherent fabric (e.g., HyperTransport or QuickPath Interconnect), and attaching one accelerator to each CPU chip through separate PCIe buses. In this scenario, data communication across accelerators requires the intervention of the CPU, since DMA transfers cannot interface with the cache-coherent interconnect. Besides the effects due to the hardware topology, the performance of data transfers also depends on the software implementation. A widespread technique to hide the cost of data transfers is double-buffering. However, this technique requires extra synchronization points that might actually harm the overall performance. Similarly, the performance of data transfers might be boosted by using pre-pinned memory in user applications, which removes the need for intermediate copies from user buffers to DMA buffers. However, pre-pinned memory is a scarce resource which can not be used for very large data sets. Hence, the amount of pre-pinned memory used by each application has to be carefully chosen to balance the

2 2 (a) Discrete GPU. Figure 1: GPU architectures. (b) Integrated GPU. data transfer bandwidth and the system performance. The large number of design choices and the dependability of data transfers between CPU and accelerators with the hardware, makes this code the ideal candidate for auto-tuning. In this paper we study the amenability of auto-tuning techniques to be used for implementing data transfers in heterogeneous systems by analyzing the performance characteristics of different data transfers techniques on different heterogeneous architectures. Our experimental results show that auto-tunning can achieve up to % speed-ups on data transfers. The first contribution of this paper is a description of different multi-gpu systems and the techniques needed to achieve the highest data communication throughput. The second contribution of this paper is a comprehensive set of benchmarks to measure the performance of the most common data communication patterns in HPC codes. They allow us to determine the best resource allocation strategy and the best transfer plan for different GPU topologies. This paper is organized as follows. Section II presents the necessary background on heterogeneous system topologies, and data transfer mechanisms. We present our experimental systems and methodology on Section III. The optimization opportunities of implementing topology-aware process placement are investigated in Section IV, while the potential performance improvements of data transfers when using double-buffering are presented in Section V. Section VI concludes this paper. II. BACKGROUND AND MOTIVATION In this Section we introduce different of GPU-based system organizations and how current programming models expose their different characteristics. A. GPU-based systems First programmable GPUs (e.g., NVIDIA G) appeared in the form of discrete devices attached to the system through an I/O bus like PCIe (Figure 1a). A discrete GPU has its own high-bandwidth RAM memory (e.g.,gddr5). Data must be copied from host (i.e., CPU) to the GPU memory before any code on the GPU accesses it and results must be copied back to be accessed by the CPU code. While last generation I/O buses deliver up to 16 GBps (PCIe 3.0 [5]), CPU GPU transfers can easily become a bottleneck. Later, designs that integrate CPU and GPU in the same die have been proposed (e.g., NVIDIA ION, AMD Fusion, Intel Sandy/Ivy Bridge). In these designs, CPUs and GPUs share the same physical memory (Figure 1b), thus eliminating the need for data transfers. However, the memory bandwidth delivered by the host memory is an order of magnitude lower than memories used in discrete GPUs ( GBps vs 0 GBps) and, therefore, they are not used in HPC environments. Therefore, in this paper we focus on systems with discrete GPUs. HPC systems commonly include several GPUs in each node to further accelerate computations and handle large datasets. In these kind of systems, data transfers between GPU memories are also possible. Typically, each CPU socket in the system has a I/O controller to (1) avoid remote accesses through inter-cpu interconnects (e.g.,hypertransport or QPI), and (2) deliver a higher aggregate bandwidth by providing independent interconnects for different groups of GPUs. A I/O controller contains a PCIe root complex that manages a tree of devices. Devices are connected to a bus and buses can be connected to the PCIe root complex or to a PCIe switch. Communication between devices connected to different PCIe root complexes (e.g., in a multi-socket system) is not currently supported (due to limitations in the CPU interconnect) and intermediate copies to host memory are needed. Figure 2a shows a single-socket system in which all GPUs are connected to the same PCIe bus. In this system, all memory transfers go through the PCIe root complex found in the I/O controller, that may become a bottleneck. Figure 2b shows a more complex topology that uses two PCIe switches. While CPU GPU transfers still suffer from the same contention problems, GPU GPU transfers can greatly benefit, since transfers between different pairs of GPUs can proceed in parallel. Figure 2c shows a dual-socket system with one I/O controller per socket. In this configuration, two CPU GPU memory transfers can execute in parallel (if the GPU involved in each transfer is connected to a different I/O controller and the source/destination address in host memory is located in a memory module connected to the same CPU socket). However, the NUMA (Non-Uniform- Memory-Access) nature of the system makes the management of memory allocations used in data transfers more difficult [6]. Traditionally, the communication of GPUs with I/O devices (disks, network interfaces,... ) has been managed by privileged CPU code (e.g., Operating System) that orchestrates the necessary memory transfers across the memories in the system. The most simple implementation is to make an intermediate copy to host memory and perform the communication like regular CPU code. This guarantees compatibility with all device drivers at the cost of extra memory transfers. NVIDIA introduced the GPUDirect [7] technology to reuse host memory buffers across different device drivers to avoid unnecessary copies, that was mainly adopted by Infiniband network interface vendors. The second version of the technology allows GPUs to directly communicate with devices connected to the same PCIe root complex thus completely eliminating intermediate copies to host memory. B. Programming multi-gpu systems Programming models for GPUs provide a host programming interface to manage all the memories in the system. CUDA [3] allows programmers to allocate memory in any GPU in the

3 3 (a) System with two GPUs. (b) System with four GPUs. (c) System with a four GPUs in a NUMA system. Figure 2: Multi-GPU system topologies. (a) Disk storage. (b) Network Interface. Figure 3: Communication between GPUs and I/O devices. system and provides functions to copy data between them. Since copies between host and GPU memories are performed using DMA transfers, pinned host memory buffers must be used. This ensures that pages are not swapped out to disk by the Operating System during the transfer. Memory transfers that use regular user (non-pinned) buffers are internally copied to pinned memory buffers by the runtime. However, the CUDA host API allows programmers to manage host pinned buffers to avoid this intermediate copies. OpenCL [4] uses a higher-level model based on contexts. Users can create a context associated to a group of devices, and memory is allocated in a per-context basis. However, the memory is allocated in the memory of a specific GPU lazily, when the first operation (e.g.,data transfer or kernel execution) on the memory allocation is detected. Transfers across contexts are not allowed and, in that case, intermediate copies to host memory are needed. Furthermore, OpenCL does not provide a standard mechanism to allocate pinned memory buffers and each vendor relies on different combinations of flags. Both CUDA (since version 4) and OpenCL allow programmers to manage all GPUs in the system from a single CPU thread. However, many HPC programs have been already ported to multi-thread schemes using pthreads (POSIX threads [8]) or OpenMP [9] to take advantage of multiple CPUs. When porting these codes to GPUs, each thread manages one GPU and inter-thread synchronization mechanisms are used to ensure the correctness of the computations. Applications that run on multi-node systems usually use the MPI (Message Passing Interface[10]) programming model to communicate across nodes. MPI functions take regular pointers as input/output buffers and, therefore, transferring data across GPUs in different nodes requires using intermediate copies to host memory. However, latest versions of some MPI implementations are able to identify GPU allocations (thus removing the need for explicit memory transfers to/ from host buffers) and to take advantage of the GPUDirect technology. If communicating GPUs reside in the same node, direct GPU GPU can be performed. On the other hand, when data is transferred through the network, data can be directly copied from GPU memory to the NIC memory if they reside in the same PCIe root complex. A. Experiments III. EXPERIMENTAL METHODOLOGY We design experiments to measure the impact of locality and double-buffering on the performance of data communication across CPUs, GPUs, and I/O devices. We cover three possible scenarios: communication inside a node (intra-node), communication between nodes (inter-node) and I/O performance. Communication inside the node can happen between CPU (host) and the GPU or between two GPUs. With the transfers between host and GPU memories standard memory allocations or pinned memory allocations can be used on the host side. In the case of pinned memory, asynchronous copies can be performed which gives the opportunity to overlap the transfer with kernel execution or other transfers. Common use case found in applications is transferring the inputs of next computation iteration to the GPU, while transferring the results of the previous one to the CPU. We label this as a twoway transfer in our experiments. We pin CPU threads, forcing the allocation of host memory buffers to the closer memory partition, and measure the throughput of data transfers between CPU and GPU connected to the same PCIe hub and CPU and GPU connected to different I/O hubs.

4 4 Other possible type of communication inside the node are transfers between GPUs. Besides the transfers from one GPU to the other (one-way transfer), we also measure the throughput of concurrent bidirectional transfers between two GPUs, to simulate the pattern commonly found in multi-gpu applications such as the halo exchange in stencil codes. The GPU to GPU experiments are performed using two GPUs sharing the same PCIe bus and using two GPUs installed in different PCIe buses to quantify the advantage of using peerto-peer data transfers, when available, over transfers requiring intermediate copies to the host memory. With the MPI standard being extensively used to develop cluster-level applications, MPI programs are usually ported to GPU clusters by spawning one process per GPU and performing the computation or a part of it on the GPU. For that reason we also measure the performance of intra-node GPU to GPU copies when each GPU is managed by a different MPI process and the transfer is performed using the MPI primitives instead of CUDA copies directly. To test the recently added GPU-awareness to some MPI implementations we use two different versions of OpenMPI library [11] (1.4.5 and 1.7) to perform the inter-node transfers. The later is GPU-aware and allows direct use of pointers to GPU memory allocations when calling MPI s data transfer functions. Communication between the nodes (inter-node) is performed over the network. In our experiments, we focus on the MPI over Infiniband network as the most representative case in multi node GPU machines. We study performance of the transfers between GPUs from different nodes and between CPU from one node and GPU from the other. Many applications do file I/O, either for the input data and results or as a backing store for big datasets. We measure the performance of transfers between GPU and the disk for both traditional Hard Disks (HDD) and Solid State Drives (SSD). To avoid the interference of the Linux operating system s file caches in our throughput measures, we use the O_DIRECT flag of the POSIX open function to ensure that data is actually read from and written to the disk instead of the cache. Depending on the scenario, different types of transfers are possible. In experiments that transfer data between any CPU and GPU or between two GPUs on the same PCIe bus (including transfers using GPU-aware version of the OpenMPI library), DMA engines of the GPU are in charge of transferring the data to the destination directly. Whenever these direct transfers are not possible staging is performed through the CPU memory, either by copying the whole memory structure to the host and then to the destination memory or by performing the pipelined transfers through host (or hosts in the case of inter-node transfers) using the double-buffering technique. B. Evaluation Systems Table I shows the systems used in our evaluation. System A is a single-node NUMA machine with two quadcore Intel Xeon E56 CPUs running at 2.4 Ghz and four NVIDIA Tesla C GPUs. CPUs are interconnected with Intel QPI and each one is connected to a I/O hub shared by two GPUs (Figure 2c). The system has a Western Digital Caviar System A System B System C CPUs 2 Intel E56 2 Intel E Intel i7 38 GPUs 4 Nvidia C 2 Nvidia M 4 Nvidia C GPUs per PCIe Disk WD HDD Micron SSD WD HDD SATA 3 6 Gbps SATA 3 6 Gbps SATA 3 6 Gbps DRAM 1333 Mhz 1333 Mhz 1333 Mhz NICs n/a Mellanox QDR n/a CUDA Driver Linux Kernel GCC NVCC Table I: System configurations Black WD0 70 rpm hard disk that is connected to the I/ O controller of one of the CPUs. Thus, GPUs connected to the same PCIe controller can directly communicate through peer-to-peer transfers. However, communications across PCIe controllers must perform intermediate copies to host memory. System B is a GPU cluster. Each node has two six-core Intel Xeon E5649 CPUs, running at 2.53 GHz, also connected with a QPI bus. The main memory is split in two 12 GB modules like in the first system. They have two NVIDIA Tesla M GPUs, and two Mellanox Technologies MT26428 QDR Gbps Infiniband network adapters. The system has two PCIe 2.0 buses, each shared by one GPU and one network adapter, and connected to one of the CPUs (Figure 3b). Contrary to System A, this system uses a SSD disk, which is able to deliver much better throughput (within the limitations of the SATA 3 interface). System C is a single-node machine with one quad-core Intel Core i7 38 CPU clocked at 3.6 Ghz. The main difference of this system is that it has 4 GPUs that are connected in pairs to different PCIe switches. This allows to utilize peerto-peer communications between all the GPUs in the system. The organization of this machine is similar to the one shown in Figure 2b. IV. EVALUATION OF LOCALITY This section presents the results from our experiments that measure the impact of locality-aware policies on the performance of data transfers. To compare the different locality configurations, we normalize the results to the best achieved bandwidth for each data size. This allows us to show the percentage of the optimal achieved by each configuration. A. Locality-aware Intra-node GPU Communication First, we study how locality affects data transfers between a GPU and the host memory. We measure the bandwidth of data transfers to/from a GPU connected to the PCIe hub local to the CPU, and a GPU connected to the PCIe hub of the other CPU (remote). We measure both single data transfers (one-way) and two concurrent transfers in opposite directions (two-way). Figure 4 shows the results obtained in systems A and B. System C is not considered for this test, because all GPUs are connected to the same PCIe hub. Both systems follow a similar trend, where the impact of locality increases with the transfer size until, where all the configurations reach a steady state. As expected, one-way local transfers are

5 5 1-way (local) 1-way (remote) Host GPU GPU Host (a) System A. 2-way (local) way (local) 1-way (remote) Host GPU GPU Host (b) System B. 2-way (local) 3 12 Figure 4: Optimality of different locality configurations when transferring data between a GPU and host memory (2-way: two concurrent transfers in opposite directions). the best performing in both systems, while one-way remote transfers reach from % to % of the optimal bandwidth for large transfer sizes, depending on the direction of the copy. In system A, host to GPU copies are faster than their respective copy from GPU to host, however in system B GPU to host copies have better bandwidths. Even though they have similar GPU and CPU models, they have different hardware characteristics that can favor one direction or the other. Twoway data transfers are the slowest ones, with the local ones obtaining up to 61% of the optimal bandwidth in system A, and 69% in system B. Remote two-way transfers have only around % of the bandwidth compared to the optimal one. Figure 5 shows the results obtained for intra-node GPU to GPU data transfers in systems A and C. Since system B has only two GPUs, it cannot be used to study the effects of locality for GPU to GPU transfers. In this case, the local and remote meaning is not the same in both systems. In system A, local transfers are between GPUs installed in the same PCIe hub, and remote transfers are done using two GPUs connected to different PCIe hubs. Thus, in this system peerto-peer copies are only available for local data transfers. In system C, all four GPUs are connected to the same PCIe hub, allowing peer-to-peer copies between all the GPUs. In this case, we consider transfers between two GPUs in the same PCIe bridge to be local, and transfers across bridges to be remote. Besides the GPU to GPU copy functions provided by CUDA, we also measure the bandwidth of our own GPU to GPU copy implementation, using a double-buffering technique through host memory. The advantage of peer-to-peer copies is clear in both systems. In system A, the best bandwidth obtained by remote transfers is only 65% of the optimal one, because they have to be implemented as a two step copy using intermediate stagging in the host memory. In system C, since all the transfers can use peer-to-peer copies, the differences in bandwidth due locality are smaller than in the previous system. However, these results show that locality differences inside the same PCIe hub can have an overhead up to 15%. Most HPC applications are programmed using MPI to run on modern clusters. To model this scenario, we also measure how intra-node transfers perform when the GPUs are managed by different MPI processes. Figure 6 shows the percentage of the optimal bandwidth achieved for local MPI process communication in two different hardware configurations. In both configurations, locality-aware policies provide the optimal bandwidth when MPI processes are running in the same node. These experimental results show that if processes are not correctly placed (i.e., the MPI process uses the CPU and GPU in the same PCIe domain), local MPI inter-processes communication can be up to 75% slower than when topology-aware placement is used. Results in Figure 6 also show that ensuring the MPI processes locality also provide large benefits for MPI inter-process communication when each process is assigned to a GPU from different PCIe domains. There is a potential bandwidth penalty of up to % if MPI communication requires crossing PCIe domains several times on each data communication operation. These experimental results show the viability of auto-tunning applications to match the system topology where they are being executed on. Moreover, Figure 6 also shows the benefits of favoring MPI inter-process communication to happen on the same domain, which is the optimal case. This shows the importance of placing the MPI processes most frequently exchanging data in the same PCIe domain. B. Locality-aware Disk I/O We measure the performance of disk reads and disk writes in several configurations. Figure 7 shows the results for the System A (which uses has HDD) and System B (which has SSD). System C is not studied because it can be treated as a subset of System A for these experiments. Each transfer is labeled with a two letter set (L or R) which describe the locality of the tested transfer. All the transfers are staged through the host memory, either by performing two transfers

6 6 1-way (local) 2-way (local) 1-way (remote) (a) System A 2-way (remote, DB) way (local) 2-way (local) 1-way (remote) (b) System C 2-way (remote, DB) 3 12 Figure 5: Optimality of different locality configurations for GPU to GPU data transfers. (2-way: two concurrent transfers in opposite directions, db: hand-written double-buffered implementation). Different domain Different domain (DB) Same domain Same domain (DB) Same domain (GPU-aware) (a) System A. A domain is a PCIe hub Different domain Different domain (DB) Different domain (GPU-aware) Same domain Same domain (DB) Same domain (GPU-aware) (b) System C. A domain is a PCIe bridge Figure 6: Optimality of different locality configurations for intra-node GPU to GPU data transfers using MPI (each GPU managed by a different MPI process). L Disk /L GPU (HD) L Disk /R GPU (HD) R Disk /L GPU (HD) R Disk /R GPU (HD) L Disk /L GPU (SSD) L Disk /R GPU (SSD) R Disk /L GPU (SSD) R Disk /R GPU (SSD) 3 12 Figure 7: Optimality of different locality configurations for file reads and writes. or through the pipelined transfer and the fastest one is shown. First letter in the set denotes the locality of disk to the CPU memory used for staging the transfers. Second letter in the set denotes the locality of the GPU to the CPU memory used for staging. For both file reads and writes, both systems show small impact of disk locality on the performance (more than %), for reasonable big files ( or more). For the file writes smaller than, considerable variability is seen due to inability to isolate disk traffic generated by our experiments from the traffic generated by the operating system services. Writes of or larger are within % of optimal in both cases, while reads are in within % of optimal in both cases. V. EVALUATION OF DOUBLE-BUFFERING OF DATA TRANSFERS A. Intra-node GPU to GPU communication Figure 8 shows the two-way transfers between remote GPUs in System B. One labeled DB is our hand implementation of the double-buffered transfer while the other uses the CUDA provided call, which is also internally double buffered, but with the fixed buffer size. We compare the two methods to conclude that for the transfer sizes of or more, CUDA provided implementation achieves from 85% to 95% of the auto-tuned transfer performance, while small transfer sizes achieve higher performance than double-buffering due to the extra overhead of synchronization and inefficiency of issuing small transfer sizes. Figure 9 shows efficiency of doublebuffering with various buffer sizes for the largest data transfer we have measured (12). buffer is large enough to hide the overheads of extra function calls and synchronization but small enough to hide the fact that the transfer is staged. B. Disk I/O The bandwidth provided by disks is much lower than that of the PCIe interconnect. However, results show that using double buffered transfers is still beneficial. Figure 10a shows how double-buffering performs against simple transfers on a Hard Disk Drive. Results show that double is slower for transfers smaller that. After that inflection point double buffering delivers 10-% more throughput than simple

7 7 L Disk /L GPU L Disk /L GPU (DB) L Disk /R GPU L Disk/R GPU (DB) R Disk /L GPU R Disk /L GPU (DB) (a) HDD (System A) R Disk /R GPU R Disk /R GPU (DB) 3 12 L Disk /L GPU L Disk /L GPU (DB) L Disk /R GPU L Disk/R GPU (DB) R Disk /L GPU R Disk /L GPU (DB) (b) SDD (System B) Figure 10: Optimality of different data transfer implementations of disk I/O transfers. R Disk /R GPU R Disk /R GPU (DB) way (remote, DB) 3 12 Figure 8: Efficiency of the auto-tuned and non-tuned doublebuffered transfers between GPUs. 4 KB Buffer size 3 Figure 9: Efficiency of the double-buffered transfers between GPUs for different buffer sizes. transfers both for and transfers. Tests on the SDD disk (Figure 10b) exhibit the same behavior. For transfers, in both configurations, the doublebuffered implementation yields much better performance than the naive transfer. The effect of the size of the buffers used in the implementation of the double-buffered transfer is shown in Figure 11a and Figure 11a. Results show the optimality of the implementation of a 12 memory transfer for different buffer sizes. The first Figure shows that buffers of are enough to achieve the maximum performance on the HDD disk. This is due to the low transfer rate of the disk. On the other hand, the SSD disk needs buffers of due to its higher bandwidth compared to the HDD. C. Inter-node GPU Communication Figure 12a shows the performance of inter-process MPI communication for data transfers across CPU and GPU memories. These results show that topology-aware placement of MPI processes barely affects the performance of remote MPI data transfers. The results in Figure 12a also shows the benefits of double-buffering remote communication; in all cases double-buffering provides the optimal bandwidth for data transfer sizes larger than. The only case where double-buffering performs worse than direct transfers when communicating data smaller than 128KB from the network to the GPU memory if both network interface and the GPU are attached to the same PCIe bus. Figure 12b shows the performance of MPI communication for different buffer sizes when data transfers are doublebuffered. These results shows that the optimal buffer size in all cases id 2MB. When larger buffers are used, the performance of data transfers slowly decrease, while smaller buffers also provide sub-optimal bandwidth. VI. CONCLUSIONS In this paper we have analyzed the importance of autotunning applications to match the system topology and select the most adequate data transfer mechanism to achieve optimal bandwidth on data transfers between CPUs, GPUs, and nodes within a cluster. To evaluate the potential impact of each of these optimizations on different scenarios, we have developed and executed synthetic benchmarks that stress each data transfer mechanisms under different circumstances. We have presented the experimental results of running these benchmarks on three different systems that represent most of the existing heterogeneous architectures currently used in HPC environments.

8 8 4 KB L Disk /L GPU L Disk /R GPU R Disk /L GPU R Disk /R GPU L Net /L GPU L Net /L GPU (DB) L Net /R GPU Buffer size (a) HDD (System A) 3 4 KB L Disk /L GPU L Disk /R GPU R Disk /L GPU R Disk /R GPU Buffer size (b) SDD (System B) 3 Figure L Net/R 11: GPU Optimality (DB) Rof Net /Rdifferent GPU buffer sizes for a 128MB disk I/O transfer. R Net /L GPU R Net /R GPU (DB) R Net /L GPU (DB) L Net /L GPU L Net /R GPU R Net /L GPU R Net /R GPU GPU GPU GPU GPU GPU Host Host GPU (a) MPI Transfer Bandwidth Our experimental results highlight the benefits of autotunning applications to match the node topology. By ensuring that applications run on the CPU and GPU connected to the same PCIe bus, the data transfer bandwidth can improve up to 8 %. We have also shown, that locality-aware policies have little impact when communicating MPI processes running on different nodes. We have also shown that double-buffering is key to achieve optimal performance on inter-process communication when processes are running on the same node as well as when running on separate nodes. Double-buffering provides little performance improvements on GPU-to-GPU communication when direct data transfers across GPUs are supported by the hardware. Moreover, double-buffering only provides marginal gains when transferring data between GPUs and I/O devices, such as disks due to the low read/write bandwidth of existing I/O devices. The results presented in this paper show that auto-tunning is key to optimize data transfers in existing heterogeneous HPC KB Figure 12: Performance of MPI transfers GPU Host Host GPU Buffer size (b) Double-buffer performance for different buffer sizes. 3 systems. Our experimental data also shows that simple autotunning policies, such as different code paths depending on the data transfer size, can provide impressive performance gains. REFERENCES [1] TOP0 list - November 12, 12. [Online]. Available: http: //top0.org/list/12/11// (page 1). [2] The green 0 list, november 12 edition, 12. [Online]. Available: (page 1). [3] CUDA C Programming Guide, NVIDIA, 12. (pages 1, 2). [4] The OpenCL Specification, 09. (pages 1, 3). [5] PCI Express Base 3.0 Specification, PCI-SIG, 10. (page 2). [6] V. Kindratenko, J. Enos, G. Shi, M. Showerman, G. Arnold, J. Stone, J. Phillips, and W.-M. Hwu, Gpu clusters for high-performance computing, in Cluster Computing and Workshops, 09. CLUSTER 09. IEEE International Conference on, 09, pp (page 2). [7] NVIDIA GPUDirect, NVIDIA. [Online]. Available: nvidia.com/gpudirect (page 2). [8] Ieee standard for information technology- portable operating system interface (posix) base specifications, issue 7, IEEE Std (Revision of IEEE Std ), pp. c1 3826, 08. (page 3). [9] OpenMP Application Program Interface Version 3.1, OpenMP Architecture Review Board, 11. (page 3).

9 [10] MPI-3: A Message-Passing Interface Standard, Message Passing Interface Forum, 12. (page 3). [11] E. Gabriel, G. E. Fagg, G. Bosilca, T. Angskun, J. J. Dongarra, J. M. Squyres, V. Sahay, P. Kambadur, B. Barrett, A. Lumsdaine, R. H. Castain, D. J. Daniel, R. L. Graham, and T. S. Woodall, Open MPI: Goals, concept, and design of a next generation MPI implementation, in Proceedings, 11th European PVM/MPI Users Group Meeting, Budapest, Hungary, September 04, pp (page 4). 9

High Performance Computing in CST STUDIO SUITE

High Performance Computing in CST STUDIO SUITE High Performance Computing in CST STUDIO SUITE Felix Wolfheimer GPU Computing Performance Speedup 18 16 14 12 10 8 6 4 2 0 Promo offer for EUC participants: 25% discount for K40 cards Speedup of Solver

More information

Parallel Programming Survey

Parallel Programming Survey Christian Terboven 02.09.2014 / Aachen, Germany Stand: 26.08.2014 Version 2.3 IT Center der RWTH Aachen University Agenda Overview: Processor Microarchitecture Shared-Memory

More information

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

GPU System Architecture. Alan Gray EPCC The University of Edinburgh GPU System Architecture EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? GPU-CPU comparison Architectural reasons for GPU performance advantages GPU accelerated systems

More information

High Performance. CAEA elearning Series. Jonathan G. Dudley, Ph.D. 06/09/2015. 2015 CAE Associates

High Performance. CAEA elearning Series. Jonathan G. Dudley, Ph.D. 06/09/2015. 2015 CAE Associates High Performance Computing (HPC) CAEA elearning Series Jonathan G. Dudley, Ph.D. 06/09/2015 2015 CAE Associates Agenda Introduction HPC Background Why HPC SMP vs. DMP Licensing HPC Terminology Types of

More information

Optimizing GPU-based application performance for the HP for the HP ProLiant SL390s G7 server

Optimizing GPU-based application performance for the HP for the HP ProLiant SL390s G7 server Optimizing GPU-based application performance for the HP for the HP ProLiant SL390s G7 server Technology brief Introduction... 2 GPU-based computing... 2 ProLiant SL390s GPU-enabled architecture... 2 Optimizing

More information

How System Settings Impact PCIe SSD Performance

How System Settings Impact PCIe SSD Performance How System Settings Impact PCIe SSD Performance Suzanne Ferreira R&D Engineer Micron Technology, Inc. July, 2012 As solid state drives (SSDs) continue to gain ground in the enterprise server and storage

More information

Multi-Threading Performance on Commodity Multi-Core Processors

Multi-Threading Performance on Commodity Multi-Core Processors Multi-Threading Performance on Commodity Multi-Core Processors Jie Chen and William Watson III Scientific Computing Group Jefferson Lab 12000 Jefferson Ave. Newport News, VA 23606 Organization Introduction

More information

Stream Processing on GPUs Using Distributed Multimedia Middleware

Stream Processing on GPUs Using Distributed Multimedia Middleware Stream Processing on GPUs Using Distributed Multimedia Middleware Michael Repplinger 1,2, and Philipp Slusallek 1,2 1 Computer Graphics Lab, Saarland University, Saarbrücken, Germany 2 German Research

More information

LS DYNA Performance Benchmarks and Profiling. January 2009

LS DYNA Performance Benchmarks and Profiling. January 2009 LS DYNA Performance Benchmarks and Profiling January 2009 Note The following research was performed under the HPC Advisory Council activities AMD, Dell, Mellanox HPC Advisory Council Cluster Center The

More information

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance 11 th International LS-DYNA Users Conference Session # LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance Gilad Shainer 1, Tong Liu 2, Jeff Layton 3, Onur Celebioglu

More information

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging In some markets and scenarios where competitive advantage is all about speed, speed is measured in micro- and even nano-seconds.

More information

FLOW-3D Performance Benchmark and Profiling. September 2012

FLOW-3D Performance Benchmark and Profiling. September 2012 FLOW-3D Performance Benchmark and Profiling September 2012 Note The following research was performed under the HPC Advisory Council activities Participating vendors: FLOW-3D, Dell, Intel, Mellanox Compute

More information

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi ICPP 6 th International Workshop on Parallel Programming Models and Systems Software for High-End Computing October 1, 2013 Lyon, France

More information

GPU File System Encryption Kartik Kulkarni and Eugene Linkov

GPU File System Encryption Kartik Kulkarni and Eugene Linkov GPU File System Encryption Kartik Kulkarni and Eugene Linkov 5/10/2012 SUMMARY. We implemented a file system that encrypts and decrypts files. The implementation uses the AES algorithm computed through

More information

Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o igiro7o@ictp.it

Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o igiro7o@ictp.it Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o igiro7o@ictp.it Informa(on & Communica(on Technology Sec(on (ICTS) Interna(onal Centre for Theore(cal Physics (ICTP) Mul(ple Socket

More information

A Flexible Cluster Infrastructure for Systems Research and Software Development

A Flexible Cluster Infrastructure for Systems Research and Software Development Award Number: CNS-551555 Title: CRI: Acquisition of an InfiniBand Cluster with SMP Nodes Institution: Florida State University PIs: Xin Yuan, Robert van Engelen, Kartik Gopalan A Flexible Cluster Infrastructure

More information

Accelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing

Accelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing Accelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing Innovation Intelligence Devin Jensen August 2012 Altair Knows HPC Altair is the only company that: makes HPC tools

More information

Enabling Technologies for Distributed Computing

Enabling Technologies for Distributed Computing Enabling Technologies for Distributed Computing Dr. Sanjay P. Ahuja, Ph.D. Fidelity National Financial Distinguished Professor of CIS School of Computing, UNF Multi-core CPUs and Multithreading Technologies

More information

Optimizing a 3D-FWT code in a cluster of CPUs+GPUs

Optimizing a 3D-FWT code in a cluster of CPUs+GPUs Optimizing a 3D-FWT code in a cluster of CPUs+GPUs Gregorio Bernabé Javier Cuenca Domingo Giménez Universidad de Murcia Scientific Computing and Parallel Programming Group XXIX Simposium Nacional de la

More information

PCIe Over Cable Provides Greater Performance for Less Cost for High Performance Computing (HPC) Clusters. from One Stop Systems (OSS)

PCIe Over Cable Provides Greater Performance for Less Cost for High Performance Computing (HPC) Clusters. from One Stop Systems (OSS) PCIe Over Cable Provides Greater Performance for Less Cost for High Performance Computing (HPC) Clusters from One Stop Systems (OSS) PCIe Over Cable PCIe provides greater performance 8 7 6 5 GBytes/s 4

More information

Overlapping Data Transfer With Application Execution on Clusters

Overlapping Data Transfer With Application Execution on Clusters Overlapping Data Transfer With Application Execution on Clusters Karen L. Reid and Michael Stumm reid@cs.toronto.edu stumm@eecg.toronto.edu Department of Computer Science Department of Electrical and Computer

More information

Sockets vs. RDMA Interface over 10-Gigabit Networks: An In-depth Analysis of the Memory Traffic Bottleneck

Sockets vs. RDMA Interface over 10-Gigabit Networks: An In-depth Analysis of the Memory Traffic Bottleneck Sockets vs. RDMA Interface over 1-Gigabit Networks: An In-depth Analysis of the Memory Traffic Bottleneck Pavan Balaji Hemal V. Shah D. K. Panda Network Based Computing Lab Computer Science and Engineering

More information

HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK

HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK Steve Oberlin CTO, Accelerated Computing US to Build Two Flagship Supercomputers SUMMIT SIERRA Partnership for Science 100-300 PFLOPS Peak Performance

More information

Enabling Technologies for Distributed and Cloud Computing

Enabling Technologies for Distributed and Cloud Computing Enabling Technologies for Distributed and Cloud Computing Dr. Sanjay P. Ahuja, Ph.D. 2010-14 FIS Distinguished Professor of Computer Science School of Computing, UNF Multi-core CPUs and Multithreading

More information

Trends in High-Performance Computing for Power Grid Applications

Trends in High-Performance Computing for Power Grid Applications Trends in High-Performance Computing for Power Grid Applications Franz Franchetti ECE, Carnegie Mellon University www.spiral.net Co-Founder, SpiralGen www.spiralgen.com This talk presents my personal views

More information

HP ProLiant SL270s Gen8 Server. Evaluation Report

HP ProLiant SL270s Gen8 Server. Evaluation Report HP ProLiant SL270s Gen8 Server Evaluation Report Thomas Schoenemeyer, Hussein Harake and Daniel Peter Swiss National Supercomputing Centre (CSCS), Lugano Institute of Geophysics, ETH Zürich schoenemeyer@cscs.ch

More information

DIABLO TECHNOLOGIES MEMORY CHANNEL STORAGE AND VMWARE VIRTUAL SAN : VDI ACCELERATION

DIABLO TECHNOLOGIES MEMORY CHANNEL STORAGE AND VMWARE VIRTUAL SAN : VDI ACCELERATION DIABLO TECHNOLOGIES MEMORY CHANNEL STORAGE AND VMWARE VIRTUAL SAN : VDI ACCELERATION A DIABLO WHITE PAPER AUGUST 2014 Ricky Trigalo Director of Business Development Virtualization, Diablo Technologies

More information

Turbomachinery CFD on many-core platforms experiences and strategies

Turbomachinery CFD on many-core platforms experiences and strategies Turbomachinery CFD on many-core platforms experiences and strategies Graham Pullan Whittle Laboratory, Department of Engineering, University of Cambridge MUSAF Colloquium, CERFACS, Toulouse September 27-29

More information

Achieving Performance Isolation with Lightweight Co-Kernels

Achieving Performance Isolation with Lightweight Co-Kernels Achieving Performance Isolation with Lightweight Co-Kernels Jiannan Ouyang, Brian Kocoloski, John Lange The Prognostic Lab @ University of Pittsburgh Kevin Pedretti Sandia National Laboratories HPDC 2015

More information

Vers des mécanismes génériques de communication et une meilleure maîtrise des affinités dans les grappes de calculateurs hiérarchiques.

Vers des mécanismes génériques de communication et une meilleure maîtrise des affinités dans les grappes de calculateurs hiérarchiques. Vers des mécanismes génériques de communication et une meilleure maîtrise des affinités dans les grappes de calculateurs hiérarchiques Brice Goglin 15 avril 2014 Towards generic Communication Mechanisms

More information

Experiences With Mobile Processors for Energy Efficient HPC

Experiences With Mobile Processors for Energy Efficient HPC Experiences With Mobile Processors for Energy Efficient HPC Nikola Rajovic, Alejandro Rico, James Vipond, Isaac Gelado, Nikola Puzovic, Alex Ramirez Barcelona Supercomputing Center Universitat Politècnica

More information

Networking Virtualization Using FPGAs

Networking Virtualization Using FPGAs Networking Virtualization Using FPGAs Russell Tessier, Deepak Unnikrishnan, Dong Yin, and Lixin Gao Reconfigurable Computing Group Department of Electrical and Computer Engineering University of Massachusetts,

More information

Distributed Symmetric Multi-Processing (DSMP)

Distributed Symmetric Multi-Processing (DSMP) Distributed Symmetric Multi-Processing (DSMP) Symmetric Computing Venture Development Center University of Massachusetts Boston 100 Morrissey Boulevard Boston, MA 02125 USA Page 1 of 13 This page is intentionally

More information

OpenMP Programming on ScaleMP

OpenMP Programming on ScaleMP OpenMP Programming on ScaleMP Dirk Schmidl schmidl@rz.rwth-aachen.de Rechen- und Kommunikationszentrum (RZ) MPI vs. OpenMP MPI distributed address space explicit message passing typically code redesign

More information

Direct GPU/FPGA Communication Via PCI Express

Direct GPU/FPGA Communication Via PCI Express Direct GPU/FPGA Communication Via PCI Express Ray Bittner, Erik Ruf Microsoft Research Redmond, USA {raybit,erikruf}@microsoft.com Abstract Parallel processing has hit mainstream computing in the form

More information

The High Performance Internet of Things: using GVirtuS for gluing cloud computing and ubiquitous connected devices

The High Performance Internet of Things: using GVirtuS for gluing cloud computing and ubiquitous connected devices WS on Models, Algorithms and Methodologies for Hierarchical Parallelism in new HPC Systems The High Performance Internet of Things: using GVirtuS for gluing cloud computing and ubiquitous connected devices

More information

Embedded Systems: map to FPGA, GPU, CPU?

Embedded Systems: map to FPGA, GPU, CPU? Embedded Systems: map to FPGA, GPU, CPU? Jos van Eijndhoven jos@vectorfabrics.com Bits&Chips Embedded systems Nov 7, 2013 # of transistors Moore s law versus Amdahl s law Computational Capacity Hardware

More information

Write a technical report Present your results Write a workshop/conference paper (optional) Could be a real system, simulation and/or theoretical

Write a technical report Present your results Write a workshop/conference paper (optional) Could be a real system, simulation and/or theoretical Identify a problem Review approaches to the problem Propose a novel approach to the problem Define, design, prototype an implementation to evaluate your approach Could be a real system, simulation and/or

More information

Cray Gemini Interconnect. Technical University of Munich Parallel Programming Class of SS14 Denys Sobchyshak

Cray Gemini Interconnect. Technical University of Munich Parallel Programming Class of SS14 Denys Sobchyshak Cray Gemini Interconnect Technical University of Munich Parallel Programming Class of SS14 Denys Sobchyshak Outline 1. Introduction 2. Overview 3. Architecture 4. Gemini Blocks 5. FMA & BTA 6. Fault tolerance

More information

Performance Comparison of Fujitsu PRIMERGY and PRIMEPOWER Servers

Performance Comparison of Fujitsu PRIMERGY and PRIMEPOWER Servers WHITE PAPER FUJITSU PRIMERGY AND PRIMEPOWER SERVERS Performance Comparison of Fujitsu PRIMERGY and PRIMEPOWER Servers CHALLENGE Replace a Fujitsu PRIMEPOWER 2500 partition with a lower cost solution that

More information

The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System

The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System Qingyu Meng, Alan Humphrey, Martin Berzins Thanks to: John Schmidt and J. Davison de St. Germain, SCI Institute Justin Luitjens

More information

Why Computers Are Getting Slower (and what we can do about it) Rik van Riel Sr. Software Engineer, Red Hat

Why Computers Are Getting Slower (and what we can do about it) Rik van Riel Sr. Software Engineer, Red Hat Why Computers Are Getting Slower (and what we can do about it) Rik van Riel Sr. Software Engineer, Red Hat Why Computers Are Getting Slower The traditional approach better performance Why computers are

More information

Intel Data Direct I/O Technology (Intel DDIO): A Primer >

Intel Data Direct I/O Technology (Intel DDIO): A Primer > Intel Data Direct I/O Technology (Intel DDIO): A Primer > Technical Brief February 2012 Revision 1.0 Legal Statements INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE,

More information

Comparing the performance of the Landmark Nexus reservoir simulator on HP servers

Comparing the performance of the Landmark Nexus reservoir simulator on HP servers WHITE PAPER Comparing the performance of the Landmark Nexus reservoir simulator on HP servers Landmark Software & Services SOFTWARE AND ASSET SOLUTIONS Comparing the performance of the Landmark Nexus

More information

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011 Graphics Cards and Graphics Processing Units Ben Johnstone Russ Martin November 15, 2011 Contents Graphics Processing Units (GPUs) Graphics Pipeline Architectures 8800-GTX200 Fermi Cayman Performance Analysis

More information

OBJECTIVE ANALYSIS WHITE PAPER MATCH FLASH. TO THE PROCESSOR Why Multithreading Requires Parallelized Flash ATCHING

OBJECTIVE ANALYSIS WHITE PAPER MATCH FLASH. TO THE PROCESSOR Why Multithreading Requires Parallelized Flash ATCHING OBJECTIVE ANALYSIS WHITE PAPER MATCH ATCHING FLASH TO THE PROCESSOR Why Multithreading Requires Parallelized Flash T he computing community is at an important juncture: flash memory is now generally accepted

More information

Parallel Algorithm Engineering

Parallel Algorithm Engineering Parallel Algorithm Engineering Kenneth S. Bøgh PhD Fellow Based on slides by Darius Sidlauskas Outline Background Current multicore architectures UMA vs NUMA The openmp framework Examples Software crisis

More information

FPGA-based Multithreading for In-Memory Hash Joins

FPGA-based Multithreading for In-Memory Hash Joins FPGA-based Multithreading for In-Memory Hash Joins Robert J. Halstead, Ildar Absalyamov, Walid A. Najjar, Vassilis J. Tsotras University of California, Riverside Outline Background What are FPGAs Multithreaded

More information

Achieving Real-Time Business Solutions Using Graph Database Technology and High Performance Networks

Achieving Real-Time Business Solutions Using Graph Database Technology and High Performance Networks WHITE PAPER July 2014 Achieving Real-Time Business Solutions Using Graph Database Technology and High Performance Networks Contents Executive Summary...2 Background...3 InfiniteGraph...3 High Performance

More information

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming Overview Lecture 1: an introduction to CUDA Mike Giles mike.giles@maths.ox.ac.uk hardware view software view Oxford University Mathematical Institute Oxford e-research Centre Lecture 1 p. 1 Lecture 1 p.

More information

A general-purpose virtualization service for HPC on cloud computing: an application to GPUs

A general-purpose virtualization service for HPC on cloud computing: an application to GPUs A general-purpose virtualization service for HPC on cloud computing: an application to GPUs R.Montella, G.Coviello, G.Giunta* G. Laccetti #, F. Isaila, J. Garcia Blas *Department of Applied Science University

More information

Removing Performance Bottlenecks in Databases with Red Hat Enterprise Linux and Violin Memory Flash Storage Arrays. Red Hat Performance Engineering

Removing Performance Bottlenecks in Databases with Red Hat Enterprise Linux and Violin Memory Flash Storage Arrays. Red Hat Performance Engineering Removing Performance Bottlenecks in Databases with Red Hat Enterprise Linux and Violin Memory Flash Storage Arrays Red Hat Performance Engineering Version 1.0 August 2013 1801 Varsity Drive Raleigh NC

More information

Deep Learning GPU-Based Hardware Platform

Deep Learning GPU-Based Hardware Platform Deep Learning GPU-Based Hardware Platform Hardware and Software Criteria and Selection Mourad Bouache Yahoo! Performance Engineering Group Sunnyvale, CA +1.408.784.1446 bouache@yahoo-inc.com John Glover

More information

Solving I/O Bottlenecks to Enable Superior Cloud Efficiency

Solving I/O Bottlenecks to Enable Superior Cloud Efficiency WHITE PAPER Solving I/O Bottlenecks to Enable Superior Cloud Efficiency Overview...1 Mellanox I/O Virtualization Features and Benefits...2 Summary...6 Overview We already have 8 or even 16 cores on one

More information

Evaluation of CUDA Fortran for the CFD code Strukti

Evaluation of CUDA Fortran for the CFD code Strukti Evaluation of CUDA Fortran for the CFD code Strukti Practical term report from Stephan Soller High performance computing center Stuttgart 1 Stuttgart Media University 2 High performance computing center

More information

Toward a practical HPC Cloud : Performance tuning of a virtualized HPC cluster

Toward a practical HPC Cloud : Performance tuning of a virtualized HPC cluster Toward a practical HPC Cloud : Performance tuning of a virtualized HPC cluster Ryousei Takano Information Technology Research Institute, National Institute of Advanced Industrial Science and Technology

More information

GPUs for Scientific Computing

GPUs for Scientific Computing GPUs for Scientific Computing p. 1/16 GPUs for Scientific Computing Mike Giles mike.giles@maths.ox.ac.uk Oxford-Man Institute of Quantitative Finance Oxford University Mathematical Institute Oxford e-research

More information

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Family-Based Platforms Executive Summary Complex simulations of structural and systems performance, such as car crash simulations,

More information

Optimizing Shared Resource Contention in HPC Clusters

Optimizing Shared Resource Contention in HPC Clusters Optimizing Shared Resource Contention in HPC Clusters Sergey Blagodurov Simon Fraser University Alexandra Fedorova Simon Fraser University Abstract Contention for shared resources in HPC clusters occurs

More information

Intel DPDK Boosts Server Appliance Performance White Paper

Intel DPDK Boosts Server Appliance Performance White Paper Intel DPDK Boosts Server Appliance Performance Intel DPDK Boosts Server Appliance Performance Introduction As network speeds increase to 40G and above, both in the enterprise and data center, the bottlenecks

More information

Oracle Database Scalability in VMware ESX VMware ESX 3.5

Oracle Database Scalability in VMware ESX VMware ESX 3.5 Performance Study Oracle Database Scalability in VMware ESX VMware ESX 3.5 Database applications running on individual physical servers represent a large consolidation opportunity. However enterprises

More information

How To Build A Supermicro Computer With A 32 Core Power Core (Powerpc) And A 32-Core (Powerpc) (Powerpowerpter) (I386) (Amd) (Microcore) (Supermicro) (

How To Build A Supermicro Computer With A 32 Core Power Core (Powerpc) And A 32-Core (Powerpc) (Powerpowerpter) (I386) (Amd) (Microcore) (Supermicro) ( TECHNICAL GUIDELINES FOR APPLICANTS TO PRACE 7 th CALL (Tier-0) Contributing sites and the corresponding computer systems for this call are: GCS@Jülich, Germany IBM Blue Gene/Q GENCI@CEA, France Bull Bullx

More information

Cloud Data Center Acceleration 2015

Cloud Data Center Acceleration 2015 Cloud Data Center Acceleration 2015 Agenda! Computer & Storage Trends! Server and Storage System - Memory and Homogenous Architecture - Direct Attachment! Memory Trends! Acceleration Introduction! FPGA

More information

Performance Evaluations of Graph Database using CUDA and OpenMP Compatible Libraries

Performance Evaluations of Graph Database using CUDA and OpenMP Compatible Libraries Performance Evaluations of Graph Database using CUDA and OpenMP Compatible Libraries Shin Morishima 1 and Hiroki Matsutani 1,2,3 1Keio University, 3 14 1 Hiyoshi, Kohoku ku, Yokohama, Japan 2National Institute

More information

Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software

Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software WHITEPAPER Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software SanDisk ZetaScale software unlocks the full benefits of flash for In-Memory Compute and NoSQL applications

More information

Distribution One Server Requirements

Distribution One Server Requirements Distribution One Server Requirements Introduction Welcome to the Hardware Configuration Guide. The goal of this guide is to provide a practical approach to sizing your Distribution One application and

More information

Performance Characteristics of VMFS and RDM VMware ESX Server 3.0.1

Performance Characteristics of VMFS and RDM VMware ESX Server 3.0.1 Performance Study Performance Characteristics of and RDM VMware ESX Server 3.0.1 VMware ESX Server offers three choices for managing disk access in a virtual machine VMware Virtual Machine File System

More information

Achieving a High Performance OLTP Database using SQL Server and Dell PowerEdge R720 with Internal PCIe SSD Storage

Achieving a High Performance OLTP Database using SQL Server and Dell PowerEdge R720 with Internal PCIe SSD Storage Achieving a High Performance OLTP Database using SQL Server and Dell PowerEdge R720 with This Dell Technical White Paper discusses the OLTP performance benefit achieved on a SQL Server database using a

More information

Multi-core Programming System Overview

Multi-core Programming System Overview Multi-core Programming System Overview Based on slides from Intel Software College and Multi-Core Programming increasing performance through software multi-threading by Shameem Akhter and Jason Roberts,

More information

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging Outline High Performance Computing (HPC) Towards exascale computing: a brief history Challenges in the exascale era Big Data meets HPC Some facts about Big Data Technologies HPC and Big Data converging

More information

Virtuoso and Database Scalability

Virtuoso and Database Scalability Virtuoso and Database Scalability By Orri Erling Table of Contents Abstract Metrics Results Transaction Throughput Initializing 40 warehouses Serial Read Test Conditions Analysis Working Set Effect of

More information

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015 GPU Hardware and Programming Models Jeremy Appleyard, September 2015 A brief history of GPUs In this talk Hardware Overview Programming Models Ask questions at any point! 2 A Brief History of GPUs 3 Once

More information

Parallel Processing and Software Performance. Lukáš Marek

Parallel Processing and Software Performance. Lukáš Marek Parallel Processing and Software Performance Lukáš Marek DISTRIBUTED SYSTEMS RESEARCH GROUP http://dsrg.mff.cuni.cz CHARLES UNIVERSITY PRAGUE Faculty of Mathematics and Physics Benchmarking in parallel

More information

Symmetric Multiprocessing

Symmetric Multiprocessing Multicore Computing A multi-core processor is a processing system composed of two or more independent cores. One can describe it as an integrated circuit to which two or more individual processors (called

More information

THE EXPAND PARALLEL FILE SYSTEM A FILE SYSTEM FOR CLUSTER AND GRID COMPUTING. José Daniel García Sánchez ARCOS Group University Carlos III of Madrid

THE EXPAND PARALLEL FILE SYSTEM A FILE SYSTEM FOR CLUSTER AND GRID COMPUTING. José Daniel García Sánchez ARCOS Group University Carlos III of Madrid THE EXPAND PARALLEL FILE SYSTEM A FILE SYSTEM FOR CLUSTER AND GRID COMPUTING José Daniel García Sánchez ARCOS Group University Carlos III of Madrid Contents 2 The ARCOS Group. Expand motivation. Expand

More information

Virtualization Technologies and Blackboard: The Future of Blackboard Software on Multi-Core Technologies

Virtualization Technologies and Blackboard: The Future of Blackboard Software on Multi-Core Technologies Virtualization Technologies and Blackboard: The Future of Blackboard Software on Multi-Core Technologies Kurt Klemperer, Principal System Performance Engineer kklemperer@blackboard.com Agenda Session Length:

More information

benchmarking Amazon EC2 for high-performance scientific computing

benchmarking Amazon EC2 for high-performance scientific computing Edward Walker benchmarking Amazon EC2 for high-performance scientific computing Edward Walker is a Research Scientist with the Texas Advanced Computing Center at the University of Texas at Austin. He received

More information

Pedraforca: ARM + GPU prototype

Pedraforca: ARM + GPU prototype www.bsc.es Pedraforca: ARM + GPU prototype Filippo Mantovani Workshop on exascale and PRACE prototypes Barcelona, 20 May 2014 Overview Goals: Test the performance, scalability, and energy efficiency of

More information

Computer Systems Structure Input/Output

Computer Systems Structure Input/Output Computer Systems Structure Input/Output Peripherals Computer Central Processing Unit Main Memory Computer Systems Interconnection Communication lines Input Output Ward 1 Ward 2 Examples of I/O Devices

More information

enabling Ultra-High Bandwidth Scalable SSDs with HLnand

enabling Ultra-High Bandwidth Scalable SSDs with HLnand www.hlnand.com enabling Ultra-High Bandwidth Scalable SSDs with HLnand May 2013 2 Enabling Ultra-High Bandwidth Scalable SSDs with HLNAND INTRODUCTION Solid State Drives (SSDs) are available in a wide

More information

HP SN1000E 16 Gb Fibre Channel HBA Evaluation

HP SN1000E 16 Gb Fibre Channel HBA Evaluation HP SN1000E 16 Gb Fibre Channel HBA Evaluation Evaluation report prepared under contract with Emulex Executive Summary The computing industry is experiencing an increasing demand for storage performance

More information

Parallel Computing. Benson Muite. benson.muite@ut.ee http://math.ut.ee/ benson. https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage

Parallel Computing. Benson Muite. benson.muite@ut.ee http://math.ut.ee/ benson. https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage Parallel Computing Benson Muite benson.muite@ut.ee http://math.ut.ee/ benson https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage 3 November 2014 Hadoop, Review Hadoop Hadoop History Hadoop Framework

More information

TCP Servers: Offloading TCP Processing in Internet Servers. Design, Implementation, and Performance

TCP Servers: Offloading TCP Processing in Internet Servers. Design, Implementation, and Performance TCP Servers: Offloading TCP Processing in Internet Servers. Design, Implementation, and Performance M. Rangarajan, A. Bohra, K. Banerjee, E.V. Carrera, R. Bianchini, L. Iftode, W. Zwaenepoel. Presented

More information

Seeking Opportunities for Hardware Acceleration in Big Data Analytics

Seeking Opportunities for Hardware Acceleration in Big Data Analytics Seeking Opportunities for Hardware Acceleration in Big Data Analytics Paul Chow High-Performance Reconfigurable Computing Group Department of Electrical and Computer Engineering University of Toronto Who

More information

Agenda. Enterprise Application Performance Factors. Current form of Enterprise Applications. Factors to Application Performance.

Agenda. Enterprise Application Performance Factors. Current form of Enterprise Applications. Factors to Application Performance. Agenda Enterprise Performance Factors Overall Enterprise Performance Factors Best Practice for generic Enterprise Best Practice for 3-tiers Enterprise Hardware Load Balancer Basic Unix Tuning Performance

More information

CORRIGENDUM TO TENDER FOR HIGH PERFORMANCE SERVER

CORRIGENDUM TO TENDER FOR HIGH PERFORMANCE SERVER CORRIGENDUM TO TENDER FOR HIGH PERFORMANCE SERVER Tender Notice No. 3/2014-15 dated 29.12.2014 (IIT/CE/ENQ/COM/HPC/2014-15/569) Tender Submission Deadline Last date for submission of sealed bids is extended

More information

AMD Opteron Quad-Core

AMD Opteron Quad-Core AMD Opteron Quad-Core a brief overview Daniele Magliozzi Politecnico di Milano Opteron Memory Architecture native quad-core design (four cores on a single die for more efficient data sharing) enhanced

More information

HP Z Turbo Drive PCIe SSD

HP Z Turbo Drive PCIe SSD Performance Evaluation of HP Z Turbo Drive PCIe SSD Powered by Samsung XP941 technology Evaluation Conducted Independently by: Hamid Taghavi Senior Technical Consultant June 2014 Sponsored by: P a g e

More information

HPC Wales Skills Academy Course Catalogue 2015

HPC Wales Skills Academy Course Catalogue 2015 HPC Wales Skills Academy Course Catalogue 2015 Overview The HPC Wales Skills Academy provides a variety of courses and workshops aimed at building skills in High Performance Computing (HPC). Our courses

More information

A Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures

A Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures 11 th International LS-DYNA Users Conference Computing Technology A Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures Yih-Yih Lin Hewlett-Packard Company Abstract In this paper, the

More information

Advancing Applications Performance With InfiniBand

Advancing Applications Performance With InfiniBand Advancing Applications Performance With InfiniBand Pak Lui, Application Performance Manager September 12, 2013 Mellanox Overview Ticker: MLNX Leading provider of high-throughput, low-latency server and

More information

Scaling Objectivity Database Performance with Panasas Scale-Out NAS Storage

Scaling Objectivity Database Performance with Panasas Scale-Out NAS Storage White Paper Scaling Objectivity Database Performance with Panasas Scale-Out NAS Storage A Benchmark Report August 211 Background Objectivity/DB uses a powerful distributed processing architecture to manage

More information

Department of Computer Sciences University of Salzburg. HPC In The Cloud? Seminar aus Informatik SS 2011/2012. July 16, 2012

Department of Computer Sciences University of Salzburg. HPC In The Cloud? Seminar aus Informatik SS 2011/2012. July 16, 2012 Department of Computer Sciences University of Salzburg HPC In The Cloud? Seminar aus Informatik SS 2011/2012 July 16, 2012 Michael Kleber, mkleber@cosy.sbg.ac.at Contents 1 Introduction...................................

More information

High Performance Computing. Course Notes 2007-2008. HPC Fundamentals

High Performance Computing. Course Notes 2007-2008. HPC Fundamentals High Performance Computing Course Notes 2007-2008 2008 HPC Fundamentals Introduction What is High Performance Computing (HPC)? Difficult to define - it s a moving target. Later 1980s, a supercomputer performs

More information

Solid State Storage in Massive Data Environments Erik Eyberg

Solid State Storage in Massive Data Environments Erik Eyberg Solid State Storage in Massive Data Environments Erik Eyberg Senior Analyst Texas Memory Systems, Inc. Agenda Taxonomy Performance Considerations Reliability Considerations Q&A Solid State Storage Taxonomy

More information

Introduction to Cloud Computing

Introduction to Cloud Computing Introduction to Cloud Computing Parallel Processing I 15 319, spring 2010 7 th Lecture, Feb 2 nd Majd F. Sakr Lecture Motivation Concurrency and why? Different flavors of parallel computing Get the basic

More information

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child Introducing A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child Bio Tim Child 35 years experience of software development Formerly VP Oracle Corporation VP BEA Systems Inc.

More information

Towards Efficient GPU Sharing on Multicore Processors

Towards Efficient GPU Sharing on Multicore Processors Towards Efficient GPU Sharing on Multicore Processors Lingyuan Wang ECE Department George Washington University lennyhpc@gmail.com Miaoqing Huang CSCE Department University of Arkansas mqhuang@uark.edu

More information

ACANO SOLUTION VIRTUALIZED DEPLOYMENTS. White Paper. Simon Evans, Acano Chief Scientist

ACANO SOLUTION VIRTUALIZED DEPLOYMENTS. White Paper. Simon Evans, Acano Chief Scientist ACANO SOLUTION VIRTUALIZED DEPLOYMENTS White Paper Simon Evans, Acano Chief Scientist Updated April 2015 CONTENTS Introduction... 3 Host Requirements... 5 Sizing a VM... 6 Call Bridge VM... 7 Acano Edge

More information

Xeon+FPGA Platform for the Data Center

Xeon+FPGA Platform for the Data Center Xeon+FPGA Platform for the Data Center ISCA/CARL 2015 PK Gupta, Director of Cloud Platform Technology, DCG/CPG Overview Data Center and Workloads Xeon+FPGA Accelerator Platform Applications and Eco-system

More information