Auto-Tunning of Data Communication on Heterogeneous Systems

1 Auto-Tunning of Data Communication on Heterogeneous Systems Marc Jordà 1, Ivan Tanasic 1, Javier Cabezas 1, Lluís Vilanova 1, Isaac Gelado 1, and Nacho Navarro 1, 2 1 Barcelona Supercomputing Center 2 Universitat Politècnica de Catalunya {first.last}@bsc.es Abstract Heterogeneous systems formed by trandional CPUs and compute accelerators, such as GPUs, are becoming widely used to build modern supercomputers. However, many different system topologies, i.e., how CPUs, accelerators, and I/O devices are interconnected, are being deployed. Each system organization presents different trade-offs when transferring data between CPUs, accelerators, and nodes within a cluster, requiring each of them a different software implementation to achieve optimal data communication bandwidth. Hence, there is a great potential for auto-tunning of applications to match the constrains of the system where the code is being executed. In his paper we explore the potential impact of the two key optimizations to achieve optimal data transfer bandwidth: topology-aware process placement policies, and double-buffering. We design a set of experiments to evaluate all possible alternatives, and run each of them on different hardware configurations. We show that optimal data transfer mechanisms not only depend on the hardware topology, but also on the application dataset. Our experimental results show that auto-tunning applications to match the hardware topology, and the usage of double-buffering for large data transfers can improve the data transfer bandwidth by % of local communication. We also show that doublebuffering large data transfers is key to achieve optimal bandwidht on remote communication for data transfers larger than 128KB, while topology-aware policies produce minimal benefits. Keywords:. I. INTRODUCTION Heterogeneous computing systems that couple general purpose CPUs and massively parallel accelerators are becoming common place in High Performance Computing (HPC) environments. For example, Titan, the #1 supercomputer in the Top0 list [1] as of November 12, is formed by 18,688 nodes that integrate one AMD Opteron CPU and one NVIDIA Kx GPU. This architecture enables applications to benefit both from the very short single-threaded latency of CPUs, key to achieve performance in control-intensive application phases, and from the high multi-threaded throughput of GPUs, which boosts the performance of application phases where massive amounts of data parallelism exist. Besides the performance benefits, heterogeneous architectures also provides higher energy efficiencies than traditional homogeneous systems. For instance, the top four supercomputers in the Green0 list [2], as of November 12, are heterogeneous systems where each node includes one or several massively parallel compute accelerators (i.e., Intel Xeon Phi or AMD and NVIDIA GPUs). Most programming models for heterogeneous systems, such as NVIDIA CUDA [3] and OpenCL [4], assume that the code running in the CPU (i.e., host) and the accelerator (i.e., device) has access to separate virtual address spaces. Therefore, any data communication between the CPU and accelerators, or between accelerators, requires explicit data transfers calls (e.g., cudamemcpy in CUDA, and clenqueuecopybuffer in OpenCL) to copy data across address spaces. The optimal implementation of these data transfer calls heavily depends on the underlaying hardware organization. Hence, the behaviour of applications have to be adapted to the hardware topology where they are being executed. Often time, the design heterogeneous node includes one multi-core CPU chip connected to one accelerator through a PCIe bus, where each chip is connected to separate physical memories. In this design, each memory is only accessible from the device physically connected to it, so Direct Memory Access (DMA) copies (i.e., cudamemcpy in CUDA) are used to move data between the CPU and the accelerator. A similar organization attaches several accelerators to the same PCIe bus. In this case, each accelerator is typically attached to its own physical memory, and it is possible to transfer data across accelerators using DMA transfers without intervention of the CPU. Another approach to integrate several accelerators in the same node is to build a Symmetric Multi Processor (SMP) system by connecting several CPU chips through a cache-coherent fabric (e.g., HyperTransport or QuickPath Interconnect), and attaching one accelerator to each CPU chip through separate PCIe buses. In this scenario, data communication across accelerators requires the intervention of the CPU, since DMA transfers cannot interface with the cache-coherent interconnect. Besides the effects due to the hardware topology, the performance of data transfers also depends on the software implementation. A widespread technique to hide the cost of data transfers is double-buffering. However, this technique requires extra synchronization points that might actually harm the overall performance. Similarly, the performance of data transfers might be boosted by using pre-pinned memory in user applications, which removes the need for intermediate copies from user buffers to DMA buffers. However, pre-pinned memory is a scarce resource which can not be used for very large data sets. Hence, the amount of pre-pinned memory used by each application has to be carefully chosen to balance the

2 (a) Discrete GPU. Figure 1: GPU architectures. (b) Integrated GPU. data transfer bandwidth and the system performance. The large number of design choices and the dependability of data transfers between CPU and accelerators with the hardware, makes this code the ideal candidate for auto-tuning. In this paper we study the amenability of auto-tuning techniques to be used for implementing data transfers in heterogeneous systems by analyzing the performance characteristics of different data transfers techniques on different heterogeneous architectures. Our experimental results show that auto-tunning can achieve up to % speed-ups on data transfers. The first contribution of this paper is a description of different multi-gpu systems and the techniques needed to achieve the highest data communication throughput. The second contribution of this paper is a comprehensive set of benchmarks to measure the performance of the most common data communication patterns in HPC codes. They allow us to determine the best resource allocation strategy and the best transfer plan for different GPU topologies. This paper is organized as follows. Section II presents the necessary background on heterogeneous system topologies, and data transfer mechanisms. We present our experimental systems and methodology on Section III. The optimization opportunities of implementing topology-aware process placement are investigated in Section IV, while the potential performance improvements of data transfers when using double-buffering are presented in Section V. Section VI concludes this paper. II. BACKGROUND AND MOTIVATION In this Section we introduce different of GPU-based system organizations and how current programming models expose their different characteristics. A. GPU-based systems First programmable GPUs (e.g., NVIDIA G) appeared in the form of discrete devices attached to the system through an I/O bus like PCIe (Figure 1a). A discrete GPU has its own high-bandwidth RAM memory (e.g.,gddr5). Data must be copied from host (i.e., CPU) to the GPU memory before any code on the GPU accesses it and results must be copied back to be accessed by the CPU code. While last generation I/O buses deliver up to 16 GBps (PCIe 3.0 [5]), CPU GPU transfers can easily become a bottleneck. Later, designs that integrate CPU and GPU in the same die have been proposed (e.g., NVIDIA ION, AMD Fusion, Intel Sandy/Ivy Bridge). In these designs, CPUs and GPUs share the same physical memory (Figure 1b), thus eliminating the need for data transfers. However, the memory bandwidth delivered by the host memory is an order of magnitude lower than memories used in discrete GPUs ( GBps vs 0 GBps) and, therefore, they are not used in HPC environments. Therefore, in this paper we focus on systems with discrete GPUs. HPC systems commonly include several GPUs in each node to further accelerate computations and handle large datasets. In these kind of systems, data transfers between GPU memories are also possible. Typically, each CPU socket in the system has a I/O controller to (1) avoid remote accesses through inter-cpu interconnects (e.g.,hypertransport or QPI), and (2) deliver a higher aggregate bandwidth by providing independent interconnects for different groups of GPUs. A I/O controller contains a PCIe root complex that manages a tree of devices. Devices are connected to a bus and buses can be connected to the PCIe root complex or to a PCIe switch. Communication between devices connected to different PCIe root complexes (e.g., in a multi-socket system) is not currently supported (due to limitations in the CPU interconnect) and intermediate copies to host memory are needed. Figure 2a shows a single-socket system in which all GPUs are connected to the same PCIe bus. In this system, all memory transfers go through the PCIe root complex found in the I/O controller, that may become a bottleneck. Figure 2b shows a more complex topology that uses two PCIe switches. While CPU GPU transfers still suffer from the same contention problems, GPU GPU transfers can greatly benefit, since transfers between different pairs of GPUs can proceed in parallel. Figure 2c shows a dual-socket system with one I/O controller per socket. In this configuration, two CPU GPU memory transfers can execute in parallel (if the GPU involved in each transfer is connected to a different I/O controller and the source/destination address in host memory is located in a memory module connected to the same CPU socket). However, the NUMA (Non-Uniform- Memory-Access) nature of the system makes the management of memory allocations used in data transfers more difficult [6]. Traditionally, the communication of GPUs with I/O devices (disks, network interfaces,... ) has been managed by privileged CPU code (e.g., Operating System) that orchestrates the necessary memory transfers across the memories in the system. The most simple implementation is to make an intermediate copy to host memory and perform the communication like regular CPU code. This guarantees compatibility with all device drivers at the cost of extra memory transfers. NVIDIA introduced the GPUDirect [7] technology to reuse host memory buffers across different device drivers to avoid unnecessary copies, that was mainly adopted by Infiniband network interface vendors. The second version of the technology allows GPUs to directly communicate with devices connected to the same PCIe root complex thus completely eliminating intermediate copies to host memory. B. Programming multi-gpu systems Programming models for GPUs provide a host programming interface to manage all the memories in the system. CUDA [3] allows programmers to allocate memory in any GPU in the

3 (a) System with two GPUs. (b) System with four GPUs. (c) System with a four GPUs in a NUMA system. Figure 2: Multi-GPU system topologies. (a) Disk storage. (b) Network Interface. Figure 3: Communication between GPUs and I/O devices. system and provides functions to copy data between them. Since copies between host and GPU memories are performed using DMA transfers, pinned host memory buffers must be used. This ensures that pages are not swapped out to disk by the Operating System during the transfer. Memory transfers that use regular user (non-pinned) buffers are internally copied to pinned memory buffers by the runtime. However, the CUDA host API allows programmers to manage host pinned buffers to avoid this intermediate copies. OpenCL [4] uses a higher-level model based on contexts. Users can create a context associated to a group of devices, and memory is allocated in a per-context basis. However, the memory is allocated in the memory of a specific GPU lazily, when the first operation (e.g.,data transfer or kernel execution) on the memory allocation is detected. Transfers across contexts are not allowed and, in that case, intermediate copies to host memory are needed. Furthermore, OpenCL does not provide a standard mechanism to allocate pinned memory buffers and each vendor relies on different combinations of flags. Both CUDA (since version 4) and OpenCL allow programmers to manage all GPUs in the system from a single CPU thread. However, many HPC programs have been already ported to multi-thread schemes using pthreads (POSIX threads [8]) or OpenMP [9] to take advantage of multiple CPUs. When porting these codes to GPUs, each thread manages one GPU and inter-thread synchronization mechanisms are used to ensure the correctness of the computations. Applications that run on multi-node systems usually use the MPI (Message Passing Interface[10]) programming model to communicate across nodes. MPI functions take regular pointers as input/output buffers and, therefore, transferring data across GPUs in different nodes requires using intermediate copies to host memory. However, latest versions of some MPI implementations are able to identify GPU allocations (thus removing the need for explicit memory transfers to/ from host buffers) and to take advantage of the GPUDirect technology. If communicating GPUs reside in the same node, direct GPU GPU can be performed. On the other hand, when data is transferred through the network, data can be directly copied from GPU memory to the NIC memory if they reside in the same PCIe root complex. A. Experiments III. EXPERIMENTAL METHODOLOGY We design experiments to measure the impact of locality and double-buffering on the performance of data communication across CPUs, GPUs, and I/O devices. We cover three possible scenarios: communication inside a node (intra-node), communication between nodes (inter-node) and I/O performance. Communication inside the node can happen between CPU (host) and the GPU or between two GPUs. With the transfers between host and GPU memories standard memory allocations or pinned memory allocations can be used on the host side. In the case of pinned memory, asynchronous copies can be performed which gives the opportunity to overlap the transfer with kernel execution or other transfers. Common use case found in applications is transferring the inputs of next computation iteration to the GPU, while transferring the results of the previous one to the CPU. We label this as a twoway transfer in our experiments. We pin CPU threads, forcing the allocation of host memory buffers to the closer memory partition, and measure the throughput of data transfers between CPU and GPU connected to the same PCIe hub and CPU and GPU connected to different I/O hubs.

4 Other possible type of communication inside the node are transfers between GPUs. Besides the transfers from one GPU to the other (one-way transfer), we also measure the throughput of concurrent bidirectional transfers between two GPUs, to simulate the pattern commonly found in multi-gpu applications such as the halo exchange in stencil codes. The GPU to GPU experiments are performed using two GPUs sharing the same PCIe bus and using two GPUs installed in different PCIe buses to quantify the advantage of using peerto-peer data transfers, when available, over transfers requiring intermediate copies to the host memory. With the MPI standard being extensively used to develop cluster-level applications, MPI programs are usually ported to GPU clusters by spawning one process per GPU and performing the computation or a part of it on the GPU. For that reason we also measure the performance of intra-node GPU to GPU copies when each GPU is managed by a different MPI process and the transfer is performed using the MPI primitives instead of CUDA copies directly. To test the recently added GPU-awareness to some MPI implementations we use two different versions of OpenMPI library [11] (1.4.5 and 1.7) to perform the inter-node transfers. The later is GPU-aware and allows direct use of pointers to GPU memory allocations when calling MPI s data transfer functions. Communication between the nodes (inter-node) is performed over the network. In our experiments, we focus on the MPI over Infiniband network as the most representative case in multi node GPU machines. We study performance of the transfers between GPUs from different nodes and between CPU from one node and GPU from the other. Many applications do file I/O, either for the input data and results or as a backing store for big datasets. We measure the performance of transfers between GPU and the disk for both traditional Hard Disks (HDD) and Solid State Drives (SSD). To avoid the interference of the Linux operating system s file caches in our throughput measures, we use the O_DIRECT flag of the POSIX open function to ensure that data is actually read from and written to the disk instead of the cache. Depending on the scenario, different types of transfers are possible. In experiments that transfer data between any CPU and GPU or between two GPUs on the same PCIe bus (including transfers using GPU-aware version of the OpenMPI library), DMA engines of the GPU are in charge of transferring the data to the destination directly. Whenever these direct transfers are not possible staging is performed through the CPU memory, either by copying the whole memory structure to the host and then to the destination memory or by performing the pipelined transfers through host (or hosts in the case of inter-node transfers) using the double-buffering technique. B. Evaluation Systems Table I shows the systems used in our evaluation. System A is a single-node NUMA machine with two quadcore Intel Xeon E56 CPUs running at 2.4 Ghz and four NVIDIA Tesla C GPUs. CPUs are interconnected with Intel QPI and each one is connected to a I/O hub shared by two GPUs (Figure 2c). The system has a Western Digital Caviar System A System B System C CPUs 2 Intel E56 2 Intel E5649 1 Intel i7 38 GPUs 4 Nvidia C 2 Nvidia M 4 Nvidia C GPUs per PCIe 2 1 1 Disk WD HDD Micron SSD WD HDD SATA 3 6 Gbps SATA 3 6 Gbps SATA 3 6 Gbps DRAM 1333 Mhz 1333 Mhz 1333 Mhz NICs n/a Mellanox QDR n/a CUDA Driver 4.88 285.05.09 4.88 Linux Kernel 3.2.41 2.6.32 3.2.41 GCC 4.6.3 4.6.1 4.6.3 NVCC 4.2 4.1 5.0 Table I: System configurations Black WD0 70 rpm hard disk that is connected to the I/ O controller of one of the CPUs. Thus, GPUs connected to the same PCIe controller can directly communicate through peer-to-peer transfers. However, communications across PCIe controllers must perform intermediate copies to host memory. System B is a GPU cluster. Each node has two six-core Intel Xeon E5649 CPUs, running at 2.53 GHz, also connected with a QPI bus. The main memory is split in two 12 GB modules like in the first system. They have two NVIDIA Tesla M GPUs, and two Mellanox Technologies MT26428 QDR Gbps Infiniband network adapters. The system has two PCIe 2.0 buses, each shared by one GPU and one network adapter, and connected to one of the CPUs (Figure 3b). Contrary to System A, this system uses a SSD disk, which is able to deliver much better throughput (within the limitations of the SATA 3 interface). System C is a single-node machine with one quad-core Intel Core i7 38 CPU clocked at 3.6 Ghz. The main difference of this system is that it has 4 GPUs that are connected in pairs to different PCIe switches. This allows to utilize peerto-peer communications between all the GPUs in the system. The organization of this machine is similar to the one shown in Figure 2b. IV. EVALUATION OF LOCALITY This section presents the results from our experiments that measure the impact of locality-aware policies on the performance of data transfers. To compare the different locality configurations, we normalize the results to the best achieved bandwidth for each data size. This allows us to show the percentage of the optimal achieved by each configuration. A. Locality-aware Intra-node GPU Communication First, we study how locality affects data transfers between a GPU and the host memory. We measure the bandwidth of data transfers to/from a GPU connected to the PCIe hub local to the CPU, and a GPU connected to the PCIe hub of the other CPU (remote). We measure both single data transfers (one-way) and two concurrent transfers in opposite directions (two-way). Figure 4 shows the results obtained in systems A and B. System C is not considered for this test, because all GPUs are connected to the same PCIe hub. Both systems follow a similar trend, where the impact of locality increases with the transfer size until, where all the configurations reach a steady state. As expected, one-way local transfers are

5 1-way (local) 1-way (remote) Host GPU GPU Host (a) System A. 2-way (local) 3 12 1-way (local) 1-way (remote) Host GPU GPU Host (b) System B. 2-way (local) 3 12 Figure 4: Optimality of different locality configurations when transferring data between a GPU and host memory (2-way: two concurrent transfers in opposite directions). the best performing in both systems, while one-way remote transfers reach from % to % of the optimal bandwidth for large transfer sizes, depending on the direction of the copy. In system A, host to GPU copies are faster than their respective copy from GPU to host, however in system B GPU to host copies have better bandwidths. Even though they have similar GPU and CPU models, they have different hardware characteristics that can favor one direction or the other. Twoway data transfers are the slowest ones, with the local ones obtaining up to 61% of the optimal bandwidth in system A, and 69% in system B. Remote two-way transfers have only around % of the bandwidth compared to the optimal one. Figure 5 shows the results obtained for intra-node GPU to GPU data transfers in systems A and C. Since system B has only two GPUs, it cannot be used to study the effects of locality for GPU to GPU transfers. In this case, the local and remote meaning is not the same in both systems. In system A, local transfers are between GPUs installed in the same PCIe hub, and remote transfers are done using two GPUs connected to different PCIe hubs. Thus, in this system peerto-peer copies are only available for local data transfers. In system C, all four GPUs are connected to the same PCIe hub, allowing peer-to-peer copies between all the GPUs. In this case, we consider transfers between two GPUs in the same PCIe bridge to be local, and transfers across bridges to be remote. Besides the GPU to GPU copy functions provided by CUDA, we also measure the bandwidth of our own GPU to GPU copy implementation, using a double-buffering technique through host memory. The advantage of peer-to-peer copies is clear in both systems. In system A, the best bandwidth obtained by remote transfers is only 65% of the optimal one, because they have to be implemented as a two step copy using intermediate stagging in the host memory. In system C, since all the transfers can use peer-to-peer copies, the differences in bandwidth due locality are smaller than in the previous system. However, these results show that locality differences inside the same PCIe hub can have an overhead up to 15%. Most HPC applications are programmed using MPI to run on modern clusters. To model this scenario, we also measure how intra-node transfers perform when the GPUs are managed by different MPI processes. Figure 6 shows the percentage of the optimal bandwidth achieved for local MPI process communication in two different hardware configurations. In both configurations, locality-aware policies provide the optimal bandwidth when MPI processes are running in the same node. These experimental results show that if processes are not correctly placed (i.e., the MPI process uses the CPU and GPU in the same PCIe domain), local MPI inter-processes communication can be up to 75% slower than when topology-aware placement is used. Results in Figure 6 also show that ensuring the MPI processes locality also provide large benefits for MPI inter-process communication when each process is assigned to a GPU from different PCIe domains. There is a potential bandwidth penalty of up to % if MPI communication requires crossing PCIe domains several times on each data communication operation. These experimental results show the viability of auto-tunning applications to match the system topology where they are being executed on. Moreover, Figure 6 also shows the benefits of favoring MPI inter-process communication to happen on the same domain, which is the optimal case. This shows the importance of placing the MPI processes most frequently exchanging data in the same PCIe domain. B. Locality-aware Disk I/O We measure the performance of disk reads and disk writes in several configurations. Figure 7 shows the results for the System A (which uses has HDD) and System B (which has SSD). System C is not studied because it can be treated as a subset of System A for these experiments. Each transfer is labeled with a two letter set (L or R) which describe the locality of the tested transfer. All the transfers are staged through the host memory, either by performing two transfers

6 1-way (local) 2-way (local) 1-way (remote) (a) System A 2-way (remote, DB) 3 12 1-way (local) 2-way (local) 1-way (remote) (b) System C 2-way (remote, DB) 3 12 Figure 5: Optimality of different locality configurations for GPU to GPU data transfers. (2-way: two concurrent transfers in opposite directions, db: hand-written double-buffered implementation). Different domain Different domain (DB) Same domain Same domain (DB) Same domain (GPU-aware) (a) System A. A domain is a PCIe hub. 3 12 Different domain Different domain (DB) Different domain (GPU-aware) Same domain Same domain (DB) Same domain (GPU-aware) (b) System C. A domain is a PCIe bridge. 3 12 Figure 6: Optimality of different locality configurations for intra-node GPU to GPU data transfers using MPI (each GPU managed by a different MPI process). L Disk /L GPU (HD) L Disk /R GPU (HD) R Disk /L GPU (HD) R Disk /R GPU (HD) L Disk /L GPU (SSD) L Disk /R GPU (SSD) R Disk /L GPU (SSD) R Disk /R GPU (SSD) 3 12 Figure 7: Optimality of different locality configurations for file reads and writes. or through the pipelined transfer and the fastest one is shown. First letter in the set denotes the locality of disk to the CPU memory used for staging the transfers. Second letter in the set denotes the locality of the GPU to the CPU memory used for staging. For both file reads and writes, both systems show small impact of disk locality on the performance (more than %), for reasonable big files ( or more). For the file writes smaller than, considerable variability is seen due to inability to isolate disk traffic generated by our experiments from the traffic generated by the operating system services. Writes of or larger are within % of optimal in both cases, while reads are in within % of optimal in both cases. V. EVALUATION OF DOUBLE-BUFFERING OF DATA TRANSFERS A. Intra-node GPU to GPU communication Figure 8 shows the two-way transfers between remote GPUs in System B. One labeled DB is our hand implementation of the double-buffered transfer while the other uses the CUDA provided call, which is also internally double buffered, but with the fixed buffer size. We compare the two methods to conclude that for the transfer sizes of or more, CUDA provided implementation achieves from 85% to 95% of the auto-tuned transfer performance, while small transfer sizes achieve higher performance than double-buffering due to the extra overhead of synchronization and inefficiency of issuing small transfer sizes. Figure 9 shows efficiency of doublebuffering with various buffer sizes for the largest data transfer we have measured (12). buffer is large enough to hide the overheads of extra function calls and synchronization but small enough to hide the fact that the transfer is staged. B. Disk I/O The bandwidth provided by disks is much lower than that of the PCIe interconnect. However, results show that using double buffered transfers is still beneficial. Figure 10a shows how double-buffering performs against simple transfers on a Hard Disk Drive. Results show that double is slower for transfers smaller that. After that inflection point double buffering delivers 10-% more throughput than simple

7 L Disk /L GPU L Disk /L GPU (DB) L Disk /R GPU L Disk/R GPU (DB) R Disk /L GPU R Disk /L GPU (DB) (a) HDD (System A) R Disk /R GPU R Disk /R GPU (DB) 3 12 L Disk /L GPU L Disk /L GPU (DB) L Disk /R GPU L Disk/R GPU (DB) R Disk /L GPU R Disk /L GPU (DB) (b) SDD (System B) Figure 10: Optimality of different data transfer implementations of disk I/O transfers. R Disk /R GPU R Disk /R GPU (DB) 3 12 2-way (remote, DB) 3 12 Figure 8: Efficiency of the auto-tuned and non-tuned doublebuffered transfers between GPUs. 4 KB Buffer size 3 Figure 9: Efficiency of the double-buffered transfers between GPUs for different buffer sizes. transfers both for and transfers. Tests on the SDD disk (Figure 10b) exhibit the same behavior. For transfers, in both configurations, the doublebuffered implementation yields much better performance than the naive transfer. The effect of the size of the buffers used in the implementation of the double-buffered transfer is shown in Figure 11a and Figure 11a. Results show the optimality of the implementation of a 12 memory transfer for different buffer sizes. The first Figure shows that buffers of are enough to achieve the maximum performance on the HDD disk. This is due to the low transfer rate of the disk. On the other hand, the SSD disk needs buffers of due to its higher bandwidth compared to the HDD. C. Inter-node GPU Communication Figure 12a shows the performance of inter-process MPI communication for data transfers across CPU and GPU memories. These results show that topology-aware placement of MPI processes barely affects the performance of remote MPI data transfers. The results in Figure 12a also shows the benefits of double-buffering remote communication; in all cases double-buffering provides the optimal bandwidth for data transfer sizes larger than. The only case where double-buffering performs worse than direct transfers when communicating data smaller than 128KB from the network to the GPU memory if both network interface and the GPU are attached to the same PCIe bus. Figure 12b shows the performance of MPI communication for different buffer sizes when data transfers are doublebuffered. These results shows that the optimal buffer size in all cases id 2MB. When larger buffers are used, the performance of data transfers slowly decrease, while smaller buffers also provide sub-optimal bandwidth. VI. CONCLUSIONS In this paper we have analyzed the importance of autotunning applications to match the system topology and select the most adequate data transfer mechanism to achieve optimal bandwidth on data transfers between CPUs, GPUs, and nodes within a cluster. To evaluate the potential impact of each of these optimizations on different scenarios, we have developed and executed synthetic benchmarks that stress each data transfer mechanisms under different circumstances. We have presented the experimental results of running these benchmarks on three different systems that represent most of the existing heterogeneous architectures currently used in HPC environments.

8 4 KB L Disk /L GPU L Disk /R GPU R Disk /L GPU R Disk /R GPU L Net /L GPU L Net /L GPU (DB) L Net /R GPU Buffer size (a) HDD (System A) 3 4 KB L Disk /L GPU L Disk /R GPU R Disk /L GPU R Disk /R GPU Buffer size (b) SDD (System B) 3 Figure L Net/R 11: GPU Optimality (DB) Rof Net /Rdifferent GPU buffer sizes for a 128MB disk I/O transfer. R Net /L GPU R Net /R GPU (DB) R Net /L GPU (DB) L Net /L GPU L Net /R GPU R Net /L GPU R Net /R GPU GPU GPU GPU GPU GPU Host Host GPU (a) MPI Transfer Bandwidth. 3 12 Our experimental results highlight the benefits of autotunning applications to match the node topology. By ensuring that applications run on the CPU and GPU connected to the same PCIe bus, the data transfer bandwidth can improve up to 8 %. We have also shown, that locality-aware policies have little impact when communicating MPI processes running on different nodes. We have also shown that double-buffering is key to achieve optimal performance on inter-process communication when processes are running on the same node as well as when running on separate nodes. Double-buffering provides little performance improvements on GPU-to-GPU communication when direct data transfers across GPUs are supported by the hardware. Moreover, double-buffering only provides marginal gains when transferring data between GPUs and I/O devices, such as disks due to the low read/write bandwidth of existing I/O devices. The results presented in this paper show that auto-tunning is key to optimize data transfers in existing heterogeneous HPC 0 0 0 4 KB Figure 12: Performance of MPI transfers GPU Host Host GPU Buffer size (b) Double-buffer performance for different buffer sizes. 3 systems. Our experimental data also shows that simple autotunning policies, such as different code paths depending on the data transfer size, can provide impressive performance gains. REFERENCES [1] TOP0 list - November 12, 12. [Online]. Available: http: //top0.org/list/12/11// (page 1). [2] The green 0 list, november 12 edition, 12. [Online]. Available: http://www.green0.org/lists/green1211 (page 1). [3] CUDA C Programming Guide, NVIDIA, 12. (pages 1, 2). [4] The OpenCL Specification, 09. (pages 1, 3). [5] PCI Express Base 3.0 Specification, PCI-SIG, 10. (page 2). [6] V. Kindratenko, J. Enos, G. Shi, M. Showerman, G. Arnold, J. Stone, J. Phillips, and W.-M. Hwu, Gpu clusters for high-performance computing, in Cluster Computing and Workshops, 09. CLUSTER 09. IEEE International Conference on, 09, pp. 1 8. (page 2). [7] NVIDIA GPUDirect, NVIDIA. [Online]. Available: https://developer. nvidia.com/gpudirect (page 2). [8] Ieee standard for information technology- portable operating system interface (posix) base specifications, issue 7, IEEE Std 3.1-08 (Revision of IEEE Std 3.1-04), pp. c1 3826, 08. (page 3). [9] OpenMP Application Program Interface Version 3.1, OpenMP Architecture Review Board, 11. (page 3).

[10] MPI-3: A Message-Passing Interface Standard, Message Passing Interface Forum, 12. (page 3). [11] E. Gabriel, G. E. Fagg, G. Bosilca, T. Angskun, J. J. Dongarra, J. M. Squyres, V. Sahay, P. Kambadur, B. Barrett, A. Lumsdaine, R. H. Castain, D. J. Daniel, R. L. Graham, and T. S. Woodall, Open MPI: Goals, concept, and design of a next generation MPI implementation, in Proceedings, 11th European PVM/MPI Users Group Meeting, Budapest, Hungary, September 04, pp. 97 104. (page 4). 9