College of William & Mary Department of Computer Science

Technical Report WM-CS-2010-03 College of William & Mary Department of Computer Science WM-CS-2010-03 Implementing the Dslash Operator in OpenCL Andy Kowalski, Xipeng Shen {kowalski,xshen}@cs.wm.edu Department of Computer Science, The College of William and Mary Feburary 3, 2010

Abstract The Dslash operator is used in Lattice Quantum Chromodymamics (LQCD) applications to implement a Wilson-Dirac sparse matrix-vector product. Typically the Dslash operation has been implemented as a parallel program. Today s Graphics Processing Units (GPU) are designed to do highly parallel numerical calculations for 3D graphics rendering. This design works well with scientific applications such as LQCD s implementation of the Dslash operator. The Scientific Computing group at the Thomas Jefferson National Accelerator Facility (Jefferson Lab) has implemented the Dslash operator for execution on GPUs using NVIDIA s Compute Unified Device Architecture (CUDA). CUDA applications, however, will only run on NVIDIA hardware. OpenCL (Open Computing Language) is a new open standard for developing parallel programs across CPUs, GPUs and other processors. This paper describes the implementation of the Dslash operator using OpenCL (Open Computing Language), its performance on NVIDIA GPUs compared with CUDA, and its performance on other hardware platforms. General Terms GPU, OpenCL, CUDA, LQCD, Parallel Pro- Keywords gramming 1. Introduction Performance, Languages. Computer simulations of Quantum Chromodynamics (QCD) using a space time lattice (also known as lattice QCD or LQCD) play an important role in calculations of quantities involving the strong nuclear force, and are vital underpinnings of research into High Energy Physics, Nuclear Physics and the search for physics beyond the Standard Model of particle interactions. These calculations are highly computationally demanding, often running on the largest supercomputing facilities in the world. The efficiency of the algorithms can determine both the accuracy of the calculation and the scale of the problems that can be tackled. The constant demands for higher efficiency have stimulated a large body of research in both algorithm design and software optimizations [4,6,7,8,9,10]. Recent developments in Graphics Processing Units (GPU) offer many new, remarkable opportunities for high performance computing. Thanks to their tremendous throughput and very high bandwidth, GPUs have emerged as an appealing alternative over CPUs for scientific applications, bringing speedup of up to several orders of magnitude [3, 5, 12]. The goal of this project is to examine the potential of GPU for enhancing the computing efficiency of a specific implementation of QCD, called Lattice Quantum Chromodymamics (LQCD). LQCD uses a Dslash operator to implement a Wilson-Dirac sparse matrix-vector product. The base implementation used in this project comes from the USQCD community [14]. It has gone through a decade of evolvement over many generations of CPU architectures. The Scientific Computing group at the Thomas Jefferson National Accelerator Facility (Jefferson Lab) has recently implemented the Dlash operator with NVIDIA s Compute Unified Device Architecture (CUDA), and showed clear performance improvement over the executions on CPU. The existing CUDA implementation, however, is specific to NVIDIA GPU architecture. OpenCL (Open Computing Language) is an open standard for general-purpose parallel programming across various CPUs, GPUs and other processors [13]. Unlike CUDA, OpenCL code can be compiled for execution on NVIDIA and ATI graphics cards as well as general purpose CPUs. It has drawn many industry attentions rapidly. For instance, Apple Computer has provided support for it with Mac OS X 10.6. In this project, we attempt to implement the Dslash operator in OpenCL and evaluate its programmability, portability and performance on a variety of hardware. 2. Dslash and LQCD In LQCD, the Dslash operator forms part of the Wilson Fermion matrix that describes the interaction of quark fields and the vacuum gluon fields between two points in space-time. The inverse of the fermion matrix is required in several places in lattice QCD simulations, including the computation of (valence) quark propagators and in the evaluation of the Hamiltonian, and the Molecular Dynamics forces in Hybrid Molecular Dynamics Monte Carlo algorithms [6,9]. At the start of a simulation, the fermions and the gauge fields between the fermions are in a random state. Over time, the system settles into a state of equilibrium. Physicists can deduce physical properties from the equilibrium configurations. These resulting configurations can themselves be saved to conduct other calculations. In this case a valence particle is introduced into the configuration and the simulation predicts interactions. The Dslash operator is used in LQCD to efficiently solve Dirac equations. In this case, the implementation is to solve the equation shown in Figure 1 below. Figure 1. [2]

Here µ iterates over the space-time dimensions of the lattice (x, y, z and t). The γ µ are 4x4 Dirac spin matrices. For each sum of the terms there is a forward and backward direction from each site. In the forward direction, the link between two sites x and x + ˆ µ has an associated gauge link field U x,µ, which is a 3x3 complex matrix. Each site also has an associated fermion field ψ. The δ x+ µ ˆ,y operator denotes a shift of the ψ field from site x + ˆ µ to site x, which is required to form numerical derivative terms. Similarly for the backward direction, each link between two sites x and x ˆ µ has an associated gauge link field U x µ ˆ,µ and δ x µ ˆ,y operator. Visually, the formula looks like Figure 2 below. In OpenCL, as in CUDA, the functions being executed in parallel are called kernels. Kernels can be either data or task parallel. Kernels can either be compiled from source during program execution or precompiled in advance. Kernels are executed as work items (individual threads) within a work group. A work group is a 1, 2 or 3 dimensional index of work items executed on a single compute unit [13]. The memory model in OpenCL consists of private, local, constant, and global memory. Private memory is accessible only by an individual work item. It is the fastest memory to access, but smallest in size. Local memory is shared among all the work items within a single a work group. Access to local memory is faster than global memory, but also limited in size. Constant memory is a region of global memory that does not change during the execution of a kernel. Global memory is the largest of all the memory types and is accessible by all work items within a context. There are five main steps to executing an OpenCL kernel. These steps are initialization, allocation of resources, creating the program and kernel execution and cleanup. Figure 2. In Figure 2, the spinor fields ψ from the Dirac equation are the 4x1 matrices containing elements of 3x1 matrices of color (c). The 4x4 matrices containing elements of U (3x3 matrices) are the product of the projectors (1-γ or 1+γ depending on direction) and U. The algorithm for Dslash is rather compute intensive. It requires 1320 floating point operations (flops) to perform a Dslash operation on a single site. Being that the same calculation is done for each site, it is a good candidate for parallelization and execution on Single Instruction Multiple Data (SIMD) processors like today s GPUs. 3. OpenCL OpenCL (Open Computing Language) is an open standard for general-purpose parallel programming across CPUs, GPUs and other processors [13]. It is based on C99 and abstracts the specifics of the underlying hardware [11]. OpenCL programs are portable in that they can be recompiled with little to no change to the source code and executed on different hardware. As an example, a program can be created and tested on a CPU. Once working, the same code can be compiled ad executed on a GPU. During initialization, compute devices are chosen and contexts and command queues are created. Command queues hold the commands to be executed on a compute device. They can be configured to submit the commands to the compute device either in order or out of order. The commands can be kernels to execute or reads/writes from/to device memory. The allocation of resources involves the creating of input and output buffers on the device and copying any input data into its buffer. Creation of the program and kernel involves loading the kernel source code or precompiled binary into a program object. If the kernel is loaded as source code, it needs to be compiled. A kernel object is then created from the compiled program. Execution involves setting arguments for the kernel and inserting the kernel in a command queue for execution. Then any result buffers are read back into host memory. Lastly, tear down is done by releasing (freeing) memory objects, kernels, programs, command queues, and contexts.

4. CUDA and OpenCL Language Differences CUDA and OpenCL differ in terminology. They share the same concepts and design in terms of how you layout your work. Table 1 shows the general differences in terminology used to describe the work models of CUDA and OpenCL. Thread Thread Block CUDA Global Memory Constant Memory Shared Memory Local Memory Work-item Work-group Table 1. [1] OpenCL Global Memory Constant Memory Local Memory Private Memory There are also differences in accessing index information within kernels. CUDA uses predefined variables where OpenCL uses function calls [1]. Table 2 shows the equivalent function calls in CUDA and OpenCL for accessing index information. griddim blockdim blockidx threadidx CUDA OpenCL get_num_groups() get_local_size() get_group_id() get_local_id() No direct equivalent. Com-get_global_id(bine blockdim, blockidx, and threadidx to calculate a global index. No direct equivalent. Combine griddim and blockdim get_global_size() to calculate the global size. Table 2. [1] Other naming differences exist between the APIs for similar object definitions and functions, but functionally CUDA does not provide anything equivalent to OpenCL s use of command queues. Command queues hold the commands to be executed on a compute device. Additionally, OpenCL provides task parallelism capabilities by allowing dependencies to be declared between tasks executing on a device [1]. 5. Code Developement Mac OS X Apple has included support for OpenCL in Mac OS X 10.6 and in the development tools that come with Mac OS X 10.6. The development tools are not installed by default, but are included on the distribution DVD and freely available from the Apple website. To compile OpenCL programs, one only needs to include the -framework OpenCL argument to the Apple provided gcc compiler. Apple s implementation of OpenCL also includes support for executing OpenCL kernels on both GPUs and the system s CPU. Apple s Xcode editor for writing source code recognizes OpenCL and indents and shows different parts of the code in different color fonts. This makes it easier to develop code because you can quickly identify types, variable names, comments, etc. NVIDIA SDK on Linux Linux developers do not include OpenCL support in their distributions. To support OpenCL for use on NVIDIA GPU devices, version 190.29 of the NVIDIA device driver must be installed on the system. CUDA must also be installed on the system. NVIDIA s Software Development Kit (SDK) for GPU programming version 2.3b includes support for OpenCL. To compile OpenCL code, one must install the SDK, set the LD_LIBRARY_PATH environment variable to include the path to the library included with the SDK and set the PATH environment variable to include the path to the CUDA bin directory. NVIDIA s OpenCL SDK is integrated with CUDA. So OpenCL kernels compiled using the NVIDIA SDK are said to perform as well as their equivalent CUDA kernels. ATI SDK on Linux At the time of this writing, ATI provides beta versions of their device driver and SDK for OpenCL support. To compile OpenCL code, one must install the SDK, set the ATIS- TREAMSDKROOT environment variable to the path of the SDK installation, set the LD_LIBRARY_PATH environment variable to include the path to the library included with the SDK and set the PATH environment variable to include the path to the bin directory in the SDK.

Getting the SDK installed such that one could compile was straightforward and successful. Executing the kernels on the ATI GPUs was not successful. The issue seems to be ATI s required access to the X11 server for communicating with the GPU. Remote logins use the X11 server on the remote machine and thus cannot find the GPU. There were also problems getting X11 to work with the device driver. Due to those problems, this project demonstrates the ability to compile OpenCL code with ATI s SDK but demonstrated portability by executing the code on the CPU of a MacBook Pro. 6. Performance The OpenCL implementation of Dslash was not fully optimized at the time of this writing. It does not take advantage of red-black labeling. Red-black labeling is a technique used for implementing a complete Dslash operation using ½ the number of sites. The CUDA implementation, however, does use red-black labeling. Memory coalescing is used for the storage of both the inputs and outputs by the OpenCL implementation. However, it does this using global memory for both. Of the memory available for the storage of inputs and outputs on the GPUs in OpenCL, global memory has the worst performance. It proved, however, to be easier to use for getting this first OpenCL implementation working. The performance of the algorithm is measured in effective flops. Since the existing CUDA implementation uses the red-black labeling technique, we will measure effective flops using the following formula: NVIDIA 9400M GT The performance on the NVIDIA 9400M GT was not that impressive. This was to be expected because the graphics card is in used for the display and because of the limited number of cores and memory. Figure 3 below shows the observed performance. Figure 3. One can clearly see that the operation tops out at around 388 Mflops with a work group size of 64 and 512 threads. The program actually caused the screen to flicker and the program to error out when the program was executed with more than 512 threads. Intel 2.4GHz Core 2 Duo Executing the same code on the Intel 2.4GHz Core 2 Duo CPU of the MacBook was better. Figure 4 below shows the observed performance. (flops * num_sites/2 * num_loops) / seconds Here, flops is the 1320 flops required per site; num_sites is the number of sites; num_loops is 200 (number of loops performed); and seconds is the recorded time in seconds. Again, we divide the number of sites processed by 2 because the existing CUDA version utilizes red-black labeling to perform the calculation. Performance testing was conducted on the processors/ systems shown in Table 2 below. System Operating GPU/Processor System MacBook Pro Mac OS X 10.6.2 NVIDIA 9400M GT (16 cores, 256MB) MacBook Pro Mac OS X 10.6.2 Intel 2.4GHz Core 2 Duo (2GB) SuperMicro CentOS 5.3 NVIDIA Tesla C1060 (240 cores 4GB) Table 2. Figure 4. The performance peaks out at 1.0 Gflops. As the number of sites increases one can see that performance begins to degrade around 16384 sites. It is believed that it is at this point that cache misses and context switching begin to degrade performance.

NVIDIA Tesla C1060 Next the program was executed on an NVIDIA Tesla C1060. The Tesla has 240 cores and 4 GBytes of memory. As expected the Tesla was the top performer in our test group. It performed best with 64 threads per work group and peaked out at about 27 Gflops. Although the Telsa C1060 can deliver even better performance, this clearly is a big gain for the application when compared to its performance on a CPU. Figure 5 shows the observed performance. Figure 5. OpenCL vs CUDA Implementation The last test was to compare the performance of the CUDA implementation with that of the OpenCL implementation of Dslash. Figure 6 shows the observed performance of each with a work group size (block size in CUDA) of 64 threads. Figure 6. Although the CUDA implementation outperformed the OpenCL implementation, the curve of the graph for each is very similar. Both reach their peak when processing at least 32768 sites. However, recall that the CUDA implementation uses red-black labeling and actually performs the Dslash operation using only half the sites and thus launches half the number of threads. The OpenCL implementation currently uses one thread for each site to do the Dslash operation. So one improvement we can make to increase the performance of the OpenCL implementation is to adjust the algorithm such that only has to execute on half the sites. (Anecdotally, when the OpenCL implementation was executed using half the sites the performance was comparable. Those results were not included in this paper because the result of the Dslash operator was incorrect.) Additionally, the CUDA implementation utilizes texture memory for the inputs. Texture memory provides better performance than the global memory used by the OpenCL implementation. So another performance gain can be expected if the OpenCL implementation is modified to use image memory (called texture memory in CUDA). 7. Future Work The OpenCL implementation of Dslash discussed here does not make use of several of the performance optimizations applied to the CUDA implementation. In particular, utilizing local memory instead of global memory is a known performance enhancement. In terms of the algorithm itself, the CUDA implementation utilizes a trick known as redblack labeling to perform the calculation with only half the sites. Hence, it launches half the number of threads as the current OpenCL implementation. Additionally, performance testing of Dslash on an ATI GPU and comparing it to NVIDIA is of interest. The portability of OpenCL was demonstrated using Centos, Mac OS X, two different GPUs and a traditional CPU. However, performance testing ATI GPUs is of interest by the Scientific Computing group at Jefferson Lab. If performance is comparable, future procurements of GPUs for use by LQCD applications by the Scientific Computing group can be competed between NVIDIA and ATI hardware without involving huge development efforts to port Dslash. 8. Conclusions Implementing Dslash in OpenCL for the first time was a bit tedious. It is important to understand the different types of memory available and the limitations associated with each type. Once fully understood, however, the steps required to execute a kernel are mostly clear-cut. Additionally, once the algorithm for the Dslash operator was completely understood, porting the existing CUDA kernel to OpenCL proved to be time consuming but straightforward. The kernels are similar enough that developers using CUDA should not have a hard time porting them over to OpenCL. The portability of OpenCL and its ability to be executed on heterogeneous processor architectures is a big advantage.

It was very easy to move and execute the Dslash implementation from Mac OS X to Linux and from CPU to GPU. This flexibility will allow the LQCD community to take advantage of advances in GPU designs from multiple vendors without the major software development efforts traditionally required when migrating to a new processor architecture. Although the performance of the Dslash operator in CUDA out performed that of the OpenCL implementation, it is important to remember that the CUDA implementation has implemented optimizations, such as red-black labeling and using shared and texture memory on the GPU. It is expected that the OpenCL implementation will perform similarly with the same optimizations. It is clear that scientific applications like LQCD can take advantage of modern GPU designs. The performance observed on a single high-end graphics card when compared to a traditional CPU makes such development efforts worthwhile. Implementation of such applications using OpenCL adds value by providing portability, reduced development costs for use with heterogeneous processor architectures and the ability to utilize new hardware platforms sooner. Acknowledgments We thank Jie Chen for the instruction on the exiting CUDA implementation of Dslash, Balint Joo for the instruction on the Dslash operator, physics and LQCD. The discussion with Kostas Orginos triggers the exploration of the application of GPU for LQCD. References [1] AMD, Inc., OpenCL and the ATI Stream SDK v2.0 (2009) http://developer.amd.com/documentation/articles/pages/open cl-and-the-ati-stream-v2.0-beta.aspx [2] Balint Joo (private communication), 2009. [3] M. M. Baskaran, U. Bondhugula, S. Krishnamoorthy, J. Ramanujam, A. Rountev, and P. Sa- dayappan. A compiler framework for optimization of affine loop nests for GPGPUs. In ICS 08: Proceedings of the 22nd Annual International Conference on Supercomputing, pages 225 234, 2008. [4] M.A. Clark, R. Babich, K. Barros, R.C. Brower, C. Rebbi. Solving Lattice QCD systems of equations using mixed precision solvers on GPUs., Nov 2009. arxiv:0911.3191 [heplat] [5] Y. Dotsenko, N. K. Govindaraju, P. Sloan, C. Boyd, and J. Manferdelli. Fast scan algorithms on graphics processors. In ICS 08: Proceedings of the 22nd Annual International Conference on Supercomputing, pages 205 213, 2008. [6] S. Duane, A. D. Kennedy, B. J. Pendleton, and D. Roweth. Hybrid monte carlo. Phys. Lett., B195:216 222, 1987. [7] G. I. Egri et al. Lattice QCD as a video game. Comput. Phys. Commun., 177:631 639, 2007. [8] J. Foley., K. J. Juge, A. O Cais, M. Peardon, S. Ryan, and J.-I. Skullerud. Practical all-to-all propagators for lattice qcd. Comput. Phys. Commun., 172:145 162, 2005. [9] S. A. Gottlieb, W. Liu, D. Toussaint, R. L. Renken, and R. L. Sugar. Hybrid molecular dynamics algorithms for the numerical simulation of quantum chromodynamics. Phys. Rev., D35:2531 2542, 1987. [10] K. Ibrahim, F. Bodin, and O. Pe`ne. Fine-grained parallelization of lattice QCD kernel routine on GPUs. Journal of Parallel and Distributed Computing, 68(10):1350 1359, 2008. [11] Khronos Group, OpenCL Parallel Computing for Heterogeneous Devices (2009), http://www.khronos.org/developers/library/overview/opencl_ overview.pdf [12] Y. Liu, E. Z. Zhang, and X. Shen. A cross-input adaptive framework for gpu programs optimization. In Proceedings of International Parallel and Distribute Processing Symposium (IPDPS), 2009. [13] A. Munshi et al., The OpenCL specification version 1.0 Revision 48, Technical report, Khronos OpenCL Working Group (2009). [14] US QCD. http://www.usqcd.org/software.html.