a conservative approach in this proposal. 2 Note that there are several steps and details missing from this description but for the purposes of this

Transcription

1 3.0 Graphics Processing Units (GPU) Today s commodity graphics cards are built upon a programmable, data parallel architecture that in many cases is capable of out-performing the CPU in computational rates. For example, the Pentium 4 processor, when leveraging the SIMD-based SSE instruction set, is theoretically capable of reaching a peak performance of around 6 GFLOPS. In comparison, the NVIDIA GeForce FX 5900 card has a theoretical rate of approximately 20 GFLOPS 1. In this section we provide a brief high-level overview of the graphics processing unit (GPU) architecture and the available programmable features. The main purpose of the GPU architecture is to convert an internal representation of the three-dimensional scene into a two-dimensional image that may be displayed on screen. This is achieved with a series of coordinate transformations. Figure 1 presents the basic architecture that is used to achieve these transformations. The first stage requires the CPU to send a series of commands and data/geometry to the GPU via the system bus (e.g. the AGP interface). From this point, the vertex engines transform the three-dimensional geometry representation into the two-dimensional space of the physical screen coordinates. Finally, the pixel engines are responsible for producing each of the individual pixels that are visible on the screen. 2 The latest graphics card leverage GPUs that are capable of processing up to eight pixels per clock cycle; this is shown as the SIMD stages in Figure 1. The next key feature to our proposed effort is the local memory (texture memory) available directly on the graphics card. In order to successfully leverage graphics hardware as a powerful coprocessor for computational algorithms it is necessary to have access to scientific data. While the texture memory of the graphics card is primarily intended for texture mapping in computer games, it can also be used to store scientific data in a 32-bit floating-point format. While the amount texture memory is generally small in comparison to the main memory available to CPUs, the texture memory based on a 256-bit wide bus and is capable of bandwidths of up to 30.4GB/sec 3. The most powerful feature of the latest generation of GPUs is their ability for the exact steps taken by the vertex and pixel engines to be directly programmed by the user. This is the key to our proposed effort. Via this mechanism it is possible to leverage the GPU in many different ways than they were originally designed for. Many examples of the latest research efforts can be found at the General-Purpose Computation Using Graphics Hardware web site, Finally it is important to note that the price point of GPU programming is incredibly cost effective in comparison to the available computational power. Most graphics cards capable of the performance and features discussed above have a cost of only $300 to $500. In addition, the cards have a technology update rate of about every six months allowing for faster and faster 1 There have been some unverified claims of performance rates as high as 200 GFLOPS but we have taken a conservative approach in this proposal. 2 Note that there are several steps and details missing from this description but for the purposes of this discussion, these are the basic concepts that are important to the proposed effort. 3 Based on the current specifications of the NVIDIA GeForce FX 5900 hardware. 1

2 technology to be leveraged at a rate that has outpaced Moore s Law for the last several years. Figure 1. A high-level diagram of the GPU architecture. Given the high-bandwidth texture memory on the card and the ability to directly program the pixel engines it is possible to write small programs that manipulate and/or compute derived values. However, it is important to note that the GPU programming environment is limited in terms of features. In spite of the limited instruction set available on GPUs, and the manufacturer-specific coding required to achieve results, the scientific community has made several important advances towards using the GPU as a generalpurpose scientific coprocessor. Several articles have appeared in recent years reporting usage of the GPU for a small set of scientific computations. These include results for numerical solution of PDEs [1-3], implementations of sparse and dense linear algebra algorithms [4-6], and conjugate gradient and multigrid solvers on graphics hardware [7,8]. However, in most cases these applications were programmed in assembler and painstakingly hand-optimized to obtain the results published. In addition, the Brook project at Stanford is researching the use of GPUs for general-purpose computation by introducing a new streaming programming language. The majority of other efforts have been focused on computer graphics related topics. The power of the GPU can also be used to perform in situ manipulation of data before displaying it on screen. For example, we have shown that given two three-dimensional data sets that represent the density and pressure values computed by a simulation, it is possible to write a program that computes entropy based on these two values and runs entirely on the GPU. Our recent studies have shown that it is possible to run such computations at a rate of five to six times faster than can be achieved on the CPU. 2

3 Although the field is young, it is moving rapidly. The previous efforts mentioned above are only a small portion of the required effort to use GPUs for general purpose computing. At the time of this writing, there is no general-purpose high-level programming tool available that leverages the power of the GPU. Given this state of affairs, we think the time is right to investigate the potential benefits to complex, largescale numerical simulations of interest to the Los Alamos community from computing on the GPU. Computational sciences at Los Alamos have a rich history. We have a number of codes that are critical for programmatic work. Any increase in the efficiency of these codes is directly beneficial to the mission of the Laboratory. We propose to develop a compiler and runtime framework that will enable applications to utilize GPUs for computationally intensive operations. In addition, we will identify a set of computational kernels common to a number of Los Alamos applications and, using the framework, evaluate performance of the kernels, both in isolation and within increasingly realistic application environments. The framework will provide an Application Program Interface (API) that will enable programming in a higher-level language, such as Fortran and C/C++. The back end of the calculation will be done by a set of libraries that will first probe the hardware to determine what GPU, if any, is present, and offload instructions to the GPU so as to alleviate the load on the CPU. The net result is that potentially any code at Los Alamos could benefit from our work. 3.1 GPU Research Plan It is important to note that the current GPU programming environment can be very limited in terms of features that many codes and developers are used to leveraging: Each card is unique in its instruction set Only one manufacturer, nvidia has released a compiler containing general purpose calls for the GPU There are no branching instructions. There are limited numbers of registers (known as temporaries). There are limits on the total number of instructions that a program may contain. The internal architecture is usually entirely based on single precision 32 bit floats. Although the GPU may seem to be too restrictive, it is important to note that each new generation of graphics hardware either increases the limits or removes the limits entirely. It is also worth noting that several projects have successfully leveraged the GPU to do tasks that were once thought of as too difficult. Therefore, it is important for us to keep up to date with the latest hardware technology and trends to provide optimized support for leveraging the graphics card architectures. One other area of key concern is the limitation of the supporting PC architecture in terms of memory bandwidth rates. In general, data transfers over the AGP bus have been plagued by an unbalanced read and 3

4 write performance. AGP graphics hardware is theoretically capable of rates reaching up to 1 GB/sec. Write performance has been measured at rates in the neighborhood of 700 MB/sec while reading data back from the graphics card usually runs at only MB/sec. New driver software enhancements and the introduction of the PCI Express bus architecture, due in the Spring of 2004, show promising improvements in this area that will hopefully balance the read/write bandwidths at rates approaching 2 GB/sec. We believe that in spite of the current limitations, we can still develop an efficient system for leveraging the GPU. However, it will be critical to refresh hardware to keep up with the fast paced growth of GPU technology to guarantee the best results of this proposed effort. In addition to keeping up with the latest graphics card technologies, computationallyrelevant kernels and algorithms will be implemented using the co-processing framework described above, and hence will influence the development of the framework. Implementation and evaluation of the kernels and algorithms will occur in a staged fashion. Stage 1: 1. Applications will be analyzed to identify computationally-intensive kernels that could benefit from implementation on GPUs. Likely examples are dense and sparse matrix-vector multiplication, symmetric weighted Gauss-Seidel (SSOR), computational geometry operations such as intersection calculations, etc. 2. These kernels will be implemented, and their performance evaluated, on available GPU hardware. At this point the results will be analyzed to determine if further efforts are warranted. Assuming promising results are obtained we will proceed to Stage 2. Note that at this point it may also be appropriate to consider alternative or restructured kernels or algorithms for implementation, particularly if results are disappointing. Stage 2: 1. Larger components of applications will be chosen for implementation on GPUs, again utilizing and driving development of the co-processing framework. Likely candidates for implementation and evaluation are preconditioned Krylov subspace iterative methods for solution of linear systems of equations, Sweep3D, and more complex computational geometry operations. 2. As above, these components will be implemented, and their performance evaluated, on available GPU hardware. Again, results at this point will need to be assessed to determine whether to proceed. If the results continue to be promising, we should be poised to utilize one of the components implemented above in a relevant realistic application. This will represent Stage 3. One possible path is to use UbikSolve, a modern Fortran library of preconditioned Krylov subspace iterative methods in Stage 2 above. Since this component represents a large portion of the computational effort in the Truchas casting simulation code (part of the Telluride Project), we could evaluate performance of a modified Truchas that uses a GPU-enabled UbikSolve implementation. 4

5 Detailed tasks and timelines follow. FY04/Q3 1. Study GPU architecture and optimization 2. Identify application kernels to target (matrix-vector multiplication, etc.) 3. Investigate GPU development environments for application kernel implementation (e.g. Cg, Brook, OpenGL Shading Language, etc.) FY04/Q4 1. Implement kernels 2. Evaluate kernel performance 3. Write report summarizing FY04 results FY05 1. Start co-processing framework (CPF) incorporating parsers, compilers, runtime systems, optimizations 2. Integrate kernels into larger components such as Sweep3D, preconditioned Krylov subspace iterative methods (e.g. UbikSolve), multigrid solvers (e.g. the MG kernel from the NAS benchmark suite), etc. 3. Evaluate component performance 4. Begin investigation of parallel scaling issues 5. Write report summarizing FY05 results FY06 1. Use co-processing framework to demonstrate improved performance of a fullscale scientific simulation of interest to the Los Alamos computational physics community. 2. Make framework available to scientific community. 3. Provide recommendations on viability of GPU co-processing for next-generation scientific computing architectures. To summarize, the goal of this proposed effort is to explore the power of commodity GPUs as co-processors for computational science. This will be a significant contribution to the current state of the art, provide standardization, and position Los Alamos as a leader of scientific computing using GPUs. 5