Conjugate Gradients on Multiple GPUs

Transcription

1 Conjugate Gradients on Multiple GPUs Serban Georgescu 1, Hiroshi Okuda 2 1 The University of Tokyo, Department of Quantum Eng. and Systems Science, Hongo, Bunkyo-ku, Tokyo , Japan 2 The University of Tokyo, Research into Artifacts, Center for Engineering(RACE),5-1-5Kashiwa-no-ha, Kashiwa, Chiba , Japan Abstract. A GPU accelerated Conjugate Gradient solver is tested on eight different matrices with different structural and numerical characteristics. The first four matrices are obtained by discretizing the 3D Poisson s equation, which arises in many fieldssuchascomputationalfluid dynamics, heat transfer and so on. Their relatively low bandwidth and low condition numbers makes them ideal targets for GPU acceleration. We chose another four matrices from the other end of the spectrum, both ill-conditioned and with very large bandwidth. This paper concentrates on the computational aspects related to running the solver on multiple GPUs.We develop a fast distributed sparse-matrix vector multiplication routine using optimized data formats which allow the overlapping of communication with computation and, at the same time, the sharing of some the work with the CPU. By a thorough analysis of the time spent in communication and computation we show that the proposed overlapped implementation outperforms the non-overlapped one by a large margin and provides almost perfect strong scalability for large Poisson-type matrices.we then benchmark the performance of the entire solver, using both double precision and single precision combined with iterative refinement and report up to 22x acceleration when using three GPUs as compared to one of the most powerful Intel Nehalem CPUs available today. Finally, we show that using GPUs as accelerators not only brings an order of magnitude speedup but also up to 5x increase in power efficiency and over 10x increase in cost effectiveness. Keywords:GPGPU,conjugategradients,communication-computationoverlapping, Poisson s equation, mixed precision 1 INTRODUCTION Krylov subspace solvers are widely employed for solving large and sparse systems of linear equations. The usual source for such matrices is the discretization of partial differential equations (PDE) over geometric domains, represented as meshes, employed in fields such as structural analysis or computational fluid dynamics (CFD). For matrices which are symmetric and positive definite, the Conjugate Gradient (CG) method [1] is the usual choice. Composed of vector-vector (BLAS1) and sparse-matrix vector multiplication (SPMV) operations and, with most of the time spent in the latter, CG solvers are highly memory bounded and have historically been able to achieve only a small fraction of CPU peak [2]. To answer the need for more computing power and, more important, more bandwidth, the High Performance Computing (HPC) community has turned its attention to accelerators like Graphic Processing Units (GPU), CELL processors and Field Programmable Gate Arrays (FPGAs). Among these, being powerful, mass produced and already present in most machines, GPUs are arguably the most promising. The current generation of GPUs from both NVIDIA and AMD provide close to 2TFlop/s of single precision performance, for dual-gpu models like the NVIDIA Geforce 295GTX or the AMD Radeon HD 4870 X2, and sustained bandwidths in the range of GB/s. Moreover, both performance and bandwidth seem to be doubling with every new generation. To make this possible, GPUs feature vectorlike architectures, with hundreds of cores grouped into SIMD units, connected to a global memory, of up to 4GB in size, via wide memory buses. GPUs use concurrency, supporting tens of thousands of threads in flight, to cover the large latency to the global memory. In the last few years, thanks to technologies like CUDA [3], GPUs have become extremely programmable, being treated like general purpose processors. Furthermore, last generation high-end models provide native support for double precision operations, albeit much slower than the single precision ones. Since around the early 2000s, when GPUs became programmable enough be used for tasks other than rendering, researchers have been using GPUs to accelerate applications ranging from matrix solvers to database applications (see [4,5] and the references therein for a review). In the context of iterative

2 solvers, various implementations have been reported since, with CG solvers being reported first in [6] and later in [7,8,10]. In particular, [8] is the first to investigate accuracy issues resulting from the use of single precision on GPUs, with focus on the quasi-double precision (emulated double precision) and iterative refinement. In this paper we present a GPU-accelerated CG solver running on multicore processors and multiple GPUs. The performance and accuracy is benchmarked by solving Poisson s equation in 3D, at various resolutions, plus some additional matrices taken from University of Florida s Sparse Matrix Collection (UFSPARSE) [9]. In particular, the need to solve Poisson s equation arises in many fields, such as CFD, where semi-implicit or explicit formulations for the incompressible Navier-Stokes equation lead to the Pressure-Poisson equation, in steady-state heat dissipation, transport phenomena, electrostatics, particlebased flows used in computer graphics, and many other. For structured meshes, Krylov subspace solvers are usually outperformed by multigrid (MG) solvers [11]. However, MG solvers have had limited success on unstructured meshes [12]. Unstructured meshes have become quite popular in fields such as CFD or structural analysis, due to the possibility of expressing complex geometries directly exported from CAD software using fully automatic unstructured triangular and tetrahedral grid generators. In such cases, as the resulting matrix is large, sparse and symmetric positive definite, CG solvers have been successfully utilized [10,12]. In [8], Poisson s equation is solved with both CG and MG, with both quasi-double precision and iterative refinement. Compared to [8], this paper focuses on optimizing the SPMV implementation using unstructured matrix storage and, in the case when more than one GPU is employed, minimizing the impact of communication by fully or partially overlapping it with useful computation. A CG-based Poisson solver running on a GPU was also reported in [10]. However, much has changed in both GPU hardware and software since then. Furthermore, their shader-based implementation is limited to one GPU and uses single precision alone. This paper makes the following contributions: Provides an updated performance benchmark for CG solvers ran on the most powerful CPU and GPUs to date; Proposes a way of implementing an optimized data format which allows the overlapping of communication with computation while performing SPMV on multiple GPUs and, additionally, provides the opportunity of offloading some of the computation to the CPU; Analyses the efficiency of using iterative refinement as a way of accelerating solvers on GPUs which support native double precision; Shows not only improvements in raw performance but also in power efficiency and cost effectiveness. This paper is structured as follows. In Section 2 we define our testing problem and hardware environment. Detailed implementation issues, such as the proposed distributed sparse-matrix vector multiplication (SPMV) implementation, are discussed in Section 3. After showing the performance results in Section 4, we analyze the power efficiency and cost effectiveness in Section 5. Finally, we draw the conclusions in Section 6. 2 TEST SCENARIO We test the performance of the CG solver for Poisson s equation of the form u = f where represents the Laplace operator and u and f denote the unknown and given right-hand side (e.g., external force), respectively. Zero Dirichlet boundary conditions are considered on the entire boundary. To test both the correctness of the implementation and the accuracy of the solver we start from a known analytical solution u(x, y, z) =e x+y+z x(1 x)y(1 y)z(1 z) which leads to the RHS f(x, y, z) = e x+y+z xyz(y(z 5) 5z + x(3zy + y + z 5) + 9) 2

3 The equation is discretized on a uniform grid of increasing size (64 3,128 3,192 3 and ) using the standard 7-point stencil. For a spacial discretization of order N per dimension, the order of the matrix is N 3. Let us define the diagonal of order d of a n-by-n matrix A =(a ij ) 1 i,j n as the sequence of elements of the form a i,i+d with 1 i n. Poisson s matrix in 3D is composed of seven such diagonals with orders 0 (the main diagonal), -1, 1, N, N, N 2 and N 2.ForthecaseN = 3, the matrix is shown in Fig.1 (right), while the general form of the matrix is shown in Fig.1 (left). As the resulting matrix is structured, one could in fact use a specialized (diagonal) format for storage and use a solver like MG. However, for the purpose of this study we choose to ignore these properties and treat the matrix as unstructured instead. This allows us to build the matrix directly rather than having to start from a mesh. Moreover, as such a matrix has a clear structure, our experiments become trivial to reproduce. This is unlike the unstructured case, where the results strongly depend on the mesh being used. Fig. 1. Left: non-zero profile of Poisson s matrix resulting from a 3D finite difference discretization. Right: the actual matrix for N=3 (empty squares represent zero values), ignoring the 1/h 2 factor. Further tests are conducted on four additional unstructured matrices chosen from University of Florida s Sparse Matrix Collection. Their structure is very different from the one of the banded and diagonally dominant Poisson s matrix while their condition number is much higher. These properties make them much more challenging to solve both from a computational point of view (e.g., poor cache behavior, increased communication when using multiple GPUs) and regarding convergence. The pattern of non-zero elements is shown in Fig.2. For these matrices, since the analytical solution is not known, we start from the prescribed desired solution x 1 from which we compute the right-hand side (RHS) b as b = Ax. We then run the solver for this RHS and compare the obtained solution with the desired one. The important properties of the matrices considered in this paper are summarized in Table 1. The memory required to run the solver has been computed by adding to the memory required to store the matrix the memory needed by the solution, RHS and temporary vectors that appear in the CG solver, as formulated in Fig.3. The condition number has been estimated directly from the CG solver, via the Lanczos connection [14]. The hardware used in our tests is listed in Table 2. On the CPU side we used one of the fastest Intel Nehalem on market today, the Core i7 975 Extreme Edition. Nehalem processors, with the integrated tipple-channel memory controller and with support for DDR3 memory provide a very large jump in performance over previous multi-core CPUs (e.g. the Core2 Duo or Core2 Quad CPUs) for bandwidthlimited applications like the CG solver. On the GPU side we used the NVIDIA 280GTX and the NVIDIA 295GTX, two powerful GPUs, with native double precision support, and the lower priced but less powerful NVIDIA GeForce 9800GTX+, which lacks support for double precision. The NVIDIA 295GTX, currently the most powerful GPU from NVIDIA, is composed of two GPUs sandwiched together. All kernels were compiled using the Intel Compiler 11.0 and NVIDIA CUDA 2.3 and ran on 64-bit Linux. We tested the performance of the CG solver when run in native double precision (64-bit) and single precision (32-bit) combined with iterative refinement. Iterative refinement is a method by which double 3

4 pwtk inline 1 bmwcra 1 ldoor Fig. 2. Unstructured matrices from UFSPARSE Table 1. Properties for the matrices considered in this study Name Order Non-zeros Solver memory for Solver memory for Condition number double (MB) single (MB) Poisson ,144 1,826, ,708 Poisson ,097,152 14,647, ,740 Poisson ,077,888 49,471, ,092 Poisson ,777, ,308,926 2,174 1,343 26,765 pwtk 217,918 46,522, E+07 inline 1 503,712 36,816, E+08 bmwcra 1 148,770 10,644, E+06 ldoor 952,203 46,522, E+06 Table 2. Hardware specifications. Hardware Model Motherboard Memory CPU Nehalem 3.2GHz Asus P6T WS Rev. GPU TR3X6G1600C8D (DDR @1600) 295GTX EVGA GeForce GTX 295 CO-OP Asus P6T WS Rev. 2x1GB GDDR3 280GTX MSI N280GTX-T2D1G-OC Asus P6T WS Rev. 1GB GDDR3 9800GTX+ GALAXY GF P98GTX+ Asus P6T WS Rev. 512MB GDDR3 4

5 precision accuracy can be obtained by combining an inner solver working in single precision, in this case a CG solver, with a small number of outer correction iterations done in double precision. Iterative refinement works for matrices which are not too ill-conditioned (i.e., have condition number of at most O(10 8 )). Iterative refinement can be used even when native double support is available in order to speed up the solver [15]. For detailed information regarding this method we direct the reader to[8]. 3 IMPLEMENTATION DETAILS In this section we describe implementation details for the left-preconditioned CG method using the formulation shown in Fig.3. For each iteration (lines 7-16), the method uses two dot product (DOT) operations (lines 8 and 13), three vector updates (two AXPY type in lines 9,10 and one AYPX type in line 15), one SPMV operation in line 7 and a preconditioning operation in line 11. The SPMV is by far the most expensive operation and, depending on which kind of preconditioner is being used, a fair amount of time can be spent in preconditioning (e.g % in a SOR or ILU preconditioner). For preconditioning, in this paper we are using only diagonal scaling. Although not a very efficient preconditioner, diagonal scaling is straightforward to implement, embarrassingly parallel and works acceptably well for not so difficult problems. 01. i =0 02. r = b Ax d = M 1 r 04. δ new = r T d 05. δ 0 = δ new 06. While i<i max and δ new >ɛ 2 δ 0 do 07. q = Ad 08. α = δnew d T q 09. x = x + αd 10. r = r αq 11. s = M 1 r 12. δ old = δ new 13. δ new = r T s 14. β = δnew δ old 15. d = s + βd 16. i = i end Fig. 3. The PCG algorithm. In order to port a CG solver to the GPU, one needs to write kernels for DOT, AXPY, AYPX, diagonal scaling and SPMV. The DOT and AXPY kernels, in both single and double precision, are directly available from NVIDIA s CUBLAS library. The AYPX and diagonal scaling are not included in CUBLAS, but they are straightforward to implement. The only difficulty lies in implementing the SPMV kernel which, when using diagonal scaling alone, represents 90% or more of the total time spent in the CG solver. On the CPU side we use the functions provided by Intel MKL 11.0, but for the AYPX kernel and diagonal scaling which we implement by ourselves. Depending on the structure of the matrix and architecture of the processor the SPMV operation is executed on, one may chose from a wide range of sparse matrix formats (we refer to [2,16] foran extensive review). The importance of the sparse format employed in GPU implementations cannot be overstated. A benchmark of the SPMV operation on a large collection of matrices has shown that a fivefold increase in performance over the unmodified CSR format, the format of choice for most codes dealing with unstructured meshes, can be achieved by using a format optimized for the GPU. Such a format, which combines the very structured ELLPACK/ITPACK (ELL) format with the flexible Coordinate (COO) format, has been proposed in [16]. The ELL format, originally created for vector machines, is intended for matrices which have a relatively constant number of non-zeros per row. If this is the case, 5

6 then all non-zero values can be stored in a large block, which makes the access to them regular hence easy to coalesce and thus efficient. However, the format incurs very large fill for matrices where the distribution of non-zeros has large variations. The concept behind the format proposed in [16] is to store the regular part of the matrix in ELL format and the irregular part in a more flexible format, in this case the COO format. F F ELL Pa PCSR Fig. 4. The ELL-PCSR format. The rectangles represent matrix rows, with the hashed part showing non-zero elements. With the cost of some fill, marked with F, the regular part of the matrix is stored as a block, in ELL format. The remaining non-zeros are stored in PCSR format. We implemented a variation of the previous format, where the COO part is replaced by padded CSR (PCSR), in order to make it usable on older GPUs that lack support for atomic operations which are required for an efficient implementation of the COO format. The CSR format is padded so that the number of elements in each row is a multiple of four, which increases the degree of coalescence. Hereafter, we will be referring to this sparse format as the ELL-PCSR format. An illustration of the format is shown in Fig.4. On average, for the many matrices we have tested, the performance of double precision SPMV for large matrices (i.e., with O(10 7 ) non-zeros or more) reached only 2GFlop/s for the unmodified CSR format, close to 5 GFlop/s for PCSR and around to 11 GFlop/s for the ELL-PCSRformatona280GTX GPU. These results are shown in Fig.5 Fig. 5. Average double precision SPMV performance for unmodified CSR format, PCSR and ELL-PCSR for matrices taken from UFSPARSE on a 280GTX GPU. Small matrices have less than 10 5 non-zeros, medium ones have while large ones have more than

7 The SPMV operation is executed by performing the multiplication of all non-zero elements in the matrix with their corresponding elements in the source vector. In the case of distributed SPMV using GPUs, the matrix, the source and the result vector are distributed across multiple GPUs. The part of the result vector belonging to each GPU is filled by multiplying all non-zero matrix elements owned by the respective GPU with their corresponding source vector elements. For each GPU, although all non-zero elements of the matrix are available, there are some elements of the source vector which are not. Before SPMV can take place, these elements must be received from the GPU that owns them. In Fig.6 we show how a distributed SPMV (y = Ax) is performed in the case of two GPUs and N = 3. The 27-by-27 matrix A is partitioned between the two GPUs, with 13 rows on the first and 14 rows on the second. The source and result vectors are partitioned to match. In order to perform its share of the SPMV operation, GPU1 needs elements x 14 x 22 from GPU2 for processing rows In the same way, GPU2 needs elements x 5 x 13 from GPU2 for processing rows SPMV can only be performed after these elements are exchanged. Implementing SPMV on a multi-gpu system must therefore follow the three steps below: 1. Copy data to be sent to other GPUs to host memory; 2. Copy needed data from host memory to GPUs; 3. Perform SPMV on each GPU. Steps 1 and 2 are necessary since currently CUDA does not allow peer-to-peer communication between GPUs and, therefore, all transfers must be made indirectly via the memory of the host. As data must be copied to and from each GPU for every CG iteration, this can become a bottleneck. Fortunately, for most problems coming from the discretization of PDEs, communication can be overlapped with useful computation. This is possible since, after performing domain decomposition, while the number of vector elements that do not require communication (local elements) scales with the volume of the partition, the number of vector elements that do require communication (shared elements, i.e., the ones placed on the boundary between domains), scales with the surface. Thus, for large matrix sizes, the number of local elements is an order of magnitude larger than the number of shared elements. As such, if using asynchronous communication, one can do the computations for the local nodes while exchanging data for the shared nodes, and sum the results when both operations are finished. For Poisson s matrix considered here, the number of elements that must be exchanged between two PEs is equal to the halfbandwidth of the matrix. For the type of partitioning we employ, all GPUs but the first and the last must exchange this amount of elements with their two nearest neighbors. The first and last one have only one closest neighbor, with whom they exchange the same amount of data. GPU 1 Host x = 1 2 GPU 2 2 x = 1 PCI Express Fig. 6. 3D Poisson s matrix for N = 3, partitioned between two GPUs. Missing source vector elements must be exchanged indirectly, in two steps, via the host memory. 7

8 We wish to point out that the second source of communication in the CG method, the DOT operation, cannot be overlapped with useful computation without modifications to the CG method itself. Alternative formulations where this becomes possible have been proposed but, in general, they are less robust than the classical formulation. For details we refer to [17]. However, when all GPUs are inside one node, there is no additional communication cost as the result of the DOT operation will always be on the host, where the final reduction takes place. In order to overlap communication with computation, the processing of the local elements must be done in parallel with the exchange and processing of the shared ones. Fortunately, recent versions of CUDA allow one to copy data to/from the GPU memory while a kernel is executing, using streams. However, CUDA does not allow the simultaneous execution of two kernels on the same GPU, which makes the task of processing both local and shared data on the GPU, at the same time, impossible. Thus, the only way a complete overlap can be achieved is by processing the shared elements on the CPU. To achieve this, we implemented a data structure where only the non-zeros corresponding to the local elements are stored on the GPU, in ELL-PCSR format, while the rest of non-zeros, corresponding to shared elements, are left in the host memory, stored in CSR format. Using this data structure, the distributed SPMV is performed in the following way: 1. Process local data (in parallel); (a) Perform SPMV on local elements, on all GPUs 2. Process shared data (in parallel): (a) Copy needed source vector elements from GPUs to host memory; (b) Perform SPMV on the shared elements, using the CPU; (c) Copy the result of the CPU computation from host memory to GPUs. 3. When 1 and 2 are both finished, sum up the results. The processes is also illustrated in Fig.7. WeshowtheN = 3 case in Fig.8. Fig. 7. Overlapping communication with computation in SPMV. The previously shown sequence of steps works well given the amount of time spent in computing SPMV on the GPU is larger than the amount of time spent communicating and computing the SPMV on the CPU. For a good partitioning, this happens given the problem is large enough and the CPU processes the shared elements fast enough. For the Nehalem processor used in this study, for large problem sizes, we have obtained speeds of 2-3GFlop/s for the unmodified CSR format. As this is around 15x slower than the performance of 3x280GTX GPUs, the size of the matrix stored on the GPU must be at least 20x larger than the one left on the host to allow for a full overlap. For the matrices considered here, the distribution of the non-zero elements is shown in Fig.9. While the format of choice of the host is always unmodified CSR, for the part of the matrix stored on the GPU we also show the ratio between the elements stored in ELL format and the ones stored in PCSR. We note that the proposed implementation has the additional advantage of offloading some of the GPU computation to the CPU. Although not used here, one could in principle assign, depending on the performance on the CPU, as many non-zero elements to the host as needed to obtain almost the same computation time on the CPU and GPU. In this way the total processing time could be further reduced. 8

9 GPU 1 Host 2a x = x = 2c GPU 2 2c x = x = PCI Express 2a Fig. 8. 3D Poisson s for N = 3, matrix partitioned between two GPUs for overlapping. Fig. 9. The distribution of non-zero elements for the matrices used in the benchmarks. The top-most plot shows the distribution for one GPU case or when not storing any elements on the host. The remaining two plots are for overlapped SPMV using two and three GPUs 9

10 4 PERFORMANCE RESULTS In this section we test the performance of the solver on matrices shown in Table 1 on the hardware shown in Table 2 and using the testing procedure described in Section 2. The CG solver was run in double precision and single precision with iterative refinement. The stopping criteria was set to 10 8 in the case of the Poisson matrices and for the rest. To compute the speedup for both the SPMV operation and for the entire solver we take as reference the performance on the Nehalem CPU in double precision when running on all four cores. In Fig.10 we show how the SPMV performance increases with the number of cores. Since for the double precision solver we are using Intel MKL, the number of cores is changed via the OMP NUM THREADS environment variable. As expected, the best performance is achieved when running on all four cores. Fig. 10. Double precision SPMV performance on the Nehalem CPU for the matrices used in the benchmarks, with increasing number of cores. Next, we show the performance of the SPMV operation on all types of hardware used in this paper and for all matrices in the collection, in Fig.11, for both single and double precision. For the cases when multiple GPUs are used (the case of two and three 280GTX GPUs and a full 295GTX GPU, using both GPUs on board) we report the performance for both non-overlapped and overlapped case, with the latter denoted by OV. Entries which are missing from the graphs correspond to matrices which did not fit in the memory of the respective device. We note both the large difference between the performance of the Nehalem CPU and that of the GPUs and the large gap between the non-overlapped and overlapped implementations, especially for the very unstructured UFSPARSE matrices. An additional thing to notice is the almost double performance achieved when running in single precision, which is the reason why we expect to get speedup when using iterative refinement. One can already notice from Fig.11 the large difference in performance between the non-overlapped and overlapped SPMV implementations. The reason for this difference is explained in detail in Fig.12 and Fig.13, where we show the breakdown of the time spent in one SPMV operation into computation, on both CPU and GPU in the overlapped case and only on the GPU in the non-overlapped one, communication (denoted by COMM ) and the final summation operation (denoted by ADD ). While in the nonoverlapped case the total performance is the sum between computation time and communication time, in the overlapped case it is reduced to the sum between the maximum computing time, which can be either on the GPU or on the CPU, and the summation time. Results show that, although the difference in performance is significant even for matrices with low bandwidth, as in the case of the Poisson matrices, there is a huge performance difference for the more unstructured ones as it is the case with the four matrices chosen from UFSPARSE. We note however that using the proposed implementation for such extreme cases can lead to a very large number of non-zero elements being processed by the much slower CPU, which will become the bottleneck. The strong scalability for both overlapped and non-overlapped SPMV implementations is shown in Fig.14. Since dividing the same matrix to multiple GPUs reduces the size of the matrix assigned to each GPU and hence degrades performance, we only consider here the largest ones. An exception is the Poisson which can only be solved on three GPUs, hence it cannot be used to test scalability. The overlapped implementation shows very good scalability for the Poisson matrices and almost close to ideal for the 10

11 Fig. 11. Single precision and double precision SPMV performance. OV denotes overlapped SPMV, as proposed in this paper. Fig. 12. SPMV time breakdown for 2x280GTX, double precision. N/OV and OV stand for nonoverlapped and overlapped, respectively. The time is for one SPMV call and is expressed in ms. 11

12 Fig. 13. SPMV time breakdown for 3x280GTX, double precision. N/OV and OV stand for nonoverlapped and overlapped, respectively. The time is for one SPMV call and is expressed in ms. Poisson case. On the other hand, for the pwtk and inline 1 matrices the scalability is much worse, especially when using 3 GPUs, because of the large time spent in both communication and computing on the CPU. However, in both cases the scalability is much better as compared with the non-overlapped case. Fig. 14. Strong scalability for one, two and three 280GTX GPUs. We now move from the SPMV operation to the entire CG solver. Fig.15 shows the speedup obtained over the Nehalem CPU, running with all four cores and in double precision, for all GPUs considered here. For the GPUs which support native double precision we run the solver in both native double precision and in single precision combined with iterative refinement while only the latter mode is used for the 9800GTX+ GPU which lacks native double precision support. Single precision with iterative refinement is used on the CPU as well. Up to 22x speedup is obtained when using three GPUs in combination with iterative refinement. Detailed information of the time elapsed until convergence is shown in Table 3. The difference between double precision and single precision combined with iterative refinement, when ran on the same hardware and to the same level of accuracy, is further investigated in Fig.16 where we plot the ratio between the time spent using the former and the time used spent using the later. Hence, a value larger than unity shows that a speedup is obtained when using iterative refinement, even though the hardware supports native double precision. As expected, iterative refinement provides a moderate speedup for well-conditioned matrices like Poisson. On the contrary, for ill-conditioned matrices iterative refinement results in a substantial slowdown due to the large increase in the number of iterations caused by the loss of direction information during solver restarts. 12

13 Fig. 15. Speedup of Nehalem running in double precision and all cores. Table 3. Elapsed solver time (in ms) to solution. Convergence criteria was 10 8 for the Poisson matrices and for the rest. IR stands for iterative refinement. Empty cells show either failed iterative refinement of lack of sufficient memory to run the solver. Fig. 16. Comparison between the time to solution for solver run in double precision and single precision with iterative refinement. 13

14 5 POWER AND COST EFFICIENCY Having looked at raw performance in previous sections, we now turn our attention to the equally important issues of power efficiency and cost effectiveness. In order to compute these values, we make use of the power consumption and cost figures shown in Table 4. Table 4. Power and cost for the tested hardware Hardware Power (W) Cost (USD) Core i GTX GTX GTX First, in Fig.17 and Fig.18 we show the power efficiency and cost effectiveness, respectively, for both single and double precision, computed in terms of MFlop/s/Watt and MFlop/s/USD, respectively. For the case of multiple GPU computing, including running on both GPUs inside the 295GTX card, since our overlapping implementation needs to use the CPU as well, we include in the total power (cost) used, the power (cost) of the Nehalem CPU as well. For the case when only a single device is being used the power (cost) of the CPU is not considered. Thus, for example the power assumed to be consumed when using 2x280GTX GPUs is 2x Watt while the power assumed to be consumedfor1x280gtx is 236 Watt. Fig. 17. Power efficiency of the CPU and GPUs, in double and single precision, expressed in MFlop/s/Watt However, since the shear number of GFlop/s does not give an accurate view over the performance of the solver, especially when considering the increase in iterations due to using iterative refinement, we 14

15 Fig. 18. Cost effectiveness of the CPU and GPUs, in double and single precision, expressed in MFlop/s/USD compute the increase in power efficiency and cost effectiveness. These quantities, shown in Fig.19 and Fig.20, represent the savings (or increase) in power consumption and money associated with the speedup (or slowdown) given by a piece of hardware over the Nehalem CPU running in double precision. Thus, for example, we compute the increase in power efficiency when using 2x280GTX GPUs by the following formula: I 2 280GT X = t Nehalem t 280GT X Watt Nehalem 2 Watt 280GT X + Watt Nehalem The results show that close to 5x increase in power efficiency, when using iterative refinement for Poissonlike matrices, and up to 18x increase in cost effectiveness, when using iterative refinement on cheap GPUs like the 9800GTX+, again on well-structured and well-conditioned matrices like Poisson, can be achieved. Fig. 19. Power efficiency increase over Nehalem running in double precision, expressed in Speedup/Watt 15

16 Fig. 20. Cost effectiveness increase over Nehalem running in double precision, expressed in Speedup/USD 6 CONCLUSIONS In the view of efficiently running a CG solver on multiple GPUs we have proposed an efficient distributed SPMV implementation that gathers all matrix elements for which communication needs to be performed on the host, in this way offering the possibility of overlapping communication with useful computation. Furthermore, the fact that part of the matrix is processed, in parallel, on the CPU brings an additional speedup. In order to assess the performance of the CG solver on GPUs, we ran benchmarks one eight sparse matrices, four of which being 3D Poisson matrices, at increasing resolutions, while an additional four being chosen from UFSPARSE. In a way, these eight matrices are at opposite ends of the performance spectrum, with the former group being both GPU and solver-friendly and the latter being ill-conditioned and requiring lots of communication. Using these matrices we have shown that the overlapped implementation is noticeably faster even for Poisson type matrices while providing orders of magnitude of speedup for the very irregular cases, were the non-overlapped implementation was practically unusable. In particular, for large Poisson-like matrices, the overlapped implementation has shown almost perfect strong scalability for up to three 280GTX GPUs. We note however that such an implementation requires a powerful multi-core CPU to be present on the host. Using this efficient implementation, we computed the time required by the solver to converge to the desired accuracy on one and more GPUs and using both double precision and single precision in combination with iterative refinement. Results showed that up to 22x speedup can be achieved when using three 280GTX GPUs as compared to one of the fastest quad-core Nehalem processors available today. Moreover, we have showed that, in a limited number of cases, using iterative refinement is a good idea even for hardware on which native double precision is supported. However, this is usually not the case for ill-conditioned matrices, where the superior single precision performance is more than canceled by the very large increase in iterations. We have also looked at two equally important performance metrics: power efficiency and cost effectiveness. Our results have shown that close to 5x savings in power and close to 20x savings in cost can be made by using GPUs as accelerators. In particular, the most cost-effective solution was running the CG solver on low-end graphic hardware like the 9800GTX+. The fact that such cards do not currently support native double precision proves the usefulness of iterative refinement. Finally, we note that in this framework we do not discuss a good preconditioner, whose presence is crucial for solving real-world problems. Indeed, in such cases the reduction in iterations provided using a good preconditioner on the CPU can more than compensate for the difference in performance when compared to a GPU. However, even in the presence of a good preconditioner the fact that the SPMV operation dominates the computation time does not change, thus efficient implementations have the same importance. As future work we aim to extend the current implementation to clusters accelerated by multi-gpu nodes and implement a way of tuning the part of the matrix computed locally in order to optimally utilize both GPUs and the CPU present on the host. 16

17 References 1. Hestenes Magnus R, Stiefel, E. Methods of Conjugate Gradients for Solving Linear Systems. Journal of Research of the National Bureau of Standards 1952; 49: Vuduc, R. Automatic performance tuning of sparse matrix kernels. PhD Thesis, University of California at Berkeley, NVIDIA Corporation, CUDA programming guide, version 2.0, Strzodka, R, Doggett, M, Kolb A. Scientific computation for simulations on programmable graphics hardware. Simulation Modeling Practice and Theory 2005; 13: Owens, D et al. A survey of general-purpose computation on graphics hardware. Computer Graphics Forum 2007; 26: Bolz, J et al. Sparse matrix solvers on the GPU: conjugate gradients and multigrid. ACM Transactions on Graphics 2003; 22: Buatois, L at al. Concurrent number cruncher: an efficient sparse linear solver on the GPU. Lecture Notes in Computer Science 2007; 4782: Göddeke D, Strzodka, R, Turek S. Performance and accuracy of hardware-oriented native-, emulated- and mixed-precision solvers in FEM simulations. International Journal of Parallel, Emergent and Distributed Systems 2007; 22: Davis T. University of Florida Sparse Matrix Collection, Menon S, Perot, J. B. Implementation of an efficient Conjugate Gradients algorithm for Poisson Solutions on Graphic Processors. Proceedings of CFD Göddeke, D et al. GPU acceleration of an unmodified parallel finite element Navier Stokes solver. In High Performance Computing & Simulation 2009, Logos Verlag, Berlin, 2009; Aubry, R et al. Deflated preconditioned conjugate gradient solvers for the Pressure-Poisson equation. Journal of Computational Physics : Göddeke, D et al. Exploring weak scalability for FEM calculations on a GPU-enhanced cluster. Parallel Computing : Saad Y., Iterative Methods for Sparse Linear Systems, 2nd Edition. SIAM, Langou, J. et al. Exploiting the performance of 32 bit floating point arithmetic in obtaining 64 bit accuracy. In Proceedings of the 2006 ACM/IEEE conference on Supercomputing Bell, N, Garland, M. Efficient Sparse Matrix-Vector Multiplication on CUDA. Techreport, NVIDIA Corporation, Dongarra, J et al. Numerical Linear Algebra for High Performance Computers, pp SIAM,