Conjugate Gradients on Multiple GPUs

Size: px
Start display at page:

Download "Conjugate Gradients on Multiple GPUs"

Transcription

1 Conjugate Gradients on Multiple GPUs Serban Georgescu 1, Hiroshi Okuda 2 1 The University of Tokyo, Department of Quantum Eng. and Systems Science, Hongo, Bunkyo-ku, Tokyo , Japan 2 The University of Tokyo, Research into Artifacts, Center for Engineering(RACE),5-1-5Kashiwa-no-ha, Kashiwa, Chiba , Japan Abstract. A GPU accelerated Conjugate Gradient solver is tested on eight different matrices with different structural and numerical characteristics. The first four matrices are obtained by discretizing the 3D Poisson s equation, which arises in many fieldssuchascomputationalfluid dynamics, heat transfer and so on. Their relatively low bandwidth and low condition numbers makes them ideal targets for GPU acceleration. We chose another four matrices from the other end of the spectrum, both ill-conditioned and with very large bandwidth. This paper concentrates on the computational aspects related to running the solver on multiple GPUs.We develop a fast distributed sparse-matrix vector multiplication routine using optimized data formats which allow the overlapping of communication with computation and, at the same time, the sharing of some the work with the CPU. By a thorough analysis of the time spent in communication and computation we show that the proposed overlapped implementation outperforms the non-overlapped one by a large margin and provides almost perfect strong scalability for large Poisson-type matrices.we then benchmark the performance of the entire solver, using both double precision and single precision combined with iterative refinement and report up to 22x acceleration when using three GPUs as compared to one of the most powerful Intel Nehalem CPUs available today. Finally, we show that using GPUs as accelerators not only brings an order of magnitude speedup but also up to 5x increase in power efficiency and over 10x increase in cost effectiveness. Keywords:GPGPU,conjugategradients,communication-computationoverlapping, Poisson s equation, mixed precision 1 INTRODUCTION Krylov subspace solvers are widely employed for solving large and sparse systems of linear equations. The usual source for such matrices is the discretization of partial differential equations (PDE) over geometric domains, represented as meshes, employed in fields such as structural analysis or computational fluid dynamics (CFD). For matrices which are symmetric and positive definite, the Conjugate Gradient (CG) method [1] is the usual choice. Composed of vector-vector (BLAS1) and sparse-matrix vector multiplication (SPMV) operations and, with most of the time spent in the latter, CG solvers are highly memory bounded and have historically been able to achieve only a small fraction of CPU peak [2]. To answer the need for more computing power and, more important, more bandwidth, the High Performance Computing (HPC) community has turned its attention to accelerators like Graphic Processing Units (GPU), CELL processors and Field Programmable Gate Arrays (FPGAs). Among these, being powerful, mass produced and already present in most machines, GPUs are arguably the most promising. The current generation of GPUs from both NVIDIA and AMD provide close to 2TFlop/s of single precision performance, for dual-gpu models like the NVIDIA Geforce 295GTX or the AMD Radeon HD 4870 X2, and sustained bandwidths in the range of GB/s. Moreover, both performance and bandwidth seem to be doubling with every new generation. To make this possible, GPUs feature vectorlike architectures, with hundreds of cores grouped into SIMD units, connected to a global memory, of up to 4GB in size, via wide memory buses. GPUs use concurrency, supporting tens of thousands of threads in flight, to cover the large latency to the global memory. In the last few years, thanks to technologies like CUDA [3], GPUs have become extremely programmable, being treated like general purpose processors. Furthermore, last generation high-end models provide native support for double precision operations, albeit much slower than the single precision ones. Since around the early 2000s, when GPUs became programmable enough be used for tasks other than rendering, researchers have been using GPUs to accelerate applications ranging from matrix solvers to database applications (see [4,5] and the references therein for a review). In the context of iterative

2 solvers, various implementations have been reported since, with CG solvers being reported first in [6] and later in [7,8,10]. In particular, [8] is the first to investigate accuracy issues resulting from the use of single precision on GPUs, with focus on the quasi-double precision (emulated double precision) and iterative refinement. In this paper we present a GPU-accelerated CG solver running on multicore processors and multiple GPUs. The performance and accuracy is benchmarked by solving Poisson s equation in 3D, at various resolutions, plus some additional matrices taken from University of Florida s Sparse Matrix Collection (UFSPARSE) [9]. In particular, the need to solve Poisson s equation arises in many fields, such as CFD, where semi-implicit or explicit formulations for the incompressible Navier-Stokes equation lead to the Pressure-Poisson equation, in steady-state heat dissipation, transport phenomena, electrostatics, particlebased flows used in computer graphics, and many other. For structured meshes, Krylov subspace solvers are usually outperformed by multigrid (MG) solvers [11]. However, MG solvers have had limited success on unstructured meshes [12]. Unstructured meshes have become quite popular in fields such as CFD or structural analysis, due to the possibility of expressing complex geometries directly exported from CAD software using fully automatic unstructured triangular and tetrahedral grid generators. In such cases, as the resulting matrix is large, sparse and symmetric positive definite, CG solvers have been successfully utilized [10,12]. In [8], Poisson s equation is solved with both CG and MG, with both quasi-double precision and iterative refinement. Compared to [8], this paper focuses on optimizing the SPMV implementation using unstructured matrix storage and, in the case when more than one GPU is employed, minimizing the impact of communication by fully or partially overlapping it with useful computation. A CG-based Poisson solver running on a GPU was also reported in [10]. However, much has changed in both GPU hardware and software since then. Furthermore, their shader-based implementation is limited to one GPU and uses single precision alone. This paper makes the following contributions: Provides an updated performance benchmark for CG solvers ran on the most powerful CPU and GPUs to date; Proposes a way of implementing an optimized data format which allows the overlapping of communication with computation while performing SPMV on multiple GPUs and, additionally, provides the opportunity of offloading some of the computation to the CPU; Analyses the efficiency of using iterative refinement as a way of accelerating solvers on GPUs which support native double precision; Shows not only improvements in raw performance but also in power efficiency and cost effectiveness. This paper is structured as follows. In Section 2 we define our testing problem and hardware environment. Detailed implementation issues, such as the proposed distributed sparse-matrix vector multiplication (SPMV) implementation, are discussed in Section 3. After showing the performance results in Section 4, we analyze the power efficiency and cost effectiveness in Section 5. Finally, we draw the conclusions in Section 6. 2 TEST SCENARIO We test the performance of the CG solver for Poisson s equation of the form u = f where represents the Laplace operator and u and f denote the unknown and given right-hand side (e.g., external force), respectively. Zero Dirichlet boundary conditions are considered on the entire boundary. To test both the correctness of the implementation and the accuracy of the solver we start from a known analytical solution u(x, y, z) =e x+y+z x(1 x)y(1 y)z(1 z) which leads to the RHS f(x, y, z) = e x+y+z xyz(y(z 5) 5z + x(3zy + y + z 5) + 9) 2

3 The equation is discretized on a uniform grid of increasing size (64 3,128 3,192 3 and ) using the standard 7-point stencil. For a spacial discretization of order N per dimension, the order of the matrix is N 3. Let us define the diagonal of order d of a n-by-n matrix A =(a ij ) 1 i,j n as the sequence of elements of the form a i,i+d with 1 i n. Poisson s matrix in 3D is composed of seven such diagonals with orders 0 (the main diagonal), -1, 1, N, N, N 2 and N 2.ForthecaseN = 3, the matrix is shown in Fig.1 (right), while the general form of the matrix is shown in Fig.1 (left). As the resulting matrix is structured, one could in fact use a specialized (diagonal) format for storage and use a solver like MG. However, for the purpose of this study we choose to ignore these properties and treat the matrix as unstructured instead. This allows us to build the matrix directly rather than having to start from a mesh. Moreover, as such a matrix has a clear structure, our experiments become trivial to reproduce. This is unlike the unstructured case, where the results strongly depend on the mesh being used. Fig. 1. Left: non-zero profile of Poisson s matrix resulting from a 3D finite difference discretization. Right: the actual matrix for N=3 (empty squares represent zero values), ignoring the 1/h 2 factor. Further tests are conducted on four additional unstructured matrices chosen from University of Florida s Sparse Matrix Collection. Their structure is very different from the one of the banded and diagonally dominant Poisson s matrix while their condition number is much higher. These properties make them much more challenging to solve both from a computational point of view (e.g., poor cache behavior, increased communication when using multiple GPUs) and regarding convergence. The pattern of non-zero elements is shown in Fig.2. For these matrices, since the analytical solution is not known, we start from the prescribed desired solution x 1 from which we compute the right-hand side (RHS) b as b = Ax. We then run the solver for this RHS and compare the obtained solution with the desired one. The important properties of the matrices considered in this paper are summarized in Table 1. The memory required to run the solver has been computed by adding to the memory required to store the matrix the memory needed by the solution, RHS and temporary vectors that appear in the CG solver, as formulated in Fig.3. The condition number has been estimated directly from the CG solver, via the Lanczos connection [14]. The hardware used in our tests is listed in Table 2. On the CPU side we used one of the fastest Intel Nehalem on market today, the Core i7 975 Extreme Edition. Nehalem processors, with the integrated tipple-channel memory controller and with support for DDR3 memory provide a very large jump in performance over previous multi-core CPUs (e.g. the Core2 Duo or Core2 Quad CPUs) for bandwidthlimited applications like the CG solver. On the GPU side we used the NVIDIA 280GTX and the NVIDIA 295GTX, two powerful GPUs, with native double precision support, and the lower priced but less powerful NVIDIA GeForce 9800GTX+, which lacks support for double precision. The NVIDIA 295GTX, currently the most powerful GPU from NVIDIA, is composed of two GPUs sandwiched together. All kernels were compiled using the Intel Compiler 11.0 and NVIDIA CUDA 2.3 and ran on 64-bit Linux. We tested the performance of the CG solver when run in native double precision (64-bit) and single precision (32-bit) combined with iterative refinement. Iterative refinement is a method by which double 3

4 pwtk inline 1 bmwcra 1 ldoor Fig. 2. Unstructured matrices from UFSPARSE Table 1. Properties for the matrices considered in this study Name Order Non-zeros Solver memory for Solver memory for Condition number double (MB) single (MB) Poisson ,144 1,826, ,708 Poisson ,097,152 14,647, ,740 Poisson ,077,888 49,471, ,092 Poisson ,777, ,308,926 2,174 1,343 26,765 pwtk 217,918 46,522, E+07 inline 1 503,712 36,816, E+08 bmwcra 1 148,770 10,644, E+06 ldoor 952,203 46,522, E+06 Table 2. Hardware specifications. Hardware Model Motherboard Memory CPU Nehalem 3.2GHz Asus P6T WS Rev. GPU TR3X6G1600C8D (DDR @1600) 295GTX EVGA GeForce GTX 295 CO-OP Asus P6T WS Rev. 2x1GB GDDR3 280GTX MSI N280GTX-T2D1G-OC Asus P6T WS Rev. 1GB GDDR3 9800GTX+ GALAXY GF P98GTX+ Asus P6T WS Rev. 512MB GDDR3 4

5 precision accuracy can be obtained by combining an inner solver working in single precision, in this case a CG solver, with a small number of outer correction iterations done in double precision. Iterative refinement works for matrices which are not too ill-conditioned (i.e., have condition number of at most O(10 8 )). Iterative refinement can be used even when native double support is available in order to speed up the solver [15]. For detailed information regarding this method we direct the reader to[8]. 3 IMPLEMENTATION DETAILS In this section we describe implementation details for the left-preconditioned CG method using the formulation shown in Fig.3. For each iteration (lines 7-16), the method uses two dot product (DOT) operations (lines 8 and 13), three vector updates (two AXPY type in lines 9,10 and one AYPX type in line 15), one SPMV operation in line 7 and a preconditioning operation in line 11. The SPMV is by far the most expensive operation and, depending on which kind of preconditioner is being used, a fair amount of time can be spent in preconditioning (e.g % in a SOR or ILU preconditioner). For preconditioning, in this paper we are using only diagonal scaling. Although not a very efficient preconditioner, diagonal scaling is straightforward to implement, embarrassingly parallel and works acceptably well for not so difficult problems. 01. i =0 02. r = b Ax d = M 1 r 04. δ new = r T d 05. δ 0 = δ new 06. While i<i max and δ new >ɛ 2 δ 0 do 07. q = Ad 08. α = δnew d T q 09. x = x + αd 10. r = r αq 11. s = M 1 r 12. δ old = δ new 13. δ new = r T s 14. β = δnew δ old 15. d = s + βd 16. i = i end Fig. 3. The PCG algorithm. In order to port a CG solver to the GPU, one needs to write kernels for DOT, AXPY, AYPX, diagonal scaling and SPMV. The DOT and AXPY kernels, in both single and double precision, are directly available from NVIDIA s CUBLAS library. The AYPX and diagonal scaling are not included in CUBLAS, but they are straightforward to implement. The only difficulty lies in implementing the SPMV kernel which, when using diagonal scaling alone, represents 90% or more of the total time spent in the CG solver. On the CPU side we use the functions provided by Intel MKL 11.0, but for the AYPX kernel and diagonal scaling which we implement by ourselves. Depending on the structure of the matrix and architecture of the processor the SPMV operation is executed on, one may chose from a wide range of sparse matrix formats (we refer to [2,16] foran extensive review). The importance of the sparse format employed in GPU implementations cannot be overstated. A benchmark of the SPMV operation on a large collection of matrices has shown that a fivefold increase in performance over the unmodified CSR format, the format of choice for most codes dealing with unstructured meshes, can be achieved by using a format optimized for the GPU. Such a format, which combines the very structured ELLPACK/ITPACK (ELL) format with the flexible Coordinate (COO) format, has been proposed in [16]. The ELL format, originally created for vector machines, is intended for matrices which have a relatively constant number of non-zeros per row. If this is the case, 5

6 then all non-zero values can be stored in a large block, which makes the access to them regular hence easy to coalesce and thus efficient. However, the format incurs very large fill for matrices where the distribution of non-zeros has large variations. The concept behind the format proposed in [16] is to store the regular part of the matrix in ELL format and the irregular part in a more flexible format, in this case the COO format. F F ELL Pa PCSR Fig. 4. The ELL-PCSR format. The rectangles represent matrix rows, with the hashed part showing non-zero elements. With the cost of some fill, marked with F, the regular part of the matrix is stored as a block, in ELL format. The remaining non-zeros are stored in PCSR format. We implemented a variation of the previous format, where the COO part is replaced by padded CSR (PCSR), in order to make it usable on older GPUs that lack support for atomic operations which are required for an efficient implementation of the COO format. The CSR format is padded so that the number of elements in each row is a multiple of four, which increases the degree of coalescence. Hereafter, we will be referring to this sparse format as the ELL-PCSR format. An illustration of the format is shown in Fig.4. On average, for the many matrices we have tested, the performance of double precision SPMV for large matrices (i.e., with O(10 7 ) non-zeros or more) reached only 2GFlop/s for the unmodified CSR format, close to 5 GFlop/s for PCSR and around to 11 GFlop/s for the ELL-PCSRformatona280GTX GPU. These results are shown in Fig.5 Fig. 5. Average double precision SPMV performance for unmodified CSR format, PCSR and ELL-PCSR for matrices taken from UFSPARSE on a 280GTX GPU. Small matrices have less than 10 5 non-zeros, medium ones have while large ones have more than

7 The SPMV operation is executed by performing the multiplication of all non-zero elements in the matrix with their corresponding elements in the source vector. In the case of distributed SPMV using GPUs, the matrix, the source and the result vector are distributed across multiple GPUs. The part of the result vector belonging to each GPU is filled by multiplying all non-zero matrix elements owned by the respective GPU with their corresponding source vector elements. For each GPU, although all non-zero elements of the matrix are available, there are some elements of the source vector which are not. Before SPMV can take place, these elements must be received from the GPU that owns them. In Fig.6 we show how a distributed SPMV (y = Ax) is performed in the case of two GPUs and N = 3. The 27-by-27 matrix A is partitioned between the two GPUs, with 13 rows on the first and 14 rows on the second. The source and result vectors are partitioned to match. In order to perform its share of the SPMV operation, GPU1 needs elements x 14 x 22 from GPU2 for processing rows In the same way, GPU2 needs elements x 5 x 13 from GPU2 for processing rows SPMV can only be performed after these elements are exchanged. Implementing SPMV on a multi-gpu system must therefore follow the three steps below: 1. Copy data to be sent to other GPUs to host memory; 2. Copy needed data from host memory to GPUs; 3. Perform SPMV on each GPU. Steps 1 and 2 are necessary since currently CUDA does not allow peer-to-peer communication between GPUs and, therefore, all transfers must be made indirectly via the memory of the host. As data must be copied to and from each GPU for every CG iteration, this can become a bottleneck. Fortunately, for most problems coming from the discretization of PDEs, communication can be overlapped with useful computation. This is possible since, after performing domain decomposition, while the number of vector elements that do not require communication (local elements) scales with the volume of the partition, the number of vector elements that do require communication (shared elements, i.e., the ones placed on the boundary between domains), scales with the surface. Thus, for large matrix sizes, the number of local elements is an order of magnitude larger than the number of shared elements. As such, if using asynchronous communication, one can do the computations for the local nodes while exchanging data for the shared nodes, and sum the results when both operations are finished. For Poisson s matrix considered here, the number of elements that must be exchanged between two PEs is equal to the halfbandwidth of the matrix. For the type of partitioning we employ, all GPUs but the first and the last must exchange this amount of elements with their two nearest neighbors. The first and last one have only one closest neighbor, with whom they exchange the same amount of data. GPU 1 Host x = 1 2 GPU 2 2 x = 1 PCI Express Fig. 6. 3D Poisson s matrix for N = 3, partitioned between two GPUs. Missing source vector elements must be exchanged indirectly, in two steps, via the host memory. 7

8 We wish to point out that the second source of communication in the CG method, the DOT operation, cannot be overlapped with useful computation without modifications to the CG method itself. Alternative formulations where this becomes possible have been proposed but, in general, they are less robust than the classical formulation. For details we refer to [17]. However, when all GPUs are inside one node, there is no additional communication cost as the result of the DOT operation will always be on the host, where the final reduction takes place. In order to overlap communication with computation, the processing of the local elements must be done in parallel with the exchange and processing of the shared ones. Fortunately, recent versions of CUDA allow one to copy data to/from the GPU memory while a kernel is executing, using streams. However, CUDA does not allow the simultaneous execution of two kernels on the same GPU, which makes the task of processing both local and shared data on the GPU, at the same time, impossible. Thus, the only way a complete overlap can be achieved is by processing the shared elements on the CPU. To achieve this, we implemented a data structure where only the non-zeros corresponding to the local elements are stored on the GPU, in ELL-PCSR format, while the rest of non-zeros, corresponding to shared elements, are left in the host memory, stored in CSR format. Using this data structure, the distributed SPMV is performed in the following way: 1. Process local data (in parallel); (a) Perform SPMV on local elements, on all GPUs 2. Process shared data (in parallel): (a) Copy needed source vector elements from GPUs to host memory; (b) Perform SPMV on the shared elements, using the CPU; (c) Copy the result of the CPU computation from host memory to GPUs. 3. When 1 and 2 are both finished, sum up the results. The processes is also illustrated in Fig.7. WeshowtheN = 3 case in Fig.8. Fig. 7. Overlapping communication with computation in SPMV. The previously shown sequence of steps works well given the amount of time spent in computing SPMV on the GPU is larger than the amount of time spent communicating and computing the SPMV on the CPU. For a good partitioning, this happens given the problem is large enough and the CPU processes the shared elements fast enough. For the Nehalem processor used in this study, for large problem sizes, we have obtained speeds of 2-3GFlop/s for the unmodified CSR format. As this is around 15x slower than the performance of 3x280GTX GPUs, the size of the matrix stored on the GPU must be at least 20x larger than the one left on the host to allow for a full overlap. For the matrices considered here, the distribution of the non-zero elements is shown in Fig.9. While the format of choice of the host is always unmodified CSR, for the part of the matrix stored on the GPU we also show the ratio between the elements stored in ELL format and the ones stored in PCSR. We note that the proposed implementation has the additional advantage of offloading some of the GPU computation to the CPU. Although not used here, one could in principle assign, depending on the performance on the CPU, as many non-zero elements to the host as needed to obtain almost the same computation time on the CPU and GPU. In this way the total processing time could be further reduced. 8

9 GPU 1 Host 2a x = x = 2c GPU 2 2c x = x = PCI Express 2a Fig. 8. 3D Poisson s for N = 3, matrix partitioned between two GPUs for overlapping. Fig. 9. The distribution of non-zero elements for the matrices used in the benchmarks. The top-most plot shows the distribution for one GPU case or when not storing any elements on the host. The remaining two plots are for overlapped SPMV using two and three GPUs 9

10 4 PERFORMANCE RESULTS In this section we test the performance of the solver on matrices shown in Table 1 on the hardware shown in Table 2 and using the testing procedure described in Section 2. The CG solver was run in double precision and single precision with iterative refinement. The stopping criteria was set to 10 8 in the case of the Poisson matrices and for the rest. To compute the speedup for both the SPMV operation and for the entire solver we take as reference the performance on the Nehalem CPU in double precision when running on all four cores. In Fig.10 we show how the SPMV performance increases with the number of cores. Since for the double precision solver we are using Intel MKL, the number of cores is changed via the OMP NUM THREADS environment variable. As expected, the best performance is achieved when running on all four cores. Fig. 10. Double precision SPMV performance on the Nehalem CPU for the matrices used in the benchmarks, with increasing number of cores. Next, we show the performance of the SPMV operation on all types of hardware used in this paper and for all matrices in the collection, in Fig.11, for both single and double precision. For the cases when multiple GPUs are used (the case of two and three 280GTX GPUs and a full 295GTX GPU, using both GPUs on board) we report the performance for both non-overlapped and overlapped case, with the latter denoted by OV. Entries which are missing from the graphs correspond to matrices which did not fit in the memory of the respective device. We note both the large difference between the performance of the Nehalem CPU and that of the GPUs and the large gap between the non-overlapped and overlapped implementations, especially for the very unstructured UFSPARSE matrices. An additional thing to notice is the almost double performance achieved when running in single precision, which is the reason why we expect to get speedup when using iterative refinement. One can already notice from Fig.11 the large difference in performance between the non-overlapped and overlapped SPMV implementations. The reason for this difference is explained in detail in Fig.12 and Fig.13, where we show the breakdown of the time spent in one SPMV operation into computation, on both CPU and GPU in the overlapped case and only on the GPU in the non-overlapped one, communication (denoted by COMM ) and the final summation operation (denoted by ADD ). While in the nonoverlapped case the total performance is the sum between computation time and communication time, in the overlapped case it is reduced to the sum between the maximum computing time, which can be either on the GPU or on the CPU, and the summation time. Results show that, although the difference in performance is significant even for matrices with low bandwidth, as in the case of the Poisson matrices, there is a huge performance difference for the more unstructured ones as it is the case with the four matrices chosen from UFSPARSE. We note however that using the proposed implementation for such extreme cases can lead to a very large number of non-zero elements being processed by the much slower CPU, which will become the bottleneck. The strong scalability for both overlapped and non-overlapped SPMV implementations is shown in Fig.14. Since dividing the same matrix to multiple GPUs reduces the size of the matrix assigned to each GPU and hence degrades performance, we only consider here the largest ones. An exception is the Poisson which can only be solved on three GPUs, hence it cannot be used to test scalability. The overlapped implementation shows very good scalability for the Poisson matrices and almost close to ideal for the 10

11 Fig. 11. Single precision and double precision SPMV performance. OV denotes overlapped SPMV, as proposed in this paper. Fig. 12. SPMV time breakdown for 2x280GTX, double precision. N/OV and OV stand for nonoverlapped and overlapped, respectively. The time is for one SPMV call and is expressed in ms. 11

12 Fig. 13. SPMV time breakdown for 3x280GTX, double precision. N/OV and OV stand for nonoverlapped and overlapped, respectively. The time is for one SPMV call and is expressed in ms. Poisson case. On the other hand, for the pwtk and inline 1 matrices the scalability is much worse, especially when using 3 GPUs, because of the large time spent in both communication and computing on the CPU. However, in both cases the scalability is much better as compared with the non-overlapped case. Fig. 14. Strong scalability for one, two and three 280GTX GPUs. We now move from the SPMV operation to the entire CG solver. Fig.15 shows the speedup obtained over the Nehalem CPU, running with all four cores and in double precision, for all GPUs considered here. For the GPUs which support native double precision we run the solver in both native double precision and in single precision combined with iterative refinement while only the latter mode is used for the 9800GTX+ GPU which lacks native double precision support. Single precision with iterative refinement is used on the CPU as well. Up to 22x speedup is obtained when using three GPUs in combination with iterative refinement. Detailed information of the time elapsed until convergence is shown in Table 3. The difference between double precision and single precision combined with iterative refinement, when ran on the same hardware and to the same level of accuracy, is further investigated in Fig.16 where we plot the ratio between the time spent using the former and the time used spent using the later. Hence, a value larger than unity shows that a speedup is obtained when using iterative refinement, even though the hardware supports native double precision. As expected, iterative refinement provides a moderate speedup for well-conditioned matrices like Poisson. On the contrary, for ill-conditioned matrices iterative refinement results in a substantial slowdown due to the large increase in the number of iterations caused by the loss of direction information during solver restarts. 12

13 Fig. 15. Speedup of Nehalem running in double precision and all cores. Table 3. Elapsed solver time (in ms) to solution. Convergence criteria was 10 8 for the Poisson matrices and for the rest. IR stands for iterative refinement. Empty cells show either failed iterative refinement of lack of sufficient memory to run the solver. Fig. 16. Comparison between the time to solution for solver run in double precision and single precision with iterative refinement. 13

14 5 POWER AND COST EFFICIENCY Having looked at raw performance in previous sections, we now turn our attention to the equally important issues of power efficiency and cost effectiveness. In order to compute these values, we make use of the power consumption and cost figures shown in Table 4. Table 4. Power and cost for the tested hardware Hardware Power (W) Cost (USD) Core i GTX GTX GTX First, in Fig.17 and Fig.18 we show the power efficiency and cost effectiveness, respectively, for both single and double precision, computed in terms of MFlop/s/Watt and MFlop/s/USD, respectively. For the case of multiple GPU computing, including running on both GPUs inside the 295GTX card, since our overlapping implementation needs to use the CPU as well, we include in the total power (cost) used, the power (cost) of the Nehalem CPU as well. For the case when only a single device is being used the power (cost) of the CPU is not considered. Thus, for example the power assumed to be consumed when using 2x280GTX GPUs is 2x Watt while the power assumed to be consumedfor1x280gtx is 236 Watt. Fig. 17. Power efficiency of the CPU and GPUs, in double and single precision, expressed in MFlop/s/Watt However, since the shear number of GFlop/s does not give an accurate view over the performance of the solver, especially when considering the increase in iterations due to using iterative refinement, we 14

15 Fig. 18. Cost effectiveness of the CPU and GPUs, in double and single precision, expressed in MFlop/s/USD compute the increase in power efficiency and cost effectiveness. These quantities, shown in Fig.19 and Fig.20, represent the savings (or increase) in power consumption and money associated with the speedup (or slowdown) given by a piece of hardware over the Nehalem CPU running in double precision. Thus, for example, we compute the increase in power efficiency when using 2x280GTX GPUs by the following formula: I 2 280GT X = t Nehalem t 280GT X Watt Nehalem 2 Watt 280GT X + Watt Nehalem The results show that close to 5x increase in power efficiency, when using iterative refinement for Poissonlike matrices, and up to 18x increase in cost effectiveness, when using iterative refinement on cheap GPUs like the 9800GTX+, again on well-structured and well-conditioned matrices like Poisson, can be achieved. Fig. 19. Power efficiency increase over Nehalem running in double precision, expressed in Speedup/Watt 15

16 Fig. 20. Cost effectiveness increase over Nehalem running in double precision, expressed in Speedup/USD 6 CONCLUSIONS In the view of efficiently running a CG solver on multiple GPUs we have proposed an efficient distributed SPMV implementation that gathers all matrix elements for which communication needs to be performed on the host, in this way offering the possibility of overlapping communication with useful computation. Furthermore, the fact that part of the matrix is processed, in parallel, on the CPU brings an additional speedup. In order to assess the performance of the CG solver on GPUs, we ran benchmarks one eight sparse matrices, four of which being 3D Poisson matrices, at increasing resolutions, while an additional four being chosen from UFSPARSE. In a way, these eight matrices are at opposite ends of the performance spectrum, with the former group being both GPU and solver-friendly and the latter being ill-conditioned and requiring lots of communication. Using these matrices we have shown that the overlapped implementation is noticeably faster even for Poisson type matrices while providing orders of magnitude of speedup for the very irregular cases, were the non-overlapped implementation was practically unusable. In particular, for large Poisson-like matrices, the overlapped implementation has shown almost perfect strong scalability for up to three 280GTX GPUs. We note however that such an implementation requires a powerful multi-core CPU to be present on the host. Using this efficient implementation, we computed the time required by the solver to converge to the desired accuracy on one and more GPUs and using both double precision and single precision in combination with iterative refinement. Results showed that up to 22x speedup can be achieved when using three 280GTX GPUs as compared to one of the fastest quad-core Nehalem processors available today. Moreover, we have showed that, in a limited number of cases, using iterative refinement is a good idea even for hardware on which native double precision is supported. However, this is usually not the case for ill-conditioned matrices, where the superior single precision performance is more than canceled by the very large increase in iterations. We have also looked at two equally important performance metrics: power efficiency and cost effectiveness. Our results have shown that close to 5x savings in power and close to 20x savings in cost can be made by using GPUs as accelerators. In particular, the most cost-effective solution was running the CG solver on low-end graphic hardware like the 9800GTX+. The fact that such cards do not currently support native double precision proves the usefulness of iterative refinement. Finally, we note that in this framework we do not discuss a good preconditioner, whose presence is crucial for solving real-world problems. Indeed, in such cases the reduction in iterations provided using a good preconditioner on the CPU can more than compensate for the difference in performance when compared to a GPU. However, even in the presence of a good preconditioner the fact that the SPMV operation dominates the computation time does not change, thus efficient implementations have the same importance. As future work we aim to extend the current implementation to clusters accelerated by multi-gpu nodes and implement a way of tuning the part of the matrix computed locally in order to optimally utilize both GPUs and the CPU present on the host. 16

17 References 1. Hestenes Magnus R, Stiefel, E. Methods of Conjugate Gradients for Solving Linear Systems. Journal of Research of the National Bureau of Standards 1952; 49: Vuduc, R. Automatic performance tuning of sparse matrix kernels. PhD Thesis, University of California at Berkeley, NVIDIA Corporation, CUDA programming guide, version 2.0, Strzodka, R, Doggett, M, Kolb A. Scientific computation for simulations on programmable graphics hardware. Simulation Modeling Practice and Theory 2005; 13: Owens, D et al. A survey of general-purpose computation on graphics hardware. Computer Graphics Forum 2007; 26: Bolz, J et al. Sparse matrix solvers on the GPU: conjugate gradients and multigrid. ACM Transactions on Graphics 2003; 22: Buatois, L at al. Concurrent number cruncher: an efficient sparse linear solver on the GPU. Lecture Notes in Computer Science 2007; 4782: Göddeke D, Strzodka, R, Turek S. Performance and accuracy of hardware-oriented native-, emulated- and mixed-precision solvers in FEM simulations. International Journal of Parallel, Emergent and Distributed Systems 2007; 22: Davis T. University of Florida Sparse Matrix Collection, Menon S, Perot, J. B. Implementation of an efficient Conjugate Gradients algorithm for Poisson Solutions on Graphic Processors. Proceedings of CFD Göddeke, D et al. GPU acceleration of an unmodified parallel finite element Navier Stokes solver. In High Performance Computing & Simulation 2009, Logos Verlag, Berlin, 2009; Aubry, R et al. Deflated preconditioned conjugate gradient solvers for the Pressure-Poisson equation. Journal of Computational Physics : Göddeke, D et al. Exploring weak scalability for FEM calculations on a GPU-enhanced cluster. Parallel Computing : Saad Y., Iterative Methods for Sparse Linear Systems, 2nd Edition. SIAM, Langou, J. et al. Exploiting the performance of 32 bit floating point arithmetic in obtaining 64 bit accuracy. In Proceedings of the 2006 ACM/IEEE conference on Supercomputing Bell, N, Garland, M. Efficient Sparse Matrix-Vector Multiplication on CUDA. Techreport, NVIDIA Corporation, Dongarra, J et al. Numerical Linear Algebra for High Performance Computers, pp SIAM,

Mixed Precision Iterative Refinement Methods Energy Efficiency on Hybrid Hardware Platforms

Mixed Precision Iterative Refinement Methods Energy Efficiency on Hybrid Hardware Platforms Mixed Precision Iterative Refinement Methods Energy Efficiency on Hybrid Hardware Platforms Björn Rocker Hamburg, June 17th 2010 Engineering Mathematics and Computing Lab (EMCL) KIT University of the State

More information

Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms

Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms Amani AlOnazi, David E. Keyes, Alexey Lastovetsky, Vladimir Rychkov Extreme Computing Research Center,

More information

Accelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing

Accelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing Accelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing Innovation Intelligence Devin Jensen August 2012 Altair Knows HPC Altair is the only company that: makes HPC tools

More information

Accelerating CFD using OpenFOAM with GPUs

Accelerating CFD using OpenFOAM with GPUs Accelerating CFD using OpenFOAM with GPUs Authors: Saeed Iqbal and Kevin Tubbs The OpenFOAM CFD Toolbox is a free, open source CFD software package produced by OpenCFD Ltd. Its user base represents a wide

More information

Hierarchically Parallel FE Software for Assembly Structures : FrontISTR - Parallel Performance Evaluation and Its Industrial Applications

Hierarchically Parallel FE Software for Assembly Structures : FrontISTR - Parallel Performance Evaluation and Its Industrial Applications CO-DESIGN 2012, October 23-25, 2012 Peing University, Beijing Hierarchically Parallel FE Software for Assembly Structures : FrontISTR - Parallel Performance Evaluation and Its Industrial Applications Hiroshi

More information

Hardware-Aware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui

Hardware-Aware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui Hardware-Aware Analysis and Optimization of Stable Fluids Presentation Date: Sep 15 th 2009 Chrissie C. Cui Outline Introduction Highlights Flop and Bandwidth Analysis Mehrstellen Schemes Advection Caching

More information

GPUs for Scientific Computing

GPUs for Scientific Computing GPUs for Scientific Computing p. 1/16 GPUs for Scientific Computing Mike Giles mike.giles@maths.ox.ac.uk Oxford-Man Institute of Quantitative Finance Oxford University Mathematical Institute Oxford e-research

More information

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015 GPU Hardware and Programming Models Jeremy Appleyard, September 2015 A brief history of GPUs In this talk Hardware Overview Programming Models Ask questions at any point! 2 A Brief History of GPUs 3 Once

More information

Introduction GPU Hardware GPU Computing Today GPU Computing Example Outlook Summary. GPU Computing. Numerical Simulation - from Models to Software

Introduction GPU Hardware GPU Computing Today GPU Computing Example Outlook Summary. GPU Computing. Numerical Simulation - from Models to Software GPU Computing Numerical Simulation - from Models to Software Andreas Barthels JASS 2009, Course 2, St. Petersburg, Russia Prof. Dr. Sergey Y. Slavyanov St. Petersburg State University Prof. Dr. Thomas

More information

HIGH PERFORMANCE CONSULTING COURSE OFFERINGS

HIGH PERFORMANCE CONSULTING COURSE OFFERINGS Performance 1(6) HIGH PERFORMANCE CONSULTING COURSE OFFERINGS LEARN TO TAKE ADVANTAGE OF POWERFUL GPU BASED ACCELERATOR TECHNOLOGY TODAY 2006 2013 Nvidia GPUs Intel CPUs CONTENTS Acronyms and Terminology...

More information

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU Heshan Li, Shaopeng Wang The Johns Hopkins University 3400 N. Charles Street Baltimore, Maryland 21218 {heshanli, shaopeng}@cs.jhu.edu 1 Overview

More information

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi ICPP 6 th International Workshop on Parallel Programming Models and Systems Software for High-End Computing October 1, 2013 Lyon, France

More information

CUDA programming on NVIDIA GPUs

CUDA programming on NVIDIA GPUs p. 1/21 on NVIDIA GPUs Mike Giles mike.giles@maths.ox.ac.uk Oxford University Mathematical Institute Oxford-Man Institute for Quantitative Finance Oxford eresearch Centre p. 2/21 Overview hardware view

More information

Turbomachinery CFD on many-core platforms experiences and strategies

Turbomachinery CFD on many-core platforms experiences and strategies Turbomachinery CFD on many-core platforms experiences and strategies Graham Pullan Whittle Laboratory, Department of Engineering, University of Cambridge MUSAF Colloquium, CERFACS, Toulouse September 27-29

More information

Three Paths to Faster Simulations Using ANSYS Mechanical 16.0 and Intel Architecture

Three Paths to Faster Simulations Using ANSYS Mechanical 16.0 and Intel Architecture White Paper Intel Xeon processor E5 v3 family Intel Xeon Phi coprocessor family Digital Design and Engineering Three Paths to Faster Simulations Using ANSYS Mechanical 16.0 and Intel Architecture Executive

More information

Mesh Generation and Load Balancing

Mesh Generation and Load Balancing Mesh Generation and Load Balancing Stan Tomov Innovative Computing Laboratory Computer Science Department The University of Tennessee April 04, 2012 CS 594 04/04/2012 Slide 1 / 19 Outline Motivation Reliable

More information

High Performance. CAEA elearning Series. Jonathan G. Dudley, Ph.D. 06/09/2015. 2015 CAE Associates

High Performance. CAEA elearning Series. Jonathan G. Dudley, Ph.D. 06/09/2015. 2015 CAE Associates High Performance Computing (HPC) CAEA elearning Series Jonathan G. Dudley, Ph.D. 06/09/2015 2015 CAE Associates Agenda Introduction HPC Background Why HPC SMP vs. DMP Licensing HPC Terminology Types of

More information

Introduction to GPGPU. Tiziano Diamanti t.diamanti@cineca.it

Introduction to GPGPU. Tiziano Diamanti t.diamanti@cineca.it t.diamanti@cineca.it Agenda From GPUs to GPGPUs GPGPU architecture CUDA programming model Perspective projection Vectors that connect the vanishing point to every point of the 3D model will intersecate

More information

P013 INTRODUCING A NEW GENERATION OF RESERVOIR SIMULATION SOFTWARE

P013 INTRODUCING A NEW GENERATION OF RESERVOIR SIMULATION SOFTWARE 1 P013 INTRODUCING A NEW GENERATION OF RESERVOIR SIMULATION SOFTWARE JEAN-MARC GRATIEN, JEAN-FRANÇOIS MAGRAS, PHILIPPE QUANDALLE, OLIVIER RICOIS 1&4, av. Bois-Préau. 92852 Rueil Malmaison Cedex. France

More information

HPC with Multicore and GPUs

HPC with Multicore and GPUs HPC with Multicore and GPUs Stan Tomov Electrical Engineering and Computer Science Department University of Tennessee, Knoxville CS 594 Lecture Notes March 4, 2015 1/18 Outline! Introduction - Hardware

More information

HPC Deployment of OpenFOAM in an Industrial Setting

HPC Deployment of OpenFOAM in an Industrial Setting HPC Deployment of OpenFOAM in an Industrial Setting Hrvoje Jasak h.jasak@wikki.co.uk Wikki Ltd, United Kingdom PRACE Seminar: Industrial Usage of HPC Stockholm, Sweden, 28-29 March 2011 HPC Deployment

More information

General Framework for an Iterative Solution of Ax b. Jacobi s Method

General Framework for an Iterative Solution of Ax b. Jacobi s Method 2.6 Iterative Solutions of Linear Systems 143 2.6 Iterative Solutions of Linear Systems Consistent linear systems in real life are solved in one of two ways: by direct calculation (using a matrix factorization,

More information

High Performance Matrix Inversion with Several GPUs

High Performance Matrix Inversion with Several GPUs High Performance Matrix Inversion on a Multi-core Platform with Several GPUs Pablo Ezzatti 1, Enrique S. Quintana-Ortí 2 and Alfredo Remón 2 1 Centro de Cálculo-Instituto de Computación, Univ. de la República

More information

Multicore Parallel Computing with OpenMP

Multicore Parallel Computing with OpenMP Multicore Parallel Computing with OpenMP Tan Chee Chiang (SVU/Academic Computing, Computer Centre) 1. OpenMP Programming The death of OpenMP was anticipated when cluster systems rapidly replaced large

More information

FRIEDRICH-ALEXANDER-UNIVERSITÄT ERLANGEN-NÜRNBERG

FRIEDRICH-ALEXANDER-UNIVERSITÄT ERLANGEN-NÜRNBERG FRIEDRICH-ALEXANDER-UNIVERSITÄT ERLANGEN-NÜRNBERG INSTITUT FÜR INFORMATIK (MATHEMATISCHE MASCHINEN UND DATENVERARBEITUNG) Lehrstuhl für Informatik 10 (Systemsimulation) Massively Parallel Multilevel Finite

More information

Operation Count; Numerical Linear Algebra

Operation Count; Numerical Linear Algebra 10 Operation Count; Numerical Linear Algebra 10.1 Introduction Many computations are limited simply by the sheer number of required additions, multiplications, or function evaluations. If floating-point

More information

LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR

LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR Frédéric Kuznik, frederic.kuznik@insa lyon.fr 1 Framework Introduction Hardware architecture CUDA overview Implementation details A simple case:

More information

SOLVING LINEAR SYSTEMS

SOLVING LINEAR SYSTEMS SOLVING LINEAR SYSTEMS Linear systems Ax = b occur widely in applied mathematics They occur as direct formulations of real world problems; but more often, they occur as a part of the numerical analysis

More information

Recommended hardware system configurations for ANSYS users

Recommended hardware system configurations for ANSYS users Recommended hardware system configurations for ANSYS users The purpose of this document is to recommend system configurations that will deliver high performance for ANSYS users across the entire range

More information

ultra fast SOM using CUDA

ultra fast SOM using CUDA ultra fast SOM using CUDA SOM (Self-Organizing Map) is one of the most popular artificial neural network algorithms in the unsupervised learning category. Sijo Mathew Preetha Joy Sibi Rajendra Manoj A

More information

ACCELERATING COMMERCIAL LINEAR DYNAMIC AND NONLINEAR IMPLICIT FEA SOFTWARE THROUGH HIGH- PERFORMANCE COMPUTING

ACCELERATING COMMERCIAL LINEAR DYNAMIC AND NONLINEAR IMPLICIT FEA SOFTWARE THROUGH HIGH- PERFORMANCE COMPUTING ACCELERATING COMMERCIAL LINEAR DYNAMIC AND Vladimir Belsky Director of Solver Development* Luis Crivelli Director of Solver Development* Matt Dunbar Chief Architect* Mikhail Belyi Development Group Manager*

More information

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip. Lecture 11: Multi-Core and GPU Multi-core computers Multithreading GPUs General Purpose GPUs Zebo Peng, IDA, LiTH 1 Multi-Core System Integration of multiple processor cores on a single chip. To provide

More information

Interactive Level-Set Deformation On the GPU

Interactive Level-Set Deformation On the GPU Interactive Level-Set Deformation On the GPU Institute for Data Analysis and Visualization University of California, Davis Problem Statement Goal Interactive system for deformable surface manipulation

More information

ST810 Advanced Computing

ST810 Advanced Computing ST810 Advanced Computing Lecture 17: Parallel computing part I Eric B. Laber Hua Zhou Department of Statistics North Carolina State University Mar 13, 2013 Outline computing Hardware computing overview

More information

ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU

ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU Computer Science 14 (2) 2013 http://dx.doi.org/10.7494/csci.2013.14.2.243 Marcin Pietroń Pawe l Russek Kazimierz Wiatr ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU Abstract This paper presents

More information

YALES2 porting on the Xeon- Phi Early results

YALES2 porting on the Xeon- Phi Early results YALES2 porting on the Xeon- Phi Early results Othman Bouizi Ghislain Lartigue Innovation and Pathfinding Architecture Group in Europe, Exascale Lab. Paris CRIHAN - Demi-journée calcul intensif, 16 juin

More information

HPC enabling of OpenFOAM R for CFD applications

HPC enabling of OpenFOAM R for CFD applications HPC enabling of OpenFOAM R for CFD applications Towards the exascale: OpenFOAM perspective Ivan Spisso 25-27 March 2015, Casalecchio di Reno, BOLOGNA. SuperComputing Applications and Innovation Department,

More information

Unleashing the Performance Potential of GPUs for Atmospheric Dynamic Solvers

Unleashing the Performance Potential of GPUs for Atmospheric Dynamic Solvers Unleashing the Performance Potential of GPUs for Atmospheric Dynamic Solvers Haohuan Fu haohuan@tsinghua.edu.cn High Performance Geo-Computing (HPGC) Group Center for Earth System Science Tsinghua University

More information

Fast Multipole Method for particle interactions: an open source parallel library component

Fast Multipole Method for particle interactions: an open source parallel library component Fast Multipole Method for particle interactions: an open source parallel library component F. A. Cruz 1,M.G.Knepley 2,andL.A.Barba 1 1 Department of Mathematics, University of Bristol, University Walk,

More information

AeroFluidX: A Next Generation GPU-Based CFD Solver for Engineering Applications

AeroFluidX: A Next Generation GPU-Based CFD Solver for Engineering Applications AeroFluidX: A Next Generation GPU-Based CFD Solver for Engineering Applications Dr. Bjoern Landmann Dr. Kerstin Wieczorek Stefan Bachschuster 18.03.2015 FluiDyna GmbH, Lichtenbergstr. 8, 85748 Garching

More information

Clustering Billions of Data Points Using GPUs

Clustering Billions of Data Points Using GPUs Clustering Billions of Data Points Using GPUs Ren Wu ren.wu@hp.com Bin Zhang bin.zhang2@hp.com Meichun Hsu meichun.hsu@hp.com ABSTRACT In this paper, we report our research on using GPUs to accelerate

More information

Matrix Multiplication

Matrix Multiplication Matrix Multiplication CPS343 Parallel and High Performance Computing Spring 2016 CPS343 (Parallel and HPC) Matrix Multiplication Spring 2016 1 / 32 Outline 1 Matrix operations Importance Dense and sparse

More information

GPU Computing with CUDA Lecture 2 - CUDA Memories. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

GPU Computing with CUDA Lecture 2 - CUDA Memories. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile GPU Computing with CUDA Lecture 2 - CUDA Memories Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1 Outline of lecture Recap of Lecture 1 Warp scheduling CUDA Memory hierarchy

More information

How High a Degree is High Enough for High Order Finite Elements?

How High a Degree is High Enough for High Order Finite Elements? This space is reserved for the Procedia header, do not use it How High a Degree is High Enough for High Order Finite Elements? William F. National Institute of Standards and Technology, Gaithersburg, Maryland,

More information

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011 Graphics Cards and Graphics Processing Units Ben Johnstone Russ Martin November 15, 2011 Contents Graphics Processing Units (GPUs) Graphics Pipeline Architectures 8800-GTX200 Fermi Cayman Performance Analysis

More information

The Green Index: A Metric for Evaluating System-Wide Energy Efficiency in HPC Systems

The Green Index: A Metric for Evaluating System-Wide Energy Efficiency in HPC Systems 202 IEEE 202 26th IEEE International 26th International Parallel Parallel and Distributed and Distributed Processing Processing Symposium Symposium Workshops Workshops & PhD Forum The Green Index: A Metric

More information

GPGPU accelerated Computational Fluid Dynamics

GPGPU accelerated Computational Fluid Dynamics t e c h n i s c h e u n i v e r s i t ä t b r a u n s c h w e i g Carl-Friedrich Gauß Faculty GPGPU accelerated Computational Fluid Dynamics 5th GACM Colloquium on Computational Mechanics Hamburg Institute

More information

TESLA Report 2003-03

TESLA Report 2003-03 TESLA Report 23-3 A multigrid based 3D space-charge routine in the tracking code GPT Gisela Pöplau, Ursula van Rienen, Marieke de Loos and Bas van der Geer Institute of General Electrical Engineering,

More information

Performance Tuning of a CFD Code on the Earth Simulator

Performance Tuning of a CFD Code on the Earth Simulator Applications on HPC Special Issue on High Performance Computing Performance Tuning of a CFD Code on the Earth Simulator By Ken ichi ITAKURA,* Atsuya UNO,* Mitsuo YOKOKAWA, Minoru SAITO, Takashi ISHIHARA

More information

Choosing a Computer for Running SLX, P3D, and P5

Choosing a Computer for Running SLX, P3D, and P5 Choosing a Computer for Running SLX, P3D, and P5 This paper is based on my experience purchasing a new laptop in January, 2010. I ll lead you through my selection criteria and point you to some on-line

More information

PyFR: Bringing Next Generation Computational Fluid Dynamics to GPU Platforms

PyFR: Bringing Next Generation Computational Fluid Dynamics to GPU Platforms PyFR: Bringing Next Generation Computational Fluid Dynamics to GPU Platforms P. E. Vincent! Department of Aeronautics Imperial College London! 25 th March 2014 Overview Motivation Flux Reconstruction Many-Core

More information

Poisson Equation Solver Parallelisation for Particle-in-Cell Model

Poisson Equation Solver Parallelisation for Particle-in-Cell Model WDS'14 Proceedings of Contributed Papers Physics, 233 237, 214. ISBN 978-8-7378-276-4 MATFYZPRESS Poisson Equation Solver Parallelisation for Particle-in-Cell Model A. Podolník, 1,2 M. Komm, 1 R. Dejarnac,

More information

GPU Acceleration of the SENSEI CFD Code Suite

GPU Acceleration of the SENSEI CFD Code Suite GPU Acceleration of the SENSEI CFD Code Suite Chris Roy, Brent Pickering, Chip Jackson, Joe Derlaga, Xiao Xu Aerospace and Ocean Engineering Primary Collaborators: Tom Scogland, Wu Feng (Computer Science)

More information

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

GPU System Architecture. Alan Gray EPCC The University of Edinburgh GPU System Architecture EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? GPU-CPU comparison Architectural reasons for GPU performance advantages GPU accelerated systems

More information

QCD as a Video Game?

QCD as a Video Game? QCD as a Video Game? Sándor D. Katz Eötvös University Budapest in collaboration with Győző Egri, Zoltán Fodor, Christian Hoelbling Dániel Nógrádi, Kálmán Szabó Outline 1. Introduction 2. GPU architecture

More information

AN INTERFACE STRIP PRECONDITIONER FOR DOMAIN DECOMPOSITION METHODS

AN INTERFACE STRIP PRECONDITIONER FOR DOMAIN DECOMPOSITION METHODS AN INTERFACE STRIP PRECONDITIONER FOR DOMAIN DECOMPOSITION METHODS by M. Storti, L. Dalcín, R. Paz Centro Internacional de Métodos Numéricos en Ingeniería - CIMEC INTEC, (CONICET-UNL), Santa Fe, Argentina

More information

Graphic Processing Units: a possible answer to High Performance Computing?

Graphic Processing Units: a possible answer to High Performance Computing? 4th ABINIT Developer Workshop RESIDENCE L ESCANDILLE AUTRANS HPC & Graphic Processing Units: a possible answer to High Performance Computing? Luigi Genovese ESRF - Grenoble 26 March 2009 http://inac.cea.fr/l_sim/

More information

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Family-Based Platforms Executive Summary Complex simulations of structural and systems performance, such as car crash simulations,

More information

A GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS

A GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS A GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS SUDHAKARAN.G APCF, AERO, VSSC, ISRO 914712564742 g_suhakaran@vssc.gov.in THOMAS.C.BABU APCF, AERO, VSSC, ISRO 914712565833

More information

GPGPU acceleration in OpenFOAM

GPGPU acceleration in OpenFOAM Carl-Friedrich Gauß Faculty GPGPU acceleration in OpenFOAM Northern germany OpenFoam User meeting Braunschweig Institute of Technology Thorsten Grahs Institute of Scientific Computing/move-csc 2nd October

More information

Numerical Methods I Solving Linear Systems: Sparse Matrices, Iterative Methods and Non-Square Systems

Numerical Methods I Solving Linear Systems: Sparse Matrices, Iterative Methods and Non-Square Systems Numerical Methods I Solving Linear Systems: Sparse Matrices, Iterative Methods and Non-Square Systems Aleksandar Donev Courant Institute, NYU 1 donev@courant.nyu.edu 1 Course G63.2010.001 / G22.2420-001,

More information

Distributed Dynamic Load Balancing for Iterative-Stencil Applications

Distributed Dynamic Load Balancing for Iterative-Stencil Applications Distributed Dynamic Load Balancing for Iterative-Stencil Applications G. Dethier 1, P. Marchot 2 and P.A. de Marneffe 1 1 EECS Department, University of Liege, Belgium 2 Chemical Engineering Department,

More information

OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA

OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA OpenCL Optimization San Jose 10/2/2009 Peng Wang, NVIDIA Outline Overview The CUDA architecture Memory optimization Execution configuration optimization Instruction optimization Summary Overall Optimization

More information

L20: GPU Architecture and Models

L20: GPU Architecture and Models L20: GPU Architecture and Models scribe(s): Abdul Khalifa 20.1 Overview GPUs (Graphics Processing Units) are large parallel structure of processing cores capable of rendering graphics efficiently on displays.

More information

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming Overview Lecture 1: an introduction to CUDA Mike Giles mike.giles@maths.ox.ac.uk hardware view software view Oxford University Mathematical Institute Oxford e-research Centre Lecture 1 p. 1 Lecture 1 p.

More information

Next Generation GPU Architecture Code-named Fermi

Next Generation GPU Architecture Code-named Fermi Next Generation GPU Architecture Code-named Fermi The Soul of a Supercomputer in the Body of a GPU Why is NVIDIA at Super Computing? Graphics is a throughput problem paint every pixel within frame time

More information

A Simultaneous Solution for General Linear Equations on a Ring or Hierarchical Cluster

A Simultaneous Solution for General Linear Equations on a Ring or Hierarchical Cluster Acta Technica Jaurinensis Vol. 3. No. 1. 010 A Simultaneous Solution for General Linear Equations on a Ring or Hierarchical Cluster G. Molnárka, N. Varjasi Széchenyi István University Győr, Hungary, H-906

More information

Parallel Programming Survey

Parallel Programming Survey Christian Terboven 02.09.2014 / Aachen, Germany Stand: 26.08.2014 Version 2.3 IT Center der RWTH Aachen University Agenda Overview: Processor Microarchitecture Shared-Memory

More information

Parallel Programming at the Exascale Era: A Case Study on Parallelizing Matrix Assembly For Unstructured Meshes

Parallel Programming at the Exascale Era: A Case Study on Parallelizing Matrix Assembly For Unstructured Meshes Parallel Programming at the Exascale Era: A Case Study on Parallelizing Matrix Assembly For Unstructured Meshes Eric Petit, Loïc Thebault, Quang V. Dinh May 2014 EXA2CT Consortium 2 WPs Organization Proto-Applications

More information

Stream Processing on GPUs Using Distributed Multimedia Middleware

Stream Processing on GPUs Using Distributed Multimedia Middleware Stream Processing on GPUs Using Distributed Multimedia Middleware Michael Repplinger 1,2, and Philipp Slusallek 1,2 1 Computer Graphics Lab, Saarland University, Saarbrücken, Germany 2 German Research

More information

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1 Introduction to GP-GPUs Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1 GPU Architectures: How do we reach here? NVIDIA Fermi, 512 Processing Elements (PEs) 2 What Can It Do?

More information

Kashif Iqbal - PhD Kashif.iqbal@ichec.ie

Kashif Iqbal - PhD Kashif.iqbal@ichec.ie HPC/HTC vs. Cloud Benchmarking An empirical evalua.on of the performance and cost implica.ons Kashif Iqbal - PhD Kashif.iqbal@ichec.ie ICHEC, NUI Galway, Ireland With acknowledgment to Michele MicheloDo

More information

A Load Balancing Tool for Structured Multi-Block Grid CFD Applications

A Load Balancing Tool for Structured Multi-Block Grid CFD Applications A Load Balancing Tool for Structured Multi-Block Grid CFD Applications K. P. Apponsah and D. W. Zingg University of Toronto Institute for Aerospace Studies (UTIAS), Toronto, ON, M3H 5T6, Canada Email:

More information

~ Greetings from WSU CAPPLab ~

~ Greetings from WSU CAPPLab ~ ~ Greetings from WSU CAPPLab ~ Multicore with SMT/GPGPU provides the ultimate performance; at WSU CAPPLab, we can help! Dr. Abu Asaduzzaman, Assistant Professor and Director Wichita State University (WSU)

More information

Solution of Linear Systems

Solution of Linear Systems Chapter 3 Solution of Linear Systems In this chapter we study algorithms for possibly the most commonly occurring problem in scientific computing, the solution of linear systems of equations. We start

More information

High Performance Computing in CST STUDIO SUITE

High Performance Computing in CST STUDIO SUITE High Performance Computing in CST STUDIO SUITE Felix Wolfheimer GPU Computing Performance Speedup 18 16 14 12 10 8 6 4 2 0 Promo offer for EUC participants: 25% discount for K40 cards Speedup of Solver

More information

Retargeting PLAPACK to Clusters with Hardware Accelerators

Retargeting PLAPACK to Clusters with Hardware Accelerators Retargeting PLAPACK to Clusters with Hardware Accelerators Manuel Fogué 1 Francisco Igual 1 Enrique S. Quintana-Ortí 1 Robert van de Geijn 2 1 Departamento de Ingeniería y Ciencia de los Computadores.

More information

Load Balancing Algorithms for Sparse Matrix Kernels on Heterogeneous Platforms

Load Balancing Algorithms for Sparse Matrix Kernels on Heterogeneous Platforms Load Balancing Algorithms for Sparse Matrix Kernels on Heterogeneous Platforms Thesis submitted in partial fulfillment of the requirements for the degree of MS by Research in Computer Science and Engineering

More information

The sparse matrix vector product on GPUs

The sparse matrix vector product on GPUs The sparse matrix vector product on GPUs F. Vázquez, E. M. Garzón, J. A. Martínez, J. J. Fernández {f.vazquez, gmartin, jamartine, jjfdez}@ual.es Dpt Computer Architecture and Electronics. University of

More information

GPGPU Computing. Yong Cao

GPGPU Computing. Yong Cao GPGPU Computing Yong Cao Why Graphics Card? It s powerful! A quiet trend Copyright 2009 by Yong Cao Why Graphics Card? It s powerful! Processor Processing Units FLOPs per Unit Clock Speed Processing Power

More information

Evaluation of CUDA Fortran for the CFD code Strukti

Evaluation of CUDA Fortran for the CFD code Strukti Evaluation of CUDA Fortran for the CFD code Strukti Practical term report from Stephan Soller High performance computing center Stuttgart 1 Stuttgart Media University 2 High performance computing center

More information

SUBJECT: SOLIDWORKS HARDWARE RECOMMENDATIONS - 2013 UPDATE

SUBJECT: SOLIDWORKS HARDWARE RECOMMENDATIONS - 2013 UPDATE SUBJECT: SOLIDWORKS RECOMMENDATIONS - 2013 UPDATE KEYWORDS:, CORE, PROCESSOR, GRAPHICS, DRIVER, RAM, STORAGE SOLIDWORKS RECOMMENDATIONS - 2013 UPDATE Below is a summary of key components of an ideal SolidWorks

More information

Vector and Matrix Norms

Vector and Matrix Norms Chapter 1 Vector and Matrix Norms 11 Vector Spaces Let F be a field (such as the real numbers, R, or complex numbers, C) with elements called scalars A Vector Space, V, over the field F is a non-empty

More information

Yousef Saad University of Minnesota Computer Science and Engineering. CRM Montreal - April 30, 2008

Yousef Saad University of Minnesota Computer Science and Engineering. CRM Montreal - April 30, 2008 A tutorial on: Iterative methods for Sparse Matrix Problems Yousef Saad University of Minnesota Computer Science and Engineering CRM Montreal - April 30, 2008 Outline Part 1 Sparse matrices and sparsity

More information

A Pattern-Based Approach to. Automated Application Performance Analysis

A Pattern-Based Approach to. Automated Application Performance Analysis A Pattern-Based Approach to Automated Application Performance Analysis Nikhil Bhatia, Shirley Moore, Felix Wolf, and Jack Dongarra Innovative Computing Laboratory University of Tennessee (bhatia, shirley,

More information

HSL and its out-of-core solver

HSL and its out-of-core solver HSL and its out-of-core solver Jennifer A. Scott j.a.scott@rl.ac.uk Prague November 2006 p. 1/37 Sparse systems Problem: we wish to solve where A is Ax = b LARGE Informal definition: A is sparse if many

More information

Experiences With Mobile Processors for Energy Efficient HPC

Experiences With Mobile Processors for Energy Efficient HPC Experiences With Mobile Processors for Energy Efficient HPC Nikola Rajovic, Alejandro Rico, James Vipond, Isaac Gelado, Nikola Puzovic, Alex Ramirez Barcelona Supercomputing Center Universitat Politècnica

More information

Express Introductory Training in ANSYS Fluent Lecture 1 Introduction to the CFD Methodology

Express Introductory Training in ANSYS Fluent Lecture 1 Introduction to the CFD Methodology Express Introductory Training in ANSYS Fluent Lecture 1 Introduction to the CFD Methodology Dimitrios Sofialidis Technical Manager, SimTec Ltd. Mechanical Engineer, PhD PRACE Autumn School 2013 - Industry

More information

GPU File System Encryption Kartik Kulkarni and Eugene Linkov

GPU File System Encryption Kartik Kulkarni and Eugene Linkov GPU File System Encryption Kartik Kulkarni and Eugene Linkov 5/10/2012 SUMMARY. We implemented a file system that encrypts and decrypts files. The implementation uses the AES algorithm computed through

More information

Assessing the Performance of OpenMP Programs on the Intel Xeon Phi

Assessing the Performance of OpenMP Programs on the Intel Xeon Phi Assessing the Performance of OpenMP Programs on the Intel Xeon Phi Dirk Schmidl, Tim Cramer, Sandra Wienke, Christian Terboven, and Matthias S. Müller schmidl@rz.rwth-aachen.de Rechen- und Kommunikationszentrum

More information

AMD PhenomII. Architecture for Multimedia System -2010. Prof. Cristina Silvano. Group Member: Nazanin Vahabi 750234 Kosar Tayebani 734923

AMD PhenomII. Architecture for Multimedia System -2010. Prof. Cristina Silvano. Group Member: Nazanin Vahabi 750234 Kosar Tayebani 734923 AMD PhenomII Architecture for Multimedia System -2010 Prof. Cristina Silvano Group Member: Nazanin Vahabi 750234 Kosar Tayebani 734923 Outline Introduction Features Key architectures References AMD Phenom

More information

Toward a New Metric for Ranking High Performance Computing Systems

Toward a New Metric for Ranking High Performance Computing Systems SANDIA REPORT SAND2013-4744 Unlimited Release Printed June 2013 Toward a New Metric for Ranking High Performance Computing Systems Jack Dongarra, University of Tennessee Michael A. Heroux, Sandia National

More information

Parallel Algorithm Engineering

Parallel Algorithm Engineering Parallel Algorithm Engineering Kenneth S. Bøgh PhD Fellow Based on slides by Darius Sidlauskas Outline Background Current multicore architectures UMA vs NUMA The openmp framework Examples Software crisis

More information

TWO-DIMENSIONAL FINITE ELEMENT ANALYSIS OF FORCED CONVECTION FLOW AND HEAT TRANSFER IN A LAMINAR CHANNEL FLOW

TWO-DIMENSIONAL FINITE ELEMENT ANALYSIS OF FORCED CONVECTION FLOW AND HEAT TRANSFER IN A LAMINAR CHANNEL FLOW TWO-DIMENSIONAL FINITE ELEMENT ANALYSIS OF FORCED CONVECTION FLOW AND HEAT TRANSFER IN A LAMINAR CHANNEL FLOW Rajesh Khatri 1, 1 M.Tech Scholar, Department of Mechanical Engineering, S.A.T.I., vidisha

More information

Efficient Parallel Graph Exploration on Multi-Core CPU and GPU

Efficient Parallel Graph Exploration on Multi-Core CPU and GPU Efficient Parallel Graph Exploration on Multi-Core CPU and GPU Pervasive Parallelism Laboratory Stanford University Sungpack Hong, Tayo Oguntebi, and Kunle Olukotun Graph and its Applications Graph Fundamental

More information

Accelerating Wavelet-Based Video Coding on Graphics Hardware

Accelerating Wavelet-Based Video Coding on Graphics Hardware Wladimir J. van der Laan, Andrei C. Jalba, and Jos B.T.M. Roerdink. Accelerating Wavelet-Based Video Coding on Graphics Hardware using CUDA. In Proc. 6th International Symposium on Image and Signal Processing

More information

Performance of the JMA NWP models on the PC cluster TSUBAME.

Performance of the JMA NWP models on the PC cluster TSUBAME. Performance of the JMA NWP models on the PC cluster TSUBAME. K.Takenouchi 1), S.Yokoi 1), T.Hara 1) *, T.Aoki 2), C.Muroi 1), K.Aranami 1), K.Iwamura 1), Y.Aikawa 1) 1) Japan Meteorological Agency (JMA)

More information

Performance of Dynamic Load Balancing Algorithms for Unstructured Mesh Calculations

Performance of Dynamic Load Balancing Algorithms for Unstructured Mesh Calculations Performance of Dynamic Load Balancing Algorithms for Unstructured Mesh Calculations Roy D. Williams, 1990 Presented by Chris Eldred Outline Summary Finite Element Solver Load Balancing Results Types Conclusions

More information

Dense Linear Algebra Solvers for Multicore with GPU Accelerators

Dense Linear Algebra Solvers for Multicore with GPU Accelerators Dense Linear Algebra Solvers for Multicore with GPU Accelerators Stanimire Tomov, Rajib Nath, Hatem Ltaief, and Jack Dongarra Department of Electrical Engineering and Computer Science, University of Tennessee,

More information

An Overview of the Finite Element Analysis

An Overview of the Finite Element Analysis CHAPTER 1 An Overview of the Finite Element Analysis 1.1 Introduction Finite element analysis (FEA) involves solution of engineering problems using computers. Engineering structures that have complex geometry

More information