Graphic Processing Units: a possible answer to High Performance Computing?

Size: px

Start display at page:

Download "Graphic Processing Units: a possible answer to High Performance Computing?"

Patricia Thompson
9 years ago
Views:

1 4th ABINIT Developer Workshop RESIDENCE L ESCANDILLE AUTRANS HPC & Graphic Processing Units: a possible answer to High Performance Computing? Luigi Genovese ESRF - Grenoble 26 March Luigi Genovese Theory Group - ESRF

The code Different numerical performed: Interpolating Daubechies HPC & For each wavefunction (Hamiltonian application) Between wavefunctions

2 The code Different numerical performed: Interpolating Daubechies HPC & For each wavefunction (Hamiltonian application) Between wavefunctions (Linear algebra) Numerical Convolutions with short filters BLAS routines FFT (Poisson Solver) Luigi Genovese Theory Group - ESRF

(Linear algebra) Numerical Convolutions with short filters BLAS routines

3 Separable We must calculate F(I 1,I 2,I 3 ) = L j 1,j 2,j 3 =0 h j1 h j2 h j3 G(I 1 j 1,I 2 j 2,I 3 j 3 ) HPC & = L j 1 =0 h j1 L j 2 =0 h j2 L j 3 =0 h j3 G(i 1 j 1,i 2 j 2,i 3 j 3 ) Application of three successive 1 A 3 (I 3,i 1,i 2 ) = j h j G(i 1,i 2,I 3 j) i 1,i 2 ; 2 A 2 (I 2,I 3,i 1 ) = j h j A 3 (I 3,i 1,I 2 j) I 3,i 1 ; 3 F(I 1,I 2,I 3 ) = j h j A 2 (I 2,I 3,I 1 j) I 2,I 3. Main routine: Convolution + transposition F(I,a) = h j G(a,I j) a ; j Luigi Genovese Theory Group - ESRF

3 j) i 1,i 2 ; 2 A 2 (I 2,I 3,i 1 ) = j h j A 3 (I 3,i 1,I 2 j) I 3,i 1 ; 3 F(I 1,I 2,I 3 ) = j h j A 2 (I 2,I 3,I 1 j) I 2,I 3.

4 Evolution of accelerator Increasing power in last years The price (and the power consumption) of a GFlops gets cheaper! HPC & Luigi Genovese Theory Group - ESRF

5 Why not to use? HPC & How to code scientific on a? hardware is designed for... graphic calculation Texture shaders rendering Single precision calculations Which programming language? Luigi Genovese Theory Group - ESRF

6 The CUDA programming language HPC & NVidia : the CUDA programming language The API is an extension to ANSI C(++) Low learning curve The hardware is designed for lightweight runtime High performance with moderate optimisation costs A number of SDK which helps the beginner and CUFFT Nvidia provides the CUDA user with pre-built libraries for BLAS and FFT Can really be used as black-boxes Since july 2008 NVidia cards fully support double precision (IEEE compliant) Luigi Genovese Theory Group - ESRF

beginner and CUFFT Nvidia provides the CUDA user with pre-built libraries for BLAS and FFT Can really be used as black-boxes Since

7 Example: BLAS used in Double precision calculations for N orb m matrices (m = 300kB) HPC & speedup (Double prec.) DGEMM DSYRK Number of orbitals Luigi Genovese Theory Group - ESRF

) 50 40 30 20 DGEMM DSYRK 0 0 500 0 1500 2000 2500 Number of

8 performances Free BC: 0 HPC & 80 Percent Seconds (log. scale) Number of atoms Time (sec) Comm (%) Other Precond HamApp PSolver sumrho LinAlg Luigi Genovese Theory Group - ESRF

1 Time (sec) Comm (%) Other Precond HamApp PSolver sumrho

9 performances Periodic BC (8 atoms/core): 0 80 HPC & Percent Number of cores 1 Seconds (log. scale) Time (sec) Comm Other PureCPU precond locham locden LinAlg Luigi Genovese Theory Group - ESRF

10 The on Main intensive routine: Convolution + transposition F(I,a) = h j G(a,I j) a ; j HPC & Combined set of 1D : Easy to parallelize (in sense) Short filters: Loop unrolling, less registers Optimal for hiding memory latency by arithmetics code makes possible to access hybrid CPU- supercomputer CEA/GENCI Titane machine, hybrid section 192 Nvidia Tesla over 800 Intel Nehalem cores Luigi Genovese Theory Group - ESRF

latency by arithmetics code makes possible to access hybrid CPU- supercomputer CEA/GENCI Titane machine,

11 One dimensional June 2008 Preliminary results (stage of M. Ospici, LIG - Bull) HPC & G80 GT200(simple) GT200(double) speedup Data size (Mb) Luigi Genovese Theory Group - ESRF

GT200(double) speedup 40 30 20 0 0 0.5 1 1.5 2 2.5 3 3.5 4 4.

12 All the three dimensional convolution operators Now Double precision calculations, full 3D operators HPC & speedup (Double prec.) locden 4 locham precond Wavefunction size (MB) Luigi Genovese Theory Group - ESRF

) 16 14 12 8 6 locden 4 locham precond 2 0 20 30 40 50 60 70

13 Full Hybrid code We can insert it in the full code, in parallel 0 HPC & 80 Percent Seconds (log. scale) CPU code Hybrid code (rel.) Number of cores 1 Time (sec) Comm Other PureCPU precond locham locden LinAlg Luigi Genovese Theory Group - ESRF

scale) 20 0 1 2 4 6 8 12 14 16 1 2 4 6 8 12 14 16 CPU code Hybrid code (rel.

14 Around 7 times faster A lot of can still be improved HPC & Percent Number of cores 1 Seconds (log. scale) Time (sec) Comm Other PureCPU precond locham locden LinAlg Luigi Genovese Theory Group - ESRF

15 Summary and outlook Considerations: HPC & Our experience may represent a real alternative for speeding up the s Production s are accessible, not only prototypes... but... Can one draw general conclusions? Probably no... How can we estimate the ratio benefit/costs? Nature of numerical Hot-spot (> 80% of the overall time) Multi-? Luigi Genovese Theory Group - ESRF

.. Can one draw general conclusions? Probably no... How can we estimate the ratio benefit/costs?

Accelerating CFD using OpenFOAM with GPUs

Accelerating CFD using OpenFOAM with GPUs Authors: Saeed Iqbal and Kevin Tubbs The OpenFOAM CFD Toolbox is a free, open source CFD software package produced by OpenCFD Ltd. Its user base represents a wide