Graphic Processing Units: a possible answer to High Performance Computing?

4th ABINIT Developer Workshop RESIDENCE L ESCANDILLE AUTRANS HPC & Graphic Processing Units: a possible answer to High Performance Computing? Luigi Genovese ESRF - Grenoble 26 March 2009 http://inac.cea.fr/l_sim/ Luigi Genovese Theory Group - ESRF

The code Different numerical performed: Interpolating Daubechies HPC & For each wavefunction (Hamiltonian application) Between wavefunctions (Linear algebra) Numerical Convolutions with short filters BLAS routines FFT (Poisson Solver) http://inac.cea.fr/l_sim/ Luigi Genovese Theory Group - ESRF

Separable We must calculate F(I 1,I 2,I 3 ) = L j 1,j 2,j 3 =0 h j1 h j2 h j3 G(I 1 j 1,I 2 j 2,I 3 j 3 ) HPC & = L j 1 =0 h j1 L j 2 =0 h j2 L j 3 =0 h j3 G(i 1 j 1,i 2 j 2,i 3 j 3 ) Application of three successive 1 A 3 (I 3,i 1,i 2 ) = j h j G(i 1,i 2,I 3 j) i 1,i 2 ; 2 A 2 (I 2,I 3,i 1 ) = j h j A 3 (I 3,i 1,I 2 j) I 3,i 1 ; 3 F(I 1,I 2,I 3 ) = j h j A 2 (I 2,I 3,I 1 j) I 2,I 3. Main routine: Convolution + transposition F(I,a) = h j G(a,I j) a ; j http://inac.cea.fr/l_sim/ Luigi Genovese Theory Group - ESRF

Evolution of accelerator Increasing power in last years The price (and the power consumption) of a GFlops gets cheaper! HPC & http://inac.cea.fr/l_sim/ Luigi Genovese Theory Group - ESRF

Why not to use? HPC & How to code scientific on a? hardware is designed for... graphic calculation Texture shaders rendering Single precision calculations Which programming language? http://inac.cea.fr/l_sim/ Luigi Genovese Theory Group - ESRF

The CUDA programming language HPC & NVidia : the CUDA programming language The API is an extension to ANSI C(++) Low learning curve The hardware is designed for lightweight runtime High performance with moderate optimisation costs A number of SDK which helps the beginner and CUFFT Nvidia provides the CUDA user with pre-built libraries for BLAS and FFT Can really be used as black-boxes Since july 2008 NVidia cards fully support double precision (IEEE compliant) http://inac.cea.fr/l_sim/ Luigi Genovese Theory Group - ESRF

Example: BLAS used in Double precision calculations for N orb m matrices (m = 300kB) 70 60 HPC & speedup (Double prec.) 50 40 30 20 DGEMM DSYRK 0 0 500 0 1500 2000 2500 Number of orbitals http://inac.cea.fr/l_sim/ Luigi Genovese Theory Group - ESRF

performances Free BC: 0 HPC & 80 Percent 60 40 Seconds (log. scale) 20 0 1 5 8 17 32 65 128 257 512 25 Number of atoms 1 0.1 Time (sec) Comm (%) Other Precond HamApp PSolver sumrho LinAlg http://inac.cea.fr/l_sim/ Luigi Genovese Theory Group - ESRF

performances Periodic BC (8 atoms/core): 0 80 HPC & Percent 60 40 20 0 1 2 4 6 8 12 14 16 Number of cores 1 Seconds (log. scale) Time (sec) Comm Other PureCPU precond locham locden LinAlg http://inac.cea.fr/l_sim/ Luigi Genovese Theory Group - ESRF

The on Main intensive routine: Convolution + transposition F(I,a) = h j G(a,I j) a ; j HPC & Combined set of 1D : Easy to parallelize (in sense) Short filters: Loop unrolling, less registers Optimal for hiding memory latency by arithmetics code makes possible to access hybrid CPU- supercomputer CEA/GENCI Titane machine, hybrid section 192 Nvidia Tesla over 800 Intel Nehalem cores http://inac.cea.fr/l_sim/ Luigi Genovese Theory Group - ESRF

One dimensional June 2008 Preliminary results (stage of M. Ospici, LIG - Bull) HPC & 60 50 G80 GT200(simple) GT200(double) speedup 40 30 20 0 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 Data size (Mb) http://inac.cea.fr/l_sim/ Luigi Genovese Theory Group - ESRF

All the three dimensional convolution operators Now Double precision calculations, full 3D operators 20 18 HPC & speedup (Double prec.) 16 14 12 8 6 locden 4 locham precond 2 0 20 30 40 50 60 70 Wavefunction size (MB) http://inac.cea.fr/l_sim/ Luigi Genovese Theory Group - ESRF

Full Hybrid code We can insert it in the full code, in parallel 0 HPC & 80 Percent 60 40 Seconds (log. scale) 20 0 1 2 4 6 8 12 14 16 1 2 4 6 8 12 14 16 CPU code Hybrid code (rel.) Number of cores 1 Time (sec) Comm Other PureCPU precond locham locden LinAlg http://inac.cea.fr/l_sim/ Luigi Genovese Theory Group - ESRF

Around 7 times faster A lot of can still be improved 90 80 HPC & Percent 70 60 50 40 30 20 0 1 2 4 6 8 12 14 16 Number of cores 1 Seconds (log. scale) Time (sec) Comm Other PureCPU precond locham locden LinAlg http://inac.cea.fr/l_sim/ Luigi Genovese Theory Group - ESRF

Summary and outlook Considerations: HPC & Our experience may represent a real alternative for speeding up the s Production s are accessible, not only prototypes... but... Can one draw general conclusions? Probably no... How can we estimate the ratio benefit/costs? Nature of numerical Hot-spot (> 80% of the overall time) Multi-? http://inac.cea.fr/l_sim/ Luigi Genovese Theory Group - ESRF