2IP WP8 Materiel Science Activity report March 6, 2013
Codes involved in this task ABINIT (M.Torrent) Quantum ESPRESSO (F. Affinito) YAMBO + Octopus (F. Nogueira) SIESTA (G. Huhs) EXCITING/ELK (A. Kozhevnikov) 2
ABINIT Prace 2IP-WP8 Developers groups CEA (TGCC, Paris) Marc Torrent, Florent Dahm CEA (INAC, Grenoble) Luigi Genovese, Brice Videau UCL (Louvain-la-Neuve) Xavier Gonze, Matteo Giantomassi BSC (Barcelona) Georg Huhs Prace 2IP-WP8 Task list Ground state + plane-waves Introduce shared memory parallelism Improve load balancing Automate processor topology and use of libraries Ground state + wavelets Implement a complete solution of automatic code generation for the convolutions Excited states Implement a hybrid MPI-OpenMP approach Use Scalapack for the inversion of the dielectric matrix Implementation a new MPI distribution for the orbitals Use MPI-IO routines to read the wave functions Response function Remove bottlenecks, distribute wave- functions Parallelize several sections that are done in sequential Parallelize the outer loop on perturbations Everything is done and committed in the trunk of the development version (v7.3.0) 3
ABINIT Hybrid version MPI-openMP Fourier Transforms BEFOR E TGCC-CURIE, INTEL SANDY BRIDGE AFTER MKL NATIVE IMPLEMENTATION 4
ABINIT Hybrid version MPI-openMP Non-local operator (Hamiltonian) # threads Speedup 1 1,00 4 3,85 8 6,81 Specific kernels (re-)written Up to 8 threads : 85% efficiency More than 8 threads : thread synchronization issues Under development: openacc version TGCC-CURIE, INTEL SANDY BRIDGE TEST CASE : 107 GOLD ATOMS 128 MPI processes 5
ABINIT Ground-state + plane-wave section Load balancing Load balancing on bands : improved! Load balancing on plane waves : need a small communication BEFORE WITH L. BALANCING ON BANDS WITH L. BALANCING ON PW TGCC-CURIE, INTEL SANDY BRIDGE TEST CASE : 107 GOLD ATOMS Waiting time has decreased! 6
ABINIT Response function section NEW PARALLELIZATION LEVEL OVER PERTURBATIONS Mostly re-used work done for the image parallelization (NEB, Path-Integral MD) Divide-and-conquer scheme prepared IMPROVEMENT OF WAVE-FUNCTIONS INITIALIZATION Orthogonalization suppressed New MPI-IO routines # MPI proc. Speedup vs 32 32 1.0 64 2.0 256 7.9 512 16.0 Test case : 29 BaTiO 3 atoms 16 irreduccible perturbations 16 k-points, 120 bands TGCC-CURIE, INTEL SANDY BRIDGE 7
ABINIT Response function section BEFOR E AFTER Remaining functions under study 8
ABINIT Automatic processes distribution among parallelization levels ADAPT PROCESSES TOPOLOGY TO THE PROBLEM SIZE AND THE ARCHITECTURE First level : a simple heuristic used to predict scaling factor Second level : micro-benchmark used to choose libraries and (Cuda, ScaLapack, ) adjust processes topology REAL SPEEDUP PREDICTED SPEEDUP TGCC-CURIE, INTEL SANDY BRIDGE TEST CASE : 107 GOLD ATOMS 9
ABINIT-BigDFT Automatic code generator for convolutions (wavelet basis) A NEW PARAMETRIZED GENERATOR that can optimize bigdft convolutions (badly optimized and vectorized by compilers) Simple 0.3 GFLOPS BOAST Bringing Optimization through Automatic Source-to-Source Transformations Hand vectorized 6.5 GFLOPS Used to generate multiple version of the reference convolution Architecture dependent optimization parameters : optimal unroll degree, resource usage pattern, PORTED ON TIBIDABO (PRACE PROTOTYPE AT BSC) 10
Quantum ESPRESSO Parallelization on bands (PW, CP, GIPAW, Phonon) OpenMP parallelization EPW parallelization ELPA implementation PP testing (G2 test set) Improvement of portability (MIC) Groups involved: CINECA (Italy) ICHEC (Ireland) University of Sofia (Bulgary) IPB (Serbia) 11
Quantum ESPRESSO - ELPA Implemented 1-stage ELPA for symmetric matrices (gamma point) 12
Quantum ESPRESSO MIC/GPUs - Several modules have been already ported to GPUs with phigemm (outside PRACE activity) - CP and PW have been now ported in native mode (no offload) to KNC 13
Quantum ESPRESSO - PHonon Introducing a new level of MPI parallelism into PHonon code Will allow PHonon to scale on petascale machines Parallelising over bands similar implementation in PWScf & GIPAW PHonon more complicated than GIPAW more dependencies So far, implemented for gamma + non-gamma parts of code Currently (Debugging) Testing + Benchmarking Will rely on more input data sets from community 14
Quantum ESPRESSO - EPW 15
Quantum ESPRESSO - benchmarking Benchmarking and testing on the G2 test set. Checking the accuracy of DFT calculatons using B3LYP functionals. Differences in the B3LYP total electronic energies (with QE) between single point calculations with the MP2 optimized geometry and structures optimized with B3LYP (QE) for selected molecules. Molecule Difference, ev neutrals cations H 2 O 0.01 0.03 CH 4 0.01 0.05 NH 3 0.00 0.02 PH 3 0.01 0.05 SiH 4 0.02 0.06 H 2 S 0.00 0.01 CO 0.10 0.17 C 2 H 2 0.06 0.11 C 2 H 4 0.02 0.09 C 2 H 6 0.01 0.02 Average 0.02 0.06 16
Yambo Work has been shared on two tasks: University of COIMBRA CINECA University of Coimbra wasn t able to fulfill the commitment for administrative issues CINECA took part with the involvment of the italian community of developers 17
Yambo The code needed a deep refactoring to be able to run on Tier-0 architectures Multilevel parallelization OpenMP parallelization Distribution of data structures Before this work scaling was strongly limited up to several hundreds of cores (mainly due to memory limitations) 18
Yambo 19
20
21
Siesta Work done: l MAGMA solver implemented Range of applications is limited l Suguira-Sakurai algorithm: Prototype implemented and tested l FEAST library tested l New method: PEXSI The Suguira-Sakurai and FEAST methods show a load balancing problem dropped these algorithms l l Prototype phase finished Started implementation into Siesta 22
PEXSI: l 2 Levels of parallelization: l Independent Nodes (good number: 80) Siesta PEXSI l Per node e.g. 16, 64, 256,... processes è Thousands of cores still efficient l Favorable computational complexity l O(n^2) for 3D systems l O(n^3/2) for quasi-2d systems l O(n) for 1D systems Without additional simplifications!! l Target are huge systems (tens of thousands of atoms) l Cooperation with a group working on layered systems of this size 23
Siesta PEXSI results PEXSI outperforms ScaLAPACK when applied to ADNA-10, but not for a small C-BN-C (layered system) example. When increasing the problem size by stacking unitcells: l Effort grows with O(n^3/2), then even linar (see table, times in seconds) è Example with 8 unitcells, meaning more than 20.000 atoms, becomes solvable 24
Conclusions so far: - Work has been successfully accomplished for all the codes involved in this task - Meetings during the work package have been useful to compare and validate obtained results and to share ideas and perspectives - We all agree that rather than work for a common code we can look forward for common rules to facilitate inter-exchange between different codes
Validation strategy - we need to show that obtained results are relevant to the communities - we need a feedback of the interplay between communities and PRACE computing centers - we want to highlight the importance of initiatives like the WP8 package, where communities are working together with scientists to produce improvements on the most relevant codes We proposed to introduce into the final deliverable a short section (1-2 pages) written by one representative per code from scientific communities. In this section the work made on the WP8 will be assessed by stressing the importance of the obtained results for the community
Work to do: - complete the documentation on the wiki HPC-Forge - individuate people for the communities for the writing of the deliverable - completing the WP8 work (synchronization of the repos, documentation, benchmarking, reintegration) whereas it is needed