2IP WP8 Materiel Science Activity report March 6, 2013

Similar documents

Graphic Processing Units: a possible answer to High Performance Computing?

A Pattern-Based Approach to. Automated Application Performance Analysis

A Case Study - Scaling Legacy Code on Next Generation Platforms

A quick tutorial on Intel's Xeon Phi Coprocessor

HPC Wales Skills Academy Course Catalogue 2015

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

The Quantum ESPRESSO Software Distribution

Multicore Parallel Computing with OpenMP

DARPA, NSF-NGS/ITR,ACR,CPA,

Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o

Q-Chem: Quantum Chemistry Software for Large Systems. Peter M.W. Gill. Q-Chem, Inc. Four Triangle Drive Export, PA 15632, USA. and

YALES2 porting on the Xeon- Phi Early results

Case Study on Productivity and Performance of GPGPUs

Accelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing

Turbomachinery CFD on many-core platforms experiences and strategies

Unleashing the Performance Potential of GPUs for Atmospheric Dynamic Solvers

Parallel Programming at the Exascale Era: A Case Study on Parallelizing Matrix Assembly For Unstructured Meshes

Part I Courses Syllabus

SR-IOV: Performance Benefits for Virtualized Interconnects!

A Simultaneous Solution for General Linear Equations on a Ring or Hierarchical Cluster

Parallel Ray Tracing using MPI: A Dynamic Load-balancing Approach

Parallel Algorithm for Dense Matrix Multiplication

MAQAO Performance Analysis and Optimization Tool

Resource Scheduling Best Practice in Hybrid Clusters

Performance monitoring at CERN openlab. July 20 th 2012 Andrzej Nowak, CERN openlab

David Rioja Redondo Telecommunication Engineer Englobe Technologies and Systems

High Performance Computing in CST STUDIO SUITE

ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU

CHAPTER 1 INTRODUCTION

OpenFOAM: Computational Fluid Dynamics. Gauss Siedel iteration : (L + D) * x new = b - U * x old

Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms

Assessing the Performance of OpenMP Programs on the Intel Xeon Phi

Kommunikation in HPC-Clustern

Analysis, post-processing and visualization tools

5x in 5 hours Porting SEISMIC_CPML using the PGI Accelerator Model

Parallel Computing using MATLAB Distributed Compute Server ZORRO HPC

OpenACC Parallelization and Optimization of NAS Parallel Benchmarks

Evaluation of CUDA Fortran for the CFD code Strukti

Petascale Software Challenges. William Gropp

Multi Scale Design of nanomaterials with simulations on hybrid architectures: (the muscade project)

OpenACC 2.0 and the PGI Accelerator Compilers

Mathematical Libraries and Application Software on JUROPA and JUQUEEN

Course Development of Programming for General-Purpose Multicore Processors

Designing and Building Applications for Extreme Scale Systems CS598 William Gropp

Building a Top500-class Supercomputing Cluster at LNS-BUAP

Optimizing a 3D-FWT code in a cluster of CPUs+GPUs

Design and Optimization of a Portable Lattice Boltzmann Code for Heterogeneous Architectures

A GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS

Parallel Computing: Strategies and Implications. Dori Exterman CTO IncrediBuild.

Building an Inexpensive Parallel Computer

Performance of Software Switching

Spring 2011 Prof. Hyesoon Kim

Ivan Girotto a,, Axel Kohlmeyer a,b,, David Grellscheid a,c, Shawn T. Brown d

HPC with Multicore and GPUs

HP ProLiant SL270s Gen8 Server. Evaluation Report

THE NAS KERNEL BENCHMARK PROGRAM

Software & systems for the neuromorphic generation of computing. Peter Suma co-ceo peter.suma@appliedbrainresearch.

FPGA area allocation for parallel C applications

Introduction to Linux and Cluster Basics for the CCR General Computing Cluster

GPU File System Encryption Kartik Kulkarni and Eugene Linkov

Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga

Introduction to GPU Programming Languages

HPC enabling of OpenFOAM R for CFD applications

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU

Cluster performance, how to get the most out of Abel. Ole W. Saastad, Dr.Scient USIT / UAV / FI April 18 th 2013

The Top Six Advantages of CUDA-Ready Clusters. Ian Lumb Bright Evangelist

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

OpenACC Programming and Best Practices Guide

Performance Analysis and Optimization Tool

Relations with ISV and Open Source. Stephane Requena GENCI

How To Build A Supermicro Computer With A 32 Core Power Core (Powerpc) And A 32-Core (Powerpc) (Powerpowerpter) (I386) (Amd) (Microcore) (Supermicro) (

A Multi-layered Domain-specific Language for Stencil Computations

PeerMon: A Peer-to-Peer Network Monitoring System

Recent and Future Activities in HPC and Scientific Data Management Siegfried Benkner

A Pattern-Based Comparison of OpenACC & OpenMP for Accelerators

Lecture 3: Evaluating Computer Architectures. Software & Hardware: The Virtuous Cycle?

Mathematical Libraries on JUQUEEN. JSC Training Course

64-Bit versus 32-Bit CPUs in Scientific Computing

FLOW-3D Performance Benchmark and Profiling. September 2012

Pedraforca: ARM + GPU prototype

JUROPA Linux Cluster An Overview. 19 May 2014 Ulrich Detert

Poisson Equation Solver Parallelisation for Particle-in-Cell Model

MEng, BSc Applied Computer Science

Three Paths to Faster Simulations Using ANSYS Mechanical 16.0 and Intel Architecture

Systolic Computing. Fundamentals

Transcription:

2IP WP8 Materiel Science Activity report March 6, 2013

Codes involved in this task ABINIT (M.Torrent) Quantum ESPRESSO (F. Affinito) YAMBO + Octopus (F. Nogueira) SIESTA (G. Huhs) EXCITING/ELK (A. Kozhevnikov) 2

ABINIT Prace 2IP-WP8 Developers groups CEA (TGCC, Paris) Marc Torrent, Florent Dahm CEA (INAC, Grenoble) Luigi Genovese, Brice Videau UCL (Louvain-la-Neuve) Xavier Gonze, Matteo Giantomassi BSC (Barcelona) Georg Huhs Prace 2IP-WP8 Task list Ground state + plane-waves Introduce shared memory parallelism Improve load balancing Automate processor topology and use of libraries Ground state + wavelets Implement a complete solution of automatic code generation for the convolutions Excited states Implement a hybrid MPI-OpenMP approach Use Scalapack for the inversion of the dielectric matrix Implementation a new MPI distribution for the orbitals Use MPI-IO routines to read the wave functions Response function Remove bottlenecks, distribute wave- functions Parallelize several sections that are done in sequential Parallelize the outer loop on perturbations Everything is done and committed in the trunk of the development version (v7.3.0) 3

ABINIT Hybrid version MPI-openMP Fourier Transforms BEFOR E TGCC-CURIE, INTEL SANDY BRIDGE AFTER MKL NATIVE IMPLEMENTATION 4

ABINIT Hybrid version MPI-openMP Non-local operator (Hamiltonian) # threads Speedup 1 1,00 4 3,85 8 6,81 Specific kernels (re-)written Up to 8 threads : 85% efficiency More than 8 threads : thread synchronization issues Under development: openacc version TGCC-CURIE, INTEL SANDY BRIDGE TEST CASE : 107 GOLD ATOMS 128 MPI processes 5

ABINIT Ground-state + plane-wave section Load balancing Load balancing on bands : improved! Load balancing on plane waves : need a small communication BEFORE WITH L. BALANCING ON BANDS WITH L. BALANCING ON PW TGCC-CURIE, INTEL SANDY BRIDGE TEST CASE : 107 GOLD ATOMS Waiting time has decreased! 6

ABINIT Response function section NEW PARALLELIZATION LEVEL OVER PERTURBATIONS Mostly re-used work done for the image parallelization (NEB, Path-Integral MD) Divide-and-conquer scheme prepared IMPROVEMENT OF WAVE-FUNCTIONS INITIALIZATION Orthogonalization suppressed New MPI-IO routines # MPI proc. Speedup vs 32 32 1.0 64 2.0 256 7.9 512 16.0 Test case : 29 BaTiO 3 atoms 16 irreduccible perturbations 16 k-points, 120 bands TGCC-CURIE, INTEL SANDY BRIDGE 7

ABINIT Response function section BEFOR E AFTER Remaining functions under study 8

ABINIT Automatic processes distribution among parallelization levels ADAPT PROCESSES TOPOLOGY TO THE PROBLEM SIZE AND THE ARCHITECTURE First level : a simple heuristic used to predict scaling factor Second level : micro-benchmark used to choose libraries and (Cuda, ScaLapack, ) adjust processes topology REAL SPEEDUP PREDICTED SPEEDUP TGCC-CURIE, INTEL SANDY BRIDGE TEST CASE : 107 GOLD ATOMS 9

ABINIT-BigDFT Automatic code generator for convolutions (wavelet basis) A NEW PARAMETRIZED GENERATOR that can optimize bigdft convolutions (badly optimized and vectorized by compilers) Simple 0.3 GFLOPS BOAST Bringing Optimization through Automatic Source-to-Source Transformations Hand vectorized 6.5 GFLOPS Used to generate multiple version of the reference convolution Architecture dependent optimization parameters : optimal unroll degree, resource usage pattern, PORTED ON TIBIDABO (PRACE PROTOTYPE AT BSC) 10

Quantum ESPRESSO Parallelization on bands (PW, CP, GIPAW, Phonon) OpenMP parallelization EPW parallelization ELPA implementation PP testing (G2 test set) Improvement of portability (MIC) Groups involved: CINECA (Italy) ICHEC (Ireland) University of Sofia (Bulgary) IPB (Serbia) 11

Quantum ESPRESSO - ELPA Implemented 1-stage ELPA for symmetric matrices (gamma point) 12

Quantum ESPRESSO MIC/GPUs - Several modules have been already ported to GPUs with phigemm (outside PRACE activity) - CP and PW have been now ported in native mode (no offload) to KNC 13

Quantum ESPRESSO - PHonon Introducing a new level of MPI parallelism into PHonon code Will allow PHonon to scale on petascale machines Parallelising over bands similar implementation in PWScf & GIPAW PHonon more complicated than GIPAW more dependencies So far, implemented for gamma + non-gamma parts of code Currently (Debugging) Testing + Benchmarking Will rely on more input data sets from community 14

Quantum ESPRESSO - EPW 15

Quantum ESPRESSO - benchmarking Benchmarking and testing on the G2 test set. Checking the accuracy of DFT calculatons using B3LYP functionals. Differences in the B3LYP total electronic energies (with QE) between single point calculations with the MP2 optimized geometry and structures optimized with B3LYP (QE) for selected molecules. Molecule Difference, ev neutrals cations H 2 O 0.01 0.03 CH 4 0.01 0.05 NH 3 0.00 0.02 PH 3 0.01 0.05 SiH 4 0.02 0.06 H 2 S 0.00 0.01 CO 0.10 0.17 C 2 H 2 0.06 0.11 C 2 H 4 0.02 0.09 C 2 H 6 0.01 0.02 Average 0.02 0.06 16

Yambo Work has been shared on two tasks: University of COIMBRA CINECA University of Coimbra wasn t able to fulfill the commitment for administrative issues CINECA took part with the involvment of the italian community of developers 17

Yambo The code needed a deep refactoring to be able to run on Tier-0 architectures Multilevel parallelization OpenMP parallelization Distribution of data structures Before this work scaling was strongly limited up to several hundreds of cores (mainly due to memory limitations) 18

Yambo 19

20

21

Siesta Work done: l MAGMA solver implemented Range of applications is limited l Suguira-Sakurai algorithm: Prototype implemented and tested l FEAST library tested l New method: PEXSI The Suguira-Sakurai and FEAST methods show a load balancing problem dropped these algorithms l l Prototype phase finished Started implementation into Siesta 22

PEXSI: l 2 Levels of parallelization: l Independent Nodes (good number: 80) Siesta PEXSI l Per node e.g. 16, 64, 256,... processes è Thousands of cores still efficient l Favorable computational complexity l O(n^2) for 3D systems l O(n^3/2) for quasi-2d systems l O(n) for 1D systems Without additional simplifications!! l Target are huge systems (tens of thousands of atoms) l Cooperation with a group working on layered systems of this size 23

Siesta PEXSI results PEXSI outperforms ScaLAPACK when applied to ADNA-10, but not for a small C-BN-C (layered system) example. When increasing the problem size by stacking unitcells: l Effort grows with O(n^3/2), then even linar (see table, times in seconds) è Example with 8 unitcells, meaning more than 20.000 atoms, becomes solvable 24

Conclusions so far: - Work has been successfully accomplished for all the codes involved in this task - Meetings during the work package have been useful to compare and validate obtained results and to share ideas and perspectives - We all agree that rather than work for a common code we can look forward for common rules to facilitate inter-exchange between different codes

Validation strategy - we need to show that obtained results are relevant to the communities - we need a feedback of the interplay between communities and PRACE computing centers - we want to highlight the importance of initiatives like the WP8 package, where communities are working together with scientists to produce improvements on the most relevant codes We proposed to introduce into the final deliverable a short section (1-2 pages) written by one representative per code from scientific communities. In this section the work made on the WP8 will be assessed by stressing the importance of the obtained results for the community

Work to do: - complete the documentation on the wiki HPC-Forge - individuate people for the communities for the writing of the deliverable - completing the WP8 work (synchronization of the repos, documentation, benchmarking, reintegration) whereas it is needed