Equipping Sparse Solvers for Exascale A Survey of the DFG Project ESSEX



Similar documents
Scalable Distributed Schur Complement Solvers for Internal and External Flow Computations on Many-Core Architectures

Andreas Schreiber, Michael Meinel, Tobias Schlauch

Parallel Programming at the Exascale Era: A Case Study on Parallelizing Matrix Assembly For Unstructured Meshes

ACCELERATING COMMERCIAL LINEAR DYNAMIC AND NONLINEAR IMPLICIT FEA SOFTWARE THROUGH HIGH- PERFORMANCE COMPUTING

Accelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing

Large-Scale Reservoir Simulation and Big Data Visualization

P013 INTRODUCING A NEW GENERATION OF RESERVOIR SIMULATION SOFTWARE

Andreas Schreiber

and RISC Optimization Techniques for the Hitachi SR8000 Architecture

The DLR Satellite Station in Inuvik a Milestone for the Canadian-German Cooperation in Earth Observation

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child

A Case Study - Scaling Legacy Code on Next Generation Platforms

Distributed communication-aware load balancing with TreeMatch in Charm++

Automation of Aircraft Pre-Design with Chameleon

Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms

Unleashing the Performance Potential of GPUs for Atmospheric Dynamic Solvers

RepoGuard Validation Framework for Version Control Systems

MAQAO Performance Analysis and Optimization Tool

A Simultaneous Solution for General Linear Equations on a Ring or Hierarchical Cluster

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

Algorithmic Research and Software Development for an Industrial Strength Sparse Matrix Library for Parallel Computers

Matrix Multiplication

The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System

Parallel Computing. Benson Muite. benson.

HPC enabling of OpenFOAM R for CFD applications

Mixed Precision Iterative Refinement Methods Energy Efficiency on Hybrid Hardware Platforms

Mathematical Libraries and Application Software on JUROPA and JUQUEEN

HPC Programming Framework Research Team

Data Centric Systems (DCS)

Accelerating CFD using OpenFOAM with GPUs

Sourcery Overview & Virtual Machine Installation

2: Computer Performance

Dr. Raju Namburu Computational Sciences Campaign U.S. Army Research Laboratory. The Nation s Premier Laboratory for Land Forces UNCLASSIFIED

Petascale Software Challenges. William Gropp

64-Bit versus 32-Bit CPUs in Scientific Computing

Building a Top500-class Supercomputing Cluster at LNS-BUAP

Designing and Building Applications for Extreme Scale Systems CS598 William Gropp

Multicore Parallel Computing with OpenMP

Adapting scientific computing problems to cloud computing frameworks Ph.D. Thesis. Pelle Jakovits

Recall that two vectors in are perpendicular or orthogonal provided that their dot

Mesh Generation and Load Balancing

A Multi-layered Domain-specific Language for Stencil Computations

Mathematical Libraries on JUQUEEN. JSC Training Course

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

Advances and Work in Progress in Aerospace Predesign Data Exchange, Validation and Software Integration at the German Aerospace Center

CALIBRATION OF A ROBUST 2 DOF PATH MONITORING TOOL FOR INDUSTRIAL ROBOTS AND MACHINE TOOLS BASED ON PARALLEL KINEMATICS

BSC vision on Big Data and extreme scale computing

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015

GPGPU accelerated Computational Fluid Dynamics

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

Part I Courses Syllabus

Introduction to GPU Programming Languages

Case Study on Productivity and Performance of GPGPUs

Scientific Computing Programming with Parallel Objects

Parallel Computing for Data Science

Petascale Software Challenges. Piyush Chaudhary High Performance Computing

HPC with Multicore and GPUs

Overview of HPC Resources at Vanderbilt

A New Unstructured Variable-Resolution Finite Element Ice Sheet Stress-Velocity Solver within the MPAS/Trilinos FELIX Dycore of PISCEES

A Pattern-Based Approach to. Automated Application Performance Analysis

YALES2 porting on the Xeon- Phi Early results

MONTE-CARLO SIMULATION OF AMERICAN OPTIONS WITH GPUS. Julien Demouth, NVIDIA

Evaluation of CUDA Fortran for the CFD code Strukti

Recent and Future Activities in HPC and Scientific Data Management Siegfried Benkner

GASPI A PGAS API for Scalable and Fault Tolerant Computing

FRIEDRICH-ALEXANDER-UNIVERSITÄT ERLANGEN-NÜRNBERG

GPGPU acceleration in OpenFOAM

P164 Tomographic Velocity Model Building Using Iterative Eigendecomposition

So#ware Tools and Techniques for HPC, Clouds, and Server- Class SoCs Ron Brightwell

Big Data Visualization on the MIC

Concurrent Solutions to Linear Systems using Hybrid CPU/GPU Nodes

VALAR: A BENCHMARK SUITE TO STUDY THE DYNAMIC BEHAVIOR OF HETEROGENEOUS SYSTEMS

Poisson Equation Solver Parallelisation for Particle-in-Cell Model

Building an energy dashboard. Energy measurement and visualization in current HPC systems

A Load Balancing Tool for Structured Multi-Block Grid CFD Applications

An Overview Of Software For Convex Optimization. Brian Borchers Department of Mathematics New Mexico Tech Socorro, NM

HPC Wales Skills Academy Course Catalogue 2015

Scalable and High Performance Computing for Big Data Analytics in Understanding the Human Dynamics in the Mobile Age

Cluster Scalability of ANSYS FLUENT 12 for a Large Aerodynamics Case on the Darwin Supercomputer

An Overview of the Trilinos Project

Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga

OpenMP Programming on ScaleMP

2IP WP8 Materiel Science Activity report March 6, 2013

MEng, BSc Applied Computer Science

Transcription:

DLR.de Chart 1 DFG Projekt ESSEX Equipping Sparse Solvers for Exascale A Survey of the DFG Project ESSEX Achim Basermann German Aerospace Center (DLR) Simulation and Software Technology Linder Höhe, Cologne, Germany

DLR.de Chart 2 DLR German Aerospace Center Research Institution Space Agency Project Management Agency

DLR.de Chart 3 DLR Locations and Employees Approx. 8000 employees across 33 institutes and facilities at 16 sites. Offices in Brussels, Paris, Tokyo and Washington. Stade Hamburg Bremen Trauen Berlin Braunschweig Goettingen Juelich Cologne Bonn Neustrelitz Lampoldshausen Stuttgart Augsburg Oberpfaffenhofen Weilheim

DLR.de Chart 4 DLR Institute Simulation and Software Technology Scientific Themes and Working Groups Distributed Software Systems Departments Distributed Systems and Component Software Software for Space Systems and Interactive Visualization High-Performance Computing Software Engineering Embedded Systems Modeling and Simulation Scientific Visualization Working Groups 3D Interaction

DLR.de Chart 5 Survey ESSEX motivation The ESSEX software infrastructure Holistic view: application, algorithm and performance Algorithmic developments: JADA, FEAST, CARP-CG Application results Conclusions The Future: ESSEX II

DLR.de Chart 6 ESSEX Motivation: Requirements for Exascale Hardware Fault tolerance Energy efficiency New levels of parallelism Quantum Physics Applications Extremely large sparse matrices: eigenvalues, spectral properties, time evolution ESSEX Exascale Sparse Solver Repository (ESSR) ghost / PHIST FT concepts, programming for extreme parallelism Sparse eigensolvers, preconditioners, spectral methods Quantum physics / chemistry ESSEX applications: Graphene, topological insulators,

DLR.de Chart 7 ESSEX: Physical Motivation and Sparse Eigenvalue problem Solve large sparse eigenvalue problem HH xx = λλ xx i ψ ( r, t) = Hψ ( r, t) t (λλ i, x i )

DLR.de Chart 8 ESSEX Motivation: Programming Heterogeneous HPC Systems Flat MPI + off-loading Runtime (e.g. MAGMA, OmpSs) Dynamic scheduling of small tasks good load balancing Kokkos (Trilinos) High level of abstraction (C++11) MPI+X strategy in ESSEX X: OpenMP, CUDA, SIMD Intrinsics, e.g. AVX Tasking for bigger asynchronous functions functional parallelism Experts implement the kernels required.

DLR.de Chart 9 ESSEX Motivation: Application Driven Fault Tolerance (FT) Application asynchronously writes checkpoints (CP) to a local disk to memory of a neighbor node Lanczos application benchmark Number of nodes (processes)=256, 12 threads/proc., 4 spare processes Dedicated process performs health checks (HC) of all nodes, GASPI/GPI used rather than MPI If a node fails: Pool of substitute processes Rollback to last checkpoint Overhead for recovery ca 18 s + computations to be repeated

DLR.de Chart 10 The ESSEX Software Infrastructure

DLR.de Chart 11 The ESSEX Software Infrastructure: Test-Driven Algorithm Development

DLR.de Chart 12 Optimized ESSEX Kernel Library General, Hybrid, and Optimized Sparse Toolkit MPI + OpenMP + SIMD + CUDA Sparse matrix-(block-)vector multiplication Dense block-vector operations Task-queue for functional parallelism Asynchronous checkpoint-restart Status: beta version, suitable for experienced HPC C programmers http://bitbucket.org/essex/ghost BSD License

DLR.de Chart 13 The Iterative Solver Library PHIST PHIST Pipelined Hybrid parallel Iterative Solver Toolkit Iterative solvers for sparse matrices Eigenproblems: Jacobi-Davidson, FEAST Systems of linear equations: GMRES, MINRES, CARP-CG Provides some abstraction from data layout, process management, tasking etc. Adapts algorithms to use block operations Implements asynchronous and fault-tolerant solvers Simple functional interface (C, Fortran, Python) Systematically tests kernel libraries for correctness and performance Various possibilities for integration into applications orthogonalization Status: beta version with extensive test framework http://bitbucket.org/essex/phist BSD License

DLR.de Chart 14 Integration of PHIST into Applications Selection of kernel library )nur Required flexibility low high gering mittel hoch No easy access to matrix elements PHIST builtin Only CPU F 03+OpenMP CRS format Various arch. Large C++ code base Own data structures Adapter ca 1000 lines of code high Hardware awareness low

DLR.de Chart 15 Interoperability of PHIST and Trilinos ESSEX project Projekt Iterative solvers ------------------------- Basic operations PHIST -------------------------------- PHIST builtin C Wrapper Can Use Anasazi (eigenproblems) Belos (lin. eq. syst.) --------------------------------- Epetra Tpetra

DLR.de Chart 16 Application, Algorithm and Performance: Kernel Polynomial Method (KPM) A Holistic View Compute approximation to the complete eigenvalue spectrum of large sparse matrix AA (with XX = II)

DLR.de Chart 17 The Kernel Polynomial Method (KPM) Optimal performance exploit knowledge from all software layers! Basic algorithm Compute Cheyshev polynomials/moments: Application: Loop over random initial states Algorithm: Loop over moments Building blocks: (Sparse) linear algebra library Sparse matrix vector multiply Scaled vector addition Vector scale Scaled vector addition Vector norm Dot Product

DLR.de Chart 18 The Kernel Polynomial Method (KPM) Optimal performance exploit knowledge from all software layers! Basic algorithm Compute Cheyshev polynomials/moments: Augmented Sparse Matrix Vector Multiply

DLR.de Chart 19 The Kernel Polynomial Method (KPM) Optimal performance exploit knowledge from all software layers! Basic algorithm Compute Cheyshev polynomials/moments: Sparse matrix vector multiply Scaled vector addition Vector scale Scaled vector addition Vector norm Dot Product Augmented Sparse Matrix Multiple Vector Multiply

DLR.de Chart 20 KPM: Heterogenous Node Performance NVIDIDA K20X Intel Xeon E5-2670 (SNB) Heterogeneous efficiency Topological Insulator Application Double complex computations Data parallel static workload distribution

DLR.de Chart 21 KPM: Large Scale Heterogenous Node Performance 0.53 PF/s (11% of LINPACK) CRAY XC30 PizDaint* 5272 nodes Peak: 7.8 PF/s LINPACK: 6.3 PF/s Largest system in Europe Performance Engineering of the Kernel Polynomial Method on Large-Scale CPU-GPU Systems M. Kreutzer, A. Pieper, G. Hager, A. Alvermann, G. Wellein and H. Fehske, IEEE IPDPS 2015 *Thanks to CSCS/T. Schulthess for granting access and compute time

DLR.de Chart 22 Algorithmic Developments: Blocked Jacobi-Davidson (JADA) Method Compute ll extreme eigenvalues/-vectors of sparse matrix AA: AA vv ii = λλ ii vv ii λλ ii, vv ii, ii = 1,, ll

DLR.de Chart 23 Algorithmic Developments: Blocked JADA exploit benefit of block spmvm Blocked JADA method: Solve n b correction equations at the same time. Basic BLOCKED JADA operator becomes (j=1,,n b ): Scalar Dense matrix (Tall & skiny) Sparse Matrix BLOCKED JADA operation available in GHOST for CPU - GPGPU & Xeon Phi: work in progress.

DLR.de Chart 24 Algorithmic Developments: Blocked JADA performance of basic operation GHOST Epetra* Matrix: D=10 7 ; n nzr =14 Intel Xeon E5-2660 v2 120 JADA operations 3.3x overcompensates numerical overhead of blocking! 2.5x vs. Trilinos building blocks Increasing the Performance of the Jacobi-Davidson Method by blocking M. Röhrig-Zöllner, J. Thies, A. Basermann et al., SIAM SISC, in print. *http://trilinos.sandia.gov/packages/epetra/

DLR.de Chart 25 Algorithmic Developments: FEAST method and CARP-CG solver Compute ll interior eigenvalues/-vectors of sparse matrix AA: AA vv ii = λλ ii vv ii λλ ii, vv ii, ii = 1,, ll

DLR.de Chart 26 Algorithmic Developments: FEAST Progress towards Large Scale FEAST = Numerical integration + Rayleigh-Ritz Eigenvalues in given interval Numerical Integration: Solution of many large linear systems Achievements: Estimation of eigenvalue count (also with KPM) Integration of linear solver CARP-CG Graphene eigenvalue problems Substitution of linear solver by polynomials On the parallel iterative solution of linear systems arising in the FEAST algorithm for computing inner eigenvalues J. Thies, A. Basermann, B. Lang et al.: Parallel Computing 49 (2015) 153 163 Few inner eigenvalues of graphene problem of size 10 8 Compare with state of the art FEAST: 10 5 using direct sparse solver

DLR.de Chart 27 Algorithmic Developments: CARP-CG Preconditioner for Inner Eigenproblems FEAST eigensolver yields challenging linear systems (indefinite, random entries, small diagonal elements) CARP-CG: a Conjugate Gradient accelerated Kaczmarz method Numerically very robust Sparse kernel: successive row projections (aa kk,: is the k th row of A) TT xx kk+1 xx kk (aa kk,: xx kk )aa kk,: Data dependency resolved by node-local graph coloring Component averaging between nodes (recovers global Kaczmarz) Not yet fully optimized in GHOST Weak scaling

DLR.de Chart 28 Application results Graphene nanoribbon (GNR) with gate-defined quantum dots

DLR.de Chart 29 Application results: GNR with 5 Gate Defined Quantum Dots Conductivity G controlled by dot potential V/t Small change in V/t large change in G GNR may realize very sensitive switch Superlattice opening of band gap Vanishing conductance

DLR.de Chart 30 Conclusions Holistic performance engineering strategie successful for developing highly scalable solutions, cf. KPM. PHIST with provides a pragmatic, flexible and hardware-aware programming model for heterogeneous systems. Includes highly scalable sparse iterative solvers for eigenproblems and systems of linear equations Well suited for iterative solver development and solver integration into applications Block operations distinctly increase performance of building blocks for iterative eigensolvers like KPM or JADA. CARP-CG with node-level multi-coloring parallelization is suitable for robust iterative solution of the nearly singular equations. Appropriate iterative solver for FEAST in order to find interior eigenpairs, in particular for problems from graphene design First convincing results with quantum physics applications

DLR.de Chart 31 The Future: ESSEX II DFG confirmed ESSEX extension to 2018. Additional partners from Japan Kengo Nakajima, Computer Science, University of Tokyo Tetsuya Sakurai, Applied Mathematics, University of Tsukuba Main objectives Enabling Exascale through software co-design Fault Tolerance Performance Engineering Applications Computational Algorithms Building Blocks Numerical Reliability Scalability Established exascale sparse solver repository

DLR.de Chart 32 Project Evolution ESSEX - I Conservative Quantum Systems Dissipative Quantum Systems AA + = AA AA + AA AAAA = λλλλ ESSR: Blueprints ESSEX - II AAAA = λλλλλλ AA λλ xx = 0 Interoperable Library

DLR.de Chart 33 Programming Extended Extended NEW Building Blocks, Parallelization, and Performance Engineering Holistic performance and power engineering Advanced building blocks engineering Fault Tolerance From prototype to application software Asynchronous checkpointing & I/O Automatically fault-tolerant applications Numerical Reliability Performance aspects Silent data corruption / skeptical programming High-precision reduction operations Performance Engineering Applications Computational Algorithms Building Blocks Fault Tolerance

DLR.de Chart 34 Computational Algorithms NEW NEW Extended NEW Non-Hermitian: ChebTP / CFET / JaDa Extreme-scale simulations for dissipative quantum systems Numerical range computation & matrix balancing ChebyshevFilterDiagonalization: >10 3 interior eigenvalues of >10 9 matrix dim. Simple, HW-efficient & low synchronization cost Preconditioning & Communication Hiding Asynchronous JaDa: pipelining & preconditioning AMG preconditioning for blocked JaDa & FEAST Leveraging FEAST techniques + GHOST Nonlinear Sakurai-Sugiura Method (NSSM) Kengo Nakajima, University of Tokyo Tetsuya Sakurai, University of Tsukuba

DLR.de Chart 35 Applications Extended Quantum State Encoding (QSE) Complex (non-stencil) matrix structure encoding Dissipative systems: Sparse Dense Extended New Matrix Reordering Strategies (REO) Application-specific General techniques, e.g. PMRSB Quantum Physics/Information Applications Topological materials Graphene & topological insulators Dissipative quantum systems Light-harvesting molecules & optomechanics Rich collection of quantum physics problems AA + = AA AA + AA

DLR.de Chart 36 Thanks Thanks to all partners from the ESSEX project and to DFG for the support through the Priority Programme 1648 Software for Exascale Computing. Computer Science, Univ. Erlangen Applied Computer Science, Univ. Wuppertal Institute for Physics, Univ. Greifswald Erlangen Regional Computing Center International contacts Sandia (Trilinos project) Tenessee (Dongarra) Japan: Tsukuba, Tokyo The Netherlands: Groningen, Utrecht

DLR.de Chart 37 Many thanks for your attention! Questions? Dr.-Ing. Achim Basermann German Aerospace Center (DLR) Simulation and Software Technology Department Distributed Systems and Component Software Team High Performance Computing Achim.Basermann@dlr.de http://www.dlr.de/sc