Parallel Programming at the Exascale Era: A Case Study on Parallelizing Matrix Assembly For Unstructured Meshes

Similar documents
P013 INTRODUCING A NEW GENERATION OF RESERVOIR SIMULATION SOFTWARE

Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms

HPC enabling of OpenFOAM R for CFD applications

ParFUM: A Parallel Framework for Unstructured Meshes. Aaron Becker, Isaac Dooley, Terry Wilmarth, Sayantan Chakravorty Charm++ Workshop 2008

Parallel Computing. Benson Muite. benson.

Distributed Dynamic Load Balancing for Iterative-Stencil Applications

Best practices for efficient HPC performance with large models

Basin simulation for complex geological settings

Keys to node-level performance analysis and threading in HPC applications

HPC Deployment of OpenFOAM in an Industrial Setting

Mesh Generation and Load Balancing

YALES2 porting on the Xeon- Phi Early results

MPI and Hybrid Programming Models. William Gropp

The PHI solution. Fujitsu Industry Ready Intel XEON-PHI based solution. SC Denver

Scientific Computing Programming with Parallel Objects

Lecture 2 Parallel Programming Platforms

LS-DYNA Scalability on Cray Supercomputers. Tin-Ting Zhu, Cray Inc. Jason Wang, Livermore Software Technology Corp.

ACCELERATING COMMERCIAL LINEAR DYNAMIC AND NONLINEAR IMPLICIT FEA SOFTWARE THROUGH HIGH- PERFORMANCE COMPUTING

Exascale Challenges and General Purpose Processors. Avinash Sodani, Ph.D. Chief Architect, Knights Landing Processor Intel Corporation

OpenMP Programming on ScaleMP

Multicore Parallel Computing with OpenMP

David Rioja Redondo Telecommunication Engineer Englobe Technologies and Systems

Hierarchically Parallel FE Software for Assembly Structures : FrontISTR - Parallel Performance Evaluation and Its Industrial Applications

Large-Scale Reservoir Simulation and Big Data Visualization

Spring 2011 Prof. Hyesoon Kim

Parallel Algorithm Engineering

Petascale Software Challenges. William Gropp

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

Distributed communication-aware load balancing with TreeMatch in Charm++

Making Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association

High Performance Computing in CST STUDIO SUITE

Scalability evaluation of barrier algorithms for OpenMP

Finite Elements Infinite Possibilities. Virtual Simulation and High-Performance Computing

Fast Multipole Method for particle interactions: an open source parallel library component

BSC vision on Big Data and extreme scale computing

A Simultaneous Solution for General Linear Equations on a Ring or Hierarchical Cluster

Performance of Dynamic Load Balancing Algorithms for Unstructured Mesh Calculations

Enabling Multi-pipeline Data Transfer in HDFS for Big Data Applications

Data Centric Systems (DCS)

PARALLEL PROGRAMMING

Cluster Scalability of ANSYS FLUENT 12 for a Large Aerodynamics Case on the Darwin Supercomputer

Accelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing

OpenMP and Performance

Designing and Building Applications for Extreme Scale Systems CS598 William Gropp

Big Data Visualization on the MIC

COMP 422, Lecture 3: Physical Organization & Communication Costs in Parallel Machines (Sections 2.4 & 2.5 of textbook)

GPI Global Address Space Programming Interface

A Multi-layered Domain-specific Language for Stencil Computations

Trends in High-Performance Computing for Power Grid Applications

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

Assessing the Performance of OpenMP Programs on the Intel Xeon Phi

System Interconnect Architectures. Goals and Analysis. Network Properties and Routing. Terminology - 2. Terminology - 1

Scalability and Classifications

Performance Monitoring of Parallel Scientific Applications

SCALABILITY OF CONTEXTUAL GENERALIZATION PROCESSING USING PARTITIONING AND PARALLELIZATION. Marc-Olivier Briat, Jean-Luc Monnot, Edith M.

Mining Large Datasets: Case of Mining Graph Data in the Cloud

Multi-Threading Performance on Commodity Multi-Core Processors

Solvers, Algorithms and Libraries (SAL)

HPC Wales Skills Academy Course Catalogue 2015

HSL and its out-of-core solver

Introduction to Cloud Computing

Control 2004, University of Bath, UK, September 2004

Proactive, Resource-Aware, Tunable Real-time Fault-tolerant Middleware

A Comparison Of Shared Memory Parallel Programming Models. Jace A Mogill David Haglin

MAQAO Performance Analysis and Optimization Tool

BLM 413E - Parallel Programming Lecture 3

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

A distributed system is defined as

Sourcery Overview & Virtual Machine Installation

Apache Hama Design Document v0.6

Performance Characteristics of Large SMP Machines

FLOW-3D Performance Benchmark and Profiling. September 2012

How To Cluster Of Complex Systems

Efficient Parallel Graph Exploration on Multi-Core CPU and GPU

SWARM: A Parallel Programming Framework for Multicore Processors. David A. Bader, Varun N. Kanade and Kamesh Madduri

Feedback guided load balancing in a distributed memory environment

Parallel Programming Survey

Using the Windows Cluster

PeerMon: A Peer-to-Peer Network Monitoring System

UTS: An Unbalanced Tree Search Benchmark

A stream computing approach towards scalable NLP

Symmetric Multiprocessing

Power-Aware High-Performance Scientific Computing

GASPI A PGAS API for Scalable and Fault Tolerant Computing

Equalizer. Parallel OpenGL Application Framework. Stefan Eilemann, Eyescale Software GmbH

Three Paths to Faster Simulations Using ANSYS Mechanical 16.0 and Intel Architecture

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

OpenFOAM Optimization Tools

Petascale Software Challenges. Piyush Chaudhary High Performance Computing

Exploiting Remote Memory Operations to Design Efficient Reconfiguration for Shared Data-Centers over InfiniBand

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

Transcription:

Parallel Programming at the Exascale Era: A Case Study on Parallelizing Matrix Assembly For Unstructured Meshes Eric Petit, Loïc Thebault, Quang V. Dinh May 2014

EXA2CT Consortium 2

WPs Organization Proto-Applications Extract representative proto-applications from HPC applications Enhance Numerical Algorithms Scalable and robust numerical solvers Enhance Programming Models Improve efficiency, robustness and scalability of GPI/GASPI, Shark, PATUS and leverage task parallelism

The Proto-Application Concept Aka mini-app, proxy-app (NERSC trinity, Argonne CESAR ) Our main tool in Exa2ct Objectives: Reproduce at scale the behavior of a set of HPC applications and support the development of optimizations that can be translated into the original applications Easier to execute, modify and re-implement If you cannot make the application open-source, you can at least open-source the problems. Support community engagement Reproducible and comparable results Interface with application developers 4

Exascale Challenges for RT and Applications Increasing number of nodes, increasing number of cores, accelerators Some ressources do not scale Memory per core, Bandwidth, Coherence protocol, Network interconnect, Fault tolerance Frontier are becoming fuzzier => Heterogeneity Distributed/shared? Software/hardware? ''Core'' definition? Compute capabilities, imbalance Multiplication of hierarchical levels => Non uniformity Different scales: BW, memory size, performance Global events: barrier, broadcast, memory coherency 5

Exascale Challenges for RT and Applications Evolutions are requested for applications, runtimes and programming models Express the new ressources usage Accelerators Multi(many)-core Large scale number of nodes PROGRAMMABILITY and PORTABILITY (inc. Perf) Adapt to the hardware design constraints More concurrent ressources Limited BW and memory budget Increasing need for fault-tolerance Architecture oblivious/aware? 6

Exascale Challenges for RT and Applications More concurrency Enough independant tasks Communication overlap Privatize memory to avoid communication (& sync) Remember Amdhal: The more core, the higher is the proportion of the sequential code More locality Memory Core level, Socket level (including HWA), Network level But also communications Synchronisation, Data We need both for performance scalability 7

Programming Models and Runtime Support for Extreme Scaling Objectives Express fine-grain dependencies of computations and communications using tasks to efficiently exploit the hundreds of cores of one Exascale node Benefit from automatic load balancing and tolerance to machine jitter of task parallel runtimes and asynchronous one sided Leverage divide-and-conquer algorithms to increase parallelism and data locality Improve the productivity of parallel programming 8

Divide and Conquer Parallelization of Finite Element Method Assembly Objectives: Expose parallelism Expose many independent task for many cores Make an efficient use of shared memory Save BW (Locality) Save Memory (Avoid MPI buffer) More efficient synchronisation Prepare scaling on a larger number of cores Keep synchronoisation local Use shared memory instead of MPI Improve the load balancing Fine grain parallelism Work stealing/work sharing 9

Finite Element Method Assembly Basic FEM iteration: Matrix A assembly from the mesh Solve Ax=b Update mesh (values, geometry) Building the linear system from the MESH Elt-Node => Sparse matrix storage Reduction of elements values on the edges and / or the nodes 10

Finite Element Method Assembly 2D Illustration Ni Nj Reduction done on each edge from all neighbor elements Edges update (+= reduction) must be sequential i j i xij Xij 0 if there is an edge between i and j j xij (Very) Sparse and symetric matrix 11

MPI Domain Decomposition Distributed memory parallelism Communicate interface elements Efficient on current architectures Sub-optimal on future architectures Data duplications Synchronizations 12

Coloring Approach Elements sharing an edge have different colors Colors are computed sequentially Elements of a same color can be computed in parallel Simple to implement Bad temporal locality High memory bandwith requirements Global synchronizations 13

The D&C approach Left and right sub-domains are independant Elements at the frontier form a separator Separators must be treated after left and right subdomains Applied recursively to all sub-domains Recursive Tree 14

The D&C approach function compute (subdomain) if Node is not a leaf spawn compute (subdomain.left) compute (subdomain.right) sync compute (subdomain.sep) else FEM_assembly (subdomain) end 15

The D&C approach Input Partitioning & Reordering Assembly Solver Optimizing locality : data permutation (composition of all recursive node permutation) Done only once (Topological partition) Update Solution 16

The D&C approach Can create many independant tasks => Concurrency Leaves data set can be downsized at will to fit into caches => Data locality Only one synchronization per task between the 2 local children => Sync locality Only Log (N) synchronizations on the critical path => Sequential part minimization 17

The Dassault test case DEFMESH Mesh Deformer from Dassault Aviation Fluid Dynamic Code based on Finite Element Method Non-Linear: displacements cut in small increments 3D Mesh: Airplane Fuel Tank Composed of 6 346 108 elements and 1 079 758 nodes 18

An Illustration of the D&C impact Initial 19

An Illustration of the D&C impact D&C 2 20

An Illustration of the D&C impact D&C 4 21

An Illustration of the D&C impact D&C 8 22

An Illustration of the D&C impact D&C 16 23

An Illustration of the D&C impact D&C 32 24

An Illustration of the D&C impact D&C 3000 Leaves Size <50 25

Results: Execution Time 2 sockets of 6 cores Intel Xeon Westmere @ 2.67GHz (MPI) (Cilk) (OpenMP) 26

Results: Parallel Efficiency 2 sockets of 6 cores Intel Xeon Westmere @ 2.67GHz (MPI) (Cilk) (OpenMP) 27

Results: Performance 2 sockets of 6 cores Intel Xeon Westmere @ 2.67GHz Last result, 3x improvement! (ivy-bridge) 28

Results: Locality 2 sockets of 6 cores Intel Xeon Westmere @ 2.67GHz 29

Impact on SPMV? The original matrix was already optimized for locality however All the thread accessing the same lines (dup in cache) Few threads accessing the whole vector Bad load balancing BW demanding: Bad locality Memory coherency Larger band 30

Impact on SPMV? Access a limited number of subdomains, one main very dense, others are very sparse separators Increased locality Minimized synchronisation => Save BW 31

Impact on SPMV? Improved locality on the matrix and the input vector. Conserve locality for the output vector. 10 to 20% improvement on the CG solver without editing a single line of code! 32

Conclusion 1/2 The experiment on the DA test-case is successfull and ready to be used in DEFMESH production code. Since march the approach start be experimented in their main CFD code. Interesting integration process with all the parallelisation in C++ and the physics in FORTRAN A proto-app to demonstrate the D&C assembly to the community is almost ready to be open-sourced, stay tuned! 33

Conclusion 2/2 Future works: D&C friendly matrix format (CSB?) and SPMV, SPMM, SPTMM, SPMMM Shared memory parallelized solvers Impact of the partitionner (scotsh, metis, SAW) Use D&C at the distributed level Using GASPI/MPI3.0 Studying dynamic load balancing Independant work Tree based (D&C) Symmetric execution (MIC/Host, starpu? Xkaapi?) 34

Thank you all for your attention! Question? 35

EXA2CT 36

Geometric vs. Topologic Geometric Example Topologic Example 37