ParFUM: A Parallel Framework for Unstructured Meshes. Aaron Becker, Isaac Dooley, Terry Wilmarth, Sayantan Chakravorty Charm++ Workshop 2008

Similar documents
Parallel Programming at the Exascale Era: A Case Study on Parallelizing Matrix Assembly For Unstructured Meshes

Performance of Dynamic Load Balancing Algorithms for Unstructured Mesh Calculations

Scientific Computing Programming with Parallel Objects

HPC Deployment of OpenFOAM in an Industrial Setting

Equalizer. Parallel OpenGL Application Framework. Stefan Eilemann, Eyescale Software GmbH

Stream Processing on GPUs Using Distributed Multimedia Middleware

GEDAE TM - A Graphical Programming and Autocode Generation Tool for Signal Processor Applications

Optimizing Load Balance Using Parallel Migratable Objects

Multi-GPU Load Balancing for Simulation and Rendering

Virtual Machines.

Designing and Building Applications for Extreme Scale Systems CS598 William Gropp

Overlapping Data Transfer With Application Execution on Clusters

LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR

- An Essential Building Block for Stable and Reliable Compute Clusters

Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff

LS-DYNA Scalability on Cray Supercomputers. Tin-Ting Zhu, Cray Inc. Jason Wang, Livermore Software Technology Corp.

Parallel Large-Scale Visualization

Distributed communication-aware load balancing with TreeMatch in Charm++

Dynamic Load Balancing in Charm++ Abhinav S Bhatele Parallel Programming Lab, UIUC

Scalability evaluation of barrier algorithms for OpenMP

Multicore Parallel Computing with OpenMP

Parallel Scalable Algorithms- Performance Parameters

A Case Study - Scaling Legacy Code on Next Generation Platforms

LOAD BALANCING DISTRIBUTED OPERATING SYSTEMS, SCALABILITY, SS Hermann Härtig

A Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures

Making Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association

Chapter 2 Parallel Architecture, Software And Performance

Scalability and Classifications

Topological Properties

FD4: A Framework for Highly Scalable Dynamic Load Balancing and Model Coupling

Chapter 18: Database System Architectures. Centralized Systems

The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System

Bringing Big Data Modelling into the Hands of Domain Experts

L20: GPU Architecture and Models

Robust Algorithms for Current Deposition and Dynamic Load-balancing in a GPU Particle-in-Cell Code

Performance Monitoring of Parallel Scientific Applications

Lecture 7 - Meshing. Applied Computational Fluid Dynamics

A Multi-layered Domain-specific Language for Stencil Computations

Model of a flow in intersecting microchannels. Denis Semyonov

Performance of the JMA NWP models on the PC cluster TSUBAME.

Cluster Scalability of ANSYS FLUENT 12 for a Large Aerodynamics Case on the Darwin Supercomputer

Centralized Systems. A Centralized Computer System. Chapter 18: Database System Architectures

NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist

Software Development around a Millisecond

Basin simulation for complex geological settings

COM 444 Cloud Computing

BLM 413E - Parallel Programming Lecture 3

Load Balancing Techniques

Distributed Dynamic Load Balancing for Iterative-Stencil Applications

Client/Server Computing Distributed Processing, Client/Server, and Clusters

Locality Based Protocol for MultiWriter Replication systems

HPC Wales Skills Academy Course Catalogue 2015

Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms

Program Coupling and Parallel Data Transfer Using PAWS

Symmetric Multiprocessing

Large-Scale Reservoir Simulation and Big Data Visualization

Mesh Generation and Load Balancing

Introduction to Cloud Computing

Cellular Computing on a Linux Cluster

Graph Analytics in Big Data. John Feo Pacific Northwest National Laboratory

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

High Performance Computing in CST STUDIO SUITE

Operating System for the K computer

Load balancing in a heterogeneous computer system by self-organizing Kohonen network

FRIEDRICH-ALEXANDER-UNIVERSITÄT ERLANGEN-NÜRNBERG

Load balancing Static Load Balancing

Best practices for efficient HPC performance with large models

MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research

Radware ADC-VX Solution. The Agility of Virtual; The Predictability of Physical

Industrial Ethernet How to Keep Your Network Up and Running A Beginner s Guide to Redundancy Standards

Load Balancing and Termination Detection

Performance Metrics and Scalability Analysis. Performance Metrics and Scalability Analysis

ABSTRACT FOR THE 1ST INTERNATIONAL WORKSHOP ON HIGH ORDER CFD METHODS

PARALLEL PROGRAMMING

GPUs for Scientific Computing

Introduction to grid technologies, parallel and cloud computing. Alaa Osama Allam Saida Saad Mohamed Mohamed Ibrahim Gaber

Interconnection Networks

In-situ Visualization

Client/Server and Distributed Computing

Agenda. Michele Taliercio, Il circuito Integrato, Novembre 2001

Adapting scientific computing problems to cloud computing frameworks Ph.D. Thesis. Pelle Jakovits

Petascale Visualization: Approaches and Initial Results

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

A Review of Customized Dynamic Load Balancing for a Network of Workstations

Feedback guided load balancing in a distributed memory environment

Parallel Computing: Strategies and Implications. Dori Exterman CTO IncrediBuild.

Accelerating CFD using OpenFOAM with GPUs

Rakam: Distributed Analytics API

Outline. Introduction. Multiprocessor Systems on Chip. A MPSoC Example: Nexperia DVP. A New Paradigm: Network on Chip

Transcription:

ParFUM: A Parallel Framework for Unstructured Meshes Aaron Becker, Isaac Dooley, Terry Wilmarth, Sayantan Chakravorty Charm++ Workshop 2008

What is ParFUM? A framework for writing parallel finite element codes Takes care of difficult tasks involved in parallelizing a serial code Provides advanced mesh operations such as mesh adaptivity and dynamic load balancing Constantly evolving to support application needs (for example, cohesive elements and collision detection) Based on Charm++ and AMPI. Supports C, C++, and Fortran

Making Parallel Finite Element Codes Easier A simple serial finite element code: Create mesh Perform finite element computations Extract results

Making Parallel Finite Element Codes Easier A simple parallel finite element code: Create mesh Partition mesh Distribute mesh data and create ghost layers Perform finite element computations with synchronization and load balancing Extract results

Making Parallel Finite Element Codes Easier Create mesh ParFUM can do these things automatically and let the developer concentrate on science and engineering Partition mesh Distribute mesh data and create ghost layers Perform finite element computations with synchronization and load balancing Extract results

The Structure of a ParFUM Program Init MP Cha MS Partitioning and Distribution Driver

The Big Picture Adjacency Generation User's Solver ParFUM Solution Transfer Bulk Adaptivity Ghost Layer Generation Partitioning IFEM ptops Collision Detection Contact Incremental Adaptivity Multi-phase Shared Arrays User View AMPI Charm++ System View Load Balancing Framework Charm Run-time System Communication Optimizations 7

Integrating Multiple Programming Models Adjacency Generation Global Shared Memory Ghost Layer Generation Partitioning User's Solver ParFUM Message ptops Passing IFEM Solution Transfer Collision Detection Message Contact Driven Bulk Adaptivity Incremental Adaptivity Multi-phase Shared Arrays User View AMPI Charm++ System View Load Balancing Framework Charm Run-time System Communication Optimizations 8

ParFUM and AMPI Application code is written in AMPI, an implementation of MPI on top of the Charm RTS. AMPI processes (virtual processors, or VPs) are not tied to a physical processor, they can migrate and there may be many of them per physical processor This allows easier porting of MPI codes and eases the learning curve of ParFUM

Virtualization Tradeoffs Advantages Allows adaptive overlap of communication and computation More granular load balancing Improved cache performance More flexibility Disadvantages More communication Worse ratio of remote data to local data Imposes some thread overhead High virtualization requires many elements per node

Virtualization Performance Impact For this dynamic fracture code, virtualization provides a substantial benefit

Parallel Mesh Adaptivity Efficient parallel adaptivity is critical for many unstructured mesh codes ParFUM provides two implementations of common operations: incremental (2D triangle meshes): each individual operation leaves the mesh consistent. Relatively slow and puts limitations on ghost layers bulk (2D triangles and 3D tets): many operations performed at once, ghosts and adjacencies updated at end. Lower cost, no restrictions on ghost layers. (ongoing work)

Higher Level Adaptivity Propagating Edge Bisection Operations like propagating edge bisection are composed from edge bisect, flip, and contraction primitives Which is better, bulk or incremental? Depends on amount and frequency of adaptivity

Load Balance, Adaptivity, and Virtualization Serious load imbalance: areas near fracture are much more expensive

Load Balance, Adaptivity, and Virtualization We can change the VP mapping to distribute computationally expensive parts of the mesh better

Load Balance, Adaptivity, and Virtualization Assigning VPs using a greedy load balancer further improves utilization

Spacetime Meshing Parallelization of Spacetime Discontinuous Galerkin (SDG) algorithm [Haber] Adaptive in both space and time, uses incremental adaptivity Asynchronous code, no global barriers

PTops Structural dynamics code for graded materials [Paulino] Based on Tops, a serial framework featuring an efficient topological mesh representation

PTops Strong Scaling 400,000 elements on Abe No virtualization

PTops and CUDA ParFUM-Tops interface has CUDA support Our implementation runs ~10x faster on a single node using CUDA Limited usefulness due to lack of double precision and lack of access to clusters which combine GPUs and high quality interconnects

Ongoing Work On-demand insertion of cohesive elements (truly extrinsic cohesives) in PTops for dynamic fracture simulations Efficient, scalable implementation of bulk edge flip and edge contraction Contact: use Charm++ collision detection library to detect when domain fragments come into contact

ParFUM: A Parallel Framework for Unstructured Meshes Aaron Becker, Sayantan Chakravorty, Isaac Dooley, Terry Wilmarth Parallel Programming Lab Charm++ Workshop 2008