FD4: A Framework for Highly Scalable Dynamic Load Balancing and Model Coupling

Similar documents
Highly Scalable Dynamic Load Balancing in the Atmospheric Modeling System COSMO-SPECS+FD4

Characterizing the Performance of Dynamic Distribution and Load-Balancing Techniques for Adaptive Grid Hierarchies

ParFUM: A Parallel Framework for Unstructured Meshes. Aaron Becker, Isaac Dooley, Terry Wilmarth, Sayantan Chakravorty Charm++ Workshop 2008

Distributed communication-aware load balancing with TreeMatch in Charm++

walberla: Towards an Adaptive, Dynamically Load-Balanced, Massively Parallel Lattice Boltzmann Fluid Simulation

HPC Deployment of OpenFOAM in an Industrial Setting

Scientific Computing Programming with Parallel Objects

Parameterization of Cumulus Convective Cloud Systems in Mesoscale Forecast Models

Lecture 4. Parallel Programming II. Homework & Reading. Page 1. Projects handout On Friday Form teams, groups of two

A Novel Switch Mechanism for Load Balancing in Public Cloud

Multi-GPU Load Balancing for Simulation and Rendering

LOAD BALANCING DISTRIBUTED OPERATING SYSTEMS, SCALABILITY, SS Hermann Härtig

Load Balancing on a Non-dedicated Heterogeneous Network of Workstations

Dynamic Load Balancing in Charm++ Abhinav S Bhatele Parallel Programming Lab, UIUC

Cellular Computing on a Linux Cluster

Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms

Distributed Dynamic Load Balancing for Iterative-Stencil Applications

Fast Multipole Method for particle interactions: an open source parallel library component

Performance Engineering of the Community Atmosphere Model

ABSTRACT FOR THE 1ST INTERNATIONAL WORKSHOP ON HIGH ORDER CFD METHODS

Locality-Preserving Dynamic Load Balancing for Data-Parallel Applications on Distributed-Memory Multiprocessors

Mesh Partitioning and Load Balancing

Reconfigurable Architecture Requirements for Co-Designed Virtual Machines

Performance Monitoring of Parallel Scientific Applications

MPI Hands-On List of the exercises

Turbulent mixing in clouds latent heat and cloud microphysics effects

Software and Performance Engineering of the Community Atmosphere Model

A Review of Customized Dynamic Load Balancing for a Network of Workstations

Cluster Scalability of ANSYS FLUENT 12 for a Large Aerodynamics Case on the Darwin Supercomputer

How To Balance In Cloud Computing

Parallel Ray Tracing using MPI: A Dynamic Load-balancing Approach

Recommendations for Performance Benchmarking

HPC Programming Framework Research Team

Mesh Generation and Load Balancing

Overlapping Data Transfer With Application Execution on Clusters

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

FRIEDRICH-ALEXANDER-UNIVERSITÄT ERLANGEN-NÜRNBERG

High Performance Computing. Course Notes HPC Fundamentals

System Models for Distributed and Cloud Computing

Uintah Framework. Justin Luitjens, Qingyu Meng, John Schmidt, Martin Berzins, Todd Harman, Chuch Wight, Steven Parker, et al

Control 2004, University of Bath, UK, September 2004

Parallel Programming at the Exascale Era: A Case Study on Parallelizing Matrix Assembly For Unstructured Meshes

Climate-Weather Modeling Studies Using a Prototype Global Cloud-System Resolving Model

A Multi-layered Domain-specific Language for Stencil Computations

Pros and Cons of HPC Cloud Computing

CHAPTER 1 INTRODUCTION

Parallel Computing. Benson Muite. benson.

Hank Childs, University of Oregon

Workshop on Parallel and Distributed Scientific and Engineering Computing, Shanghai, 25 May 2012

MOSIX: High performance Linux farm

LOAD BALANCING FOR MULTIPLE PARALLEL JOBS

Designing and Building Applications for Extreme Scale Systems CS598 William Gropp

LAMMPS Developer Guide 23 Aug 2011

A Novel Way of Deduplication Approach for Cloud Backup Services Using Block Index Caching Technique

A Load Balancing Tool for Structured Multi-Block Grid CFD Applications

Mesos: A Platform for Fine- Grained Resource Sharing in Data Centers (II)

P013 INTRODUCING A NEW GENERATION OF RESERVOIR SIMULATION SOFTWARE

An Empirical Study and Analysis of the Dynamic Load Balancing Techniques Used in Parallel Computing Systems

The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System

High Performance Computing in CST STUDIO SUITE

Hardware-Aware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui

A Performance Study of Load Balancing Strategies for Approximate String Matching on an MPI Heterogeneous System Environment

Dell High-Performance Computing Clusters and Reservoir Simulation Research at UT Austin.

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

Load balancing in a heterogeneous computer system by self-organizing Kohonen network

Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing

Scalable Developments for Big Data Analytics in Remote Sensing

Large-Scale Reservoir Simulation and Big Data Visualization

Unleashing the Performance Potential of GPUs for Atmospheric Dynamic Solvers

HPC enabling of OpenFOAM R for CFD applications

Mariana Vertenstein. NCAR Earth System Laboratory CESM Software Engineering Group (CSEG) NCAR is sponsored by the National Science Foundation

Operating System for the K computer

Recommended hardware system configurations for ANSYS users

GEDAE TM - A Graphical Programming and Autocode Generation Tool for Signal Processor Applications

GCMs with Implicit and Explicit cloudrain processes for simulation of extreme precipitation frequency

PART IV Performance oriented design, Performance testing, Performance tuning & Performance solutions. Outline. Performance oriented design

Multilevel Load Balancing in NUMA Computers

9/26/2011. What is Virtualization? What are the different types of virtualization.

Transcription:

Center for Information Services and High Performance Computing (ZIH) FD4: A Framework for Highly Scalable Dynamic Load Balancing and Model Coupling Symposium on HPC and Data-Intensive Applications in Earth Sciences 13 Nov 2014, Trieste, Italy Matthias Lieber (matthias.lieber@tu-dresden.de) Center for Information Services and High Performance Computing (ZIH) Technische Universität Dresden, Germany

"Climate models now include more cloud and aerosol processes, and their interactions, than at the time of the AR4, but there remains low confidence in the representation and quantification of these processes in models." IPCC, 2013: Summary for Policymakers. In: Climate Change 2013: The Physical Science Basis. Contribution of Working Group I to the Fifth Assessment Report of the Intergovernmental Panel on Climate Change.

Motivation: Spectral Bin Cloud Microphysics Schemes Spectral bin microphysics radius Lynn et al., Mon. Weather Rev., 133:59-71, 2005 Grützun et al., Atmos. Res., 90(2-4):233-242, 2008 mixing ratio mixing ratio Widely used bulk models Khain et al., J. Atmos. Sci., 67(2):365-384, 2010 Sato et al., J. Atmos. Sci., 69:2012-2030, 2012 radius Bin discretization of cloud particle size distribution Allows more detailed modeling of interaction between aerosols, clouds, and precipitation Planche et al., Quart. J. Roy. Meteor. Soc. Vol. 140, No. 683, 2014 Fan et al., Atmos. Chem. Phys., 14:81-101, 2014 Computationally too expensive for forecast Only used for process studies up to now 3

Motivation: Tropical Cyclone Forecast with SBM? Real-time forecast requires ~10 000 CPU cores 1000 grid cells Horizontal grid: 1000 x 1000 Model systems must be tuned for efficient usage of large machines 1000 grid cells 4

Outline Bottleneck Analysis Concept of Load-balanced Coupling FD4's Features Benchmarks Conclusion 5

FD4 Motivation: COSMO-SPECS Performance COSMO-SPECS: Atmospheric model COSMO extended with highly detailed cloud microphysics model SPECS Small 3D case with 64x64x48 grid t = 10 min t = 30 min Growing cumulus cloud ideal scalability Lieber et al., Highly Scalable Dynamic Load Balancing in the Atmospheric Modeling System COSMO-SPECS+FD4, PARA 2010, 2012 6

Analysis: Common Parallelization Scheme Partition Communication 3D domain partitioned into rectangular boxes 2D decomposition (horizontal dimensions) Regular communication with 4 direct neighbors required (periodic boundary conditions) Based on MPI (Message Passing Interface) 7

Analysis: Load Imbalance due to Microphysics P0 P1 P2 P3 Solution: Apply dynamic load balancing SPECS computing time varies strongly depending on the range of the particle size distribution and presence of frozen particles Leads to load imbalances between partitions 8

Analysis: Increasing Communication Volume Surface-to-volume-ratio of partitions grows with number of partitions, in theory (best case): 2D decomposition: A2D(P) = 4 G2/3 P1/2 ~ P1/2 3D decomposition: A3D(P) = 6 G2/3 P1/3 ~ P1/3 Solution: Apply 3D decomposition 9

Outline Bottleneck Analysis Concept of Load-balanced Coupling FD4's Features Benchmarks Conclusion 10

Concept of Load-Balanced Coupling Atmospheric Model & Spectral Bin Microphysics 2D Decomposition Static Partitioning Lieber et al., Highly Scalable Dynamic Load Balancing in the Atmospheric Modeling System COSMO-SPECS+FD4, PARA 2010, 2012 11

Concept of Load-Balanced Coupling Atmospheric Model & Spectral Bin Microphysics Spectral Bin Microphysics High Scalability P 10 000 Model Coupling 2D Decomposition Static Partitioning Block-based 3D Decomposition Dynamic Load Balancing Optimized Data Structures Lieber et al., Highly Scalable Dynamic Load Balancing in the Atmospheric Modeling System COSMO-SPECS+FD4, PARA 2010, 2012 12

Concept of Load-Balanced Coupling High Scalability P 10 000 Model Coupling Implemented as independent framework FD4 FD4: Four-Dimensional Distributed Dynamic Data structures Block-based 3D Decomposition Dynamic Load Balancing Optimized Data Structures Lieber et al., Highly Scalable Dynamic Load Balancing in the Atmospheric Modeling System COSMO-SPECS+FD4, PARA 2010, 2012 13

Outline Bottleneck Analysis Concept of Load-balanced Coupling FD4's Features Benchmarks Conclusion 14

FD4: Dynamic Load Balancing 3D block decomposition of rectangular grid Space-filling curve (SFC) partitioning to assign blocks to ranks SFC reduces 3D partitioning problem to 1D High locality of SFC leads to moderate comm. costs Developed a highly scalable, hierarchical method for high-quality 1D partitioning of the SFC-indexed blocks 15

FD4: Model Coupling Data exchange between FD4 based model and an external model E.g. weather or CFD model Transfer in both directions FD4 computes partition overlaps after each repartitioning of FD4 grid Highly scalable algorithm No grid transformation / interpolation External model must provide data matching the FD4 grid Sequential coupling only Both models run alternately on same set of MPI ranks 16

FD4: 4th Dimension Extra, non-spatial dimension of grid variables, e.g. Size resolving models Array of gas phase tracers FD4 is optimized for a large 4th dimension COSMO-SPECS requires 2 x 11 x 66 ~ 1500 values 17

FD4: Adaptive Block Mode Grid allocation adapts to spatial structure of simulated problem Save memory in case data and computations are required for a subset only For multiphase problems like drops, clouds, flame fronts FD4 ensures existence of all blocks required for correct stencil operations 18

Outline Bottleneck Analysis Concept of Load-balanced Coupling FD4's Features Benchmarks Conclusion 19

Benchmarks: COSMO-SPECS Performance Comparison Almost 3 times faster at 1024 CPU cores Load balancing & coupling scale well, but can we reach > 10 000 processes? COSMO-SPECS COSMO-SPECS+FD4 20

Benchmarks: Scalability on Blue Gene/Q Lieber, Nagel, Mix, Scalability Tuning of the Load Balancing and Coupling Framework FD4, NIC Symposium 2014, pp. 363-370. Grid size: 1024 x 1024 x 48 grid cells, > 3M blocks 256k: 30 min forecast in <5min (w/o init and I/O) Runs on Blue Gene/Q with up to 262 144 MPI ranks 14 x speed-up from 16k to 256k COSMO-SPECS+FD4 21

Benchmarks: Load Balancing & Coupling Scalability Lieber, Nagel, Mix, Scalability Tuning of the Load Balancing and Coupling Framework FD4, NIC Symposium 2014, pp. 363-370. Grid size: 1024 x 1024 x 48 grid cells, > 3M blocks Load balancing scales comparatively very well Coupling scales nearly perfect Dynamic Load Balancing Runtime % Coupling Runtime % 22

Benchmarks: 1D Partitioning Comparison on Blue Gene/Q Lieber, Nagel, Scalable High-Quality 1D Partitioning, HPCS 2014, pp. 112-119, 2014 ExactBS: exact method, but slow and serial H2: fast heuristic, but may result in poor load balance HIER*: hierarchical algorithm implemented in FD4, achieves nearly optimal load balance ExactBS: 2668 ms QBS: 692 ms H2seq: 363 ms H2par: 40.5 ms HIER*G=64 : 8.55 ms HIER*P/G=256 : 3.77 ms 23

Heuristic H2 in Action (COSMO-SPECS+FD4) 24

HIER* in Action (COSMO-SPECS+FD4) 25

Outline Bottleneck Analysis Concept of Load-balanced Coupling FD4's Features Benchmarks Conclusion 26

Conclusions FD4 provides for simulation models Parallelization of numerical grid Communication between neighbor partitions Dynamic load balancing Model coupling High scalability Initially developed for atmospheric modeling, but generally applicable FD4 is available as open source software Fortran 95, MPI-2, NetCDF Tested on many different HPC systems FD4 website: http://wwwpub.zih.tu-dresden.de/~mlieber/fd4 Lieber, Nagel, Scalable High-Quality 1D Partitioning, HPCS 2014, pp. 112119, 2014 Lieber, Nagel, Mix, Scalability Tuning of the Load Balancing and Coupling Framework FD4, NIC Symposium 2014 Lieber et al., Highly Scalable Dynamic Load Balancing in the Atmospheric Modeling System COSMOSPECS+FD4, PARA 2010, 2012 Lieber et al., FD4: A Framework for Highly Scalable Load Balancing and Coupling of Multiphase Models, ICNAAM 2010 27

Thank you very much for your attention! Acknowledgments Verena Grützun, Ralf Wolke, Oswald Knoth, Martin Simmel, René Widera, Matthias Jurenz, Matthias Müller, Wolfgang E. Nagel www.vampir.eu www.tropos.de www.cosmo-model.org picongpu.hzdr.de Funding 28

Backup Slides 29

Framework FD4: Optimized Data Structure A large number of small blocks are good for performance: Size-resolved approach / ~1000 variables per grid cell: Only small blocks do not exceed processor cache Load balancing: #blocks > #partitions to enable fine-grained balancing Additional memory costs for a boundary of ghost cells Too high for small blocks! Add ghost blocks at the partition borders only 30

From SFC Partitioning to 1D Partitioning Space-filling curve (SFC) partitioning widely used nd space is mapped to 1D by SFC Mapping is fast and has high locality Migration typically between neighbor ranks 1D partitioning is core problem of SFC partitioning Hilbert SFC Pilkington, Baden, Dynamic partitioning of non-uniform structured workloads with spacefilling curves, IEEE T. Parall. Distr., vol. 7, no. 3, pp. 288-300, 1996. Decomposes task chain into consecutive parts Two classes of existing 1D partitioning algorithms: Heuristics: fast, parallel, no optimal solution Pinar, Aykanat, Fast optimal load balancing algorithms for 1D partitioning, J. Parallel Distr. Com., vol. 64, no. 8, pp. 974-996, 2004. Exact methods: slow, serial, but optimal 31

Sequential vs. Concurrent Model Coupling Both models run alternately on same set of MPI ranks Allows tight coupling (data dependencies) Avoids load imbalances between models Data exchange Model A Model B Ranks Data exchange Data exchange Data exchange Model B Data exchange Ranks Model A t Data exchange Concurrent Coupling Sequential Coupling t MPI ranks are split into groups Loose coupling, codes may be separate Scales to higher total number of ranks 32

Scalable High-Quality 1D Partitioning: Algorithm HIER* Large scale applications require a fully parallel method, i.e. without gathering all task weights Run parallel H2 to create G < P coarse partitions: Orig part. P0 P1 P2 P3 P4 P5 P6 H2 nearly optimal if wmax << WN / P: Miguet, Pierson, Heuristics for 1D rectilinear partitioning as a low cost and high quality answer to dynamic load balancing, LNCS, vol. 1225, 1997, pp. 550-564. P7 Parallel Heuristic H2 Coarse part. P0 / P1 P2 / P3 P4 / P5 P6 / P7 Run G independent instances of exact QBS* (q=1.0) to create final partitions within each group: Final part. QBS* QBS* QBS* QBS* P0 P2 P4 P6 P1 P3 P5 P7 Parameter G allows trade-off between scalability (high G heuristic dominates) and load balance (small G exact method dominates) 33

FD4: Implementation Implemented in Fortran 95 MPI-based parallelization Open Source Software www.tu-dresden.de/zih/clouds! MPI initialization call MPI_Init(err) call MPI_Comm_rank(MPI_COMM_WORLD, rank, err) call MPI_Comm_size(MPI_COMM_WORLD, nproc, err)! create the domain and allocate memory call fd4_domain_create(domain, nb, size, & vartab, ng, peri, MPI_COMM_WORLD, err) call fd4_util_allocate_all_blocks(domain, err)! initialize ghost communication call fd4_ghostcomm_create(ghostcomm, domain, & 4, vars, steps, err)! loop over time steps do timestep=1,nsteps! exchange ghosts call fd4_ghostcomm_exch(ghostcomm, err)! loop over local blocks call fd4_iter_init(domain, iter) do while(associated(iter%cur))! do some computations call compute_block(iter) call fd4_iter_next(iter) end do! dynamic load balancing call fd4_balance_readjust(domain, err) end do 34

Benchmarks: COSMO-SPECS Performance Comparison COSMO-SPECS+FD4 COSMO-SPECS 35

Scalable High-Quality 1D Partitioning: Load Balance Cloud simulation, 1 357 824 tasks CLOUD Simulation System: JUQUEEN, IBM Blue Gene/Q HIER*, G=64 achieves 99.2% of the optimal load balance at 262 144 processes 36

HIER* seen in Vampir (one Group of 256 out of 64Ki) 37

ExactBS in Action (COSMO-SPECS+FD4) 38

COSMO-SPECS+FD4: Comparison of Methods Lieber, Nagel, Mix, Scalability Tuning of the Load Balancing and Coupling Framework FD4, NIC Symposium 2014, pp. 363-370. 39

Scalable Coupling: Meta Data Subdomains Handshaking Identifying partition overlaps between the coupled models turned out to be the main scalability bottleneck Solved with spatially indexed data structure for coupling meta data in FD4 Time for locating overlap candidates does not depend on number of ranks Lieber, Nagel, Mix, Scalability Tuning of the Load Balancing and Coupling Framework FD4, NIC Symposium 2014, pp. 363-370. 40