walberla: Towards an Adaptive, Dynamically Load-Balanced, Massively Parallel Lattice Boltzmann Fluid Simulation

Similar documents

walberla: A software framework for CFD applications on Compute Cores

walberla: A software framework for CFD applications

Characterizing the Performance of Dynamic Distribution and Load-Balancing Techniques for Adaptive Grid Hierarchies

FRIEDRICH-ALEXANDER-UNIVERSITÄT ERLANGEN-NÜRNBERG

Distributed Dynamic Load Balancing for Iterative-Stencil Applications

Towards real-time image processing with Hierarchical Hybrid Grids

HPC enabling of OpenFOAM R for CFD applications

Multi-GPU Load Balancing for Simulation and Rendering

Highly Scalable Dynamic Load Balancing in the Atmospheric Modeling System COSMO-SPECS+FD4

Employing Complex GPU Data Structures for the Interactive Visualization of Adaptive Mesh Refinement Data

Uintah Framework. Justin Luitjens, Qingyu Meng, John Schmidt, Martin Berzins, Todd Harman, Chuch Wight, Steven Parker, et al

Mesh Generation and Load Balancing

Parallel Simplification of Large Meshes on PC Clusters

ParFUM: A Parallel Framework for Unstructured Meshes. Aaron Becker, Isaac Dooley, Terry Wilmarth, Sayantan Chakravorty Charm++ Workshop 2008

FD4: A Framework for Highly Scalable Dynamic Load Balancing and Model Coupling

Fast Multipole Method for particle interactions: an open source parallel library component

Lecture 7 - Meshing. Applied Computational Fluid Dynamics

The Lattice Project: A Multi-Model Grid Computing System. Center for Bioinformatics and Computational Biology University of Maryland

Efficient Storage, Compression and Transmission

A Pattern-Based Approach to. Automated Application Performance Analysis

Dynamic Load Balancing in Charm++ Abhinav S Bhatele Parallel Programming Lab, UIUC

Parallel Scalable Algorithms- Performance Parameters

Parallel Computing. Benson Muite. benson.

A VOXELIZATION BASED MESH GENERATION ALGORITHM FOR NUMERICAL MODELS USED IN FOUNDRY ENGINEERING

Load balancing; Termination detection

Source Code Transformations Strategies to Load-balance Grid Applications

Parallel Programming at the Exascale Era: A Case Study on Parallelizing Matrix Assembly For Unstructured Meshes

A Load Balancing Tool for Structured Multi-Block Grid CFD Applications

Scientific Computing Programming with Parallel Objects

Multiphase Flow - Appendices

Load Balancing on a Non-dedicated Heterogeneous Network of Workstations

A Novel Switch Mechanism for Load Balancing in Public Cloud

Fully Automatic Hex Dominant Mesher. Paul Gilfrin Sharc Ltd

Dynamic Load Balancing for Cluster Computing Jaswinder Pal Singh, Technische Universität München. singhj@in.tum.de

Development and Evaluation of Point Cloud Compression for the Point Cloud Library

PARALLEL PROGRAMMING

A Comparison of General Approaches to Multiprocessor Scheduling

Graph Analytics in Big Data. John Feo Pacific Northwest National Laboratory

Software Development around a Millisecond

Robust Algorithms for Current Deposition and Dynamic Load-balancing in a GPU Particle-in-Cell Code

Expanding the CASEsim Framework to Facilitate Load Balancing of Social Network Simulations

AN EFFECT OF GRID QUALITY ON THE RESULTS OF NUMERICAL SIMULATIONS OF THE FLUID FLOW FIELD IN AN AGITATED VESSEL

Load Balancing Techniques

Optimized Hybrid Parallel Lattice Boltzmann Fluid Flow Simulations on Complex Geometries

OpenMosix Presented by Dr. Moshe Bar and MAASK [01]

Spatial Discretisation Schemes in the PDE framework Peano for Fluid-Structure Interactions

Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms

Big Graph Processing: Some Background

Load balancing; Termination detection

Fast Parallel Algorithms for Computational Bio-Medicine

Load Balancing Strategies for Parallel SAMR Algorithms

Energy Efficient MapReduce

Making Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association

Parallel Ray Tracing using MPI: A Dynamic Load-balancing Approach

International journal of Engineering Research-Online A Peer Reviewed International Journal Articles available online

Parallels Virtuozzo Containers

ABSTRACT FOR THE 1ST INTERNATIONAL WORKSHOP ON HIGH ORDER CFD METHODS

Characterizing Task Usage Shapes in Google s Compute Clusters

A Novel Cloud Based Elastic Framework for Big Data Preprocessing

Large-Scale Reservoir Simulation and Big Data Visualization

Distributed Computing over Communication Networks: Maximal Independent Set

Optimizing Performance of the Lattice Boltzmann Method for Complex Structures on Cache-based Architectures

GEDAE TM - A Graphical Programming and Autocode Generation Tool for Signal Processor Applications

How To Write A Program For The Pd Framework

The Advantages and Disadvantages of Network Computing Nodes

Explicit Spatial Scattering for Load Balancing in Conservatively Synchronized Parallel Discrete-Event Simulations

Dynamic Network Analyzer Building a Framework for the Graph-theoretic Analysis of Dynamic Networks

Load Balancing in Structured Peer to Peer Systems

DECENTRALIZED LOAD BALANCING IN HETEROGENEOUS SYSTEMS USING DIFFUSION APPROACH

Load Balancing in Structured Peer to Peer Systems

Parallel Visualization for GIS Applications

Measuring the Performance of an Agent

The Scientific Data Mining Process

Computer Graphics. Geometric Modeling. Page 1. Copyright Gotsman, Elber, Barequet, Karni, Sheffer Computer Science - Technion. An Example.

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Load Imbalance Analysis

Praktikum Wissenschaftliches Rechnen (Performance-optimized optimized Programming)

CONVERGE Features, Capabilities and Applications

Information Processing, Big Data, and the Cloud

Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing

LOAD BALANCING TECHNIQUES

Visualization methods for patent data

Introduction to ANSYS ICEM CFD

MINIMIZING STORAGE COST IN CLOUD COMPUTING ENVIRONMENT

Optimizing Load Balance Using Parallel Migratable Objects

Data Warehousing und Data Mining

Cost Model: Work, Span and Parallelism. 1 The RAM model for sequential computation:

Rapid Design of an optimized Radial Compressor using CFturbo and ANSYS

Performance Monitoring of Parallel Scientific Applications

Transcription:

walberla: Towards an Adaptive, Dynamically Load-Balanced, Massively Parallel Lattice Boltzmann Fluid Simulation SIAM Parallel Processing for Scientific Computing 2012 February 16, 2012 Florian Schornbaum, Christian Feichtinger, Harald Köstler, Ulrich Rüde Chair for System Simulation Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Germany

Outline Introduction Motivation / Problem Description Current Framework Capabilities Future Goals & Extensions Prototyping Environment Implementation Data Structures Distributed Refinement/Coarsening Algorithm Procedure Virtualization / Virtual Blocks Load Balancing Results / Benchmarks Summary & Conclusion 1

Outline Introduction Motivation / Problem Description Current Framework Capabilities Future Goals & Extensions Prototyping Environment Implementation Data Structures Distributed Refinement/Coarsening Algorithm Procedure Virtualization / Virtual Blocks Load Balancing Results / Benchmarks Summary & Conclusion 1

Introduction walberla: A massively parallel software framework originally developed for CFD simulations based on the Lattice Boltzmann method (LBM) Lattice Boltzmann method: In every time step, each cell in a discretized simulation space exchanges information with its directly adjacent neighbors: high data locality especially well suited for extensive parallelization 2

Introduction walberla: A massively parallel software framework originally developed for CFD simulations based on the Lattice Boltzmann method (LBM) Lattice Boltzmann method: In every time step, each cell in a discretized simulation space exchanges information with its directly adjacent neighbors: high data locality especially well suited for extensive parallelization 2

Introduction walberla: A massively parallel software framework originally developed for CFD simulations based on the Lattice Boltzmann method (LBM) Lattice Boltzmann method: In every time step, each cell in a discretized simulation space exchanges information with its directly adjacent neighbors: high data locality especially well suited for extensive parallelization 2

Introduction walberla: A massively parallel software framework originally developed for CFD simulations based on the Lattice Boltzmann method (LBM) Lattice Boltzmann method: In every time step, each cell in a discretized simulation space exchanges information with its directly adjacent neighbors: high data locality especially well suited for extensive parallelization 2

Outline Introduction Motivation / Problem Description Current Framework Capabilities Future Goals & Extensions Prototyping Environment Implementation Data Structures Distributed Refinement/Coarsening Algorithm Procedure Virtualization / Virtual Blocks Load Balancing Results / Benchmarks Summary & Conclusion 3

Motivation / Problem Description Current Framework Capabilities Currently, the walberla framework does not support refinement. The simulation space is always regularly discretized. For parallel simulations, each process is assigned agglomerates of several thousands of cells ("blocks" of cells). geometric distribution 1 2 3 4 1 2 3 4 4

Motivation / Problem Description Current Framework Capabilities Currently, the walberla framework does not support refinement. The simulation space is always regularly discretized. For parallel simulations, each process is assigned agglomerates of several thousands of cells ("blocks" of cells). geometric distribution 1 2 3 4 1 2 3 4 4

Motivation / Problem Description Current Framework Capabilities Currently, the walberla framework does not support refinement. The simulation space is always regularly discretized. For parallel simulations, each process is assigned agglomerates of several thousands of cells ("blocks" of cells). geometric distribution 1 2 3 4 1 2 3 4 4

Motivation / Problem Description Current Framework Capabilities inter-process communication 1 2 3 4 intra-process communication 1 2 3 4 The required inter- and intra-process communication schemes are relatively easy to understand and to implement. Data must be exchanged only between neighboring blocks. straight-forward parallelization of large simulations 5

Motivation / Problem Description Current Framework Capabilities inter-process communication 1 2 3 4 intra-process communication 1 2 3 4 The required inter- and intra-process communication schemes are relatively easy to understand and to implement. Data must be exchanged only between neighboring blocks. straight-forward parallelization of large simulations 5

Motivation / Problem Description Current Framework Capabilities inter-process communication 1 2 3 4 intra-process communication 1 2 3 4 The required inter- and intra-process communication schemes are relatively easy to understand and to implement. Data must be exchanged only between neighboring blocks. straight-forward parallelization of large simulations 5

Motivation / Problem Description Future Goals & Extensions walberla will be extended to support grid refinement (for more information on grid refinement & LBM see Filippova et al., Dupuis et al., Krafczyk et al.). restrictions for and consequences of grid refinement: 2:1 size ratio of neighboring cells higher resolution in areas covered with obstacles With the Lattice Boltzmann method, on the fine grid, twice as many time steps need to be performed as on the coarse grid. 6

Motivation / Problem Description Future Goals & Extensions walberla will be extended to support grid refinement (for more information on grid refinement & LBM see Filippova et al., Dupuis et al., Krafczyk et al.). restrictions for and consequences of grid refinement: 2:1 size ratio of neighboring cells higher resolution in areas covered with obstacles With the Lattice Boltzmann method, on the fine grid, twice as many time steps need to be performed as on the coarse grid. 6

Motivation / Problem Description Future Goals & Extensions walberla will be extended to support grid refinement (for more information on grid refinement & LBM see Filippova et al., Dupuis et al., Krafczyk et al.). restrictions for and consequences of grid refinement: 2:1 size ratio of neighboring cells higher resolution in areas covered with obstacles With the Lattice Boltzmann method, on the fine grid, twice as many time steps need to be performed as on the coarse grid. 6

Motivation / Problem Description Future Goals & Extensions walberla will be extended to support grid refinement (for more information on grid refinement & LBM see Filippova et al., Dupuis et al., Krafczyk et al.). restrictions for and consequences of grid refinement: 2:1 size ratio of neighboring cells higher resolution in areas covered with obstacles With the Lattice Boltzmann method, on the fine grid, twice as many time steps need to be performed as on the coarse grid. 6

Motivation / Problem Description Future Goals & Extensions restrictions for and consequences of grid refinement (cont.): In 3D, one refinement step leads to eight times as many cells being required in the refined area: memory consumption 8 & generated workload 16 If more than one refinement level is used, the 2:1 size ratio of neighboring cells must be obeyed: If n refinement levels are used, then memory workload on the finest grid = 8n 1 memory 16 n 1 workload on the coarsest grid 7

Motivation / Problem Description Future Goals & Extensions restrictions for and consequences of grid refinement (cont.): In 3D, one refinement step leads to eight times as many cells being required in the refined area: memory consumption 8 & generated workload 16 If more than one refinement level is used, the 2:1 size ratio of neighboring cells must be obeyed: If n refinement levels are used, then memory workload on the finest grid = 8n 1 memory 16 n 1 workload on the coarsest grid 7

Motivation / Problem Description Future Goals & Extensions restrictions for and consequences of grid refinement (cont.): In 3D, one refinement step leads to eight times as many cells being required in the refined area: memory consumption 8 & generated workload 16 If more than one refinement level is used, the 2:1 size ratio of neighboring cells must be obeyed: If n refinement levels are used, then memory workload on the finest grid = 8n 1 memory 16 n 1 workload on the coarsest grid 7

Motivation / Problem Description Future Goals & Extensions restrictions for and consequences of grid refinement (cont.): In 3D, one refinement step leads to eight times as many cells being required in the refined area: memory consumption 8 & generated workload 16 If more than one refinement level is used, the 2:1 size ratio of neighboring cells must be obeyed: If n refinement levels are used, then memory workload on the finest grid = 8n 1 memory 16 n 1 workload on the coarsest grid 7

Motivation / Problem Description Future Goals & Extensions restrictions for and consequences of grid refinement (cont.): In 3D, one refinement step leads to eight times as many cells being required in the refined area: memory consumption 8 & generated workload 16 If more than one refinement level is used, the 2:1 size ratio of neighboring cells must be obeyed: If n refinement levels are used, then memory workload on the finest grid = 8n 1 memory 16 n 1 workload on the coarsest grid 7

Motivation / Problem Description Future Goals & Extensions In order to achieve good load balancing, subdividing the simulation space into equally sized regions won t work. Each process must be assigned the same amount of work (the workload is given by the number of cells weighted by the number of time steps that need to be performed on the corresponding grid level). Not trivial to solve for billions of cells! 8

Motivation / Problem Description Future Goals & Extensions In order to achieve good load balancing, subdividing the simulation space into equally sized regions won t work. Each process must be assigned the same amount of work (the workload is given by the number of cells weighted by the number of time steps that need to be performed on the corresponding grid level). Not trivial to solve for billions of cells! 8

Motivation / Problem Description Future Goals & Extensions In order to achieve good load balancing, subdividing the simulation space into equally sized regions won t work. Each process must be assigned the same amount of work (the workload is given by the number of cells weighted by the number of time steps that need to be performed on the corresponding grid level). Not trivial to solve for billions of cells! 8

Motivation / Problem Description Future Goals & Extensions In order to achieve good load balancing, subdividing the simulation space into equally sized regions won t work. Each process must be assigned the same amount of work (the workload is given by the number of cells weighted by the number of time steps that need to be performed on the corresponding grid level). Not trivial to solve for billions of cells! 8

Motivation / Problem Description Future Goals & Extensions The problem gets even worse if the fine regions are not static but dynamically change their locations (moving obstacles etc.). Areas initially consisting of coarse cells will require much more memory und generate a lot more workload after being refined (and vice versa). massive workload & memory fluctuations! Performing global refinement, coarsening, and load balancing (by synchronizing all processes or using a master-slave scheme) can be extremely expensive or maybe even impossible for simulations with billions of cells distributed to thousands of processes. solution: fully distributed algorithms working in parallel 9

Motivation / Problem Description Future Goals & Extensions The problem gets even worse if the fine regions are not static but dynamically change their locations (moving obstacles etc.). Areas initially consisting of coarse cells will require much more memory und generate a lot more workload after being refined (and vice versa). massive workload & memory fluctuations! Performing global refinement, coarsening, and load balancing (by synchronizing all processes or using a master-slave scheme) can be extremely expensive or maybe even impossible for simulations with billions of cells distributed to thousands of processes. solution: fully distributed algorithms working in parallel 9

Motivation / Problem Description Future Goals & Extensions The problem gets even worse if the fine regions are not static but dynamically change their locations (moving obstacles etc.). Areas initially consisting of coarse cells will require much more memory und generate a lot more workload after being refined (and vice versa). massive workload & memory fluctuations! Performing global refinement, coarsening, and load balancing (by synchronizing all processes or using a master-slave scheme) can be extremely expensive or maybe even impossible for simulations with billions of cells distributed to thousands of processes. solution: fully distributed algorithms working in parallel 9

Motivation / Problem Description Future Goals & Extensions The problem gets even worse if the fine regions are not static but dynamically change their locations (moving obstacles etc.). Areas initially consisting of coarse cells will require much more memory und generate a lot more workload after being refined (and vice versa). massive workload & memory fluctuations! Performing global refinement, coarsening, and load balancing (by synchronizing all processes or using a master-slave scheme) can be extremely expensive or maybe even impossible for simulations with billions of cells distributed to thousands of processes. solution: fully distributed algorithms working in parallel 9

Outline Introduction Motivation / Problem Description Current Framework Capabilities Future Goals & Extensions Prototyping Environment Implementation Data Structures Distributed Refinement/Coarsening Algorithm Procedure Virtualization / Virtual Blocks Load Balancing Results / Benchmarks Summary & Conclusion 10

Implementation In order to be able to deal with all of these problems, new and adapted data structures and algorithms are required. A prototyping environment has been created within the walberla framework that solely focuses on the development of these new data structures and distributed algorithms. No actual Lattice Boltzmann fluid simulation is executed. All the data that is required for the LBM only exists in form of accumulated, abstract information regarding workload and memory. Adaptive refinement is simulated by moving spherical objects through the simulation and demanding a fine resolution around these objects. The prototyping environment allows for a fast and efficient development and testing of different concepts and structures. 11

Implementation In order to be able to deal with all of these problems, new and adapted data structures and algorithms are required. A prototyping environment has been created within the walberla framework that solely focuses on the development of these new data structures and distributed algorithms. No actual Lattice Boltzmann fluid simulation is executed. All the data that is required for the LBM only exists in form of accumulated, abstract information regarding workload and memory. Adaptive refinement is simulated by moving spherical objects through the simulation and demanding a fine resolution around these objects. The prototyping environment allows for a fast and efficient development and testing of different concepts and structures. 11

Implementation In order to be able to deal with all of these problems, new and adapted data structures and algorithms are required. A prototyping environment has been created within the walberla framework that solely focuses on the development of these new data structures and distributed algorithms. No actual Lattice Boltzmann fluid simulation is executed. All the data that is required for the LBM only exists in form of accumulated, abstract information regarding workload and memory. Adaptive refinement is simulated by moving spherical objects through the simulation and demanding a fine resolution around these objects. The prototyping environment allows for a fast and efficient development and testing of different concepts and structures. 11

Implementation In order to be able to deal with all of these problems, new and adapted data structures and algorithms are required. A prototyping environment has been created within the walberla framework that solely focuses on the development of these new data structures and distributed algorithms. No actual Lattice Boltzmann fluid simulation is executed. All the data that is required for the LBM only exists in form of accumulated, abstract information regarding workload and memory. Adaptive refinement is simulated by moving spherical objects through the simulation and demanding a fine resolution around these objects. The prototyping environment allows for a fast and efficient development and testing of different concepts and structures. 11

Implementation In order to be able to deal with all of these problems, new and adapted data structures and algorithms are required. A prototyping environment has been created within the walberla framework that solely focuses on the development of these new data structures and distributed algorithms. No actual Lattice Boltzmann fluid simulation is executed. All the data that is required for the LBM only exists in form of accumulated, abstract information regarding workload and memory. Adaptive refinement is simulated by moving spherical objects through the simulation and demanding a fine resolution around these objects. The prototyping environment allows for a fast and efficient development and testing of different concepts and structures. 11

Implementation In order to be able to deal with all of these problems, new and adapted data structures and algorithms are required. A prototyping environment has been created within the walberla framework that solely focuses on the development of these new data structures and distributed algorithms. No actual Lattice Boltzmann fluid simulation is executed. All the data that is required for the LBM only exists in form of accumulated, abstract information regarding workload and memory. Adaptive refinement is simulated by moving spherical objects through the simulation and demanding a fine resolution around these objects. The prototyping environment allows for a fast and efficient development and testing of different concepts and structures. 11

Implementation The prototyping environment (written in C++) is not parallelized with MPI but only with OpenMP. It runs on shared memory systems. Thousands of processes running in parallel using distributed algorithms for refinement and balancing are only simulated. Advantages: Fast development and testing ( thousands of processes can be simulated on a desktop computer) All tasks are also solved with easy to understand, global algorithms which are then used to validate the results of the fully distributed, parallel algorithms. 12

Implementation The prototyping environment (written in C++) is not parallelized with MPI but only with OpenMP. It runs on shared memory systems. Thousands of processes running in parallel using distributed algorithms for refinement and balancing are only simulated. Advantages: Fast development and testing ( thousands of processes can be simulated on a desktop computer) All tasks are also solved with easy to understand, global algorithms which are then used to validate the results of the fully distributed, parallel algorithms. 12

Implementation The prototyping environment (written in C++) is not parallelized with MPI but only with OpenMP. It runs on shared memory systems. Thousands of processes running in parallel using distributed algorithms for refinement and balancing are only simulated. Advantages: Fast development and testing ( thousands of processes can be simulated on a desktop computer) All tasks are also solved with easy to understand, global algorithms which are then used to validate the results of the fully distributed, parallel algorithms. 12

Implementation The prototyping environment (written in C++) is not parallelized with MPI but only with OpenMP. It runs on shared memory systems. Thousands of processes running in parallel using distributed algorithms for refinement and balancing are only simulated. Advantages: Fast development and testing ( thousands of processes can be simulated on a desktop computer) All tasks are also solved with easy to understand, global algorithms which are then used to validate the results of the fully distributed, parallel algorithms. 12

Data Structures Algorithms working on a cell-based structure cannot be implemented efficiently. highly irregularly shaped partitions of the simulation domain completely irregular communication schemes Computation sweeps over blocks of cells resulting from the current homogenous discretization are much more efficient. The new structure is also based on blocks of cells (e.g., 40 40 40). (All cells in one block are of the same size.) 13

Data Structures Algorithms working on a cell-based structure cannot be implemented efficiently. highly irregularly shaped partitions of the simulation domain completely irregular communication schemes Computation sweeps over blocks of cells resulting from the current homogenous discretization are much more efficient. The new structure is also based on blocks of cells (e.g., 40 40 40). (All cells in one block are of the same size.) 13

Data Structures Algorithms working on a cell-based structure cannot be implemented efficiently. highly irregularly shaped partitions of the simulation domain completely irregular communication schemes Computation sweeps over blocks of cells resulting from the current homogenous discretization are much more efficient. The new structure is also based on blocks of cells (e.g., 40 40 40). (All cells in one block are of the same size.) 13

Data Structures Algorithms working on a cell-based structure cannot be implemented efficiently. highly irregularly shaped partitions of the simulation domain completely irregular communication schemes Computation sweeps over blocks of cells resulting from the current homogenous discretization are much more efficient. The new structure is also based on blocks of cells (e.g., 40 40 40). (All cells in one block are of the same size.) 13

Data Structures Algorithms working on a cell-based structure cannot be implemented efficiently. highly irregularly shaped partitions of the simulation domain completely irregular communication schemes Computation sweeps over blocks of cells resulting from the current homogenous discretization are much more efficient. The new structure is also based on blocks of cells (e.g., 40 40 40). (All cells in one block are of the same size.) 13

Data Structures The 2:1 cell size ratio restriction causes two neighboring blocks to have the same cell size or to differ by only one refinement level. region in the simulation domain where the underlying application demands a fine resolution What makes this structure special/different: No concepts and structures typically associated with trees (fatherchild connections, inner nodes, etc.) are used. Each block only knows all of its direct neighbors perfect for parallelization! 14

Data Structures The 2:1 cell size ratio restriction causes two neighboring blocks to have the same cell size or to differ by only one refinement level. region in the simulation domain where the underlying application demands a fine resolution What makes this structure special/different: No concepts and structures typically associated with trees (fatherchild connections, inner nodes, etc.) are used. Each block only knows all of its direct neighbors perfect for parallelization! 14

Data Structures The 2:1 cell size ratio restriction causes two neighboring blocks to have the same cell size or to differ by only one refinement level. geometrically: forest of octrees (blocks = leaves) region in the simulation domain where the underlying application demands a fine resolution What makes this structure special/different: No concepts and structures typically associated with trees (fatherchild connections, inner nodes, etc.) are used. Each block only knows all of its direct neighbors perfect for parallelization! 14

Data Structures The 2:1 cell size ratio restriction causes two neighboring blocks to have the same cell size or to differ by only one refinement level. geometrically: forest of octrees (blocks = leaves) region in the simulation domain where the underlying application demands a fine resolution What makes this structure special/different: No concepts and structures typically associated with trees (fatherchild connections, inner nodes, etc.) are used. Each block only knows all of its direct neighbors perfect for parallelization! 14

Data Structures The 2:1 cell size ratio restriction causes two neighboring blocks to have the same cell size or to differ by only one refinement level. geometrically: forest of octrees (blocks = leaves) region in the simulation domain where the underlying application demands a fine resolution What makes this structure special/different: No concepts and structures typically associated with trees (fatherchild connections, inner nodes, etc.) are used. Each block only knows all of its direct neighbors perfect for parallelization! 14

Distributed Refinement/Coarsening Algorithm If the area that requires the finest resolution changes, the data structure must be adapted accordingly: From now on, each box represents an entire block of cells. If one block is refined, more additional blocks may be affected: 2:1! 15

Distributed Refinement/Coarsening Algorithm If the area that requires the finest resolution changes, the data structure must be adapted accordingly: From now on, each box represents an entire block of cells. If one block is refined, more additional blocks may be affected: 2:1! 15

Distributed Refinement/Coarsening Algorithm If the area that requires the finest resolution changes, the data structure must be adapted accordingly: From now on, each box represents an entire block of cells. If one block is refined, more additional blocks may be affected: 2:1! 15

Distributed Refinement/Coarsening Algorithm If the area that requires the finest resolution changes, the data structure must be adapted accordingly: From now on, each box represents an entire block of cells. If one block is refined, more additional blocks may be affected: 2:1! 15

Distributed Refinement/Coarsening Algorithm The same holds true if multiple blocks are reunited to one single block ( coarsening): Refinement & coarsening is performed in parallel by a fully distributed algorithm. The runtime of these algorithms only depends on the number of grid levels, not the number of processes! 16

Distributed Refinement/Coarsening Algorithm The same holds true if multiple blocks are reunited to one single block ( coarsening): Refinement & coarsening is performed in parallel by a fully distributed algorithm. The runtime of these algorithms only depends on the number of grid levels, not the number of processes! 16

Distributed Refinement/Coarsening Algorithm The same holds true if multiple blocks are reunited to one single block ( coarsening): Refinement & coarsening is performed in parallel by a fully distributed algorithm. The runtime of these algorithms only depends on the number of grid levels, not the number of processes! 16

Distributed Refinement/Coarsening Algorithm The same holds true if multiple blocks are reunited to one single block ( coarsening): Refinement & coarsening is performed in parallel by a fully distributed algorithm. The runtime of these algorithms only depends on the number of grid levels, not the number of processes! 16

Procedure Virtualization / Virtual Blocks Idea: Each block creates a virtual representation of itself: Each virtual block has a very small memory footprint (no cells but only values like 'workload' and 'memory size' are stored). All algorithms (refinement, coarsening, and load balancing) operate on these virtual blocks. If a block moves from one process to another, only a small amount of memory must be communicated. Only at the end of the refinement-coarsening-balancing pipeline the actual blocks follow their virtual blocks to the designated target processes (and only then refinement and coarsening is performed on the actual cells). 17

Procedure Virtualization / Virtual Blocks Idea: Each block creates a virtual representation of itself: Each virtual block has a very small memory footprint (no cells but only values like 'workload' and 'memory size' are stored). All algorithms (refinement, coarsening, and load balancing) operate on these virtual blocks. If a block moves from one process to another, only a small amount of memory must be communicated. Only at the end of the refinement-coarsening-balancing pipeline the actual blocks follow their virtual blocks to the designated target processes (and only then refinement and coarsening is performed on the actual cells). 17

Procedure Virtualization / Virtual Blocks Idea: Each block creates a virtual representation of itself: Each virtual block has a very small memory footprint (no cells but only values like 'workload' and 'memory size' are stored). All algorithms (refinement, coarsening, and load balancing) operate on these virtual blocks. If a block moves from one process to another, only a small amount of memory must be communicated. Only at the end of the refinement-coarsening-balancing pipeline the actual blocks follow their virtual blocks to the designated target processes (and only then refinement and coarsening is performed on the actual cells). 17

Procedure Virtualization / Virtual Blocks Idea: Each block creates a virtual representation of itself: Each virtual block has a very small memory footprint (no cells but only values like 'workload' and 'memory size' are stored). All algorithms (refinement, coarsening, and load balancing) operate on these virtual blocks. If a block moves from one process to another, only a small amount of memory must be communicated. Only at the end of the refinement-coarsening-balancing pipeline the actual blocks follow their virtual blocks to the designated target processes (and only then refinement and coarsening is performed on the actual cells). 17

Procedure Virtualization / Virtual Blocks Starting situation: blocks may be aggregated block needs to be refined process distribution 1. Initialization: actual blocks virtual blocks 18

Procedure Virtualization / Virtual Blocks Starting situation: blocks may be aggregated block needs to be refined process distribution 1. Initialization: actual blocks virtual blocks 18

Procedure Virtualization / Virtual Blocks Starting situation: blocks may be aggregated block needs to be refined process distribution 2. Refinement: 19

Procedure Virtualization / Virtual Blocks Starting situation: blocks may be aggregated block needs to be refined process distribution 2. Refinement: 19

Procedure Virtualization / Virtual Blocks Starting situation: blocks may be aggregated block needs to be refined process distribution 2. Refinement: 19

Procedure Virtualization / Virtual Blocks Starting situation: blocks may be aggregated block needs to be refined process distribution 3. Coarsening: 20

Procedure Virtualization / Virtual Blocks Starting situation: blocks may be aggregated block needs to be refined process distribution 3. Coarsening: 20

Procedure Virtualization / Virtual Blocks Starting situation: blocks may be aggregated block needs to be refined process distribution 3. Coarsening: 20

Procedure Virtualization / Virtual Blocks Starting situation: blocks may be aggregated block needs to be refined process distribution 4. Load Balancing: 21

Procedure Virtualization / Virtual Blocks Starting situation: blocks may be aggregated block needs to be refined process distribution 5. Finalization: actual blocks virtual blocks 22

Load Balancing Each block has the same number of cells ( identical memory consumption), but smaller cells generate more workload. In a simulation with 5 different grid levels, 2 blocks on the finest level generate the same amount of work than 32 blocks on the coarsest level yet 32 blocks might not fit into the memory of one process. Blocks assigned to the same process should be close. Load balancing problem/situation #1: Some processes may reach their memory limit without generat- ing as much work as the average process. 23

Load Balancing Each block has the same number of cells ( identical memory consumption), but smaller cells generate more workload. In a simulation with 5 different grid levels, 2 blocks on the finest level generate the same amount of work than 32 blocks on the coarsest level yet 32 blocks might not fit into the memory of one process. Blocks assigned to the same process should be close. Load balancing problem/situation #1: Some processes may reach their memory limit without generat- ing as much work as the average process. 23

Load Balancing Each block has the same number of cells ( identical memory consumption), but smaller cells generate more workload. In a simulation with 5 different grid levels, 2 blocks on the finest level generate the same amount of work than 32 blocks on the coarsest level yet 32 blocks might not fit into the memory of one process. Blocks assigned to the same process should be close. Load balancing problem/situation #1: Some processes may reach their memory limit without generat- ing as much work as the average process. 23

Load Balancing Each block has the same number of cells ( identical memory consumption), but smaller cells generate more workload. In a simulation with 5 different grid levels, 2 blocks on the finest level generate the same amount of work than 32 blocks on the coarsest level yet 32 blocks might not fit into the memory of one process. Blocks assigned to the same process should be close. Load balancing problem/situation #1: Some processes may reach their memory limit without generat- ing as much work as the average process. 23

Load Balancing Each block has the same number of cells ( identical memory consumption), but smaller cells generate more workload. In a simulation with 5 different grid levels, 2 blocks on the finest level generate the same amount of work than 32 blocks on the coarsest level yet 32 blocks might not fit into the memory of one process. Blocks assigned to the same process should be close. Load balancing problem/situation #1: Some processes may reach their memory limit without generat- ing as much work as the average process. 23

Load Balancing The blocks should be large, i.e., they should contain many cells: few (maybe only one) blocks per process minimizes communication cost enables efficient computation algorithms Only entire blocks can be exchanged between processes: many blocks per process (certainly good for balancing) The blocks should be small. Load balancing problem/situation #2: On average, each process owns about 4 to 10 blocks and possesses 20 to 25 neighbors (in 3D). 24

Load Balancing The blocks should be large, i.e., they should contain many cells: few (maybe only one) blocks per process minimizes communication cost enables efficient computation algorithms Only entire blocks can be exchanged between processes: many blocks per process (certainly good for balancing) The blocks should be small. Load balancing problem/situation #2: On average, each process owns about 4 to 10 blocks and possesses 20 to 25 neighbors (in 3D). 24

Load Balancing The blocks should be large, i.e., they should contain many cells: few (maybe only one) blocks per process minimizes communication cost enables efficient computation algorithms Only entire blocks can be exchanged between processes: many blocks per process (certainly good for balancing) The blocks should be small. Load balancing problem/situation #2: On average, each process owns about 4 to 10 blocks and possesses 20 to 25 neighbors (in 3D). 24

Load Balancing The blocks should be large, i.e., they should contain many cells: few (maybe only one) blocks per process minimizes communication cost enables efficient computation algorithms Only entire blocks can be exchanged between processes: many blocks per process (certainly good for balancing) The blocks should be small. Load balancing problem/situation #2: On average, each process owns about 4 to 10 blocks and possesses 20 to 25 neighbors (in 3D). 24

Load Balancing The blocks should be large, i.e., they should contain many cells: few (maybe only one) blocks per process minimizes communication cost enables efficient computation algorithms Only entire blocks can be exchanged between processes: many blocks per process (certainly good for balancing) The blocks should be small. Load balancing problem/situation #2: On average, each process owns about 4 to 10 blocks and possesses 20 to 25 neighbors (in 3D). 24

Load Balancing The blocks should be large, i.e., they should contain many cells: few (maybe only one) blocks per process minimizes communication cost enables efficient computation algorithms Only entire blocks can be exchanged between processes: many blocks per process (certainly good for balancing) The blocks should be small. Load balancing problem/situation #2: On average, each process owns about 4 to 10 blocks and possesses 20 to 25 neighbors (in 3D). 24

Load Balancing The blocks should be large, i.e., they should contain many cells: few (maybe only one) blocks per process minimizes communication cost enables efficient computation algorithms Only entire blocks can be exchanged between processes: many blocks per process (certainly good for balancing) The blocks should be small. Load balancing problem/situation #2: On average, each process owns about 4 to 10 blocks and possesses 20 to 25 neighbors (in 3D). 24

Load Balancing The blocks should be large, i.e., they should contain many cells: few (maybe only one) blocks per process minimizes communication cost enables efficient computation algorithms Only entire blocks can be exchanged between processes: many blocks per process (certainly good for balancing) The blocks should be small. Load balancing problem/situation #2: On average, each process owns about 4 to 10 blocks and possesses 20 to 25 neighbors (in 3D). 24

Load Balancing Static Load Balancing Implemented static load balancing strategies: Space-filling curves: Z-order (aka Morton order or Morton code) Hilbert curve Both curves can be constructed by a depth-first search. A custom greedy algorithm which aggregates neighboring blocks Comparison of these three methods: number of processes: greedy Hilbert < Morton partition quality (intra-process com.): greedy > Hilbert > Morton runtime (less is better): Morton = Hilbert greedy all three algorithms run in O(#processes) 25

Load Balancing Static Load Balancing Implemented static load balancing strategies: Space-filling curves: Z-order (aka Morton order or Morton code) Hilbert curve Both curves can be constructed by a depth-first search. A custom greedy algorithm which aggregates neighboring blocks Comparison of these three methods: number of processes: greedy Hilbert < Morton partition quality (intra-process com.): greedy > Hilbert > Morton runtime (less is better): Morton = Hilbert greedy all three algorithms run in O(#processes) 25

Load Balancing Static Load Balancing Implemented static load balancing strategies: Space-filling curves: Z-order (aka Morton order or Morton code) Hilbert curve Both curves can be constructed by a depth-first search. A custom greedy algorithm which aggregates neighboring blocks Comparison of these three methods: number of processes: greedy Hilbert < Morton partition quality (intra-process com.): greedy > Hilbert > Morton runtime (less is better): Morton = Hilbert greedy all three algorithms run in O(#processes) 25

Load Balancing Static Load Balancing Implemented static load balancing strategies: Space-filling curves: Z-order (aka Morton order or Morton code) Hilbert curve Both curves can be constructed by a depth-first search. A custom greedy algorithm which aggregates neighboring blocks Comparison of these three methods: number of processes: greedy Hilbert < Morton partition quality (intra-process com.): greedy > Hilbert > Morton runtime (less is better): Morton = Hilbert greedy all three algorithms run in O(#processes) 25

Load Balancing Static Load Balancing Implemented static load balancing strategies: Space-filling curves: Z-order (aka Morton order or Morton code) Hilbert curve Both curves can be constructed by a depth-first search. A custom greedy algorithm which aggregates neighboring blocks Comparison of these three methods: number of processes: greedy Hilbert < Morton partition quality (intra-process com.): greedy > Hilbert > Morton runtime (less is better): Morton = Hilbert greedy all three algorithms run in O(#processes) 25

Load Balancing Dynamic Load Balancing Dynamic load balancing is based on a diffusive algorithm: The 'work flow' between neighboring processes is calculated. If the flows on all edges were met exactly, almost perfect load balancing could be achieved. The flows cannot be met: Available/free memory must be taken into account Fewer blocks per process than connections to other processes 6 8 12 3 10 7 50 50 50 50 5 5 3 4 50-1 6-4 10 5-2 one process with 5 blocks, workload per block and work flow per edge (process graph) are illustrated 26

Load Balancing Dynamic Load Balancing Dynamic load balancing is based on a diffusive algorithm: The 'work flow' between neighboring processes is calculated. If the flows on all edges were met exactly, almost perfect load balancing could be achieved. The flows cannot be met: Available/free memory must be taken into account Fewer blocks per process than connections to other processes 6 8 12 3 10 7 50 50 50 50 5 5 3 4 50-1 6-4 10 5-2 one process with 5 blocks, workload per block and work flow per edge (process graph) are illustrated 26

Load Balancing Dynamic Load Balancing Dynamic load balancing is based on a diffusive algorithm: The 'work flow' between neighboring processes is calculated. If the flows on all edges were met exactly, almost perfect load balancing could be achieved. The flows cannot be met: Available/free memory must be taken into account Fewer blocks per process than connections to other processes 6 8 12 3 10 7 50 50 50 50 5 5 3 4 50-1 6-4 10 5-2 one process with 5 blocks, workload per block and work flow per edge (process graph) are illustrated 26

Load Balancing Dynamic Load Balancing Dynamic load balancing is based on a diffusive algorithm: The 'work flow' between neighboring processes is calculated. If the flows on all edges were met exactly, almost perfect load balancing could be achieved. The flows cannot be met: Available/free memory must be taken into account Fewer blocks per process than connections to other processes 6 8 12 3 10 7 50 50 50 50 5 5 3 4 50-1 6-4 10 5-2 one process with 5 blocks, workload per block and work flow per edge (process graph) are illustrated 26

Load Balancing Dynamic Load Balancing Dynamic load balancing is based on a diffusive algorithm: The 'work flow' between neighboring processes is calculated. If the flows on all edges were met exactly, almost perfect load balancing could be achieved. The flows cannot be met: Available/free memory must be taken into account Fewer blocks per process than connections to other processes 6 8 12 3 10 7 50 50 50 50 5 5 3 4 50-1 6-4 10 5-2 one process with 5 blocks, workload per block and work flow per edge (process graph) are illustrated 26

Load Balancing Dynamic Load Balancing Dynamic load balancing is based on a diffusive algorithm: The 'work flow' between neighboring processes is calculated. If the flows on all edges were met exactly, almost perfect load balancing could be achieved. The flows cannot be met: Available/free memory must be taken into account Fewer blocks per process than connections to other processes 6 8 12 3 10 7 50 50 50 50 5 5 3 4 50-1 6-4 10 5-2 one process with 5 blocks, workload per block and work flow per edge (process graph) are illustrated 26

Load Balancing Dynamic Load Balancing The basic ideas behind our current implementation: 1) Refinement and coarsening can both lead to too many (virtual) blocks to be located on the same process. By redistributing these blocks, a distributed algorithm makes sure that the memory limit is not violated. 2) The diffusive load balancing algorithm does not violate the memory limit (receiving processes must always authorize block exchanges) uses the calculated work flows for guidance: sum of flow number of blocks to be sent/received work flow, memory usage of all neighbors, etc. used for guidance where to send (sending processes decide) 27

Load Balancing Dynamic Load Balancing The basic ideas behind our current implementation: 1) Refinement and coarsening can both lead to too many (virtual) blocks to be located on the same process. By redistributing these blocks, a distributed algorithm makes sure that the memory limit is not violated. 2) The diffusive load balancing algorithm does not violate the memory limit (receiving processes must always authorize block exchanges) uses the calculated work flows for guidance: sum of flow number of blocks to be sent/received work flow, memory usage of all neighbors, etc. used for guidance where to send (sending processes decide) 27

Load Balancing Dynamic Load Balancing The basic ideas behind our current implementation: 1) Refinement and coarsening can both lead to too many (virtual) blocks to be located on the same process. By redistributing these blocks, a distributed algorithm makes sure that the memory limit is not violated. 2) The diffusive load balancing algorithm does not violate the memory limit (receiving processes must always authorize block exchanges) uses the calculated work flows for guidance: sum of flow number of blocks to be sent/received work flow, memory usage of all neighbors, etc. used for guidance where to send (sending processes decide) 27

Load Balancing Dynamic Load Balancing The basic ideas behind our current implementation: 1) Refinement and coarsening can both lead to too many (virtual) blocks to be located on the same process. By redistributing these blocks, a distributed algorithm makes sure that the memory limit is not violated. 2) The diffusive load balancing algorithm does not violate the memory limit (receiving processes must always authorize block exchanges) uses the calculated work flows for guidance: sum of flow number of blocks to be sent/received work flow, memory usage of all neighbors, etc. used for guidance where to send (sending processes decide) 27

Load Balancing Dynamic Load Balancing The basic ideas behind our current implementation: 1) Refinement and coarsening can both lead to too many (virtual) blocks to be located on the same process. By redistributing these blocks, a distributed algorithm makes sure that the memory limit is not violated. 2) The diffusive load balancing algorithm does not violate the memory limit (receiving processes must always authorize block exchanges) uses the calculated work flows for guidance: sum of flow number of blocks to be sent/received work flow, memory usage of all neighbors, etc. used for guidance where to send (sending processes decide) 27

Load Balancing Dynamic Load Balancing The basic ideas behind our current implementation: 1) Refinement and coarsening can both lead to too many (virtual) blocks to be located on the same process. By redistributing these blocks, a distributed algorithm makes sure that the memory limit is not violated. 2) The diffusive load balancing algorithm does not violate the memory limit (receiving processes must always authorize block exchanges) uses the calculated work flows for guidance: sum of flow number of blocks to be sent/received work flow, memory usage of all neighbors, etc. used for guidance where to send (sending processes decide) 27

Outline Introduction Motivation / Problem Description Current Framework Capabilities Future Goals & Extensions Prototyping Environment Implementation Data Structures Distributed Refinement/Coarsening Algorithm Procedure Virtualization / Virtual Blocks Load Balancing Results / Benchmarks Summary & Conclusion 28

Results / Benchmarks 300 Processes Setup 'simulated' simulation: 14 rising bubbles high resolution around these bubbles 29

Results / Benchmarks 300 Processes Setup 14 rising bubbles ( high resolution around these bubbles) 5 different grid levels initially: 15 016 blocks (40 40 40 cells) 30

Results / Benchmarks 300 Processes Setup 14 rising bubbles ( high resolution around these bubbles) 5 different grid levels initially: 15 016 blocks (40 40 40 cells) 31

Results / Benchmarks 300 Processes No Load Balancing 300 processes initially: 15 016 blocks & 961 024 000 cells 32

Results / Benchmarks 300 Processes No Load Balancing 300 processes initially: 15 016 blocks & 961 024 000 cells 33

Results / Benchmarks 300 Processes Load Balancing 300 processes initially: 15 016 blocks & 961 024 000 cells 34

Results / Benchmarks 300 Processes Load Balancing 300 processes initially: 15 016 blocks & 961 024 000 cells 35

Results / Benchmarks 160 000 Processes Load Balancing 160 000 pro. initially: 808 176 blocks & 51 723 264 000 cells 36

Results / Benchmarks 160 000 Processes Load Balancing 160 000 pro. initially: 808 176 blocks & 51 723 264 000 cells 37

Results / Benchmarks 160 000 Processes Load Balancing 160 000 pro. initially: 808 176 blocks & 51 723 264 000 cells 38

Outline Introduction Motivation / Problem Description Current Framework Capabilities Future Goals & Extensions Prototyping Environment Implementation Data Structures Distributed Refinement/Coarsening Algorithm Procedure Virtualization / Virtual Blocks Load Balancing Results / Benchmarks Summary & Conclusion 39

Summary & Conclusion We have all ingredients required for very large, adaptive, dynamically load balanced Lattice Boltzmann fluid simulations: handling of/interpolation between different grid resolutions ( Filippova et al., Dupuis et al., Krafczyk et al.) our contribution: all the necessary data structures and algorithms for performing simulations in massively parallel environments (100.000 processes and more) very high data locality within the fully distributed 'blocks of cells' data structure manipulation (refinement, balancing, etc.) only through distributed/diffusive algorithms prototyping environment production code (walberla framework) 40

Summary & Conclusion We have all ingredients required for very large, adaptive, dynamically load balanced Lattice Boltzmann fluid simulations: handling of/interpolation between different grid resolutions ( Filippova et al., Dupuis et al., Krafczyk et al.) our contribution: all the necessary data structures and algorithms for performing simulations in massively parallel environments (100.000 processes and more) very high data locality within the fully distributed 'blocks of cells' data structure manipulation (refinement, balancing, etc.) only through distributed/diffusive algorithms prototyping environment production code (walberla framework) 40

Summary & Conclusion We have all ingredients required for very large, adaptive, dynamically load balanced Lattice Boltzmann fluid simulations: handling of/interpolation between different grid resolutions ( Filippova et al., Dupuis et al., Krafczyk et al.) our contribution: all the necessary data structures and algorithms for performing simulations in massively parallel environments (100.000 processes and more) very high data locality within the fully distributed 'blocks of cells' data structure manipulation (refinement, balancing, etc.) only through distributed/diffusive algorithms prototyping environment production code (walberla framework) 40

Summary & Conclusion We have all ingredients required for very large, adaptive, dynamically load balanced Lattice Boltzmann fluid simulations: handling of/interpolation between different grid resolutions ( Filippova et al., Dupuis et al., Krafczyk et al.) our contribution: all the necessary data structures and algorithms for performing simulations in massively parallel environments (100.000 processes and more) very high data locality within the fully distributed 'blocks of cells' data structure manipulation (refinement, balancing, etc.) only through distributed/diffusive algorithms prototyping environment production code (walberla framework) 40

Summary & Conclusion We have all ingredients required for very large, adaptive, dynamically load balanced Lattice Boltzmann fluid simulations: handling of/interpolation between different grid resolutions ( Filippova et al., Dupuis et al., Krafczyk et al.) our contribution: all the necessary data structures and algorithms for performing simulations in massively parallel environments (100.000 processes and more) very high data locality within the fully distributed 'blocks of cells' data structure manipulation (refinement, balancing, etc.) only through distributed/diffusive algorithms prototyping environment production code (walberla framework) 40

Summary & Conclusion We have all ingredients required for very large, adaptive, dynamically load balanced Lattice Boltzmann fluid simulations: handling of/interpolation between different grid resolutions ( Filippova et al., Dupuis et al., Krafczyk et al.) our contribution: all the necessary data structures and algorithms for performing simulations in massively parallel environments (100.000 processes and more) very high data locality within the fully distributed 'blocks of cells' data structure manipulation (refinement, balancing, etc.) only through distributed/diffusive algorithms prototyping environment production code (walberla framework) 40

walberla: Towards an Adaptive, THE END Dynamically Load-Balanced, Massively Parallel Lattice Boltzmann Fluid Simulation SIAM Parallel Processing for Scientific Computing 2012 Questions? February 16, 2012 Chair for System Simulation Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Germany Florian Schornbaum, Christian Feichtinger, Harald Köstler, Ulrich Rüde

walberla: Towards an Adaptive, THE END Dynamically Load-Balanced, Massively Parallel Lattice Boltzmann Fluid Simulation SIAM Parallel Processing for Scientific Computing 2012 Questions? February 16, 2012 Chair for System Simulation Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Germany Florian Schornbaum, Christian Feichtinger, Harald Köstler, Ulrich Rüde