walberla: Towards an Adaptive, Dynamically Load-Balanced, Massively Parallel Lattice Boltzmann Fluid Simulation
|
|
|
- Lambert Barker
- 10 years ago
- Views:
Transcription
1 walberla: Towards an Adaptive, Dynamically Load-Balanced, Massively Parallel Lattice Boltzmann Fluid Simulation SIAM Parallel Processing for Scientific Computing 2012 February 16, 2012 Florian Schornbaum, Christian Feichtinger, Harald Köstler, Ulrich Rüde Chair for System Simulation Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Germany
2 Outline Introduction Motivation / Problem Description Current Framework Capabilities Future Goals & Extensions Prototyping Environment Implementation Data Structures Distributed Refinement/Coarsening Algorithm Procedure Virtualization / Virtual Blocks Load Balancing Results / Benchmarks Summary & Conclusion 1
3 Outline Introduction Motivation / Problem Description Current Framework Capabilities Future Goals & Extensions Prototyping Environment Implementation Data Structures Distributed Refinement/Coarsening Algorithm Procedure Virtualization / Virtual Blocks Load Balancing Results / Benchmarks Summary & Conclusion 1
4 Introduction walberla: A massively parallel software framework originally developed for CFD simulations based on the Lattice Boltzmann method (LBM) Lattice Boltzmann method: In every time step, each cell in a discretized simulation space exchanges information with its directly adjacent neighbors: high data locality especially well suited for extensive parallelization 2
5 Introduction walberla: A massively parallel software framework originally developed for CFD simulations based on the Lattice Boltzmann method (LBM) Lattice Boltzmann method: In every time step, each cell in a discretized simulation space exchanges information with its directly adjacent neighbors: high data locality especially well suited for extensive parallelization 2
6 Introduction walberla: A massively parallel software framework originally developed for CFD simulations based on the Lattice Boltzmann method (LBM) Lattice Boltzmann method: In every time step, each cell in a discretized simulation space exchanges information with its directly adjacent neighbors: high data locality especially well suited for extensive parallelization 2
7 Introduction walberla: A massively parallel software framework originally developed for CFD simulations based on the Lattice Boltzmann method (LBM) Lattice Boltzmann method: In every time step, each cell in a discretized simulation space exchanges information with its directly adjacent neighbors: high data locality especially well suited for extensive parallelization 2
8 Outline Introduction Motivation / Problem Description Current Framework Capabilities Future Goals & Extensions Prototyping Environment Implementation Data Structures Distributed Refinement/Coarsening Algorithm Procedure Virtualization / Virtual Blocks Load Balancing Results / Benchmarks Summary & Conclusion 3
9 Motivation / Problem Description Current Framework Capabilities Currently, the walberla framework does not support refinement. The simulation space is always regularly discretized. For parallel simulations, each process is assigned agglomerates of several thousands of cells ("blocks" of cells). geometric distribution
10 Motivation / Problem Description Current Framework Capabilities Currently, the walberla framework does not support refinement. The simulation space is always regularly discretized. For parallel simulations, each process is assigned agglomerates of several thousands of cells ("blocks" of cells). geometric distribution
11 Motivation / Problem Description Current Framework Capabilities Currently, the walberla framework does not support refinement. The simulation space is always regularly discretized. For parallel simulations, each process is assigned agglomerates of several thousands of cells ("blocks" of cells). geometric distribution
12 Motivation / Problem Description Current Framework Capabilities inter-process communication intra-process communication The required inter- and intra-process communication schemes are relatively easy to understand and to implement. Data must be exchanged only between neighboring blocks. straight-forward parallelization of large simulations 5
13 Motivation / Problem Description Current Framework Capabilities inter-process communication intra-process communication The required inter- and intra-process communication schemes are relatively easy to understand and to implement. Data must be exchanged only between neighboring blocks. straight-forward parallelization of large simulations 5
14 Motivation / Problem Description Current Framework Capabilities inter-process communication intra-process communication The required inter- and intra-process communication schemes are relatively easy to understand and to implement. Data must be exchanged only between neighboring blocks. straight-forward parallelization of large simulations 5
15 Motivation / Problem Description Future Goals & Extensions walberla will be extended to support grid refinement (for more information on grid refinement & LBM see Filippova et al., Dupuis et al., Krafczyk et al.). restrictions for and consequences of grid refinement: 2:1 size ratio of neighboring cells higher resolution in areas covered with obstacles With the Lattice Boltzmann method, on the fine grid, twice as many time steps need to be performed as on the coarse grid. 6
16 Motivation / Problem Description Future Goals & Extensions walberla will be extended to support grid refinement (for more information on grid refinement & LBM see Filippova et al., Dupuis et al., Krafczyk et al.). restrictions for and consequences of grid refinement: 2:1 size ratio of neighboring cells higher resolution in areas covered with obstacles With the Lattice Boltzmann method, on the fine grid, twice as many time steps need to be performed as on the coarse grid. 6
17 Motivation / Problem Description Future Goals & Extensions walberla will be extended to support grid refinement (for more information on grid refinement & LBM see Filippova et al., Dupuis et al., Krafczyk et al.). restrictions for and consequences of grid refinement: 2:1 size ratio of neighboring cells higher resolution in areas covered with obstacles With the Lattice Boltzmann method, on the fine grid, twice as many time steps need to be performed as on the coarse grid. 6
18 Motivation / Problem Description Future Goals & Extensions walberla will be extended to support grid refinement (for more information on grid refinement & LBM see Filippova et al., Dupuis et al., Krafczyk et al.). restrictions for and consequences of grid refinement: 2:1 size ratio of neighboring cells higher resolution in areas covered with obstacles With the Lattice Boltzmann method, on the fine grid, twice as many time steps need to be performed as on the coarse grid. 6
19 Motivation / Problem Description Future Goals & Extensions restrictions for and consequences of grid refinement (cont.): In 3D, one refinement step leads to eight times as many cells being required in the refined area: memory consumption 8 & generated workload 16 If more than one refinement level is used, the 2:1 size ratio of neighboring cells must be obeyed: If n refinement levels are used, then memory workload on the finest grid = 8n 1 memory 16 n 1 workload on the coarsest grid 7
20 Motivation / Problem Description Future Goals & Extensions restrictions for and consequences of grid refinement (cont.): In 3D, one refinement step leads to eight times as many cells being required in the refined area: memory consumption 8 & generated workload 16 If more than one refinement level is used, the 2:1 size ratio of neighboring cells must be obeyed: If n refinement levels are used, then memory workload on the finest grid = 8n 1 memory 16 n 1 workload on the coarsest grid 7
21 Motivation / Problem Description Future Goals & Extensions restrictions for and consequences of grid refinement (cont.): In 3D, one refinement step leads to eight times as many cells being required in the refined area: memory consumption 8 & generated workload 16 If more than one refinement level is used, the 2:1 size ratio of neighboring cells must be obeyed: If n refinement levels are used, then memory workload on the finest grid = 8n 1 memory 16 n 1 workload on the coarsest grid 7
22 Motivation / Problem Description Future Goals & Extensions restrictions for and consequences of grid refinement (cont.): In 3D, one refinement step leads to eight times as many cells being required in the refined area: memory consumption 8 & generated workload 16 If more than one refinement level is used, the 2:1 size ratio of neighboring cells must be obeyed: If n refinement levels are used, then memory workload on the finest grid = 8n 1 memory 16 n 1 workload on the coarsest grid 7
23 Motivation / Problem Description Future Goals & Extensions restrictions for and consequences of grid refinement (cont.): In 3D, one refinement step leads to eight times as many cells being required in the refined area: memory consumption 8 & generated workload 16 If more than one refinement level is used, the 2:1 size ratio of neighboring cells must be obeyed: If n refinement levels are used, then memory workload on the finest grid = 8n 1 memory 16 n 1 workload on the coarsest grid 7
24 Motivation / Problem Description Future Goals & Extensions In order to achieve good load balancing, subdividing the simulation space into equally sized regions won t work. Each process must be assigned the same amount of work (the work- load is given by the number of cells weighted by the number of time steps that need to be performed on the corresponding grid level). Not trivial to solve for billions of cells! 8
25 Motivation / Problem Description Future Goals & Extensions In order to achieve good load balancing, subdividing the simulation space into equally sized regions won t work. Each process must be assigned the same amount of work (the work- load is given by the number of cells weighted by the number of time steps that need to be performed on the corresponding grid level). Not trivial to solve for billions of cells! 8
26 Motivation / Problem Description Future Goals & Extensions In order to achieve good load balancing, subdividing the simulation space into equally sized regions won t work. Each process must be assigned the same amount of work (the work- load is given by the number of cells weighted by the number of time steps that need to be performed on the corresponding grid level). Not trivial to solve for billions of cells! 8
27 Motivation / Problem Description Future Goals & Extensions In order to achieve good load balancing, subdividing the simulation space into equally sized regions won t work. Each process must be assigned the same amount of work (the work- load is given by the number of cells weighted by the number of time steps that need to be performed on the corresponding grid level). Not trivial to solve for billions of cells! 8
28 Motivation / Problem Description Future Goals & Extensions The problem gets even worse if the fine regions are not static but dynamically change their locations (moving obstacles etc.). Areas initially consisting of coarse cells will require much more memory und generate a lot more workload after being refined (and vice versa). massive workload & memory fluctuations! Performing global refinement, coarsening, and load balancing (by synchronizing all processes or using a master-slave scheme) can be extremely expensive or maybe even impossible for simulations with billions of cells distributed to thousands of processes. solution: fully distributed algorithms working in parallel 9
29 Motivation / Problem Description Future Goals & Extensions The problem gets even worse if the fine regions are not static but dynamically change their locations (moving obstacles etc.). Areas initially consisting of coarse cells will require much more memory und generate a lot more workload after being refined (and vice versa). massive workload & memory fluctuations! Performing global refinement, coarsening, and load balancing (by synchronizing all processes or using a master-slave scheme) can be extremely expensive or maybe even impossible for simulations with billions of cells distributed to thousands of processes. solution: fully distributed algorithms working in parallel 9
30 Motivation / Problem Description Future Goals & Extensions The problem gets even worse if the fine regions are not static but dynamically change their locations (moving obstacles etc.). Areas initially consisting of coarse cells will require much more memory und generate a lot more workload after being refined (and vice versa). massive workload & memory fluctuations! Performing global refinement, coarsening, and load balancing (by synchronizing all processes or using a master-slave scheme) can be extremely expensive or maybe even impossible for simulations with billions of cells distributed to thousands of processes. solution: fully distributed algorithms working in parallel 9
31 Motivation / Problem Description Future Goals & Extensions The problem gets even worse if the fine regions are not static but dynamically change their locations (moving obstacles etc.). Areas initially consisting of coarse cells will require much more memory und generate a lot more workload after being refined (and vice versa). massive workload & memory fluctuations! Performing global refinement, coarsening, and load balancing (by synchronizing all processes or using a master-slave scheme) can be extremely expensive or maybe even impossible for simulations with billions of cells distributed to thousands of processes. solution: fully distributed algorithms working in parallel 9
32 Outline Introduction Motivation / Problem Description Current Framework Capabilities Future Goals & Extensions Prototyping Environment Implementation Data Structures Distributed Refinement/Coarsening Algorithm Procedure Virtualization / Virtual Blocks Load Balancing Results / Benchmarks Summary & Conclusion 10
33 Implementation In order to be able to deal with all of these problems, new and adapted data structures and algorithms are required. A prototyping environment has been created within the walberla framework that solely focuses on the development of these new data structures and distributed algorithms. No actual Lattice Boltzmann fluid simulation is executed. All the data that is required for the LBM only exists in form of accumulated, abstract information regarding workload and memory. Adaptive refinement is simulated by moving spherical objects through the simulation and demanding a fine resolution around these objects. The prototyping environment allows for a fast and efficient development and testing of different concepts and structures. 11
34 Implementation In order to be able to deal with all of these problems, new and adapted data structures and algorithms are required. A prototyping environment has been created within the walberla framework that solely focuses on the development of these new data structures and distributed algorithms. No actual Lattice Boltzmann fluid simulation is executed. All the data that is required for the LBM only exists in form of accumulated, abstract information regarding workload and memory. Adaptive refinement is simulated by moving spherical objects through the simulation and demanding a fine resolution around these objects. The prototyping environment allows for a fast and efficient development and testing of different concepts and structures. 11
35 Implementation In order to be able to deal with all of these problems, new and adapted data structures and algorithms are required. A prototyping environment has been created within the walberla framework that solely focuses on the development of these new data structures and distributed algorithms. No actual Lattice Boltzmann fluid simulation is executed. All the data that is required for the LBM only exists in form of accumulated, abstract information regarding workload and memory. Adaptive refinement is simulated by moving spherical objects through the simulation and demanding a fine resolution around these objects. The prototyping environment allows for a fast and efficient development and testing of different concepts and structures. 11
36 Implementation In order to be able to deal with all of these problems, new and adapted data structures and algorithms are required. A prototyping environment has been created within the walberla framework that solely focuses on the development of these new data structures and distributed algorithms. No actual Lattice Boltzmann fluid simulation is executed. All the data that is required for the LBM only exists in form of accumulated, abstract information regarding workload and memory. Adaptive refinement is simulated by moving spherical objects through the simulation and demanding a fine resolution around these objects. The prototyping environment allows for a fast and efficient development and testing of different concepts and structures. 11
37 Implementation In order to be able to deal with all of these problems, new and adapted data structures and algorithms are required. A prototyping environment has been created within the walberla framework that solely focuses on the development of these new data structures and distributed algorithms. No actual Lattice Boltzmann fluid simulation is executed. All the data that is required for the LBM only exists in form of accumulated, abstract information regarding workload and memory. Adaptive refinement is simulated by moving spherical objects through the simulation and demanding a fine resolution around these objects. The prototyping environment allows for a fast and efficient development and testing of different concepts and structures. 11
38 Implementation In order to be able to deal with all of these problems, new and adapted data structures and algorithms are required. A prototyping environment has been created within the walberla framework that solely focuses on the development of these new data structures and distributed algorithms. No actual Lattice Boltzmann fluid simulation is executed. All the data that is required for the LBM only exists in form of accumulated, abstract information regarding workload and memory. Adaptive refinement is simulated by moving spherical objects through the simulation and demanding a fine resolution around these objects. The prototyping environment allows for a fast and efficient development and testing of different concepts and structures. 11
39 Implementation The prototyping environment (written in C++) is not parallelized with MPI but only with OpenMP. It runs on shared memory systems. Thousands of processes running in parallel using distributed algorithms for refinement and balancing are only simulated. Advantages: Fast development and testing ( thousands of processes can be simulated on a desktop computer) All tasks are also solved with easy to understand, global algorithms which are then used to validate the results of the fully distributed, parallel algorithms. 12
40 Implementation The prototyping environment (written in C++) is not parallelized with MPI but only with OpenMP. It runs on shared memory systems. Thousands of processes running in parallel using distributed algorithms for refinement and balancing are only simulated. Advantages: Fast development and testing ( thousands of processes can be simulated on a desktop computer) All tasks are also solved with easy to understand, global algorithms which are then used to validate the results of the fully distributed, parallel algorithms. 12
41 Implementation The prototyping environment (written in C++) is not parallelized with MPI but only with OpenMP. It runs on shared memory systems. Thousands of processes running in parallel using distributed algorithms for refinement and balancing are only simulated. Advantages: Fast development and testing ( thousands of processes can be simulated on a desktop computer) All tasks are also solved with easy to understand, global algorithms which are then used to validate the results of the fully distributed, parallel algorithms. 12
42 Implementation The prototyping environment (written in C++) is not parallelized with MPI but only with OpenMP. It runs on shared memory systems. Thousands of processes running in parallel using distributed algorithms for refinement and balancing are only simulated. Advantages: Fast development and testing ( thousands of processes can be simulated on a desktop computer) All tasks are also solved with easy to understand, global algorithms which are then used to validate the results of the fully distributed, parallel algorithms. 12
43 Data Structures Algorithms working on a cell-based structure cannot be implemented efficiently. highly irregularly shaped partitions of the simulation domain completely irregular communication schemes Computation sweeps over blocks of cells resulting from the current homogenous discretization are much more efficient. The new structure is also based on blocks of cells (e.g., ). (All cells in one block are of the same size.) 13
44 Data Structures Algorithms working on a cell-based structure cannot be implemented efficiently. highly irregularly shaped partitions of the simulation domain completely irregular communication schemes Computation sweeps over blocks of cells resulting from the current homogenous discretization are much more efficient. The new structure is also based on blocks of cells (e.g., ). (All cells in one block are of the same size.) 13
45 Data Structures Algorithms working on a cell-based structure cannot be implemented efficiently. highly irregularly shaped partitions of the simulation domain completely irregular communication schemes Computation sweeps over blocks of cells resulting from the current homogenous discretization are much more efficient. The new structure is also based on blocks of cells (e.g., ). (All cells in one block are of the same size.) 13
46 Data Structures Algorithms working on a cell-based structure cannot be implemented efficiently. highly irregularly shaped partitions of the simulation domain completely irregular communication schemes Computation sweeps over blocks of cells resulting from the current homogenous discretization are much more efficient. The new structure is also based on blocks of cells (e.g., ). (All cells in one block are of the same size.) 13
47 Data Structures Algorithms working on a cell-based structure cannot be implemented efficiently. highly irregularly shaped partitions of the simulation domain completely irregular communication schemes Computation sweeps over blocks of cells resulting from the current homogenous discretization are much more efficient. The new structure is also based on blocks of cells (e.g., ). (All cells in one block are of the same size.) 13
48 Data Structures The 2:1 cell size ratio restriction causes two neighboring blocks to have the same cell size or to differ by only one refinement level. region in the simulation domain where the underlying application demands a fine resolution What makes this structure special/different: No concepts and structures typically associated with trees (fatherchild connections, inner nodes, etc.) are used. Each block only knows all of its direct neighbors perfect for parallelization! 14
49 Data Structures The 2:1 cell size ratio restriction causes two neighboring blocks to have the same cell size or to differ by only one refinement level. region in the simulation domain where the underlying application demands a fine resolution What makes this structure special/different: No concepts and structures typically associated with trees (fatherchild connections, inner nodes, etc.) are used. Each block only knows all of its direct neighbors perfect for parallelization! 14
50 Data Structures The 2:1 cell size ratio restriction causes two neighboring blocks to have the same cell size or to differ by only one refinement level. geometrically: forest of octrees (blocks = leaves) region in the simulation domain where the underlying application demands a fine resolution What makes this structure special/different: No concepts and structures typically associated with trees (fatherchild connections, inner nodes, etc.) are used. Each block only knows all of its direct neighbors perfect for parallelization! 14
51 Data Structures The 2:1 cell size ratio restriction causes two neighboring blocks to have the same cell size or to differ by only one refinement level. geometrically: forest of octrees (blocks = leaves) region in the simulation domain where the underlying application demands a fine resolution What makes this structure special/different: No concepts and structures typically associated with trees (fatherchild connections, inner nodes, etc.) are used. Each block only knows all of its direct neighbors perfect for parallelization! 14
52 Data Structures The 2:1 cell size ratio restriction causes two neighboring blocks to have the same cell size or to differ by only one refinement level. geometrically: forest of octrees (blocks = leaves) region in the simulation domain where the underlying application demands a fine resolution What makes this structure special/different: No concepts and structures typically associated with trees (fatherchild connections, inner nodes, etc.) are used. Each block only knows all of its direct neighbors perfect for parallelization! 14
53 Distributed Refinement/Coarsening Algorithm If the area that requires the finest resolution changes, the data structure must be adapted accordingly: From now on, each box represents an entire block of cells. If one block is refined, more additional blocks may be affected: 2:1! 15
54 Distributed Refinement/Coarsening Algorithm If the area that requires the finest resolution changes, the data structure must be adapted accordingly: From now on, each box represents an entire block of cells. If one block is refined, more additional blocks may be affected: 2:1! 15
55 Distributed Refinement/Coarsening Algorithm If the area that requires the finest resolution changes, the data structure must be adapted accordingly: From now on, each box represents an entire block of cells. If one block is refined, more additional blocks may be affected: 2:1! 15
56 Distributed Refinement/Coarsening Algorithm If the area that requires the finest resolution changes, the data structure must be adapted accordingly: From now on, each box represents an entire block of cells. If one block is refined, more additional blocks may be affected: 2:1! 15
57 Distributed Refinement/Coarsening Algorithm The same holds true if multiple blocks are reunited to one single block ( coarsening): Refinement & coarsening is performed in parallel by a fully distributed algorithm. The runtime of these algorithms only depends on the number of grid levels, not the number of processes! 16
58 Distributed Refinement/Coarsening Algorithm The same holds true if multiple blocks are reunited to one single block ( coarsening): Refinement & coarsening is performed in parallel by a fully distributed algorithm. The runtime of these algorithms only depends on the number of grid levels, not the number of processes! 16
59 Distributed Refinement/Coarsening Algorithm The same holds true if multiple blocks are reunited to one single block ( coarsening): Refinement & coarsening is performed in parallel by a fully distributed algorithm. The runtime of these algorithms only depends on the number of grid levels, not the number of processes! 16
60 Distributed Refinement/Coarsening Algorithm The same holds true if multiple blocks are reunited to one single block ( coarsening): Refinement & coarsening is performed in parallel by a fully distributed algorithm. The runtime of these algorithms only depends on the number of grid levels, not the number of processes! 16
61 Procedure Virtualization / Virtual Blocks Idea: Each block creates a virtual representation of itself: Each virtual block has a very small memory footprint (no cells but only values like 'workload' and 'memory size' are stored). All algorithms (refinement, coarsening, and load balancing) operate on these virtual blocks. If a block moves from one process to another, only a small amount of memory must be communicated. Only at the end of the refinement-coarsening-balancing pipeline the actual blocks follow their virtual blocks to the designated target processes (and only then refinement and coarsening is performed on the actual cells). 17
62 Procedure Virtualization / Virtual Blocks Idea: Each block creates a virtual representation of itself: Each virtual block has a very small memory footprint (no cells but only values like 'workload' and 'memory size' are stored). All algorithms (refinement, coarsening, and load balancing) operate on these virtual blocks. If a block moves from one process to another, only a small amount of memory must be communicated. Only at the end of the refinement-coarsening-balancing pipeline the actual blocks follow their virtual blocks to the designated target processes (and only then refinement and coarsening is performed on the actual cells). 17
63 Procedure Virtualization / Virtual Blocks Idea: Each block creates a virtual representation of itself: Each virtual block has a very small memory footprint (no cells but only values like 'workload' and 'memory size' are stored). All algorithms (refinement, coarsening, and load balancing) operate on these virtual blocks. If a block moves from one process to another, only a small amount of memory must be communicated. Only at the end of the refinement-coarsening-balancing pipeline the actual blocks follow their virtual blocks to the designated target processes (and only then refinement and coarsening is performed on the actual cells). 17
64 Procedure Virtualization / Virtual Blocks Idea: Each block creates a virtual representation of itself: Each virtual block has a very small memory footprint (no cells but only values like 'workload' and 'memory size' are stored). All algorithms (refinement, coarsening, and load balancing) operate on these virtual blocks. If a block moves from one process to another, only a small amount of memory must be communicated. Only at the end of the refinement-coarsening-balancing pipeline the actual blocks follow their virtual blocks to the designated target processes (and only then refinement and coarsening is performed on the actual cells). 17
65 Procedure Virtualization / Virtual Blocks Starting situation: blocks may be aggregated block needs to be refined process distribution 1. Initialization: actual blocks virtual blocks 18
66 Procedure Virtualization / Virtual Blocks Starting situation: blocks may be aggregated block needs to be refined process distribution 1. Initialization: actual blocks virtual blocks 18
67 Procedure Virtualization / Virtual Blocks Starting situation: blocks may be aggregated block needs to be refined process distribution 2. Refinement: 19
68 Procedure Virtualization / Virtual Blocks Starting situation: blocks may be aggregated block needs to be refined process distribution 2. Refinement: 19
69 Procedure Virtualization / Virtual Blocks Starting situation: blocks may be aggregated block needs to be refined process distribution 2. Refinement: 19
70 Procedure Virtualization / Virtual Blocks Starting situation: blocks may be aggregated block needs to be refined process distribution 3. Coarsening: 20
71 Procedure Virtualization / Virtual Blocks Starting situation: blocks may be aggregated block needs to be refined process distribution 3. Coarsening: 20
72 Procedure Virtualization / Virtual Blocks Starting situation: blocks may be aggregated block needs to be refined process distribution 3. Coarsening: 20
73 Procedure Virtualization / Virtual Blocks Starting situation: blocks may be aggregated block needs to be refined process distribution 4. Load Balancing: 21
74 Procedure Virtualization / Virtual Blocks Starting situation: blocks may be aggregated block needs to be refined process distribution 5. Finalization: actual blocks virtual blocks 22
75 Load Balancing Each block has the same number of cells ( identical memory consumption), but smaller cells generate more workload. In a simulation with 5 different grid levels, 2 blocks on the finest level generate the same amount of work than 32 blocks on the coarsest level yet 32 blocks might not fit into the memory of one process. Blocks assigned to the same process should be close. Load balancing problem/situation #1: Some processes may reach their memory limit without generat- ing as much work as the average process. 23
76 Load Balancing Each block has the same number of cells ( identical memory consumption), but smaller cells generate more workload. In a simulation with 5 different grid levels, 2 blocks on the finest level generate the same amount of work than 32 blocks on the coarsest level yet 32 blocks might not fit into the memory of one process. Blocks assigned to the same process should be close. Load balancing problem/situation #1: Some processes may reach their memory limit without generat- ing as much work as the average process. 23
77 Load Balancing Each block has the same number of cells ( identical memory consumption), but smaller cells generate more workload. In a simulation with 5 different grid levels, 2 blocks on the finest level generate the same amount of work than 32 blocks on the coarsest level yet 32 blocks might not fit into the memory of one process. Blocks assigned to the same process should be close. Load balancing problem/situation #1: Some processes may reach their memory limit without generat- ing as much work as the average process. 23
78 Load Balancing Each block has the same number of cells ( identical memory consumption), but smaller cells generate more workload. In a simulation with 5 different grid levels, 2 blocks on the finest level generate the same amount of work than 32 blocks on the coarsest level yet 32 blocks might not fit into the memory of one process. Blocks assigned to the same process should be close. Load balancing problem/situation #1: Some processes may reach their memory limit without generat- ing as much work as the average process. 23
79 Load Balancing Each block has the same number of cells ( identical memory consumption), but smaller cells generate more workload. In a simulation with 5 different grid levels, 2 blocks on the finest level generate the same amount of work than 32 blocks on the coarsest level yet 32 blocks might not fit into the memory of one process. Blocks assigned to the same process should be close. Load balancing problem/situation #1: Some processes may reach their memory limit without generat- ing as much work as the average process. 23
80 Load Balancing The blocks should be large, i.e., they should contain many cells: few (maybe only one) blocks per process minimizes communication cost enables efficient computation algorithms Only entire blocks can be exchanged between processes: many blocks per process (certainly good for balancing) The blocks should be small. Load balancing problem/situation #2: On average, each process owns about 4 to 10 blocks and possesses 20 to 25 neighbors (in 3D). 24
81 Load Balancing The blocks should be large, i.e., they should contain many cells: few (maybe only one) blocks per process minimizes communication cost enables efficient computation algorithms Only entire blocks can be exchanged between processes: many blocks per process (certainly good for balancing) The blocks should be small. Load balancing problem/situation #2: On average, each process owns about 4 to 10 blocks and possesses 20 to 25 neighbors (in 3D). 24
82 Load Balancing The blocks should be large, i.e., they should contain many cells: few (maybe only one) blocks per process minimizes communication cost enables efficient computation algorithms Only entire blocks can be exchanged between processes: many blocks per process (certainly good for balancing) The blocks should be small. Load balancing problem/situation #2: On average, each process owns about 4 to 10 blocks and possesses 20 to 25 neighbors (in 3D). 24
83 Load Balancing The blocks should be large, i.e., they should contain many cells: few (maybe only one) blocks per process minimizes communication cost enables efficient computation algorithms Only entire blocks can be exchanged between processes: many blocks per process (certainly good for balancing) The blocks should be small. Load balancing problem/situation #2: On average, each process owns about 4 to 10 blocks and possesses 20 to 25 neighbors (in 3D). 24
84 Load Balancing The blocks should be large, i.e., they should contain many cells: few (maybe only one) blocks per process minimizes communication cost enables efficient computation algorithms Only entire blocks can be exchanged between processes: many blocks per process (certainly good for balancing) The blocks should be small. Load balancing problem/situation #2: On average, each process owns about 4 to 10 blocks and possesses 20 to 25 neighbors (in 3D). 24
85 Load Balancing The blocks should be large, i.e., they should contain many cells: few (maybe only one) blocks per process minimizes communication cost enables efficient computation algorithms Only entire blocks can be exchanged between processes: many blocks per process (certainly good for balancing) The blocks should be small. Load balancing problem/situation #2: On average, each process owns about 4 to 10 blocks and possesses 20 to 25 neighbors (in 3D). 24
86 Load Balancing The blocks should be large, i.e., they should contain many cells: few (maybe only one) blocks per process minimizes communication cost enables efficient computation algorithms Only entire blocks can be exchanged between processes: many blocks per process (certainly good for balancing) The blocks should be small. Load balancing problem/situation #2: On average, each process owns about 4 to 10 blocks and possesses 20 to 25 neighbors (in 3D). 24
87 Load Balancing The blocks should be large, i.e., they should contain many cells: few (maybe only one) blocks per process minimizes communication cost enables efficient computation algorithms Only entire blocks can be exchanged between processes: many blocks per process (certainly good for balancing) The blocks should be small. Load balancing problem/situation #2: On average, each process owns about 4 to 10 blocks and possesses 20 to 25 neighbors (in 3D). 24
88 Load Balancing Static Load Balancing Implemented static load balancing strategies: Space-filling curves: Z-order (aka Morton order or Morton code) Hilbert curve Both curves can be constructed by a depth-first search. A custom greedy algorithm which aggregates neighboring blocks Comparison of these three methods: number of processes: greedy Hilbert < Morton partition quality (intra-process com.): greedy > Hilbert > Morton runtime (less is better): Morton = Hilbert greedy all three algorithms run in O(#processes) 25
89 Load Balancing Static Load Balancing Implemented static load balancing strategies: Space-filling curves: Z-order (aka Morton order or Morton code) Hilbert curve Both curves can be constructed by a depth-first search. A custom greedy algorithm which aggregates neighboring blocks Comparison of these three methods: number of processes: greedy Hilbert < Morton partition quality (intra-process com.): greedy > Hilbert > Morton runtime (less is better): Morton = Hilbert greedy all three algorithms run in O(#processes) 25
90 Load Balancing Static Load Balancing Implemented static load balancing strategies: Space-filling curves: Z-order (aka Morton order or Morton code) Hilbert curve Both curves can be constructed by a depth-first search. A custom greedy algorithm which aggregates neighboring blocks Comparison of these three methods: number of processes: greedy Hilbert < Morton partition quality (intra-process com.): greedy > Hilbert > Morton runtime (less is better): Morton = Hilbert greedy all three algorithms run in O(#processes) 25
91 Load Balancing Static Load Balancing Implemented static load balancing strategies: Space-filling curves: Z-order (aka Morton order or Morton code) Hilbert curve Both curves can be constructed by a depth-first search. A custom greedy algorithm which aggregates neighboring blocks Comparison of these three methods: number of processes: greedy Hilbert < Morton partition quality (intra-process com.): greedy > Hilbert > Morton runtime (less is better): Morton = Hilbert greedy all three algorithms run in O(#processes) 25
92 Load Balancing Static Load Balancing Implemented static load balancing strategies: Space-filling curves: Z-order (aka Morton order or Morton code) Hilbert curve Both curves can be constructed by a depth-first search. A custom greedy algorithm which aggregates neighboring blocks Comparison of these three methods: number of processes: greedy Hilbert < Morton partition quality (intra-process com.): greedy > Hilbert > Morton runtime (less is better): Morton = Hilbert greedy all three algorithms run in O(#processes) 25
93 Load Balancing Dynamic Load Balancing Dynamic load balancing is based on a diffusive algorithm: The 'work flow' between neighboring processes is calculated. If the flows on all edges were met exactly, almost perfect load balancing could be achieved. The flows cannot be met: Available/free memory must be taken into account Fewer blocks per process than connections to other processes one process with 5 blocks, workload per block and work flow per edge (process graph) are illustrated 26
94 Load Balancing Dynamic Load Balancing Dynamic load balancing is based on a diffusive algorithm: The 'work flow' between neighboring processes is calculated. If the flows on all edges were met exactly, almost perfect load balancing could be achieved. The flows cannot be met: Available/free memory must be taken into account Fewer blocks per process than connections to other processes one process with 5 blocks, workload per block and work flow per edge (process graph) are illustrated 26
95 Load Balancing Dynamic Load Balancing Dynamic load balancing is based on a diffusive algorithm: The 'work flow' between neighboring processes is calculated. If the flows on all edges were met exactly, almost perfect load balancing could be achieved. The flows cannot be met: Available/free memory must be taken into account Fewer blocks per process than connections to other processes one process with 5 blocks, workload per block and work flow per edge (process graph) are illustrated 26
96 Load Balancing Dynamic Load Balancing Dynamic load balancing is based on a diffusive algorithm: The 'work flow' between neighboring processes is calculated. If the flows on all edges were met exactly, almost perfect load balancing could be achieved. The flows cannot be met: Available/free memory must be taken into account Fewer blocks per process than connections to other processes one process with 5 blocks, workload per block and work flow per edge (process graph) are illustrated 26
97 Load Balancing Dynamic Load Balancing Dynamic load balancing is based on a diffusive algorithm: The 'work flow' between neighboring processes is calculated. If the flows on all edges were met exactly, almost perfect load balancing could be achieved. The flows cannot be met: Available/free memory must be taken into account Fewer blocks per process than connections to other processes one process with 5 blocks, workload per block and work flow per edge (process graph) are illustrated 26
98 Load Balancing Dynamic Load Balancing Dynamic load balancing is based on a diffusive algorithm: The 'work flow' between neighboring processes is calculated. If the flows on all edges were met exactly, almost perfect load balancing could be achieved. The flows cannot be met: Available/free memory must be taken into account Fewer blocks per process than connections to other processes one process with 5 blocks, workload per block and work flow per edge (process graph) are illustrated 26
99 Load Balancing Dynamic Load Balancing The basic ideas behind our current implementation: 1) Refinement and coarsening can both lead to too many (virtual) blocks to be located on the same process. By redistributing these blocks, a distributed algorithm makes sure that the memory limit is not violated. 2) The diffusive load balancing algorithm does not violate the memory limit (receiving processes must always authorize block exchanges) uses the calculated work flows for guidance: sum of flow number of blocks to be sent/received work flow, memory usage of all neighbors, etc. used for guidance where to send (sending processes decide) 27
100 Load Balancing Dynamic Load Balancing The basic ideas behind our current implementation: 1) Refinement and coarsening can both lead to too many (virtual) blocks to be located on the same process. By redistributing these blocks, a distributed algorithm makes sure that the memory limit is not violated. 2) The diffusive load balancing algorithm does not violate the memory limit (receiving processes must always authorize block exchanges) uses the calculated work flows for guidance: sum of flow number of blocks to be sent/received work flow, memory usage of all neighbors, etc. used for guidance where to send (sending processes decide) 27
101 Load Balancing Dynamic Load Balancing The basic ideas behind our current implementation: 1) Refinement and coarsening can both lead to too many (virtual) blocks to be located on the same process. By redistributing these blocks, a distributed algorithm makes sure that the memory limit is not violated. 2) The diffusive load balancing algorithm does not violate the memory limit (receiving processes must always authorize block exchanges) uses the calculated work flows for guidance: sum of flow number of blocks to be sent/received work flow, memory usage of all neighbors, etc. used for guidance where to send (sending processes decide) 27
102 Load Balancing Dynamic Load Balancing The basic ideas behind our current implementation: 1) Refinement and coarsening can both lead to too many (virtual) blocks to be located on the same process. By redistributing these blocks, a distributed algorithm makes sure that the memory limit is not violated. 2) The diffusive load balancing algorithm does not violate the memory limit (receiving processes must always authorize block exchanges) uses the calculated work flows for guidance: sum of flow number of blocks to be sent/received work flow, memory usage of all neighbors, etc. used for guidance where to send (sending processes decide) 27
103 Load Balancing Dynamic Load Balancing The basic ideas behind our current implementation: 1) Refinement and coarsening can both lead to too many (virtual) blocks to be located on the same process. By redistributing these blocks, a distributed algorithm makes sure that the memory limit is not violated. 2) The diffusive load balancing algorithm does not violate the memory limit (receiving processes must always authorize block exchanges) uses the calculated work flows for guidance: sum of flow number of blocks to be sent/received work flow, memory usage of all neighbors, etc. used for guidance where to send (sending processes decide) 27
104 Load Balancing Dynamic Load Balancing The basic ideas behind our current implementation: 1) Refinement and coarsening can both lead to too many (virtual) blocks to be located on the same process. By redistributing these blocks, a distributed algorithm makes sure that the memory limit is not violated. 2) The diffusive load balancing algorithm does not violate the memory limit (receiving processes must always authorize block exchanges) uses the calculated work flows for guidance: sum of flow number of blocks to be sent/received work flow, memory usage of all neighbors, etc. used for guidance where to send (sending processes decide) 27
105 Outline Introduction Motivation / Problem Description Current Framework Capabilities Future Goals & Extensions Prototyping Environment Implementation Data Structures Distributed Refinement/Coarsening Algorithm Procedure Virtualization / Virtual Blocks Load Balancing Results / Benchmarks Summary & Conclusion 28
106 Results / Benchmarks 300 Processes Setup 'simulated' simulation: 14 rising bubbles high resolution around these bubbles 29
107 Results / Benchmarks 300 Processes Setup 14 rising bubbles ( high resolution around these bubbles) 5 different grid levels initially: blocks ( cells) 30
108 Results / Benchmarks 300 Processes Setup 14 rising bubbles ( high resolution around these bubbles) 5 different grid levels initially: blocks ( cells) 31
109 Results / Benchmarks 300 Processes No Load Balancing 300 processes initially: blocks & cells 32
110 Results / Benchmarks 300 Processes No Load Balancing 300 processes initially: blocks & cells 33
111 Results / Benchmarks 300 Processes Load Balancing 300 processes initially: blocks & cells 34
112 Results / Benchmarks 300 Processes Load Balancing 300 processes initially: blocks & cells 35
113 Results / Benchmarks Processes Load Balancing pro. initially: blocks & cells 36
114 Results / Benchmarks Processes Load Balancing pro. initially: blocks & cells 37
115 Results / Benchmarks Processes Load Balancing pro. initially: blocks & cells 38
116 Outline Introduction Motivation / Problem Description Current Framework Capabilities Future Goals & Extensions Prototyping Environment Implementation Data Structures Distributed Refinement/Coarsening Algorithm Procedure Virtualization / Virtual Blocks Load Balancing Results / Benchmarks Summary & Conclusion 39
117 Summary & Conclusion We have all ingredients required for very large, adaptive, dynamically load balanced Lattice Boltzmann fluid simulations: handling of/interpolation between different grid resolutions ( Filippova et al., Dupuis et al., Krafczyk et al.) our contribution: all the necessary data structures and algorithms for performing simulations in massively parallel environments ( processes and more) very high data locality within the fully distributed 'blocks of cells' data structure manipulation (refinement, balancing, etc.) only through distributed/diffusive algorithms prototyping environment production code (walberla framework) 40
118 Summary & Conclusion We have all ingredients required for very large, adaptive, dynamically load balanced Lattice Boltzmann fluid simulations: handling of/interpolation between different grid resolutions ( Filippova et al., Dupuis et al., Krafczyk et al.) our contribution: all the necessary data structures and algorithms for performing simulations in massively parallel environments ( processes and more) very high data locality within the fully distributed 'blocks of cells' data structure manipulation (refinement, balancing, etc.) only through distributed/diffusive algorithms prototyping environment production code (walberla framework) 40
119 Summary & Conclusion We have all ingredients required for very large, adaptive, dynamically load balanced Lattice Boltzmann fluid simulations: handling of/interpolation between different grid resolutions ( Filippova et al., Dupuis et al., Krafczyk et al.) our contribution: all the necessary data structures and algorithms for performing simulations in massively parallel environments ( processes and more) very high data locality within the fully distributed 'blocks of cells' data structure manipulation (refinement, balancing, etc.) only through distributed/diffusive algorithms prototyping environment production code (walberla framework) 40
120 Summary & Conclusion We have all ingredients required for very large, adaptive, dynamically load balanced Lattice Boltzmann fluid simulations: handling of/interpolation between different grid resolutions ( Filippova et al., Dupuis et al., Krafczyk et al.) our contribution: all the necessary data structures and algorithms for performing simulations in massively parallel environments ( processes and more) very high data locality within the fully distributed 'blocks of cells' data structure manipulation (refinement, balancing, etc.) only through distributed/diffusive algorithms prototyping environment production code (walberla framework) 40
121 Summary & Conclusion We have all ingredients required for very large, adaptive, dynamically load balanced Lattice Boltzmann fluid simulations: handling of/interpolation between different grid resolutions ( Filippova et al., Dupuis et al., Krafczyk et al.) our contribution: all the necessary data structures and algorithms for performing simulations in massively parallel environments ( processes and more) very high data locality within the fully distributed 'blocks of cells' data structure manipulation (refinement, balancing, etc.) only through distributed/diffusive algorithms prototyping environment production code (walberla framework) 40
122 Summary & Conclusion We have all ingredients required for very large, adaptive, dynamically load balanced Lattice Boltzmann fluid simulations: handling of/interpolation between different grid resolutions ( Filippova et al., Dupuis et al., Krafczyk et al.) our contribution: all the necessary data structures and algorithms for performing simulations in massively parallel environments ( processes and more) very high data locality within the fully distributed 'blocks of cells' data structure manipulation (refinement, balancing, etc.) only through distributed/diffusive algorithms prototyping environment production code (walberla framework) 40
123 walberla: Towards an Adaptive, THE END Dynamically Load-Balanced, Massively Parallel Lattice Boltzmann Fluid Simulation SIAM Parallel Processing for Scientific Computing 2012 Questions? February 16, 2012 Chair for System Simulation Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Germany Florian Schornbaum, Christian Feichtinger, Harald Köstler, Ulrich Rüde
124 walberla: Towards an Adaptive, THE END Dynamically Load-Balanced, Massively Parallel Lattice Boltzmann Fluid Simulation SIAM Parallel Processing for Scientific Computing 2012 Questions? February 16, 2012 Chair for System Simulation Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Germany Florian Schornbaum, Christian Feichtinger, Harald Köstler, Ulrich Rüde
walberla: A software framework for CFD applications on 300.000 Compute Cores
walberla: A software framework for CFD applications on 300.000 Compute Cores J. Götz (LSS Erlangen, [email protected]), K. Iglberger, S. Donath, C. Feichtinger, U. Rüde Lehrstuhl für Informatik 10 (Systemsimulation)
walberla: A software framework for CFD applications
walberla: A software framework for CFD applications U. Rüde, S. Donath, C. Feichtinger, K. Iglberger, F. Deserno, M. Stürmer, C. Mihoubi, T. Preclic, D. Haspel (all LSS Erlangen), N. Thürey (LSS Erlangen/
Characterizing the Performance of Dynamic Distribution and Load-Balancing Techniques for Adaptive Grid Hierarchies
Proceedings of the IASTED International Conference Parallel and Distributed Computing and Systems November 3-6, 1999 in Cambridge Massachusetts, USA Characterizing the Performance of Dynamic Distribution
FRIEDRICH-ALEXANDER-UNIVERSITÄT ERLANGEN-NÜRNBERG
FRIEDRICH-ALEXANDER-UNIVERSITÄT ERLANGEN-NÜRNBERG INSTITUT FÜR INFORMATIK (MATHEMATISCHE MASCHINEN UND DATENVERARBEITUNG) Lehrstuhl für Informatik 10 (Systemsimulation) Massively Parallel Multilevel Finite
Distributed Dynamic Load Balancing for Iterative-Stencil Applications
Distributed Dynamic Load Balancing for Iterative-Stencil Applications G. Dethier 1, P. Marchot 2 and P.A. de Marneffe 1 1 EECS Department, University of Liege, Belgium 2 Chemical Engineering Department,
Towards real-time image processing with Hierarchical Hybrid Grids
Towards real-time image processing with Hierarchical Hybrid Grids International Doctorate Program - Summer School Björn Gmeiner Joint work with: Harald Köstler, Ulrich Rüde August, 2011 Contents The HHG
HPC enabling of OpenFOAM R for CFD applications
HPC enabling of OpenFOAM R for CFD applications Towards the exascale: OpenFOAM perspective Ivan Spisso 25-27 March 2015, Casalecchio di Reno, BOLOGNA. SuperComputing Applications and Innovation Department,
Multi-GPU Load Balancing for Simulation and Rendering
Multi- Load Balancing for Simulation and Rendering Yong Cao Computer Science Department, Virginia Tech, USA In-situ ualization and ual Analytics Instant visualization and interaction of computing tasks
Highly Scalable Dynamic Load Balancing in the Atmospheric Modeling System COSMO-SPECS+FD4
Center for Information Services and High Performance Computing (ZIH) Highly Scalable Dynamic Load Balancing in the Atmospheric Modeling System COSMO-SPECS+FD4 PARA 2010, June 9, Reykjavík, Iceland Matthias
Employing Complex GPU Data Structures for the Interactive Visualization of Adaptive Mesh Refinement Data
Volume Graphics (2006) T. Möller, R. Machiraju, T. Ertl, M. Chen (Editors) Employing Complex GPU Data Structures for the Interactive Visualization of Adaptive Mesh Refinement Data Joachim E. Vollrath Tobias
Uintah Framework. Justin Luitjens, Qingyu Meng, John Schmidt, Martin Berzins, Todd Harman, Chuch Wight, Steven Parker, et al
Uintah Framework Justin Luitjens, Qingyu Meng, John Schmidt, Martin Berzins, Todd Harman, Chuch Wight, Steven Parker, et al Uintah Parallel Computing Framework Uintah - far-sighted design by Steve Parker
Mesh Generation and Load Balancing
Mesh Generation and Load Balancing Stan Tomov Innovative Computing Laboratory Computer Science Department The University of Tennessee April 04, 2012 CS 594 04/04/2012 Slide 1 / 19 Outline Motivation Reliable
Parallel Simplification of Large Meshes on PC Clusters
Parallel Simplification of Large Meshes on PC Clusters Hua Xiong, Xiaohong Jiang, Yaping Zhang, Jiaoying Shi State Key Lab of CAD&CG, College of Computer Science Zhejiang University Hangzhou, China April
ParFUM: A Parallel Framework for Unstructured Meshes. Aaron Becker, Isaac Dooley, Terry Wilmarth, Sayantan Chakravorty Charm++ Workshop 2008
ParFUM: A Parallel Framework for Unstructured Meshes Aaron Becker, Isaac Dooley, Terry Wilmarth, Sayantan Chakravorty Charm++ Workshop 2008 What is ParFUM? A framework for writing parallel finite element
FD4: A Framework for Highly Scalable Dynamic Load Balancing and Model Coupling
Center for Information Services and High Performance Computing (ZIH) FD4: A Framework for Highly Scalable Dynamic Load Balancing and Model Coupling Symposium on HPC and Data-Intensive Applications in Earth
Fast Multipole Method for particle interactions: an open source parallel library component
Fast Multipole Method for particle interactions: an open source parallel library component F. A. Cruz 1,M.G.Knepley 2,andL.A.Barba 1 1 Department of Mathematics, University of Bristol, University Walk,
Lecture 7 - Meshing. Applied Computational Fluid Dynamics
Lecture 7 - Meshing Applied Computational Fluid Dynamics Instructor: André Bakker http://www.bakker.org André Bakker (2002-2006) Fluent Inc. (2002) 1 Outline Why is a grid needed? Element types. Grid types.
The Lattice Project: A Multi-Model Grid Computing System. Center for Bioinformatics and Computational Biology University of Maryland
The Lattice Project: A Multi-Model Grid Computing System Center for Bioinformatics and Computational Biology University of Maryland Parallel Computing PARALLEL COMPUTING a form of computation in which
Efficient Storage, Compression and Transmission
Efficient Storage, Compression and Transmission of Complex 3D Models context & problem definition general framework & classification our new algorithm applications for digital documents Mesh Decimation
A Pattern-Based Approach to. Automated Application Performance Analysis
A Pattern-Based Approach to Automated Application Performance Analysis Nikhil Bhatia, Shirley Moore, Felix Wolf, and Jack Dongarra Innovative Computing Laboratory University of Tennessee (bhatia, shirley,
Dynamic Load Balancing in Charm++ Abhinav S Bhatele Parallel Programming Lab, UIUC
Dynamic Load Balancing in Charm++ Abhinav S Bhatele Parallel Programming Lab, UIUC Outline Dynamic Load Balancing framework in Charm++ Measurement Based Load Balancing Examples: Hybrid Load Balancers Topology-aware
Parallel Scalable Algorithms- Performance Parameters
www.bsc.es Parallel Scalable Algorithms- Performance Parameters Vassil Alexandrov, ICREA - Barcelona Supercomputing Center, Spain Overview Sources of Overhead in Parallel Programs Performance Metrics for
Parallel Computing. Benson Muite. [email protected] http://math.ut.ee/ benson. https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage
Parallel Computing Benson Muite [email protected] http://math.ut.ee/ benson https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage 3 November 2014 Hadoop, Review Hadoop Hadoop History Hadoop Framework
A VOXELIZATION BASED MESH GENERATION ALGORITHM FOR NUMERICAL MODELS USED IN FOUNDRY ENGINEERING
METALLURGY AND FOUNDRY ENGINEERING Vol. 38, 2012, No. 1 http://dx.doi.org/10.7494/mafe.2012.38.1.43 Micha³ Szucki *, Józef S. Suchy ** A VOXELIZATION BASED MESH GENERATION ALGORITHM FOR NUMERICAL MODELS
Load balancing; Termination detection
Load balancing; Termination detection Parallel and Distributed Computing Department of Computer Science and Engineering (DEI) Instituto Superior Técnico November 14, 2013 CPD (DEI / IST) Parallel and Distributed
Source Code Transformations Strategies to Load-balance Grid Applications
Source Code Transformations Strategies to Load-balance Grid Applications Romaric David, Stéphane Genaud, Arnaud Giersch, Benjamin Schwarz, and Éric Violard LSIIT-ICPS, Université Louis Pasteur, Bd S. Brant,
Parallel Programming at the Exascale Era: A Case Study on Parallelizing Matrix Assembly For Unstructured Meshes
Parallel Programming at the Exascale Era: A Case Study on Parallelizing Matrix Assembly For Unstructured Meshes Eric Petit, Loïc Thebault, Quang V. Dinh May 2014 EXA2CT Consortium 2 WPs Organization Proto-Applications
A Load Balancing Tool for Structured Multi-Block Grid CFD Applications
A Load Balancing Tool for Structured Multi-Block Grid CFD Applications K. P. Apponsah and D. W. Zingg University of Toronto Institute for Aerospace Studies (UTIAS), Toronto, ON, M3H 5T6, Canada Email:
Scientific Computing Programming with Parallel Objects
Scientific Computing Programming with Parallel Objects Esteban Meneses, PhD School of Computing, Costa Rica Institute of Technology Parallel Architectures Galore Personal Computing Embedded Computing Moore
Multiphase Flow - Appendices
Discovery Laboratory Multiphase Flow - Appendices 1. Creating a Mesh 1.1. What is a geometry? The geometry used in a CFD simulation defines the problem domain and boundaries; it is the area (2D) or volume
Load Balancing on a Non-dedicated Heterogeneous Network of Workstations
Load Balancing on a Non-dedicated Heterogeneous Network of Workstations Dr. Maurice Eggen Nathan Franklin Department of Computer Science Trinity University San Antonio, Texas 78212 Dr. Roger Eggen Department
A Novel Switch Mechanism for Load Balancing in Public Cloud
International OPEN ACCESS Journal Of Modern Engineering Research (IJMER) A Novel Switch Mechanism for Load Balancing in Public Cloud Kalathoti Rambabu 1, M. Chandra Sekhar 2 1 M. Tech (CSE), MVR College
Fully Automatic Hex Dominant Mesher. Paul Gilfrin Sharc Ltd
Fully Automatic Hex Dominant Mesher Paul Gilfrin Sharc Ltd Sharc Ltd UK About Sharc Developer of Harpoon Founded in 1997 Distributors of Ensight Engineers with CFD/FEA experience Specialise in the integration
Dynamic Load Balancing for Cluster Computing Jaswinder Pal Singh, CSE @ Technische Universität München. e-mail: [email protected]
Dynamic Load Balancing for Cluster Computing Jaswinder Pal Singh, CSE @ Technische Universität München. e-mail: [email protected] Abstract: In parallel simulations, partitioning and load-balancing algorithms
Development and Evaluation of Point Cloud Compression for the Point Cloud Library
Development and Evaluation of Point Cloud Compression for the Institute for Media Technology, TUM, Germany May 12, 2011 Motivation Point Cloud Stream Compression Network Point Cloud Stream Decompression
PARALLEL PROGRAMMING
PARALLEL PROGRAMMING TECHNIQUES AND APPLICATIONS USING NETWORKED WORKSTATIONS AND PARALLEL COMPUTERS 2nd Edition BARRY WILKINSON University of North Carolina at Charlotte Western Carolina University MICHAEL
A Comparison of General Approaches to Multiprocessor Scheduling
A Comparison of General Approaches to Multiprocessor Scheduling Jing-Chiou Liou AT&T Laboratories Middletown, NJ 0778, USA [email protected] Michael A. Palis Department of Computer Science Rutgers University
Graph Analytics in Big Data. John Feo Pacific Northwest National Laboratory
Graph Analytics in Big Data John Feo Pacific Northwest National Laboratory 1 A changing World The breadth of problems requiring graph analytics is growing rapidly Large Network Systems Social Networks
Software Development around a Millisecond
Introduction Software Development around a Millisecond Geoffrey Fox In this column we consider software development methodologies with some emphasis on those relevant for large scale scientific computing.
Robust Algorithms for Current Deposition and Dynamic Load-balancing in a GPU Particle-in-Cell Code
Robust Algorithms for Current Deposition and Dynamic Load-balancing in a GPU Particle-in-Cell Code F. Rossi, S. Sinigardi, P. Londrillo & G. Turchetti University of Bologna & INFN GPU2014, Rome, Sept 17th
Expanding the CASEsim Framework to Facilitate Load Balancing of Social Network Simulations
Expanding the CASEsim Framework to Facilitate Load Balancing of Social Network Simulations Amara Keller, Martin Kelly, Aaron Todd 4 June 2010 Abstract This research has two components, both involving the
AN EFFECT OF GRID QUALITY ON THE RESULTS OF NUMERICAL SIMULATIONS OF THE FLUID FLOW FIELD IN AN AGITATED VESSEL
14 th European Conference on Mixing Warszawa, 10-13 September 2012 AN EFFECT OF GRID QUALITY ON THE RESULTS OF NUMERICAL SIMULATIONS OF THE FLUID FLOW FIELD IN AN AGITATED VESSEL Joanna Karcz, Lukasz Kacperski
Load Balancing Techniques
Load Balancing Techniques 1 Lecture Outline Following Topics will be discussed Static Load Balancing Dynamic Load Balancing Mapping for load balancing Minimizing Interaction 2 1 Load Balancing Techniques
Optimized Hybrid Parallel Lattice Boltzmann Fluid Flow Simulations on Complex Geometries
Optimized Hybrid Parallel Lattice Boltzmann Fluid Flow Simulations on Complex Geometries Jonas Fietz 2, Mathias J. Krause 2, Christian Schulz 1, Peter Sanders 1, and Vincent Heuveline 2 1 Karlsruhe Institute
OpenMosix Presented by Dr. Moshe Bar and MAASK [01]
OpenMosix Presented by Dr. Moshe Bar and MAASK [01] openmosix is a kernel extension for single-system image clustering. openmosix [24] is a tool for a Unix-like kernel, such as Linux, consisting of adaptive
Spatial Discretisation Schemes in the PDE framework Peano for Fluid-Structure Interactions
Spatial Discretisation Schemes in the PDE framework Peano for Fluid-Structure Interactions T. Neckel, H.-J. Bungartz, B. Gatzhammer, M. Mehl, C. Zenger TUM Department of Informatics Chair of Scientific
Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms
Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms Amani AlOnazi, David E. Keyes, Alexey Lastovetsky, Vladimir Rychkov Extreme Computing Research Center,
Big Graph Processing: Some Background
Big Graph Processing: Some Background Bo Wu Colorado School of Mines Part of slides from: Paul Burkhardt (National Security Agency) and Carlos Guestrin (Washington University) Mines CSCI-580, Bo Wu Graphs
Load balancing; Termination detection
Load balancing; Termination detection Parallel and Distributed Computing Department of Computer Science and Engineering (DEI) Instituto Superior Técnico November 13, 2014 CPD (DEI / IST) Parallel and Distributed
Fast Parallel Algorithms for Computational Bio-Medicine
Fast Parallel Algorithms for Computational Bio-Medicine H. Köstler, J. Habich, J. Götz, M. Stürmer, S. Donath, T. Gradl, D. Ritter, D. Bartuschat, C. Feichtinger, C. Mihoubi, K. Iglberger (LSS Erlangen)
Load Balancing Strategies for Parallel SAMR Algorithms
Proposal for a Summer Undergraduate Research Fellowship 2005 Computer science / Applied and Computational Mathematics Load Balancing Strategies for Parallel SAMR Algorithms Randolf Rotta Institut für Informatik,
Energy Efficient MapReduce
Energy Efficient MapReduce Motivation: Energy consumption is an important aspect of datacenters efficiency, the total power consumption in the united states has doubled from 2000 to 2005, representing
Making Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association
Making Multicore Work and Measuring its Benefits Markus Levy, president EEMBC and Multicore Association Agenda Why Multicore? Standards and issues in the multicore community What is Multicore Association?
Parallel Ray Tracing using MPI: A Dynamic Load-balancing Approach
Parallel Ray Tracing using MPI: A Dynamic Load-balancing Approach S. M. Ashraful Kadir 1 and Tazrian Khan 2 1 Scientific Computing, Royal Institute of Technology (KTH), Stockholm, Sweden [email protected],
International journal of Engineering Research-Online A Peer Reviewed International Journal Articles available online http://www.ijoer.
RESEARCH ARTICLE ISSN: 2321-7758 GLOBAL LOAD DISTRIBUTION USING SKIP GRAPH, BATON AND CHORD J.K.JEEVITHA, B.KARTHIKA* Information Technology,PSNA College of Engineering & Technology, Dindigul, India Article
Parallels Virtuozzo Containers
Parallels Virtuozzo Containers White Paper Greener Virtualization www.parallels.com Version 1.0 Greener Virtualization Operating system virtualization by Parallels Virtuozzo Containers from Parallels is
ABSTRACT FOR THE 1ST INTERNATIONAL WORKSHOP ON HIGH ORDER CFD METHODS
1 ABSTRACT FOR THE 1ST INTERNATIONAL WORKSHOP ON HIGH ORDER CFD METHODS Sreenivas Varadan a, Kentaro Hara b, Eric Johnsen a, Bram Van Leer b a. Department of Mechanical Engineering, University of Michigan,
Characterizing Task Usage Shapes in Google s Compute Clusters
Characterizing Task Usage Shapes in Google s Compute Clusters Qi Zhang 1, Joseph L. Hellerstein 2, Raouf Boutaba 1 1 University of Waterloo, 2 Google Inc. Introduction Cloud computing is becoming a key
A Novel Cloud Based Elastic Framework for Big Data Preprocessing
School of Systems Engineering A Novel Cloud Based Elastic Framework for Big Data Preprocessing Omer Dawelbeit and Rachel McCrindle October 21, 2014 University of Reading 2008 www.reading.ac.uk Overview
Large-Scale Reservoir Simulation and Big Data Visualization
Large-Scale Reservoir Simulation and Big Data Visualization Dr. Zhangxing John Chen NSERC/Alberta Innovates Energy Environment Solutions/Foundation CMG Chair Alberta Innovates Technology Future (icore)
Distributed Computing over Communication Networks: Maximal Independent Set
Distributed Computing over Communication Networks: Maximal Independent Set What is a MIS? MIS An independent set (IS) of an undirected graph is a subset U of nodes such that no two nodes in U are adjacent.
Optimizing Performance of the Lattice Boltzmann Method for Complex Structures on Cache-based Architectures
Optimizing Performance of the Lattice Boltzmann Method for Complex Structures on Cache-based Architectures Stefan Donath 1, Thomas Zeiser, Georg Hager, Johannes Habich, Gerhard Wellein Regionales Rechenzentrum
GEDAE TM - A Graphical Programming and Autocode Generation Tool for Signal Processor Applications
GEDAE TM - A Graphical Programming and Autocode Generation Tool for Signal Processor Applications Harris Z. Zebrowitz Lockheed Martin Advanced Technology Laboratories 1 Federal Street Camden, NJ 08102
How To Write A Program For The Pd Framework
Enhanced divergence-free elements for efficient incompressible flow simulations in the PDE framework Peano, Miriam Mehl, Christoph Zenger, Fakultät für Informatik TU München Germany Outline Derivation
The Advantages and Disadvantages of Network Computing Nodes
Big Data & Scripting storage networks and distributed file systems 1, 2, in the remainder we use networks of computing nodes to enable computations on even larger datasets for a computation, each node
Explicit Spatial Scattering for Load Balancing in Conservatively Synchronized Parallel Discrete-Event Simulations
Explicit Spatial ing for Load Balancing in Conservatively Synchronized Parallel Discrete-Event Simulations Sunil Thulasidasan Shiva Prasad Kasiviswanathan Stephan Eidenbenz Phillip Romero Los Alamos National
Dynamic Network Analyzer Building a Framework for the Graph-theoretic Analysis of Dynamic Networks
Dynamic Network Analyzer Building a Framework for the Graph-theoretic Analysis of Dynamic Networks Benjamin Schiller and Thorsten Strufe P2P Networks - TU Darmstadt [schiller, strufe][at]cs.tu-darmstadt.de
Load Balancing in Structured Peer to Peer Systems
Load Balancing in Structured Peer to Peer Systems DR.K.P.KALIYAMURTHIE 1, D.PARAMESWARI 2 Professor and Head, Dept. of IT, Bharath University, Chennai-600 073 1 Asst. Prof. (SG), Dept. of Computer Applications,
DECENTRALIZED LOAD BALANCING IN HETEROGENEOUS SYSTEMS USING DIFFUSION APPROACH
DECENTRALIZED LOAD BALANCING IN HETEROGENEOUS SYSTEMS USING DIFFUSION APPROACH P.Neelakantan Department of Computer Science & Engineering, SVCET, Chittoor [email protected] ABSTRACT The grid
Load Balancing in Structured Peer to Peer Systems
Load Balancing in Structured Peer to Peer Systems Dr.K.P.Kaliyamurthie 1, D.Parameswari 2 1.Professor and Head, Dept. of IT, Bharath University, Chennai-600 073. 2.Asst. Prof.(SG), Dept. of Computer Applications,
Parallel Visualization for GIS Applications
Parallel Visualization for GIS Applications Alexandre Sorokine, Jamison Daniel, Cheng Liu Oak Ridge National Laboratory, Geographic Information Science & Technology, PO Box 2008 MS 6017, Oak Ridge National
Measuring the Performance of an Agent
25 Measuring the Performance of an Agent The rational agent that we are aiming at should be successful in the task it is performing To assess the success we need to have a performance measure What is rational
The Scientific Data Mining Process
Chapter 4 The Scientific Data Mining Process When I use a word, Humpty Dumpty said, in rather a scornful tone, it means just what I choose it to mean neither more nor less. Lewis Carroll [87, p. 214] In
Computer Graphics. Geometric Modeling. Page 1. Copyright Gotsman, Elber, Barequet, Karni, Sheffer Computer Science - Technion. An Example.
An Example 2 3 4 Outline Objective: Develop methods and algorithms to mathematically model shape of real world objects Categories: Wire-Frame Representation Object is represented as as a set of points
Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining
Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 by Tan, Steinbach, Kumar 1 What is Cluster Analysis? Finding groups of objects such that the objects in a group will
Load Imbalance Analysis
With CrayPat Load Imbalance Analysis Imbalance time is a metric based on execution time and is dependent on the type of activity: User functions Imbalance time = Maximum time Average time Synchronization
Praktikum Wissenschaftliches Rechnen (Performance-optimized optimized Programming)
Praktikum Wissenschaftliches Rechnen (Performance-optimized optimized Programming) Dynamic Load Balancing Dr. Ralf-Peter Mundani Center for Simulation Technology in Engineering Technische Universität München
CONVERGE Features, Capabilities and Applications
CONVERGE Features, Capabilities and Applications CONVERGE CONVERGE The industry leading CFD code for complex geometries with moving boundaries. Start using CONVERGE and never make a CFD mesh again. CONVERGE
Information Processing, Big Data, and the Cloud
Information Processing, Big Data, and the Cloud James Horey Computational Sciences & Engineering Oak Ridge National Laboratory Fall Creek Falls 2010 Information Processing Systems Model Parameters Data-intensive
Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing
/35 Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing Zuhair Khayyat 1 Karim Awara 1 Amani Alonazi 1 Hani Jamjoom 2 Dan Williams 2 Panos Kalnis 1 1 King Abdullah University of
LOAD BALANCING TECHNIQUES
LOAD BALANCING TECHNIQUES Two imporatnt characteristics of distributed systems are resource multiplicity and system transparency. In a distributed system we have a number of resources interconnected by
Visualization methods for patent data
Visualization methods for patent data Treparel 2013 Dr. Anton Heijs (CTO & Founder) Delft, The Netherlands Introduction Treparel can provide advanced visualizations for patent data. This document describes
Introduction to ANSYS ICEM CFD
Lecture 6 Mesh Preparation Before Output to Solver 14. 0 Release Introduction to ANSYS ICEM CFD 1 2011 ANSYS, Inc. March 22, 2015 Mesh Preparation Before Output to Solver What will you learn from this
MINIMIZING STORAGE COST IN CLOUD COMPUTING ENVIRONMENT
MINIMIZING STORAGE COST IN CLOUD COMPUTING ENVIRONMENT 1 SARIKA K B, 2 S SUBASREE 1 Department of Computer Science, Nehru College of Engineering and Research Centre, Thrissur, Kerala 2 Professor and Head,
Optimizing Load Balance Using Parallel Migratable Objects
Optimizing Load Balance Using Parallel Migratable Objects Laxmikant V. Kalé, Eric Bohm Parallel Programming Laboratory University of Illinois Urbana-Champaign 2012/9/25 Laxmikant V. Kalé, Eric Bohm (UIUC)
Data Warehousing und Data Mining
Data Warehousing und Data Mining Multidimensionale Indexstrukturen Ulf Leser Wissensmanagement in der Bioinformatik Content of this Lecture Multidimensional Indexing Grid-Files Kd-trees Ulf Leser: Data
Cost Model: Work, Span and Parallelism. 1 The RAM model for sequential computation:
CSE341T 08/31/2015 Lecture 3 Cost Model: Work, Span and Parallelism In this lecture, we will look at how one analyze a parallel program written using Cilk Plus. When we analyze the cost of an algorithm
Rapid Design of an optimized Radial Compressor using CFturbo and ANSYS
Rapid Design of an optimized Radial Compressor using CFturbo and ANSYS Enrique Correa, Marius Korfanty, Sebastian Stübing CFturbo Software & Engineering GmbH, Dresden (Germany) PRESENTATION TOPICS 1. Company
Performance Monitoring of Parallel Scientific Applications
Performance Monitoring of Parallel Scientific Applications Abstract. David Skinner National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory This paper introduces an infrastructure
