A Load Balancing Tool for Structured Multi-Block Grid CFD Applications



Similar documents
C3.8 CRM wing/body Case

PyFR: Bringing Next Generation Computational Fluid Dynamics to GPU Platforms

P013 INTRODUCING A NEW GENERATION OF RESERVOIR SIMULATION SOFTWARE

Computational Fluid Dynamics Research Projects at Cenaero (2011)

HPC enabling of OpenFOAM R for CFD applications

ME6130 An introduction to CFD 1-1

Aerodynamic Design Optimization Discussion Group Case 4: Single- and multi-point optimization problems based on the CRM wing

Mixed Precision Iterative Refinement Methods Energy Efficiency on Hybrid Hardware Platforms

Simulation of Fluid-Structure Interactions in Aeronautical Applications

Domain Decomposition Methods. Partial Differential Equations

HPC Deployment of OpenFOAM in an Industrial Setting

Scalable Distributed Schur Complement Solvers for Internal and External Flow Computations on Many-Core Architectures

CFD analysis for road vehicles - case study

Express Introductory Training in ANSYS Fluent Lecture 1 Introduction to the CFD Methodology

XFlow CFD results for the 1st AIAA High Lift Prediction Workshop

FRIEDRICH-ALEXANDER-UNIVERSITÄT ERLANGEN-NÜRNBERG

Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms

Performance of Dynamic Load Balancing Algorithms for Unstructured Mesh Calculations

Computational Modeling of Wind Turbines in OpenFOAM

Fast Multipole Method for particle interactions: an open source parallel library component

Cluster Scalability of ANSYS FLUENT 12 for a Large Aerodynamics Case on the Darwin Supercomputer

AN INTERFACE STRIP PRECONDITIONER FOR DOMAIN DECOMPOSITION METHODS

THE CFD SIMULATION OF THE FLOW AROUND THE AIRCRAFT USING OPENFOAM AND ANSA

CFD Analysis of Civil Transport Aircraft

Large-Scale Reservoir Simulation and Big Data Visualization

Module 6 Case Studies

OpenFOAM Optimization Tools

Multicore Parallel Computing with OpenMP

AN INTRODUCTION TO CFD CODE VERIFICATION INCLUDING EDDY-VISCOSITY MODELS

A. Hyll and V. Horák * Department of Mechanical Engineering, Faculty of Military Technology, University of Defence, Brno, Czech Republic

AIAA CFD DRAG PREDICTION WORKSHOP: AN OVERVIEW

Yousef Saad University of Minnesota Computer Science and Engineering. CRM Montreal - April 30, 2008

CFD Lab Department of Engineering The University of Liverpool

DYNAMIC LOAD BALANCING APPLICATIONS ON A HETEROGENEOUS UNIX/NT CLUSTER

TWO-DIMENSIONAL FINITE ELEMENT ANALYSIS OF FORCED CONVECTION FLOW AND HEAT TRANSFER IN A LAMINAR CHANNEL FLOW

walberla: Towards an Adaptive, Dynamically Load-Balanced, Massively Parallel Lattice Boltzmann Fluid Simulation

CFD modelling of floating body response to regular waves

Multiphase Flow - Appendices

LOAD BALANCING FOR MULTIPLE PARALLEL JOBS

CCTech TM. ICEM-CFD & FLUENT Software Training. Course Brochure. Simulation is The Future

Load Balancing on a Non-dedicated Heterogeneous Network of Workstations

Static Load Balancing of Parallel PDE Solver for Distributed Computing Environment

It s Not A Disease: The Parallel Solver Packages MUMPS, PaStiX & SuperLU

Distributed Dynamic Load Balancing for Iterative-Stencil Applications

ABSTRACT FOR THE 1ST INTERNATIONAL WORKSHOP ON HIGH ORDER CFD METHODS

Computational Fluid Dynamics

Load Balancing Strategies for Multi-Block Overset Grid Applications NAS

NUMERICAL ANALYSIS OF THE EFFECTS OF WIND ON BUILDING STRUCTURES

CFD Application on Food Industry; Energy Saving on the Bread Oven

Model of a flow in intersecting microchannels. Denis Semyonov

Iterative Solvers for Linear Systems

Pushing the limits. Turbine simulation for next-generation turbochargers

Formula One - The Engineering Race Evolution in virtual & physical testing over the last 15 years. Torbjörn Larsson Creo Dynamics AB

TESLA Report

The Influence of Aerodynamics on the Design of High-Performance Road Vehicles

Status and Future Challenges of CFD in a Coupled Simulation Environment for Aircraft Design

Lecture 7 - Meshing. Applied Computational Fluid Dynamics

Mesh Generation and Load Balancing

Computational Fluid Dynamics in Automotive Applications

Parallel Programming at the Exascale Era: A Case Study on Parallelizing Matrix Assembly For Unstructured Meshes

Petascale, Adaptive CFD (ALCF ESP Technical Report)

AN APPROACH FOR SECURE CLOUD COMPUTING FOR FEM SIMULATION

Applications of the Discrete Adjoint Method in Computational Fluid Dynamics

Customer Training Material. Lecture 2. Introduction to. Methodology ANSYS FLUENT. ANSYS, Inc. Proprietary 2010 ANSYS, Inc. All rights reserved.

A Simultaneous Solution for General Linear Equations on a Ring or Hierarchical Cluster

Scientific Computing Programming with Parallel Objects

Source Code Transformations Strategies to Load-balance Grid Applications

Characterizing the Performance of Dynamic Distribution and Load-Balancing Techniques for Adaptive Grid Hierarchies

Hash-Storage Techniques for Adaptive Multilevel Solvers and Their Domain Decomposition Parallelization

Current Status and Challenges in CFD at the DLR Institute of Aerodynamics and Flow Technology

Parallel Ray Tracing using MPI: A Dynamic Load-balancing Approach

Hardware-Aware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui

Performance of the JMA NWP models on the PC cluster TSUBAME.

A Massively Parallel Finite Element Framework with Application to Incompressible Flows

Benchmark Tests on ANSYS Parallel Processing Technology

High-fidelity electromagnetic modeling of large multi-scale naval structures

GPU Acceleration of the SENSEI CFD Code Suite

Practical Grid Generation Tools with Applications to Ship Hydrodynamics

Modeling of Earth Surface Dynamics and Related Problems Using OpenFOAM

A STUDY OF TASK SCHEDULING IN MULTIPROCESSOR ENVIROMENT Ranjit Rajak 1, C.P.Katti 2, Nidhi Rajak 3

Using Computational Fluid Dynamics for Aerodynamics Antony Jameson and Massimiliano Fatica Stanford University

AN EFFECT OF GRID QUALITY ON THE RESULTS OF NUMERICAL SIMULATIONS OF THE FLUID FLOW FIELD IN AN AGITATED VESSEL

Application of CFD Simulation in the Design of a Parabolic Winglet on NACA 2412

Introduction to Solid Modeling Using SolidWorks 2012 SolidWorks Simulation Tutorial Page 1

Aerodynamic Department Institute of Aviation. Adam Dziubiński CFD group FLUENT

FPGA area allocation for parallel C applications

Solution of Linear Systems

AN INTRODUCTION TO NUMERICAL METHODS AND ANALYSIS

Unleashing the Performance Potential of GPUs for Atmospheric Dynamic Solvers

Modeling Rotor Wakes with a Hybrid OVERFLOW-Vortex Method on a GPU Cluster

A Pattern-Based Approach to. Automated Application Performance Analysis

A GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS

High Performance Computing in CST STUDIO SUITE

Computational Simulation of Flow Over a High-Lift Trapezoidal Wing

Fire Simulations in Civil Engineering

Transcription:

A Load Balancing Tool for Structured Multi-Block Grid CFD Applications K. P. Apponsah and D. W. Zingg University of Toronto Institute for Aerospace Studies (UTIAS), Toronto, ON, M3H 5T6, Canada Email: kwesi@oddjob.utias.utoronto.ca ABSTRACT Large-scale computational fluid dynamics (CFD) simulations have benefited from recent advances in high performance computers. Solving a CFD problem on 100 to 1000 processors is becoming commonplace. Multi-block grid methodology together with parallel computing resources have improved turnaround times for high-fidelity CFD simulations about complex geometries. The inherent coarse-graiarallelism exhibited by multi-block grids allows for the solution process to proceed in separate sub-domains simultaneously. For a solutiorocess that takes advantage of coarse-graiarallelism, domain decomposition techniques such as the Schur complement or Schwarz methods can be used to reduce the solutiorocess for the whole domain to solving smaller problems iarallel for each partition on a processor. We present a static load balancing algorithm implemented for a parallel Newton-Krylov solver which solves the Euler and Reynolds Averaged Navier-Stokes equations on structured multi-block grids. We show the performance of our basic load balancing algorithm and also perform scaling studies to help refine the basic assumption upon which the algorithm was developed in order to reduce the cost of computation and improve the workload distribution across processors. Key words: high performance computers, computational fluid dynamics, Newton-Krylov, load balancing, multi-block grid applications, domain decomposition 1 INTRODUCTION CFD has grown from a purely scientific discipline to become an important tool for solving fluid dynamics MASc. Candidate, UTIAS Professor and Director, UTIAS. Tier 1 Canada Chair in Computational Aerodynamics and Environmentally Friendly Aircraft. J. Armand Bombardier Foundation Chair in Aerospace Flight. problems in industry. The use of CFD in solving fluid dynamics problems has become attractive due to its cost-effectiveness and ability to quickly reproduce physical flow phenomena compared to performing experiments [3]. However, for large scale and complex simulations, massively parallel computers are required to achieve low turnaround times. High performance computers or massively parallel computers are computer systems approaching the teraflop (10 12 Floating point Operations Per Second FLOPS) region. Generally, the expectation is that such high-end computing platforms will reduce computational time and also provide access to the large amounts of memory usually required for large-scale CFD applications. For multi-block grid applications, blocks can either be of the same size or different sizes. Therefore, to achieve a good static load balance several blocks can be mapped onto a single processor by some algorithm dictated by the strategy employed for the CFD solution process. The objective of this paper is to develop an automatic tool to split blocks and also partition workload as evenly as possible across processors, in order to derive maximum benefit from a high-end computing platform. 2 NUMERICAL METHODOLOGY In this section, we highlight the key features of our flow solver. We explain the motivation for our choice of grids and finite-difference discretization strategy. Furthermore, we give an overview of the numerical solution method and parallel implementation considerations for our solver.

2.1 Multi-block Grids and Discretization As the complexity of geometry increases, singleblock structured grids become insufficient for obtaining accurate numerical solutions to the flow equations. Multi-block grids, either homogeneous or heterogeneous at the block level, but structured within a block provide geometrical flexibility and preserve the computational efficiency of finite-difference methods. In addition, multi-block grids allow for direct parallelization of both grid generation and the flow codes on massively parallel computers [9]. Multi-block finite-difference methods also present some challenges that must be addressed. For example, for a CFD algorithm that uses the same interior scheme to discretize nodes at block interfaces through the use of ghost or halo nodes, it is required that the mesh be sufficiently smooth in order to retain the desired accuracy. To eliminate the challenges associated with block interfaces and exceptional points, we use the summation-by-parts (SBPs) [1, 2, 4] finitedifference operators and simultaneous-approximation terms (SATs) [1, 2, 4] for discretization. An SBP operator is essentially a centered difference scheme with a specific boundary treatment, which mimics the behaviour of the corresponding continuous operator with regard to certairoperties. We discretize the Euler and Reynolds Averaged Navier- Stokes (RANS) equations independently on each block in a multi-block grid using the SBP operators. The domain interfaces are treated as boundaries, and together with other boundary types the associated conditions are enforced using SATs. For further information on SBP operators and SATs, we direct the reader to the literature on these topics [6, 11]. 2.2 Solution Method and Parallel Implementation Considerations We use an inexact Newton method [12] to solve the nonlinear equations resulting from the discretization of the steady Euler and RANS equations. The resulting sparse linear system of equations at each outer iteration is solved inexactly using a preconditioned Krylov solver, the flexible generalized minimal residual method (FGMRES) [7]. An approximate Schur [8] preconditioner is used during the Krylov solution process, with slight modifications to the original technique, basically to improve CPU time [4]. For domain decomposition, we assign one block to a processor or combine blocks to form a partition on each processor when the number of blocks exceeds the. On each partition the following main operations in our solutiorocess are parallelized; Block Incomplete LU factorization (BILU(k)), inner products, and matrix-vector products required during the Krylov stages. For example, to compute a global inner product, we first compute local inner products iarallel and then sum the components local to each processor by using the MPI command MPI_Allreduce(). With regard to communication, we use the MPI non-blocking communication commands for data exchanges. This communication approach has the advantage of partially hiding communication time, since the program does not wait while data is being sent or received, thus some useful work which does not require the data to be exchanged can be done. Previously, our ability to obtain a good static load balance was restricted to homogeneous multi-block grids and to processor numbers which are factors of the number of blocks present in a grid. In order to accommodate more general grids and offer more flexibility in terms of the, an automatic load balancing tool was needed. 3 LOAD BALANCING TECHNIQUE The load balancing algorithm implemented is based on the coarse-graiarallelism approach. Entire blocks are assigned to a processor during computation. The load balancing tool proposed is static, that is, block splitting and workload partitioning are done before flow solution begins. In addition the tool can handle cases where the number of blocks is less than the number of processors. 3.1 Workload Distribution Strategy The assumption is that the CPU time per node remains constant. Blocks are sorted in descending order of size and at every instance of block assignment the processor with the minimum work (nodes) receives a block [5]. The processors which do not meet the load balance threshold set have their largest block split. Two constraints are used to control block splitting. The first is the block size constraint, f b, defined as: where f b = w i min(w (o) i ), i = 1,2,3,..,n b (1) w i = number nodes in a block n b = number of blocks

w (o) i = number of nodes in a block in the initial mesh The block size constraint is enforced when the number of blocks is less than the, and blocks vary in size. It restricts a large block from being greater than the smallest block in the initial mesh by more than a factor specified by the user. The second constraint is the workload threshold factor, f t, which is defined as : where f t = max(w j), j = 1,2,3,.., (2) w w = n b w i i=1 = The workload threshold factor ensures that the workload per processor does not exceed the average workload by more than the factor specified by the user. We also define a correction factor, c, as: c = w( f ) w (o) (3) where w (o) and w ( f ) are the workload averages during initial and the final load balance iterations respectively. The parameter, c, is a measure of the increase in volume-to-surface ratio due to block splitting. For problems that are difficult (i.e. small f t and a smaller number of blocks compared to processors) to load balance we observe that c is high. In the next section we provide further insight into the effects of this parameter on the parallel scaling of problems that use the load balancing tool developed. We present some results for a homogeneous grid with 64 blocks for the ONERA M6 wing and a heterogeneous grid with 569 blocks for the Common Research Model Wing-Body (CRM) used for the Drag Prediction Workshop IV. Table 1 shows the flow conditions and some properties of the multi-block grids used. Properties ONERA M6 - Euler CRM - RANS n b 64 569 block min 37 37 37 11 11 15 block max 37 37 37 44 41 43 α 3.06 2.356 Re - 5.0 10 6 M 0.699 0.85 Table 1: Mesh properties and flow conditions for test cases 3.2 Block-splitting Tool The block splitting tool is based on recursive edge bisection (REB) [10]. Blocks are split half way or at the nearest half way point along the index with the largest number of nodes. This tool can be used independent of the main load balancing loop as well. 4 RESULTS In this section, we discuss some of the results obtained so far and suggest some possible improvements to the basic algorithm described with regard to our flow solution algorithm. Lastly, we show how load balancing can improve turnaround times for CFD problems especially in the case of heterogeneous multi-block grids. 4.1 Homogeneous Multi-block Grids We refer to homogeneous multi-blocks grids as grids with blocks of the same size throughout the domain. We consider two cases based on the initial number of processors started with for the scaling studies, since we extrapolate the ideal time based on this starting point. We define an ideal time for a for a constant problem size as: t p, ideal = t 1 (4) where t 1 is the computational time if the problem is run on a single processor (serial case) and is the number of processors. Since we do not run any of the cases considered on a single processor we modify equation (4) slightly to: t p, ideal = t(o) p n (o) p (5) where t p (o) is execution time for the smallest number of processors in the scaling studies group. It should be noted that t p, ideal does not account for the changing size (measured with c) of each problem. In Figure 1, we show a combined scaling diagram for the two distinct cases considered. In addition we include the ideal cases based on the two extrapolation points, 16 and 180 processors respectively. From Figure 1 we observe that 128, 320, 640, 720 and 960 and 1024 processors we obtained lower execution times compared to the processors proceeding them. One would expect that we observe a lower turnaround time for 180 processors than 128 processors and 460 processors than 320, but this is not the case. This anomaly

can be attributed to the fact that 128, 320, 640, 720, 960 and 1024 are multiples of the initial number of blocks (64). If the ratio of the to the number of initial blocks is equal to a power of 2 then we obtain a perfect load balance considering our block splitting strategy. Therefore for 128 and 1024 processors we obtain a perfect load balance. In the case (320, 640, 720 and 960 processors) where the to number of blocks ratio is not equal to a power of 2 we still obtain a near perfect load balance compared to the subsequent processors, hence recording a low turnaround time. Therefore for homogeneous multi-block grids low turnaround times and better performance can be achieved when the number of processors is a multiple of the number of blocks. We show the parallel efficiency of the processors that are multiples of the number of blocks in Tables 2 and 3. Table 2 is computed with respect to the time measured for 16 processors while for Table 3 we use 180 processors. execution time (secs) 10 4 10 2 ideal - 16 ideal - 180 10 1 Figure 1: Execution times obtained for 16, 30, 48, 72, 128, 180, 240, 250, 320, 460, 500, 580, 640, 720, 760, 840, 960 and 1024 processors. Processors f t=1.10 f t=1.20 f t=1.40 128 69% 69% 69% 320 40% 40% 53% 640 29% 29% 39% 720 33% 33% 33% 960 21% 25% 25% 1024 28% 29% 29% Table 2: Parallel efficiency for 128, 320, 640, 720, 960 and 1024 processors using 16 processors as the extrapolation point for the ideal time. Processors f t=1.10 f t=1.20 f t=1.40 128 - - - 320 80% 79% 104% 640 58% 58% 77% 720 66% 66% 66% 960 42% 49% 49% 1024 56% 57% 58% Table 3: Parallel efficiency for 128, 320, 640, 720, 960 and 1024 processors using 180 processors as the extrapolation point for the ideal time. 4.1.1 Case I : n (o) b > n (o) p For this case our initial 64-block mesh is started off with 16 processors. We consider three different load balancing thresholds, 1.10, 1.20 and 1.40. Figure 2 shows how our code scales as we increase the number of processors. Since we start with a smaller number of blocks compared to processors, blocks have to be split several times in order to meet the workload threshold set. In cases where a low f t is set or the number of processors far exceeds the number of initial blocks, a high c results due to the increased number of block splittings. For example in the case of 250 processors c goes as high as 1.15 as seen in Figure 3 when f t =1.10. This means the computational nodes in the initial grid increase by 15% after the final load balancing iteration. Considering our parallel solutiorocess, a high c results in longer preconditioning time and a higher number of Krylov iterations as observed in Figures 4 and 5. During the Schur preconditioning stages we obtain approximate solution values for all the internal interface nodes. The approximate solution to the Schur complement system is accelerated using GMRES. Consider a single block of size 37 37 37 which is split into two in all three index directions resulting in 8 blocks, we record a c value of approximately 1.08. This 8% increase in the size of the problem results in a 100% increase in the number of interface nodes translating into a bigger Schur complement problem, therefore increasing preconditioning time significantly. Figure 4 shows the CPU time spent at the preconditioning stages for the cases in Figure 2. In [4], Hicken and Zingg showed that the number of Krylov iterations required to solve the linear system at each inexact-newton iteration of our flow solver using the approximate Schur preconditioning technique is independent of the, but dependent on the size of the problem. The implication of the preceding conclusion for our load balancing tool

is that the total Krylov iterations required to solve a problem will remain the same for any two or more problems that have the same c irrespective of the f t set and the used. For instance, in Figure 3, for processor numbers 72 and 250 and f t set to 1.10 and 1.40 respectively, the problem size is the same with a c of 1.08 and we record approximately the same number of Krylov iterations. The value of c begins to grow steadily as we increase the number of processors due to the small number of blocks started with, which in turn leads to increased Krylov iterations and increased computational time. To help improve the scalability of our code for difficult load balancing problems we introduce a new exit criterion for our main load balancing loop. With the knowledge that a small f t ensures a good static load balance but does not necessarily improve the scalability of our code in terms of overall CPU time, Krylov iterations and preconditioning time, we relax the strict enforcement of f t by allowing it to operate only when a blocks-to-processors ratio (b r ) constraint has not been met. This new constraint is set by the user as well. Once the blocks-to-processors ratio for an iteration is equal or greater than b r, the main load balancing iteration loop is exited even if f t has not been met. The lines in Figures 2, 4 and 5 indicated with new show the effect of introducing this new constraint. We show results only for, b r = 2.0 and processors 30, 48, 72, 250, 500. The processors selected had b r greater than 2.0 after the final iterations for the initial runs without the introduction of b r. Table 4 shows the percentage reductions obtained for overall CPU time, preconditioning time and Krylov iterations for the large number of processors which are of much interest to us in these scaling studies. Processors CPU time Precon time Krylov its 72 12% 20% 8% 250 5% 13% 7% 500 6% 15% 10% Table 4: % reductions obtained after the introduction of b r =2.0. execution time (secs) 10 4 10 2 f t,new = 1.10 ideal 10 1 Figure 2: Execution times obtained for 16, 30, 48, 72, 128, 250, 500, 720 and 1024 processors when n (o) n (o) p with varying f t. c preconditioning time (secs) 1.3 1.25 1.2 1.15 1.1 1.05 1 0.95 f t,new = 1.10 0.9 1000 100 Figure 3: c values obtained for Fig 2. f t,new = 1.10 10 b > Figure 4: Preconditioning time obtained for Fig 2.

Krylov iterations 1450 1400 1350 1300 1250 1200 1150 1100 1050 1000 f t,new = 1.10 950 execution time (secs) 10 2 ideal 10 1 10 2 Figure 5: Krylov iterations obtained for Figure 2. 4.1.2 Case II : n (o) b < n (o) p We consider this case mainly to draw the reader s attention to how the overall CPU time scales with respect to the initial number of blocks in comparison to the used to extrapolate the ideal CPU time. Figure 6 shows the scaling performance. In the previous case, a perfect static load balance was obtained for the smallest number (16) of processors without any block splitting. But in this case, some block splitting was required for the smallest number (180) of processors. Although the case with f t =1.40 tends to scale well compared to the other cases, we still observe poor scaling with increasing. Further investigation into the use of different parallel preconditioning techniques is being considered. In a previous study performed using the Schwarz preconditioner we realised that for cases where the ratio of number of blocks to the number of processors is high, Schwarz preconditioning improves CPU time compared to the Schur approach. We intend to conduct further study in this direction to take advantage of either preconditioning technique to gain further improvement in CPU time. Figure 7 shows the turnaround times observed in our preconditioning comparison study using a 16-block ONERA M6 mesh. The initial grid was then split into 32, 64, 128, 256, 512 and 1024 blocks. All the cases were run on 16 processors using the Euler equations. The ideal case is extrapolated by multiplying the time for 16 blocks by the increase iroblem size, c. Figure 6: Execution times obtained for 180, 240, 320, 460, 580, 640, 760, 840 and 960 processors when n (o) b execution time (secs) < n (o) p with varying f t. 10 4 Schur Schwarz ideal-schur 10 2 number of blocks Figure 7: A comparison of execution times for Schur and Schwarz parallel preconditioners with varying block numbers on 16 processors. 4.2 Heterogeneous Multi-block Grids We refer to heterogeneous multi-block grids as grids with different block sizes. The results for this case were obtained with the CRM grid using the RANS equations with the Spalart-Allmaras one-equation turbulence model. The case considered in this section supports our view that a load balancing tool can significantly improve turnaround times for large-scale problems. We run the initial 569 blocks on 569 processors without any load balancing. The average of three different runs is shown with the * symbol in Figure 8. We run this case with load balancing, setting f t to 1.10, 1.20 and 1.40 on 190, 280, 370, 450, 560 and 740 processors. We observe that with load balancing we are able to obtain lower turnaround times compared to the CPU time measured for the case without load balancing even on fewer than 569 processors.

execution time (secs) 10 4 ideal LB off 10 2 Figure 8: Computational time obtained for the heterogeneous mult-block grid. The parallel efficiency for 740 processors with f t =1.20 is 68%. 5 CONCLUSIONS We described our basic load balancing algorithm and presented some results on workload distribution and computational times recorded for the different processor numbers considered. In Figures 1, 2 and 6 we showed the scaling performance of our Newton- Krylov CFD code and how our extrapolatiooint for computing ideal CPU times impacts scaling diagrams. To improve on CPU time for computation and preconditioning, and the number of Krylov iterations, we introduced a new constraint to control block splitting during the use of our load balancing tool. We show the improvements gained from introducing this new constraint in Figures 4 and 5 and Table 4. In Figure 7 we show the impact of block splitting on CPU time considering different preconditioning techniques. Finally, we show in Figure 8 how the automatic tool for load balancing can lead to reduced turnaround times for large-scale multi-block CFD applications with heterogeneous meshes. APPENDIX A: LOAD BALANCING ALGORITHM W nb = (w 1,w 2,...,w nb ) w i, i = 1,n b w p, j, j = 1, w r, j p x = 0 repeat, j = 1, n b w i i=1 w = w p, j = 0, j = 1, for i = 1 to n b do w p, min( j) = w p, min( j) + w i end for for i = 1 to do w r, j = w p, j w if w r, j > f t then p x = p x + 1 end if end for if p x > 0 then Split blocks associated with w i,i = 1, p x n(w nb, new) = n(w nb +p x ) W = W new end if until p x = 0 or n b >= b r n b = number of blocks in grid file = specified W nb = is the set of initial block sizes sorted in descending order W nb, new = is the set of new block sizes sorted in descending order w i = block workload w p = processor workload w r = processor workload ratio p x = with workload exceeding load balance threshold f t = load balance threshold specified b r = blocks-to-processor ratio min( j) = index of w p, j with the minimum workload ACKNOWLEDGEMENTS The authors gratefully acknowledge financial assistance from Mathematics of Information Technology and Complex Systems (MITACS), Canada Research Chairs Program and the University of Toronto. All computations were performed on the GPC supercomputer at the SciNet 1 HPC Consortium. REFERENCES [1] M. Osusky, J. E. Hicken and D. W. Zingg. A Parallel Newton-Krylov-Schur Flow Solver for Navier-Stokes Equations Discretized Using SBP- SAT Approach. 48th AIAA Aerospace Sciences and Meeting Including the New Horizons Forum and Aerospace Exposition, 2010-116. [2] J. E. Hicken. Efficient Algorithms for Future Aircraft Design: Contributions to Aerodynamic 1 SciNet is funded by: the Canada Foundation for Innovation under the auspices of Compute Canada; the Government of Ontario; Ontario Research Fund - Research Excellence; and the University of Toronto.

Shape Optimization. Toronto, 2009. Phd Thesis, University of [3] N. Gourdain, L. Gicquel, M. Montagnac, O. Vermorel, M. Gazaix, G. Staffelbach, M. Garcia, J- F Boussuge and T. Poinsot. High Performance Parallel Computing of Flows in Complex Geometries:I. Methods. Computational Science and Discovery, 2(2009), 015003, (26pp). [4] J. E. Hicken and D. W. Zingg. Parallel Newton- Krylov Solver for Euler Equations Discretized Using Simultaneous-Approximation Terms. AIAA Journal, 46 (11):2773 2786, 2008. [5] K. Sermeus, E. Laurendeau and F. Parpia. Parallelization and Performance Optimization of Bombardier Multi-Block Structured Navier-Stokes Solver on IBM eserver Cluster 1600. AIAA Aerospace Sciences Meeting and Exhibit, 2007-1109. [6] K. Mattson and J. Nordström. Summation by Parts Operators for Finite Difference Approximations of Second Derivatives. Journal for Computational Physics, 199:503-540, 2004. [7] Y. Saad. Iterative Methods for Sparse Linear Systems. Society for Industrial and Applied Mathematics, 2nd ed., 2000. [8] Y. Saad and M. Sosonkina. Distributed Schur Complement Techniques for General Sparse Linear Systems. SIAM Journal of Scientific Computing, 21:1337-1357, 1999. [9] J. Häuser, P. Eiseman, Y. Xia and Zheming Cheng. Handbook on Grid Generation: Chapter 12. Parallel Multiblock Structured Grids. CRC Press, 1998. [10] A. Ytterström. A Tool for Partitioning Structured Multi-Block Meshes for Parallel Computational Mechanics. International Journal of Supercomputer Applications, 11(4):336-343, 1997. [11] B. Strand. Summation by Parts for Finite Difference Approximations for dx d. Journal of Computational Physics, 110, 47-67, 1994. [12] T. H. Pulliam. Solution Methods in Computational fluid dynamics. Technical report, NASA Ames Research Center, 1992.