Suggar++: An Improved General Overset Grid Assembly Capability

Transcription

1 19th AIAA Computational Fluid Dynamics June 2009, San Antonio, Texas AIAA th AIAA Computational Fluid Dynamics Conference, June 2009, San Antonio, TX Suggar++: An Improved General Overset Grid Assembly Capability Ralph W. Noack David A. Boger Robert F. Kunz Applied Research Laboratory, The Pennsylvania State University, USA Pablo M. Carrica IIHR-Hydroscience and Engineering, University of Iowa, USA The overset, or chimera, grid methodology utilizes a set of overlapping grids to discretize the solution domain. Parallel computation of the overset domain connectivity information (DCI) pose unique problems relative to the typical flow solver. The present paper investigates the parallel computation of the DCI within the context of Suggar++, a general capability for obtaining the overset domain connectivity information. Significant improvements to the donor search process are also presented that reduce the time required during the donor search phase. The effect of grid partitioning on the overset work is investigated and demonstrated to cause the work to increase. A new approach to partitioning for the overset domain connectivity assembly process is proposed and demonstrated to avoid the increase in work of the conventional partitioning. The parallel execution of Suggar++ is examined in detail and is found to provide a significantly improved parallel performance relative to the current capability in SUGGAR. Finally, a new approach to eliminate orphans in tight gap regions by having the overset grid assembly process identify solver locations to be treated as immersed boundary points is also presented. I. Introduction The overset, or chimera, grid methodology 1 utilizes a set of overlapping grids to discretize the solution domain. The component grids are fitted to portions of the geometry without regard for other portions of the geometry, which greatly simplifies the grid generation process. The result is a flexible computational simulation framework that can be an enabling force in many situations. It has been widely used to simplify the structured grid generation requirements for complex geometries. The use of an overset grid system is also an enabling technology for the simulation of bodies in relative motion, such as a store separation 2, 3 and rotorcraft. 4 The overlapping grid system must be processed to form a composite grid where the solution from one grid is linked to the solution on an overlapping grid. This requires identification of hole points in a component grid that are outside the domain of interest and should be excluded from the flow solver computations. Examples of hole points (commonly called OUT points) are points inside a body or behind a symmetry plane. The grid points adjacent to the holes become intergrid boundary points, termed receptor or fringe points. The marking of holes and the surrounding fringe boundaries forms the first phase of an overset grid Senior Member AIAA 1 of 48 Copyright 2009 by Ralph Noack. Published by the, Inc., with permission.

2 assembly process. The boundary values required by the flow-field solution at the fringe points are obtained by interpolating the solution from appropriate donor elements using information from other grids that overlap the region. The primary innovation of the overset method is the accommodation of holes within grids 5 as it is the holes and their boundaries within a grid that allow the solution from one overlapping grid to be coupled to the solutions on other grids to form a single composite grid and solution. The domain connectivity information that, along with the component grids, forms the overset composite grid requires identification of the hole and fringe points along with donor elements consisting of interpolation stencil members and weights. Several different overset grid assembly codes are currently available with varying abilities for general use. Overflow and Beggar 6 each have an internal overset capability that cannot be readily used by other flow solvers. Pegasus 5 7 is a stand-alone overset grid assembly capability that is restricted to node-centered structured grids. SUGGAR 8 10 is a general capability with the ability to assemble any combination of structured, unstructured, general polyhedral, and octree-based Cartesian unstructured grids for node- and/or cell-centered solver formulations. SUGGAR has enabled new overset simulation capabilities with a number of different flow solvers but has some significant weaknesses. The hole-cutting process in SUGGAR uses an octree-based Cartesian approximation to the geometry that can require excessive amounts of memory when a highly accurate hole cut is required. Balancing the memory usage with accuracy of approximation requires the user to iteratively adjust input parameters, which is undesirable. Improved serial and parallel execution performance is also needed but would be difficult to achieve with the current code architecture. In an effort to provide an improved overset grid assembly capability, a new code is currently under development with the goal of providing the same general capability as SUGGAR to operate on all grid topologies for node- and/or cell-centered flow solver formulations. Specific improvements include a significant decrease in wall clock time using better algorithms and an improved hole-cutting procedure that is more accurate and easier to use while maintaining similar performance. The new code called Suggar++ is written from scratch in C++. The hole cutting procedure has been previously described in Reference 11. This paper will focus on two new improvements within Suggar++. The major portion of the paper will focus on the parallel execution capability within Suggar++. A detailed examination of the work involved in computing the overset domain connectivity will be presented. A new scheme for decomposing the system of grids for executing in parallel will be presented and compared with a partitioning based upon a conventional approach. A detailed comparison of the parallel performance of Suggar++ for the two different decompositions will be presented along with a comparison of parallel performance with SUGGAR. Finally a new approach to eliminating orphans due to insufficient overlap will be presented. II. Overset Hole Cutting The hole-cutting procedure is a critical element of the typical overset grid assembly process. It is desirable for the hole-cutting process to be fast, efficient, and require a minimal amount of user input and control. Many different methodologies have been developed and each approach has different levels of complexity and execution speed. Hole-cutting techniques can be broadly classified as explicit cutters, query cutters, and direct cutters. Explicit cutters simply mark a user-specified range of nodes or cells as OUT. Query cutters mark as OUT any nodes or cells that fall within a specified region of space, where the space may be defined by user-specified analytical shapes, such as a sphere, cylinder, or bounding box, or by some other approximate representation of the geometry volume. The octree-based Cartesian approximation to the geometry used in SUGGAR is an example of a query cutter and has a significant number of weaknesses, such as the memory required when highly accurate cutting is required. Finally, direct cutters directly intersect cells of the grid being cut by the faces of the cutting geometry. Since the intersection of the cutting faces in the direct cutter only defines the perimeter of the hole, a subsequent flood/fill operation is required to mark all the locations inside the body as OUT. In Suggar++, the primary-hole cutting approach is a direct cutter technique. An overview of this 2 of 48

3 technique is provided in this section; more details can be found in Reference 11. Explicit cutters and query cutters based upon analytical shapes are also available. An example of a commonly used query cutter is the plane, which can be used when the computational domain includes one or more planes of symmetry. The direct-cutting operation begins with a search for the grid cells containing each of the nodes within the cutting surface. These containing cells are marked as OUT, and the connected edges are tested for intersection with the cutting surface faces. The intersection of the cutting faces with the grid used in the direct cutting approach will cut an outline of the cutting geometry into the grid being cut but does not explicitly mark all the locations inside the geometry as out. To accomplish this, a flood/fill operation is required after the cutting phase is completed to mark the points that lie in the interior of the bodies as OUT (i.e. outside the domain of interest). The fill operation is initiated by marking as OUT some point that is known to be inside the geometry. Each point marked as OUT recursively sets its neighbors to be OUT. The fill does not propagate to neighbors across the links broken by the cutting process, which ultimately halts the process. However, if the cut is not complete and watertight, the fill operation will leak out of the actual geometry and erroneously mark the entire grid as OUT. A robust cutting method is therefore critical to the success of the direct cut method. The direct cutter uses all the faces on surfaces marked as solid and provides a minimal hole cut, which is vital in grid systems with body components in close proximity. In many cases, a less accurate hole cut is acceptable. For example, Figure 1 shows the Hart grid system where it can be seen that there is significant clearance between the fuselage and the blades. Figure 4 shows the fine detail captured by the surface grid for each blade, resulting in a large number (about 33,000) of surface triangles in the blade surface. Also shown in the figure is a bounding box for the blade. In this case, the bounding box provides an acceptable cutting geometry comprised of only 6 quadrilateral faces, which can provide a significant decrease in computational effort. III. Donor Search The overset domain connectivity requires an interpolation source or donor for each point identified as fringes or intergrid boundary point. The methods used to find these donors are obviously then a critical component of the overset grid assembly process, but the efficiency of these donor search methods is made even more important due to the fact that they are employed in the direct cutter hole-cutting process as well. As a result, Suggar++ utilizes routines tailored for different grid topologies, aspects of which are described next. A. Cartesian Grid The search for a donor in a Cartesian grid is very fast relative to other grid topologies. The donor element containing a point can be found using direct calculation in a uniform Cartesian grid. A stretched Cartesian grid can have independent non-uniform spacing distributions in the X, Y, and Z directions and simple table lookups in each direction are used to quickly locate the element containing a fringe point with a specified X,Y,Z value. B. Curvilinear Structured Grid The donor search procedure in a curvilinear structured grid uses a standard non-linear Jacobian stencil jumping procedure. A node-centered donor will interpolate using values and weights at the nodes of an element. These weights are formed in terms of a local (ξ, η, ζ) coordinate system: (x, y, z) = F (ξ, η, ζ), where F (ξ, η, ζ) is a non-linear system of equations. Given the (x, y, z) values of a fringe location, the system of equations is solved to find the corresponding (ξ, η, ζ) values. By assuming that the Jacobian of F (ξ, η, ζ) is locally constant, the procedure can jump across many elements to quickly find the correct donor element even when a poor starting point is used. 3 of 48

4 C. Unstructured Grid The donor search procedure in an unstructured grid uses a neighbor walk procedure where the search proceeds from one element across a shared face into a neighboring element. The procedure evaluates the dot product of each element face normal with a vector from a node on the face to the fringe location and moves to the neighbor across the face with the most negative dot product. If the dot product with all the faces is positive then the donor element has been found. D. Multiple Donors The donor search procedure will find candidate donors from all possible donor grids and keep the most suitable donor. The Donor Suitability Function (DSF) is used in the same way as in SUGGAR to sort the donors to find the best candidate. Typically, the DSF is set to the element volume but other options, such as the length of the element diagonal, are also available and can make a significant difference in the resulting domain connectivity. E. Inverse Map Suggar++ uses an inverse map similar in concept to that used by Meakin. 12 The inverse map is a data structure associated with each grid in the system to assist searching that grid for candidate donors. The inverse map is simply a Cartesian grid sized to the bounding box of the donor grid and stores the donor grid element closest to the center of each element in the inverse map. In addition, a flag is set to indicate if no element in the donor grid overlaps the element in the inverse map. This data structure can be queried to determine whether a fringe location might lie within a given donor grid and, if so, to identify an element in the donor grid that is close to the fringe location. Since the inverse map is a uniform Cartesian grid, finding the inverse map element containing a fringe location is very fast. If that inverse map element stores a reference to a donor grid element that overlaps it, then that donor grid element can be used to start the donor search in the grid; otherwise, the point does not lie in the donor grid and a more rigorous search can be avoided. The inverse map is similar in concept to the vision space bins of PUNDIT; 13 however Suggar++ does not go to the extra expense of rotating the Cartesian inverse map to align with the grid. F. Overlap Minimization The overset composite flow solution can be improved by reducing the overlap between component grids to the minimum possible, without degrading donor quality, such that the donors and fringes are of comparable size. Pegasus 5 and PUNDIT each contain methods that require finding donors for all grid points and then changing the grid point to a fringe if a donor with smaller size is found. The present approach, first used in SUGGAR, 9 uses an iterative approach to reduce the number of donor searches performed by the minimization process. The iterative process begins adjacent to outer or inner fringes, which are themselves adjacent to hole/out points. Thus unnecessary searches for candidate donors in the boundary layer regions (where the elements are typically much smaller than background elements) can be avoided. A direct comparison in performance between Suggar++ and Pegasus 5 or PUNDIT has not been made, but a simple investigation of the potential savings in the hole-cut and iterative minimization approaches used in Suggar++ versus the strategies employed by other codes was attempted. For example, a preliminary capability is available in Suggar++ to search for donors for all grid points, similar to what is required for the implicit hole cutting in PUNDIT or the LEVEL2 minimization in Pegasus 5. Table 1 presents Suggar++ CPU times for searching for all donors, hole cutting, iterative minimization, the sum of the hole cutting and iterative minimization columns, and the ratio between the two approaches. These timings are for CPU time only and do not include any parallel communication or other wait times and hence are not the complete picture. The comparison does show that the hole cut plus iterative minimization has the potential to reduce the overall work compared to approaches that need to search for a donor for every point in the grid. 4 of 48

5 (A)Search All Donors Iterative Hole Cut (B)Sum Hole Cut and Ratio (A)/(B) Rank Minimization Iterative Minimization Table 1. Comparison of CPU time in searching for donors for all grid points versus a hole cut and iterative minimization. G. Improving the Donor Search Efficiency An efficient procedure for the donor search is critical as this phase of the overset grid assembly process can be the most time consuming. Search routines specific to each grid topology are used to make the search steps within a grid as efficient as possible, but the best way to improve the efficiency is to minimize the number of steps required to find the proper donor. Suggar++ starts each donor search using a hint that can be provided in different ways depending on context. The best hint for a fringe point is expected to be the donor already found by a neighboring fringe point. In order to take advantage of this type of hint as often as possible, the donor search works from a stack. Any fringe point that finds a donor adds its neighbors within the fringe set to the stack along with the donor that it found. The next point to be searched is the first in the stack, provided the stack is not empty. If the stack is empty (as it is for the first point in the set, for example), then the search must proceed without the benefit of a hint from a neighboring fringe. In this case, a hint may still be obtained from a number of other sources. As discussed in Section F, the overlap minimization proceeds in an iterative advancing front fashion. A location selected to be a min-fringe from one iteration will sponsor its neighboring locations as candidates to be min-fringes. Thus, during the overlap minimization phase, a good hint is provided by the sponsor that caused the fringe point to be added as a candidate. (Note that the sponsor is of course a neighbor of the fringe point but is not within the same fringe set due to the iterative nature of the minimization.) If a donor exists from a previous time step, this can be used as a hint as well. Finally, a hint can be obtained from the inverse map as a last resort. Note that if the inverse map can not provide a hint, then no donor is possible for the fringe point, in which case the search would have aborted immediately. In most cases, Suggar++ will be able to complete the donor searches for all of the points in a given fringe set in long chains constructed as a result of the neighbor connectivity, with only the first point in the chain needing a hint from a source other than its immediate neighbors. This system is thus expected to keep the number of search steps per fringe set to a minimum. IV. Parallel Processing One of the primary goals of Suggar++ was to obtain scalable parallel performance in terms of both memory and run time. This is most important for the primary assembly procedures, such as hole cutting and the fringe donor search, that must occur at each time step in a simulation involving moving overset grids, but I/O operations and the initialization of the grids that takes place at start-up also benefit. Three modes 5 of 48

6 of operation are of interest: a multithreaded shared-memory mode, a multiprocessor mode that utilizes MPI for interprocessor communication, and a mixed mode that incorporates both. The parallelization is currently performed on a per grid basis. Most of the parallelized procedures operate via a member function par applytogridlist in the CompositeGrid class. This function takes as arguments a grid list and any function that operates on grid objects. In MPI mode, the grid list is typically comprised of a list of member grids that was established by a mesh partitioning step. That is, each grid is assigned to one rank (or partitioned and distributed across several ranks) in such a way that the ranks have roughly equal amounts of estimated work. Each rank contains the light data (e.g. basic descriptors such as the size and type of each grid, its coordinate bounding box, its solid surface data, its relationship within any user-specified body hierarchy, etc.) for every grid in the system but contains the heavy data (e.g. grid coordinate data, connectivity, cell volumes, donor lists, etc.) for only its member grids. The vast majority of the procedures are designed to be thread safe, so the list of member grids on a rank may be processed in a multithreaded fashion by having each available thread (up to the number of threads specified by the user) operate on the next available grid in the list until all of the work for the list has been completed. For the hole-cutting operation, each grid may be subject to explicit cutters, query cutters, and/or direct cutters. As detailed in Section II, the direct cutting operation begins with a search for the donor cells to each of the nodes within the cutting surface. These donor cells are marked as OUT, and the connected edges are tested for intersection with the cutting surface faces. All of these operations are conducted by (or within ) the grid being cut. As such (and because all of the cutters are carried as light data on every rank in the current implementation), the hole-cutting phase can be easily parallelized on a per grid basis. A. Parallel Decomposition A common approach to efficient utilization of multiple processors is to partition the total work load into subsets that can be assigned to each processor for execution. The typical unstructured flow solver uses METIS 14 or ParMETIS 15 to decompose the set of elements in the simulation into a set of subgrids while minimizing the communication between subgrids. For a flow solver, the work associated with an element is approximately constant and hence an equal number of elements will be distributed to each subgrid. The communication between subgrids is through the faces shared by neighboring elements and hence, from the perspective of the flow solver, the communication is minimized by minimizing the number of faces shared between subgrids. Figure 1 presents the partitioned grid system for an overset helicopter rotor blade grid system using ParMETIS. The grid system consists of an unstructured tetrahedral grid around each rotor blade and a single large background unstructured tetrahedral grid. The image was obtained by taking a slice through the grid system and cycling the colors used to display each subgrid. The three blades and the background grid are all fed as a single mesh to ParMETIS. Since the rotor blades have no element connectivity to the background mesh, ParMETIS does not attempt to keep the blade elements on the same ranks as the background mesh elements. In fact, for this decomposition, only the two subgrid regions colored gray are on the same rank. It is easy to see that this type of decomposition, while minimizing the communication within the flow solver, will actually increase the amount of communication required by the overset grid assembly code because the fringe locations and donor grid will be stored on different ranks. Thus the fringe location must be sent via message passing to the donor grid for processing and the results, including candidate donors, must be returned to the fringe grid via message passing. The overset donor search must find donor elements that overlap the same region of space as the fringe location. Hence a partitioning optimized for the overset grid assembly process should assign elements in the fringe grid and donor grids to the same rank when they occupy the same region of space. It is also obvious for the helicopter rotor case, where the blades are constrained to rotate within a cylindrical volume, that a large portion of the domain is outside the region of overset connectivity activity. In the present case, because the background grid is clustered, approximately 78% of the elements in the entire grid system are in the overset activity region. This means that a typical flow solver decomposition 6 of 48

7 will lead to approximately 20% of the processors never contributing to the overset domain connectivity. 1. Spatial Decomposition The present effort seeks to minimize the wall clock time required to compute the overset domain connectivity information using an efficient algorithm coupled with parallel computing resources. If the fringe locations and donor elements are decomposed to be on the same processors then the communication to send the fringe location to the donor grid can be avoided. Thus rather than decomposing based upon element connectivity, we propose to decompose based upon a volume in space, which we call a spatial decomposition volume (SDV). For rotor blade simulations, the region of overset connectivity activity is a cylindrical volume. By decomposing that volume using cylindrical shells, the donor-receptor pairs can be assigned to the same processor, so no message passing would be required between the fringe and donor grids. Figure 2 presents the same grid system partitioned using the cylindrical spatial decomposition. The image was obtained by again taking a slice through the grid system and cycling the colors used to display each subgrid. The partitioned grids that are within the same cylindrical slice are assigned to the same rank. In addition, the outer grid, which is colored light blue, is inactive and does not participate in the overset domain connectivity computations. A store separation case is an example where the motion of the body is not constrained to a cylindrical path. For a moment consider a static case, such as when the store is in the carriage position and will not move. The geometry for the generic Eglin Wing/Pylon/Store configuration is shown in Figure 3. A flow-solver-type decomposition generated by ParMETIS for 8 partitions is shown in Figure 4, in which the surface geometry is colored by the partition number to which those cells are assigned. The boundaries of partition 1 for the wing and store grids are shown in Figure 5. One can clearly see that the majority of these two grid partitions do not overlap. A spatial decomposition using a Cartesian volume is best suited for the wing/pylon/store case with the SDV sized to the bounding box of the store grid. Figure 6 shows the boundaries of the grid system colored by the assigned partition. The region of the wing grid outside of the store boundaries is inactive in the computation of the overset domain connectivity as there are no overlapping grids. Figure 7 shows the geometry and symmetry surfaces again colored by the assigned partition with the active faces displayed as solids and the inactive faces as a wireframe. Figure 7 includes the boundaries of the different regions of the SDV as thick lines along with the plane of each SDV farthest from the viewer. V. Overset Computational Work The goal of parallel decomposition is to take a given amount of work and equally divide it among the available computer resources. The parallel execution would then be scalable if no extra work or overhead results from executing the decomposed problem. For example, in the flow solver execution, the work is decomposed by putting approximately equal numbers of elements on each processor. The overhead for the solver parallel execution results from the need to communicate solver properties across the partition boundaries. As long as the time for communication is small relative to the solver computational time, the solution method will be scalable. The computation of the overset domain connectivity information is significantly different than the flow solver computations. To begin with, the overset work is typically divided into two main operations: hole cutting and donor search. Each of these operations has unique computational characteristics that differ from the flow solver and each other. The hole-cutting operation, as indicated above, duplicates the hole cutting geometry on all parallel processes and applies the hole cutting in parallel to the grids stored on each processor. This work is proportional to the number of surface faces in the cutting geometry and not the entire solution volume as is the case in the flow solver. A grid will be cut if it overlaps the cutting geometry and hence it is entirely 7 of 48

8 likely to encounter situations where a processor does no hole cutting work because none of its grids overlaps the cutting geometry. Depending upon the cutting approach, communication across partition boundaries will be required to synchronize the shared data. The donor-search operation requires finding elements in donor grids that contain the fringe locations. Ignoring the overlap minimization for a moment, the number of fringes will be related to the number of elements intersected by the cutting surface and hence roughly related to the cutting surface geometry. Thus the work in the donor search is also roughly related to the cutting surface rather than the entire solution volume. Again, a processor will not contribute to any of the work if its grids do not overlap any fringe locations. The initial donor search finds donors for the fringes at overlap grid boundaries and the intergrid boundary points adjacent to holes. In contrast, the overlap minimization requires finding donors for additional fringes in the interior of the grid. The work in the overlap minimization will be roughly proportional to the number of elements or nodes in the overlap region. Thus the donor search and overlap minimization operation will have significantly different work requirements than the hole cut operation. This disparity will impact the parallel load balance between the two operations. Similarly, a processor could have donor search work but no hole-cutting work resulting in an imbalance in the work for one operation if the partitioning is based upon the work for the other. The bottom line is that the computation of the overset domain connectivity information differs significantly from the flow solver in the way that work is distributed, and this can vary significantly even within different phases of the overset computation. As such it is very difficult to both minimize the amount of work done in the overset assembly and equally distribute it. A. Parallel Decomposition Can Increases the Work The presence of intergrid boundaries due to decomposing a grid for parallel execution can also increase the work required to calculate the overset domain connectivity information. This is highly undesirable as it will further limit scalability. To illustrate how this happens, consider the donor search depicted in Figure 8. The arrow head indicates the fringe location, and the tail of the arrow indicates the start location for the donor search. The donor search will use a neighbor walk to move from the start element and then iteratively move to the appropriate neighboring element until the donor element containing the fringe is found. On the other hand, Figure 9 depicts a donor search in which the search path crosses an intergrid boundary partition resulting from parallel decomposition. The red portion of the grid is stored on one processor and the blue portion of the grid is stored on another processor. The red grid is the donor grid for the fringe location, but the donor search start element is such that neighbor walk must cross the partition boundary into the blue grid and back into the red grid to find the proper donor element. The search starting at the element marked a will encounter the partition boundary face indicated by the bold white line and stop as it must move into the blue grid. Without additional work, the donor for the fringe location will not be found, resulting in an erroneous declaration of an orphan. Suggar++ makes the process robust by storing the boundary elements in an Alternating Digital Tree (ADT), which it queries to find additional start locations for the donor search. When the donor search hits a boundary face, the search is restarted using an appropriate element provided by the boundary element ADT, such as element b. As such, the presence of the boundary requires additional work (querying the ADT to restart the donor search) to be performed. A second example is illustrated in Figure 10, where the fringe point lies in the blue grid. The red and blue grids are searched in this case because the fringe location falls in the bounding box of each grid. It can thus be seen that the presence of intergrid boundaries is likely to result in additional searches being conducted on multiple grids for fringe locations that fall within such border regions. In this case the search starts at 1 in the red grid but fails to find a containing donor element in that grid. The search then queries the boundary element ADT, requiring it to select from cells 2 though 4 as potential start locations before it can abandon the search in the red grid. 8 of 48

9 A real-world example of searches starting in one partition with the actual donor occurring in an adjacent partition is shown in Figure 11. The gray crinkle surface is the intergrid partition boundary displayed as a lighted solid surface. The red dot indicates a fringe point for which a donor search was performed in the grid behind the partition boundary. The red line extends from the fringe point to the element where the donor search stopped adjacent to the partition boundary. It is easy to see how a partition boundary that is smooth will have fewer fringe locations that fall in the region that is searched by both grids on either side of the boundary. Metis attempts to minimize the number of surface faces at the partition boundary while the SDV approach does not. Thus the smoother Metis partitions will result in fewer searches that require crossing partition boundaries in comparison to the SDV boundaries. B. Eliminating the Extra Searches Due to Partition Boundaries An approach to eliminate the increase in donor searches and work due to the presence of partition boundaries is highly desirable. As mentioned in the previous section, the extra searches are caused by two factors. The first is that more than one grid may be searched for donors for the fringe point because with the approach as described above we cannot definitively decide which grid contains the donor element and hence must search multiple possible donor grids. One approach to reduce the searches is to definitively decide which grid contains the fringe point and hence only search that grid. This is similar to the Binary Space Partition (BSP) Tree used in Beggar. 6 Querying the BSP to determine if the point is in a grid, however, still constitutes additional work for every candidate fringe point. The SDV provides a mechanism to uniquely assign a fringe location to a grid partition. We assign a grid element uniquely to a grid partition based upon which SDV the element center falls into. We can make a similar assignment to the fringe location. This requires solving the problem of finding the proper donor element when it is not assigned to the same partition as the fringe point grid. Fundamentally this means that rank must store any grid elements that overlap that SDV and the donor search must be able to move between donor subgrids. One approach, which is not scalable in terms of memory, is to duplicate the entire grid system on every rank. SUGGAR took this simple approach for MPI parallel execution, but this is not feasible for large grid systems. The approach taken in Suggar++ is to store only the small fragment of other subgrids that overlap the SDV, which requires only a small increase in storage on all ranks for fragments of subgrids not assigned to a rank. By allowing the search to move from the on-rank subgrid into and out of the grid fragments, we can start a single search for each fringe location and find the donor if it exists. This eliminates the increase in searches described in the previous section. VI. Immersed Boundary Points to Eliminate Orphans The overset grid approach simplifies simulations of moving bodies by enabling gridding of each moving body component with minimal regard for other nearby body components. This works well as long as there is sufficient overlap between the component grids on nearby bodies. If the gap between the bodies is small, such as between a fixed geometry and an attached movable control surface, the requirement for sufficient overlap can create gridding difficulties and require excessive grid refinement in a region that may not be that critical. One approach to partially alleviate this problem is to reduce the fringe depth if the flow solver can locally reduce the order of the spatial differencing. While reducing the order and overlap requirement is helpful, it cannot eliminate the problem entirely. An Immersed Boundary Method 16 can be used to apply a solid wall boundary condition at locations other than at a grid boundary face. An overset solver with an immersed boundary capability could use this capability to eliminate orphans. Since the orphan designation occurs during the construction of the overset domain connectivity information, the overset grid assembly process must be capable of appropriate changes 9 of 48

10 in the assembly procedure to flag appropriate locations as immersed boundary points to alleviate the overlap requirement and enable solutions where the grid overlap is insufficient to support overset connectivity. Such a capability has been added to Suggar++. For example, Figure 12 presents an overset grid system with two cylinders in contact. Figure 13 shows a closeup of the region to the left of the contact point with orphan points marked with the squares. These points are orphans because no donor with acceptable quality could be found. For a static configuration, the typical approach is to insert a collar grid as shown in Figure 14 that has solid surfaces conforming to the upper and lower cylinders. The collar grid is able to provide donors for the fringe points in the region and eliminates the orphans. A moving body case cannot use this collar grid as a solid surface would no longer conform to the moving body. Suggar++ has a capability to identify orphans and flag appropriate points as immersed boundary points to eliminate the need for fringe points in the region and the orphans. The procedure is currently enabled for user-selected bodies and is performed with the following steps for a node-centered solver: 1. During the hole cutting, edges connecting grid points are marked as cut if they intersect the cutting surface. If the cutting surface is part of an immersed boundary cutting body, the cut edge is marked accordingly. 2. After the hole cutting, if an edge marked as cut by an immersed boundary cutting body has a node that is marked as OUT, then the node is also marked as IMMERSED. 3. Mark the points adjacent to OUT points as INNER FRINGE 1. If the adjacent out point is also marked as IMMERSED, mark the fringe as IMMERSED. 4. Perform a donor search for any point marked as INNER FRINGE 1. If no acceptable donor is found and the point was marked as IMMERSED, remove the fringe marking and hence the orphan. 5. Mark the points adjacent to INNER FRINGE 1 points as INNER FRINGE Perform a donor search for any point marked as INNER FRINGE Look for INNER FRINGE 2 points that were flagged as ORPHAN because no donor of sufficient quality was found. Remove the INNER FRINGE 2 and ORPHAN marking from the point and for any INNER FRINGE 1 neighboring points remove the INNER FRINGE 1 marking. Figure 15 shows the upper and lower grids along with markings of points in the lower grid. The blue points are OUT points, either because they are interior to the body or because they are not needed by the solver due to the overlap minimization. The red points are locations in the lower grid that have been marked as fringe. The green points mark locations where the solver could apply an immersed boundary procedure. The gray squares mark points that were fringe but have been changed to active because of the presence of the immersed boundary points. VII. Parallel Execution Results The parallel execution capability in Suggar++ and the impact of the different approaches to partitioning will now be examined using two different configurations with unstructured grids. The first is the helicopter configuration utilized in the discussion on spatial decomposition. The second is a generic store separating from a wing with pylon. The performance of the Spatial Decomposition Volume approach to grid partitioning will be examined and compared with ParMETIS, which represents a flow-solver-style partitioning. Limited comparisons of performance between Suggar++ and SUGGAR are made to show the improvements in the new capability. All of the following performance comparisons were executed on a shared memory machine with 4 dual core 3.00GHz Xeon(TM) CPUs and 32 Gigabytes of shared memory. 10 of 48

11 A. Helicopter Fuselage and Blades Grid System The helicopter system of grids utilizes all tetrahedral grids with one grid for the fuselage/background geometry and a separate grid around each of the four blades. Each blade grid has 1,156,735 grid points and 6,731,961 elements with the background grid containing 8,971,420 grid points and 52,733,053 elements for a composite grid system containing a total of 13,598,360 grid points and 79,660,897 elements. SUGGAR and Suggar++ were run in node-centered mode to produce the overset domain connectivity information that would be usable for an overset node-centered solver such as FUN3D. The geometry is shown in Figure 16 with a portion of the outer boundary for a blade grid shown in Figure 17. The following speedup plots were obtained for the fixed problem size given above and dividing the wall clock times for the serial execution on the unpartitioned grid system by the wall clock times for the parallel execution on the partitioned grid system. 1. Comparison of Suggar++ and SUGGAR As indicated previously, the parallel capability in Suggar++ can use threads on a shared memory machine or MPI for communication on a distributed memory cluster. SUGGAR also includes a threaded and MPI capability although MPI parallel execution is not practical for large problems as it duplicates grid storage on all ranks. Thus the following comparison between Suggar++ and SUGGAR is for parallel execution using threads on a shared memory machine and consists of wall clock times for different number of threads and a speedup plot. The different phases of the overset grid assembly process hole-cutting, initial fringe donor search, and overlap minimization will be compared individually along with the total of the three phases. SUGGAR cannot utilize partitioned grids and hence must work with the original input grids while the results for Suggar++ are for grids partitioned using the SDV. A comparison of the performance between Suggar++ and SUGGAR in computing the node-centered DCI for the helicopter configuration is shown in terms of wall clock times and parallel speedup for hole-cutting, initial fringe donor search, and overlap minimization in Figures 18 through 21. Figure 18 shows that the octree-based query cutter approach in SUGGAR is faster and scales better on this problem than the direct cut approach in Suggar++. The query cutter is applied to every point in the mesh but is a very fast technique to determine if a point is inside the body. The major difficulty with the octree-based query cutter is the amount of memory required when a very tight cut is required. The direct cutter approach will cut minimal holes and lower memory usage, scaling with the number of faces in the cutting geometry. The approach to parallelization the hole cutting via threads is also considerably different between the two codes. SUGGAR applies the multiple threads to a single grid with a thread performing the hole cutting on a stride of points. The result is that SUGGAR has finer granularity yielding a better load balance for the hole cutting, which is evident in the speedup plot. There are optimizations that are planned for Suggar++ to improve its hole cut performance. The comparison of the initial donor search is shown in Figure 19. This initial donor search is performed on the fringes from the outer/overlap boundaries of component grids and the fringes adjacent to holes. The improvements to the donor search process in Suggar++ yield a significant improvement in wall clock time in comparison to SUGGAR. Suggar++ assigns a partitioned subgrid to a thread and SUGGAR again assigns multiple threads to a grid for the donor search. The comparison of the overlap minimization is shown in Figure 20. An iterative approach to overlap minimization is used in both codes. Suggar++ again assigns a partitioned subgrid to a thread while SUGGAR assigns a grid to a thread for the overlap minimization. The poor load balance in SUGGAR is very evident in the speed up plot and is especially evident in the case with eight threads where SUGGAR is unable to utilize three of the eight threads since the input grid system contains only 5 grids. The work imbalance in Suggar++ is magnified by the way Suggar++ performs the operations in the overlap minimization. There are several distinct operations during an iteration of the minimization procedure. Each of these operations is 11 of 48

12 executed in parallel using the par applytogridlist method, which for executes a thread join call. The code modularity is thus impacting the load balance as a thread that finishes early cannot continue to the next operation and must wait for the slowest thread to finish. Future work will restructure the code to alleviate this synchronization when possible. The sum of the hole cut, donor search, and overlap minimization time is compared in Figure 21. This sum does not include output of files or communication of the results to the flow solver and hence is not necessarily the total wall clock time per time step. Suggar++ shows a dramatic improvement in performance relative to SUGGAR for serial and parallel execution on this case. 2. Comparison of SDV and ParMETIS Partitioning Suggar++ has two different methods for partitioning a grid system for parallel execution: using the new Spatial Decomposition Volume approach or using ParMETIS. This section will compare Suggar++ execution for these two very different partitioning approaches using the node-centered helicopter grid system. Figures compare the performance of Suggar++ when partitioned using the cylindrical SDV and ParMETIS approaches using MPI and no threads. The curves labeled MPI:SDV utilize grids partitioned using the cylindrical SDV and includes grid fragments of other subgrids on a rank so that a fringe location can be assigned to a rank and be assured of finding a donor on that rank. The curves labeled MPI:SDV: No Grid Fragments utilize grids partitioned using the cylindrical SDV but do not include grid fragments of other subgrids on a rank. Thus any subgrid whose bounding box overlaps the fringe location must be searched. The inverse map discussed in section G is used to refine the bounding box test and eliminate many searches. The curves labeled MPI:ParMETIS utilize grids partitioned using ParMETIS for a partitioning that is typical of what would be used in a flow solver. Figure 22 shows the wall clock time and speedup for the hole-cutting procedure. The SDV partitioning performs better than the ParMETIS partitioning while there is little difference between the SDV with and without grid fragments. The improvement with the SDV for hole cutting relative to ParMETIS is attributed to better load balancing. There is negligible difference between the SDV with and without grid fragments as the extra donor searches required when the grid fragments are not included represent only a small portion of the hole-cutting work. Figure 23 shows the wall clock time and speedup for the initial donor search procedure. The SDV partitioning performs significantly better than the ParMETIS partitioning and slightly better with grid fragments than without. Figure 24 shows the wall clock time and speedup for the overlap minimization procedure. The SDV partitioning again performs significantly better than the ParMETIS partitioning and slightly better with grid fragments than without. The fall off in speedup is attributed to the extra communication that takes place with the current iterative implementation of the overlap minimization procedure. The load balance between the SDV with and without grid fragments for the donor search and overlap minimization will be essentially identical. The difference between the wall clock time for the two is attributed to the increased number of searches required when the SDV partitioning does not include the grid fragments. The improvement with the SDV for hole cutting relative to the ParMETIS is attributed to better load balancing and the extra searches required resulting from the partition boundaries in the ParMETIS partitioning. The increase in the number of searches performed is clearly evident in Figure 26. The slower rate of increase with the ParMETIS partitioning relative to the cylindrical SDV without grid fragments is attributed to the smoother nature of the partition boundary for the ParMETIS partitioning. The number of searches for the SDV with grid fragments is constant but is slightly higher than for the unpartitioned grid system. This increase is unexpected and has not been investigated. Figures 27 and 28 show the variation in load balance for the different operations when using the SDV and ParMETIS partitions respectively. The figures show the CPU time for each rank along with the wall clock time for the operation. The gold portion of each bar represents the system time portion of the CPU time and is attributed to time spent in communication. One can see that while there is significant load imbalances 12 of 48

13 for all operations, the ParMETIS partitioning has a larger imbalance. The difference between the maximum CPU time for a rank and the wall clock time is attributed to the communication time. This difference is largest for the overlap minimization operation and is a result of the iterative minimization process used in Suggar++. Reducing the iterations and the communication overhead is a priority for future work. Figures show the performance of Suggar++ for larger numbers of partitions with ParMETIS partitioning. The serial run was performed on the same large shared-memory node as the previous results while the NP=2 through NP=64 results were run on distributed memory cluster nodes of similar hardware using Gigabit Ethernet interconnects for the message passing. These results for up to 64 partitions are only to demonstrate that Suggar++ can be run for larger number of partitions than the 8 partitions that have been used in the remainder of the results. The impediment to scalability for these results are the same as previously described: Poor distribution of work and too much communication in the iterative minimization process. The current implementation of the SDV partitioning uses a one-dimensional distribution to rank and would produce very thin SDVs for large numbers of partitions, which would produce a very poor initial load balance. Hence no attempt was made to run the SDV partitioning for larger numbers of partitions. 3. Comparison of Threads Versus MPI for Parallel Execution As previously indicated, Suggar++ can utilize multiple threads and/or MPI processes for parallel execution. This section will briefly examine the performance of both modes using the same helicopter grid system with the grids partitioned using the cylindrical SDV. Figures show the wall clock time and parallel speedup for threads versus MPI parallel execution. The figures show that, in general, using thread or MPI provides comparable performance. The threaded execution is slightly slower than MPI execution in spite of the overhead imposed by the MPI communications to exchange data. As indicated in the previous section, the threaded execution imposes more barriers with the thread join function with a resulting accumulation of load imbalance. B. Wing/Pylon/Store Grid System The second configuration used for comparison purposes is the Eglin generic wing/pylon/store. The grid system consists of an unstructured tetrahedral background grid containing the wing and pylon portion of the geometry and an unstructured tetrahedral grid for the store geometry. The grid system was obtained by taking a baseline set of grids and performing three adaption cycles as detailed in References 17, 18 to reduce the number of orphans due to insufficient overlap. The resulting grid system contains 240,305 grid points and 1,397,052 elements in the store grid and 1,048,929 grid points and 6,114,751 elements in the wing/background grid for a total of 1,289,234 grid points and 7,511,803 elements. Suggar++ was run in cell-centered mode to produce the overset domain connectivity information that would be usable for an overset cell-centered solver such as USM3D. 1. Comparison of Suggar++ and SUGGAR A comparison of the performance between Suggar++ and SUGGAR in computing the cell-centered DCI for the wing/pylon/store configuration is shown in Figures The conclusions to be drawn from the comparison for the cell-centered wing/pylon/store configuration are identical to those of the node-centered helicopter configuration. The octree-based query cutter approach in SUGGAR is faster and scales better on this problem than the direct cut approach in Suggar++. The donor search process and overlap minimization in Suggar++ yield a significant improvement in wall clock time in comparison to SUGGAR. The poor load balance in SUGGAR s overlap minimization is again evident in the speedup plot with SUGGAR unable to utilize more than two threads since the input grid system contains only 2 grids. Suggar++ again shows a dramatic improvement in performance relative to SUGGAR for serial and parallel execution on this cell-centered case. 13 of 48

14 2. Comparison of SDV and ParMETIS Partitioning The comparison of Suggar++ parallel execution using ParMETIS and Spatial Decomposition Volume partitioning approaches for the cell-centered wing/pylon/store grid system is shown in Figures The results show a clear problem with the four partition results when partitioned with ParMETIS. Table 2 lists the number of elements assigned to each rank for the four and eight partition cases. There is clearly some imbalance with the NP=4 case but not enough to justify the poor performance. Figure 45 shows that the poor performance is because of the drastic increase in the number of donor searches performed for the four partition case on the partitioned grid system produced by ParMETIS. The boundaries of the grid partition for rank 1, shown in Figure 46, are clearly disjoint and not very compact or smooth. This case clearly illustrates the extra overset domain connectivity work as discussed in Section V that can be encountered in a traditional flow-solver-type decomposition. Grid NP=4 wing NP=4 store NP=8 wing NP=8 store Table 2. Number of Elements on Rank for Wing/Pylon and Store Grids for ParMETIS Partitioning. VIII. Demonstration of Immersed Boundary Approach to Eliminating Orphans The use of the Suggar++ identification of immersed boundary locations to eliminate orphans in an overset grid system will be demonstrated for two different problems using two different flow solvers. The results presented here are intended as demonstrations only and additional development and validation of the results will be presented in the future. A. Two-Dimensional Glottal The first demonstration problem is a simple two-dimensional geometry. This configuration, which is illustrated in Figure 47, is a canonical model of the human glottis, for which experimental measurements are available. 19 In the experiment, the nominal vocal folds are closed at t=0, and a pressure difference is applied between the inflow and outflow boundaries. The vocal folds are then opened and closed over a period of 5.67 seconds, following a non-sinusoidal path controlled by a stepper motor. As the folds open, the p across the test section drives a jet flow through the space between the folds. This test case challenges the traditional overset approach due to the point contact between the cylindrical folds, which would normally result in orphans. This motivated the mesh refinement near the contact point illustrated in the figure, so as to suitably approximate point closure. The equations for the fluid motion was solved using the NPHASE-PSU code, an unstructured, parallel finite volume solver. Algorithmically, NPHASE-PSU follows the well-established segregated pressure based method. A colocated variable arrangement is used, and a lagged coefficient linearization is applied. In this laminar flow example, 2 nd order upwind and central discretization schemes were selected for the convection and viscous terms in the momentum equations. Continuity is introduced through a pressure correction equation based on the SIMPLE-C algorithm. In constructing element face fluxes, a Rhie-Chow momentum interpolation scheme is employed. Further details of the scheme are available in Reference 20. DirtLib and the necessary iblanking instrumentation for the hybrid overset-immersed method were recently installed in the code. Specifically, all out-immersed elements are nullified through appropriate source term treatments, 14 of 48

15 and wall faces are inserted in the data structure at all internal faces that separate out-immersed from active elements. Since the glottal motion was specified, all of the overset assembly files were pre-generated using SUGGAR driven by a Python motion controller. In order to accommodate the fully closed conditions (no flow, despite a driving p between inlet and outlet), a constant stagnation pressure condition was specified at the inlet to match the experimentally-obtained peak jet centerline velocity. A constant reference pressure was specified at the domain outlet for this incompressible flow. A dual-time approach was used wherein 50 sub-iterations were employed for each physical time step. At each sub-iteration, the discrete momentum equations are solved approximately, followed by a more exact solution of the pressure correction equation. Figure 47 shows a portion of the overset grid system for this case when the Glottal is partially open. The grid for the upper glottal is displayed in red, the channel grid is shown in gray, and the two lower collar grids are shown in cyan. The boundaries of all overset grids are also displayed with thick lines. Figure 48 shows the complete overset grid system when the gap is closed and the surfaces are in contact. Figure 49 shows a closeup of the grid system near the contact point. The upper and lower Glottal grids are shown in red and blue respectively. The cyan and magenta dots are located at the element centers of fringe elements. The red and blue dots are located at the center of elements marked as OUT-IMMERSED in the upper and lower Glottal grids respectively. The faces between active and OUT-IMMERSED elements are shown as thick lines and indicate the approximate geometry being used in the solution at this time step. Figure 50 shows a sequence of the predicted instantaneous axial velocity through one open-close cycle. The experimental configuration has a square cross section at the inlet, rather than an infinite spanwise extent assumed with the present 2D simulation. The authors are presently extending the model to capture the three-dimensionality of the geometry and the flow field. B. Ship Rudder The second demonstration problem is a three-dimensional geometry model for the rudder of a ship and is illustrated in Figure 51. The gray lower portion of the rudder moves relative to the cyan upper portion of the rudder, which is fixed to the blue symmetry plane for this case. The typical overset approach would be leave a small gap between the components to accommodate the motion and cluster the grid in the gap region with sufficient overlap to yield an orphan free composite grid. The immersed boundary approach will allow a smaller gap with fewer grid points in the gap region. Ship moving surfaces (rudders, stabilizers, propellers) are characterized by tight gaps that make overset grid generation problematic. In the case of a spade rudder, shown in this example, the gap between the root and the rudder amounts to about 20 mm, compared to a root chord of 3 m. The effect of the gap on the flow is usually negligible (except for some applications like cavitation), and thus small modifications to the gap caused by the immersed boundary approach are acceptable. Figure 52 shows a close up of the gap region displaying the geometry along with points marked as OUT- IMMERSED. The black points are located in the lower, movable portion of the rudder while the blue points are in the upper fixed portion of the rudder. Preliminary demonstration solutions for the rudder configuration were computed with CFDShip-Iowa V4 21 for structured grids. CFDShip-Iowa employs a single-phase level-set approach to solve the viscous flow with a free surface, using either RANS or DES models for turbulence, both based on a blended k ɛ/k ω model. Multi-block structured grids with dynamic overset capabilities allow handling of complex geometries and large-amplitude motions, including the motion of the propeller and control surfaces on moving ships. High-performance computations with CFDShip-Iowa with grids of up to 300 million grid points have been performed for the surface combatant model DTMB 5512 in straight ahead, static drift, forward-speed diffraction and pitching and heaving in waves. 22 Figures 53 and 54 present these preliminary demonstration solutions. The slices through the grid are colored by pressure and also display velocity vectors. The black line near the top of the figures indicates the location of the free surface. 15 of 48

16 IX. Summary A new general overset grid assembly capability provided by the Suggar++ code has been presented, and detailed examination has shown it to have superior overall performance relative to SUGGAR, in part due to new, efficient hole cutting and grid partitioning methods. Extended capability such as an immersed boundary identification method has also been demonstrated. The direct cut approach to hole cutting used in Suggar++ results in a minimal hole cut without requiring any user iteration on parameters, giving a more straight-forward and more accurate hole cut at the cost of a small amount of speed but without requiring the excessive memory allocations of SUGGAR. The overall cost of the method is shown to be less than what is expected for alternative strategies that require finding a donor for every grid point. Load balancing of the method is complicated due to the differing work requirement between the hole-cutting and overlap minimization procedures. Investigation of efficiency in the donor search and related minimization schemes has focused on the influence of the domain decomposition. The conventional strategy based on ParMETIS is optimal for the flow-solver work distribution and communication patterns but has been shown to be inefficient for the overset assembly because it does not take the particular needs of the overset assembly into consideration, leading to an increase in the amount of donor search work. An alternative partitioning scheme based on a Spatial Decomposition Volume has been demonstrated here. Designed for the overset assembly process, it (a) guarantees that each fringe point is able to find a donor on its own rank, thus minimizing the amount of interprocessor communication, (b) maintains the total amount of work as constant as the number of partitions increases, and (c) decreases the overall amount of work relative to the unpartitioned grid by marking portions of the grid outside of the region of interest as inactive. Finally, a new procedure has been presented that uses the overset grid assembly code to identify solver locations that can be treated as immersed boundary points, thus eliminating orphans in regions where moving bodies are in close proximity. Demonstration of the method was given in connection with flow solvers using both structured and unstructured grids. X. Future Work Suggar++ has been shown to have superior overall speed and parallel performance characteristics relative to SUGGAR. On the other hand, the detailed examination of the parallel performance in Suggar++ has highlighted several areas that need further work. Optimizations of the donor search method within the context of the direct hole-cutting method have already been identified and will be implemented in the near future, while the thread-based parallel implementation needs to be refactored to eliminate costs associated with the thread join operation. Within Suggar++, the benefits of a spatial decomposition have been demonstrated relative to a conventional ParMETIS, solver-type decomposition. The current implementation of the spatial decomposition uses one-dimensional approach to assign ranks to bins in the Spatial Decomposition Volume. This needs to be extended to a multi-dimensional approach to allow the partitioning to be used for larger number of processors. A simple examination of the advancing front minimization scheme within Suggar++ has been shown it to be less expensive than one based on searching for donors for every fringe point. Nevertheless, the implementation of the minimization scheme currently requires too much communication within each iteration, and this becomes increasingly noticeable as the amount of work per iteration diminishes near convergence. Future work will focus on iterating as much as possible within a grid partition in order to significantly decrease communication while still trying to minimize the overall work by searching for donors for only a fraction of the total number of points in the grid. Finally, dynamic load balancing will be critical to the spatial partitioning when the motion is not constrained, such as in a store separation simulation. The load balancing will require migration of data not only to reduce the load imbalance but also to transition elements that were marked as inactive during one time 16 of 48

17 step to an active partition for the next time step. Acknowledgments This material is based, in part, upon work supported by the National Aeronautics and Space Administration under Agreement No. NNX07AU75A issued through the Aeronautics Research Mission Directorate and the New 6DOF Environment project through the DoD HPC Institute for HPC Applications to Air Armament. Any opinions, findings, and conclusions or recommendations expressed in this paper are those of the author and do not necessarily reflect the views of the National Aeronautics and Space Administration. References 1 Benek, J. A., Steger, J., and Dougherty, F., A Flexible Grid Embedding Technique with Applications to the Euler Equations, Paper , AIAA, Lijewski, L. E. and Suhs, N., Chimera-Eagle Store Separation, Paper , AIAA, R. L. Meakin, Computations of the Unsteady Flow About a Generic Wing/Pylon/Finned-Store Configuration, Paper AIAA , AIAA, AIAA Atmospheric Flight Mechanics Conference, R. L. Meakin, Unsteady Simulation of the Viscous Flow About a V-22 Rotor and Wing in Hover, AIAA Atmospheric Flight Mechanics Conf., CP, 1995, pp R. L. Meakin, Object X-Rays for Cutting Holes in Composite Overset Structured Grids, Paper , AIAA, Belk, D. and Maple, R., Automated Assembly of Structured Grids for Moving Body Problems, Proceedings of 12th AIAA Computational Fluid Dynamics Conference, AIAA Paper CP, San Diego, CA, Suhs, N., Rogers, S., and Dietz, W., PEGASUS 5: An Automated Pre-processor for Overset-Grid CFD, Paper , AIAA, R. W. Noack and T. Kadanthot, An Octree based Overset Grid Hole Cutting Method, Proceedings of 8th International Conference On Numerical Grid Generation in Computational Field Simulations, Honolulu, HI, 2002, pp R. W. Noack, Resolution Appropriate Overset Grid Assembly for Structured and Unstructured Grids, Paper , AIAA, 16th AIAA Computational Fluid Dynamic Conference, Orlando, FL, R. W. Noack, SUGGAR: A General Capability for Moving Body Overset Grid Assembly, Paper , AIAA, 17th AIAA Computational Fluid Dynamic Conference, Toronto, Ontario, Canada, R. W. Noack, A Direct Cut Approach for Overset Hole Cutting, Paper , AIAA, 18th AIAA Computational Fluid Dynamic Conference, Miami, FL, R. L. Meakin, A New Method for Establishing Intergrid Communication Among Systems of Overset Grids, Tech. rep., Honolulu, HI, Sitaraman, J., Floros, M., Wissink, A., and Potsdam, M., Parallel Unsteady Overset Mesh Methodology for a. Multi- Solver Paradigm with Adaptive Cartesian Grids, Paper , AIAA, 26th AIAA Applied Aerodynamcis Conference, Honlolulu, Hawaii, Karypis, G. and Kumar, V., A fast and high quality multilevel scheme for partitioning irregular graphs, SIAM Journal on Scientific Computing, Vol. 20, 1998, pp Karypis, G., Kumar, V., and Kumar, V., A Parallel Algorithm for Multilevel Graph Partitioning and Sparse Matrix Ordering, Journal of Parallel and Distributed Computing, Vol. 48, 1998, pp Peskin, C., Flow patterns around heart valves: a numerical method. Composite Overlapping Meshes for the Solution of, Journal of Computational Physics, Vol. 100, No. 1, 1972, pp Power, G., Gudenkauf, J., Masters, J., Aboulmouna, M., and Calahan, J., Integration of USM3D Into the Store Separation Process: Current Status and Future Direction, Paper , AIAA, 47th AIAA Aerospace Sciences Meeting, Orlando, FL, R. W. Noack and D. A. Boger, Improvements to SUGGAR and DiRTlib for Overset Store Separation Simulations, Paper , AIAA, 47th AIAA Aerospace Sciences Meeting, Orlando, FL, Krane, M., Barry, M., and Wei, T., Unsteady behavior of flow in a Scaled-up Vocal Folds Model, Journal of the Acoustical Society of America, Vol. 122, No. 6, 2007, pp Kunz, R., Yu, W., Antal, S., and Ettorre, S., An Unstructured Two-Fluid Method Based on the Coupled Phasic Exchange Algorithm, Paper , AIAA, Carrica, P., Wilson, R., Noack, R., and Stern, F., Ship motions using single-phase level set with dynamic overset grids, Computers and Fluids, Vol. 36, 2007, pp Carrica, P., Huang, J., Noack, R., Kaushik, D., Smith, B., and Stern, F., Large-scale DES computations of the forward speed diffraction and pitch and heave problems for a surface combatant. Submitted to Comput Fluids, Submitted to Computers and Fluids, of 48

18 Figure 1. Overset grid system for rotor blades decomposed into 32 partitions using ParMETIS. 18 of 48

19 Figure 2. Overset grid system for rotor blades decomposed using cylindrical spatial decomposition. Figure 3. Geometry for Wing/Pylon/Store configuration. 19 of 48

20 Figure 4. Wing/Pylon/Store geometry colored by partition number for a METIS/flow solver type parallel decomposition. 20 of 48

21 Figure 5. Boundaries of partition number 1 of a METIS/flow solver type parallel decomposition for Wing/Pylon/Store grid system. Figure 6. Wing/Pylon/Store partitioned using the Cartesian Spatial Decomposition Volume. 21 of 48

22 Figure 7. Boundaries of the Cartesian Spatial Decomposition Volume used to partition the Wing/Pylon/Store grid system. Figure 8. Donor search within a single grid. 22 of 48

23 Figure 9. Donor search that crosses a inter-grid partition boundary. Figure 10. Donor search that crosses a inter-grid partition boundary must restart search multiple times. 23 of 48

24 Figure 11. Partition boundary and searches that tried to cross into neighboring partition. Figure 12. Overset grid system with two cylinders in contact. 24 of 48

25 Figure 13. Closeup of grid system for two cylinders showing orphan locations. Figure 14. Closeup of grid system with collar grid to eliminate orphans. 25 of 48

26 Figure 15. Overset grid system showing OUT, Fringes, and immersed boundary locations. 26 of 48

27 Figure 16. Geometry for Helicopter fuselage and rotor blades. Figure 17. Geometry for Helicopter fuselage and rotor blades with a portion of the outer boundary of a blade. 27 of 48

28 Figure 18. Comparison of hole-cut performance for Suggar++ and Suggar for helicopter configuration. Comparison of initial donor search performance for Suggar++ and Suggar for helicopter configu- Figure 19. ration. 28 of 48

29 Comparison of overlap minimization performance for Suggar++ and Suggar for helicopter config- Figure 20. uration. Figure 21. Comparison of sum of hole-cut, donor search, and overlap minimization performance for Suggar++ and Suggar for helicopter configuration. 29 of 48

30 Figure 22. Comparison of performance with SDV and ParMETIS partitioning for hole-cut operation in Suggar++ for helicopter configuration. Figure 23. Comparison of performance with SDV and ParMETIS partitioning for donor search operation in Suggar++ for helicopter configuration. 30 of 48

31 Figure 24. Comparison of performance with SDV and ParMETIS partitioning for overlap minimization operation in Suggar++ for helicopter configuration. Figure 25. Comparison of performance with SDV and ParMETIS partitioning of hole-cut, donor search, and overlap minimization operations in Suggar++ for helicopter configuration. 31 of 48

32 Figure 26. Comparison of the number of donor searches performed with SDV and ParMETIS partitioning in Suggar++ for helicopter configuration. 32 of 48

33 Seconds Load balance Rank 0 Hole Cutting CPU Rank 1 Hole Cutting CPU Rank 2 Hole Cutting CPU Rank 3 Hole Cutting CPU Rank 4 Hole Cutting CPU Rank 5 Hole Cutting CPU Rank 6 Hole Cutting CPU Rank 7 Hole Cutting CPU Hole Cutting Wall Rank 0 Donor search CPU Rank 1 Donor search CPU Rank 2 Donor search CPU Rank 3 Donor search CPU Rank 4 Donor search CPU Rank 5 Donor search CPU Rank 6 Donor search CPU Rank 7 Donor search CPU Donor Search Wall Rank 0 Minimization CPU Rank 1 Minimization CPU Rank 2 Minimization CPU Rank 3 Minimization CPU Rank 4 Minimization CPU Rank 5 Minimization CPU Rank 6 Minimization CPU Rank 7 Minimization CPU Minimization Wall Figure 27. Load balance with SDV partitioning for hole-cut, donor search, and overlap minimization operations in Suggar++ for helicopter configuration. 33 of 48

34 Seconds Load balance Rank 0 Hole Cutting CPU Rank 1 Hole Cutting CPU Rank 2 Hole Cutting CPU Rank 3 Hole Cutting CPU Rank 4 Hole Cutting CPU Rank 5 Hole Cutting CPU Rank 6 Hole Cutting CPU Rank 7 Hole Cutting CPU Hole Cutting Wall Rank 0 Donor search CPU Rank 1 Donor search CPU Rank 2 Donor search CPU Rank 3 Donor search CPU Rank 4 Donor search CPU Rank 5 Donor search CPU Rank 6 Donor search CPU Rank 7 Donor search CPU Donor Search Wall Rank 0 Minimization CPU Rank 1 Minimization CPU Rank 2 Minimization CPU Rank 3 Minimization CPU Rank 4 Minimization CPU Rank 5 Minimization CPU Rank 6 Minimization CPU Rank 7 Minimization CPU Minimization Wall Figure 28. Load balance with ParMETIS partitioning for hole-cut, donor search, and overlap minimization operations in Suggar++ for helicopter configuration. 34 of 48

35 Figure 29. Performance of ParMETIS partitioning up to 64 partitions for hole-cut operation in Suggar++ for helicopter configuration. Figure 30. Performance of ParMETIS partitioning up to 64 partitions for donor search operation in Suggar++ for helicopter configuration. 35 of 48

36 Figure 31. Performance of ParMETIS partitioning up to 64 partitions for overlap minimization operation in Suggar++ for helicopter configuration. Figure 32. Performance of ParMETIS partitioning up to 64 partitions for hole-cut, donor search, and overlap minimization operations in Suggar++ for helicopter configuration. 36 of 48

37 Figure 33. Comparison of MPI and threaded parallel execution for hole-cut operation in Suggar++ for helicopter configuration. Figure 34. Comparison of MPI and threaded parallel execution for donor search operation in Suggar++ for helicopter configuration. 37 of 48

38 Figure 35. Comparison of MPI and threaded parallel execution for overlap minimization operation in Suggar++ for helicopter configuration. Figure 36. Comparison of MPI and threaded parallel execution of hole-cut, donor search, and overlap minimization operations in Suggar++ for helicopter configuration. 38 of 48

39 Figure 37. Comparison of hole-cut performance for Suggar++ and Suggar for wing/pylon/store configuration. Figure 38. Comparison of initial donor search performance for Suggar++ and Suggar for wing/pylon/store configuration. 39 of 48

40 Figure 39. Comparison of overlap minimization performance for Suggar++ and Suggar for wing/pylon/store configuration. Figure 40. Comparison of sum of hole-cut, donor search, and overlap minimization performance for Suggar++ and Suggar for wing/pylon/store configuration. 40 of 48

41 Figure 41. Comparison of SDV and ParMETIS partitioning on Suggar++ hole-cut performance for wing/pylon/store configuration. Figure 42. Comparison of SDV and ParMETIS partitioning on Suggar++ initial donor search performance for wing/pylon/store configuration. 41 of 48

42 Figure 43. Comparison of SDV and ParMETIS partitioning on Suggar++ overlap minimization performance for wing/pylon/store configuration. Figure 44. Comparison of SDV and ParMETIS partitioning on Suggar++ sum of hole-cut, donor search, and overlap minimization performance for wing/pylon/store configuration. 42 of 48

43 Figure 45. Comparison of the number of donor searches performed with SDV and ParMETIS partitioning in Suggar++ for wing/pylon/store configuration. 43 of 48

44 Figure 46. Boundary of partition for rank=1 from ParMETIS partitioning with four partitions for wing/pylon/store configuration. Figure 47. Portion of overset grid system for the 2D Glottal case. 44 of 48

45 Figure 48. Complete overset grid system for the 2D Glottal case when the gap is closed. Figure 49. Closeup grid system near the contact point for the 2D Glottal case when the gap is closed. 45 of 48

46 Figure 50. Sequence of the predicted instantaneous axial velocity through one open-close cycle for 2D Glottal case. 46 of 48

47 Figure 51. Ship Rudder geometry used for immersed boundary demonstration. Figure 52. Ship Rudder geometry and OUT-IMMERSED locations. 47 of 48

48 Figure 53. Cross section, colored by pressure, of rudder/root arrangement at the mid cord of the rudder root section along with velocity vectors. Figure 54. Cross section, colored by pressure, of rudder/root arrangement at one cord length downstream from the rudder along with velocity vectors. 48 of 48