PREDICTIVE PERFORMANCE AND SCALABILITY MODELING OF A LARGE-SCALE APPLICATION

Transcription

1 PREDICIVE PERFORMANCE AND SCALABILIY MODELING OF A LARGE-SCALE APPLICAION D.J. Kerbyson *, H.J. Alme *, A. Hoisie *, F. Petrini *, H.J. Wasserman *, M. Gittings ** * Los Alamos National Laboratory, Parallel Architectures and Performance eam CCS-3 Modeling, Algorithms and Informatics Group,, Los Alamos, NM (505) {djk,hoisie}@lanl.gov ** SAIC and Los Alamos National Laboratory ABSRAC In this work we present a predictive analytical model that encompasses the performance and scaling characteristics of an important ASCI application. SAGE (SAIC s Adaptive Grid Eulerian hydrocode) is a multidimensional hydrodynamics code with adaptive mesh refinement. he model is validated against measurements on several systems including ASCI Blue Mountain, ASCI White, and a Compaq Alphaserver ES5 system showing high accuracy. It is parametric - basic machine performance numbers (latency, MFLOPS rate, bandwidth) and application characteristics (problem size, decomposition method, etc.) serve as input. he model is applied to add insight into the performance of current systems, to reveal bottlenecks, and to illustrate where tuning efforts can be effective. We also use the model to predict performance on future systems. Keywords Performance analysis, full application codes, parallel system architecture, eraflop scale computing.. INRODUCION SAGE (SAIC's Adaptive Grid Eulerian hydrocode) is a multidimensional (D, D, and 3D), multimaterial, Eulerian hydrodynamics code with adaptive mesh refinement (AMR). he code uses second order accurate numerical techniques. SAGE comes from the Los Alamos National Laboratory Crestone project, whose goal is the investigation of continuous adaptive Eulerian techniques to stockpile stewardship problems. SAGE has also been applied to a variety of problems in many areas of science and engineering including water shock, energy coupling, cratering and ground shock, stemming and containment, early time front end design, explosively generated air blast, and hydrodynamic instability problems [9]. SAGE represents a large class of production ASCI applications at Los Alamos that routinely run on processors for months at a time. SAGE is a large-scale parallel code written in Fortran 90, using MPI for inter-processor unications. Early versions of SAGE were developed for vector architectures. More recently, optimized versions of SAGE have been ported to all teraflop-scale ASCI architectures, as well as the CRAY 3E and Linux-based cluster systems. his work describes a performance and scalability analysis of SAGE. One essential result is the development of a performance model that encapsulates the code s crucial performance and scaling characteristics. he performance model has been formulated from an analysis of the code, inspection of key data structures, and analysis of traces gathered at run-time. he model has been validated against a number of ASCI machines with high accuracy. he model is also applied in this work to predict the performance of SAGE on extreme-scale future architectures, such as clusters of SMPs. Included is the application of the model for predicting the performance of the code when algorithmic changes are implemented, such as using a different parallel data decomposition. here are few existing performance studies that extend to full codes (for instance [0]), many tend to consider smaller applications especially in distributed environments (e.g. [5,7]). his paper represents an example of performance engineering applied to a full-blown code. SAGE has been analyzed and a performance model proposed, and validated, on all architectures of interest. he validated model is utilized for point-design studies involving changes in the architectures on which the code is running and in the algorithms utilized in the code. A predictive performance model of another important ASCI application is described in previous work []. (c) 00 Association for Computing Machinery. ACM acknowledges that this contribution was authored or co-authored by a contractor or affiliate of the U.S. Government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only. SC00 November 00, Denver (c) 00 ACM X/0/00 $5.00

2 . ESSENIAL CHARACERISICS OF SAGE In this section the important characteristics of SAGE that affect its performance and scaling behavior are described. In particular, the spatial data decomposition, the scaling of the sub-grid, the on operations within a code cycle and the adaptive mesh refinement operations are analyzed. In this work it is assumed that the spatial grid is three dimensional.. Parallel Spatial Decomposition in SAGE SAGE uses a spatial discretization of the physical domain utilizing Cartesian grids. his spatial domain is partitioned across processors in sub-grids such that the first processor is assigned the first E cells in the grid (indexed in dimension order X,Y,Z), the second processor is assigned the next E cells and so on. he assignment is actually done in blocks of xx as illustrated in Figure a), where M is the number of blocks in the X-dimension. Figure b) illustrates the approximate assignment of cells over processors (PEs). Note that the grid is primarily partitioned in the Z dimension. Each processor contains cells which are either: a) internal all neighbor cells are contained on the same PE, b) boundary belong to one of the spatial domain s physical boundaries ( faces ), or c) inter-processor boundary neighbor cells in physical space belong to sub-grids contained on different PEs (in one or more dimension). Y Y M+ PE Z 3 a) Cell and block ordering PE PE 3 PE b) Example assignment of the spatial grid across four PEs Figure. Cell and block assignment to Processors in SAGE. Z X X A library designed for the unication requirements of the code is used to handle the necessary unications within SAGE. his includes the on MPI operations of allreduce and broadcast for instance, as well as two main application specific unication kernels: gather (get data) and scatter (put data) operations. hese operations are used when processors require an update of their sub-grid with local cell information and inter-processor boundary data. he library uses a notion of tokens to record where all the necessary data can be found for each individual processor. A token is defined on each processor for each of cell centered values and cell face values in each data neighbor direction - 6 in total, and for the relationships of cells between levels in the AMR operation. Each token contains information on: - sub-grid boundaries, - data held locally within a processor, - data held off processor (requiring unication), and - data required off processor (also requiring unication)... Scaling of the Sub-grid he sub-grid volume on each PE is a function of the number of cells per PE, a parameter which is specified in the input deck. SAGE assigns this number of cells to each PE. We are concerned with weak scaling in this analysis, in which the problem size grows proportionally with the number of processors resulting in each PE doing approximately the same amount of work. he decomposition of the spatial grid across PEs is done in slabs (-D partitioning), as suggested in Figure b). Each slab is uniquely assigned to one processor. aking the number of cells in each sub-grid to be E, the total grid volume V in cells is: V = E.P = L 3 () and the volume of each sub-grid is: E=l.L () where P is the number of PEs, l is the short side of the slab (in the Z dimension) and L the side of the slab in X and Y directions (assuming a square grid in the X-Y plane). he surface of the slab, L, in the X-Y plane is: L = V /3 = (E.P) /3 (3) From this it can be seen that the surface increases as P /3. his sub-grid surface is directly proportional to the maximum data size that is unicated between PEs on data gather and scatter operations. he maximum size of this surface that a PE will contain is constrained by E. In fact, since the assignment of cells to PEs is done in xx blocks, the maximum surface is E/, at which point the slab degenerates to a foil with a thickness of cells. It is possible for the surface of the full spatial domain to be greater than E - thus resulting in each surface being assigned to more than one PE. his leads to physically neighboring data cells assigned to logically distant PEs. Hence unications will take place between more distant processors. he total unication requirements will remain as (E.P) /3, but will be dealt with by more than one PE.

3 PEs 8PEs PEs PEs Figure. Cross-sections of the spatial grid and the assignment of cells to, 8, 6, and 56 processors. Examples of the partitioning of the spatial grid across processors using slab decomposition are shown in Figure. he crosssection in the Z dimension is shown and it is assumed that E=3,500 cells throughout this work. Note that the volume of the spatial grid scales in accordance with equation. It can be seen that when using and 8 processors, each PE holds more than one foil. In the case of 6 PEs each foil is mapped to two processors, and in the case of 56 PEs, each foil is held by processors. he maximum logical distance between PEs on foil boundaries for the four cases shown is,, and respectively. Consider again the volume of the entire grid: V = E.P = (l.l ).P () his is partitioned across PEs such that there will be L/(P) foils of width on each PE, or: (E.P) /3 /P = (E/8P ) /3 (5) When this has a value less than one, a processor will contain less than a single foil, i.e. when P > SQR(E/8) (6) he maximum distance between the processors that hold a foil, termed the PE Distance (PED) here is: PED = / 3 ( E / 8P ) where indicates an integer ceiling function. he minimum distance between the processors that hold that foil is: / 3 ( E / 8P ), MAX hus when a processor is not assigned a full foil of the spatial domain, boundary exchange will involve all the processors that own the boundary, located at a logical distance PED apart. he sub-grid surface, L, the actual inter-processor boundary area owned by a processor due to the slab degenerating to a foil and the subsequent splitting of the boundary amongst the processors within the PED, the PE surface, and the PED are shown in Figure 3. It can be seen that the PE surface achieves a maximum 9 (7) (8) after 3 PEs. he sub-grid surface approximately equals the PE surface multiplied by the PED. It is important to note that the PED is related to the unication requirements of the code, more precisely it is proportional to the size of messages generated in order to satisfy each necessary inter-processor boundary exchange. his is a consequence of the slab decomposition and could lead to unication inefficiencies depending on a specific machine topology. A further observation related to the unication pattern is that when processors are considered to be labeled in a vector-like manner, from 0 to P-, and with P SMP processors per SMP box, out-of-box unication involves no more than a number of pairs of processors equal to the min(ped, P SMP ). Of course, if the PED is larger than P SMP, more than two SMP boxes will be involved in the boundary exchange. As an example, on the ASCI Blue Mountain at Los Alamos, composed of SGI Origin 000 boxes, given that P SMP =8 and that, from Figure b) the PED is not larger than 00 for a reasonable number of PEs, no more than Origin 000 boxes will be involved in the unication for one boundary exchange. Surface Size (cells) Logical neighbor distance.e+06.e+05.e+0.e+03.e+0.e+0.e+0 SAGE Scaling Behavior (Surface Sizes) Subgrid surface PE Surf ace a) sub-grid surface and PE surface sizes, SAGE Scaling Behavior (PE Distance) Max Neighbor PE.E b) PED, the logical neighbor distance. Figure 3. SAGE scaling characteristics

4 .3 An Iteration Cycle of SAGE he processing stages within a cycle typically involve three operations which are repeated a number of times dependent on the time interval utilized for integration of the equations in the code: i) one (or more) gather operations to obtain a copy of the local and remote neighbor data ii) computation in each of the local cells iii) one (or more) scatter operations to update data on remote processors. hese three operations of SAGE directly relate to the surface-tovolume ratio of the code []. he first and the third stage define the surface, related to the amount and pattern of unication, while the second stage represents the volume, related to the amount of computation. he gather and scatter operations are performed using the token library described in section.. hese three operations in a cycle of SAGE are shown schematically in Figure. In this example, it is assumed that the number of PEs (P) is 56, and the number of cells per PE (E) is 3,500. A single gather operation in all dimensions is depicted, followed by a processing step, and then a single scatter operation in all dimensions. he unication is shown only for processor n but in reality all processors perform unications of the same sizes, in the same direction, at the same time. In this example it can be seen that the main unication is in the Z dimension dealing with the sub-grid surface. he preponderance of unication in the Z dimension is also a consequence of the slab decomposition and is intuitive from Figure b). he message sizes in both directions (HI and LO in SAGE terminology) of the three dimensions is shown in the box on the right side of Figure. In addition to the gather and scatter operations, a number of other unications take place including several MPI type allreduce unications per cycle. A number of broadcast operations also exist but only during the initialization phase of the code. he frequency of the gather/scatter operations was analyzed using MPI trace data. From this, the number of scatter/gather operations was taken to be 60 real and 7 integer operations per cycle. he surface unications in the Z-dimension represent 0% of the total number of messages but over 95% of the total unication time. In addition, 0 allreduce operations take place per cycle each of bytes in size.. Adaptive Mesh Refinement in SAGE SAGE performs adaptive mesh refinement operations at the end of each cycle iteration. Each cell in the spatial grid at this point may either be: - split into a xx block of cells, or - combined with its neighbors, within the same cell block, to form a single cell, or - remain unchanged. he decision on whether to split or combine cells is determined by the current cell values in the calculation being performed. AMR enables more refined calculations to take place in those areas of the spatial grid characterized by more intense physical phenomena. For example, the shock-wave indicated in the -D example in Figure 5 by the solid line may cause the cells associated with it (and close to it) to be split into smaller cells. In this example, cells are represented at a certain level of refinement. A cell at level 0 is not refined while a cell at level n represents a domain 8 n times smaller than one at level 0 in 3 dimensions. he adaptive refinement of cells can result in load imbalance across processors, for instance when there is a large degree of activity in a localized region of the spatial grid in comparison to the grid as a whole. o overcome this, a load-balancing operation is performed at the end of each cycle when the maximum number of cells on any processor is greater than 0% above the average number of cells on the processors, i.e. load-balance if MAX i > (9) i P i= P ( E ). E i where E i is the number of cells on processor i. he load-balancing operation takes advantage of the fact that the cells are organized into a one dimensional logical vector. he cells at level 0 are indexed in X, Y, Z ordering corresponding to positions in the spatial grid. By partitioning this vector into approximately equally sized segments, the number of cells can remain approximately equal among processors. However, the load-balancing process requires the unication of all data values associated with cells to be moved between processors. his can impact the overall application performance. Gather Compute Scatter n- n-3 n- n- n n+ n+ n+3 n+ n- n-3 n- n- n n+ n+ n+3 n+ n- n-3 n- n- n n+ n+ n+3 n+ Gather/Scatter Comms Direction Size (cells) X, HI X, LO Y, LO 5 Y, HI 5 Z, LO 6860 Z, HI 6860 Figure. Schematic representation of the unication and computation within a cycle of SAGE consisting of: a data gather, processing, and a data scatter.

5 Level0 Level Level Level3 Figure 5. AMR example at multiple levels he resulting data decomposition of the spatial grid among processors after this process remains similar to that depicted in Figure. However, the surface (in the X-Y plane) of each processor s sub-grid will no longer be of equal size. Since the AMR process is data dependent, each separate calculation using SAGE will exhibit in a different adaptive refinement process and hence a different performance will result. 3. A PERFORMANCE MODEL OF SAGE In the analysis that follows in Section 3., the main characteristics of SAGE as described in Sections. to.3 are used to construct a performance model but without AMR. his model is extended in Section 3. to include refinement. Applications of the model are illustrated in Section. 3. SAGE Model without AMR he unication and computation stages of SAGE are centered around the gather/compute/scatter operations as described in Section.3. he runtime for one cycle of the code, given that the three stages are not overlapped, can be described as: cycle( P, E) = comp( E) + memcon( P, E) + (0) ( P, E) + ( P) GS allreduce where: P is the number of PEs E is the number of cells per PE comp (E) is the computation time GS (P,E) is the gather and scatter unication time allreduce (P) is the allreduce unication time memcon (P,E) is memory contention that may occur between PEs within an SMP box he computation time, comp (E), is measured from an execution of SAGE on a single PE for a given number of cells E. he gather and scatter unication time is the time taken to provide boundary information by processors owning the boundary. his is related to the PED (the unication distance described in Section.), and on the sizes of the messages. GS (P,E) is modeled as: GS ( P, E) = C( P, E) ( SurfaceZ. MPI Real8, P) ( SurfaceZ. MPI IN, P) + ( SurfaceY. MPI Real8, P) ( SurfaceY. MPI IN, P) + ( Surface X. MPI Real8, P) ( Surface. MPI, P) X IN () where C(P,E) is the contention on the processor network when using P processors due to distant processor neighbor unications (i.e. PED > ) and (S,P) is the time taken to unicate a message of size S when using P processors in the system. he sizes of the messages, Surface Z, Surface Y, Surface X (words) are determined by the size of the sub-grid mapped to each processor, the number of processors P, and the data decomposition used as described in Section. For slab decomposition Surface Z = MIN(L, E/), Surface Y =.L, and surface x = words. he size of MPI real8 and MPI IN are determined by the MPI implementation. he coefficients multiplying the unication times in equation are the frequency of the messages, as described in Section.3. A linear model for the unication time is assumed which uses the Latency (L c ) and Bandwidth (B c ) of the unication network in the system. he unication latency and bandwidth vary depending on the size of the message and also the number of processors used (for instance when dealing with in-box or with out-box unication for an SMP based machine)., = c + S. () B ( S P) L ( S, P) c ( S, P) he unication model utilizes the bandwidth and latencies of the unication network observed in a single direction when performing bi-directional unications, as is the case in SAGE. his should not be confused with the peak uni-directional unication performance of the network or peak measured bandwidths from a performance evaluation exercise (e.g. []). he impact of the PED on unication performance depends on the specific network topology. On a cluster of Compaq ES5 SMPs, the maximum contention from an SMP box occurs when all PEs within the box perform out-of-box sends and each receives from out-of-box PEs. his system s topology is a fat-tree using the Quadrics QSNet [6] as shown in Figure 6. his network is able to handle any logical PED without penalty hence for this particular network there will be no extra overhead due to the physical distance between processors within the PED.

6 Figure 6. Network topology for a 6-node cluster of Compaq SMPs using Quadrics QSNet Fat-tree network. Other topologies are not contention free under this unication pattern, for example the Cray 3E, ASCI Red, and ASCI Blue Mountain. Communication involving processors within the PED will be bottlenecked by the lack of physical unication links between processors that limit the concurrency of messages. For example, with the initial configuration of the ASCI Blue Mountain, the minimum number of HiPPi channels that are used to interconnect SMP boxes of 8 PEs is, as shown in Figure 7. he contention on the processor network, C(P,E), is modeled as: C ( ) SMP P, E = MIN MAX,, CL PEsurface CL L P (3) where CL is the number of unication links per node, and P SMP is the number of PEs per node. hus the contention has a maximum - the number of nodes within the SMP divided by the number of unication links, i.e. when all PEs perform outbox sends and receives. It has a minimum of one since at least one PE will perform out-box unications. his model of the contention on the processor network is optimistic as it does not take into account possible overhead in the management of multiple unication links within an SMP box. he time taken to perform the allreduce operations is modeled as: ( P) 0..log ( P). (, P) allreduce = () which consists of log(p) stages in a binary tree reduction operation. his is multiplied by (since the operation is effectively a reduction followed by a broadcast). his occurs at a frequency of 0 per SAGE cycle. 8 SMPs 8node SMP 8 SMPs n HiPPi links ( P, E) E. ( P) memcon = (5) mem where mem (P) is the measured memory contention on P processors per cell per cycle. Our overall model contains many inputs which may be conveniently categorized into either: application, system, or mapping parameters. hese inputs specify a particular design point a matching of the application, in a particular configuration, with a target system in a particular configuration. he application and mapping parameters can be specified appropriately for the design point based on the input deck of a specific run while the system parameters need to be measured or otherwise specified for a particular system. he input parameters to the SAGE performance model are listed in able below. able : Input parameters to the SAGE performance model. Application E Cells per processor Mapping Surface x Surface size (in cells) of the sub-grid Surface mapped to each processor (in each of y the three dimensions) Surface Z System P Number of processors P SMP CL L c (S,P) B c (S,P) MPI Real8 MPI IN comp (E) mem (P) Processors per SMP box Communication Links per SMP box Latencies and Bandwidths achieved in one direction on bi-directional unication Size of MPI data types Sequential cycle time of SAGE on E cells Memory contention per cell per cycle Figure 7. Inter-SMP network on ASCI Blue Mountain he memory contention represents the extra time required per cycle when multiple PEs contend for memory within an SMP. On some systems this can be measured by considering the use of different configurations of processors for the same problem for instance using all processors within an SMP node or using processor in each of P SMP nodes. he difference in execution times can be considered as the additional time due to memory contention. his is modeled as: 3. SAGE Model with AMR he adaptive mesh refinement process in SAGE is performed at the end of each cycle as described in Section.. he AMR operation is triggered by one of several thresholds on the physical quantities contained in the cell. hus it is heavily dependent on the calculation being performed. In order to accurately model this operation information on the AMR process due to the calculation being performed is required. his includes:

7 - the number of new cells added in a cycle, - the current cell division factor (the total cells divided by the number of cells level 0 cells), and - movement of cells between processors to load-balance. For a particular calculation this information needs to be defined on a per-cycle basis. For example, a calculation which results in intense physical phenomena in a localized area of the spatial grid will require more time to load-balance (see Section.) in contrast with a calculation with uniformly distributed phenomena. he performance model of SAGE presented in Section 3 can be extended to include the main characteristics of the adaptive mesh refinement process. A model that includes these operations is: cyclei ( P, E, D, A, M cm ) = comp ( E. Di ) + memcon ( P, E. Di ) + allreduce ( P) + GS ( P, E, Di ) ( A ) + ( E. D ) + divide load i ( M, P) cmi combine i + (6) he main additional components in this model in comparison to that defined previously in equation 0, are: divide (A i,) - the time to divide cells in the current cycle combine (E.D i ) - the time to combine cells in the current cycle load (M cmi, P) - the time to perform the load-balancing In addition three parameters are included in this model that define characteristics of the calculation being performed. Each represents a time history of values defined on a cycle by cycle basis: D - a vector containing the cell division factor [..8 maxlevel ] A - a vector containing the maximum number of cells added (over all processors) through the division process M cm - a vector containing the maximum number of cells moved between any two PEs in the load balancing. By defining these inputs on a per cycle basis, the model can accurately encompass the change in computation time and unication time due to the change in the amount of cell division. For instance, the computation time will scale in proportion to the amount of cell division (the volume of the subgrid) whereas the size of unications for the gather and scatter operations will scale as the /3 power of the cell division (the surface of the sub-grid). his model is not described in any further detail in this work. An application of the model with adaption is illustrated in Section.3. able. System parameters used in the validation for each system. System AlphaSever ES5 AlphaServer ES0 ASCI Blue Mountain ASCI White P smp (PEs per node) 8 6 # Nodes used in Validation CL (Communication 8 P 0 Links per node) 0 < P 08 P > 08 comp (E) (s) L c (S,P) (µs) in box( P ).8 S < S < S S > 89 out box( P > ) 6.0 S < S S > 5 /B c (S,P) (ns) in box( P ) mem (P) (µs) S < S < S S > 89 out box( P > ) S < 6. 6 S S > 5.8 P =.8 P > in box( P ).7 S < S < S S > 89 out box( P > ) 9.8 S < S 5. S > 5 in box( P ) in box( P 8) 8.0 S < S < S S > 08 out box( P > 8) 50 S in box( P 8) in box ( P 6).0 S < S S > 5 out box P > 6 ( ) 8.0 S < S < S S > in box P 6 ( ) S < 6 S < 8.6 S 8 6 S S 5. 8 < S < S < S 08.0 S > 5 3. S > S > 08 out box( P > 6) out box( P > ) out box( P > 8) 8.6 S 8 S < 6 0 S < S S < S S > 5.3 S > P = - -. P >

8 . APPLICAION OF HE MODEL In this section the SAGE performance model described in Section 3 is validated against four existing architectures and also applied to predicting performance on future architectures. In addition, the performance of SAGE is investigated given algorithmic changes that could be implemented in the code. An example of predicting the performance of SAGE with the adaptive mesh refinement process is also illustrated... Validation and Performance Prediction on Future Architectures he model presented in Section 3 has been validated against measurements taken on a Compaq AlphaServer ES5, an AlphaServer ES0, the ASCI Blue Mountain (SGI Origin 000), and from a preliminary performance analysis of ASCI White (IBM SP3). A detailed performance study of ASCI White can be found in [8]. he input parameters to the SAGE performance model are listed in able. his includes the number of nodes used in the validation along with the system parameters used in the model for each architecture. he parameters are either measured or otherwise specified. he comparison is performed with the default slab decomposition in SAGE as described in Section. For each system, the time taken to perform one cycle of SAGE as given by the performance model is compared to measurements. In the case of the two Compaq systems, predictions are given for systems larger than was available in the measurement process. he Compaq AlphaServer ES5 cluster that we had access to was very small, but a larger system of this kind is being installed at Los Alamos National Laboratory. Without an available largescale system, the performance model is able to provide an expectation of the performance on such a future architecture. he comparison of predictions and measurements on the four systems is shown in Figure 8. he predictions from the performance model show high accuracy mostly within 0% of the actual measurements. A comparison of the cell-cycles per second on SAGE is shown in Figure 9. his metric is used by SAGE as a further indication of performance. It represents the number of cells that can be processed in each wall-clock time unit. In Figure 9 measurements are used for the CRAY 3E, ASCI White, and ASCI Blue Mountain, whereas we predict the performance of the Compaq system using our model SAGE Performance Model (Compaq AlphaServer ES5).. SAGE Performance (Compaq AlphaServer ES0) 0.7 ime for cycle (s) Prediction Measurement ime for cycle (s) Prediction 0. Measurement a) Compaq AlphaServer ES5 b) Compaq AlphaServer ES0 SAGE Performance Model (ASCI Blue Mountain) SAGE Performance Model (ASCI White) ime for cycle (s) 3 ime for cycle (s) Prediction Measurement Prediction Measurement c) ASCI Blue Mountain (SGI Origin 000) d) ASCI White (IBM SP3) Figure 8. Comparison of predictions from the SAGE performance model with actual measurements

9 .E+09.E+08 SAGE - Architecture Performance Comparison AlpaServer ES5 3E-00 ASCI White ASCI Blue Mountain Comparison of Slab and Cube (Surface Sizes) Slab Cube Cell-Cycles/sec.E+07.E+06.E+05.E+0 otal Surface (cells) E Figure 9. Comparison of the performance of SAGE across several systems a) Surface sizes Comparison of Slab and Cube (PE Distance) he model predicts the performance of the AlphaServer ES5 to be approximately a factor of times greater than that on the ASCI White (IBM SP3) on a comparable number of processors. A system with a peak performance of 30flops composed of the Compaq SMP boxes with Quadrics QSNet would be approximately 0 times greater than the performance of SAGE achieved to date on the ASCI Blue Mountain with 6000 SGI Origin 000 processors. By comparison, the ratio of peak speeds is approximately 0. Logical neighbour distance Slab (Z) Cube (Y) Cube (Z).. Performance Prediction on Algorithmic ransformations: an alternative Data Decomposition he surface-to-volume ratio of the processing in SAGE is dependent on the grid decomposition. here is a large difference between the use of the slab decomposition (Figure ) and a cube decomposition (Figure 0). Where the slab decomposition results in unications scaling as the /3 power of the number of PEs, as shown by equation (3), with a cube decomposition the unication size will remain approximately constant, though the number of PE pairs unicating will be larger. It can be easily shown (see for example []) that the surface-to-volume ratio (i.e. the unication-to-computation ratio) gets better (i.e. smaller) as the aspect ratio of the sub-grids changes towards being perfect cubes, as suggested in Figure 0. Of course, perfect cubic decomposition can only be achieved when the number of processors is a cubic power, as is the decomposition on 8 processors shown in Figure 0. PEs PEs 8PEs Figure 0. Possible 3-D data decomposition configurations for, and 8 processors b) PE distance (PED) Figure. Comparison of Slab and Cube decomposition A comparison between the cube decomposition and the slab decomposition is shown in Figure. he total surface of the subgrid on an individual PE is plotted which is proportional to the unication that takes place in each gather (and scatter) operation. he PE distance (PED) is also shown in Figure b). he curves for the slab decomposition have already been presented in Figure. he PED for the slab decomposition in the X and Y dimensions are always equal to. For the cube decomposition PED is always equal to in the X dimension, but varies in the Y and Z dimensions. he unication size using the cube decomposition is considerably smaller than that for the slab, but the PED is considerably larger. A comparison between the expected performance of SAGE using cube decomposition and the current slab decomposition on the Compaq AlphaServer ES5, and the ASCI Blue Mountain, is shown in Figure. his is achieved by modifying the parameters Surface X, Surface Y, and Surface Z, in equation, to represent the sub-grid surface sizes in the cube decomposition. For an ideal cube these would all be equal to 3 ( L / P / ).he use of the cube decomposition reduces unication requirements and hence results in an expected performance improvement of 35% (on the Compaq system), and between 5% and 5% (on the SGI system) compared with the use of slabs.

10 ime for Cycle (s) SAGE Performance Model - Comparison of Slab and Cube (Compaq System) Slab Cube Cycle ime (s) SAGE - Component imes (ES5-Slab) Compute/memory Bandwidth Latency otal a) Compaq AlphaServer ES5 a) Slab decomposition ime for Cycle (s) SAGE Performance Model - Comparison of Slab and Cube (ASCI Blue Mountain) Slab Cube b) ASCI Blue Mountain Figure. Performance comparison of slab and cube decomposition Cycle ime (s) SAGE - Component imes (ES5-Cube) 0.9 Compute/memory 0.8 Bandwidth 0.7 Latency 0.6 otal b) Cube decomposition Figure 3. ime-component predictions (Compaq ES5) he performance model can also be used to provide insight into where time is spent within the application. In Figure 3, time components representing computation (including memory), unication latency, and unication bandwidth are shown for both data decomposition schemes of SAGE on the Compaq AlphaServer ES5. It can be clearly seen that the unication bandwidth component is much reduced when using the cube decomposition. Both the computation and the unication latency components remain mostly unchanged. SAGE could benefit from a cube decomposition of the full grid if the unication network within the machine is able to handle the large logical PEDs without performance penalty. his is true in the fat-tree topology of the Quadrics network used on the cluster of Compaq SMPs as described in Section 3..3 Extending the Performance Model to AMR he model can be used to explore the performance on different characteristic adaptive mesh refinement calculations on different architectures. In Figure an example time history for each of the cell division factors (D), the maximum cells added in a cycle (A), and the maximum cells moved for load balancing (M cm ) are shown. he example time histories attempt to depict the situation in which a shock-wave propagates through the spatial grid. his causes the following characteristics: - the cell division factor gradually increases with the number of cells added per cycle as the shock-wave expands thus the cell division factor increases, and - load-balancing is assumed to take place every fifth cycle.

11 # Cells.E+06.E+05.E+0.E+03 SAGE - Example time histories Cells Added (A) Max cells load-balanced (Mcm) Cell Division Factor (D).E Cycle Number Figure. Example time histories for: division factor, added blocks, and blocks load-balanced (indexed by cycle). he time histories as depicted in Figure were used in the performance model in order to investigate both the variation in cycle time during the calculation (Figure 5a), and the time taken to perform the 00 cycles while scaling the number of processors (Figure 5b). his was undertaken for the Compaq AlphaServer ES0. It can be seen from Figure 5a), that cycles requiring loadbalancing take slightly longer than those without Cell Division Factor 5. SUMMARY AND CONCLUSIONS In this paper we have presented a predictive performance and scalability model for an important application from the ASCI workload. he model takes into account the main computation and unication characteristics of the entire code. he model proposed was validated on two large-scale ASCI architectures, ASCI White (IBM SP3), and ASCI Blue Mountain (SGI Origin 000), showing very good accuracy. he model was then utilized to predict performance of SAGE on future architectures and also when using an alternative parallel data decomposition. We believe that performance modeling is the key to building performance engineered applications and architectures. o this end, the work presented in this paper represents one of a very few existing performance models of entire applications. Like our previous performance model of a particle transport application [], the model incorporates information from various levels of the benchmark hierarchy [3] and is parametric - basic machine performance numbers (latency, computational rate, bandwidth) and application characteristics (problem size, decomposition method, etc.) serve as input. Such a model adds insight into the performance of current systems, revealing bottlenecks and showing where tuning efforts would be most effective. It also allows prediction of performance on future systems. he latter is important for both application and system architecture design as well as for the procurement of supercomputer architectures ime (s) SAGE - Example Adaption performance (ES0) A performance model is meant to be updated, refined, and further validated as new factors come into play. he work performed in this report was primarily concerned with the analysis of SAGE in absence of grid adaptation. With additional analysis, the model has been extended to include the main characteristics of the adaptation process. he performance model can be used to investigate performance on alternative application configurations (data decompositions), and alternative target systems Cycle Number 000 a) time taken for each of 00 cycles SAGE - Example Adaption Scalability (ES0) ACKNOWLEDGEMENS his work was supported by funding from the Los Alamos Computer Science Institute. We would like to thank Ed Benson for access and support on the AlphaServer ES5 at Compaq in Marlborough, MA. Los Alamos National Laboratory is operated by the University of California for the National Nuclear Security Administration of the US Department of Energy ime (s) cycles b) Example adaption scalabilty (time for 00 cycles) Figure 5. Performance Prediction of SAGE with AMR (using the input histories from Figure ) 6. REFERENCES [] Culler, D.E., Singh, J.P., Gupta, A., Parallel Computer Architecture, Morgan Kaufmann, ISBN , 999. [] Goedecker, S., Hoisie, A., Performance Optimization of Numerically Intensive Codes, SIAM Press, ISBN , March 00. [3] Hockney, R., Berry, M., (Eds). Public International Benchmarks for Parallel Computers. Scientific Programming, Vol. 3, 99, 0-0.

12 [] Hoisie. A., Lubeck, O., Wasserman. H., Performance and scalability analysis of teraflop-scale parallel architectures using multidimensional wavefront applications, Int. J. of High Performance Computing Applications, Vol., No., Winter 00, [5] Nudd, G.R., Kerbyson, D.J., et.al. PACE: A oolset for the Performance Prediction of Parallel and Distributed Systems, in the Journal of High Performance Applications, Vol., No. 3, Fall 000, 8-5. [6] Petrini, F., Feng, W., Hoisie, A., Coll, S., and Frachtenberg, E. he Quadrics Network (QsNet): High-Performance Clustering echnology, Symposium on High Performance Interconnects, Hot Interconnects 9, Stanford CA, August -, 00. [7] Rauber,., and Runger, G., Modeling the Runtime of Scientific Programs on Parallel Computers, in Proc. 000 ICPP Workshops. IEEE Computer Society, 000, [8] de Supinski, B.R. he ASCI PSE Milepost: Run-ime Systems Performance ests, Int. Conf. On Parallel & Distrib. Process. ech. & Apps., Las Vegas, June 5-8, Vol., 00, [9] Weaver, R., Major 3-D Parallel Simulations, BIS - Computing and unication news, Los Alamos National Laboratory, June/July, 999, [0] Worley. P.H., Performance uning and Evaluation of a Parallel Community Climate Model, SC99, Portland, Oregon, November 999. [] Worley, P.H. Performance Evaluation of the IBM SP and the Compaq AlphaServer SC. In Proc. ICS 000, ACM, 000, 35-.