PREDICTIVE PERFORMANCE AND SCALABILITY MODELING OF A LARGE-SCALE APPLICATION

Size: px
Start display at page:

Download "PREDICTIVE PERFORMANCE AND SCALABILITY MODELING OF A LARGE-SCALE APPLICATION"

Transcription

1 PREDICIVE PERFORMANCE AND SCALABILIY MODELING OF A LARGE-SCALE APPLICAION D.J. Kerbyson *, H.J. Alme *, A. Hoisie *, F. Petrini *, H.J. Wasserman *, M. Gittings ** * Los Alamos National Laboratory, Parallel Architectures and Performance eam CCS-3 Modeling, Algorithms and Informatics Group,, Los Alamos, NM (505) {djk,hoisie}@lanl.gov ** SAIC and Los Alamos National Laboratory ABSRAC In this work we present a predictive analytical model that encompasses the performance and scaling characteristics of an important ASCI application. SAGE (SAIC s Adaptive Grid Eulerian hydrocode) is a multidimensional hydrodynamics code with adaptive mesh refinement. he model is validated against measurements on several systems including ASCI Blue Mountain, ASCI White, and a Compaq Alphaserver ES5 system showing high accuracy. It is parametric - basic machine performance numbers (latency, MFLOPS rate, bandwidth) and application characteristics (problem size, decomposition method, etc.) serve as input. he model is applied to add insight into the performance of current systems, to reveal bottlenecks, and to illustrate where tuning efforts can be effective. We also use the model to predict performance on future systems. Keywords Performance analysis, full application codes, parallel system architecture, eraflop scale computing.. INRODUCION SAGE (SAIC's Adaptive Grid Eulerian hydrocode) is a multidimensional (D, D, and 3D), multimaterial, Eulerian hydrodynamics code with adaptive mesh refinement (AMR). he code uses second order accurate numerical techniques. SAGE comes from the Los Alamos National Laboratory Crestone project, whose goal is the investigation of continuous adaptive Eulerian techniques to stockpile stewardship problems. SAGE has also been applied to a variety of problems in many areas of science and engineering including water shock, energy coupling, cratering and ground shock, stemming and containment, early time front end design, explosively generated air blast, and hydrodynamic instability problems [9]. SAGE represents a large class of production ASCI applications at Los Alamos that routinely run on processors for months at a time. SAGE is a large-scale parallel code written in Fortran 90, using MPI for inter-processor unications. Early versions of SAGE were developed for vector architectures. More recently, optimized versions of SAGE have been ported to all teraflop-scale ASCI architectures, as well as the CRAY 3E and Linux-based cluster systems. his work describes a performance and scalability analysis of SAGE. One essential result is the development of a performance model that encapsulates the code s crucial performance and scaling characteristics. he performance model has been formulated from an analysis of the code, inspection of key data structures, and analysis of traces gathered at run-time. he model has been validated against a number of ASCI machines with high accuracy. he model is also applied in this work to predict the performance of SAGE on extreme-scale future architectures, such as clusters of SMPs. Included is the application of the model for predicting the performance of the code when algorithmic changes are implemented, such as using a different parallel data decomposition. here are few existing performance studies that extend to full codes (for instance [0]), many tend to consider smaller applications especially in distributed environments (e.g. [5,7]). his paper represents an example of performance engineering applied to a full-blown code. SAGE has been analyzed and a performance model proposed, and validated, on all architectures of interest. he validated model is utilized for point-design studies involving changes in the architectures on which the code is running and in the algorithms utilized in the code. A predictive performance model of another important ASCI application is described in previous work []. (c) 00 Association for Computing Machinery. ACM acknowledges that this contribution was authored or co-authored by a contractor or affiliate of the U.S. Government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only. SC00 November 00, Denver (c) 00 ACM X/0/00 $5.00

2 . ESSENIAL CHARACERISICS OF SAGE In this section the important characteristics of SAGE that affect its performance and scaling behavior are described. In particular, the spatial data decomposition, the scaling of the sub-grid, the on operations within a code cycle and the adaptive mesh refinement operations are analyzed. In this work it is assumed that the spatial grid is three dimensional.. Parallel Spatial Decomposition in SAGE SAGE uses a spatial discretization of the physical domain utilizing Cartesian grids. his spatial domain is partitioned across processors in sub-grids such that the first processor is assigned the first E cells in the grid (indexed in dimension order X,Y,Z), the second processor is assigned the next E cells and so on. he assignment is actually done in blocks of xx as illustrated in Figure a), where M is the number of blocks in the X-dimension. Figure b) illustrates the approximate assignment of cells over processors (PEs). Note that the grid is primarily partitioned in the Z dimension. Each processor contains cells which are either: a) internal all neighbor cells are contained on the same PE, b) boundary belong to one of the spatial domain s physical boundaries ( faces ), or c) inter-processor boundary neighbor cells in physical space belong to sub-grids contained on different PEs (in one or more dimension). Y Y M+ PE Z 3 a) Cell and block ordering PE PE 3 PE b) Example assignment of the spatial grid across four PEs Figure. Cell and block assignment to Processors in SAGE. Z X X A library designed for the unication requirements of the code is used to handle the necessary unications within SAGE. his includes the on MPI operations of allreduce and broadcast for instance, as well as two main application specific unication kernels: gather (get data) and scatter (put data) operations. hese operations are used when processors require an update of their sub-grid with local cell information and inter-processor boundary data. he library uses a notion of tokens to record where all the necessary data can be found for each individual processor. A token is defined on each processor for each of cell centered values and cell face values in each data neighbor direction - 6 in total, and for the relationships of cells between levels in the AMR operation. Each token contains information on: - sub-grid boundaries, - data held locally within a processor, - data held off processor (requiring unication), and - data required off processor (also requiring unication)... Scaling of the Sub-grid he sub-grid volume on each PE is a function of the number of cells per PE, a parameter which is specified in the input deck. SAGE assigns this number of cells to each PE. We are concerned with weak scaling in this analysis, in which the problem size grows proportionally with the number of processors resulting in each PE doing approximately the same amount of work. he decomposition of the spatial grid across PEs is done in slabs (-D partitioning), as suggested in Figure b). Each slab is uniquely assigned to one processor. aking the number of cells in each sub-grid to be E, the total grid volume V in cells is: V = E.P = L 3 () and the volume of each sub-grid is: E=l.L () where P is the number of PEs, l is the short side of the slab (in the Z dimension) and L the side of the slab in X and Y directions (assuming a square grid in the X-Y plane). he surface of the slab, L, in the X-Y plane is: L = V /3 = (E.P) /3 (3) From this it can be seen that the surface increases as P /3. his sub-grid surface is directly proportional to the maximum data size that is unicated between PEs on data gather and scatter operations. he maximum size of this surface that a PE will contain is constrained by E. In fact, since the assignment of cells to PEs is done in xx blocks, the maximum surface is E/, at which point the slab degenerates to a foil with a thickness of cells. It is possible for the surface of the full spatial domain to be greater than E - thus resulting in each surface being assigned to more than one PE. his leads to physically neighboring data cells assigned to logically distant PEs. Hence unications will take place between more distant processors. he total unication requirements will remain as (E.P) /3, but will be dealt with by more than one PE.

3 PEs 8PEs PEs PEs Figure. Cross-sections of the spatial grid and the assignment of cells to, 8, 6, and 56 processors. Examples of the partitioning of the spatial grid across processors using slab decomposition are shown in Figure. he crosssection in the Z dimension is shown and it is assumed that E=3,500 cells throughout this work. Note that the volume of the spatial grid scales in accordance with equation. It can be seen that when using and 8 processors, each PE holds more than one foil. In the case of 6 PEs each foil is mapped to two processors, and in the case of 56 PEs, each foil is held by processors. he maximum logical distance between PEs on foil boundaries for the four cases shown is,, and respectively. Consider again the volume of the entire grid: V = E.P = (l.l ).P () his is partitioned across PEs such that there will be L/(P) foils of width on each PE, or: (E.P) /3 /P = (E/8P ) /3 (5) When this has a value less than one, a processor will contain less than a single foil, i.e. when P > SQR(E/8) (6) he maximum distance between the processors that hold a foil, termed the PE Distance (PED) here is: PED = / 3 ( E / 8P ) where indicates an integer ceiling function. he minimum distance between the processors that hold that foil is: / 3 ( E / 8P ), MAX hus when a processor is not assigned a full foil of the spatial domain, boundary exchange will involve all the processors that own the boundary, located at a logical distance PED apart. he sub-grid surface, L, the actual inter-processor boundary area owned by a processor due to the slab degenerating to a foil and the subsequent splitting of the boundary amongst the processors within the PED, the PE surface, and the PED are shown in Figure 3. It can be seen that the PE surface achieves a maximum 9 (7) (8) after 3 PEs. he sub-grid surface approximately equals the PE surface multiplied by the PED. It is important to note that the PED is related to the unication requirements of the code, more precisely it is proportional to the size of messages generated in order to satisfy each necessary inter-processor boundary exchange. his is a consequence of the slab decomposition and could lead to unication inefficiencies depending on a specific machine topology. A further observation related to the unication pattern is that when processors are considered to be labeled in a vector-like manner, from 0 to P-, and with P SMP processors per SMP box, out-of-box unication involves no more than a number of pairs of processors equal to the min(ped, P SMP ). Of course, if the PED is larger than P SMP, more than two SMP boxes will be involved in the boundary exchange. As an example, on the ASCI Blue Mountain at Los Alamos, composed of SGI Origin 000 boxes, given that P SMP =8 and that, from Figure b) the PED is not larger than 00 for a reasonable number of PEs, no more than Origin 000 boxes will be involved in the unication for one boundary exchange. Surface Size (cells) Logical neighbor distance.e+06.e+05.e+0.e+03.e+0.e+0.e+0 SAGE Scaling Behavior (Surface Sizes) Subgrid surface PE Surf ace a) sub-grid surface and PE surface sizes, SAGE Scaling Behavior (PE Distance) Max Neighbor PE.E b) PED, the logical neighbor distance. Figure 3. SAGE scaling characteristics

4 .3 An Iteration Cycle of SAGE he processing stages within a cycle typically involve three operations which are repeated a number of times dependent on the time interval utilized for integration of the equations in the code: i) one (or more) gather operations to obtain a copy of the local and remote neighbor data ii) computation in each of the local cells iii) one (or more) scatter operations to update data on remote processors. hese three operations of SAGE directly relate to the surface-tovolume ratio of the code []. he first and the third stage define the surface, related to the amount and pattern of unication, while the second stage represents the volume, related to the amount of computation. he gather and scatter operations are performed using the token library described in section.. hese three operations in a cycle of SAGE are shown schematically in Figure. In this example, it is assumed that the number of PEs (P) is 56, and the number of cells per PE (E) is 3,500. A single gather operation in all dimensions is depicted, followed by a processing step, and then a single scatter operation in all dimensions. he unication is shown only for processor n but in reality all processors perform unications of the same sizes, in the same direction, at the same time. In this example it can be seen that the main unication is in the Z dimension dealing with the sub-grid surface. he preponderance of unication in the Z dimension is also a consequence of the slab decomposition and is intuitive from Figure b). he message sizes in both directions (HI and LO in SAGE terminology) of the three dimensions is shown in the box on the right side of Figure. In addition to the gather and scatter operations, a number of other unications take place including several MPI type allreduce unications per cycle. A number of broadcast operations also exist but only during the initialization phase of the code. he frequency of the gather/scatter operations was analyzed using MPI trace data. From this, the number of scatter/gather operations was taken to be 60 real and 7 integer operations per cycle. he surface unications in the Z-dimension represent 0% of the total number of messages but over 95% of the total unication time. In addition, 0 allreduce operations take place per cycle each of bytes in size.. Adaptive Mesh Refinement in SAGE SAGE performs adaptive mesh refinement operations at the end of each cycle iteration. Each cell in the spatial grid at this point may either be: - split into a xx block of cells, or - combined with its neighbors, within the same cell block, to form a single cell, or - remain unchanged. he decision on whether to split or combine cells is determined by the current cell values in the calculation being performed. AMR enables more refined calculations to take place in those areas of the spatial grid characterized by more intense physical phenomena. For example, the shock-wave indicated in the -D example in Figure 5 by the solid line may cause the cells associated with it (and close to it) to be split into smaller cells. In this example, cells are represented at a certain level of refinement. A cell at level 0 is not refined while a cell at level n represents a domain 8 n times smaller than one at level 0 in 3 dimensions. he adaptive refinement of cells can result in load imbalance across processors, for instance when there is a large degree of activity in a localized region of the spatial grid in comparison to the grid as a whole. o overcome this, a load-balancing operation is performed at the end of each cycle when the maximum number of cells on any processor is greater than 0% above the average number of cells on the processors, i.e. load-balance if MAX i > (9) i P i= P ( E ). E i where E i is the number of cells on processor i. he load-balancing operation takes advantage of the fact that the cells are organized into a one dimensional logical vector. he cells at level 0 are indexed in X, Y, Z ordering corresponding to positions in the spatial grid. By partitioning this vector into approximately equally sized segments, the number of cells can remain approximately equal among processors. However, the load-balancing process requires the unication of all data values associated with cells to be moved between processors. his can impact the overall application performance. Gather Compute Scatter n- n-3 n- n- n n+ n+ n+3 n+ n- n-3 n- n- n n+ n+ n+3 n+ n- n-3 n- n- n n+ n+ n+3 n+ Gather/Scatter Comms Direction Size (cells) X, HI X, LO Y, LO 5 Y, HI 5 Z, LO 6860 Z, HI 6860 Figure. Schematic representation of the unication and computation within a cycle of SAGE consisting of: a data gather, processing, and a data scatter.

5 Level0 Level Level Level3 Figure 5. AMR example at multiple levels he resulting data decomposition of the spatial grid among processors after this process remains similar to that depicted in Figure. However, the surface (in the X-Y plane) of each processor s sub-grid will no longer be of equal size. Since the AMR process is data dependent, each separate calculation using SAGE will exhibit in a different adaptive refinement process and hence a different performance will result. 3. A PERFORMANCE MODEL OF SAGE In the analysis that follows in Section 3., the main characteristics of SAGE as described in Sections. to.3 are used to construct a performance model but without AMR. his model is extended in Section 3. to include refinement. Applications of the model are illustrated in Section. 3. SAGE Model without AMR he unication and computation stages of SAGE are centered around the gather/compute/scatter operations as described in Section.3. he runtime for one cycle of the code, given that the three stages are not overlapped, can be described as: cycle( P, E) = comp( E) + memcon( P, E) + (0) ( P, E) + ( P) GS allreduce where: P is the number of PEs E is the number of cells per PE comp (E) is the computation time GS (P,E) is the gather and scatter unication time allreduce (P) is the allreduce unication time memcon (P,E) is memory contention that may occur between PEs within an SMP box he computation time, comp (E), is measured from an execution of SAGE on a single PE for a given number of cells E. he gather and scatter unication time is the time taken to provide boundary information by processors owning the boundary. his is related to the PED (the unication distance described in Section.), and on the sizes of the messages. GS (P,E) is modeled as: GS ( P, E) = C( P, E) ( SurfaceZ. MPI Real8, P) ( SurfaceZ. MPI IN, P) + ( SurfaceY. MPI Real8, P) ( SurfaceY. MPI IN, P) + ( Surface X. MPI Real8, P) ( Surface. MPI, P) X IN () where C(P,E) is the contention on the processor network when using P processors due to distant processor neighbor unications (i.e. PED > ) and (S,P) is the time taken to unicate a message of size S when using P processors in the system. he sizes of the messages, Surface Z, Surface Y, Surface X (words) are determined by the size of the sub-grid mapped to each processor, the number of processors P, and the data decomposition used as described in Section. For slab decomposition Surface Z = MIN(L, E/), Surface Y =.L, and surface x = words. he size of MPI real8 and MPI IN are determined by the MPI implementation. he coefficients multiplying the unication times in equation are the frequency of the messages, as described in Section.3. A linear model for the unication time is assumed which uses the Latency (L c ) and Bandwidth (B c ) of the unication network in the system. he unication latency and bandwidth vary depending on the size of the message and also the number of processors used (for instance when dealing with in-box or with out-box unication for an SMP based machine)., = c + S. () B ( S P) L ( S, P) c ( S, P) he unication model utilizes the bandwidth and latencies of the unication network observed in a single direction when performing bi-directional unications, as is the case in SAGE. his should not be confused with the peak uni-directional unication performance of the network or peak measured bandwidths from a performance evaluation exercise (e.g. []). he impact of the PED on unication performance depends on the specific network topology. On a cluster of Compaq ES5 SMPs, the maximum contention from an SMP box occurs when all PEs within the box perform out-of-box sends and each receives from out-of-box PEs. his system s topology is a fat-tree using the Quadrics QSNet [6] as shown in Figure 6. his network is able to handle any logical PED without penalty hence for this particular network there will be no extra overhead due to the physical distance between processors within the PED.

6 Figure 6. Network topology for a 6-node cluster of Compaq SMPs using Quadrics QSNet Fat-tree network. Other topologies are not contention free under this unication pattern, for example the Cray 3E, ASCI Red, and ASCI Blue Mountain. Communication involving processors within the PED will be bottlenecked by the lack of physical unication links between processors that limit the concurrency of messages. For example, with the initial configuration of the ASCI Blue Mountain, the minimum number of HiPPi channels that are used to interconnect SMP boxes of 8 PEs is, as shown in Figure 7. he contention on the processor network, C(P,E), is modeled as: C ( ) SMP P, E = MIN MAX,, CL PEsurface CL L P (3) where CL is the number of unication links per node, and P SMP is the number of PEs per node. hus the contention has a maximum - the number of nodes within the SMP divided by the number of unication links, i.e. when all PEs perform outbox sends and receives. It has a minimum of one since at least one PE will perform out-box unications. his model of the contention on the processor network is optimistic as it does not take into account possible overhead in the management of multiple unication links within an SMP box. he time taken to perform the allreduce operations is modeled as: ( P) 0..log ( P). (, P) allreduce = () which consists of log(p) stages in a binary tree reduction operation. his is multiplied by (since the operation is effectively a reduction followed by a broadcast). his occurs at a frequency of 0 per SAGE cycle. 8 SMPs 8node SMP 8 SMPs n HiPPi links ( P, E) E. ( P) memcon = (5) mem where mem (P) is the measured memory contention on P processors per cell per cycle. Our overall model contains many inputs which may be conveniently categorized into either: application, system, or mapping parameters. hese inputs specify a particular design point a matching of the application, in a particular configuration, with a target system in a particular configuration. he application and mapping parameters can be specified appropriately for the design point based on the input deck of a specific run while the system parameters need to be measured or otherwise specified for a particular system. he input parameters to the SAGE performance model are listed in able below. able : Input parameters to the SAGE performance model. Application E Cells per processor Mapping Surface x Surface size (in cells) of the sub-grid Surface mapped to each processor (in each of y the three dimensions) Surface Z System P Number of processors P SMP CL L c (S,P) B c (S,P) MPI Real8 MPI IN comp (E) mem (P) Processors per SMP box Communication Links per SMP box Latencies and Bandwidths achieved in one direction on bi-directional unication Size of MPI data types Sequential cycle time of SAGE on E cells Memory contention per cell per cycle Figure 7. Inter-SMP network on ASCI Blue Mountain he memory contention represents the extra time required per cycle when multiple PEs contend for memory within an SMP. On some systems this can be measured by considering the use of different configurations of processors for the same problem for instance using all processors within an SMP node or using processor in each of P SMP nodes. he difference in execution times can be considered as the additional time due to memory contention. his is modeled as: 3. SAGE Model with AMR he adaptive mesh refinement process in SAGE is performed at the end of each cycle as described in Section.. he AMR operation is triggered by one of several thresholds on the physical quantities contained in the cell. hus it is heavily dependent on the calculation being performed. In order to accurately model this operation information on the AMR process due to the calculation being performed is required. his includes:

7 - the number of new cells added in a cycle, - the current cell division factor (the total cells divided by the number of cells level 0 cells), and - movement of cells between processors to load-balance. For a particular calculation this information needs to be defined on a per-cycle basis. For example, a calculation which results in intense physical phenomena in a localized area of the spatial grid will require more time to load-balance (see Section.) in contrast with a calculation with uniformly distributed phenomena. he performance model of SAGE presented in Section 3 can be extended to include the main characteristics of the adaptive mesh refinement process. A model that includes these operations is: cyclei ( P, E, D, A, M cm ) = comp ( E. Di ) + memcon ( P, E. Di ) + allreduce ( P) + GS ( P, E, Di ) ( A ) + ( E. D ) + divide load i ( M, P) cmi combine i + (6) he main additional components in this model in comparison to that defined previously in equation 0, are: divide (A i,) - the time to divide cells in the current cycle combine (E.D i ) - the time to combine cells in the current cycle load (M cmi, P) - the time to perform the load-balancing In addition three parameters are included in this model that define characteristics of the calculation being performed. Each represents a time history of values defined on a cycle by cycle basis: D - a vector containing the cell division factor [..8 maxlevel ] A - a vector containing the maximum number of cells added (over all processors) through the division process M cm - a vector containing the maximum number of cells moved between any two PEs in the load balancing. By defining these inputs on a per cycle basis, the model can accurately encompass the change in computation time and unication time due to the change in the amount of cell division. For instance, the computation time will scale in proportion to the amount of cell division (the volume of the subgrid) whereas the size of unications for the gather and scatter operations will scale as the /3 power of the cell division (the surface of the sub-grid). his model is not described in any further detail in this work. An application of the model with adaption is illustrated in Section.3. able. System parameters used in the validation for each system. System AlphaSever ES5 AlphaServer ES0 ASCI Blue Mountain ASCI White P smp (PEs per node) 8 6 # Nodes used in Validation CL (Communication 8 P 0 Links per node) 0 < P 08 P > 08 comp (E) (s) L c (S,P) (µs) in box( P ).8 S < S < S S > 89 out box( P > ) 6.0 S < S S > 5 /B c (S,P) (ns) in box( P ) mem (P) (µs) S < S < S S > 89 out box( P > ) S < 6. 6 S S > 5.8 P =.8 P > in box( P ).7 S < S < S S > 89 out box( P > ) 9.8 S < S 5. S > 5 in box( P ) in box( P 8) 8.0 S < S < S S > 08 out box( P > 8) 50 S in box( P 8) in box ( P 6).0 S < S S > 5 out box P > 6 ( ) 8.0 S < S < S S > in box P 6 ( ) S < 6 S < 8.6 S 8 6 S S 5. 8 < S < S < S 08.0 S > 5 3. S > S > 08 out box( P > 6) out box( P > ) out box( P > 8) 8.6 S 8 S < 6 0 S < S S < S S > 5.3 S > P = - -. P >

8 . APPLICAION OF HE MODEL In this section the SAGE performance model described in Section 3 is validated against four existing architectures and also applied to predicting performance on future architectures. In addition, the performance of SAGE is investigated given algorithmic changes that could be implemented in the code. An example of predicting the performance of SAGE with the adaptive mesh refinement process is also illustrated... Validation and Performance Prediction on Future Architectures he model presented in Section 3 has been validated against measurements taken on a Compaq AlphaServer ES5, an AlphaServer ES0, the ASCI Blue Mountain (SGI Origin 000), and from a preliminary performance analysis of ASCI White (IBM SP3). A detailed performance study of ASCI White can be found in [8]. he input parameters to the SAGE performance model are listed in able. his includes the number of nodes used in the validation along with the system parameters used in the model for each architecture. he parameters are either measured or otherwise specified. he comparison is performed with the default slab decomposition in SAGE as described in Section. For each system, the time taken to perform one cycle of SAGE as given by the performance model is compared to measurements. In the case of the two Compaq systems, predictions are given for systems larger than was available in the measurement process. he Compaq AlphaServer ES5 cluster that we had access to was very small, but a larger system of this kind is being installed at Los Alamos National Laboratory. Without an available largescale system, the performance model is able to provide an expectation of the performance on such a future architecture. he comparison of predictions and measurements on the four systems is shown in Figure 8. he predictions from the performance model show high accuracy mostly within 0% of the actual measurements. A comparison of the cell-cycles per second on SAGE is shown in Figure 9. his metric is used by SAGE as a further indication of performance. It represents the number of cells that can be processed in each wall-clock time unit. In Figure 9 measurements are used for the CRAY 3E, ASCI White, and ASCI Blue Mountain, whereas we predict the performance of the Compaq system using our model SAGE Performance Model (Compaq AlphaServer ES5).. SAGE Performance (Compaq AlphaServer ES0) 0.7 ime for cycle (s) Prediction Measurement ime for cycle (s) Prediction 0. Measurement a) Compaq AlphaServer ES5 b) Compaq AlphaServer ES0 SAGE Performance Model (ASCI Blue Mountain) SAGE Performance Model (ASCI White) ime for cycle (s) 3 ime for cycle (s) Prediction Measurement Prediction Measurement c) ASCI Blue Mountain (SGI Origin 000) d) ASCI White (IBM SP3) Figure 8. Comparison of predictions from the SAGE performance model with actual measurements

9 .E+09.E+08 SAGE - Architecture Performance Comparison AlpaServer ES5 3E-00 ASCI White ASCI Blue Mountain Comparison of Slab and Cube (Surface Sizes) Slab Cube Cell-Cycles/sec.E+07.E+06.E+05.E+0 otal Surface (cells) E Figure 9. Comparison of the performance of SAGE across several systems a) Surface sizes Comparison of Slab and Cube (PE Distance) he model predicts the performance of the AlphaServer ES5 to be approximately a factor of times greater than that on the ASCI White (IBM SP3) on a comparable number of processors. A system with a peak performance of 30flops composed of the Compaq SMP boxes with Quadrics QSNet would be approximately 0 times greater than the performance of SAGE achieved to date on the ASCI Blue Mountain with 6000 SGI Origin 000 processors. By comparison, the ratio of peak speeds is approximately 0. Logical neighbour distance Slab (Z) Cube (Y) Cube (Z).. Performance Prediction on Algorithmic ransformations: an alternative Data Decomposition he surface-to-volume ratio of the processing in SAGE is dependent on the grid decomposition. here is a large difference between the use of the slab decomposition (Figure ) and a cube decomposition (Figure 0). Where the slab decomposition results in unications scaling as the /3 power of the number of PEs, as shown by equation (3), with a cube decomposition the unication size will remain approximately constant, though the number of PE pairs unicating will be larger. It can be easily shown (see for example []) that the surface-to-volume ratio (i.e. the unication-to-computation ratio) gets better (i.e. smaller) as the aspect ratio of the sub-grids changes towards being perfect cubes, as suggested in Figure 0. Of course, perfect cubic decomposition can only be achieved when the number of processors is a cubic power, as is the decomposition on 8 processors shown in Figure 0. PEs PEs 8PEs Figure 0. Possible 3-D data decomposition configurations for, and 8 processors b) PE distance (PED) Figure. Comparison of Slab and Cube decomposition A comparison between the cube decomposition and the slab decomposition is shown in Figure. he total surface of the subgrid on an individual PE is plotted which is proportional to the unication that takes place in each gather (and scatter) operation. he PE distance (PED) is also shown in Figure b). he curves for the slab decomposition have already been presented in Figure. he PED for the slab decomposition in the X and Y dimensions are always equal to. For the cube decomposition PED is always equal to in the X dimension, but varies in the Y and Z dimensions. he unication size using the cube decomposition is considerably smaller than that for the slab, but the PED is considerably larger. A comparison between the expected performance of SAGE using cube decomposition and the current slab decomposition on the Compaq AlphaServer ES5, and the ASCI Blue Mountain, is shown in Figure. his is achieved by modifying the parameters Surface X, Surface Y, and Surface Z, in equation, to represent the sub-grid surface sizes in the cube decomposition. For an ideal cube these would all be equal to 3 ( L / P / ).he use of the cube decomposition reduces unication requirements and hence results in an expected performance improvement of 35% (on the Compaq system), and between 5% and 5% (on the SGI system) compared with the use of slabs.

10 ime for Cycle (s) SAGE Performance Model - Comparison of Slab and Cube (Compaq System) Slab Cube Cycle ime (s) SAGE - Component imes (ES5-Slab) Compute/memory Bandwidth Latency otal a) Compaq AlphaServer ES5 a) Slab decomposition ime for Cycle (s) SAGE Performance Model - Comparison of Slab and Cube (ASCI Blue Mountain) Slab Cube b) ASCI Blue Mountain Figure. Performance comparison of slab and cube decomposition Cycle ime (s) SAGE - Component imes (ES5-Cube) 0.9 Compute/memory 0.8 Bandwidth 0.7 Latency 0.6 otal b) Cube decomposition Figure 3. ime-component predictions (Compaq ES5) he performance model can also be used to provide insight into where time is spent within the application. In Figure 3, time components representing computation (including memory), unication latency, and unication bandwidth are shown for both data decomposition schemes of SAGE on the Compaq AlphaServer ES5. It can be clearly seen that the unication bandwidth component is much reduced when using the cube decomposition. Both the computation and the unication latency components remain mostly unchanged. SAGE could benefit from a cube decomposition of the full grid if the unication network within the machine is able to handle the large logical PEDs without performance penalty. his is true in the fat-tree topology of the Quadrics network used on the cluster of Compaq SMPs as described in Section 3..3 Extending the Performance Model to AMR he model can be used to explore the performance on different characteristic adaptive mesh refinement calculations on different architectures. In Figure an example time history for each of the cell division factors (D), the maximum cells added in a cycle (A), and the maximum cells moved for load balancing (M cm ) are shown. he example time histories attempt to depict the situation in which a shock-wave propagates through the spatial grid. his causes the following characteristics: - the cell division factor gradually increases with the number of cells added per cycle as the shock-wave expands thus the cell division factor increases, and - load-balancing is assumed to take place every fifth cycle.

11 # Cells.E+06.E+05.E+0.E+03 SAGE - Example time histories Cells Added (A) Max cells load-balanced (Mcm) Cell Division Factor (D).E Cycle Number Figure. Example time histories for: division factor, added blocks, and blocks load-balanced (indexed by cycle). he time histories as depicted in Figure were used in the performance model in order to investigate both the variation in cycle time during the calculation (Figure 5a), and the time taken to perform the 00 cycles while scaling the number of processors (Figure 5b). his was undertaken for the Compaq AlphaServer ES0. It can be seen from Figure 5a), that cycles requiring loadbalancing take slightly longer than those without Cell Division Factor 5. SUMMARY AND CONCLUSIONS In this paper we have presented a predictive performance and scalability model for an important application from the ASCI workload. he model takes into account the main computation and unication characteristics of the entire code. he model proposed was validated on two large-scale ASCI architectures, ASCI White (IBM SP3), and ASCI Blue Mountain (SGI Origin 000), showing very good accuracy. he model was then utilized to predict performance of SAGE on future architectures and also when using an alternative parallel data decomposition. We believe that performance modeling is the key to building performance engineered applications and architectures. o this end, the work presented in this paper represents one of a very few existing performance models of entire applications. Like our previous performance model of a particle transport application [], the model incorporates information from various levels of the benchmark hierarchy [3] and is parametric - basic machine performance numbers (latency, computational rate, bandwidth) and application characteristics (problem size, decomposition method, etc.) serve as input. Such a model adds insight into the performance of current systems, revealing bottlenecks and showing where tuning efforts would be most effective. It also allows prediction of performance on future systems. he latter is important for both application and system architecture design as well as for the procurement of supercomputer architectures ime (s) SAGE - Example Adaption performance (ES0) A performance model is meant to be updated, refined, and further validated as new factors come into play. he work performed in this report was primarily concerned with the analysis of SAGE in absence of grid adaptation. With additional analysis, the model has been extended to include the main characteristics of the adaptation process. he performance model can be used to investigate performance on alternative application configurations (data decompositions), and alternative target systems Cycle Number 000 a) time taken for each of 00 cycles SAGE - Example Adaption Scalability (ES0) ACKNOWLEDGEMENS his work was supported by funding from the Los Alamos Computer Science Institute. We would like to thank Ed Benson for access and support on the AlphaServer ES5 at Compaq in Marlborough, MA. Los Alamos National Laboratory is operated by the University of California for the National Nuclear Security Administration of the US Department of Energy ime (s) cycles b) Example adaption scalabilty (time for 00 cycles) Figure 5. Performance Prediction of SAGE with AMR (using the input histories from Figure ) 6. REFERENCES [] Culler, D.E., Singh, J.P., Gupta, A., Parallel Computer Architecture, Morgan Kaufmann, ISBN , 999. [] Goedecker, S., Hoisie, A., Performance Optimization of Numerically Intensive Codes, SIAM Press, ISBN , March 00. [3] Hockney, R., Berry, M., (Eds). Public International Benchmarks for Parallel Computers. Scientific Programming, Vol. 3, 99, 0-0.

12 [] Hoisie. A., Lubeck, O., Wasserman. H., Performance and scalability analysis of teraflop-scale parallel architectures using multidimensional wavefront applications, Int. J. of High Performance Computing Applications, Vol., No., Winter 00, [5] Nudd, G.R., Kerbyson, D.J., et.al. PACE: A oolset for the Performance Prediction of Parallel and Distributed Systems, in the Journal of High Performance Applications, Vol., No. 3, Fall 000, 8-5. [6] Petrini, F., Feng, W., Hoisie, A., Coll, S., and Frachtenberg, E. he Quadrics Network (QsNet): High-Performance Clustering echnology, Symposium on High Performance Interconnects, Hot Interconnects 9, Stanford CA, August -, 00. [7] Rauber,., and Runger, G., Modeling the Runtime of Scientific Programs on Parallel Computers, in Proc. 000 ICPP Workshops. IEEE Computer Society, 000, [8] de Supinski, B.R. he ASCI PSE Milepost: Run-ime Systems Performance ests, Int. Conf. On Parallel & Distrib. Process. ech. & Apps., Las Vegas, June 5-8, Vol., 00, [9] Weaver, R., Major 3-D Parallel Simulations, BIS - Computing and unication news, Los Alamos National Laboratory, June/July, 999, [0] Worley. P.H., Performance uning and Evaluation of a Parallel Community Climate Model, SC99, Portland, Oregon, November 999. [] Worley, P.H. Performance Evaluation of the IBM SP and the Compaq AlphaServer SC. In Proc. ICS 000, ACM, 000, 35-.

Characterizing the Performance of Dynamic Distribution and Load-Balancing Techniques for Adaptive Grid Hierarchies

Characterizing the Performance of Dynamic Distribution and Load-Balancing Techniques for Adaptive Grid Hierarchies Proceedings of the IASTED International Conference Parallel and Distributed Computing and Systems November 3-6, 1999 in Cambridge Massachusetts, USA Characterizing the Performance of Dynamic Distribution

More information

Components: Interconnect Page 1 of 18

Components: Interconnect Page 1 of 18 Components: Interconnect Page 1 of 18 PE to PE interconnect: The most expensive supercomputer component Possible implementations: FULL INTERCONNECTION: The ideal Usually not attainable Each PE has a direct

More information

Topological Properties

Topological Properties Advanced Computer Architecture Topological Properties Routing Distance: Number of links on route Node degree: Number of channels per node Network diameter: Longest minimum routing distance between any

More information

Performance Monitoring of Parallel Scientific Applications

Performance Monitoring of Parallel Scientific Applications Performance Monitoring of Parallel Scientific Applications Abstract. David Skinner National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory This paper introduces an infrastructure

More information

Lecture 2 Parallel Programming Platforms

Lecture 2 Parallel Programming Platforms Lecture 2 Parallel Programming Platforms Flynn s Taxonomy In 1966, Michael Flynn classified systems according to numbers of instruction streams and the number of data stream. Data stream Single Multiple

More information

GSAND2015-1899C. Demonstrating Improved Application Performance Using Dynamic Monitoring and Task Mapping

GSAND2015-1899C. Demonstrating Improved Application Performance Using Dynamic Monitoring and Task Mapping Exceptional service in the national interest GSAND2015-1899C Demonstrating Improved Application Performance Using Dynamic Monitoring and Task Mapping J. Brandt, K. Devine, A. Gentile, K. Pedretti Sandia,

More information

Interconnection Networks. Interconnection Networks. Interconnection networks are used everywhere!

Interconnection Networks. Interconnection Networks. Interconnection networks are used everywhere! Interconnection Networks Interconnection Networks Interconnection networks are used everywhere! Supercomputers connecting the processors Routers connecting the ports can consider a router as a parallel

More information

Performance Metrics and Scalability Analysis. Performance Metrics and Scalability Analysis

Performance Metrics and Scalability Analysis. Performance Metrics and Scalability Analysis Performance Metrics and Scalability Analysis 1 Performance Metrics and Scalability Analysis Lecture Outline Following Topics will be discussed Requirements in performance and cost Performance metrics Work

More information

High Performance Computing. Course Notes 2007-2008. HPC Fundamentals

High Performance Computing. Course Notes 2007-2008. HPC Fundamentals High Performance Computing Course Notes 2007-2008 2008 HPC Fundamentals Introduction What is High Performance Computing (HPC)? Difficult to define - it s a moving target. Later 1980s, a supercomputer performs

More information

System Interconnect Architectures. Goals and Analysis. Network Properties and Routing. Terminology - 2. Terminology - 1

System Interconnect Architectures. Goals and Analysis. Network Properties and Routing. Terminology - 2. Terminology - 1 System Interconnect Architectures CSCI 8150 Advanced Computer Architecture Hwang, Chapter 2 Program and Network Properties 2.4 System Interconnect Architectures Direct networks for static connections Indirect

More information

Load Balancing on a Non-dedicated Heterogeneous Network of Workstations

Load Balancing on a Non-dedicated Heterogeneous Network of Workstations Load Balancing on a Non-dedicated Heterogeneous Network of Workstations Dr. Maurice Eggen Nathan Franklin Department of Computer Science Trinity University San Antonio, Texas 78212 Dr. Roger Eggen Department

More information

Mesh Generation and Load Balancing

Mesh Generation and Load Balancing Mesh Generation and Load Balancing Stan Tomov Innovative Computing Laboratory Computer Science Department The University of Tennessee April 04, 2012 CS 594 04/04/2012 Slide 1 / 19 Outline Motivation Reliable

More information

Partitioning and Divide and Conquer Strategies

Partitioning and Divide and Conquer Strategies and Divide and Conquer Strategies Lecture 4 and Strategies Strategies Data partitioning aka domain decomposition Functional decomposition Lecture 4 and Strategies Quiz 4.1 For nuclear reactor simulation,

More information

Grid Computing Approach for Dynamic Load Balancing

Grid Computing Approach for Dynamic Load Balancing International Journal of Computer Sciences and Engineering Open Access Review Paper Volume-4, Issue-1 E-ISSN: 2347-2693 Grid Computing Approach for Dynamic Load Balancing Kapil B. Morey 1*, Sachin B. Jadhav

More information

A Flexible Cluster Infrastructure for Systems Research and Software Development

A Flexible Cluster Infrastructure for Systems Research and Software Development Award Number: CNS-551555 Title: CRI: Acquisition of an InfiniBand Cluster with SMP Nodes Institution: Florida State University PIs: Xin Yuan, Robert van Engelen, Kartik Gopalan A Flexible Cluster Infrastructure

More information

Methodology for predicting the energy consumption of SPMD application on virtualized environments *

Methodology for predicting the energy consumption of SPMD application on virtualized environments * Methodology for predicting the energy consumption of SPMD application on virtualized environments * Javier Balladini, Ronal Muresano +, Remo Suppi +, Dolores Rexachs + and Emilio Luque + * Computer Engineering

More information

Distributed communication-aware load balancing with TreeMatch in Charm++

Distributed communication-aware load balancing with TreeMatch in Charm++ Distributed communication-aware load balancing with TreeMatch in Charm++ The 9th Scheduling for Large Scale Systems Workshop, Lyon, France Emmanuel Jeannot Guillaume Mercier Francois Tessier In collaboration

More information

A Pattern-Based Approach to. Automated Application Performance Analysis

A Pattern-Based Approach to. Automated Application Performance Analysis A Pattern-Based Approach to Automated Application Performance Analysis Nikhil Bhatia, Shirley Moore, Felix Wolf, and Jack Dongarra Innovative Computing Laboratory University of Tennessee (bhatia, shirley,

More information

A Novel Switch Mechanism for Load Balancing in Public Cloud

A Novel Switch Mechanism for Load Balancing in Public Cloud International OPEN ACCESS Journal Of Modern Engineering Research (IJMER) A Novel Switch Mechanism for Load Balancing in Public Cloud Kalathoti Rambabu 1, M. Chandra Sekhar 2 1 M. Tech (CSE), MVR College

More information

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi ICPP 6 th International Workshop on Parallel Programming Models and Systems Software for High-End Computing October 1, 2013 Lyon, France

More information

Interconnection Networks Programmierung Paralleler und Verteilter Systeme (PPV)

Interconnection Networks Programmierung Paralleler und Verteilter Systeme (PPV) Interconnection Networks Programmierung Paralleler und Verteilter Systeme (PPV) Sommer 2015 Frank Feinbube, M.Sc., Felix Eberhardt, M.Sc., Prof. Dr. Andreas Polze Interconnection Networks 2 SIMD systems

More information

A Review of Customized Dynamic Load Balancing for a Network of Workstations

A Review of Customized Dynamic Load Balancing for a Network of Workstations A Review of Customized Dynamic Load Balancing for a Network of Workstations Taken from work done by: Mohammed Javeed Zaki, Wei Li, Srinivasan Parthasarathy Computer Science Department, University of Rochester

More information

Performance Engineering of the Community Atmosphere Model

Performance Engineering of the Community Atmosphere Model Performance Engineering of the Community Atmosphere Model Patrick H. Worley Oak Ridge National Laboratory Arthur A. Mirin Lawrence Livermore National Laboratory 11th Annual CCSM Workshop June 20-22, 2006

More information

Dynamic Load Balancing in Charm++ Abhinav S Bhatele Parallel Programming Lab, UIUC

Dynamic Load Balancing in Charm++ Abhinav S Bhatele Parallel Programming Lab, UIUC Dynamic Load Balancing in Charm++ Abhinav S Bhatele Parallel Programming Lab, UIUC Outline Dynamic Load Balancing framework in Charm++ Measurement Based Load Balancing Examples: Hybrid Load Balancers Topology-aware

More information

FRIEDRICH-ALEXANDER-UNIVERSITÄT ERLANGEN-NÜRNBERG

FRIEDRICH-ALEXANDER-UNIVERSITÄT ERLANGEN-NÜRNBERG FRIEDRICH-ALEXANDER-UNIVERSITÄT ERLANGEN-NÜRNBERG INSTITUT FÜR INFORMATIK (MATHEMATISCHE MASCHINEN UND DATENVERARBEITUNG) Lehrstuhl für Informatik 10 (Systemsimulation) Massively Parallel Multilevel Finite

More information

QUADRICS IN LINUX CLUSTERS

QUADRICS IN LINUX CLUSTERS QUADRICS IN LINUX CLUSTERS John Taylor Motivation QLC 21/11/00 Quadrics Cluster Products Performance Case Studies Development Activities Super-Cluster Performance Landscape CPLANT ~600 GF? 128 64 32 16

More information

PERFORMANCE ANALYSIS AND OPTIMIZATION OF LARGE-SCALE SCIENTIFIC APPLICATIONS JINGJIN WU

PERFORMANCE ANALYSIS AND OPTIMIZATION OF LARGE-SCALE SCIENTIFIC APPLICATIONS JINGJIN WU PERFORMANCE ANALYSIS AND OPTIMIZATION OF LARGE-SCALE SCIENTIFIC APPLICATIONS BY JINGJIN WU Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computer Science

More information

Measuring MPI Send and Receive Overhead and Application Availability in High Performance Network Interfaces

Measuring MPI Send and Receive Overhead and Application Availability in High Performance Network Interfaces Measuring MPI Send and Receive Overhead and Application Availability in High Performance Network Interfaces Douglas Doerfler and Ron Brightwell Center for Computation, Computers, Information and Math Sandia

More information

On-Chip Interconnection Networks Low-Power Interconnect

On-Chip Interconnection Networks Low-Power Interconnect On-Chip Interconnection Networks Low-Power Interconnect William J. Dally Computer Systems Laboratory Stanford University ISLPED August 27, 2007 ISLPED: 1 Aug 27, 2007 Outline Demand for On-Chip Networks

More information

Performance Improvement of Application on the K computer

Performance Improvement of Application on the K computer Performance Improvement of Application on the K computer November 13, 2011 Kazuo Minami Team Leader, Application Development Team Research and Development Group Next-Generation Supercomputer R & D Center

More information

MPI Implementation Analysis - A Practical Approach to Network Marketing

MPI Implementation Analysis - A Practical Approach to Network Marketing Optimizing MPI Collective Communication by Orthogonal Structures Matthias Kühnemann Fakultät für Informatik Technische Universität Chemnitz 917 Chemnitz, Germany kumat@informatik.tu chemnitz.de Gudula

More information

Distributed Dynamic Load Balancing for Iterative-Stencil Applications

Distributed Dynamic Load Balancing for Iterative-Stencil Applications Distributed Dynamic Load Balancing for Iterative-Stencil Applications G. Dethier 1, P. Marchot 2 and P.A. de Marneffe 1 1 EECS Department, University of Liege, Belgium 2 Chemical Engineering Department,

More information

Investigations on InfiniBand: Efficient Network Buffer Utilization at Scale

Investigations on InfiniBand: Efficient Network Buffer Utilization at Scale Investigations on InfiniBand: Efficient Network Buffer Utilization at Scale Galen M. Shipman 1, Ron Brightwell 2, Brian Barrett 1, Jeffrey M. Squyres 3, and Gil Bloch 4 1 Los Alamos National Laboratory,

More information

Why the Network Matters

Why the Network Matters Week 2, Lecture 2 Copyright 2009 by W. Feng. Based on material from Matthew Sottile. So Far Overview of Multicore Systems Why Memory Matters Memory Architectures Emerging Chip Multiprocessors (CMP) Increasing

More information

Load Imbalance Analysis

Load Imbalance Analysis With CrayPat Load Imbalance Analysis Imbalance time is a metric based on execution time and is dependent on the type of activity: User functions Imbalance time = Maximum time Average time Synchronization

More information

Hyper Node Torus: A New Interconnection Network for High Speed Packet Processors

Hyper Node Torus: A New Interconnection Network for High Speed Packet Processors 2011 International Symposium on Computer Networks and Distributed Systems (CNDS), February 23-24, 2011 Hyper Node Torus: A New Interconnection Network for High Speed Packet Processors Atefeh Khosravi,

More information

Combining Scalability and Efficiency for SPMD Applications on Multicore Clusters*

Combining Scalability and Efficiency for SPMD Applications on Multicore Clusters* Combining Scalability and Efficiency for SPMD Applications on Multicore Clusters* Ronal Muresano, Dolores Rexachs and Emilio Luque Computer Architecture and Operating System Department (CAOS) Universitat

More information

Load Balancing Support for Grid-enabled Applications

Load Balancing Support for Grid-enabled Applications John von Neumann Institute for Computing Load Balancing Support for Grid-enabled Applications S. Rips published in Parallel Computing: Current & Future Issues of High-End Computing, Proceedings of the

More information

Cluster Scalability of ANSYS FLUENT 12 for a Large Aerodynamics Case on the Darwin Supercomputer

Cluster Scalability of ANSYS FLUENT 12 for a Large Aerodynamics Case on the Darwin Supercomputer Cluster Scalability of ANSYS FLUENT 12 for a Large Aerodynamics Case on the Darwin Supercomputer Stan Posey, MSc and Bill Loewe, PhD Panasas Inc., Fremont, CA, USA Paul Calleja, PhD University of Cambridge,

More information

Parallel Computing. Benson Muite. benson.muite@ut.ee http://math.ut.ee/ benson. https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage

Parallel Computing. Benson Muite. benson.muite@ut.ee http://math.ut.ee/ benson. https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage Parallel Computing Benson Muite benson.muite@ut.ee http://math.ut.ee/ benson https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage 3 November 2014 Hadoop, Review Hadoop Hadoop History Hadoop Framework

More information

Performance of the NAS Parallel Benchmarks on Grid Enabled Clusters

Performance of the NAS Parallel Benchmarks on Grid Enabled Clusters Performance of the NAS Parallel Benchmarks on Grid Enabled Clusters Philip J. Sokolowski Dept. of Electrical and Computer Engineering Wayne State University 55 Anthony Wayne Dr., Detroit, MI 4822 phil@wayne.edu

More information

Fast Multipole Method for particle interactions: an open source parallel library component

Fast Multipole Method for particle interactions: an open source parallel library component Fast Multipole Method for particle interactions: an open source parallel library component F. A. Cruz 1,M.G.Knepley 2,andL.A.Barba 1 1 Department of Mathematics, University of Bristol, University Walk,

More information

Load Balancing Between Heterogenous Computing Clusters

Load Balancing Between Heterogenous Computing Clusters Load Balancing Between Heterogenous Computing Clusters Siu-Cheung Chau Dept. of Physics and Computing, Wilfrid Laurier University, Waterloo, Ontario, Canada, N2L 3C5 e-mail: schau@wlu.ca Ada Wai-Chee Fu

More information

Parallel Analysis and Visualization on Cray Compute Node Linux

Parallel Analysis and Visualization on Cray Compute Node Linux Parallel Analysis and Visualization on Cray Compute Node Linux David Pugmire, Oak Ridge National Laboratory and Hank Childs, Lawrence Livermore National Laboratory and Sean Ahern, Oak Ridge National Laboratory

More information

Load balancing in a heterogeneous computer system by self-organizing Kohonen network

Load balancing in a heterogeneous computer system by self-organizing Kohonen network Bull. Nov. Comp. Center, Comp. Science, 25 (2006), 69 74 c 2006 NCC Publisher Load balancing in a heterogeneous computer system by self-organizing Kohonen network Mikhail S. Tarkov, Yakov S. Bezrukov Abstract.

More information

An Empirical Study and Analysis of the Dynamic Load Balancing Techniques Used in Parallel Computing Systems

An Empirical Study and Analysis of the Dynamic Load Balancing Techniques Used in Parallel Computing Systems An Empirical Study and Analysis of the Dynamic Load Balancing Techniques Used in Parallel Computing Systems Ardhendu Mandal and Subhas Chandra Pal Department of Computer Science and Application, University

More information

LaPIe: Collective Communications adapted to Grid Environments

LaPIe: Collective Communications adapted to Grid Environments LaPIe: Collective Communications adapted to Grid Environments Luiz Angelo Barchet-Estefanel Thesis Supervisor: M Denis TRYSTRAM Co-Supervisor: M Grégory MOUNIE ID-IMAG Laboratory Grenoble - France LaPIe:

More information

Interconnection Network Design

Interconnection Network Design Interconnection Network Design Vida Vukašinović 1 Introduction Parallel computer networks are interesting topic, but they are also difficult to understand in an overall sense. The topological structure

More information

Search Strategies for Automatic Performance Analysis Tools

Search Strategies for Automatic Performance Analysis Tools Search Strategies for Automatic Performance Analysis Tools Michael Gerndt and Edmond Kereku Technische Universität München, Fakultät für Informatik I10, Boltzmannstr.3, 85748 Garching, Germany gerndt@in.tum.de

More information

PERFORMANCE CONSIDERATIONS FOR NETWORK SWITCH FABRICS ON LINUX CLUSTERS

PERFORMANCE CONSIDERATIONS FOR NETWORK SWITCH FABRICS ON LINUX CLUSTERS PERFORMANCE CONSIDERATIONS FOR NETWORK SWITCH FABRICS ON LINUX CLUSTERS Philip J. Sokolowski Department of Electrical and Computer Engineering Wayne State University 55 Anthony Wayne Dr. Detroit, MI 822

More information

ParFUM: A Parallel Framework for Unstructured Meshes. Aaron Becker, Isaac Dooley, Terry Wilmarth, Sayantan Chakravorty Charm++ Workshop 2008

ParFUM: A Parallel Framework for Unstructured Meshes. Aaron Becker, Isaac Dooley, Terry Wilmarth, Sayantan Chakravorty Charm++ Workshop 2008 ParFUM: A Parallel Framework for Unstructured Meshes Aaron Becker, Isaac Dooley, Terry Wilmarth, Sayantan Chakravorty Charm++ Workshop 2008 What is ParFUM? A framework for writing parallel finite element

More information

Load Balancing using Potential Functions for Hierarchical Topologies

Load Balancing using Potential Functions for Hierarchical Topologies Acta Technica Jaurinensis Vol. 4. No. 4. 2012 Load Balancing using Potential Functions for Hierarchical Topologies Molnárka Győző, Varjasi Norbert Széchenyi István University Dept. of Machine Design and

More information

Software and Performance Engineering of the Community Atmosphere Model

Software and Performance Engineering of the Community Atmosphere Model Software and Performance Engineering of the Community Atmosphere Model Patrick H. Worley Oak Ridge National Laboratory Arthur A. Mirin Lawrence Livermore National Laboratory Science Team Meeting Climate

More information

Scalability and Classifications

Scalability and Classifications Scalability and Classifications 1 Types of Parallel Computers MIMD and SIMD classifications shared and distributed memory multicomputers distributed shared memory computers 2 Network Topologies static

More information

A SIMULATOR FOR LOAD BALANCING ANALYSIS IN DISTRIBUTED SYSTEMS

A SIMULATOR FOR LOAD BALANCING ANALYSIS IN DISTRIBUTED SYSTEMS Mihai Horia Zaharia, Florin Leon, Dan Galea (3) A Simulator for Load Balancing Analysis in Distributed Systems in A. Valachi, D. Galea, A. M. Florea, M. Craus (eds.) - Tehnologii informationale, Editura

More information

Comparing the OpenMP, MPI, and Hybrid Programming Paradigm on an SMP Cluster

Comparing the OpenMP, MPI, and Hybrid Programming Paradigm on an SMP Cluster Comparing the OpenMP, MPI, and Hybrid Programming Paradigm on an SMP Cluster Gabriele Jost and Haoqiang Jin NAS Division, NASA Ames Research Center, Moffett Field, CA 94035-1000 {gjost,hjin}@nas.nasa.gov

More information

COMP 422, Lecture 3: Physical Organization & Communication Costs in Parallel Machines (Sections 2.4 & 2.5 of textbook)

COMP 422, Lecture 3: Physical Organization & Communication Costs in Parallel Machines (Sections 2.4 & 2.5 of textbook) COMP 422, Lecture 3: Physical Organization & Communication Costs in Parallel Machines (Sections 2.4 & 2.5 of textbook) Vivek Sarkar Department of Computer Science Rice University vsarkar@rice.edu COMP

More information

Supercomputing applied to Parallel Network Simulation

Supercomputing applied to Parallel Network Simulation Supercomputing applied to Parallel Network Simulation David Cortés-Polo Research, Technological Innovation and Supercomputing Centre of Extremadura, CenitS. Trujillo, Spain david.cortes@cenits.es Summary

More information

Parallel Scalable Algorithms- Performance Parameters

Parallel Scalable Algorithms- Performance Parameters www.bsc.es Parallel Scalable Algorithms- Performance Parameters Vassil Alexandrov, ICREA - Barcelona Supercomputing Center, Spain Overview Sources of Overhead in Parallel Programs Performance Metrics for

More information

Modeling Parallel Applications for Scalability Analysis: An approach to predict the communication pattern

Modeling Parallel Applications for Scalability Analysis: An approach to predict the communication pattern Int'l Conf. Par. and Dist. Proc. Tech. and Appl. PDPTA'15 191 Modeling Parallel Applications for calability Analysis: An approach to predict the communication pattern Javier Panadero 1, Alvaro Wong 1,

More information

Performance of the JMA NWP models on the PC cluster TSUBAME.

Performance of the JMA NWP models on the PC cluster TSUBAME. Performance of the JMA NWP models on the PC cluster TSUBAME. K.Takenouchi 1), S.Yokoi 1), T.Hara 1) *, T.Aoki 2), C.Muroi 1), K.Aranami 1), K.Iwamura 1), Y.Aikawa 1) 1) Japan Meteorological Agency (JMA)

More information

Scalable Parallel Clustering for Data Mining on Multicomputers

Scalable Parallel Clustering for Data Mining on Multicomputers Scalable Parallel Clustering for Data Mining on Multicomputers D. Foti, D. Lipari, C. Pizzuti and D. Talia ISI-CNR c/o DEIS, UNICAL 87036 Rende (CS), Italy {pizzuti,talia}@si.deis.unical.it Abstract. This

More information

A Framework For Application Performance Understanding and Prediction

A Framework For Application Performance Understanding and Prediction A Framework For Application Performance Understanding and Prediction Laura Carrington Ph.D. Lab (Performance Modeling & Characterization) at the 1 About us An NSF lab see www.sdsc.edu/ The mission of the

More information

Revenue Management for Transportation Problems

Revenue Management for Transportation Problems Revenue Management for Transportation Problems Francesca Guerriero Giovanna Miglionico Filomena Olivito Department of Electronic Informatics and Systems, University of Calabria Via P. Bucci, 87036 Rende

More information

Algorithms of Scientific Computing II

Algorithms of Scientific Computing II Technische Universität München WS 2010/2011 Institut für Informatik Prof. Dr. Hans-Joachim Bungartz Alexander Heinecke, M.Sc., M.Sc.w.H. Algorithms of Scientific Computing II Exercise 4 - Hardware-aware

More information

Junghyun Ahn Changho Sung Tag Gon Kim. Korea Advanced Institute of Science and Technology (KAIST) 373-1 Kuseong-dong, Yuseong-gu Daejoen, Korea

Junghyun Ahn Changho Sung Tag Gon Kim. Korea Advanced Institute of Science and Technology (KAIST) 373-1 Kuseong-dong, Yuseong-gu Daejoen, Korea Proceedings of the 211 Winter Simulation Conference S. Jain, R. R. Creasey, J. Himmelspach, K. P. White, and M. Fu, eds. A BINARY PARTITION-BASED MATCHING ALGORITHM FOR DATA DISTRIBUTION MANAGEMENT Junghyun

More information

Knowledge Discovery from patents using KMX Text Analytics

Knowledge Discovery from patents using KMX Text Analytics Knowledge Discovery from patents using KMX Text Analytics Dr. Anton Heijs anton.heijs@treparel.com Treparel Abstract In this white paper we discuss how the KMX technology of Treparel can help searchers

More information

Source Code Transformations Strategies to Load-balance Grid Applications

Source Code Transformations Strategies to Load-balance Grid Applications Source Code Transformations Strategies to Load-balance Grid Applications Romaric David, Stéphane Genaud, Arnaud Giersch, Benjamin Schwarz, and Éric Violard LSIIT-ICPS, Université Louis Pasteur, Bd S. Brant,

More information

HPC Deployment of OpenFOAM in an Industrial Setting

HPC Deployment of OpenFOAM in an Industrial Setting HPC Deployment of OpenFOAM in an Industrial Setting Hrvoje Jasak h.jasak@wikki.co.uk Wikki Ltd, United Kingdom PRACE Seminar: Industrial Usage of HPC Stockholm, Sweden, 28-29 March 2011 HPC Deployment

More information

Rackspace Cloud Databases and Container-based Virtualization

Rackspace Cloud Databases and Container-based Virtualization Rackspace Cloud Databases and Container-based Virtualization August 2012 J.R. Arredondo @jrarredondo Page 1 of 6 INTRODUCTION When Rackspace set out to build the Cloud Databases product, we asked many

More information

Impacts of Operating Systems on the Scalability of Parallel Applications

Impacts of Operating Systems on the Scalability of Parallel Applications UCRL-MI-202629 LAWRENCE LIVERMORE NATIONAL LABORATORY Impacts of Operating Systems on the Scalability of Parallel Applications T.R. Jones L.B. Brenner J.M. Fier March 5, 2003 This document was prepared

More information

An Analytical Framework for Particle and Volume Data of Large-Scale Combustion Simulations. Franz Sauer 1, Hongfeng Yu 2, Kwan-Liu Ma 1

An Analytical Framework for Particle and Volume Data of Large-Scale Combustion Simulations. Franz Sauer 1, Hongfeng Yu 2, Kwan-Liu Ma 1 An Analytical Framework for Particle and Volume Data of Large-Scale Combustion Simulations Franz Sauer 1, Hongfeng Yu 2, Kwan-Liu Ma 1 1 University of California, Davis 2 University of Nebraska, Lincoln

More information

FD4: A Framework for Highly Scalable Dynamic Load Balancing and Model Coupling

FD4: A Framework for Highly Scalable Dynamic Load Balancing and Model Coupling Center for Information Services and High Performance Computing (ZIH) FD4: A Framework for Highly Scalable Dynamic Load Balancing and Model Coupling Symposium on HPC and Data-Intensive Applications in Earth

More information

SCALABILITY OF CONTEXTUAL GENERALIZATION PROCESSING USING PARTITIONING AND PARALLELIZATION. Marc-Olivier Briat, Jean-Luc Monnot, Edith M.

SCALABILITY OF CONTEXTUAL GENERALIZATION PROCESSING USING PARTITIONING AND PARALLELIZATION. Marc-Olivier Briat, Jean-Luc Monnot, Edith M. SCALABILITY OF CONTEXTUAL GENERALIZATION PROCESSING USING PARTITIONING AND PARALLELIZATION Abstract Marc-Olivier Briat, Jean-Luc Monnot, Edith M. Punt Esri, Redlands, California, USA mbriat@esri.com, jmonnot@esri.com,

More information

Performance Tuning of a CFD Code on the Earth Simulator

Performance Tuning of a CFD Code on the Earth Simulator Applications on HPC Special Issue on High Performance Computing Performance Tuning of a CFD Code on the Earth Simulator By Ken ichi ITAKURA,* Atsuya UNO,* Mitsuo YOKOKAWA, Minoru SAITO, Takashi ISHIHARA

More information

Performance Comparison of Dynamic Load-Balancing Strategies for Distributed Computing

Performance Comparison of Dynamic Load-Balancing Strategies for Distributed Computing Performance Comparison of Dynamic Load-Balancing Strategies for Distributed Computing A. Cortés, A. Ripoll, M.A. Senar and E. Luque Computer Architecture and Operating Systems Group Universitat Autònoma

More information

Multilevel Load Balancing in NUMA Computers

Multilevel Load Balancing in NUMA Computers FACULDADE DE INFORMÁTICA PUCRS - Brazil http://www.pucrs.br/inf/pos/ Multilevel Load Balancing in NUMA Computers M. Corrêa, R. Chanin, A. Sales, R. Scheer, A. Zorzo Technical Report Series Number 049 July,

More information

Influence of Load Balancing on Quality of Real Time Data Transmission*

Influence of Load Balancing on Quality of Real Time Data Transmission* SERBIAN JOURNAL OF ELECTRICAL ENGINEERING Vol. 6, No. 3, December 2009, 515-524 UDK: 004.738.2 Influence of Load Balancing on Quality of Real Time Data Transmission* Nataša Maksić 1,a, Petar Knežević 2,

More information

LOAD BALANCING AS A STRATEGY LEARNING TASK

LOAD BALANCING AS A STRATEGY LEARNING TASK LOAD BALANCING AS A STRATEGY LEARNING TASK 1 K.KUNGUMARAJ, 2 T.RAVICHANDRAN 1 Research Scholar, Karpagam University, Coimbatore 21. 2 Principal, Hindusthan Institute of Technology, Coimbatore 32. ABSTRACT

More information

Hank Childs, University of Oregon

Hank Childs, University of Oregon Exascale Analysis & Visualization: Get Ready For a Whole New World Sept. 16, 2015 Hank Childs, University of Oregon Before I forget VisIt: visualization and analysis for very big data DOE Workshop for

More information

18-742 Lecture 4. Parallel Programming II. Homework & Reading. Page 1. Projects handout On Friday Form teams, groups of two

18-742 Lecture 4. Parallel Programming II. Homework & Reading. Page 1. Projects handout On Friday Form teams, groups of two age 1 18-742 Lecture 4 arallel rogramming II Spring 2005 rof. Babak Falsafi http://www.ece.cmu.edu/~ece742 write X Memory send X Memory read X Memory Slides developed in part by rofs. Adve, Falsafi, Hill,

More information

Medical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu

Medical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu Medical Information Management & Mining You Chen Jan,15, 2013 You.chen@vanderbilt.edu 1 Trees Building Materials Trees cannot be used to build a house directly. How can we transform trees to building materials?

More information

Resource Allocation Schemes for Gang Scheduling

Resource Allocation Schemes for Gang Scheduling Resource Allocation Schemes for Gang Scheduling B. B. Zhou School of Computing and Mathematics Deakin University Geelong, VIC 327, Australia D. Walsh R. P. Brent Department of Computer Science Australian

More information

The Green Index: A Metric for Evaluating System-Wide Energy Efficiency in HPC Systems

The Green Index: A Metric for Evaluating System-Wide Energy Efficiency in HPC Systems 202 IEEE 202 26th IEEE International 26th International Parallel Parallel and Distributed and Distributed Processing Processing Symposium Symposium Workshops Workshops & PhD Forum The Green Index: A Metric

More information

Cellular Computing on a Linux Cluster

Cellular Computing on a Linux Cluster Cellular Computing on a Linux Cluster Alexei Agueev, Bernd Däne, Wolfgang Fengler TU Ilmenau, Department of Computer Architecture Topics 1. Cellular Computing 2. The Experiment 3. Experimental Results

More information

Grid Scheduling Dictionary of Terms and Keywords

Grid Scheduling Dictionary of Terms and Keywords Grid Scheduling Dictionary Working Group M. Roehrig, Sandia National Laboratories W. Ziegler, Fraunhofer-Institute for Algorithms and Scientific Computing Document: Category: Informational June 2002 Status

More information

From Hypercubes to Dragonflies a short history of interconnect

From Hypercubes to Dragonflies a short history of interconnect From Hypercubes to Dragonflies a short history of interconnect William J. Dally Computer Science Department Stanford University IAA Workshop July 21, 2008 IAA: # Outline The low-radix era High-radix routers

More information

Optimizing a ëcontent-aware" Load Balancing Strategy for Shared Web Hosting Service Ludmila Cherkasova Hewlett-Packard Laboratories 1501 Page Mill Road, Palo Alto, CA 94303 cherkasova@hpl.hp.com Shankar

More information

Benchmarking Cassandra on Violin

Benchmarking Cassandra on Violin Technical White Paper Report Technical Report Benchmarking Cassandra on Violin Accelerating Cassandra Performance and Reducing Read Latency With Violin Memory Flash-based Storage Arrays Version 1.0 Abstract

More information

A NOVEL RESOURCE EFFICIENT DMMS APPROACH

A NOVEL RESOURCE EFFICIENT DMMS APPROACH A NOVEL RESOURCE EFFICIENT DMMS APPROACH FOR NETWORK MONITORING AND CONTROLLING FUNCTIONS Golam R. Khan 1, Sharmistha Khan 2, Dhadesugoor R. Vaman 3, and Suxia Cui 4 Department of Electrical and Computer

More information

Operating System Multilevel Load Balancing

Operating System Multilevel Load Balancing Operating System Multilevel Load Balancing M. Corrêa, A. Zorzo Faculty of Informatics - PUCRS Porto Alegre, Brazil {mcorrea, zorzo}@inf.pucrs.br R. Scheer HP Brazil R&D Porto Alegre, Brazil roque.scheer@hp.com

More information

Predict the Popularity of YouTube Videos Using Early View Data

Predict the Popularity of YouTube Videos Using Early View Data 000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050

More information

1 Bull, 2011 Bull Extreme Computing

1 Bull, 2011 Bull Extreme Computing 1 Bull, 2011 Bull Extreme Computing Table of Contents HPC Overview. Cluster Overview. FLOPS. 2 Bull, 2011 Bull Extreme Computing HPC Overview Ares, Gerardo, HPC Team HPC concepts HPC: High Performance

More information

THE NAS KERNEL BENCHMARK PROGRAM

THE NAS KERNEL BENCHMARK PROGRAM THE NAS KERNEL BENCHMARK PROGRAM David H. Bailey and John T. Barton Numerical Aerodynamic Simulations Systems Division NASA Ames Research Center June 13, 1986 SUMMARY A benchmark test program that measures

More information

PARALLEL & CLUSTER COMPUTING CS 6260 PROFESSOR: ELISE DE DONCKER BY: LINA HUSSEIN

PARALLEL & CLUSTER COMPUTING CS 6260 PROFESSOR: ELISE DE DONCKER BY: LINA HUSSEIN 1 PARALLEL & CLUSTER COMPUTING CS 6260 PROFESSOR: ELISE DE DONCKER BY: LINA HUSSEIN Introduction What is cluster computing? Classification of Cluster Computing Technologies: Beowulf cluster Construction

More information

Recommendations for Performance Benchmarking

Recommendations for Performance Benchmarking Recommendations for Performance Benchmarking Shikhar Puri Abstract Performance benchmarking of applications is increasingly becoming essential before deployment. This paper covers recommendations and best

More information

Determining optimal window size for texture feature extraction methods

Determining optimal window size for texture feature extraction methods IX Spanish Symposium on Pattern Recognition and Image Analysis, Castellon, Spain, May 2001, vol.2, 237-242, ISBN: 84-8021-351-5. Determining optimal window size for texture feature extraction methods Domènec

More information

OpenMosix Presented by Dr. Moshe Bar and MAASK [01]

OpenMosix Presented by Dr. Moshe Bar and MAASK [01] OpenMosix Presented by Dr. Moshe Bar and MAASK [01] openmosix is a kernel extension for single-system image clustering. openmosix [24] is a tool for a Unix-like kernel, such as Linux, consisting of adaptive

More information

FPGA area allocation for parallel C applications

FPGA area allocation for parallel C applications 1 FPGA area allocation for parallel C applications Vlad-Mihai Sima, Elena Moscu Panainte, Koen Bertels Computer Engineering Faculty of Electrical Engineering, Mathematics and Computer Science Delft University

More information

Designing and Building Applications for Extreme Scale Systems CS598 William Gropp www.cs.illinois.edu/~wgropp

Designing and Building Applications for Extreme Scale Systems CS598 William Gropp www.cs.illinois.edu/~wgropp Designing and Building Applications for Extreme Scale Systems CS598 William Gropp www.cs.illinois.edu/~wgropp Welcome! Who am I? William (Bill) Gropp Professor of Computer Science One of the Creators of

More information