Franck Cappello and Daniel Etiemble LRI, Université Paris-Sud, 91405, Orsay, France

Transcription

1 MPI versus MPI+OenMP on the IBM SP for the NAS Benchmarks Franck Caello and Daniel Etiemble LRI, Université Paris-Sud, 945, Orsay, France Abstract The hybrid memory model of clusters of multirocessors raises two issues: rogramming model and erformance. Many arallel rograms have been written by using the MPI standard. To evaluate the ertinence of hybrid models for existing MPI codes, we comare a unified model (MPI) and a hybrid one (OenMP fine grain arallelization after rofiling) for the NAS 2.3 benchmarks on two IBM SP systems. The sueriority of one model deends on ) the level of shared memory model arallelization, 2) the communication atterns and 3) the memory access atterns. The relative seeds of the main architecture comonents (CPU, memory, and network) are of tremendous imortance for selecting one model. With the used hybrid model, our results show that a unified MPI aroach is better for most of the benchmarks. The hybrid aroach becomes better only when fast rocessors make the communication erformance significant and the level of arallelization is sufficient. Introduction Some rimary suercomuter manufacturers like IBM and Comaq use CLUsters of MUltiProcessors (CLUMP) to rovide scalable high erformance comuters. The future ASCI White machine corresonds to this architecture with 52 6-way SMP nodes. CLUMPs may use different memory models for the interconnection of the SMP nodes: NUMA or Message-Passing. The SGI Origin comuters use the NUMA aroach as well as the Convex Exemlar ones. The IBM SPs and the Comaq SC Clusters use message-assing. In this aer, we focus on CLUMPs with a hybrid memory model (shared memory inside nodes and message assing between nodes) at the hardware level. Two main issues should be addressed for selecting a rogramming model: the ease of use and erformance. First, we must choose between a unified rogramming model or a hybrid one. In the unified rogramming models, the rogrammer uses a single API to describe the communications inside the multirocessor nodes and between them. All the message assing or DSM systems belong to this category. The hybrid memory models mix shared memory inside the multirocessor and message assing between the nodes. MPI+OenMP, MPI+threads are two hybrid models. Because the hybrid rogramming models require managing two different memory models, they are much more comlex for the rogrammer. Also, many existing alications use a MPI imlementation. For these alications, the relevance of the hybrid rogramming models must be carefully examined because, as we will demonstrate, even otimized hybrid codes may rovide insignificant erformance imrovement comared to the original MPI version. The second issue is erformance. It deends on at least three arameters: a) the sharing of communication suort (memory system and network interface) between the rocessors for the unified model i.e. how the er rocessor latency and bandwidth evolve when several rocessors use a same network interface (and rotocol), b) the degree of shared memory arallelism for the hybrid model and c) the seed-u on the arallel section for both models that can be different because loo nests may be executed in a different way. Section 2 resents the rogramming models and the methodology. Section 3 comares the erformance of MPI unified model and the MPI+ OenMP hybrid model for the NAS 2.3 Benchmarks. We resent detailed results for different cluster sizes and data set sizes (CLASS A and CLASS B) in section 4. Section 5 resents related works. Finally, section 6 concludes. 2 Programming models and methodology In this aer, we focus on existing MPI rograms that have been develoed for traditional arallel machines. Because most of the manufacturers rovide extended versions of their communication library for clusters of multirocessors, existing MPI codes can be directly used with a unified MPI model. The alternative is mixing MPI with a shared /2/$. (c) 2 IEEE.

2 memory model such as OenMP. In that case, different ossibilities exist, which must be comared according to the erformance and rogramming effort trade-off. 2. Programming efforts for mixing MPI + OenMP 2.. Fine-grain arallelization From an existing MPI code, the simlest aroach is the incremental one: it consists in OenMP arallelization of the loo nests in the comutation art of the MPI code. This aroach is also called OenMP fine-grain or loo level arallelization. Several otions can be used according to ) the rogramming effort and 2) the choice of the loo nests to arallelize. Several levels of rogramming effort can be used. First ossibility consists in arallelizing loo nests in the comutation art of the MPI code without any manual otimization. Only the correctness of the arallel version versus the sequential version semantic is checked. But the incremental aroach can be significantly imroved by alying several manual otimizations (loo ermutation, loo exchange, use of temorary variables). These otimizations are required ) to transform non arallel loo nests into arallel ones and 2) to imrove the arallel efficiency by avoiding false sharing or reducing the number of synchronization oints (critical sections, barriers). Another issue is the choice of the loo nests to arallelize. One otion is to arallelize all loo nests. This otion has two drawbacks: it increases the rogramming effort and the arallelization of loo nests that doesn t contribute significantly to the global execution time can be counterroductive. The alternative consists in selecting by rofiling the loo nests that contribute significantly to the global execution time. In this work, we have selected the loos to arallelize according to the framework shown in figure. The comlete descrition of intra-node arallelization rocess for the NAS arallel benchmarks is described in []. MPI code Execute with a rofiler Build the calling grah Select loo to arallelize and insert directives Candidate MPI+OenMP code for SMP nodes Remove or refine arallelization Check semantic and seedu False or Slow d own Correct Figure. The arallelization framework 2..2 Coarse-grain arallelization Correct MPI+OenMP code for SMP nodes Instead of alying a two level arallelization (rocess level and loo level), another currently investigated aroach is the coarse-grain OenMP SPMD arallelization. In this aroach, OenMP is still used to take advantage of the shared memory inside the SMP nodes but a SPMD rogramming style is used instead of the traditional shared memory multithread aroach. In this mode, OenMP is used to sawn N threads in the main rogram, having each thread act similarly to a MPI rocess. The OenMP Parallel directive is used at the outermost level of the rogram. The rincile is to sawn the threads just after the sawn of the MPI rocesses (some initializations may searate the two sawns). As for the message assing SPMD aroach, the rogrammer must take care of several issues: array distribution among threads, work distribution among threads and coordination between threads. Since the array distribution is done assuming a shared memory, the distribution of the arrays only concerns the attribution of different array regions to the different running threads. For maximum erformance, these regions should not overla for write references. The work distribution is made according to the array distribution. Tyically, the OenMP DO directive is not used for distributing the loo iterations among threads. Instead, the rogrammer inserts some calculations of the loo boundaries that deends on the thread number. Coordinating the threads involves managing critical sections (I/O, MPI calls) using either OenMP directives like Master or thread library calls like om get thread num() to guard conditional statements. Few results have been ublished on the coarse-grain mixed OenMP+MPI arallelization. In this aer, we only consider the fine grain incremental aroach including manual otimizations and rofiling to choice the loo nests to arallelize. The comarison results between MPI and MPI+OenMP erformance resented in this aer are only valid for this aroach. 2.2 MPI versus MPI+OenMP MPI The MPI rocesses within nodes communicates through the memory (MP-SHARED-MEMORY otion on the IBM SP) without using the network interface. The user sace mode has been used for highest ossible erformance. With this model, existing MPI codes run without modification (excet for time measurements). MPI+OenMP Parallelizing an alication for the SPMD MPI aradigm often roduces a rogram with the tyical layout resented in the left art of figure 2. It corresonds to the arallelization for a cluster of unirocessor nodes, where a MPI rocess is allocated on each node. The right art shows how the comutation art of the MPI rocess is slit into threads by using OenMP directives. The number of threads corresonds to the number of CPUs in each node. In Figure 2, P is the art of MPI code that can- 2

3 S P C Sy S Message assing (MPI) seq art init MPI init // art com. comm. sync end // art seq art Message assing + shared memory (OenMP) seq art init SMP art unarallel. com. arallel. com. sync end SMP art P P Pn instructions. The -qcache=auto otion lets the comiler decides whether otimizing or not according to the features of memory hierarchy. For OenMP, the following otions are needed: -qsm=noauto:schedule=static. It means that the comiler must not arallelize automatically, but obey the directives (OenMP or other ones). The scheduling otion indicates how the loo iterations must be distributed among the different threads. The static otion without argument means a block distribution with a block size of I/n, where I is the number of loo iterations and n is the number of arallel threads that will run concurrently. Table details the bidirectional communication erformance (asynchronous echo test) with WH2 nodes. Note that no comiler otimizations were used for these measures. Figure 2. Parallelizing a MPI code with OenMP by using the fine grain aroach: the comutation art of the MPI code on a node is arallelized with OenMP directives. not be arallelized with OenMP and Pn is the OenMP arallel art. 2.3 IBM SP systems Two IBM SP systems were used for the exeriments because they exhibit two different balances between the erformance of the main comonents (CPU, Memory, Network) erformance. The first system gathers 32 Winter- Hawk II (WH2) multirocessors with 4 Power3+ CPUs er node running at 375 MHz. Memory system for these nodes is based on a bus with a.6 GB/s maximum bandwidth. The second system connects 8 NightHawk I (NH) multirocessors based on Power3 CPUs running at 222 MHz. Each NH multirocessor rovides a theoretical 4.2 GB/s eak memory bandwidth under secific assumtions (memory bank oulation). The Power3 CPU has a 64 KB data cache. The unified L2 cache sizes are resectively 8 MB (WH2 nodes) and 4 MB (NH nodes) er CPU. The multi-stage interconnection network is the same for both SP systems. Each multirocessor node has one network interface. The maximum er node network bandwidth is 5-MB/sec unidirectional and 3-MB/sec bidirectional. The SP software environment includes the AIX 4.3 oerating system, the XLF 6. Fortran comiler that includes OenMP directives and the PPE 2.4 Parallel Environment (intra-node MPI communications use the shared memory). The following comiler otions have been used: -O3 -qarch=wr3 -qtune=w3 -qcache=auto for MPI. The Power3 microrocessor otions allow the use of refetch Unirocessor 4-way node 4-way node External comm. Internal comm. Bandwidth 93 MB/s 49 MB/s 36 MB/s ( comm.) (er CPU) 74 Mo/s (2 comm.) Latency ( comm.) 8.4 (2 comm.) Table. Communication erformance for bidirectional oint to oint tests ( comm. means that only two rocessors are communicating. 2 comm. means that two coules of rocessors are communicating simultaneously). Table shows that each CPU can use /4 of the maximum bandwidth delivered by the network board. The measured standard deviation is very low. It is lower than 5% for the er rocessor communication bandwidth when 4 CPUs from one node communicate with 4 CPUs of another node (extended echo test for clusters of SMP nodes). For this communication scheme, each CPU exeriences a latency of. For a given number of small messages to send from a node, the total communication time is thus two times smaller with 4 CPUs er node than with only one. The reason is the overla between the comutation arts of the communications. The internal communication erformance is far greater than the external one. Each node has a bus between the CPUs and the memory. This is why the bandwidth decreases by a half for two simultaneous communications within the SMP node. 2.4 Using SMP nodes 2.4. Scalability of the NAS benchmarks with -way nodes Before examinig the scalability of the NAS benchmarks with the two different memory models on clusters of multirocessors, we resent some background results about their scalability on a cluster of unirocessors. The behaviour of 3

4 Seedu Class A Perfect cg ft lu mg bt s Number of -way nodes Seedu Class B Perfect cg ft lu mg bt s Number of -way nodes Figure 3. Seedu for the NAS benchmarks with WH2 nodes. The left art corresonds to Class A and the right art to Class B. the NAS benchmark highly deends on the balance between the main comonents of the arallel architectures. In [2] most of the Class A benchmarks excet IS and FT have a linear seedu u to 32 rocessors on recent architectures. The SGI Origin even rovides suerlinear seedu for some benchmarks. The figure 3 resents the seedu of the Class A and class B benchmarks for the SP3 with WH2 nodes. The observed seedus, which deend on the communication/comutation ratio, are sensitive to the dataset sizes, as shown by CG and LU (sub-linear for Class A and suerlinear for Class B for 32 nodes). Previous results available on the NAS Benchmark Web age generally show sublinear seedus for most of the comuters. [2] resents a dee analysis of the reasons behind the different behaviours of the NAS benchmarks on different machines. The main arameters of the scalability are the sequential node architecture and the erformance of the communication system. The authors demonstrate that although the SGI Origine has a relatively low communication efficiency on the NAS communication atterns, some suerlinear seedus are achievable because of the node architecture (cache, memory system). -way/4-way exec. time ratio CPU 6-CPU 32-CPU cg-a cg-b ft-a ft-b lu-a lu-b mg-a mg-b Figure 4. Ratio of the MPI execution time with -way WH2 nodes over the MPI execution time with 4-way nodes according to the number of CPUs for the class A and class B benchmarks. The values are less than when -way nodes are better way nodes versus 4-way nodes Before going into details of the erformance comarison, we want to confirm the intuition that sharing the memory system and network interface makes a cluster of SMP nodes less efficient than a cluster of unirocessor nodes using the same number of CPUs. On the SP system with 32 WH2 nodes, we have comared the execution times of class A and class B MPI benchmarks when using the same number of CPUs, with either -way nodes or 4-way nodes. The comarison have thus been done with 8, 6 and 32 CPUs. Results are shown in figure 4. The figure shows that the ratio between clusters of -way nodes and clusters of 4-way nodes is always less than for any benchmark and any number of CPUs. Performance ratio clearly deends on the alication. For FT, the ga is accentuated because this benchmark uses global communications, which are resently not highly otimized on the SP systems. The results of the comarison confirms the ones resented in [5] where the authors claim that a cluster of unirocessors can be faster that a cluster of multirocessors. The detailed comarison of these 4

5 two tyes of clusters is out of the scoe of this aer, which aims on comaring rogramming models. More, comaring erformance according to the number of CPUs is very debatable according to the cost issues and the current trends in arallel architectures. Because of the increasing ga between CPU erformance and DRAM erformance, the technological trends are towards larger SMP nodes (NH2 nodes of the IBM SP use 6-way nodes and the future SP4 systems will have 32-way nodes) built as clusters of smaller SMP nodes. This is why we only consider erformance of SMP nodes in the rest of the aer. 4-way nodes are the largest nodes that can be used to comare MPI and MPI+OenMP on the two SP systems that were used. 2.5 Methodology We first alied the incremental aroach to the original MPI NPB-2.3 benchmarks to get the hybrid version. The NAS benchmarks have not been tuned secifically to take advantage of the memory hierarchy of the Power3 for both models. So erformance results do not give the highest ossible value for these codes. However, we try to be fair in comaring codes with quite equivalent otimization levels. Then, the execution time have been decomosed into comutation and communication times. For that, all benchmarks have been instrumented with time measurements. Timings are accumulated across all nodes and divided by the number of nodes to give mean values er rocessor. The communication time includes the communication calls and the synchronization rocedures (Wait). During the exeriments on the IBM latforms and u to now, no software was available to access the hardware erformance counters of the Power3 on the IBM SP. So all interretations of the measured results are based on the timing measurements. We have taken care to limit the interretation to the clearly aarent henomena. A revious study has been done and ublished [3] for a cluster of way Pentium-II nodes. On this latform, we had access to the hardware erformance counters and we could rovide a deeer analysis. 3 Overall comarison between MPI and MPI+OenMP Figure 5 resents the ratio between the MPI execution time and the MPI+OenMP execution time according to the number of 4-way nodes for each class of benchmarks (A or B) and each tye of SP nodes (WH2 and NH). For BT and SP, which use a square number of rocesses, we only use the number of nodes (, 4 and 6) that allows a direct comarison with the other benchmarks, omitting the 9-node configuration. A ratio less than means that the unified MPI version is more efficient than the hybrid version. The comarison results are clearly alicationdeendent. Whatever nodes and data sets are used, the unified MPI model is always better for LU, MG, BT and SP. The advantage is sectacular with LU, for which unified MPI is always more than 2 times more efficient than MPI+OenMP. For CG and FT, the advantage of one model deends on the data set size, on the tye of nodes and also on the number of nodes. For NH nodes and class A, MPI shows better erformance for CG and FT for any number of nodes. On the oosite, the hybrid model is always better for FT with WH2 nodes (classes A and B) and for class B (WH2 and NH nodes). It is always better for CG with WH2 nodes and class B. The advantage of one model can deend on the number of nodes: for CG and the WH2-class A and NH-class B configurations, MPI is better for a small number of nodes and MPI+OenMP for a larger number of nodes. In both case, the threshold is low (resectively for 2 and 8 nodes). The results are quite similar for the classes A and B, but they are slightly more favorable to the hybrid model for the class B. The results are also slightly more favorable to the hybrid model when using WH2 nodes instead of NH nodes. These results show that the relative erformance of each model deends on the alication, the dataset size and the features of the different comonents of the architecture (CPU, Memory system, Interconnection system). 4 Understanding the erformance results To exlain the results of the revious section, we have decomosed the execution time into comutation and communication times. According to Figure 2, we have further decomosed the comutation time into S+P and Pn, where S, P and Pn are resectively the sequential time, the MPI arallel time (that cannot be arallelized with OenMP) and the comutation time that can be arallelized with OenMP. 4. Amdahl s law for the hybrid model For constant size roblems, when using the MPI unified model with N 4-way nodes, 4N CPUs will run concurrently all the code arts excet the sequential and the communication arts. For the hybrid model with N 4-way nodes, only the Pn art of the rogram can be accelerated. Figure 6 resents the Pn execution time/total execution time ratio according to the number of -way nodes to evaluate the art of the execution time that can effectively benefit from the OenMP arallelization on SMP nodes. The results corresond to Class A and Class B benchmarks. Clearly, the time sent on the sequential sections and the MPI arallel sections that cannot be arallelized with OenMP and the communication time limit the fraction of 5

6 WH2 4-way nodes - Class A WH2 4-way nodes - Class B MPI /MPI+OMP cg ft lu mg bt s MPI / MPI+OMP cg ft lu mg bt s NH 4-way nodes - Class A NH 4-way nodes - Class B MPI / MPI + OenMP cg ft lu mg bt s MPI / MPI + OenMP cg lu mg Figure 5. Ratio of the MPI execution time over the MPI+OenMP execution time according to the number of 4-way nodes. The values are less than when MPI is better. The to diagrams corresond to WH2 nodes, with class A (left) and class B (left). The bottom diagrams corresond to NH nodes. For the NH nodes and class B, some measures are missing. the overall code that can be accelerated with an OenMP arallelization. This alication of Amdahl s law is more significant for the Class A than for the Class B, for which the communication imact is less significant. Its imact is more significant with the most owerful WH2 nodes. For class A and WH2 nodes, CG has the worst behavior, the Pn fraction going down to 25%. Pn fraction decreases from % (CG, FT) or 8% (LU, MG) down to 5 or 7% according to the benchmark and the class. These results are relevant considering the original assumtion that we use OenMP to arallelize existing MPI alications. Also, we should state that a better arallelization could be achieved for some benchmarks like LU using a stronger arallelization effort, but this no longer corresonds to the incremental aroach of arallelizing loo nests, which is the context of this aer. 4.2 Communication and Comutation times To exlain why the hybrid model erforms better on some benchmarks and SP configurations, we have decomosed the execution time into comutation and communication times. Summing searately the comutation times and communication times across all CPUs make easier the resentation of the results. With this assumtion, a erfect seedu would give the same value whatever number of CPUs is used. Class A results with WH2 nodes Figure 7 shows the result for CG, FT, LU and MG in class A with WH2 nodes. Comaring the left and right scales for each benchmark indicates the relative contribution of the communication time on the overall execution time. For all benchmarks excet LU, MPI communication 6

7 WH2 nodes NH nodes Pn exec.time/totalexec. time cg-a cg-b ft - A ft - B lu - A lu - B mg-a mg-b Pn exec. time/ total exec. time cg-a cg-b ft-a ft-b lu-a lu-b mg-a mg-b Figure 6. Fraction of the execution time that can be accelerated with OenMP on clusters of -way nodes according to the number of nodes for Class A and B benchmarks. Com-mi Comm-mi Com-mi+oenm Comm-mi+oenm Com-mi Comm-mi Com-mi+oenm Comm-mi+oenm CG_A Com. time (sec) Number of WH2 nodes CG_A Comm. time (sec) FT_A time (sec) Number of WH2 nodes LU-A Com. time (sec) Com-mi Comm-mi Number of WH2 nodes Com-mi+oenm Comm-mi+oenm LU_A Comm. time (sec) MG_A Com. time (sec) Com-mi Comm-mi Number of WH2 nodes Com-mi+oenm Comm-mi+oenm MG_A Comm. time (sec) Figure 7. Time Breakdown for MG, CG, LU and FT MPI and MPI+OenMP version according to the number of 4-way WH2 nodes. The figures breakdown the total execution time, summed across all the rocessors, into comutation times (shown as bars with the left scale) and communication times (lines with the right scale). 7

8 times are larger than for MPI+OenMP. The difference increases significantly from 4 nodes. To exlain this difference we should state that the two models use different communication atterns for the same number of CPUs. With N 4-way nodes, there are communications between 4N rocesses for the MPI model oosed to communications between N rocesses for the MPI+OenMP model. For constant size roblems, Won et al. [2] have shown for the NAS benchmarks that message size generally decreases when the number of nodes increases, but the number of messages sent and received by each node is likely to increase as more nodes lead to more comlicated communication atterns. These communication atterns and erformance (latency and bandwidth) exlain the behavior of CG. Figure 8 resents the main communication atterns for CG when using 2, 4, 8 and 6 rocessors. The small white boxes are for the rocessors. The grey rectangles corresond to the maing of communications on 4-way nodes. The external communications are between the grey boxes. In each art of the figure, communications are decomosed into small and large messages, with the number of each tye (number * 64-bit words). For CG, the long messages dominates the communication time. When increasing the number of rocesses, CG exchanges more long messages but the messages are shorter. As shown in table, the bandwidth when using a single MPI rocess er node is similar to the aggregate bandwidth when using 4 MPI rocesses er node. So, only the number of long messages gives advantage to the MPI+OenMP * 46 * * 46 * * 46 * * 46 * 35 Figure 8. Communication atterns for CG using 2, 4, 8 and 6 rocessors. The comarison of the communication time for LU gives the oosite result. The unified MPI aroach rovides a lower communication time than the hybrid aroach. Note that the communication time also encomasses the synchronization of the running rocesses. In the original code, there is no way to searate the communication and the synchronization times. LU uses many more small ( KB) messages with blocking communication calls than large messages with asynchronous communication calls. Only considering the extended echo test (for cluster of multirocessors) for synchronous communications cannot exlain the oosite behavior of LU comared to CG. Since ) the communication latency for the four rocesses inside a SMP node (unified aroach) is two times higher ( for 64 rocesses) than the communication latency ( for 6 rocesses) for a single rocess node (hybrid aroach) and 2) the communication time for the hybrid aroach ( ) stays close to the communication time ( ) for a void message (the latency is the dominant factor) although the hybrid aroach uses larger messages (2 times) than the unified aroach, the communication attern should give the advantage to the hybrid aroach. The oosite result suggests that the synchronization time is higher for the hybrid aroach than for the unified one. For all benchmarks, the MPI comutation times are less than the MPI+OenMP ones. The difference is relatively small for FT and CG. For MG, it becomes significant for a large number of nodes. For LU, the MPI comutation time is more than two times smaller than the MPI+OenMP one. Three factors contribute to the difference in comutation time between the two versions. First one is the Amdahl s law on the OenMP arallel art that was mentioned in the revious section. Second one is the data layout, which may be different when a rocess is slit into 4 threads by an OenMP arallelization or when there are 4 MPI rocesses within a node. This difference imacts on the cache behaviour, secially when MPI rogramming allows to exress multi-dimensional blocking. A third factor is the overhead of thread management in the OenMP version. Next section on memory access atterns and arallel efficiency on the OenMP arallel art will give more information to understand the large difference on LU comutation times between the two versions. Combining comutation and communication values exlain the overall results. The MPI aroach with LU, which has both better comutation and communication erformance, outerforms the hybrid aroach. For MG, the comutation advantage of MPI always comensates the communication disadvantage. For CG, the comutation advantage of MPI comensates the communication disadvantage only for a small number of nodes. For FT, communication disadvantage of the MPI aroach associated with the more comlicated all to all communication atterns leads to better erformance with the hybrid aroach. As reviously mentioned, the global 8

9 communications on SP3 systems are not otimized with SMP nodes. Other configurations For LU and MG, the decomosition of the execution time gives very similar results for the 3 other configurations: WH2-class B, NH-class A and NH- class B. We resent now for each configuration the time decomosition of CG (figure 9). Like FT, CG is a benchmark for which the best rogramming model deends on the configuration, the dataset size and the number of nodes. For class A with WH2 nodes, as reviously exlained, the comutation advantage of MPI comensates the communication disadvantage u to 4 nodes, but not for a larger number of nodes. With NH nodes, the comutation for both models are larger than with WH2 nodes (according to resective clock frequency) and the communication time are also slightly larger: the situation is similar, excet that MPI kees advantage until 8 nodes, but measures with 6 NH nodes would give advantage to the hybrid model. For class B, the clock frequency advantage of WH2 nodes only aears for 4 or 8 nodes, robably because some cache effects become significant. For less nodes, the frequency advantage is counterbalanced by the worse erformance of the memory system (bus versus crossbar). With NH nodes, the ga between the MPI communication time and the MPI+OenMP communication time increases quickly and only the configuration with one node shows an advantage for the MPI model. Remember that the communication time includes the synchronization time. We have no clear exlanation for the significant difference in MPI communication times between the two hardware systems for the same number of nodes. One reason might be that the communication time encomasses a CPU art and a Network Interface Communication art. In that case, slower CPUs would lead to larger communication times. Unfortunately, no hardware monitoring counters are available on the SP systems. They would be needed to recisely examine the behaviour of the memory hierarchy for each hardware configuration and benchmark class. 4.3 Memory access atterns The Amdahl s law is not sufficient to exlain the difference of comutation times between the two models. As for the communication atterns, the memory access atterns are different for the two models. To go further, we examine the arallel efficiency, which is defined as the measured seedu divided by the number of CPUs, on the Pn art of the comutation (art that can be arallelized with OenMP). Note that Pn art corresonds to the same code for both models. However loo nests are executed differently by each model. Figure shows the arallel efficiency on the Pn art for the class A and class B benchmarks with WH2 nodes. FT and MG exhibits similar behavior: the arallel efficiency is aroximately the same for the two aroaches. For FT, it increases with the number of nodes for small dataset sizes, which corresond to a better use of caches or remains constant if the dataset size is too large for the cache hierarchy. For MG, the arallel efficiency increases when the number of nodes increases above a threshold. For CG and LU, the MPI version has better arallel efficiency. For CG, the arallel efficiency increases with the number of nodes for large dataset sizes and reaches a maximum with 4 nodes and then decreases: the dataset size is too small for a full exloitation of the cache hierarchy. This last situation exists for LU for class A and class B. The decomosition of the comutation time for the hybrid version shows that two arameters govern its erformance: ) the time sent on the core of the loo nests and 2) the overhead of thread management. Although, the hybrid version reaches lower comutation time on loo cores for some benchmarks, it has higher comutation time for the whole loo nests due to the overhead of the thread management. At this oint, we should state that there is no reason why the hybrid aroach erforms better on some loo nests excet because loo nest otimizations used in the original MPI version do not match well the Power3 memory hierarchy. Conversely, we should note that the OenMP arallelization has a severe limitation comared to the MPI version: OenMP cannot exress multi-dimensional blocking when using the easiest arallelization aroach or simle loo otimizations. Such kinds of atterns require a comlete rewriting of the loo nest (loo fusion), which means a time consuming rogramming effort. 5 Related works The erformance of the IBM SP with Winter Hawk II nodes is resented in [4]. The author comares this architecture to a AlhaServer SC for several kernels and alications that do not include the NAS benchmarks. The aer does not comare unified and hybrid versions. The erformance issues of the NAS benchmarks have been resented in [2]. Esecially, Wong et al. have examined the architectural requirements and scalability. This work is restricted to unirocessor nodes and comares erformance of a cluster of workstations and the SGI Origin 2 machine. The erformance of Message Passing has been detailed in [5] for a cluster of SUN SMPs. Authors note that the memory hierarchy and esecially the oulation of the memory banks have a significant imact on the erformance. The aer does not address the issue of hybrid rogramming models. A comarison of shared memory and message assing 9

10 Com-mi-w Com-mi-n Comm-mi-w Comm-mi-n Com-mi+oenm-w Com-mi+oenm-n Comm-mi+oenm-w Comm-mi+oenm-n Com-mi-w Com-mi-n Comm-mi-w Comm-mi-n Com-mi+oenm-w Com-mi+oenm-n Comm-mi+oenm-w Comm-mi+oenm-n CG_A times (sec) Number of nodes CG_B times (sec) Figure 9. The execution time breakdown for CG according to the number of 4-way nodes for the MPI and MPI+OenMP versions. The left art corresonds to the Class A and the right art to the class B. w (res. n) stands for WH2 (res. NH) nodes. The figures breakdown the total execution time, summed across all the rocessors, into comutation times (shown as bars) and communication time (lines). The left scale is the same for comutation and communication times. Class A Class B Parallel Efficiency on Pn art cg cg ft ft lu lu mg mg Parallel efficiency on Pn art cg cg ft ft lu lu mg mg Figure. Parallel efficiency on the Pn art according to the number of nodes for the Class A (left art) and the Class B benchmarks (right art). tag corresonds to the unified MPI version with 4-way WH2 nodes. tag corresonds to the hybrid version with 4-way WH2 nodes. memory models has been reviously resented in [6]. The aer focuses on classical arallel architectures and does not examine the CLUMP issues. Several aers in the literature resent the imlementation of alications using a hybrid model like [7]. None of them rovides a dee understanding of why this model is better or worst than a unified one. Other rogramming models have been designed for the CLUMPs. For examle, the shared virtual memory environments (DSVM) rovide another alternative to unify the memory model, as resented in [8] [9] [][]. Recently, OenMP [2] has been imlemented on a cluster of SMPs on to of the Treadmark DSM system. As far as we know, there is no comarison between these unified aroaches and some hybrid ones. 6 Concluding remarks Based on the analysis of the NAS benchmarks running on two different IBM SP systems, we have demonstrated that the choice between unified or hybrid rogramming models is not trivial from the MPI existing codes. Several arameters must be considered. The first one is the level of shared memory arallelization achievable for

11 the loo nests of the original MPI rogram. From the original MPI rogram, it is difficult to evaluate if the incremental aroach that arallelizes the loo nests with OenMP (including secific otimizations) is sufficient to imrove erformance or if it is needed to reconsider the arallelization of the alication from scratch. The second arameter concerns the communication cost. For the same number of rocessors, this cost deends on how the communication atterns of the alication match the communication architecture of the cluster. Generally, clusters of multirocessors erform better with unified MPI aroach for latency sensitive rograms and worse for bandwidth sensitive rograms. A third arameter is the memory access atterns used by each aroach. Whereas MPI rogramming allows to exress multi-dimensional blocking, it is not natural for OenMP to do so. To erform the same memory atterns, loo nests must be rewritten which may be comlex for long loo core or dee loo nests. Efforts for imroving OenMP may rovide a solution to this roblem. Other rogramming languages like Co-array [3] may also be considered as alternative to OenMP for the hybrid rogramming or even for the unified rogramming. The last arameter is the erformance balance of the main comonents (CPUs, Memory, Network) of the CLUMP. For the same network erformance, fast rocessors may reduce the comutation time comared to slow rocessors u to the oint that communication erformance becomes significant. In that case the communication erformance allows to select the right model. In the other case, MPI seems to be always the best. Note that that our results and analysis only concerns a articular aroach of hybrid rogramming, which consists in loo level arallelization with a reasonable rogramming effort after rofiling to select the loo nests to arallelize. Other hybrid aroaches may lead to different results and analysis conclusions. We believe that selecting the aroriate model is a general roblem that is not secific to the alications analyzed in the aer. First, NAS benchmarks exhibit a large set of communication, memory reference and comutation atterns. Excet some irregular ones, real alications are likely to encomass similar atterns. Second, we have demonstrated that several imortant arameters distinguish the erformance of both aroaches. Although the relative imact of these arameters is alication deendent, they exist in any arallel alication. Future work will include erforming the same analysis for other clusters of multirocessors like Comaq SC, comaring with the other hybrid aroaches and esecially the coarse grain arallelization and deriving a general erformance model that may hel choosing between unified and hybrid models according to the characteristics of the architecture and the alication. 7 Acknowledgments Alain Quere and Eric Boyer (CINES in Montellier) allowed the first set of measurements on the SP3 with NH nodes as a single user with user sace communications. Olivier Hess, from IBM Montellier, was very helful for all technical roblems. John Levesque and its grou in the IBM Advanced Comuting Technology Center allowed the set of measurements on the SP3 with WH2 nodes. References [] F. Caello and O. Richard. Intra-node arallelization of MPI rograms with OenMP Reort TR-CAP-99, URL: htt:// fci/goinfrewww/96.s.gz [2] F. C. Wong, R. P. Martin, R. H. Araci-Dusseau, D. T. Wu, and D. E. Culler. Architectural requirements and scalability of the NAS arallel benchmarks. In Proceedings of Suercomuting 99, 999. [3] F. Caello, O. Richard O and D. Etiemble. Investigating the erformance of two rogramming models for clusters of SMP PCs In Proc. of the 6th IEEE Sym. on High-Performance Comuter Architecture (HPCA-6). January 2. [4] P. H. Worley. Performance Evaluation of the IBM SP and Comaq AlhaServer SC. In Proc. of the International Conference on Suercomuting, ICS, May 2 [5] Steven S. Lumetta, Alan Mainwaring, and David E. Culler. Multi-rotocol active messages on a cluster of SMPs. In SC 97: High Performance Networking and Comuting: Proceedings of the 997 ACM/IEEE SC97 Conference: November 5 2, 997, San Jose, California, USA., 997. [6] H. Shan and J. P. Singh. A Comarison of MPI, SHMEM, and Cache-Coherent Shared Address Sace Programming Models on the SGI Origin2. In Proc. of the International Conference on Suercomuting, ICS, June 999 [7] S. W. Bova, C. P. Breshears, C. Cuicchi, Z. Demirbilek, H. A. Ga. Dual-level Parallel Analysis of Harbor Wave Resonse Using MPI and OenMP In The International Journal of High Performance Comuting Alications, Sring 2. [8] D. J. Scales, K. Gharachorloo, and A. Aggarwal. Fine-grain software distributed shared memory on

12 SMP clusters. In Proc. of the 4th IEEE Sym. on High-Performance Comuter Architecture (HPCA-4), February 998. [9] R. Stets, S. Dwarkadas, N. Hardavellas, G. Hunt, L. Kontothanassis, S. Parthasarathy, and Michael Scott. Cashmere-2L: Software coherent shared memory on a clustered remote-write network. In Proc. of the 6th ACM Sym. on Oerating Systems Princiles (SOSP-6), October 997. [] R. Samanta, A. Bilas, L. Iftode, and J. P. Singh. Homebased SVM rotocols for SMP clusters: Design and erformance. In Proc. of the 4th IEEE Sym. on High-Performance Comuter Architecture (HPCA-4), February 998. [] Andrew Erlichson, Neal Nuckolls, Greg Chesson, and John Hennessy. SoftFLASH: Analyzing the erformance of clustered distributed virtual shared memory. In Proceedings of the Seventh International Conference on Architectural Suort for Programming Languages and Oerating Systems, ages 2 22, 996. [2] Charlie Hu Honghui Lu Alan Cox and Willy Zwaeneoel. OenMP for Networks of SMPs. In Proc. of the Second Merged Sym. IPPS/SPDP 99, 999. [3] R. W. Numrich and J. K. Reid. Co-Array Fortran for arallel rogramming Research reort RAL-TR-998-6, Deartment for Comutation and Information, Rutherford Aleton Laboratory, Oxon, UK, August 998 2