CONSISTENT PERFORMANCE ASSESSMENT OF MULTICORE COMPUTER SYSTEMS

CONSISTENT PERFORMANCE ASSESSMENT OF MULTICORE COMPUTER SYSTEMS GH. ADAM 1,2, S. ADAM 1,2, A. AYRIYAN 2, V. KORENKOV 2, V. MITSYN 2, M. DULEA 1, I. VASILE 1 1 Horia Hulubei National Institute for Physics and Nuclear Engineering (IFIN-HH), 407 Atomistilor, Magurele Bucharest, 077125, Romania E-mail: adamg@ifin.nipne.ro 2 Joint Institute for Nuclear Research, 141980 Dubna, Moscow reg., Russia Received August 30, 2008 Performance assessment, through High-Performance Linpack (HPL) benchmark, of the quad-core cluster with InfiniBand interconnect recently acquired at LIT-JINR Dubna is reported. Corroboration with previous results [Gh. Adam et al., Rom. J. Phys., 53, 665 (2008)] shows that the HPL benchmark scales for single-core, two-core, quad-core chips and yields results fitting its intrinsic complexity under statistically relevant criteria. Free software implementations (OS and MPI) on multicore clusters at LIT-JINR and IFIN-HH resulted in relative performances comparable to those reported within the September 2007 issue of TOP500, the list of the five hundred most productive parallel computers in the world. 1. INTRODUCTION The multiprocessor computer architectures built by the computing system vendors are intended to solve complex computational problems. At one extreme there is the case of very large single problems (like, e.g., those arising in lattice chromodynamics), which ultimately result in very large linear algebraic systems, as a consequence of specific discretization procedures yielding numerical algorithms. At the other extreme there is the case of very large sets of independent small to medium size problems of similar nature, which arise in very large scale projects (like, e.g., the four LHC experiments at CERN, the data taking beginning of which is planned for September 2008). These two kinds of problems correspond to two extremes of the existing multiprocessor architectures: parallel clusters (which do high performance computing under small latencies of the interprocessor communication) and distributed systems (Grids) (which are reservoirs of computing power, accessible everywhere from the world within a virtual organization). Most of the offers during the last few years by the computer manufactures for Grid infrastructure development use multicore computer architecture, which involves several independent processors (cores) on a chip that communicate Rom. Journ. Phys., Vol. 53, Nos. 9 10, P. 985 991, Bucharest, 2008

986 Gh. Adam et al. 2 through shared memory. Conceived mainly as a solution to overcome the power consumption problem which is impeding higher processor clock frequency increase, the multicore computer architecture marks the start of a historic transition from sequential to parallel computation inside each multicore chip installed on the system. Under parallel computations scalable with the number of cores on a chip, this would afford an alternative way towards further exponential performance improvement under Moore s law exponential increase in chip resources via core number increase. This is, however, a formidable task, quoted at the recent Gartner Conference [1] to represent one of the seven grand challenges facing IT for the next 25 years. Research concerning both computer architecture issues under the new circumstances [2] as well as the development of new higher-level abstractions for writing parallel programs [3] are actively pursued. Data accumulated both at LIT-JINR and abroad [4, 5] show that understanding the hardware transfer processes for specific problems inbetween the core and the RAM, together with appropriate identification of the algorithm modules that may be executed in parallel and with corresponding best MPI standard instructions for their handling, allow parallel code improvement. The present paper discusses performance assessment of a 20 quad-core processors module, with InfiniBand interconnect, acquired at the beginning of 2008 at LIT-JINR Dubna. This continues a similar study [6] of performance assessments of the CICC JINR supercomputer consisting of 120 two-core processors with Gigabit Ethernet (GbEthernet) interconnect, and the parallel 16-processor cluster SIMFAP with Myrinet interconnect, at IFIN-HH. 2. PERFORMANCE ASSESSMENT The main characteristics of the three systems mentioned above are given in Table 1. Performance is measured by means of the High-Performance Linpack (HPL) benchmark [7], used in TOP500, the list of the five hundred most productive parallel computing systems in the world [8], and in TOP50, the list of the fifty most productive parallel computing systems in the CIS (the Commonwealth of the Independent States) [9]. The HPL benchmark essentials and the discussion of its computational complexity can be found in [6]. The system performance gets maximized provided the order N of the solved algebraic system satisfies N < N max, where N max denotes the maximum system order for which the coefficient matrix can be accommodated within the available overall RAM, RAM. At N > N max, performance deterioration occurs due to the need of using the HDD swap storage. The quantity P peak denotes the peak theoretical performance which would be obtained under instantaneous information exchange along any of the paths involving cores, cache, RAM, HDD.

3 Consistent performance assessment of multicore computer systems 987 Table 1 Main characteristics of the three computing systems of interest Features IFIN-HH CICC CICC parallel SIMFAP supercomputer cluster Intel Processors Xeon Irwindale 2xXeon 5150 Xeon 5315 Clock frequency, ν 3 GHz 2.66 GHz 3 GHz Cores per CPU 1 2 4 CPUs per node 2 2 2 Total nodes 8 60 10 Total CPUs 16 120 20 Total cores, n 16 240 80 2-level cache/cpu 2 MB 4 MB 8 MB RAM on node 4 GB 8 GB 8 GB Overall RAM, RAM 32 GB 480 GB 80 GB Operating System CentOS 5 SL 4.5 SL 4.5 Network Myrinet GbEthernet InfiniBand MPI Version 1.2.7 1.2.7 OpenMPI 1.2.5 Flops per tact, k 2 4 4 System performance under HPL benchmark N max 63.2 10 3 244.9 10 3 100 10 3 P peak = knν 96 GFlops 2553.6 GFlops 960 GFlops P max 64.24 GFlops 1124 GFlops 684.5 GFlops ρ eff = P max / P peak 0.67 0.44 0.713 The quantity Pmax = Nop/ T denotes the maximum measured system performance, where N 3 2 op = (2/ 3) N + 2N is the number of floating point operations needed for solving the algebraic system of order N N max, and T is the measured computing time in seconds. Finally, the ratio ρ eff denotes the effectiveness of the system under scrutiny. For values N Nmax, the system performance is expected to be much smaller than P max. Fig. 1 summarizes the results obtained for the three mentioned clusters. On the bottom row, measured computing times in terms of N are given in minutes, while on the upper row, the resulting performances in terms of N are given for each of the clusters. The interesting feature showed both by the SIMFAP (Myrinet interconnect) and the CICC parallel cluster (InfiniBand interconnect) is the performance saturation near the upper order end of the solved algebraic systems. This points to the advantage of having a wide band dedicated data transfer bus among the processors. For the Gigabit Ethernet CICC supercomputer, saturation does not occur due to the absence of such a dedicated

988 Gh. Adam et al. 4 Fig. 1 Performance (on the upper row) and time of calculation (on the bottom row) vs. order of linear system of equations N, in 10 3 units, for the three clusters. bus. As compared to the previous performance estimates, the present statistics is larger and derived at magic N values [6]. 3. DISCUSSION AND CONCLUSIONS The least squares fit of the computing times measured at various N values provides insight into the consistency of the performance assessment procedure [6]. On one side, the intrinsic degree of complexity of the HPL benchmark is d = 3. On the other side, we can determine the optimal degree m of the least squares fit polynomial under a particular assumption on the distribution law of the uncertainties {σ i } and a statistically significant termination criterion of the least squares procedure. Since the time measurements have been done independently of each other, we have to assume a Poisson distribution law. In [6], optimal values m = d have been obtained both for the CICC supercomputer and SIMFAP data asking for the Hamming termination criterion (criterion 1 in the Appendix of [6]). For the CICC parallel cluster data, this

5 Consistent performance assessment of multicore computer systems 989 criterion proved to be ineffective. However, instead of the pure noise requirement involved in the Hamming criterion, we may ask the criterion z <τ max(1, T ), τ 0. 01, (1) im, i where z i,m denotes the residual associated to the T i time measurement within the m-th degree fitting polynomial. The criterion (1) has indeed resulted in m = d = 3, hence the present data are consistent with the third order complexity of the HPL benchmark as well (Fig. 2). Fig. 2 Fitting CICC parallel cluster performance data resulted in optimal m = 3 degree fitting polynomial with sup-norm misfit magnitude below one percent. Corroborating this result with the evidence reported in [6], we conclude that the HPL benchmark scales perfectly for the multicore clusters. This let us infer that, for scientific computing involving compact matrix coefficients, the derivation of scalable parallel codes is a feasible task within the MPI standard. In the last line of Table 1, there is a large difference between the 44 percent effectiveness of the GbEthernet CICC supercomputer, the 67 percent effectiveness of the Myrinet SIMFAP cluster, and the 71.3 percent effectiveness of the InfiniBand CICC parallel cluster. We assume that these figures should stem from the specific interconnects of the three computer clusters. An independent confirmation of such a hypothesis comes from the comparison of these figures with the histogram representations of the efficiencies

990 Gh. Adam et al. 6 Fig. 3 Histograms summarize the September 2007 issue of the TOP500 data for each of the existing interconnect networks. Arrows point to the present results.

7 Consistent performance assessment of multicore computer systems 991 reported in the September 2007 issue of the TOP500 list for GbEthernet, Myrinet, and InfiniBand parallel clusters (Fig. 3). The occurrence of the relative performances at the level of the best computers in the world points to the fact that the home made open software implementations of the operating systems (OS) and MPI standards have been done at a high qualitative level. Acknowledgments. Romanian authors acknowledge partial support from contract CEX05- D11-67. A. Ayriyan acknowledges partial support from RFBR grant #08-01-00800-a. REFERENCES 1. Gartner Symposium/ITxpo 2008, Emerging Trends, 6 10 April 2008, Mandalay Bay/Las Vegas, NV, USA; Comm. ACM, 51, no. 7, 10 (2008); http://www.networkworld.com/ news/2008/040908-gartner-it-challenges.html 2. M. Osin, Comm. ACM, 51, no. 7, 70 78 (July 2008). 3. J. Larus, C. Kozyrakis, Comm. ACM 51, no. 7, 80 88 (July 2008). 4. V. Lindenstruth, Status and plans for building an energy efficient supercomputer in Frankfurt, GRID 2008, 3-rd Intl. Conf. Distributed computed and Grid technologies in science and education, JINR Dubna, 30 June 4 July 2008. http://grid2008.jinr.ru/pdf/lindenstruth.pdf 5. S. Gorbunov, U. Kebschull, I. Kisel, V. Lindenstruth, W. F. J. Mueller, Comput. Phys. Commun. 178, 374 383 (2008). 6. Gh. Adam, S. Adam, A. Ayriyan, E. Dushanov, E. Hayryan, V. Korenkov, A. Lutsenko, V. Mitsyn, T. Sapozhnikova, A. Sapozhnikov, O. Streltsova, F. Buzatu, M. Dulea, I. Vasile, A. Sima, C. Viºan, J. Buša, I. Pokorny, Romanian J. Phys. 53, No. 5 6, 665 677 (2008). 7. A. Petitet, R. C. Whaley, J. Dongarra, A. Cleary, HPL A Portable Implementation of the High-Performance Linpack Benchmark for Distributed-Memory Computers, http:// www.netlib.org/benchmark/hpl/ 8. http://www.top500.org/ 9. http://www.supercomputers.ru/?page=rating/