CONSISTENT PERFORMANCE ASSESSMENT OF MULTICORE COMPUTER SYSTEMS



Similar documents
Building a Top500-class Supercomputing Cluster at LNS-BUAP

A Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures

The Green Index: A Metric for Evaluating System-Wide Energy Efficiency in HPC Systems

1 Bull, 2011 Bull Extreme Computing

Cluster Computing at HRI

Performance of the Cloud-Based Commodity Cluster. School of Computer Science and Engineering, International University, Hochiminh City 70000, Vietnam

Introduction to High Performance Cluster Computing. Cluster Training for UCL Part 1

Performance Characteristics of a Cost-Effective Medium-Sized Beowulf Cluster Supercomputer

High Performance Computing in CST STUDIO SUITE

Parallel Programming Survey

benchmarking Amazon EC2 for high-performance scientific computing

Interconnect Efficiency of Tyan PSC T-630 with Microsoft Compute Cluster Server 2003

A Flexible Cluster Infrastructure for Systems Research and Software Development

HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK

Lecture 1: the anatomy of a supercomputer

Dell High-Performance Computing Clusters and Reservoir Simulation Research at UT Austin.

A GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS

Key words: cloud computing, cluster computing, virtualization, hypervisor, performance evaluation

Improving Grid Processing Efficiency through Compute-Data Confluence

Trends in High-Performance Computing for Power Grid Applications

QLIKVIEW SERVER MEMORY MANAGEMENT AND CPU UTILIZATION

High Performance Computing. Course Notes HPC Fundamentals

Lattice QCD Performance. on Multi core Linux Servers

Clusters: Mainstream Technology for CAE

Binary search tree with SIMD bandwidth optimization using SSE

Comparing Multi-Core Processors for Server Virtualization

Cluster Scalability of ANSYS FLUENT 12 for a Large Aerodynamics Case on the Darwin Supercomputer

PARALLEL & CLUSTER COMPUTING CS 6260 PROFESSOR: ELISE DE DONCKER BY: LINA HUSSEIN

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

Optimizing Shared Resource Contention in HPC Clusters

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

Scalability and Classifications

ECLIPSE Performance Benchmarks and Profiling. January 2009

ACCELERATING COMMERCIAL LINEAR DYNAMIC AND NONLINEAR IMPLICIT FEA SOFTWARE THROUGH HIGH- PERFORMANCE COMPUTING

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

Hardware-Aware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui

How System Settings Impact PCIe SSD Performance

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

LS DYNA Performance Benchmarks and Profiling. January 2009

How To Compare Amazon Ec2 To A Supercomputer For Scientific Applications

Multi-core and Linux* Kernel

Overview of HPC Resources at Vanderbilt

Silviu Panica, Marian Neagul, Daniela Zaharie and Dana Petcu (Romania)

The L-CSC cluster: Optimizing power efficiency to become the greenest supercomputer in the world in the Green500 list of November 2014

Comparative performance test Red Hat Enterprise Linux 5.1 and Red Hat Enterprise Linux 3 AS on Intel-based servers

Comparing the performance of the Landmark Nexus reservoir simulator on HP servers

CORRIGENDUM TO TENDER FOR HIGH PERFORMANCE SERVER

Virtuoso and Database Scalability

Mixed Precision Iterative Refinement Methods Energy Efficiency on Hybrid Hardware Platforms

Parallel Computing. Introduction

Cellular Computing on a Linux Cluster

Building an energy dashboard. Energy measurement and visualization in current HPC systems

Performance of the JMA NWP models on the PC cluster TSUBAME.

High Performance. CAEA elearning Series. Jonathan G. Dudley, Ph.D. 06/09/ CAE Associates

Using an MPI Cluster in the Control of a Mobile Robots System

2: Computer Performance

Improved LS-DYNA Performance on Sun Servers

Supercomputing Status und Trends (Conference Report) Peter Wegner

Design Considerations for Increasing VDI Performance and Scalability with Cisco Unified Computing System

Oracle Database Scalability in VMware ESX VMware ESX 3.5

Multicore Parallel Computing with OpenMP

FLOW-3D Performance Benchmark and Profiling. September 2012

Performance Guide. 275 Technology Drive ANSYS, Inc. is Canonsburg, PA (T) (F)

Understanding the Benefits of IBM SPSS Statistics Server

Informatica Ultra Messaging SMX Shared-Memory Transport

Hadoop on a Low-Budget General Purpose HPC Cluster in Academia

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

Workshare Process of Thread Programming and MPI Model on Multicore Architecture

Retargeting PLAPACK to Clusters with Hardware Accelerators

AirWave 7.7. Server Sizing Guide

OBJECTIVE ANALYSIS WHITE PAPER MATCH FLASH. TO THE PROCESSOR Why Multithreading Requires Parallelized Flash ATCHING

LS-DYNA Scalability on Cray Supercomputers. Tin-Ting Zhu, Cray Inc. Jason Wang, Livermore Software Technology Corp.

ECLIPSE Best Practices Performance, Productivity, Efficiency. March 2009

Purchase of High Performance Computing (HPC) Central Compute Resources by Northwestern Researchers

Logically a Linux cluster looks something like the following: Compute Nodes. user Head node. network

GPU Computing with CUDA Lecture 2 - CUDA Memories. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

Benchmarking Hadoop & HBase on Violin

Power-Aware High-Performance Scientific Computing

Symmetric Multiprocessing

Workshop on Parallel and Distributed Scientific and Engineering Computing, Shanghai, 25 May 2012

Icepak High-Performance Computing at Rockwell Automation: Benefits and Benchmarks

Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software

Multi-Threading Performance on Commodity Multi-Core Processors

High Performance Matrix Inversion with Several GPUs

INDIAN INSTITUTE OF TECHNOLOGY KANPUR Department of Mechanical Engineering

Performance Evaluation of Amazon EC2 for NASA HPC Applications!

Optimizing GPU-based application performance for the HP for the HP ProLiant SL390s G7 server

Transcription:

CONSISTENT PERFORMANCE ASSESSMENT OF MULTICORE COMPUTER SYSTEMS GH. ADAM 1,2, S. ADAM 1,2, A. AYRIYAN 2, V. KORENKOV 2, V. MITSYN 2, M. DULEA 1, I. VASILE 1 1 Horia Hulubei National Institute for Physics and Nuclear Engineering (IFIN-HH), 407 Atomistilor, Magurele Bucharest, 077125, Romania E-mail: adamg@ifin.nipne.ro 2 Joint Institute for Nuclear Research, 141980 Dubna, Moscow reg., Russia Received August 30, 2008 Performance assessment, through High-Performance Linpack (HPL) benchmark, of the quad-core cluster with InfiniBand interconnect recently acquired at LIT-JINR Dubna is reported. Corroboration with previous results [Gh. Adam et al., Rom. J. Phys., 53, 665 (2008)] shows that the HPL benchmark scales for single-core, two-core, quad-core chips and yields results fitting its intrinsic complexity under statistically relevant criteria. Free software implementations (OS and MPI) on multicore clusters at LIT-JINR and IFIN-HH resulted in relative performances comparable to those reported within the September 2007 issue of TOP500, the list of the five hundred most productive parallel computers in the world. 1. INTRODUCTION The multiprocessor computer architectures built by the computing system vendors are intended to solve complex computational problems. At one extreme there is the case of very large single problems (like, e.g., those arising in lattice chromodynamics), which ultimately result in very large linear algebraic systems, as a consequence of specific discretization procedures yielding numerical algorithms. At the other extreme there is the case of very large sets of independent small to medium size problems of similar nature, which arise in very large scale projects (like, e.g., the four LHC experiments at CERN, the data taking beginning of which is planned for September 2008). These two kinds of problems correspond to two extremes of the existing multiprocessor architectures: parallel clusters (which do high performance computing under small latencies of the interprocessor communication) and distributed systems (Grids) (which are reservoirs of computing power, accessible everywhere from the world within a virtual organization). Most of the offers during the last few years by the computer manufactures for Grid infrastructure development use multicore computer architecture, which involves several independent processors (cores) on a chip that communicate Rom. Journ. Phys., Vol. 53, Nos. 9 10, P. 985 991, Bucharest, 2008

986 Gh. Adam et al. 2 through shared memory. Conceived mainly as a solution to overcome the power consumption problem which is impeding higher processor clock frequency increase, the multicore computer architecture marks the start of a historic transition from sequential to parallel computation inside each multicore chip installed on the system. Under parallel computations scalable with the number of cores on a chip, this would afford an alternative way towards further exponential performance improvement under Moore s law exponential increase in chip resources via core number increase. This is, however, a formidable task, quoted at the recent Gartner Conference [1] to represent one of the seven grand challenges facing IT for the next 25 years. Research concerning both computer architecture issues under the new circumstances [2] as well as the development of new higher-level abstractions for writing parallel programs [3] are actively pursued. Data accumulated both at LIT-JINR and abroad [4, 5] show that understanding the hardware transfer processes for specific problems inbetween the core and the RAM, together with appropriate identification of the algorithm modules that may be executed in parallel and with corresponding best MPI standard instructions for their handling, allow parallel code improvement. The present paper discusses performance assessment of a 20 quad-core processors module, with InfiniBand interconnect, acquired at the beginning of 2008 at LIT-JINR Dubna. This continues a similar study [6] of performance assessments of the CICC JINR supercomputer consisting of 120 two-core processors with Gigabit Ethernet (GbEthernet) interconnect, and the parallel 16-processor cluster SIMFAP with Myrinet interconnect, at IFIN-HH. 2. PERFORMANCE ASSESSMENT The main characteristics of the three systems mentioned above are given in Table 1. Performance is measured by means of the High-Performance Linpack (HPL) benchmark [7], used in TOP500, the list of the five hundred most productive parallel computing systems in the world [8], and in TOP50, the list of the fifty most productive parallel computing systems in the CIS (the Commonwealth of the Independent States) [9]. The HPL benchmark essentials and the discussion of its computational complexity can be found in [6]. The system performance gets maximized provided the order N of the solved algebraic system satisfies N < N max, where N max denotes the maximum system order for which the coefficient matrix can be accommodated within the available overall RAM, RAM. At N > N max, performance deterioration occurs due to the need of using the HDD swap storage. The quantity P peak denotes the peak theoretical performance which would be obtained under instantaneous information exchange along any of the paths involving cores, cache, RAM, HDD.

3 Consistent performance assessment of multicore computer systems 987 Table 1 Main characteristics of the three computing systems of interest Features IFIN-HH CICC CICC parallel SIMFAP supercomputer cluster Intel Processors Xeon Irwindale 2xXeon 5150 Xeon 5315 Clock frequency, ν 3 GHz 2.66 GHz 3 GHz Cores per CPU 1 2 4 CPUs per node 2 2 2 Total nodes 8 60 10 Total CPUs 16 120 20 Total cores, n 16 240 80 2-level cache/cpu 2 MB 4 MB 8 MB RAM on node 4 GB 8 GB 8 GB Overall RAM, RAM 32 GB 480 GB 80 GB Operating System CentOS 5 SL 4.5 SL 4.5 Network Myrinet GbEthernet InfiniBand MPI Version 1.2.7 1.2.7 OpenMPI 1.2.5 Flops per tact, k 2 4 4 System performance under HPL benchmark N max 63.2 10 3 244.9 10 3 100 10 3 P peak = knν 96 GFlops 2553.6 GFlops 960 GFlops P max 64.24 GFlops 1124 GFlops 684.5 GFlops ρ eff = P max / P peak 0.67 0.44 0.713 The quantity Pmax = Nop/ T denotes the maximum measured system performance, where N 3 2 op = (2/ 3) N + 2N is the number of floating point operations needed for solving the algebraic system of order N N max, and T is the measured computing time in seconds. Finally, the ratio ρ eff denotes the effectiveness of the system under scrutiny. For values N Nmax, the system performance is expected to be much smaller than P max. Fig. 1 summarizes the results obtained for the three mentioned clusters. On the bottom row, measured computing times in terms of N are given in minutes, while on the upper row, the resulting performances in terms of N are given for each of the clusters. The interesting feature showed both by the SIMFAP (Myrinet interconnect) and the CICC parallel cluster (InfiniBand interconnect) is the performance saturation near the upper order end of the solved algebraic systems. This points to the advantage of having a wide band dedicated data transfer bus among the processors. For the Gigabit Ethernet CICC supercomputer, saturation does not occur due to the absence of such a dedicated

988 Gh. Adam et al. 4 Fig. 1 Performance (on the upper row) and time of calculation (on the bottom row) vs. order of linear system of equations N, in 10 3 units, for the three clusters. bus. As compared to the previous performance estimates, the present statistics is larger and derived at magic N values [6]. 3. DISCUSSION AND CONCLUSIONS The least squares fit of the computing times measured at various N values provides insight into the consistency of the performance assessment procedure [6]. On one side, the intrinsic degree of complexity of the HPL benchmark is d = 3. On the other side, we can determine the optimal degree m of the least squares fit polynomial under a particular assumption on the distribution law of the uncertainties {σ i } and a statistically significant termination criterion of the least squares procedure. Since the time measurements have been done independently of each other, we have to assume a Poisson distribution law. In [6], optimal values m = d have been obtained both for the CICC supercomputer and SIMFAP data asking for the Hamming termination criterion (criterion 1 in the Appendix of [6]). For the CICC parallel cluster data, this

5 Consistent performance assessment of multicore computer systems 989 criterion proved to be ineffective. However, instead of the pure noise requirement involved in the Hamming criterion, we may ask the criterion z <τ max(1, T ), τ 0. 01, (1) im, i where z i,m denotes the residual associated to the T i time measurement within the m-th degree fitting polynomial. The criterion (1) has indeed resulted in m = d = 3, hence the present data are consistent with the third order complexity of the HPL benchmark as well (Fig. 2). Fig. 2 Fitting CICC parallel cluster performance data resulted in optimal m = 3 degree fitting polynomial with sup-norm misfit magnitude below one percent. Corroborating this result with the evidence reported in [6], we conclude that the HPL benchmark scales perfectly for the multicore clusters. This let us infer that, for scientific computing involving compact matrix coefficients, the derivation of scalable parallel codes is a feasible task within the MPI standard. In the last line of Table 1, there is a large difference between the 44 percent effectiveness of the GbEthernet CICC supercomputer, the 67 percent effectiveness of the Myrinet SIMFAP cluster, and the 71.3 percent effectiveness of the InfiniBand CICC parallel cluster. We assume that these figures should stem from the specific interconnects of the three computer clusters. An independent confirmation of such a hypothesis comes from the comparison of these figures with the histogram representations of the efficiencies

990 Gh. Adam et al. 6 Fig. 3 Histograms summarize the September 2007 issue of the TOP500 data for each of the existing interconnect networks. Arrows point to the present results.

7 Consistent performance assessment of multicore computer systems 991 reported in the September 2007 issue of the TOP500 list for GbEthernet, Myrinet, and InfiniBand parallel clusters (Fig. 3). The occurrence of the relative performances at the level of the best computers in the world points to the fact that the home made open software implementations of the operating systems (OS) and MPI standards have been done at a high qualitative level. Acknowledgments. Romanian authors acknowledge partial support from contract CEX05- D11-67. A. Ayriyan acknowledges partial support from RFBR grant #08-01-00800-a. REFERENCES 1. Gartner Symposium/ITxpo 2008, Emerging Trends, 6 10 April 2008, Mandalay Bay/Las Vegas, NV, USA; Comm. ACM, 51, no. 7, 10 (2008); http://www.networkworld.com/ news/2008/040908-gartner-it-challenges.html 2. M. Osin, Comm. ACM, 51, no. 7, 70 78 (July 2008). 3. J. Larus, C. Kozyrakis, Comm. ACM 51, no. 7, 80 88 (July 2008). 4. V. Lindenstruth, Status and plans for building an energy efficient supercomputer in Frankfurt, GRID 2008, 3-rd Intl. Conf. Distributed computed and Grid technologies in science and education, JINR Dubna, 30 June 4 July 2008. http://grid2008.jinr.ru/pdf/lindenstruth.pdf 5. S. Gorbunov, U. Kebschull, I. Kisel, V. Lindenstruth, W. F. J. Mueller, Comput. Phys. Commun. 178, 374 383 (2008). 6. Gh. Adam, S. Adam, A. Ayriyan, E. Dushanov, E. Hayryan, V. Korenkov, A. Lutsenko, V. Mitsyn, T. Sapozhnikova, A. Sapozhnikov, O. Streltsova, F. Buzatu, M. Dulea, I. Vasile, A. Sima, C. Viºan, J. Buša, I. Pokorny, Romanian J. Phys. 53, No. 5 6, 665 677 (2008). 7. A. Petitet, R. C. Whaley, J. Dongarra, A. Cleary, HPL A Portable Implementation of the High-Performance Linpack Benchmark for Distributed-Memory Computers, http:// www.netlib.org/benchmark/hpl/ 8. http://www.top500.org/ 9. http://www.supercomputers.ru/?page=rating/