Building a Top500-class Supercomputing Cluster at LNS-BUAP

Building a Top500-class Supercomputing Cluster at LNS-BUAP Dr. José Luis Ricardo Chávez Dr. Humberto Salazar Ibargüen Dr. Enrique Varela Carlos Laboratorio Nacional de Supercómputo Benemérita Universidad Autónoma de Puebla

Outline of the talk: The LNS project Planning and building the Cuetlaxcoapan supercomputing cluster. Measuring Performance: the HPL benchmark. High Performance Applications Running on the cluster. Summary

The LNS project Laboratorio Nacional de Supercómputo of Benemérita Universidad Autónoma de Puebla Before LNS: - Individual efforts by some institutions of BUAP to build high-performance computing clusters, e.g. the Fénix cluster at Faculty of Physics and Mathematics. - General consensus about the need of a larger computing facility.

Planning and Building the Cuetlaxcoapan cluster Important questions: - What are the current and planned needs on high performance computing in our scientific community? - What kind of applications will run on the cluster? - What is the recommended hardware and software infrastructure?

Planning and Building the Cuetlaxcoapan cluster In order to determine the actual needs a meeting was organized at BUAP to discuss these matters. The general consensus was to focus initially on actual performance needs.

Planning and Building the Cuetlaxcoapan cluster Based on these needs, a panel of scientists and computing experts determined the hardware and software requierements and evaluated multiple proposals from hardware providers. Actual performance needs: 160 TFLOPS peak About 1 PB (petabyte) storage It was decided to focus on the newer (very recently introduced) Intel Haswell architecture.

Planning and Building the Cuetlaxcoapan cluster Hardware partner: Fujitsu, Spain division. Proposal: an architecturally simple, tightly integrated supercomputing cluster.

Planning and Building the Supercomputing Cluster Schematic representation of the cluster:

Planning and Building the Cuetlaxcoapan cluster 204 compute nodes: 2 x Intel Xeon E5-2680 v3 at 2.5 GHz 2 x 12 cores 128 GB DDR4 RAM AVX 2.0 (16 double precision floating point operations per clock cycle per core 960 GFLOPS DP peak performance per node) All compute nodes run CentOS Linux 6.6

Planning and Building the Cuetlaxcoapan cluster 4 special compute nodes with GPUs: Same CPU as normal compute nodes 2 nodes with 2 NVIDIA K40 GPUs: - 2880 CUDA cores - 12 GB of memory - 1.43 TFLOPS DP peak performance 2 nodes with 2 Intel Xeon Phi coprocessors - 61 cores - 16 GB of memory - 1.208 TFLOPS DP peak performance

Planning and Building the Cuetlaxcoapan cluster An upgrade to the cluster is on progress and consists of: 52 additional compute nodes having the same characteristics as the installed nodes. This upgrade increases the computing capacity by 25% and position the cluster as one of the 500 most powerful supercomputing clusters in the world.

Planning and Building the Cuetlaxcoapan cluster 3 service nodes: Master node Cluster monitoring and software deployment Login node User tools for code compiling, job execution and monitoring, etc. Job management node SLURM resource management All servers run RedHat Linux 6.6

Planning and Building the Cuetlaxcoapan cluster Fast data transfer network (computation and parallel filesystem): Mellanox FDR Infiniband SX6518 director switch Up to 324 FDR IB ports: 56 Gb/s full bidirectional bandwidth with sub 1 μs port latency. 36.3 Tb/s aggregate non blocking bandwidth.

Planning and Building the Cuetlaxcoapan cluster 2 x 1 Gb/s ethernet interfaces per node: One for IPMI and TCP/IP management - Fujitsu ServerView system management - Nagios + Ganglia monitoring software One for slow data transfer (NFS)

Planning and Building the Cuetlaxcoapan cluster Storage servers: LUSTRE parallel distributed filesystem: - 6 object storage servers (OSS) 2 OSS share a 352 TB hardware RAID 6 object storage target (OST) 1056 TB raw storage capacity - 2 metadata servers (MDS) sharing a 32 TB hardware RAID 6 metadata target (MDT).

Planning and Building the Cuetlaxcoapan cluster Storage servers: NFS: - 200 TB hardware RAID 6 cabinet - XFS filesystem

Planning and Building the Cuetlaxcoapan cluster

The HPL Benchmark Based on the LINPACK library developed in the 1970s by Jack Dongarra and coworkers. LINPACK is a collection of functions for the analysis and solution of linear systems of equations. HPL constitutes the standard performance test for the Top500 consortium.

The HPL Benchmark Structure of the HPL test: Solution of an order N dense linear system of equations Ax = b using LU decomposition with partial pivoting. The N x N matrix of coefficients A is set up with random numbers. In practice N is chosen so that the matrix uses almost all the available memory on all nodes.

The HPL Benchmark Structure of the HPL test: Required memory: 8 x N 2 bytes In the actual case of the Cuetlaxcoapan cluster N = 1788288 i.e., about 115 GB of local memory on each node. The matrix A is distributed on the compute nodes in a P x Q grid. In practice, the values of P and Q should be optimized for maximum performance.

The HPL Benchmark Structure of the HPL test: In order to maximize data communication performance among nodes, a block size NB for data transfer is chosen. The total number of operations for the solution of the linear system is: 2 N 3 / 3 + 2 N 2

The HPL Benchmark The performance of the test is computed by dividing the total number of floating point operations by the total computing time and is expressed as FLOPS (floating point operations per second). The theoretical performance of a processor (peak performance) is computed by multiplying the processor frequency by the number of floating point operations executed at each clock cycle.

The HPL Benchmark The aggregate peak performance of the cluster is computed by multiplying the peak performance of a single node times the total number of nodes. Intel provides an extremely optimized HPL test for shared memory (to be run on a single node) and distributed memory using MPI (to be run on the complete set of nodes of the cluster).

The HPL Benchmark In practice the real (sustained) performance depends not only on raw processor performance but also on parameters N, P, Q, NB,and the speed of communications among nodes. It is also necessary to turn off hyperthreading since it degrades performance.

The HPL Benchmark Results for the Cuetlaxcoapan cluster: Optimized parameters: P = 52 Q = 96 NB = 192 Performance using the distributed memory test on the complete cluster (208 nodes): 153.408 TFLOPS Average performance per node: 737.5 MFLOPS

The HPL Benchmark Performance using the shared memory test on individual nodes: varies from 720 to 820 GFLOPS. Average performance per node (SMP test): 770 GFLOPS This results corresponds to 80.3 % of peak performance and is in good agreement with independent test results provided by Fujitsu and Intel.

The HPL Benchmark The performance degradation in the parallel test is of the order of 4% which is reasonable due to the need to interchange data among processors. Conclusion: the hardware reaches performance values which are in general better than other independent tests provided by the Top500 list. The speed and bandwidth of communicaitons is not a limiting factor in the test.

The HPL Benchmark The Cuetlaxcoapan cluster is therefore placed among the 500 most powerful clusters in the world according to the Top500 list of November 2014.

Energy efficiency: the Green500 list What about other performance parameters? Energy consumption at full load: 96.3 kw Energy efficiency: 1593.022 MFLOPS / W Would take place 45 in the Green500 list of November 2014.

High Performance Applications Running on Cuetlaxcoapan A resident team of scientists provide support to users on the installation and execution of high performance applications.

High Performance Applications Running on Cuetlaxcoapan Main scientific areas: Users Forum June 2014 Actual Usage

High Performance Applications Running on Cuetlaxcoapan Number of research projects by scientific field: Condensed Matter Physics and Chemistry: 15 Biology and Physiology: 3 Mathematical Physics: 1 High Energy Physics: 6 Computational Science: 1 Plastic and Visual Arts: 1 Current number of research accounts: 40

High Performance Applications Running on Cuetlaxcoapan Many of these projects are international collaborations: - ALICE - CMS - Auger - HAWC - Nanophotonics

High Performance Applications Running on Cuetlaxcoapan An important effort was made to provide a balanced set of commercial and free HPC applications: Number of research groups using HPC applications in condensed matter physics and chemistry: Gaussian: 7 Abinit: 4 CRYSTAL: 2 NWChem: 2 VASP: 3 SIESTA: 1 TeraChem: 3 ORCA: 2 Molpro: 2 Quantum Espresso: 3

High Performance Applications Running on Cuetlaxcoapan High energy physics: Corsika: 3 Ape aerie: 1 Fluka: 5 Ape offline: 1 Conex: 1 Canopy: 1 Geant4: 5 Gate: 1 Root: 5 Aliroot: 1

High Performance Applications Running on Cuetlaxcoapan Biophysics and Physiology: Sybyl: 2 NAMD: 2 Gromacs: 1 GULP: 2 Plastic and Visual Arts: BLENDER: 1

Summary: We have designed a powerful supercomputing cluster using actual performance needs in the scientific community. Early adoption of the Haswell processor technology and fast communication network results in more computing power and less hardware complexity which also reduces energy consumption.

Thank you for your attention!