Building a Top500-class Supercomputing Cluster at LNS-BUAP

Similar documents
1 Bull, 2011 Bull Extreme Computing

PLGrid Infrastructure Solutions For Computational Chemistry

Agenda. HPC Software Stack. HPC Post-Processing Visualization. Case Study National Scientific Center. European HPC Benchmark Center Montpellier PSSC

Cluster Implementation and Management; Scheduling

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK

ALPS Supercomputing System A Scalable Supercomputer with Flexible Services

SR-IOV In High Performance Computing

Appro Supercomputer Solutions Best Practices Appro 2012 Deployment Successes. Anthony Kenisky, VP of North America Sales

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

The Green Index: A Metric for Evaluating System-Wide Energy Efficiency in HPC Systems

HP ProLiant SL270s Gen8 Server. Evaluation Report

Overview of HPC systems and software available within

Thematic Unit of Excellence on Computational Materials Science Solid State and Structural Chemistry Unit, Indian Institute of Science

New Storage System Solutions

Introduction to High Performance Cluster Computing. Cluster Training for UCL Part 1

Lustre & Cluster. - monitoring the whole thing Erich Focht

Hadoop on the Gordon Data Intensive Cluster

Current Status of FEFS for the K computer

Purchase of High Performance Computing (HPC) Central Compute Resources by Northwestern Researchers

A GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS

SR-IOV: Performance Benefits for Virtualized Interconnects!

Cluster Computing at HRI

High Performance Computing in CST STUDIO SUITE

Lecture 1: the anatomy of a supercomputer

Computational infrastructure for NGS data analysis. José Carbonell Caballero Pablo Escobar

Trends in High-Performance Computing for Power Grid Applications

Overview of HPC Resources at Vanderbilt

Performance Characteristics of a Cost-Effective Medium-Sized Beowulf Cluster Supercomputer


The PHI solution. Fujitsu Industry Ready Intel XEON-PHI based solution. SC Denver

Building Clusters for Gromacs and other HPC applications

Mississippi State University High Performance Computing Collaboratory Brief Overview. Trey Breckenridge Director, HPC

1 DCSC/AU: HUGE. DeIC Sekretariat /RB. Bilag 1. DeIC (DCSC) Scientific Computing Installations

Accelerating CFD using OpenFOAM with GPUs

An Alternative Storage Solution for MapReduce. Eric Lomascolo Director, Solutions Marketing

HPC Update: Engagement Model

Linux clustering. Morris Law, IT Coordinator, Science Faculty, Hong Kong Baptist University

Parallel Computing. Introduction

Clusters: Mainstream Technology for CAE

COMP/CS 605: Intro to Parallel Computing Lecture 01: Parallel Computing Overview (Part 1)

Performance Evaluation of Amazon EC2 for NASA HPC Applications!

Parallel Programming Survey

Mixed Precision Iterative Refinement Methods Energy Efficiency on Hybrid Hardware Platforms

High Performance. CAEA elearning Series. Jonathan G. Dudley, Ph.D. 06/09/ CAE Associates

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

benchmarking Amazon EC2 for high-performance scientific computing

JUROPA Linux Cluster An Overview. 19 May 2014 Ulrich Detert

Commoditisation of the High-End Research Storage Market with the Dell MD3460 & Intel Enterprise Edition Lustre

Scaling from Workstation to Cluster for Compute-Intensive Applications

HPC and Big Data. EPCC The University of Edinburgh. Adrian Jackson Technical Architect

Accelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing

Architecting a High Performance Storage System

PRIMERGY server-based High Performance Computing solutions

Sun Constellation System: The Open Petascale Computing Architecture

Linux Cluster Computing An Administrator s Perspective

Cluster Computing in a College of Criminal Justice

Three Paths to Faster Simulations Using ANSYS Mechanical 16.0 and Intel Architecture

High Performance Computing Infrastructure at DESY

Scientific Computing Data Management Visions

Kriterien für ein PetaFlop System

TSUBAME-KFC : a Modern Liquid Submersion Cooling Prototype Towards Exascale

The High Performance Internet of Things: using GVirtuS for gluing cloud computing and ubiquitous connected devices

Logically a Linux cluster looks something like the following: Compute Nodes. user Head node. network


CORRIGENDUM TO TENDER FOR HIGH PERFORMANCE SERVER

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

Cluster performance, how to get the most out of Abel. Ole W. Saastad, Dr.Scient USIT / UAV / FI April 18 th 2013

高 通 量 科 学 计 算 集 群 及 Lustre 文 件 系 统. High Throughput Scientific Computing Clusters And Lustre Filesystem In Tsinghua University

Stovepipes to Clouds. Rick Reid Principal Engineer SGI Federal by SGI Federal. Published by The Aerospace Corporation with permission.

PARALLEL & CLUSTER COMPUTING CS 6260 PROFESSOR: ELISE DE DONCKER BY: LINA HUSSEIN

Fujitsu HPC Cluster Suite

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015

THE SUN STORAGE AND ARCHIVE SOLUTION FOR HPC

Estonian Scientific Computing Infrastructure (ETAIS)

David Vicente Head of User Support BSC

ST810 Advanced Computing

MareNostrum 3 Javier Bartolomé BSC System Head Barcelona, April 2015

Intel Solid- State Drive Data Center P3700 Series NVMe Hybrid Storage Performance

Evoluzione dell Infrastruttura di Calcolo e Data Analytics per la ricerca

Supercomputing Status und Trends (Conference Report) Peter Wegner

Can High-Performance Interconnects Benefit Memcached and Hadoop?

Jean-Pierre Panziera Teratec 2011

FLOW-3D Performance Benchmark and Profiling. September 2012

BSC - Barcelona Supercomputer Center

The L-CSC cluster: Optimizing power efficiency to become the greenest supercomputer in the world in the Green500 list of November 2014

How To Compare Amazon Ec2 To A Supercomputer For Scientific Applications

Michael Kagan.

CONSISTENT PERFORMANCE ASSESSMENT OF MULTICORE COMPUTER SYSTEMS

Crossing the Performance Chasm with OpenPOWER

REPORT DOCUMENTATION PAGE

Transcription:

Building a Top500-class Supercomputing Cluster at LNS-BUAP Dr. José Luis Ricardo Chávez Dr. Humberto Salazar Ibargüen Dr. Enrique Varela Carlos Laboratorio Nacional de Supercómputo Benemérita Universidad Autónoma de Puebla

Outline of the talk: The LNS project Planning and building the Cuetlaxcoapan supercomputing cluster. Measuring Performance: the HPL benchmark. High Performance Applications Running on the cluster. Summary

The LNS project Laboratorio Nacional de Supercómputo of Benemérita Universidad Autónoma de Puebla Before LNS: - Individual efforts by some institutions of BUAP to build high-performance computing clusters, e.g. the Fénix cluster at Faculty of Physics and Mathematics. - General consensus about the need of a larger computing facility.

Planning and Building the Cuetlaxcoapan cluster Important questions: - What are the current and planned needs on high performance computing in our scientific community? - What kind of applications will run on the cluster? - What is the recommended hardware and software infrastructure?

Planning and Building the Cuetlaxcoapan cluster In order to determine the actual needs a meeting was organized at BUAP to discuss these matters. The general consensus was to focus initially on actual performance needs.

Planning and Building the Cuetlaxcoapan cluster Based on these needs, a panel of scientists and computing experts determined the hardware and software requierements and evaluated multiple proposals from hardware providers. Actual performance needs: 160 TFLOPS peak About 1 PB (petabyte) storage It was decided to focus on the newer (very recently introduced) Intel Haswell architecture.

Planning and Building the Cuetlaxcoapan cluster Hardware partner: Fujitsu, Spain division. Proposal: an architecturally simple, tightly integrated supercomputing cluster.

Planning and Building the Supercomputing Cluster Schematic representation of the cluster:

Planning and Building the Cuetlaxcoapan cluster 204 compute nodes: 2 x Intel Xeon E5-2680 v3 at 2.5 GHz 2 x 12 cores 128 GB DDR4 RAM AVX 2.0 (16 double precision floating point operations per clock cycle per core 960 GFLOPS DP peak performance per node) All compute nodes run CentOS Linux 6.6

Planning and Building the Cuetlaxcoapan cluster 4 special compute nodes with GPUs: Same CPU as normal compute nodes 2 nodes with 2 NVIDIA K40 GPUs: - 2880 CUDA cores - 12 GB of memory - 1.43 TFLOPS DP peak performance 2 nodes with 2 Intel Xeon Phi coprocessors - 61 cores - 16 GB of memory - 1.208 TFLOPS DP peak performance

Planning and Building the Cuetlaxcoapan cluster An upgrade to the cluster is on progress and consists of: 52 additional compute nodes having the same characteristics as the installed nodes. This upgrade increases the computing capacity by 25% and position the cluster as one of the 500 most powerful supercomputing clusters in the world.

Planning and Building the Cuetlaxcoapan cluster 3 service nodes: Master node Cluster monitoring and software deployment Login node User tools for code compiling, job execution and monitoring, etc. Job management node SLURM resource management All servers run RedHat Linux 6.6

Planning and Building the Cuetlaxcoapan cluster Fast data transfer network (computation and parallel filesystem): Mellanox FDR Infiniband SX6518 director switch Up to 324 FDR IB ports: 56 Gb/s full bidirectional bandwidth with sub 1 μs port latency. 36.3 Tb/s aggregate non blocking bandwidth.

Planning and Building the Cuetlaxcoapan cluster 2 x 1 Gb/s ethernet interfaces per node: One for IPMI and TCP/IP management - Fujitsu ServerView system management - Nagios + Ganglia monitoring software One for slow data transfer (NFS)

Planning and Building the Cuetlaxcoapan cluster Storage servers: LUSTRE parallel distributed filesystem: - 6 object storage servers (OSS) 2 OSS share a 352 TB hardware RAID 6 object storage target (OST) 1056 TB raw storage capacity - 2 metadata servers (MDS) sharing a 32 TB hardware RAID 6 metadata target (MDT).

Planning and Building the Cuetlaxcoapan cluster Storage servers: NFS: - 200 TB hardware RAID 6 cabinet - XFS filesystem

Planning and Building the Cuetlaxcoapan cluster

The HPL Benchmark Based on the LINPACK library developed in the 1970s by Jack Dongarra and coworkers. LINPACK is a collection of functions for the analysis and solution of linear systems of equations. HPL constitutes the standard performance test for the Top500 consortium.

The HPL Benchmark Structure of the HPL test: Solution of an order N dense linear system of equations Ax = b using LU decomposition with partial pivoting. The N x N matrix of coefficients A is set up with random numbers. In practice N is chosen so that the matrix uses almost all the available memory on all nodes.

The HPL Benchmark Structure of the HPL test: Required memory: 8 x N 2 bytes In the actual case of the Cuetlaxcoapan cluster N = 1788288 i.e., about 115 GB of local memory on each node. The matrix A is distributed on the compute nodes in a P x Q grid. In practice, the values of P and Q should be optimized for maximum performance.

The HPL Benchmark Structure of the HPL test: In order to maximize data communication performance among nodes, a block size NB for data transfer is chosen. The total number of operations for the solution of the linear system is: 2 N 3 / 3 + 2 N 2

The HPL Benchmark The performance of the test is computed by dividing the total number of floating point operations by the total computing time and is expressed as FLOPS (floating point operations per second). The theoretical performance of a processor (peak performance) is computed by multiplying the processor frequency by the number of floating point operations executed at each clock cycle.

The HPL Benchmark The aggregate peak performance of the cluster is computed by multiplying the peak performance of a single node times the total number of nodes. Intel provides an extremely optimized HPL test for shared memory (to be run on a single node) and distributed memory using MPI (to be run on the complete set of nodes of the cluster).

The HPL Benchmark In practice the real (sustained) performance depends not only on raw processor performance but also on parameters N, P, Q, NB,and the speed of communications among nodes. It is also necessary to turn off hyperthreading since it degrades performance.

The HPL Benchmark Results for the Cuetlaxcoapan cluster: Optimized parameters: P = 52 Q = 96 NB = 192 Performance using the distributed memory test on the complete cluster (208 nodes): 153.408 TFLOPS Average performance per node: 737.5 MFLOPS

The HPL Benchmark Performance using the shared memory test on individual nodes: varies from 720 to 820 GFLOPS. Average performance per node (SMP test): 770 GFLOPS This results corresponds to 80.3 % of peak performance and is in good agreement with independent test results provided by Fujitsu and Intel.

The HPL Benchmark The performance degradation in the parallel test is of the order of 4% which is reasonable due to the need to interchange data among processors. Conclusion: the hardware reaches performance values which are in general better than other independent tests provided by the Top500 list. The speed and bandwidth of communicaitons is not a limiting factor in the test.

The HPL Benchmark The Cuetlaxcoapan cluster is therefore placed among the 500 most powerful clusters in the world according to the Top500 list of November 2014.

Energy efficiency: the Green500 list What about other performance parameters? Energy consumption at full load: 96.3 kw Energy efficiency: 1593.022 MFLOPS / W Would take place 45 in the Green500 list of November 2014.

High Performance Applications Running on Cuetlaxcoapan A resident team of scientists provide support to users on the installation and execution of high performance applications.

High Performance Applications Running on Cuetlaxcoapan Main scientific areas: Users Forum June 2014 Actual Usage

High Performance Applications Running on Cuetlaxcoapan Number of research projects by scientific field: Condensed Matter Physics and Chemistry: 15 Biology and Physiology: 3 Mathematical Physics: 1 High Energy Physics: 6 Computational Science: 1 Plastic and Visual Arts: 1 Current number of research accounts: 40

High Performance Applications Running on Cuetlaxcoapan Many of these projects are international collaborations: - ALICE - CMS - Auger - HAWC - Nanophotonics

High Performance Applications Running on Cuetlaxcoapan An important effort was made to provide a balanced set of commercial and free HPC applications: Number of research groups using HPC applications in condensed matter physics and chemistry: Gaussian: 7 Abinit: 4 CRYSTAL: 2 NWChem: 2 VASP: 3 SIESTA: 1 TeraChem: 3 ORCA: 2 Molpro: 2 Quantum Espresso: 3

High Performance Applications Running on Cuetlaxcoapan High energy physics: Corsika: 3 Ape aerie: 1 Fluka: 5 Ape offline: 1 Conex: 1 Canopy: 1 Geant4: 5 Gate: 1 Root: 5 Aliroot: 1

High Performance Applications Running on Cuetlaxcoapan Biophysics and Physiology: Sybyl: 2 NAMD: 2 Gromacs: 1 GULP: 2 Plastic and Visual Arts: BLENDER: 1

Summary: We have designed a powerful supercomputing cluster using actual performance needs in the scientific community. Early adoption of the Haswell processor technology and fast communication network results in more computing power and less hardware complexity which also reduces energy consumption.

Thank you for your attention!