JuRoPA. Jülich Research on Petaflop Architecture. One Year on. Hugo R. Falter, COO Lee J Porter, Engineering



Similar documents
JUROPA Linux Cluster An Overview. 19 May 2014 Ulrich Detert

Welcome to the. Jülich Supercomputing Centre. D. Rohe and N. Attig Jülich Supercomputing Centre (JSC), Forschungszentrum Jülich

HPC Update: Engagement Model

Kriterien für ein PetaFlop System

Introduction to Infiniband. Hussein N. Harake, Performance U! Winter School

Sun Constellation System: The Open Petascale Computing Architecture

ALPS Supercomputing System A Scalable Supercomputer with Flexible Services

Intel Cluster Ready Appro Xtreme-X Computers with Mellanox QDR Infiniband

Mellanox Academy Online Training (E-learning)

Cluster Implementation and Management; Scheduling

Appro Supercomputer Solutions Best Practices Appro 2012 Deployment Successes. Anthony Kenisky, VP of North America Sales

LS DYNA Performance Benchmarks and Profiling. January 2009

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

Agenda. HPC Software Stack. HPC Post-Processing Visualization. Case Study National Scientific Center. European HPC Benchmark Center Montpellier PSSC

FLOW-3D Performance Benchmark and Profiling. September 2012

ECLIPSE Performance Benchmarks and Profiling. January 2009

Hadoop on the Gordon Data Intensive Cluster

Building a Top500-class Supercomputing Cluster at LNS-BUAP

Current Status of FEFS for the K computer

SR-IOV In High Performance Computing

OpenMP Programming on ScaleMP

Stovepipes to Clouds. Rick Reid Principal Engineer SGI Federal by SGI Federal. Published by The Aerospace Corporation with permission.

Advancing Applications Performance With InfiniBand

Lustre & Cluster. - monitoring the whole thing Erich Focht


Altix Usage and Application Programming. Welcome and Introduction

The following InfiniBand products based on Mellanox technology are available for the HP BladeSystem c-class from HP:

Department of Computer Sciences University of Salzburg. HPC In The Cloud? Seminar aus Informatik SS 2011/2012. July 16, 2012

Building Clusters for Gromacs and other HPC applications

HP ProLiant BL660c Gen9 and Microsoft SQL Server 2014 technical brief

Cray DVS: Data Virtualization Service

1 Bull, 2011 Bull Extreme Computing

InfiniBand Strengthens Leadership as the High-Speed Interconnect Of Choice

How To Build A Supermicro Computer With A 32 Core Power Core (Powerpc) And A 32-Core (Powerpc) (Powerpowerpter) (I386) (Amd) (Microcore) (Supermicro) (

Toward a practical HPC Cloud : Performance tuning of a virtualized HPC cluster

Can High-Performance Interconnects Benefit Memcached and Hadoop?

SMB Direct for SQL Server and Private Cloud

Comparing SMB Direct 3.0 performance over RoCE, InfiniBand and Ethernet. September 2014

An Introduction to the Gordon Architecture

INDIAN INSTITUTE OF TECHNOLOGY KANPUR Department of Mechanical Engineering

A Scalable Ethernet Clos-Switch

HP ProLiant SL270s Gen8 Server. Evaluation Report

How To Compare Amazon Ec2 To A Supercomputer For Scientific Applications

Cray Gemini Interconnect. Technical University of Munich Parallel Programming Class of SS14 Denys Sobchyshak

A Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures

Extreme Computing: The Bull Way

Introduction. Need for ever-increasing storage scalability. Arista and Panasas provide a unique Cloud Storage solution

ECLIPSE Best Practices Performance, Productivity, Efficiency. March 2009

Cray XT3 Supercomputer Scalable by Design CRAY XT3 DATASHEET

Headline in Arial Bold 30pt. The Need For Speed. Rick Reid Principal Engineer SGI

PADS GPFS Filesystem: Crash Root Cause Analysis. Computation Institute

Achieving Real-Time Business Solutions Using Graph Database Technology and High Performance Networks

Michael Kagan.

Virtual InfiniBand Clusters for HPC Clouds

Mississippi State University High Performance Computing Collaboratory Brief Overview. Trey Breckenridge Director, HPC

EDUCATION. PCI Express, InfiniBand and Storage Ron Emerick, Sun Microsystems Paul Millard, Xyratex Corporation

Oracle Exadata: The World s Fastest Database Machine Exadata Database Machine Architecture

Workshop on Parallel and Distributed Scientific and Engineering Computing, Shanghai, 25 May 2012

Lecture 2 Parallel Programming Platforms

A PERFORMANCE COMPARISON USING HPC BENCHMARKS: WINDOWS HPC SERVER 2008 AND RED HAT ENTERPRISE LINUX 5

Sun in HPC. Update for IDC HPC User Forum Tucson, AZ, Sept 2008

Installing Hadoop over Ceph, Using High Performance Networking

Supercomputing Resources in BSC, RES and PRACE

Cooling and thermal efficiently in

PCI Express and Storage. Ron Emerick, Sun Microsystems

Commoditisation of the High-End Research Storage Market with the Dell MD3460 & Intel Enterprise Edition Lustre

Technical Computing Suite Job Management Software

HPC Applications Scalability.

High Performance. CAEA elearning Series. Jonathan G. Dudley, Ph.D. 06/09/ CAE Associates

Solving I/O Bottlenecks to Enable Superior Cloud Efficiency

White Paper Solarflare High-Performance Computing (HPC) Applications

PRIMERGY server-based High Performance Computing solutions

PCIe Over Cable Provides Greater Performance for Less Cost for High Performance Computing (HPC) Clusters. from One Stop Systems (OSS)

Overview of HPC systems and software available within

June, Supermicro ICR Recipe For 1U Twin Department Cluster. Version 1.4 6/25/2009

Enabling the Use of Data

MPI / ClusterTools Update and Plans

Mellanox Academy Course Catalog. Empower your organization with a new world of educational possibilities

RDMA over Ethernet - A Preliminary Study

Thematic Unit of Excellence on Computational Materials Science Solid State and Structural Chemistry Unit, Indian Institute of Science

Purchase of High Performance Computing (HPC) Central Compute Resources by Northwestern Researchers

Performance Comparison of ISV Simulation Codes on Microsoft Windows HPC Server 2008 and SUSE Linux Enterprise Server 10.2

Mellanox Cloud and Database Acceleration Solution over Windows Server 2012 SMB Direct

Performance of HPC Applications on the Amazon Web Services Cloud

Improved LS-DYNA Performance on Sun Servers

HPC Software Requirements to Support an HPC Cluster Supercomputer

CORRIGENDUM TO TENDER FOR HIGH PERFORMANCE SERVER

Simplifying Big Data Deployments in Cloud Environments with Mellanox Interconnects and QualiSystems Orchestration Solutions

Clusters: Mainstream Technology for CAE

HPC Growing Pains. Lessons learned from building a Top500 supercomputer

High Throughput File Servers with SMB Direct, Using the 3 Flavors of RDMA network adapters

- An Essential Building Block for Stable and Reliable Compute Clusters

Open Cirrus: Towards an Open Source Cloud Stack

PCI Express Impact on Storage Architectures. Ron Emerick, Sun Microsystems

Lecture 1: the anatomy of a supercomputer

New Storage System Solutions

Designed for Maximum Accelerator Performance

Cosmological simulations on High Performance Computers

Using PCI Express Technology in High-Performance Computing Clusters

Interconnect Analysis: 10GigE and InfiniBand in High Performance Computing

Transcription:

JuRoPA Jülich Research on Petaflop Architecture One Year on Hugo R. Falter, COO Lee J Porter, Engineering HPC Advisoy Counsil, Workshop 2010, Lugano 1

Outline The work of ParTec on JuRoPA (HF) Overview of JuRoPA (HF) Basics of scalability (LP) Expectations & some experiments (LP) Conclusions (LP) Slide 2

The work of ParTec on JuRoPA project consulting manpower HPC systems consulting experience flexible / dynamic ParaStationV5 (open source) assistance to setup a tender, solving legal questions, preparing MoUs and contracts on site system administrators backed by teams of experienced developers and architects system architects, HPC network specialists and HPC aware software engineers 70 + man years of software development, parallel application debugging installation knowledge (hardware and software) and front-line support experience A pragmatic approach always seeking alternatives, and finding workarounds ParTec created and maintains the ParaStation cluster middleware the chosen cluster operating system of JuRoPA HPC Advisoy Counsil, Workshop 2010, Lugano 3

What is ParaStationV5 ParaStation V5 is an integral part of the success of the JuRoPA machine ParaStation MPI ParaStation Process Management ParaStation Provisioning ParaStation Accounting ParaStation Parallel Debugger ParaStation GridMonitor ParaStation Healthchecker HPC Advisoy Counsil, Workshop 2010, Lugano 4

The work of ParTec on JuRoPA (HF) Overview of JuRoPA (HF) Basics of scalability (LP) Expectations & some experiments (LP) Conclusions (LP) Slide 5

Forschungszentrum Jülich (FZJ) Slide 6

Jülich Supercomputing Centre (JSC) Slide 7

JuRoPA-JSC Slide 8

IB HARDWARE JUROPA CLUSTER Sun M9 QDR switch 24 x 9 12X ports Sun Blade 6048 InfiniBand QDR Switched Network Express Module 12x Optical CXP-to-CXP 12x CU CXP-to-CXP 12x Optical/CU Splitter CXP-to-3 QSFP

JuRoPA - HPC-FF Slide 10

User Research Fields IAS Institute for Advanced Simulation Soft Matter Nano/Material Science JUROPA ~ 200 Projects Biophysics Hadron Physics Jülich Supercomputing Centre (JSC) Slide 11

GCS: Gauss Centre for Supercomputing Germany s Tier-0/1 Supercomputing Complex Association of Jülich, Garching and Stuttgart A single joint scientific governance Germany s representative in PRACE More information: http://www.gauss-centre.de Slide 12

JSC: Sun Blade SB6048 2208 Compute nodes - 2 Intel Nehalem 2.93 GHz - 24 GB DDR3, 1066 MHz Memory - Mellanox IB ConnectX QDR HCA 17664 cores, 207 Tflops peak HPC-FF: Bull NovaScale R422-E2 1080 Compute nodes - 2 Intel Nehalem-EP 2.93 GHz - 24 GB DDR, 1066 MHz Memory - Mellanox IB ConnectX QDR HCA 8640 cores, 101 Tflops peak Overview Lustre FileSystem: - 2 MDS (Bull R423-E2) - 8 OSS (Sun Fire 4170) - DDN S2A9900 - Sun 4400 JBOD Lustre over IB (530 TB) Novell SUSE Linux Enterprise Linux ParaStation Cluster-OS HPC Advisoy Counsil, Workshop 2010, Lugano 13

JuRoPA common ingredients 2208 + 1080 Compute Nodes Intel Nehalem Processor Mellanox 4 th Gen. QDR InfiniBand interconnect Global fat-tree network Adaptive routing not yet implemented ParaStation V5 cluster middle-ware Intel development tools (compiler, libraries, etc.) Linux OS (starting w/ Novell SLES; moving to Novell SLERT) Lustre for I/O (but gateways to GPFS) Batch-system: Torque/Moab MPI: ParaStation (Intel-MPI, further MPIs from OFED) Slide 14

Overall design schematic view Slide 15

23 x 4 QNEM modules, 24 ports each 6 x M9 switches, 648 ports max. each, 468/276 links used Mellanox MTS3600 switches (Shark), 36 ports, for service nodes JuRoPA Service HPC-FF 23 x 4 QNEM modules, 24 ports each 6 x M9 switches, 648 ports max. each, 468/276 links used Mellanox MTS3600 switches (Shark), 36 ports, for service nodes 4 Compute Sets (CS) with 15 Compute Cells (CC) each CC with 18 Compute Nodes (CN) and 1 Mellanox MTS3600 (Shark) switch each Virtual 648-port switches constructed from 54x/44x Mellanox MTS3600 Slide 16

Status today Juropa Machine at 90% aggregate utilization outside of maintainance slots utilization comparable to cloud systems JSC portion is currently overbooked X 10 indicates this is a popular machine amoungst scientific community HPC Advisoy Counsil, Workshop 2010, Lugano 17

The work of ParTec on JuRoPA (HF) Overview of JuRoPA (HF) Basics of scalability (LP) Expectations & some experiments (LP) Conclusions (LP) Slide 18

Scalability - Nehalem? Newest, fastest, Intel processor Technically: Great memory-bandwidth (up to 32 GB/s per socket) - QPI Superior cache structure Large, low latency 1 st level: 32 kb 8-way associative 2 nd level: 256 kb 8-way associative 3 rd level: shared 8192 kb 16-way associative 128 bits into SSE register in one cycle Results in > 90% efficiency in Linpack BlueGene ~82%, other Clusters 75%-80% Slide 19

Scalability - QDR InfiniBand? Short answer: Higher bandwidth (now at 32 Gbit/s) Lower latency (less than 1.0 µsec point-to-point) Actual reasons: Appropriate network topologies: Clos, fat-tree, Ω-network,... 4 th Gen. QDR silicon has 36-port building blocks (24-ports in earlier versions) Higher port arity > fewer hops -> Reduced complexity reduced end-to-end latency QDR supports adaptive routing (not yet implemented JuRoPA) IB's static routing might lead to congestion Slide 20

Scalability - Middleware Cluster Middleware MPI can it scale to > 10,000 MPI tasks Expose MPI tuning to user environment used the ParaStation MPI and scheduler to start and manage over 25000 MPI-processes standard scheduler and MPI failed to scale (unexp. timeouts) Communication layer, process management, (pscom) designed to scale. ParaStation MPI uses pscom to start 25000 processes in less than 5 min! ALL OPEN SOURCE Slide 21

Scalability - Administration ParaStation Healthchecker: special development of a tool to detect broken components in software and hardware at node boot and during job preamble. IB network diags tools: near full automatation in IB diagnostics, link monitoring, and active link management. Problem tracker: automated hardware / software PR submission with ParTec supported tracking system HPC Advisoy Counsil, Workshop 2010, Lugano 22

The work of ParTec on JuRoPA (HF) Overview of JuRoPA (HF) Basics of scalability (LP) Experiences, Expectations & Experiments (LP) Conclusions (LP) Slide 23

IB bring-up cabling impressions Slide 24

IB bring-up cabling impressions Slide 25

Some experiences System was not expected to scale during first phase Besides Linpack We expected OS jitter problems to be solved with codevelopment projects. We were conservative Don't allow jobs with more than 512 nodes (4096 core) Nevertheless, users are impatient and adventurous Push applications to the allowed limits Try to scale beyond this limits Investigate scalability Let's look at some examples Synthetic benchmarks (link-test) Real-world applications Slide 26

Application startup results Slide 27

Link-test by Wolfgang Frings - JSC Slide 28

PEPC Physical Plasma N-body code Simulation of laser-plasma interaction Complexity reduction from O(N 2 ) to O(N log(n)) Scale up to 8k cores Developed by P. Gibbon (JSC) Slide 29

Slide 30

HYDRA a astrophysics fluid code (PRACE) T. Downes, M. Browne Dublin City Univ. Three prototypes were examined as part of this project: the Cray XT5 at CSC, and the JUROPA and JUGENE systems at FZJ. HYDRA was shown to scale remarkably well on JUGENE exhibiting strong scaling [...] from 8192 cores up to the full system (294912 cores) with approximately 73% efficiency. HYDRA also exhibited excellent strong scaling on JUROPA up to 2048 cores but less impressive scaling on the XT5 system. Machine (cores) cost-to-solution kw h / time-step JUGENE (294912) JuRoPA (2048) LOUHI (1440) 0.52 0.38 0.57 Slide 31

The work of ParTec on JuRoPA (HF) Overview of JuRoPA (HF) Basics of scalability (LP) Experiences, Expectations & Experiments (LP) Conclusions (LP) Slide 32

Conclusions JuRoPA already has some ingredients to assure scalability Good all round InfiniBand topology for throughput cluster Scalable services ParaStation MPI Some room for improvement Progress co-development OS Jitter resolution MPI connections based on UD reduced memory footprint (ParTec development- in Beta testing) Implement Adaptive routing Implement collective offloading Some very good results already Many applications scale well at least better than Cray XT5 JuRoPA greener than JUGENE for the HYDRA code Slide 33

Thank you! www.par-tec.com HPC Advisoy Counsil, Workshop 2010, Lugano 34