www.bsc.es MareNostrum 3 Javier Bartolomé BSC System Head Barcelona, April 2015



Similar documents
FLOW-3D Performance Benchmark and Profiling. September 2012

JUROPA Linux Cluster An Overview. 19 May 2014 Ulrich Detert

INDIAN INSTITUTE OF TECHNOLOGY KANPUR Department of Mechanical Engineering

Building a Top500-class Supercomputing Cluster at LNS-BUAP

Agenda. HPC Software Stack. HPC Post-Processing Visualization. Case Study National Scientific Center. European HPC Benchmark Center Montpellier PSSC

HPC Update: Engagement Model

Cost Efficient VDI. XenDesktop 7 on Commodity Hardware

PADS GPFS Filesystem: Crash Root Cause Analysis. Computation Institute

Sun Constellation System: The Open Petascale Computing Architecture

Michael Kagan.

How To Build A Supermicro Computer With A 32 Core Power Core (Powerpc) And A 32-Core (Powerpc) (Powerpowerpter) (I386) (Amd) (Microcore) (Supermicro) (

CORRIGENDUM TO TENDER FOR HIGH PERFORMANCE SERVER

A-CLASS The rack-level supercomputer platform with hot-water cooling

Thematic Unit of Excellence on Computational Materials Science Solid State and Structural Chemistry Unit, Indian Institute of Science

LBNC and IBM Corporation Document: LBNC-Install.doc Date: Path: D:\Doc\EPFL\LNBC\LBNC-Install.doc Version: V1.0

MESOS CB220. Cluster-in-a-Box. Network Storage Appliance. A Simple and Smart Way to Converged Storage with QCT MESOS CB220

Deploying Ceph with High Performance Networks, Architectures and benchmarks for Block Storage Solutions


COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters

Fujitsu PRIMERGY Servers Portfolio

SX1012: High Performance Small Scale Top-of-Rack Switch

UCS M-Series Modular Servers

GPFS Storage Server. Concepts and Setup in Lemanicus BG/Q system" Christian Clémençon (EPFL-DIT)" " 4 April 2013"

How To Write An Article On An Hp Appsystem For Spera Hana

David Vicente Head of User Support BSC

Potsdam Scientists to Tackle New Type of Weather Simulations with IBM idataplex

Achieving Real-Time Business Solutions Using Graph Database Technology and High Performance Networks

Servers, Clients. Displaying max. 60 cameras at the same time Recording max. 80 cameras Server-side VCA Desktop or rackmount form factor

Large Scale Storage. Orlando Richards, Information Services LCFG Users Day, University of Edinburgh 18 th January 2013

Cluster Implementation and Management; Scheduling

ALPS Supercomputing System A Scalable Supercomputer with Flexible Services

Hadoop on the Gordon Data Intensive Cluster

SMB Direct for SQL Server and Private Cloud

præsentation oktober 2011

Lustre SMB Gateway. Integrating Lustre with Windows

Designed for Maximum Accelerator Performance

Current Status of FEFS for the K computer

SUN HARDWARE FROM ORACLE: PRICING FOR EDUCATION

HP reference configuration for entry-level SAS Grid Manager solutions

Cisco Unified Computing System Hardware

Data Sheet FUJITSU Server PRIMERGY CX272 S1 Dual socket server node for PRIMERGY CX420 cluster server

XenData Product Brief: SX-250 Archive Server for LTO

HP recommended configuration for Microsoft Exchange Server 2010: HP LeftHand P4000 SAN

Reference Design: Scalable Object Storage with Seagate Kinetic, Supermicro, and SwiftStack

Comparing SMB Direct 3.0 performance over RoCE, InfiniBand and Ethernet. September 2014

InfiniBand Switch System Family. Highest Levels of Scalability, Simplified Network Manageability, Maximum System Productivity

Transforming the UL into a Big Data University. Current status and planned evolutions

VTrak SATA RAID Storage System

Fujitsu PRIMEFLEX reference architectures

APACHE HADOOP PLATFORM HARDWARE INFRASTRUCTURE SOLUTIONS

BRIDGING EMC ISILON NAS ON IP TO INFINIBAND NETWORKS WITH MELLANOX SWITCHX

Driving IBM BigInsights Performance Over GPFS Using InfiniBand+RDMA

Solving I/O Bottlenecks to Enable Superior Cloud Efficiency

SPC BENCHMARK 2/ENERGY EXECUTIVE SUMMARY ORACLE CORPORATION ORACLE ZFS STORAGE ZS3-2 APPLIANCE (2-NODE CLUSTER) SPC-2/E V1.5

PCIe Over Cable Provides Greater Performance for Less Cost for High Performance Computing (HPC) Clusters. from One Stop Systems (OSS)

Sun in HPC. Update for IDC HPC User Forum Tucson, AZ, Sept 2008

Introduction to Running Computations on the High Performance Clusters at the Center for Computational Research

Power Efficiency Comparison: Cisco UCS 5108 Blade Server Chassis and IBM FlexSystem Enterprise Chassis

Terms of Reference Microsoft Exchange and Domain Controller/ AD implementation

Quantum StorNext. Product Brief: Distributed LAN Client

Computational infrastructure for NGS data analysis. José Carbonell Caballero Pablo Escobar

Brainlab Node TM Technical Specifications

Architecting a High Performance Storage System

HITACHI VIRTUAL STORAGE PLATFORM FAMILY MATRIX

XenData Product Brief: SX-250 Archive Server for LTO

760 Veterans Circle, Warminster, PA Technical Proposal. Submitted by: ACT/Technico 760 Veterans Circle Warminster, PA

HUAWEI Tecal E6000 Blade Server

WHITE PAPER 1

Mellanox Academy Online Training (E-learning)

Agenda. Enterprise Application Performance Factors. Current form of Enterprise Applications. Factors to Application Performance.

IMPLEMENTING GREEN IT

Fujitsu PRIMERGY Servers Portfolio

SX1024: The Ideal Multi-Purpose Top-of-Rack Switch

Mellanox Cloud and Database Acceleration Solution over Windows Server 2012 SMB Direct

Purchase of High Performance Computing (HPC) Central Compute Resources by Northwestern Researchers

The following InfiniBand adaptor products based on Mellanox technologies are available from HP

Oracle Exadata: The World s Fastest Database Machine Exadata Database Machine Architecture

Stovepipes to Clouds. Rick Reid Principal Engineer SGI Federal by SGI Federal. Published by The Aerospace Corporation with permission.

Fine-grained File System Monitoring with Lustre Jobstat

Maintaining Non-Stop Services with Multi Layer Monitoring

High Performance Computing Infrastructure at DESY

Configuration Maximums

Kriterien für ein PetaFlop System

Building Clusters for Gromacs and other HPC applications

Building a Scalable Storage with InfiniBand

NetApp High-Performance Computing Solution for Lustre: Solution Guide

SAN TECHNICAL - DETAILS/ SPECIFICATIONS

Power Efficiency Comparison: Cisco UCS 5108 Blade Server Chassis and Dell PowerEdge M1000e Blade Enclosure

New and Improved Lustre Performance Monitoring Tool. Torben Kling Petersen, PhD Principal Engineer. Chris Bloxham Principal Architect

Intel Cluster Ready Appro Xtreme-X Computers with Mellanox QDR Infiniband

Microsoft SharePoint Server 2010

Annex 1: Hardware and Software Details

A Smart Investment for Flexible, Modular and Scalable Blade Architecture Designed for High-Performance Computing.

Network Storage Appliance

UNIFIED HYBRID STORAGE. Performance, Availability and Scale for Any SAN and NAS Workload in Your Environment

Transcription:

www.bsc.es MareNostrum 3 Javier Bartolomé BSC System Head Barcelona, April 2015

Index MareNostrum 3 Overview Compute Racks Infiniband Racks Management Racks GPFS Network Racks HPC GPFS Storage Hardware GPFS Data Services Long-Term Storage (Archive) Active Archive Hardware Active Archive Services Batch Scheduler System Software Stack 2

MN2 MN3 3

MareNostrum 3 4

MareNostrum 3 36 x IBM idataplex Compute racks 84 x IBM compute nodes 2x SandyBridge-EP E5-2670 2.6GHz/1600 20M 8-core 115W 8x 4G DDR3-1600 DIMMs (2GB/core) 500GB 7200 rpm SATA II local HDD 4x IBM dx460 M4 compute nodes on a Management Rack 3028 compute nodes 48,448 Intel cores Memory 94.62 TB 32GB/node Peak performance: 1.1 Pflop/s Node performance: 332.8 Gflops Rack Performance: 27.95 Tflops Rack Consumption: 28.04 kw/rack (nominal under HPL) Estimated power consumption: 1.08 MW Infiniband FDR10 non-blocking Fat Tree network topology 5

MareNostrum 1-2 - 3 Compute Performance Memory Network MN1 (2004) Ratio MN2 (2006) Ratio MN3 (2012) Cores/chip 1 x2 2 x4 8 Chip/node 2 2 2 Cores/node 2 x2 4 x4 16 Nodes 2406 +154 2560 +468 3028 Total cores 4812 x2 10240 x4,73 48448 Freq. 2,2 2,3 2,6 Gflops/core 8,8 9,2 20,8 Gflops/node 17,6 36,8 332,8 Total Tflops 42,3 x2 94,2 x10,61 1000,0 GB/core (GB) 2 2 2 GB/node (GB) 4 x2 8 x4 32 Total (TB) 9,6 x2 20 x4,84 96,89 Topology Non- blocking Fat Tree Non- blocking Fat Tree Non- blocking Fat Tree Latency (µs) 4 4 x5,7 0,7 Bandwidth (Gb/s) 4 4 x10 40 Storage (TB) 236 x2 460 x4,34 2000 Consumption (KW) 650 x1,1 750 x1,44 10806

7 MN3 Hardware Layout D5 D3 D4 D1 D2 M3 M1 M2 C1 s01r1 C2 s01r2 C3 s02r1 C4 s02r2 C5 s03r1 C6 s03r2 C7 s04r1 C8 s04r2 C9 s05r1 C10 s05r2 C11 s06r1 C12 s06r2 C13 s07r1 C14 s07r2 C15 s08r1 C16 s08r2 C17 s09r1 C18 s09r2 C19 s10r1 C20 s10r2 C21 s11r1 C22 s11r2 C23 s12r1 C24 s12r2 C25 s13r1 C40 C39 C38 C37 C29 s15r1 C30 s15r2 C31 s16r1 C32 s16r2 C33 s17r1 C34 s17r2 C35 s18r1 C36 s18r2 C26 s13r2 C27 s14r1 C28 s14r2 IB5 IB3 IB4 IB1 IB2 IB6 IB7

8 MN3 Compute Racks D5 D3 D4 D1 D2 M3 M1 M2 C1 s01r1 C2 s01r2 C3 s02r1 C4 s02r2 C5 s03r1 C6 s03r2 C7 s04r1 C8 s04r2 C9 s05r1 C10 s05r2 C11 s06r1 C12 s06r2 C13 s07r1 C14 s07r2 C15 s08r1 C16 s08r2 C17 s09r1 C18 s09r2 C19 s10r1 C20 s10r2 C21 s11r1 C22 s11r2 C23 s12r1 C24 s12r2 C25 s13r1 C40 C39 C38 C37 C29 s15r1 C30 s15r2 C31 s16r1 C32 s16r2 C33 s17r1 C34 s17r2 C35 s18r1 C36 s18r2 C26 s13r2 C27 s14r1 C28 s14r2 IB5 IB3 IB4 IB1 IB2 IB6 IB7

MN3 idataplex Compute Rack 84x IBM System x idataplex server 4x Mellanox 36-port Managed FDR10 IB Switch 12x compute nodes connected to leaf switches at IB core racks. Management Network 2x BNT RackSwitch G8052F GPFS Network 2x BNT RackSwitch G8052F idpx rack with RDHX (water cooling) Performance 2.60 GHz x 8 flops/cycle (AVX) = 20.8 Gflops/core 16 core x 20.8 Gflops/core = 332.8 Gflops/node 84 nodes x 298.64 Gflops/node = 27.95 Tflops/rack 3P 32A PDU 3P 32A PDU BNT G8052 - Mgt MLX FDR10 36 Port MLX FDR10 36 Port BNT G8052 - GPFS 3P 32A PDU 3P 32A PDU BNT G8052 - Mgt MLX FDR10 36 Port MLX FDR10 36 Port BNT G8052 - GPFS 9

MN3 idataplex Compute Rack 84x IBM System x idataplex server 4x Mellanox 36-port Managed FDR10 IB Switch 12x compute nodes connected to leaf switches at IB core racks. Management Network 2x BNT RackSwitch G8052F GPFS Network 2x BNT RackSwitch G8052F idpx rack with RDHX (water cooling) Performance 2.60 GHz x 8 flops/cycle (AVX) = 20.8 Gflops/core 16 core x 20.8 Gflops/core = 332.8 Gflops/node 84 nodes x 298.64 Gflops/node = 27.95 Tflops/rack 3P 32A PDU 3P 32A PDU BNT G8052 - Mgt MLX FDR10 36 Port MLX FDR10 36 Port BNT G8052 - GPFS 3P 32A PDU 3P 32A PDU BNT G8052 - Mgt MLX FDR10 36 Port MLX FDR10 36 Port BNT G8052 - GPFS 10

Rear Door Heat exchanger No Leaks Sealed Internal coils Lock handle to close/ open door Perforated Door for clear airflow Industry Standard hose fittings Swings to provide access to rear PDUs 11

MN3 1 chassis 2U for 2 nodes with shared: Power (2x 900W redundant N+N) Cooling 80mm Fans Each node: 2x 1GbE interfaces 1x imm interface Mellanox ConnectX-3 Dual-port FDR10 QSFP IB Mezz Card Front of Chassis Rear of Chassis 16x DIMM DDR3 slots One 3.5'' SATA drive IMM port 2x CPU Sockets FCLGA2011 12 Dual-port QSFP FDR10 IB Mezz card & ports Ethernet 12

Block diagram 13

MN3 Network physical configuration idataplex rack switches compute node 41x dx360m4 41 2x Management & Boot Ethernet switch BNT RackSwitch G8052F Management Network (IMM and xcat/boot) 1 Gb/s copper 42 2 IB Leaf mgt port 1 GPFS switch 2x 1 Gb/s copper GPFS Network 41x dx360m4 41 2x GPFS Ethernet switch BNT RackSwitch G8052F 1 Gb/s copper 42 4x 10 Gb/s optical Infiniband FDR10 17x dx360m4 17 4x Infiniband Leaf Switch Mellanox 36-port FDR10 IB Switch 40 Gb/s copper 18... 14 18 optical links

MN3 VLAN configuration idataplex rack switches compute node 41x dx360m4 41 2x IMM, Management Remote control, & Boot Ethernet Consoles, switch BNT RackSwitch Switches G8052F Management Network (IMM and xcat/boot) 1 Gb/s copper 42 OS Services: 2 xcat, 1 Network Boot, LSF, Ganglia IB Leaf mgt port GPFS switch 2x 1 Gb/s copper GPFS Network 41x dx360m4 41 2x GPFS Ethernet switch BNT RackSwitch G8052F GPFS Network I/O Traffic 1 Gb/s copper 42 4x 10 Gb/s optical 17x dx360m4 4x Infiniband Leaf Switch Mellanox 36-port FDR10 IB Switch Infiniband FDR10 17 MPI Applications Traffic 40 Gb/s copper 18... 15 18 optical links

16 MN3 Infiniband Racks D5 D3 D4 D1 D2 M3 M1 M2 C1 s01r1 C2 s01r2 C3 s02r1 C4 s02r2 C5 s03r1 C6 s03r2 C7 s04r1 C8 s04r2 C9 s05r1 C10 s05r2 C11 s06r1 C12 s06r2 C13 s07r1 C14 s07r2 C15 s08r1 C16 s08r2 C17 s09r1 C18 s09r2 C19 s10r1 C20 s10r2 C21 s11r1 C22 s11r2 C23 s12r1 C24 s12r2 C25 s13r1 C40 C39 C38 C37 C29 s15r1 C30 s15r2 C31 s16r1 C32 s16r2 C33 s17r1 C34 s17r2 C35 s18r1 C36 s18r2 C26 s13r2 C27 s14r1 C28 s14r2 IB5 IB3 IB4 IB1 IB2 IB6 IB7

MN3 Infiniband Network 6x Infiniband Racks (4 today) Melanox 648-port FDR10 Infiniband Core Switch (29U) 1x Infiniband Rack: Leaf IB switches + UFM servers 18x Mellanox 36-port Managed FDR10 IB Switch 2x Infiniband UFM servers. Unified Fabric Manager: Provision, monitor and operate data center fabric 144x Mellanox 36-port Managed FDR10 IB Switch (100 today) Front Back (cabling) 17

MN3 Infiniband Network MLX SX6535 522 port available (507 used) MLX SX6535 522 port available (507 used) MLX SX6535 522 port available (507 used) MLX SX6535 522 port available (507 used) MLX SX6535 522 port available (507 used) MLX SX6535 522 port available (507 used) 3 3 3 3 3 3 MLX FDR10 36 Port 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 MLX FDR10 36 Port MLX FDR10 36 Port MLX FDR10 36 Port... MLX FDR10 36 Port MLX FDR10 36 Port 3 UFM server Login node... Login node... UFM server............... 18 s 18 Nodes 12 Nodes 18 s 18 s 18 s 18

19 MN3 Management racks D5 D3 D4 D1 D2 M3 M1 M2 C1 s01r1 C2 s01r2 C3 s02r1 C4 s02r2 C5 s03r1 C6 s03r2 C7 s04r1 C8 s04r2 C9 s05r1 C10 s05r2 C11 s06r1 C12 s06r2 C13 s07r1 C14 s07r2 C15 s08r1 C16 s08r2 C17 s09r1 C18 s09r2 C19 s10r1 C20 s10r2 C21 s11r1 C22 s11r2 C23 s12r1 C24 s12r2 C25 s13r1 C40 C39 C38 C37 C29 s15r1 C30 s15r2 C31 s16r1 C32 s16r2 C33 s17r1 C34 s17r2 C35 s18r1 C36 s18r2 C26 s13r2 C27 s14r1 C28 s14r2 IB5 IB3 IB4 IB1 IB2 IB6 IB7

MN3 Management Hardware 2x xcat GPFS Servers & 2 Storage Controllers 9TB Filesystem mounted in all management servers only Store operating system images of all nodes Logs and configuration files of the cluster 2x xcat Master Servers Main xcat servers working in high-availability Main DNS servers for the cluster 18x xcat Service Node (13 today) Holds the services for a portion of the machine DHCP,TFTP,HTTP,NFS 2x Scheduler Servers 2x Monitoring Servers 5x Login nodes and 1 master node 20

MN3 Management software xcat: extreme Cluster (Cloud) Administration Toolkit (http://xcat.sourceforge.net) Framework for alerts and alert management Hardware management control, monitoring, etc. Administration of cluster services: DNS, DHCP, Conserver, Software provisioning and maintenance Compute nodes Boot from network RootFS is mounted via NFS (ro, rw, tmpfs) from xcat Servers Local hard drive only for temporary data and swap space Same OS image for all compute nodes 21

MN3 Management xcat xcat Masters xcat GPFS Servers Hierarchical Infrastructure Management Node (MN) (DHCP,DNS,TFTP,HTTP) Management Node (MN) (DHCP,DNS,TFTP,HTTP) xcatdb mysql GPFS Server GPFS Server DS3512 Exp3512 xcat Group xcat Group Service Node (DHCP,TFTP,HTTP,NFS) Service Node (DHCP,TFTP,HTTP,NFS) Service Node (DHCP,TFTP,HTTP,NFS) Service Node (DHCP,TFTP,HTTP,NFS) Service Node (DHCP,TFTP,HTTP,NFS) Service Node (DHCP,TFTP,HTTP,NFS) Service Node (DHCP,TFTP,HTTP,NFS) Service Node (DHCP,TFTP,HTTP,NFS) 8x idataplex racks 8x idataplex racks 22

23 MN3 GPFS Network racks D5 D3 D4 D1 D2 M3 M1 M2 C1 s01r1 C2 s01r2 C3 s02r1 C4 s02r2 C5 s03r1 C6 s03r2 C7 s04r1 C8 s04r2 C9 s05r1 C10 s05r2 C11 s06r1 C12 s06r2 C13 s07r1 C14 s07r2 C15 s08r1 C16 s08r2 C17 s09r1 C18 s09r2 C19 s10r1 C20 s10r2 C21 s11r1 C22 s11r2 C23 s12r1 C24 s12r2 C25 s13r1 C40 C39 C38 C37 C29 s15r1 C30 s15r2 C31 s16r1 C32 s16r2 C33 s17r1 C34 s17r2 C35 s18r1 C36 s18r2 C26 s13r2 C27 s14r1 C28 s14r2 IB5 IB3 IB4 IB1 IB2 IB6 IB7

MN3 GPFS Network idpx 1 42x 1Gb/s (5.25GB/s) NEW Force10 E1200i 10G switch 10-port Line Cards 7x 40-port Line Cards 7x Existing Line Cards Empty slots 42x 1Gb/s 4x 10Gb/s (4.8GB/s) BSC Force10 E1200i 10G Switch idpx 2 42x 1Gb/s 4x 10Gb/s... 42x 1Gb/s 4x 10Gb/s 4x 10Gb/s 4x 10Gb/s 4x 10Gb/s 20x 10Gb/s (24GB/s) 30 x 10Gb/s 1.9 PB GPFS High performance Filesystems idpx 36 43x 1Gb/s 43x 1Gb/s 4x ExaScale 10- port 10G Eth Line Card compute node on mgnt rack 24 24

Index MareNostrum 3 Overview Compute Racks Infiniband Racks Management Racks GPFS Network Racks HPC GPFS Storage Hardware GPFS Data Services Long-Term Storage (Archive) Active Archive Hardware Active Archive Services Batch Scheduler System Software Stack 25

26 HPC GPFS Storage racks D5 D3 D4 D1 D2 M3 M1 M2 C1 s01r1 C2 s01r2 C3 s02r1 C4 s02r2 C5 s03r1 C6 s03r2 C7 s04r1 C8 s04r2 C9 s05r1 C10 s05r2 C11 s06r1 C12 s06r2 C13 s07r1 C14 s07r2 C15 s08r1 C16 s08r2 C17 s09r1 C18 s09r2 C19 s10r1 C20 s10r2 C21 s11r1 C22 s11r2 C23 s12r1 C24 s12r2 C25 s13r1 C40 C39 C38 C37 C29 s15r1 C30 s15r2 C31 s16r1 C32 s16r2 C33 s17r1 C34 s17r2 C35 s18r1 C36 s18r2 C26 s13r2 C27 s14r1 C28 s14r2 IB5 IB3 IB4 IB1 IB2 IB6 IB7

HPC Storage Hardware 3x Data Building Blocks with: 8x Data Servers (x3550 M3) with 48 GB main memory 1x DS5300 controller couplet 8x EXP5060 400x SATA 2TB 7,2K rpm 50 disk/enclosure (Total: 400 disks) 10 empty disk slots Total capacity: 800 TB Net capacity: 640TB (RAID6 8+2P) TOTAL Data Capacity: 1200 SATA 2TB 7,2 rpm disks Net capacity: 1920 TB (RAID6 8+2P) 1x MetaData Building Block 6x Metadata servers (x3650 M3) with 128 GB main memory 1x DS5300 controller couplet (4U) 8x Storage Enclosure Expansion Units. 112x FC 600GB 15Krpm (16 disk/enclosure) Total capacity: 67.2 TB Net capacity: 33.6 TB (RAID 1) 27

Storage GPFS IBM high performance shared-disk file management tool. Allows multiple processes on all nodes access the same file with standard syscalls File reads / writes are stripped across multiple disks Increase of aggregate bandwith use Balance the load across all disks in a filesystem Large files are divided in equal sized blocks, consecutive blocks allocated on different disks round-robin Supports very large files and file system sizes (Max tested: 4PB ) Allows concurrent reads and writes from multiple nodes GPFS uses a local cache on each client (MN pagepool 1GB) GPFS prefetches data into its buffer pool, issuing parallel I/O requests Sequential pattern Reverse sequential pattern Various strided access patterns Supports block size up to 8MB. 28

Storage GPFS Distributed Locking Mechanism GPFS has a distributed-based token lock system to keep file consistency Tokens are issued at block level or whole-file depending on the operation File system manager is the token manager server Coordinates access to files granting the right to read/write data/metadata read (1,*buff,buffsize) write(1,*buff,buffsize) read (1,*buff,buffsize) write(1,*buff,buffsize) GPFS Client 2 Request Token read/write List of nodes holding conflicting tokens File system Manager GPFS Client 1 Request Token read/write Grant Token File system Manager Relinquish Token Token GPFS Client 1 Token GPFS Client 1 Read /Write GPFS Client 2 Grant Token GPFS Client 1 Token GPFS Client 2 Read /Write

HPC Storage GPFS Filesystems /gpfs/home: User's Home directory (59 TB ) User quotas enforced Data & MetaData replication Block Size 256KB /gpfs/apps: Applications ( 30 TB ) Data & MetaData replication Block Size 512KB /gpfs/projects: Data shared between users of same project ( 612 TB ) Group quotas enforced Metadata replication Block Size 4MB /gpfs/scratch: Data used only during executions ( 1.1 PB ) Group quotas enforced Metadata replication Block Size 4MB No backup from this filesystem 30

HPC GPFS MareNostrum3 cluster - 3024 nodes - 1 PFlop MinoTauro cluster - 128 nodes 256 GPUs - 198 TFlops 2 SMP Machines - 128 cores 1.5 TB RAM - 96 cores 1.2 TB RAM HPC GPFS Services - 1 login server - 2 transfer servers dl01.bsc.es dt01.bsc.es dt02.bsc.es 288x 10GE links 15x 10GE links 1x 10GE links 1x 10GE link Nord Cluster - 256 nodes ppc64-9.42 TFlops LifeScience Cluster - 12 nodes 2x 10GE links 1.9 PB HPC GPFS /gpfs/home /gpfs/projects /gpfs/scratch /gpfs/apps 1x 10GE links 15 GB/s 31

HPC GPFS Services dlogin (dl01.bsc.es) Interactive access via SSH from Internet to BSC HPC GPFS dtransfer (dt01.bsc.es & dt02.bsc.es) Transfer servers from/to Internet to BSC HPC GPFS Transfer protocols supported: SCP / SFTP FTP + SSL BBCP Grid-ftp dt01.bsc.es dt02.bsc.es Internet In these nodes also access to other BSC storage is available Long-term storage: Active Archive & HSM (read-only mode) Internal BSC Departamental storage 32

Index MareNostrum 3 Overview Compute Racks Infiniband Racks Management Racks GPFS Network Racks HPC GPFS Storage Hardware GPFS Data Services Long-Term Storage (Archive) Active Archive Hardware Active Archive Services Batch Scheduler System Software Stack 33

Long-Term Storage (Archive) Not directly accessible from HPC Machines Can be used from any HPC Machine through a batch system Commands: dtcp, dttar, dtmv, Active Archive ( /gpfs/archive ) Archive system based on harddrives 3.8 PB GPFS Filesystem Group quota enabled Metadata replicated Blocksize = 1MB

Active Archive Hardware Overview 12x GPFS Servers (x3550 M4) with 16 GB RAM 10x Data Storage Block 1 DCS3700 Controller + 2 EXP3700 expansions 180x NL SAS 3TB 3.5 7.2K rpm (60 disks per enclousure) Block Capacity: 540 TB raw 3x Metadata Block 1 DS3512 Controller + 6 EXP3512 expansions 77x SAS 600 GB 3.5 15K rpm Block capacity: 45 TB raw TOTAL CAPACITY: DATA: 5.45 PB raw (4.1 PB Net) METADATA: 135 TB raw (67 TB Net) 10x client Servers (x3550 M4) with 128 GB RAM 4 amovers (explained later) 4 NFS/CIFS servers for BSC LAN 4 Data Cloud Services for Internet

Active Archive Services dtransfer (dt01 & dt02) Permits transfer on Long-Term Storage from/to Internet Interactive access to Archive and HSM via NFS mounts Low performance (interactive access only) Internet GPFS Mount 1x10 GE dt01.bsc.es dt02.bsc.es NFS Mount 1x1GE /gpfs/home /gpfs/projects /gpfs/scratch /gpfs/apps 3.7 PB Archive /gpfs/archive

Active Archive Movers Batch Active Archive Movers (amover1 amover4) Non-interactive nodes Execute movement commands: HPC GPFS Active Archive From ANY HPC machine (login or compute node) $ dtcp <ORIG> <DEST> $ dtmv <ORIG> <DEST> $ dtrsync <ORIG> <DEST> Job enters to run amover1 amover2 amover3 Each amover server can provide up to 2GB/s of performance submit a batch job 2x 10GE links amover4 2x 10GE links Data Transfer batch queue system /gpfs/home /gpfs/projects /gpfs/scratch /gpfs/apps 3.7 PB Archive /gpfs/archive

Index MareNostrum 3 Overview Compute Racks Infiniband Racks Management Racks GPFS Network Racks HPC GPFS Storage Hardware GPFS Data Services Long-Term Storage (Archive) Active Archive Hardware Active Archive Services Batch Scheduler System Software Stack 38

Batch Scheduler System Users only access to login nodes and submit job to the Batch Scheduler System. IBM LSF is used in MareNostrum3 LSF takes care: Handling user jobs ( submit, cancel, query, ) Priorization between jobs Health monitorization of all machine resources Deciding which nodes to be used by each job Controlling process spawning and finalization of each job Accouting of all hours consumed

Batch Scheduler Overview login1 login2 login3 login4 login5 Submit Query Cancel Return code + Info Scheduler server Scheduler server JOBID USER STATE QUEUE CPUS 216454 user92 RUNNING prace 2352 217530 user01 RUNNING class_a 448 217558 user33 RUNNING prace 448 217588 my_user PEND class_b 448 217589 my_user PEND class_b 448 Jobs enters to run master Job finishes Clean up of nodes Spawn of processes

Jobs Priorities Job priority decided in a Fair-Share policy Hours are distributed based on a share distribution Dynamic priority per job based on Hours already consumed by the group job belongs to Type of hours (share distribution) Waiting time on queue Share distribution for MareNostrum 3 is: 70% PRACE projects 24% RES projects 3 internal level of priorities (A_hours >> B_hours >> C_hours) 6% BSC internal use Bigger jobs have more priority than little ones

Spawn of processes and monitoring LSF decides where to run a job depending on Free resources at each moment Use of minimal of Infiniband Switches (FUTURE) Minimal power Usage (Energy-aware scheduling) Health Control A regular process checks health of all nodes reporting any error Avoids entering new jobs to a failed node Clean of a node After any job finishes an epilog process is executed to clean the node of any remaining processes

MareNostrum Monitoring Nagios Monitors basic administrative elements Ganglia Monitors performance values of compute nodes Cpu load, memory used, local disk free space, Graphic visualization of all those values xcat Collects all SNMP traps from hardware component Hardware/Firmware failure reporting Scripts filter and process those traps BSC Monitoring tools GGcollector, og3, perfd

Index MareNostrum 3 Overview Compute Racks Infiniband Racks Management Racks GPFS Network Racks HPC GPFS Storage Hardware GPFS Data Services Long-Term Storage (Archive) Active Archive Hardware Active Archive Services Batch Scheduler System Software Stack 44

Software Stack Operating System SLES11 SP2 Clustrer Software xcat Compilers: Intel Cluster Studio GNU compilers MPI OpenMPI IBM Parallel Environment Intel MPI MVAPICH2 Infiniband Mellanox OFED 1.5.3

www.bsc.es Thank you! For further information please contact Javier.bartolome@bsc.es 46