HPC Architecture End to End Alexandre Chauvin
Agenda HPC Software Stack Visualization National Scientific Center 2 Agenda
HPC Software Stack Alexandre Chauvin
Typical HPC Software Stack Externes LAN Typical HPC cluster configuration Compute nodes IO Nodes Interconnects Management Server Login Nodes Compute nodes High Speed Interconnect IO nodes Frontend / Login nodes Cluster Management Server Management LAN Application Application Application Resource Management MPI Development Tools Cluster Management Cluster File System Operating System Hardware Unix Required HPC Software Stack On the compute nodes OS (Windows, Linux, Unix) Parallel Libraries Development Tools For the full architecture High Speed IO Filesystem System Management Job Scheduler 4 Agenda
HPC Compute Nodes Typical operating systems for High Performance Computing Unix AIX, HP UX, Solaris Linux RHEL, SLES Windows To build applications, HPC users need compilers and mathematical libraries which usually depend on the hardware and operating system environment Compilers: Gnu, PGI, Intel, Pathscale, Mathematical Libraries: Goto, MKL, ACML, Mass, Unix To build parallel applications, HPC users usually need a version of MPI Message Passing Interface. The different implmentations depend on the network type, hardware vendor and operating system. OpenMPI, Mpich, LAM, HP MPI, PE, 5 Agenda
HPC System Management HPC System Administrators need System Management Software in order to remotely Easily deployed Operating systems with appropriate configuration Enables user accounts Allows predictive failures, security events and hardware and software errors Deploys new or updated software and operating system patches Cluster System Management Cluster management server Additionnaly tools such as Ganglia enables the HPC users and administrators to get a quick overview of the performance of the full system Screeenshot of Ganglia Monitoring 6 Agenda
Job Scheduler Job schedulers aim at imporving resource utilization and quelity of service The different metrics are: Resource Utilzation System Throughput Mean response time Different algorithms can be specified to the resource manager for scheduling jobs Based on # of processors required Estimated elapsed time required Job Priority Job Scheduler Source: www.dell.com 7 Agenda
HPC Software Stack High Performance Shared Filesystem HPC Environments requires shared filesystems accessible by all nodes or a pool of nodes with common purpose. HPC filesystem needs to be High Performance IO can represent a subsequent part of a HPC workload Scalable it must be possible to increase capacity and performance as a HPC systems and requirements can evolve Reliable no single point of failure Clients IO Network IO Servers (Data and/or Metadata) Enterprise Storage 8 Agenda
HPC Software Stack IO Parallelization Local Filesystem Not Shared Clients No parallelization HPC Filesystem Shared Parallelization Clients IO Network NFS Filesystem Shared No parallelization Clients IO Network IO Servers (Data and/or Metadata) IO Network NFS Server Storage Enterprise Storage 9 Agenda
GPFS Architecture Performance The performance of the GPFS subsystem is dependent of 3 different parts of its architecture Server Network SAN Network Disks Subsystem Clients IO Network Clients IO Network IO Servers (Data and/or Metadata) Enterprise Storage Client IO Requests Server Network Performance SAN Network Performance SAN Disk Performance IO Servers (Data and/or Metadata) SAN Network Each of them can be scaled to increase the IO capabilties. The goal is to have a balanced configuration across GPFS in term of performance. Enterprise Storage 10 Agenda
HPC Software Stack Multi Cluster Environment HPC File Systems and Job Scheduling can be shared among clusters at different sites This is used by major computing consortium such as Deisa DEISA is operating a heterogeneous HPC infrastructure currently formed by eleven European national supercomputing centres that are tightly interconnected by a dedicated high performance network. Among the Deisa projects, all the clusters are sharing the same data though Multi- Cluster GPFS Julich, Cineca, Idris and RZG have closely coupled job scheduling file systems The DEISA Global File System at European Scale 11 Agenda
Visualization Alexandre Chauvin
HPC Visualization Introduction Visualization is often the best way for HPC users to analyse the result of a large computation Typical Applications Computer-aided design Visualization of medical data Modeling and simulation Exploration of oil and gas Virtual training Remote collaboration Challenges Usually datasets are sent from HPC Data Center to analysts and are visualized locally on standalone worksation but HPC Computations writes large datasets Too large to be seen on a single computer, specially in high resolution mode Too large to be transferred through a WAN network Maybe too confidential to be transferred on the internet This method prevents collaboration between analysts WW 13 Agenda
HPC Software Stack Visualization Example #1: Dam Breaking EDF R&D SPARTACUS code 250 K / 2 M / 5 M Cells 1000 Time Steps 25 MB / File Loading Time: From 0.5 to 50 Secs / File 14 Agenda
Typical Visualization Architecture Data SW Network Network Data SW Data SW Data SW Networks Network Network Data SW Data Center Users 15 Agenda
HPC Visualization Architecture Visualization Cluster to render complex and large datasets this leads also to finer resolution of the final image WAN Collaboration on the WAN Networks Data SW Apps Image Transfer, not Data 16 Agenda
National Computing Center Alexandre Chauvin
Customer Case National Scientific Center Missions Helping national competitiveness for scientific researches needing very high compute capabilities. Dealing with More than 250 projects per year and 1000 users. Objectives for upgrade Having an always evolving platform to drive competitiveness Adding a highly performant architecture for scalar computation Integration to computing European consortium Current Computing Capabiltiies 10 Vector Supercomputers 8 Processor Systems and 64GB of Memory per system 10TB of IO disk space 1.28TFlops vector peak performance total SMP clusters 20 nodes of 32 processors @ 1.7GHz for OpenMP and MPI workloads 128GB of memory per node 1TB of IO disk space per node 96 nodes of 4 processors @ 1.5GHz and 8GB of memory for MPI workloads 10TB of shared IO Disk space 6.7TFlops peak performance total Also Pre-,post-processing capabilities Visualization 1000TB IO disk space for archiving 18 Agenda
New Computational Needs Requirements 19 Agenda
HPC Software Stack Requirements Analysis Compute performance to be improved by x20 Hybrid Architecture welcome Strong focus on IO capacity, usable performance, and availability Focus on RAS and usabillity Solution can be limited by Datacenter characterisitics, ie Power Consumption 20 Agenda
Solution Analysis SMP Cluster Hybrid solution possible based on SMP and MPP architectures Some numbers: Infiniband 1GB/s IO per link FC 4Gb 500MB/s IO per link SMP Cluster Characteristics SMP servers Based on 16 Dual core processor nodes @4.7GHz 54 of such nodes = 32TFlops Infiniband interconnect for MPI Network Infiniband Interconnect for global IO access 2 cards for redundancy SAN Fiber Channel 4Gb Connection for dedicated IO requirements 2 cards Switches 2 Infiniband switches for MPI communications 2 Infiniband switches for global IO connexion 2 SAN Fiber Channel switched for dedicated IO Dedicated Filesystem 400TBs of disks Vers Global IO 2 SAN Switch 140 ports 400TB of dedicated fs 21 Agenda
HPC Software Stack Solution Analysis MPP MPP Solution Characteristics Based on 10 Racks BlueGene/P 40960 procs @ 850MHz = 139TFlops 10Gb interconnect towards Global IO Filesystem 16 links for 2GB/s throughput performance To Global IO Filesystem 22 Agenda
Solution Analysis Global IO Filesystem High Speed shared interconnect based on GPFS software 2 different networks accessing the same data 16x 10Gb links for MPP 32x Infiniband links for SMP cluster GPFS servers are connected through 4 Fiber Channel 4Gb links to storage disks IO Subsystem Performance 16x 10Gb links gives 20GB/s theoritical throughput performance 4x 4Gb FC links gives 2GB/s theoritical throughput performance 8 disks bays for 16GB/s theoritical preformance 23 Agenda
Management Subsystem Job Scheduler Full HPC Management Subsystem Cluster Management Multi-Cluster Job Scheduling GPFS Cluster System Management Cluster management server 24 Agenda
Computing Solution Overview 56 nœuds de calcul SMP To European HPC Consortium Service Node BG Service Node BG Frontal BG Frontal BG 10 racks BG/P Réseau Myrinet 10G IB1 IB2 54 nœuds de calcul SMP Vers DEISA IB3 IB4 16 noeuds GPFS Connexion Réseau IDRIS 2 ports Ethernet 10G 5 ports Ethernet 1G 8 baies de disques pour 800 TO utiles et 16 GO/s IB 4x Ethernet 1G Ethernet 10G Myrinet 10G SAN FC 4 Gbits Nœud de management 400 TO pour FS locaux 8 GO/s 25 Agenda Réseau Admin 2 SAN switch 140 ports 2 nœuds interactifs