Service Partition Specialized Linux nodes. Compute PE Login PE Network PE System PE I/O PE



Similar documents
Cray XT System Overview S

Cray XT3 Supercomputer Scalable by Design CRAY XT3 DATASHEET

Cray Gemini Interconnect. Technical University of Munich Parallel Programming Class of SS14 Denys Sobchyshak

Introduction to Linux and Cluster Basics for the CCR General Computing Cluster

1 Bull, 2011 Bull Extreme Computing

OLCF Best Practices. Bill Renaud OLCF User Assistance Group

Agenda. HPC Software Stack. HPC Post-Processing Visualization. Case Study National Scientific Center. European HPC Benchmark Center Montpellier PSSC

A Crash course to (The) Bighouse

Mathematical Libraries on JUQUEEN. JSC Training Course

Cray DVS: Data Virtualization Service

Optimization tools. 1) Improving Overall I/O

Running applications on the Cray XC30 4/12/2015

JUROPA Linux Cluster An Overview. 19 May 2014 Ulrich Detert

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

LS DYNA Performance Benchmarks and Profiling. January 2009

Mathematical Libraries and Application Software on JUROPA and JUQUEEN

Kriterien für ein PetaFlop System

Scaling Across the Supercomputer Performance Spectrum

Petascale Software Challenges. Piyush Chaudhary High Performance Computing

How To Build A Supermicro Computer With A 32 Core Power Core (Powerpc) And A 32-Core (Powerpc) (Powerpowerpter) (I386) (Amd) (Microcore) (Supermicro) (

Beyond Embarrassingly Parallel Big Data. William Gropp

Multi-Threading Performance on Commodity Multi-Core Processors

Petascale Software Challenges. William Gropp

SRNWP Workshop. HP Solutions and Activities in Climate & Weather Research. Michael Riedmann European Performance Center

CAS2K5. Jim Tuccillo

MPI / ClusterTools Update and Plans

High Performance Computing at the Oak Ridge Leadership Computing Facility

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

SR-IOV: Performance Benefits for Virtualized Interconnects!

Introduction to ACENET Accelerating Discovery with Computational Research May, 2015

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

Sun Constellation System: The Open Petascale Computing Architecture

Cluster performance, how to get the most out of Abel. Ole W. Saastad, Dr.Scient USIT / UAV / FI April 18 th 2013

The Top Six Advantages of CUDA-Ready Clusters. Ian Lumb Bright Evangelist

Cray XE and XK Application Programming and Optimization Student Guide

Cisco Prime Home 5.0 Minimum System Requirements (Standalone and High Availability)

Overview of HPC systems and software available within

HPC Wales Skills Academy Course Catalogue 2015

OLCF Best Practices (and More) Bill Renaud OLCF User Assistance Group

Streamline Computing Linux Cluster User Training. ( Nottingham University)

INTEL PARALLEL STUDIO XE EVALUATION GUIDE

HPC Software Requirements to Support an HPC Cluster Supercomputer

The CNMS Computer Cluster

STEPPING TOWARDS A NOISELESS LINUX ENVIRONMENT

A Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures

supercomputing. simplified.

Appro Supercomputer Solutions Best Practices Appro 2012 Deployment Successes. Anthony Kenisky, VP of North America Sales

The PHI solution. Fujitsu Industry Ready Intel XEON-PHI based solution. SC Denver

Initial Performance Evaluation of the Cray SeaStar

Altix Usage and Application Programming. Welcome and Introduction

Introduction to the CRAY XE6(Lindgren) environment at PDC. Dr. Lilit Axner (PDC, Sweden)

Designed for Maximum Accelerator Performance

Multicore Parallel Computing with OpenMP

Achieving Performance Isolation with Lightweight Co-Kernels

Energy efficient computing on Embedded and Mobile devices. Nikola Rajovic, Nikola Puzovic, Lluis Vilanova, Carlos Villavieja, Alex Ramirez

ECLIPSE Performance Benchmarks and Profiling. January 2009

præsentation oktober 2011

Advanced Techniques with Newton. Gerald Ragghianti Advanced Newton workshop Sept. 22, 2011

Cray System Management Workstation 4.1 Release (SMW)

A Comparison of Three MPI Implementations for Red Storm

Big Data Challenges In Leadership Computing

Public deliverable Copyright Beneficiaries of the MIKELANGELO Project MIKELANGELO

Cluster Computing at HRI

Cluster Implementation and Management; Scheduling

The safer, easier way to help you pass any IT exams. Industry Standard Architecture and Technology. Title : Version : Demo 1 / 5

Parallel Programming Survey

Current Status of FEFS for the K computer

Toward a practical HPC Cloud : Performance tuning of a virtualized HPC cluster

PRIMERGY server-based High Performance Computing solutions

Systems, Storage and Software in the National Supercomputing Service. CSCS User Assembly, Luzern, 26 th March 2010 Neil Stringfellow

The Gemini Network. Rev 1.1. Cray Inc.

Clusters: Mainstream Technology for CAE

Windows IB. Introduction to Windows 2003 Compute Cluster Edition. Eric Lantz Microsoft

Going Linux on Massive Multicore

8/15/2014. Best (and more) General Information. Staying Informed. Staying Informed. Staying Informed-System Status

Debugging in Heterogeneous Environments with TotalView. ECMWF HPC Workshop 30 th October 2014

FLOW-3D Performance Benchmark and Profiling. September 2012

CS 377: Operating Systems. Outline. A review of what you ve learned, and how it applies to a real operating system. Lecture 25 - Linux Case Study

Recommended hardware system configurations for ANSYS users

Example of Standard API

Advanced Computational Software

Lecture 1: the anatomy of a supercomputer

Working with HPC and HTC Apps. Abhinav Thota Research Technologies Indiana University

ALPS Supercomputing System A Scalable Supercomputer with Flexible Services

PARALLEL & CLUSTER COMPUTING CS 6260 PROFESSOR: ELISE DE DONCKER BY: LINA HUSSEIN

Power Efficiency Comparison: Cisco UCS 5108 Blade Server Chassis and IBM FlexSystem Enterprise Chassis

Kashif Iqbal - PhD Kashif.iqbal@ichec.ie

The Evolution of Cray Management Services

SGI High Performance Computing

Evaluation of CUDA Fortran for the CFD code Strukti

Lattice QCD Performance. on Multi core Linux Servers

Datacenter Operating Systems

Trends in High-Performance Computing for Power Grid Applications

Sun in HPC. Update for IDC HPC User Forum Tucson, AZ, Sept 2008

Architecture of Hitachi SR-8000

Best Practice mini-guide "Stokes"

MAQAO Performance Analysis and Optimization Tool

Using the Windows Cluster

The Benefits of Verio Virtual Private Servers (VPS) Verio Virtual Private Server (VPS) CONTENTS

LS-DYNA Scalability on Cray Supercomputers. Tin-Ting Zhu, Cray Inc. Jason Wang, Livermore Software Technology Corp.

Transcription:

2

Service Partition Specialized Linux nodes Compute PE Login PE Network PE System PE I/O PE Microkernel on Compute PEs, full featured Linux on Service PEs. Service PEs specialize by function Software Architecture eliminates OS Jitter Software Architecture enables reproducible run times Large machines boot in under 30 minutes, including filesystem 3

Z Y GigE X 10 GigE GigE Fibre Channels SMW RAID Subsystem Compute node Login node Network node Boot/Syslog/Database nodes I/O and Metadata nodes 4

Characteristics Number of Cores 12 6.4 GB/sec direct connect HyperTransport Peak Performance Istanbul (2.6) Memory Size Memory Bandwidth 124 Gflops/sec 16 GB per node 25.6 GB/sec Cray SeaStar2+ Interconnect 25.6 GB/sec direct connect memory 5

Now Scaled to 225,000 cores 6-Port Router Blade Control Processor Interface DMA Engine HyperTransport Interface Memory PowerPC 440 Processor Cray XT5 systems ship with the SeaStar2+ interconnect Custom ASIC Integrated NIC / Router MPI offload engine Connectionless Protocol Link Level Reliability Proven scalability to 225,000 cores 6

7

8

Low Velocity Airflow High Velocity Airflow Low Velocity Airflow High Velocity Airflow Low Velocity Airflow 9

DDR3 Channel DDR3 Channel DDR3 Channel DDR3 Channel 6MB L3 Cache 6MB L3 Cache HT3 HT3 HT3 HT3 6MB L3 Cache 6MB L3 Cache HT3 DDR3 Channel DDR3 Channel DDR3 Channel DDR3 Channel 2 Multi-Chip Modules, 4 Opteron Dies 8 Channels of DDR3 Bandwidth to 8 DIMMs 24 (or 16) Computational Cores, 24 MB of L3 cache Dies are fully connected with HT3 Snoop Filter Feature Allows 4 Die SMP to scale well To Interconnect HT1 / HT3 10

Cray Baker Node Characteristics 10 12X Gemini Channels (Each Gemini acts like two nodes on the 3D Torus) High Radix YARC Router with adaptive Routing 168 GB/sec capacity Number of Cores Peak Performance Memory Size Memory Bandwidth 16 or 24 140 or 210 Gflops/s 32 or 64 GB per node 85 GB/sec 11

Module with SeaStar Z X Y Module with Gemini 12

Number of Cores Peak Performance MC-8 (2.4) Peak Performance MC-12 (2.2) Memory Size Memory Bandwidth Characteristics 16 or 24 (MC) 32 (IL) 153 Gflops/sec 211 Gflops/sec 32 or 64 GB per node 83.5 GB/sec 6.4 GB/sec direct connect HyperTransport Cray SeaStar2+ Interconnect 83.5 GB/sec direct connect memory 13

14

Cool air is released into the computer room Liquid in Liquid/Vapor Mixture out Hot air stream passes through evaporator, rejects heat to R134a via liquid-vapor phase change (evaporation). R134a absorbs energy only in the presence of heated air. Phase change is 10x more efficient than pure water cooling. 15

R134a piping Exit Evaporators Inlet Evaporator 16

17

New enhanced blower to handle the 130 Watt Magny- Cours Processor Enhanced sound kit to reduce noise More efficient design New VFD (Variable Frequency Diode) for blower An upgrade kit product code will be available for existing XT5 customers which will contain the required components 18

Air taken from top, no line of sight for sound Foam lined duct for sound absorption Extra foam added to front. Door now seals to front IO extension 19

20

Cray Software Ecosystem Applications Site specific Public Domain ISV Applications Compilers Debuggers Schedulers Tools OS CrayPat Cray Apprentice Libraries Public Domain Tools Cray Linux Enviroment 21

Service nodes run a full-featured SLES10 Linux installation We add our tools, libraries, and services Compute nodes run a slim-line Linux kernel with only necessary services Only run what s needed so the application can rule the roost Libraries MPT Message Passing Toolkit LibSci Cray Scientific Libraries (BLAS, LAPACK, SCALAPACK, FFTW, etc) I/O Libraries HDF5 & NetCDF Tools Compilers PGI, Cray, GNU, Pathscale, Intel CrayPAT Performance Analysis Tools ALPS Application placement, job launching, application clean-up Users interface with ALPS primarily via aprun PBS/TORQUE & MOAB All jobs on the local XTs are batch jobs MOAB is an advanced job scheduler that is used on Jaguar and Kraken 22

Less Compatibility High Scale Capability (Ultra-light Linux Image) Capacity/Production (Mid-weight Linux Image) Full Compatibility Shrink-wrap 3rd Party Application (Full Linux Image and all services) 28350 28150 27950 27750 27550 Low Scale 0 1 23

Benefit: Eliminate noise with overhead (interrupts, daemon execution) directed to a single core Rearranges existing work Without core specialization: overhead affects every core With core specialization: overhead is confined, giving app exclusive access to remaining cores Helps some applications, hurts others POP 2.0.1 on 8K cores on XT5: 23% improvement Larger jobs should see larger benefit Future nodes with larger core counts will see even more benefit This feature is adaptable and available on a job-by-job basis 24

25

Cray XT/XE Supercomputers come with compiler wrappers to simplify building parallel applications (similar the mpicc/mpif90) Fortran Compiler: ftn C Compiler: cc C++ Compiler: CC Using these wrappers ensures that your code is built for the compute nodes and linked against important libraries Cray MPT (MPI, Shmem, etc.) Cray LibSci (BLAS, LAPACK, etc.) Choosing the underlying compiler is via the PrgEnv-* modules, do not call the PGI, Cray, etc. compilers directly. Always load the appropriate xtpe-<arch> module for your machine Enables proper compiler target Links optimized math libraries 26

from Cray s Perspective PGI Very good Fortran and C, pretty good C++ Good vectorization Good functional correctness with optimization enabled Good manual and automatic prefetch capabilities Very interested in the Linux HPC market, although that is not their only focus Excellent working relationship with Cray, good bug responsiveness Pathscale Good Fortran, C, possibly good C++ Outstanding scalar optimization for loops that do not vectorize Fortran front end uses an older version of the CCE Fortran front end OpenMP uses a non-pthreads approach Scalar benefits will not get as much mileage with longer vectors Intel Good Fortran, excellent C and C++ (if you ignore vectorization) Automatic vectorization capabilities are modest, compared to PGI and CCE Use of inline assembly is encouraged Focus is more on best speed for scalar, non-scaling apps Tuned for Intel architectures, but actually works well for some applications on AMD (this is becoming increasingly important) 27

from Cray s Perspective GNU Inproving Fortran, outstanding C and C++ (if you ignore vectorization) Obviously, the best for gcc compatability Scalar optimizer was recently rewritten and is very good Vectorization capabilities focus mostly on inline assembly, but automatic vectorization is improving Note: may be required to recompile when changing between major version (4.5 -> 4.6, for example) CCE Outstanding Fortran, very good C, and okay C++ Very good vectorization Very good Fortran language support; only real choice for Coarrays C support is quite good, with UPC support Very good scalar optimization and automatic parallelization Clean implementation of OpenMP 3.0, with tasks Sole delivery focus is on Linux-based Cray hardware systems Best bug turnaround time (if it isn t, let us know!) Cleanest integration with other Cray tools (performance tools, debuggers, upcoming productivity tools) No inline assembly support 28

PGI -fast Mipa=fast(,safe) If you can be flexible with precision, also try -Mfprelaxed Compiler feedback: -Minfo=all -Mneginfo man pgf90; man pgcc; man pgcc; or pgf90 -help Cray <none, turned on by default> Compiler feedback: -rm (Fortran) -hlist=m (C) If you know you don t want OpenMP: -xomp or -Othread0 man crayftn; man craycc ; man craycc Pathscale -Ofast Note: this is a little looser with precision than other compilers Compiler feedback: - LNO:simd_verbose=ON:vintr_verbose=ON:prefetch_verbose=ON man eko ( Every Known Optimization ) GNU -O2 / -O3 Compiler feedback: -ftree-vectorizer-verbose=1 man gfortran; man gcc; man g++ Intel -fast Compiler feedback: -vec-report1 man ifort; man icc; man icc 29

Goal of scientific libraries Improve Productivity at optimal performance Cray use four concentrations to achieve this Standardization Use standard or de facto standard interfaces whenever available Hand tuning Use extensive knowledge of target processor and network to optimize common code patterns Auto-tuning Automate code generation and a huge number of empirical performance evaluations to configure software to the target platforms Adaptive Libraries Make runtime decisions to choose the best kernel/library/routine 30

FFT CRAFFT FFTW P-CRAFFT Dense BLAS LAPACK ScaLAPACK IRT CASE Sparse CASK PETSc Trilinos IRT Iterative Refinement Toolkit CASK Cray Adaptive Sparse Kernels CRAFFT Cray Adaptive FFT CASE Cray Adaptive Simplified Eigensolver 31

BLAS LAPACK SCALAPACK BLACS PBLAS ACML FFTW 2&3 PETSC TRILINOS IRT* MUMPS ParMetis SuperLU SuperLU_dist Hypre Scotch Sundials CASK* CRAFFT* CASE* * Cray-specific 32

Full MPI2 support (except process spawning) based on ANL MPICH2 Cray used the MPICH2 Nemesis layer for Gemini Cray-tuned collectives Cray-tuned ROMIO for MPI-IO Current Release: 5.3.0 (MPICH 1.3.1) Improved MPI_Allreduce and MPI_alltoallv Initial support for checkpoint/restart for MPI or Cray SHMEM on XE systems Improved support for MPI thread safety. module load xt-mpich2 Tuned SHMEM library module load xt-shmem 33