ANALYSIS OF SUPERCOMPUTER DESIGN

Similar documents

Parallel Programming Survey

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

High Performance Computing

A GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS

Scalability and Classifications

Building a Top500-class Supercomputing Cluster at LNS-BUAP

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

A Very Brief History of High-Performance Computing

Parallel Computing. Introduction

Introduction to Cloud Computing

Lecture 2 Parallel Programming Platforms

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Introduction to GPU Programming Languages

Supercomputing Status und Trends (Conference Report) Peter Wegner

Chapter 2 Parallel Architecture, Software And Performance

Introduction to High Performance Cluster Computing. Cluster Training for UCL Part 1

COMP 422, Lecture 3: Physical Organization & Communication Costs in Parallel Machines (Sections 2.4 & 2.5 of textbook)

HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK

Parallel Computing. Benson Muite. benson.

Computer Architecture TDTS10

Trends in High-Performance Computing for Power Grid Applications

PRIMERGY server-based High Performance Computing solutions

OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC

Next Generation GPU Architecture Code-named Fermi

HPC Software Requirements to Support an HPC Cluster Supercomputer

- An Essential Building Block for Stable and Reliable Compute Clusters

Performance Characteristics of a Cost-Effective Medium-Sized Beowulf Cluster Supercomputer

Current Trend of Supercomputer Architecture

High Performance. CAEA elearning Series. Jonathan G. Dudley, Ph.D. 06/09/ CAE Associates

Chapter 2 Parallel Computer Architecture

High Performance Computing. Course Notes HPC Fundamentals

Agenda. HPC Software Stack. HPC Post-Processing Visualization. Case Study National Scientific Center. European HPC Benchmark Center Montpellier PSSC

GPU Computing. The GPU Advantage. To ExaScale and Beyond. The GPU is the Computer

Kriterien für ein PetaFlop System

The Green Index: A Metric for Evaluating System-Wide Energy Efficiency in HPC Systems

Hardware-Aware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui

Interconnect Efficiency of Tyan PSC T-630 with Microsoft Compute Cluster Server 2003

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu

An Introduction to Parallel Computing/ Programming

How To Write A Parallel Computer Program

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

Evaluation of CUDA Fortran for the CFD code Strukti

UNIT 2 CLASSIFICATION OF PARALLEL COMPUTERS

ST810 Advanced Computing

LS DYNA Performance Benchmarks and Profiling. January 2009

GPU File System Encryption Kartik Kulkarni and Eugene Linkov

High Performance Computing in CST STUDIO SUITE

Lecture 1: the anatomy of a supercomputer

Evoluzione dell Infrastruttura di Calcolo e Data Analytics per la ricerca

Accelerating CFD using OpenFOAM with GPUs

GPUs for Scientific Computing

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

HP ProLiant SL270s Gen8 Server. Evaluation Report

Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o

10- High Performance Compu5ng

CUDA programming on NVIDIA GPUs

Kalray MPPA Massively Parallel Processing Array

CMSC 611: Advanced Computer Architecture

Cluster Implementation and Management; Scheduling

Building an Inexpensive Parallel Computer

Seeking Opportunities for Hardware Acceleration in Big Data Analytics

CHAPTER 7: The CPU and Memory

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child

HPC Wales Skills Academy Course Catalogue 2015

Introduction to GPGPU. Tiziano Diamanti

Cray DVS: Data Virtualization Service

Systolic Computing. Fundamentals

Cluster Computing at HRI

Architectures and Platforms

Jezelf Groen Rekenen met Supercomputers

22S:295 Seminar in Applied Statistics High Performance Computing in Statistics

Stream Processing on GPUs Using Distributed Multimedia Middleware

Technical Computing Suite Job Management Software

Case Study on Productivity and Performance of GPGPUs

Parallel Programming

Building Clusters for Gromacs and other HPC applications

Parallel Firewalls on General-Purpose Graphics Processing Units

High Performance Computing, an Introduction to

Accelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing

The K computer: Project overview

Jean-Pierre Panziera Teratec 2011

Introduction to GPU hardware and to CUDA

NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist

Achieving Performance Isolation with Lightweight Co-Kernels

Performance Metrics and Scalability Analysis. Performance Metrics and Scalability Analysis

Advanced Computer Networks. High Performance Networking I

Introduction to Infiniband. Hussein N. Harake, Performance U! Winter School

Large Scale Simulation on Clusters using COMSOL 4.2

The virtualization of SAP environments to accommodate standardization and easier management is gaining momentum in data centers.

Binary search tree with SIMD bandwidth optimization using SSE

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

Transcription:

ANALYSIS OF SUPERCOMPUTER DESIGN CS/ECE 566 Parallel Processing Fall 2011 1 Anh Huy Bui Nilesh Malpekar Vishnu Gajendran

AGENDA Brief introduction of supercomputer Supercomputer design concerns and analysis Processor architecture, memory Interconnection model Cluster design Software (System & Application) Conclusions 2

1. BRIEF INTRODUCTION Brief history of supercomputer Introduced in the 1960s By Father of supercomputer: Seymour Cray at CDC The first supercomputer: CDC 6600-1964 Scalar processor 40 MHz 3

1. BRIEF INTRODUCTION Roadmap of supercomputer since then until now Processors roadmap Early machines: scalar processors 1970s: most supercomputer used vector processors mid-1980s: number of vector processors working parallel Late 1980s and 1990s: massive parallel processing system with thousands of ordinary CPU, which are offthe-shelf units or being custom designs Today: supercomputers are now highly-tuned computer clusters using commodity processors combined with custom interconnects 4

1. BRIEF INTRODUCTION Roadmap of supercomputer since then until now Speed 5

SUPERCOMPUTER DEFINITION As per Landau and Fink The class of fastest and most powerful computers available As per Dictionary of Science and Technology Any computer that is one of the largest, fastest and most powerful available at a given time 6

LINPACK BENCHMARK Introduced by Jack Dongarra Reflects performance of a dedicated system for solving dense system of linear equations Algorithm must confirm to LU factorization with partial pivoting have 2/3 n^3 + O(n^2) double precision floating point operations 7

LINPACK BENCHMARK - DETAILS Flops/s 64-bit floating point operations per second Operations refer to addition or multiplication Gigaflops => 10^9 flops/s Teraflops => 10^12 flops/s Petaflops => 10^15 flops/s Exaflops => 10^18 flops/s 8

LINPACK BENCHMARK DETAILS Rpeak Theoretical peak performance Number of full precision floating-point additions and multiplications completed within cycle time of the machine E.g. If 1.5 GHz computer completes 4 floating point operations per cycle, then Rpeak is 6 Gigaflops 9

LINPACK BENCHMARK DETAILS Rmax Maximum performance of a supercomputer measured in Gigaflops Nhalf Size of the problem for which machine achieves half its peak speed Good indicator of machine bandwidth Small value of Nhalf => good machine balance 10

2. DESIGN CONCERNS Processor architecture, memory Interconnection model Cluster design Software (System and Application) 11

SUPERCOMPUTER ARCHITECTURE Processor architecture Flynn's taxonomy SISD SIMD MISD MIMD Memory Shared memory Distributed memory Virtual shared memory 12

VECTOR PROCESSING (SIMD) Acts on a array of data instead of single data item Pipelines the data to the ALU. Scalar processors pipelines only the instruction execution Example: A[i] = B[i] + C[i] for i = 1 to 10 13

SCALAR PROCESSOR EXECUTION Execute this loop 10 times read the next instruction and decode it fetch this number fetch that number add them put the result here end loop Demerits: Instruction fetched and decoded ten times Memory is accessed ten times 14

VECTOR PROCESSOR EXECUTION Read instruction and decode it. Fetch array B[1..10] and fetch array C[1..10], add them and put the results in A[1...10] Merits Only two address translations are needed Instruction fetch and decode is done only once Demerits Increase in the complexity of the decoder Might slow down the decoding of normal instruction 15

VECTOR PROCESSOR BASED Fujitsu VPP500 series Cray -1, Cray-2, Cray X-MP, Cray Y-MP Nec SX-4 series 16

RISC ARCHITECTURE Simple instructions Simple hardware design Pipelining is used to speedup RISC machines Less cost and good performance 17

PIPELINED VS. NON-PIPELINED 18

RISC BASED SUPERCOMPUTERS IBM Roadrunner #1 spot among supercomputers in 2008 uses cell processor Tianhe-IA #1 spot among supercomputers in 2010 uses Intel Xeon processors and Nvidia Tesla GPGPUs 19

GPGPU General purpose computing on graphics processing units GPU Stream processor Processor that can run single kernel on many records SIMD High arithmetic intensity 20

GPGPU BASED SUPERCOMPUTERS 3 out of top 5 supercomputers in the world uses NVIDIA Tesla GPUs Tianhe 1A Nebulae Tsubame 2.0 21

SPECIAL PURPOSE SUPERCOMPUTERS High performance computing device with hardware architecture dedicated for single problem Custom FPGA or VLSI chips are used Examples GRAPE for astrophysics D.E. SHAW RESEARCH ANTON for simulating moleculat dynamics MDGRAPE-3 for protein structure computation BELLE for playing chess 22

TOP 500 THE CPU ARCHITECTURE The CPU Architecture Share of Top500 Rankings between 1993 and 2009 23

SHARED AND DISTRIBUTED MEMORY 24

SHARED AND DISTRIBUTED MEMORY Virtual Shared memory Programming model that allows processors on the distributed memory machine to be programmed as if they had shared memory Software layer takes care of the necessary communications 25

MEMORY HIERARCHIES Two types Cache based Vector register based Factors affecting memory latency Temporal locality - for instruction and data Spatial locality - for data only 26

CACHE BASED o Hierarchy of memory o Most recent used data is kept in the cache memory o Cost increases and access time decreases as it goes up the hierarchy 27

VECTOR REGISTER BASED Consists of small set of vector registers Main memory built from SRAM Instructions to move data from main memory to vector register in a high bandwidth bulk transfer 28

CACHE BASED & VECTOR REGISTER BASED Cache based Merits lower average access time low cost Demerits Lower bandwidth to memory Programs not exhibiting spatial or temporal locality are penalized Vector register based Merits Faster access to main memory Demerits Expensive 29

LATEST DEVELOPMENTS Using Flash memory instead of DRAM Cheaper than DRAM Retains data when the current is turned off Reduces the space and power requirements Livermore's Hyperion supercomputer uses Flash based memory 30

2.2 INTERCONNECTION Supercomputer interconnect Joins nodes within supercomputer Compute node I/O node Service node Network node Needs to support High Bandwidth Very low level communication latency 31

INTERCONNECT TOPOLOGY Static (fixed) Dynamic (switches) Routing Involves large quantities of network cabling often must fit within small spaces do NOT utilize wireless networking technology internally! 32

INTERCONNECT USAGE 33

WIDELY USED INTERCONNECTS Quadrics 6 /10 fastest supercomputers used Quadrics in 2003 Hardware QsNet I : 350 MB/s @ 5us MPI latency QsNet II : 912MB/s @1.26us MPI latency QsTenG : 10 Gigabit Ethernet switches, from 24-port QsNet III : Approx 2 GB/s in each direction @ 1.3us MPI latency 34

WIDELY USED INTERCONNECTS Infiniband Switched fabric communication link point-to-point bi-directional serial links between processor node and high-speed peripherals supports several signaling rates links can be bonded together for additional input 35

WIDELY USED INTERCONNECTS Myrinet High speed LAN Much lower protocol overhead better throughput less interference and latency can bypass operating system physically two fiber-optic cables upstream and downstream 36

TOFU : 6D MESH/TORUS From Fujitsu For large-scale supercomputers that exceed 10 petaflops Stands for TOrus FUsion Can be divided into an arbitrary size of rectangular submeshes, provides torus topology for each submesh 37

TOFU : 6D MESH/TORUS 38

TOFU : MULTIPATH ROUTING 39

TOFU : 3D TORUS VIEW 40

TOFU: OTHER FEATURES Throughput and Packet Transfer 10 GB/s of fully bidirectional bandwidth for each 100 GB/s of the off-chip bandwidth for each node to feed enough data to a massive array of 128-Gflops processors Variable packet length; 32 B to 2 KB including header and CRC 41

TOFU : 6D MESH/TORUS 42

TOFU : 6D MESH/TORUS 43

TOFU : 6D MESH/TORUS 44

2.3 CLUSTER DESIGN Nowadays, most of supercomputers are clusters: Typical nodes in a cluster Tiered architecture of a cluster Energy consumption Cooling problem 45

2.3 CLUSTER DESIGN Typical nodes in a cluster Compute nodes: Comprise the heart of a system. This is where user jobs run I/O nodes: Dedicated to performing all I/O requests by compute nodes - not available to users directly Login/Front-end nodes: These are where users login, compile and interact with the batch system Service nodes : for management functions such as system boot, machine partitioning, system performance measurements, system health monitoring, etc. 46

2.3 CLUSTER DESIGN Nodes in BlueGene/P General Configuration 47

2.3 CLUSTER DESIGN Scaling Architecture(H/W Scaling) 48

2.3 CLUSTER DESIGN A schematic overview of a Blue Gene/L supercomputer 49

2.3 CLUSTER DESIGN A schematic overview of the tiered composition of the Roadrunner supercomputer cluster. 50

2.3 CLUSTER DESIGN Energy consumption A typical supercomputer consumes a lot of energy Most of them turns into heat Then it requires cooling Examples Tiahe-1A: 4.04MW/hr, if 10cent/hr, then $400/hr and $3.5M/year K computer: 9.89 MW/hr ~ 10,000 suburban homes. $10M/year Energy efficient is measured: FLOPS/Watt Green 500 June 2011: IBM BlueGene/Q is 1st: 2097.19 MFLOPS/Watt. 51

2.3 CLUSTER DESIGN Cooling techniques Liquid cooling Fluorinert "cooling waterfall":cray 2 Hot watercooling:ibm Aquasar system (water is used to heat the building as well) Air cooling IBM BlueGene/P Combination of air conditioning with liquid cooling System X Virginia Tech Using low power processors IBM BlueGene systems 52

2.3 CLUSTER DESIGN IBM BlueGene/P cooling system 53

2.3 CLUSTER DESIGN IBM Aquasar cooling system 54

2.4 SOFTWARE - SYSTEM SOFTWARE Operating systems Most of supercomputers are now using Linux Operating sytems used by top 500 55

2.4 SOFTWARE - APPLICATION SOFTWARE/TOOLS Programming languages: Base languages: Fortran, C Variants of C: C for CUDA or OpenCL for GPGPUs Libraries Loosely connected clusters: PVM, MPI Tightly coordinated shared memory clusters: OpenMP Key software for different functions FullLinux kernel on I/O nodes Proprietary kernel dedicated for compute nodes Scalable control system based on an external service node Tools: open-source solutions Beowulf, WareWulf... 56

2.4 SOFTWARE - APPLICATION SOFTWARE/TOOLS Software stacks IBM BlueGene 57

3. CONCLUSIONS Giving an overview of concerns when designing a supercomputer Hardware design Interconnection Software design Cluster layout Other concerns: power consumption and cooling Not covered all topics, various designs due to proprietary. 58

REFERENCES Supercomputer Wikipedia http://en.wikipedia.org/wiki/supercomputer Tofu: a 6D mesh/torus interconnect for exascale computers. Yuichiro Ajima, Shinji Sumimoto and Toshiyuki Shimizu, Fujitsu Evolution of IBM System Blue Gene Solution, RedPaper REDP-4247-00 Using the Dawn BG/P System. https://computing.llnl.gov/tutorials/bgp/ 59

THANK YOU! 60