ANALYSIS OF SUPERCOMPUTER DESIGN CS/ECE 566 Parallel Processing Fall 2011 1 Anh Huy Bui Nilesh Malpekar Vishnu Gajendran
AGENDA Brief introduction of supercomputer Supercomputer design concerns and analysis Processor architecture, memory Interconnection model Cluster design Software (System & Application) Conclusions 2
1. BRIEF INTRODUCTION Brief history of supercomputer Introduced in the 1960s By Father of supercomputer: Seymour Cray at CDC The first supercomputer: CDC 6600-1964 Scalar processor 40 MHz 3
1. BRIEF INTRODUCTION Roadmap of supercomputer since then until now Processors roadmap Early machines: scalar processors 1970s: most supercomputer used vector processors mid-1980s: number of vector processors working parallel Late 1980s and 1990s: massive parallel processing system with thousands of ordinary CPU, which are offthe-shelf units or being custom designs Today: supercomputers are now highly-tuned computer clusters using commodity processors combined with custom interconnects 4
1. BRIEF INTRODUCTION Roadmap of supercomputer since then until now Speed 5
SUPERCOMPUTER DEFINITION As per Landau and Fink The class of fastest and most powerful computers available As per Dictionary of Science and Technology Any computer that is one of the largest, fastest and most powerful available at a given time 6
LINPACK BENCHMARK Introduced by Jack Dongarra Reflects performance of a dedicated system for solving dense system of linear equations Algorithm must confirm to LU factorization with partial pivoting have 2/3 n^3 + O(n^2) double precision floating point operations 7
LINPACK BENCHMARK - DETAILS Flops/s 64-bit floating point operations per second Operations refer to addition or multiplication Gigaflops => 10^9 flops/s Teraflops => 10^12 flops/s Petaflops => 10^15 flops/s Exaflops => 10^18 flops/s 8
LINPACK BENCHMARK DETAILS Rpeak Theoretical peak performance Number of full precision floating-point additions and multiplications completed within cycle time of the machine E.g. If 1.5 GHz computer completes 4 floating point operations per cycle, then Rpeak is 6 Gigaflops 9
LINPACK BENCHMARK DETAILS Rmax Maximum performance of a supercomputer measured in Gigaflops Nhalf Size of the problem for which machine achieves half its peak speed Good indicator of machine bandwidth Small value of Nhalf => good machine balance 10
2. DESIGN CONCERNS Processor architecture, memory Interconnection model Cluster design Software (System and Application) 11
SUPERCOMPUTER ARCHITECTURE Processor architecture Flynn's taxonomy SISD SIMD MISD MIMD Memory Shared memory Distributed memory Virtual shared memory 12
VECTOR PROCESSING (SIMD) Acts on a array of data instead of single data item Pipelines the data to the ALU. Scalar processors pipelines only the instruction execution Example: A[i] = B[i] + C[i] for i = 1 to 10 13
SCALAR PROCESSOR EXECUTION Execute this loop 10 times read the next instruction and decode it fetch this number fetch that number add them put the result here end loop Demerits: Instruction fetched and decoded ten times Memory is accessed ten times 14
VECTOR PROCESSOR EXECUTION Read instruction and decode it. Fetch array B[1..10] and fetch array C[1..10], add them and put the results in A[1...10] Merits Only two address translations are needed Instruction fetch and decode is done only once Demerits Increase in the complexity of the decoder Might slow down the decoding of normal instruction 15
VECTOR PROCESSOR BASED Fujitsu VPP500 series Cray -1, Cray-2, Cray X-MP, Cray Y-MP Nec SX-4 series 16
RISC ARCHITECTURE Simple instructions Simple hardware design Pipelining is used to speedup RISC machines Less cost and good performance 17
PIPELINED VS. NON-PIPELINED 18
RISC BASED SUPERCOMPUTERS IBM Roadrunner #1 spot among supercomputers in 2008 uses cell processor Tianhe-IA #1 spot among supercomputers in 2010 uses Intel Xeon processors and Nvidia Tesla GPGPUs 19
GPGPU General purpose computing on graphics processing units GPU Stream processor Processor that can run single kernel on many records SIMD High arithmetic intensity 20
GPGPU BASED SUPERCOMPUTERS 3 out of top 5 supercomputers in the world uses NVIDIA Tesla GPUs Tianhe 1A Nebulae Tsubame 2.0 21
SPECIAL PURPOSE SUPERCOMPUTERS High performance computing device with hardware architecture dedicated for single problem Custom FPGA or VLSI chips are used Examples GRAPE for astrophysics D.E. SHAW RESEARCH ANTON for simulating moleculat dynamics MDGRAPE-3 for protein structure computation BELLE for playing chess 22
TOP 500 THE CPU ARCHITECTURE The CPU Architecture Share of Top500 Rankings between 1993 and 2009 23
SHARED AND DISTRIBUTED MEMORY 24
SHARED AND DISTRIBUTED MEMORY Virtual Shared memory Programming model that allows processors on the distributed memory machine to be programmed as if they had shared memory Software layer takes care of the necessary communications 25
MEMORY HIERARCHIES Two types Cache based Vector register based Factors affecting memory latency Temporal locality - for instruction and data Spatial locality - for data only 26
CACHE BASED o Hierarchy of memory o Most recent used data is kept in the cache memory o Cost increases and access time decreases as it goes up the hierarchy 27
VECTOR REGISTER BASED Consists of small set of vector registers Main memory built from SRAM Instructions to move data from main memory to vector register in a high bandwidth bulk transfer 28
CACHE BASED & VECTOR REGISTER BASED Cache based Merits lower average access time low cost Demerits Lower bandwidth to memory Programs not exhibiting spatial or temporal locality are penalized Vector register based Merits Faster access to main memory Demerits Expensive 29
LATEST DEVELOPMENTS Using Flash memory instead of DRAM Cheaper than DRAM Retains data when the current is turned off Reduces the space and power requirements Livermore's Hyperion supercomputer uses Flash based memory 30
2.2 INTERCONNECTION Supercomputer interconnect Joins nodes within supercomputer Compute node I/O node Service node Network node Needs to support High Bandwidth Very low level communication latency 31
INTERCONNECT TOPOLOGY Static (fixed) Dynamic (switches) Routing Involves large quantities of network cabling often must fit within small spaces do NOT utilize wireless networking technology internally! 32
INTERCONNECT USAGE 33
WIDELY USED INTERCONNECTS Quadrics 6 /10 fastest supercomputers used Quadrics in 2003 Hardware QsNet I : 350 MB/s @ 5us MPI latency QsNet II : 912MB/s @1.26us MPI latency QsTenG : 10 Gigabit Ethernet switches, from 24-port QsNet III : Approx 2 GB/s in each direction @ 1.3us MPI latency 34
WIDELY USED INTERCONNECTS Infiniband Switched fabric communication link point-to-point bi-directional serial links between processor node and high-speed peripherals supports several signaling rates links can be bonded together for additional input 35
WIDELY USED INTERCONNECTS Myrinet High speed LAN Much lower protocol overhead better throughput less interference and latency can bypass operating system physically two fiber-optic cables upstream and downstream 36
TOFU : 6D MESH/TORUS From Fujitsu For large-scale supercomputers that exceed 10 petaflops Stands for TOrus FUsion Can be divided into an arbitrary size of rectangular submeshes, provides torus topology for each submesh 37
TOFU : 6D MESH/TORUS 38
TOFU : MULTIPATH ROUTING 39
TOFU : 3D TORUS VIEW 40
TOFU: OTHER FEATURES Throughput and Packet Transfer 10 GB/s of fully bidirectional bandwidth for each 100 GB/s of the off-chip bandwidth for each node to feed enough data to a massive array of 128-Gflops processors Variable packet length; 32 B to 2 KB including header and CRC 41
TOFU : 6D MESH/TORUS 42
TOFU : 6D MESH/TORUS 43
TOFU : 6D MESH/TORUS 44
2.3 CLUSTER DESIGN Nowadays, most of supercomputers are clusters: Typical nodes in a cluster Tiered architecture of a cluster Energy consumption Cooling problem 45
2.3 CLUSTER DESIGN Typical nodes in a cluster Compute nodes: Comprise the heart of a system. This is where user jobs run I/O nodes: Dedicated to performing all I/O requests by compute nodes - not available to users directly Login/Front-end nodes: These are where users login, compile and interact with the batch system Service nodes : for management functions such as system boot, machine partitioning, system performance measurements, system health monitoring, etc. 46
2.3 CLUSTER DESIGN Nodes in BlueGene/P General Configuration 47
2.3 CLUSTER DESIGN Scaling Architecture(H/W Scaling) 48
2.3 CLUSTER DESIGN A schematic overview of a Blue Gene/L supercomputer 49
2.3 CLUSTER DESIGN A schematic overview of the tiered composition of the Roadrunner supercomputer cluster. 50
2.3 CLUSTER DESIGN Energy consumption A typical supercomputer consumes a lot of energy Most of them turns into heat Then it requires cooling Examples Tiahe-1A: 4.04MW/hr, if 10cent/hr, then $400/hr and $3.5M/year K computer: 9.89 MW/hr ~ 10,000 suburban homes. $10M/year Energy efficient is measured: FLOPS/Watt Green 500 June 2011: IBM BlueGene/Q is 1st: 2097.19 MFLOPS/Watt. 51
2.3 CLUSTER DESIGN Cooling techniques Liquid cooling Fluorinert "cooling waterfall":cray 2 Hot watercooling:ibm Aquasar system (water is used to heat the building as well) Air cooling IBM BlueGene/P Combination of air conditioning with liquid cooling System X Virginia Tech Using low power processors IBM BlueGene systems 52
2.3 CLUSTER DESIGN IBM BlueGene/P cooling system 53
2.3 CLUSTER DESIGN IBM Aquasar cooling system 54
2.4 SOFTWARE - SYSTEM SOFTWARE Operating systems Most of supercomputers are now using Linux Operating sytems used by top 500 55
2.4 SOFTWARE - APPLICATION SOFTWARE/TOOLS Programming languages: Base languages: Fortran, C Variants of C: C for CUDA or OpenCL for GPGPUs Libraries Loosely connected clusters: PVM, MPI Tightly coordinated shared memory clusters: OpenMP Key software for different functions FullLinux kernel on I/O nodes Proprietary kernel dedicated for compute nodes Scalable control system based on an external service node Tools: open-source solutions Beowulf, WareWulf... 56
2.4 SOFTWARE - APPLICATION SOFTWARE/TOOLS Software stacks IBM BlueGene 57
3. CONCLUSIONS Giving an overview of concerns when designing a supercomputer Hardware design Interconnection Software design Cluster layout Other concerns: power consumption and cooling Not covered all topics, various designs due to proprietary. 58
REFERENCES Supercomputer Wikipedia http://en.wikipedia.org/wiki/supercomputer Tofu: a 6D mesh/torus interconnect for exascale computers. Yuichiro Ajima, Shinji Sumimoto and Toshiyuki Shimizu, Fujitsu Evolution of IBM System Blue Gene Solution, RedPaper REDP-4247-00 Using the Dawn BG/P System. https://computing.llnl.gov/tutorials/bgp/ 59
THANK YOU! 60