From Hypercubes to Dragonflies a short history of interconnect

Similar documents
On-Chip Interconnection Networks Low-Power Interconnect

Technology-Driven, Highly-Scalable Dragonfly Topology

Lecture 18: Interconnection Networks. CMU : Parallel Computer Architecture and Programming (Spring 2012)

Interconnection Networks. Interconnection Networks. Interconnection networks are used everywhere!

Interconnection Network

Lecture 23: Interconnection Networks. Topics: communication latency, centralized and decentralized switches (Appendix E)

System Interconnect Architectures. Goals and Analysis. Network Properties and Routing. Terminology - 2. Terminology - 1

Topological Properties

Hyper Node Torus: A New Interconnection Network for High Speed Packet Processors

Interconnection Network Design

Lecture 2 Parallel Programming Platforms

Cray Gemini Interconnect. Technical University of Munich Parallel Programming Class of SS14 Denys Sobchyshak

Why the Network Matters

Interconnection Networks

LOAD-BALANCED ROUTING IN INTERCONNECTION NETWORKS

Asynchronous Bypass Channels

Data Center Switch Fabric Competitive Analysis

Performance Evaluation of 2D-Mesh, Ring, and Crossbar Interconnects for Chip Multi- Processors. NoCArc 09

Introduction to Infiniband. Hussein N. Harake, Performance U! Winter School

COMP 422, Lecture 3: Physical Organization & Communication Costs in Parallel Machines (Sections 2.4 & 2.5 of textbook)

Load Balancing Mechanisms in Data Center Networks

Photonic Networks for Data Centres and High Performance Computing

Interconnection Networks

Optical interconnection networks for data centers

Architectural Level Power Consumption of Network on Chip. Presenter: YUAN Zheng

OpenSoC Fabric: On-Chip Network Generator

Scaling 10Gb/s Clustering at Wire-Speed

Scalability and Classifications

Introduction to Exploration and Optimization of Multiprocessor Embedded Architectures based on Networks On-Chip

Hardware Implementation of Improved Adaptive NoC Router with Flit Flow History based Load Balancing Selection Strategy

Leveraging Torus Topology with Deadlock Recovery for Cost-Efficient On-Chip Network

Interconnection Networks Programmierung Paralleler und Verteilter Systeme (PPV)

Advanced Computer Networks. High Performance Networking I

A Low-Radix and Low-Diameter 3D Interconnection Network Design

Data Center Network Topologies

Chapter 12: Multiprocessor Architectures. Lesson 04: Interconnect Networks

Design and Implementation of an On-Chip Permutation Network for Multiprocessor System-On-Chip

ISSCC 2003 / SESSION 4 / CLOCK RECOVERY AND BACKPLANE TRANSCEIVERS / PAPER 4.7

Paolo Costa

Packetization and routing analysis of on-chip multiprocessor networks

Smart Queue Scheduling for QoS Spring 2001 Final Report

Optimizing Configuration and Application Mapping for MPSoC Architectures

Performance Monitoring of Parallel Scientific Applications

Parallel Programming Survey

DOCSIS 1.1 Cable Modem Termination Systems

Communication Networks. MAP-TELE 2011/12 José Ruela

Optics Technology Trends in Data Centers

Low-Overhead Hard Real-time Aware Interconnect Network Router

Intel Ethernet Switch Converged Enhanced Ethernet (CEE) and Datacenter Bridging (DCB) Using Intel Ethernet Switch Family Switches

Advanced Computer Networks. Datacenter Network Fabric

GPU Computing. The GPU Advantage. To ExaScale and Beyond. The GPU is the Computer

Design of a Feasible On-Chip Interconnection Network for a Chip Multiprocessor (CMP)

Large Scale Clustering with Voltaire InfiniBand HyperScale Technology

Computer Network. Interconnected collection of autonomous computers that are able to exchange information

Operating Systems. Cloud Computing and Data Centers

SAN Conceptual and Design Basics

Efficient Parallel Graph Exploration on Multi-Core CPU and GPU

Delivering Scale Out Data Center Networking with Optics Why and How. Amin Vahdat Google/UC San Diego

Software and Performance Engineering of the Community Atmosphere Model

How To Build A Low Cost Data Center Network With Two Ports And A Backup Port

Parallel Computing. Benson Muite. benson.

Performance Engineering of the Community Atmosphere Model

Non-blocking Switching in the Cloud Computing Era

Lecture 24: WSC, Datacenters. Topics: network-on-chip wrap-up, warehouse-scale computing and datacenters (Sections )

A RDT-Based Interconnection Network for Scalable Network-on-Chip Designs

A Low Latency Router Supporting Adaptivity for On-Chip Interconnects

Dahu: Commodity Switches for Direct Connect Data Center Networks

Cisco s Massively Scalable Data Center

Traffic Engineering for Multiple Spanning Tree Protocol in Large Data Centers

Introduction to Parallel Computing. George Karypis Parallel Programming Platforms

On the effect of forwarding table size on SDN network utilization

Load Balancing in Data Center Networks

Trends in High-Performance Computing for Power Grid Applications

Network Contention and Congestion Control: Lustre FineGrained Routing

Network Architecture and Topology

Oracle Database Scalability in VMware ESX VMware ESX 3.5

Data Center Network Structure using Hybrid Optoelectronic Routers

A Reliability Analysis of Datacenter Topologies

Components: Interconnect Page 1 of 18

Longer is Better? Exploiting Path Diversity in Data Centre Networks

Parallel Programming at the Exascale Era: A Case Study on Parallelizing Matrix Assembly For Unstructured Meshes

Mixed-Criticality Systems Based on Time- Triggered Ethernet with Multiple Ring Topologies. University of Siegen Mohammed Abuteir, Roman Obermaisser

Question: 3 When using Application Intelligence, Server Time may be defined as.

Data Center Network Topologies: FatTree

COS 318: Operating Systems. Storage Devices. Kai Li Computer Science Department Princeton University. (

Transcription:

From Hypercubes to Dragonflies a short history of interconnect William J. Dally Computer Science Department Stanford University IAA Workshop July 21, 2008 IAA: #

Outline The low-radix era High-radix routers and networks To ExaScale and beyond NoCs the final frontier IAA: #

Partial Timeline Date Event Features 1983 Caltech Cosmic Cube Hypercube, programmed transfer 1985 Torus Routing Chip torus, routed, wormhole, virtual channels 1987 ipsc/2 routed hypercube 1990 Virtual-channel flow control 1991 J-Machine 1992 Paragon, T3D, CM5 1994 Vulcan 1995 T3E 1996 Reliable Router Link level retry 2000 NoCs 2001 SP2, Quadrics 2002 X1 2004 Global adaptive routing 2005 High-Radix Routers IAA: # 2006 YARC/BW Low-Radix Era

The Low-Radix Era IAA: #

The Cosmic Cube Caltech 1983 Hypercube topology No routers programmed transfers for every hop Store and forward: T = L/B x H IAA: #

Torus Routing Chip Caltech 1985, 3µm CMOS Torus topology Topology driven by technology constraints (pins, bisection) Wormhole routing: T = L/B + H Virtual channels to break deadlock IAA: #

1985-2004 The Low-Radix Era Low-radix (4 k 8) routers Torus, mesh, or Clos (fat-tree) topologies Wormhole or virtual-cut-through flow control Virtual channels for deadlock avoidance and performance Almost exclusively minimal routing Delta, Paragon, T3D, SP-1, CM-5, T3E, SP-2, but router bandwidth was increasing exponentially IAA: #

Some Routers MARS Router 1984 Torus Routing Chip 1985 MDP 1991 Network Design Frame 1988 Robert Mullins Reliable Router 1994 MAP 1998 Imagine 2002

High Radix Routers and Networks IAA: #

Bandwidth Velio 3003 1296 ball BGA 280 3.2G pairs 440Gb/s in +440Gb/s out IAA: #

Router Bandwidth Scaling ~100x in 10 years bandwidth per router node (Gb/s) 10000 1000 100 10 1 0.1 1985 1990 1995 2000 2005 2010 year BlackWidow Torus Routing Chip Intel ipsc/2 J-Machine CM-5 Intel Paragon XP Cray T3D MIT Alewife IBM Vulcan Cray T3E SGI Origin 2000 AlphaServer GS320 IBM SP Switch2 Quadrics QsNet Cray X1 Velio 3003 IBM HPS SGI Altix 3000 Cray XT3 YARC IAA: #

High-Radix Router Router Router Google: 12

High-Radix Router Router Router Router Low-radix (small number of fat ports) High-radix (large number of skinny ports) Google: 13

Latency Latency = H t r + L / b = 2t r log k N + 2kL / B where k = radix B = total Bandwidth N = # of nodes L = message size Google: 14

Latency vs. Radix 2003 technology 2010 technology latency (nsec) 300 250 200 150 100 Header latency decreases Optimal radix ~ 40 Serialization latency increases Optimal radix ~ 128 50 0 0 50 100 150 200 250 radix Google: 15

Determining Optimal Radix Latency = Header Latency + Serialization Latency = H t r + L / b = 2t r log k N + 2kL / B where k = radix B = total Bandwidth N = # of nodes L = message size Optimal radix k log 2 k = (B t r log N) / L = Aspect Ratio Google: 16

High-Radix Router Many router structures scale as P 2 (or P 2 V 2 ) Allocators particularly difficult Not feasible with P=64 and V~8 Decompose router Each sub-allocation is feasible Put the buffers where they do the most good YARC Cray 2006 64 ports, each 18.75Gb/s (3x6.25) Tiled hierarchical design 8x8 array of 8x8 subswitches Buffering at subswitch inputs IAA: #

High-Radix Switch Architectures (II) input 1 input 1 input 2 input 2 input k input k output 1 output 2 output k output 1 output 2 output k (a) Baseline design (b) Fully buffered crossbar Google: 18

High-Radix Switch Architectures (III) input 1 input 1 subswitch input 2 input 2 input k input k output 1 output 2 output k output 1 output 2 output k (a) Baseline design (b) Fully buffered crossbar (c) Hierarchical crossbar Google: 19

Global Adaptive Routing Enables new Topologies VAL gives optimal worst-case throughput MIN gives optimal benign traffic performance UGAL (Universal Globally Adaptive Load-balance) [Singh 05] Routes benign traffic minimally Starts routing like VAL if load imbalance in channel queues In the worst-case, degenerates into VAL, thus giving optimal worst-case throughput 20

UGAL 1. H m = shortest path (SP) length s q m H m d 2. q m = congestion of the outgoing channel for SP 3. Pick i, a random intermediate node q nm H nm i 4. H nm = non-min path (s i d) length 5. qnm = congestion of the outgoing channel for s i d 6. Choose SP if H m q m H nm q nm ; else route via i, minimally in each phase 21

CQR: TOR throughput Switches to non -minimal at 0.12 22

CQR: TOR latency 23

UGAL report card 64 node topology K 64 8 x 8 torus 64 node CCC Throughput (frac of capacity) Algo Θ benign Θ adv Θ avg VAL 0.5 0.5 0.5 MIN 1.0 0.02 0.02 UGAL 1.0 0.5 0.5 VAL 0.5 0.5 0.5 MIN 1.0 0.33 0.63 UGAL 1.0 0.5 0.7 VAL 0.5 0.5 0.5 MIN 1.0 0.2 0.52 UGAL 1.0 0.5 0.63 24

Transient Imbalance 16 14 Maximum buffer size 12 10 8 6 4 Google: 25 2 0 Routers in the middle stage of the network

With Adaptive Routing 16 14 Maximum buffer size 12 10 8 6 4 Google: 26 2 0 Routers in the middle stage of the network

High-Radix Topology Use high radix, k, to get low hop count H = log k (N) Hop count ~ cost Provide good performance on both benign and adversarial traffic patterns Rules out butterfly networks - no path diversity H = log k (N) - optimal Dismal throughput on worst-case traffic Clos networks work OK H = 2log k (N) - with short circuit But twice the hop count needed on benign traffic Cayley graphs have nice properties but are hard to route Google: 27

Clos Networks Delivering Predictable Performance Since 1953 High-Radix Interconnection Networks

IAA: # IEEE Spectrum Online, July 16, 2008

Flattened Butterfly (ISCA 07)

Dragonfly Topology Interconnection Network R 0 R 1 R 2 R R n-2 n-1 intra-group interconnection network Inter-group Interconnection Network G0 G1 G g-1 R 0 R 1 R a-1

Dragonfly Topology Example G3 P24 P25 P26 P27 P28 P29 P30 P31 R R R R G4 G5 G6 G7 G8 12 13 14 15 R 0 R 1 R 2 R 3 R 4 R 5 R 6 R 7 G2 P0 P1 P2 P3 P4 P5 P6 P7 P8 P9 P10 P11 P12 P13 P14 P15 G0 G1

Topology Cost Comparison

A Summary of Where We Are High radix routers and networks Best able to convert high pin bandwidth into value Router organization Hierarchical switch and allocator, subswitch buffers Routing Global adaptive routing enables aggressive topologies Topology Flattened butterfly gives minimal diameter (hence cost) Dragonfly minimizes number of long (expensive) links IAA: #

To ExaScale and Beyond IAA: #

1 EFlop/s Strawman Sizing done by balancing power budgets with achievable capabilities 4 FPUs+RegFiles/Core (=6 GF @1.5GHz) 1 Chip = 742 Cores (=4.5TF/s) 213MB of L1I&D; 93MB of L2 1 Node = 1 Proc Chip + 16 DRAMs (16GB) 1 Group = 12 Nodes + 12 Routers (=54TF/s) 1 Rack = 32 Groups (=1.7 PF/s) 384 nodes / rack 3.6EB of Disk Storage included 1 System = 583 Racks (=1 EF/s) 68 MW w aggressive assumptions 166 MILLION cores 680 MILLION FPUs 3.6PB = 0.0036 bytes/flops

Some Exa Observations Its all about energy (J/bit) more on this later Two levels of network on-chip (Noc) and off-chip System network is a dragonfly 12 CMPs, 31 local, 21 global links per router 12 Parallel networks enables tuning and tapering IAA: #

Exa Energy Budget - Fixed Allocating 60% power to FPUs doesn t leave much for communication Leads to a steep BW taper IAA: #

Exa Energy Budget Adaptive 1060 cores per chip (vs 742) 4 FPUs/core Can sustain 1 access per cycle at L1 or L2 But nothing else Can use all power at L3, DRAM, or globally Adapt actual power budget to demand of application IAA: # Throttle to stay in bounds

Its all about Joules/bit 10pJ/FLOP in 32nm 128FLOPs of energy to read a word from DRAM 32FLOPs of energy to send word within cabinet 256FLOPs to cross machine IAA: #

NoCs IAA: #

Exa NoC Much of the communication is on chip First 4-levels of storage hierarchy The bulk of all bandwidth What does this network look like? IAA: #

On Chip Interconnect Enabling circuit technology Links, switches, buffers 10x - 100x improvements in b/j Drives topologies, routing, flow control Network design Efficient topology/routing Flow control (trade BW for buffers) R0 R1 R2 R3 Low latency uarch R4 R5 R6 R7 R8 R9 R10 R11 R12 R13 R14 R15 IAA: #

Evaluation Comparison Latency Energy-Delay Product 1.2 1 0.8 0.6 0.4 0.2 1.2 1 ~35% 0.8 ~65% 0.6 0.4 0.2 0 mesh flatbfly 0 mesh flatbfly Flattened butterfly can be extended to on-chip networks to achieve lower latency and lower energy. High-Radix Interconnection Networks

Summary Low-radix era Developed key network technologies Wormhole FC, virtual channels, etc Matched network design to packaging and signaling constraints 4 < k < 8 torus, mesh, and Clos networks High-radix era Router bandwidth increasing 100x per decade Increasing network radix better able to exploit increased router bandwidth Requires partitioned router organization with subswitch buffers. Global adaptive routing enables efficient topologies Flattened butterflies and dragonflies Data centers now driving these technologies NoCs Bulk of bandwidth is on-chip Flattened butterfly is ideal topology here too! fewer hops Many open questions in NoC design IAA: #

Challenges Power at all levels Circuits/devices Topology/routing/flow control On and off chip Network architecture Indirect adaptive routing Dealing with cable aggregation (cables and WDM) On-chip networks all levels Circuits Topology/routing/flow control On/off chip interfaces Network interfaces Can have zero overhead send and receive (J-Machine, M -Machine) Programming abstractions Abstract the communication hierarchy (Sequoia) Vertical, not horizontal is what matters Focus on essential issues, not incidentals (like MPI occupancy) Device/Circuit Technology Focus on pj/bit IAA: #

Some very good books Google: 47