On-Chip Interconnection Networks Low-Power Interconnect



Similar documents
From Hypercubes to Dragonflies a short history of interconnect

Lecture 18: Interconnection Networks. CMU : Parallel Computer Architecture and Programming (Spring 2012)

Interconnection Networks. Interconnection Networks. Interconnection networks are used everywhere!

Naveen Muralimanohar Rajeev Balasubramonian Norman P Jouppi

Architectural Level Power Consumption of Network on Chip. Presenter: YUAN Zheng

Interconnection Network

Technology-Driven, Highly-Scalable Dragonfly Topology

System Interconnect Architectures. Goals and Analysis. Network Properties and Routing. Terminology - 2. Terminology - 1

From Bus and Crossbar to Network-On-Chip. Arteris S.A.

Cray Gemini Interconnect. Technical University of Munich Parallel Programming Class of SS14 Denys Sobchyshak

Hyper Node Torus: A New Interconnection Network for High Speed Packet Processors

Photonic Networks for Data Centres and High Performance Computing

Lecture 23: Interconnection Networks. Topics: communication latency, centralized and decentralized switches (Appendix E)

Why the Network Matters

Distributed Elastic Switch Architecture for efficient Networks-on-FPGAs

Power Reduction Techniques in the SoC Clock Network. Clock Power

Interconnection Network Design

Topological Properties

Lecture 2 Parallel Programming Platforms

Interconnection Networks

COMP 422, Lecture 3: Physical Organization & Communication Costs in Parallel Machines (Sections 2.4 & 2.5 of textbook)

Asynchronous Bypass Channels

What is a System on a Chip?

Design and Implementation of an On-Chip timing based Permutation Network for Multiprocessor system on Chip

Interconnection Networks Programmierung Paralleler und Verteilter Systeme (PPV)

Low Power AMD Athlon 64 and AMD Opteron Processors

Interconnection Networks

Use-it or Lose-it: Wearout and Lifetime in Future Chip-Multiprocessors

Components: Interconnect Page 1 of 18

Introduction to Exploration and Optimization of Multiprocessor Embedded Architectures based on Networks On-Chip

Intel Itanium Quad-Core Architecture for the Enterprise. Lambert Schaelicke Eric DeLano

Performance Evaluation of 2D-Mesh, Ring, and Crossbar Interconnects for Chip Multi- Processors. NoCArc 09

Design and Implementation of an On-Chip Permutation Network for Multiprocessor System-On-Chip

PCI Express Overview. And, by the way, they need to do it in less time.

A Low Latency Router Supporting Adaptivity for On-Chip Interconnects

Low-Overhead Hard Real-time Aware Interconnect Network Router

Chapter 1 Reading Organizer

Computer Systems Structure Input/Output

inter-chip and intra-chip Harm Dorren and Oded Raz

A Low-Radix and Low-Diameter 3D Interconnection Network Design

Parallel Programming

Leveraging Torus Topology with Deadlock Recovery for Cost-Efficient On-Chip Network

- Hubs vs. Switches vs. Routers -

8 Gbps CMOS interface for parallel fiber-optic interconnects

Interconnection Network of OTA-based FPAA

Chapter 2 Heterogeneous Multicore Architecture

A Detailed and Flexible Cycle-Accurate Network-on-Chip Simulator

Parallel Programming Survey

Scaling 10Gb/s Clustering at Wire-Speed

OpenSoC Fabric: On-Chip Network Generator

SAN Conceptual and Design Basics

CCNA R&S: Introduction to Networks. Chapter 5: Ethernet

Network Architecture and Topology

- Nishad Nerurkar. - Aniket Mhatre

Introduction to Infiniband. Hussein N. Harake, Performance U! Winter School

OC By Arsene Fansi T. POLIMI

Latency Considerations for 10GBase-T PHYs

Architectures and Platforms

Packetization and routing analysis of on-chip multiprocessor networks

Interconnection Networks

SeaMicro SM Server

SOC architecture and design

How PCI Express Works (by Tracy V. Wilson)

Infrastructure Components: Hub & Repeater. Network Infrastructure. Switch: Realization. Infrastructure Components: Switch

Intel Ethernet Switch Load Balancing System Design Using Advanced Features in Intel Ethernet Switch Family

Clocking. Figure by MIT OCW Spring /18/05 L06 Clocks 1

Driving force. What future software needs. Potential research topics

ICRI-CI Retreat Architecture track

Network Design. Yiannos Mylonas

Quality of Service (QoS) for Asynchronous On-Chip Networks

Testing of Digital System-on- Chip (SoC)

Applying the Benefits of Network on a Chip Architecture to FPGA System Design

Communicating with devices

Computer Network. Interconnected collection of autonomous computers that are able to exchange information

Designing HP SAN Networking Solutions

Maximizing Server Storage Performance with PCI Express and Serial Attached SCSI. Article for InfoStor November 2003 Paul Griffith Adaptec, Inc.

Switched Interconnect for System-on-a-Chip Designs

Communication Networks. MAP-TELE 2011/12 José Ruela

InfiniBand Clustering

Storage at a Distance; Using RoCE as a WAN Transport

Reconfigurable Computing. Reconfigurable Architectures. Chapter 3.2

The OSI & Internet layering models

PCI Express* Ethernet Networking

Introduction to System-on-Chip

OpenSPARC T1 Processor

Annotation to the assignments and the solution sheet. Note the following points

SCSI vs. Fibre Channel White Paper

Redundancy in enterprise storage networks using dual-domain SAS configurations

Lizy Kurian John Electrical and Computer Engineering Department, The University of Texas as Austin

Scalability and Classifications

Chapter 2. Multiprocessors Interconnection Networks

Transcription:

On-Chip Interconnection Networks Low-Power Interconnect William J. Dally Computer Systems Laboratory Stanford University ISLPED August 27, 2007 ISLPED: 1 Aug 27, 2007

Outline Demand for On-Chip Networks What is Unique about On-Chip Networks The Power Problem Enabling Technologies Circuits - set the constraints Topology Micro-Architecture Summary ISLPED: 2 Aug 27, 2007

Urgent Demand for OCINs ISLPED: 3 Aug 27, 2007

The Future is CMPs OCINs are a Critical Component 2006 2007.5 2009 2010.5 2012 2015 2013.5 ISLPED: 4 Aug 27, 2007

Example CMP OCIN ISLPED: 5 Aug 27, 2007

Growing Complexity of SoCs Demands an On-Chip Interconnection Network Avner Goren TI EPF 2004 ISLPED: 6 Aug 27, 2007

So, what s different about on-chip networks? ISLPED: 7 Aug 27, 2007

State of Off-Chip Networks ISLPED: 8 Aug 27, 2007

Technology Trends bandwidth per router node (Gb/s) 10000 BlackWidow 1000 100 10 1 0.1 1985 1990 1995 2000 2005 2010 Torus Routing Chip Intel ipsc/2 J-Machine CM-5 Intel Paragon XP Cray T3D MIT Alewife IBM Vulcan Cray T3E SGI Origin 2000 AlphaServer GS320 IBM SP Switch2 Quadrics QsNet Cray X1 Velio 3003 IBM HPS SGI Altix 3000 Cray XT3 YARC year ISLPED: 9 Aug 27, 2007

Some History MARS Router 1984 Torus Routing Chip 1985 MDP 1991 Network Design Frame 1988 Robert Mullins Reliable Router 1994 ISLPED: 10 MAP 1998 Imagine 2002 YARC 2006 Aug 27, 2007

Some very good books ISLPED: 11 Aug 27, 2007

Topology Summary of Off-Chip Networks* Fit to packaging and signaling technology High-radix - Clos or FlatBfly gives lowest cost Routing Global adaptive routing balances load w/o destroying locality Flow control Virtual channels/virtual cut-through *oversimplified ISLPED: 12 Aug 27, 2007

So, what s different about on-chip networks? ISLPED: 13 Aug 27, 2007

Cost, Channels, Workload are Different Cost Off-chip: cost is channels - pins, connectors, cables, optics On-chip: cost is Si area and Power (storage and switches), wires plentiful Drives networks with many long, wide channels, few buffers Channel Characteristics On-chip RC lines - need a repeater every 1mm (or less) Short distance - low latency Can put logic in repeaters, motivates low-latency routers Workload CMP cache traffic SoC isochronous flows Design issues Floorplanning Different constraints motivate some surprising differences in design. ISLPED: 14 Aug 27, 2007

The Power Problem ISLPED: 15 Aug 27, 2007

Power is Dominated by Interconnect In 65nm A 32 bit add takes 0.8pJ Moving a 32 bit word 1mm takes 13pJ A 64-core CMP at 1GHz Estimated BW demand of 6.4GW/s (~200Gb/s) Average distance for each transfer of 10mm (each way) Power due to network ~50W Power due to 64 adds at 1GHz ~50mW Power due to 64 RISC processors at 1GHz ~13W Need to reduce network power by at least 10x ISLPED: 16 Aug 27, 2007

OCINs are an Enabling Technology Modular Standard interface enables design reuse Optimized Encapsulates circuits and wiring. Enables optimization. Efficient Just the wires needed to reach the destination are toggled. ISLPED: 17 Aug 27, 2007

OCINs vs Buses Most CMPs and SOCs today use one or more buses for interconnect A bus is just one type of OCIN Low bandwidth (one transation at a time) High concentration (all processors share one link) Inefficient (all wire segments toggled even for short communication) For both buses and other OCINs Control path is dominated by arbiters/allocators Data path is dominated by data movement and buffering Can do much better by searching more of the network design space ISLPED: 18 Aug 27, 2007

Circuits set Cost & Area Constraints for Architecture Can do substantially (10x-100x) better than default circuits ISLPED: 19 Aug 27, 2007

Enabling Technology is a Prerequisite Channels, Buffers, Switches Topology Routing Flow Control Microarchitecture ISLPED: 20 Aug 27, 2007

Channels 10x to 100x power reduction Eq signaling for faster propagation and increased repeater distance (D & P Chapter 8, Heaton 01) Elastic channels provide free buffers (Mizuno 01) Send 4-8 bits per cycle per wire (assuming 20FO4 cycle) ISLPED: 21 Aug 27, 2007

Buffers Dense arrays (vs. Flip-Flops or Latches) 1/10 area/bit 1/10 power for low-swing read Low-swing write 1/10 power for writes. Low-swing read - can keep swing low through muxes. ISLPED: 22 Aug 27, 2007

Switches Low-swing bit lines Operate at channel rate Reduces area and hence power Equalized drive Buffered crosspoints Integral allocation ISLPED: 23 Aug 27, 2007

Circuits Impact Architecture With standard-cell approach Power is approximately evenly split between channels, buffers, and routers With efficient circuits Channels 1/30, buffers 1/3 Routers dominate Routing >> Buffering >> Propagating Motivates topologies with fewer hops, longer channels. Just propagate bits - avoid buffering, really avoid routing ISLPED: 24 Aug 27, 2007

Properties of these elements drives optimal network organization ISLPED: 25 Aug 27, 2007

On-Chip Interconnection Network System = Processor Tiles Source: Balfour and Dally, ICS 06 ISLPED: 26 Aug 27, 2007

On-Chip Interconnection Network (2) System = Processor Tiles + Channels Source: Balfour and Dally, ICS 06 ISLPED: 27 Aug 27, 2007

Interconnection Network (3) System = Processor Tiles + Channels + Routers Source: Balfour and Dally, ICS 06 ISLPED: 28 Aug 27, 2007

Router Architecture Input-queued Virtual Channel Speculative Pipeline Source: Balfour and Dally, ICS 06 ISLPED: 29 Aug 27, 2007

Router Area Accurate modeling requires floorplan Source: Balfour and Dally, ICS 06 ISLPED: 30 Aug 27, 2007

Torus Source: Balfour and Dally, ICS 06 ISLPED: 31 Aug 27, 2007

Concentrated Mesh Source: Balfour and Dally, ICS 06 ISLPED: 32 Aug 27, 2007

Express Links Source: Balfour and Dally, ICS 06 ISLPED: 33 Aug 27, 2007

Network Replication Abundant wire resources build second network Resource allocation tradeoff Wide: [+] Serialization Latency [+] Router Energy Efficiency [ ] Router Area Replicated: [+] Decoupled Resources [+] Area Efficiency [?] Energy Efficiency [ ] Serialization Latency [+] SCALABLE Source: Balfour and Dally, ICS 06 ISLPED: 34 Aug 27, 2007

Energy Efficiency Network Energy Completion Time (normalized to Torus network) 1.5 1.0 0.5 0.0 Concentrated Mesh (Replicated) Torus Fat-Tree Fat-Tree with Taper Mesh (Replicated) Source: Balfour and Dally, ICS 06 ISLPED: 35 Aug 27, 2007

Large differences in efficiency. Optimal topology not obvious, not regular and very sensitive to properties of network elements ISLPED: 36 Aug 27, 2007

Where is Energy Expended? Power [W] 20 15 10 5 Buffers Switch Channels 0 Concentrated Mesh (Replicated) Torus Source: Balfour and Dally, ICS 06 ISLPED: 37 Aug 27, 2007

Flow Control Trade channel bandwidth (cheap) for buffer space (expensive) Make buffers shallow Compensate for lower duty factor by overprovisioning channels Little cost in energy Circuit switching (no buffers) Elastic buffers - use free buffers in the channels ISLPED: 38 Aug 27, 2007

Flow Control in an On-Chip FlatBfly S X D ISLPED: 39 Aug 27, 2007

View as Two Buffered Links S X D S X D ISLPED: 40 Aug 27, 2007

Channels Have Repeaters S X D ISLPED: 41 Aug 27, 2007

Buffers Decouple Channel Allocation in Time S X D S X D ISLPED: 42 Aug 27, 2007

Circuit Switching S X D S X D ISLPED: 43 Aug 27, 2007

With Elastic Buffers S X D S X D ISLPED: 44 Aug 27, 2007

Summary ISLPED: 45 Aug 27, 2007

Comments on Shekar s Talk There is more to life than meshes and buses Optimal networks, like flattened butterflies are better than either Buses weren t even good on boards Slow, no parallelism, excess power, no locality Even Shekar s numbers say buses are worse 25-100W vs 20-80W Differential signaling can be applied to networks too. Routers don t take 3-5 clocks - many examples of one clock routers See Mullins, ISCA 2004 Arbiters for circuit switched network no different than routers for packet switched But drops packets on collisions - packet switching more efficient Yes, embrace parallelism - with well-designed networks We re making progress - in December Shekar was saying buses, now he s advanced to circuit switching ISLPED: 46 Aug 27, 2007

Summary OCINs critically important Vital component of CMPs, SoCs Less mature technology than other components Naïve implementations consume significant Power Power is dominated by interconnect - operations are cheap Very different than off-chip networks Cost, Channels, Workloads, Design Issues Efficient network elements are enabling technology Energy and area efficient channels, buffers, & switches Change the equation for network design: channels << buffers << routers Topology Minimize diameter Concentrated Mesh with Express Channels Flattened Butterfly Bus is one extreme, but almost never the right answer Flow Control Minimize buffers at switch points Use elastic buffers to minimize latency Efficient OCIN designs yield significant gains ISLPED: 47 Aug 27, 2007

Backup Slides ISLPED: 48 Aug 27, 2007

On-Chip Flattened Butterfly Conventional 2D Mesh 2D Flattened Butterfly Source: Kim and Dally, to appear ISLPED: 49 Aug 27, 2007

On-Chip Flattened Butterfly dimension 1 Layout Mapping dimension 2 Source: Kim and Dally, to appear ISLPED: 50 Aug 27, 2007

Bypass Channels Conventional Flattened Butterfly Flattened Butterfly with Bypass Channels connected to local router Source: Kim and Dally, to appear ISLPED: 51 Aug 27, 2007

Latency Comparison Latency (normalized to mesh network) 1 0.8 0.6 0.4 0.2 0 bitrev tornado UR MESH CMESH FBFLY-MIN FBFLY- NONMIN transpose randperm bitcomp FBFLY-BYP Source: Kim and Dally, to appear ISLPED: 52 Aug 27, 2007

Power Comparison Power (W) 12 10 8 6 4 Memory Crossbar Channel 2 0 MESH CMESH FBFLY-MIN FBFLY- NONMIN Source: Kim and Dally, to appear FBFLY-BYP ISLPED: 53 Aug 27, 2007