On-Chip Interconnection Networks Low-Power Interconnect William J. Dally Computer Systems Laboratory Stanford University ISLPED August 27, 2007 ISLPED: 1 Aug 27, 2007
Outline Demand for On-Chip Networks What is Unique about On-Chip Networks The Power Problem Enabling Technologies Circuits - set the constraints Topology Micro-Architecture Summary ISLPED: 2 Aug 27, 2007
Urgent Demand for OCINs ISLPED: 3 Aug 27, 2007
The Future is CMPs OCINs are a Critical Component 2006 2007.5 2009 2010.5 2012 2015 2013.5 ISLPED: 4 Aug 27, 2007
Example CMP OCIN ISLPED: 5 Aug 27, 2007
Growing Complexity of SoCs Demands an On-Chip Interconnection Network Avner Goren TI EPF 2004 ISLPED: 6 Aug 27, 2007
So, what s different about on-chip networks? ISLPED: 7 Aug 27, 2007
State of Off-Chip Networks ISLPED: 8 Aug 27, 2007
Technology Trends bandwidth per router node (Gb/s) 10000 BlackWidow 1000 100 10 1 0.1 1985 1990 1995 2000 2005 2010 Torus Routing Chip Intel ipsc/2 J-Machine CM-5 Intel Paragon XP Cray T3D MIT Alewife IBM Vulcan Cray T3E SGI Origin 2000 AlphaServer GS320 IBM SP Switch2 Quadrics QsNet Cray X1 Velio 3003 IBM HPS SGI Altix 3000 Cray XT3 YARC year ISLPED: 9 Aug 27, 2007
Some History MARS Router 1984 Torus Routing Chip 1985 MDP 1991 Network Design Frame 1988 Robert Mullins Reliable Router 1994 ISLPED: 10 MAP 1998 Imagine 2002 YARC 2006 Aug 27, 2007
Some very good books ISLPED: 11 Aug 27, 2007
Topology Summary of Off-Chip Networks* Fit to packaging and signaling technology High-radix - Clos or FlatBfly gives lowest cost Routing Global adaptive routing balances load w/o destroying locality Flow control Virtual channels/virtual cut-through *oversimplified ISLPED: 12 Aug 27, 2007
So, what s different about on-chip networks? ISLPED: 13 Aug 27, 2007
Cost, Channels, Workload are Different Cost Off-chip: cost is channels - pins, connectors, cables, optics On-chip: cost is Si area and Power (storage and switches), wires plentiful Drives networks with many long, wide channels, few buffers Channel Characteristics On-chip RC lines - need a repeater every 1mm (or less) Short distance - low latency Can put logic in repeaters, motivates low-latency routers Workload CMP cache traffic SoC isochronous flows Design issues Floorplanning Different constraints motivate some surprising differences in design. ISLPED: 14 Aug 27, 2007
The Power Problem ISLPED: 15 Aug 27, 2007
Power is Dominated by Interconnect In 65nm A 32 bit add takes 0.8pJ Moving a 32 bit word 1mm takes 13pJ A 64-core CMP at 1GHz Estimated BW demand of 6.4GW/s (~200Gb/s) Average distance for each transfer of 10mm (each way) Power due to network ~50W Power due to 64 adds at 1GHz ~50mW Power due to 64 RISC processors at 1GHz ~13W Need to reduce network power by at least 10x ISLPED: 16 Aug 27, 2007
OCINs are an Enabling Technology Modular Standard interface enables design reuse Optimized Encapsulates circuits and wiring. Enables optimization. Efficient Just the wires needed to reach the destination are toggled. ISLPED: 17 Aug 27, 2007
OCINs vs Buses Most CMPs and SOCs today use one or more buses for interconnect A bus is just one type of OCIN Low bandwidth (one transation at a time) High concentration (all processors share one link) Inefficient (all wire segments toggled even for short communication) For both buses and other OCINs Control path is dominated by arbiters/allocators Data path is dominated by data movement and buffering Can do much better by searching more of the network design space ISLPED: 18 Aug 27, 2007
Circuits set Cost & Area Constraints for Architecture Can do substantially (10x-100x) better than default circuits ISLPED: 19 Aug 27, 2007
Enabling Technology is a Prerequisite Channels, Buffers, Switches Topology Routing Flow Control Microarchitecture ISLPED: 20 Aug 27, 2007
Channels 10x to 100x power reduction Eq signaling for faster propagation and increased repeater distance (D & P Chapter 8, Heaton 01) Elastic channels provide free buffers (Mizuno 01) Send 4-8 bits per cycle per wire (assuming 20FO4 cycle) ISLPED: 21 Aug 27, 2007
Buffers Dense arrays (vs. Flip-Flops or Latches) 1/10 area/bit 1/10 power for low-swing read Low-swing write 1/10 power for writes. Low-swing read - can keep swing low through muxes. ISLPED: 22 Aug 27, 2007
Switches Low-swing bit lines Operate at channel rate Reduces area and hence power Equalized drive Buffered crosspoints Integral allocation ISLPED: 23 Aug 27, 2007
Circuits Impact Architecture With standard-cell approach Power is approximately evenly split between channels, buffers, and routers With efficient circuits Channels 1/30, buffers 1/3 Routers dominate Routing >> Buffering >> Propagating Motivates topologies with fewer hops, longer channels. Just propagate bits - avoid buffering, really avoid routing ISLPED: 24 Aug 27, 2007
Properties of these elements drives optimal network organization ISLPED: 25 Aug 27, 2007
On-Chip Interconnection Network System = Processor Tiles Source: Balfour and Dally, ICS 06 ISLPED: 26 Aug 27, 2007
On-Chip Interconnection Network (2) System = Processor Tiles + Channels Source: Balfour and Dally, ICS 06 ISLPED: 27 Aug 27, 2007
Interconnection Network (3) System = Processor Tiles + Channels + Routers Source: Balfour and Dally, ICS 06 ISLPED: 28 Aug 27, 2007
Router Architecture Input-queued Virtual Channel Speculative Pipeline Source: Balfour and Dally, ICS 06 ISLPED: 29 Aug 27, 2007
Router Area Accurate modeling requires floorplan Source: Balfour and Dally, ICS 06 ISLPED: 30 Aug 27, 2007
Torus Source: Balfour and Dally, ICS 06 ISLPED: 31 Aug 27, 2007
Concentrated Mesh Source: Balfour and Dally, ICS 06 ISLPED: 32 Aug 27, 2007
Express Links Source: Balfour and Dally, ICS 06 ISLPED: 33 Aug 27, 2007
Network Replication Abundant wire resources build second network Resource allocation tradeoff Wide: [+] Serialization Latency [+] Router Energy Efficiency [ ] Router Area Replicated: [+] Decoupled Resources [+] Area Efficiency [?] Energy Efficiency [ ] Serialization Latency [+] SCALABLE Source: Balfour and Dally, ICS 06 ISLPED: 34 Aug 27, 2007
Energy Efficiency Network Energy Completion Time (normalized to Torus network) 1.5 1.0 0.5 0.0 Concentrated Mesh (Replicated) Torus Fat-Tree Fat-Tree with Taper Mesh (Replicated) Source: Balfour and Dally, ICS 06 ISLPED: 35 Aug 27, 2007
Large differences in efficiency. Optimal topology not obvious, not regular and very sensitive to properties of network elements ISLPED: 36 Aug 27, 2007
Where is Energy Expended? Power [W] 20 15 10 5 Buffers Switch Channels 0 Concentrated Mesh (Replicated) Torus Source: Balfour and Dally, ICS 06 ISLPED: 37 Aug 27, 2007
Flow Control Trade channel bandwidth (cheap) for buffer space (expensive) Make buffers shallow Compensate for lower duty factor by overprovisioning channels Little cost in energy Circuit switching (no buffers) Elastic buffers - use free buffers in the channels ISLPED: 38 Aug 27, 2007
Flow Control in an On-Chip FlatBfly S X D ISLPED: 39 Aug 27, 2007
View as Two Buffered Links S X D S X D ISLPED: 40 Aug 27, 2007
Channels Have Repeaters S X D ISLPED: 41 Aug 27, 2007
Buffers Decouple Channel Allocation in Time S X D S X D ISLPED: 42 Aug 27, 2007
Circuit Switching S X D S X D ISLPED: 43 Aug 27, 2007
With Elastic Buffers S X D S X D ISLPED: 44 Aug 27, 2007
Summary ISLPED: 45 Aug 27, 2007
Comments on Shekar s Talk There is more to life than meshes and buses Optimal networks, like flattened butterflies are better than either Buses weren t even good on boards Slow, no parallelism, excess power, no locality Even Shekar s numbers say buses are worse 25-100W vs 20-80W Differential signaling can be applied to networks too. Routers don t take 3-5 clocks - many examples of one clock routers See Mullins, ISCA 2004 Arbiters for circuit switched network no different than routers for packet switched But drops packets on collisions - packet switching more efficient Yes, embrace parallelism - with well-designed networks We re making progress - in December Shekar was saying buses, now he s advanced to circuit switching ISLPED: 46 Aug 27, 2007
Summary OCINs critically important Vital component of CMPs, SoCs Less mature technology than other components Naïve implementations consume significant Power Power is dominated by interconnect - operations are cheap Very different than off-chip networks Cost, Channels, Workloads, Design Issues Efficient network elements are enabling technology Energy and area efficient channels, buffers, & switches Change the equation for network design: channels << buffers << routers Topology Minimize diameter Concentrated Mesh with Express Channels Flattened Butterfly Bus is one extreme, but almost never the right answer Flow Control Minimize buffers at switch points Use elastic buffers to minimize latency Efficient OCIN designs yield significant gains ISLPED: 47 Aug 27, 2007
Backup Slides ISLPED: 48 Aug 27, 2007
On-Chip Flattened Butterfly Conventional 2D Mesh 2D Flattened Butterfly Source: Kim and Dally, to appear ISLPED: 49 Aug 27, 2007
On-Chip Flattened Butterfly dimension 1 Layout Mapping dimension 2 Source: Kim and Dally, to appear ISLPED: 50 Aug 27, 2007
Bypass Channels Conventional Flattened Butterfly Flattened Butterfly with Bypass Channels connected to local router Source: Kim and Dally, to appear ISLPED: 51 Aug 27, 2007
Latency Comparison Latency (normalized to mesh network) 1 0.8 0.6 0.4 0.2 0 bitrev tornado UR MESH CMESH FBFLY-MIN FBFLY- NONMIN transpose randperm bitcomp FBFLY-BYP Source: Kim and Dally, to appear ISLPED: 52 Aug 27, 2007
Power Comparison Power (W) 12 10 8 6 4 Memory Crossbar Channel 2 0 MESH CMESH FBFLY-MIN FBFLY- NONMIN Source: Kim and Dally, to appear FBFLY-BYP ISLPED: 53 Aug 27, 2007