On-Chip Interconnection Networks Low-Power Interconnect

Transcription

1 On-Chip Interconnection Networks Low-Power Interconnect William J. Dally Computer Systems Laboratory Stanford University ISLPED August 27, 2007 ISLPED: 1 Aug 27, 2007

2 Outline Demand for On-Chip Networks What is Unique about On-Chip Networks The Power Problem Enabling Technologies Circuits - set the constraints Topology Micro-Architecture Summary ISLPED: 2 Aug 27, 2007

3 Urgent Demand for OCINs ISLPED: 3 Aug 27, 2007

4 The Future is CMPs OCINs are a Critical Component ISLPED: 4 Aug 27, 2007

5 Example CMP OCIN ISLPED: 5 Aug 27, 2007

6 Growing Complexity of SoCs Demands an On-Chip Interconnection Network Avner Goren TI EPF 2004 ISLPED: 6 Aug 27, 2007

7 So, what s different about on-chip networks? ISLPED: 7 Aug 27, 2007

8 State of Off-Chip Networks ISLPED: 8 Aug 27, 2007

9 Technology Trends bandwidth per router node (Gb/s) BlackWidow Torus Routing Chip Intel ipsc/2 J-Machine CM-5 Intel Paragon XP Cray T3D MIT Alewife IBM Vulcan Cray T3E SGI Origin 2000 AlphaServer GS320 IBM SP Switch2 Quadrics QsNet Cray X1 Velio 3003 IBM HPS SGI Altix 3000 Cray XT3 YARC year ISLPED: 9 Aug 27, 2007

10 Some History MARS Router 1984 Torus Routing Chip 1985 MDP 1991 Network Design Frame 1988 Robert Mullins Reliable Router 1994 ISLPED: 10 MAP 1998 Imagine 2002 YARC 2006 Aug 27, 2007

11 Some very good books ISLPED: 11 Aug 27, 2007

12 Topology Summary of Off-Chip Networks* Fit to packaging and signaling technology High-radix - Clos or FlatBfly gives lowest cost Routing Global adaptive routing balances load w/o destroying locality Flow control Virtual channels/virtual cut-through *oversimplified ISLPED: 12 Aug 27, 2007

13 So, what s different about on-chip networks? ISLPED: 13 Aug 27, 2007

14 Cost, Channels, Workload are Different Cost Off-chip: cost is channels - pins, connectors, cables, optics On-chip: cost is Si area and Power (storage and switches), wires plentiful Drives networks with many long, wide channels, few buffers Channel Characteristics On-chip RC lines - need a repeater every 1mm (or less) Short distance - low latency Can put logic in repeaters, motivates low-latency routers Workload CMP cache traffic SoC isochronous flows Design issues Floorplanning Different constraints motivate some surprising differences in design. ISLPED: 14 Aug 27, 2007

15 The Power Problem ISLPED: 15 Aug 27, 2007

16 Power is Dominated by Interconnect In 65nm A 32 bit add takes 0.8pJ Moving a 32 bit word 1mm takes 13pJ A 64-core CMP at 1GHz Estimated BW demand of 6.4GW/s (~200Gb/s) Average distance for each transfer of 10mm (each way) Power due to network ~50W Power due to 64 adds at 1GHz ~50mW Power due to 64 RISC processors at 1GHz ~13W Need to reduce network power by at least 10x ISLPED: 16 Aug 27, 2007

17 OCINs are an Enabling Technology Modular Standard interface enables design reuse Optimized Encapsulates circuits and wiring. Enables optimization. Efficient Just the wires needed to reach the destination are toggled. ISLPED: 17 Aug 27, 2007

18 OCINs vs Buses Most CMPs and SOCs today use one or more buses for interconnect A bus is just one type of OCIN Low bandwidth (one transation at a time) High concentration (all processors share one link) Inefficient (all wire segments toggled even for short communication) For both buses and other OCINs Control path is dominated by arbiters/allocators Data path is dominated by data movement and buffering Can do much better by searching more of the network design space ISLPED: 18 Aug 27, 2007

19 Circuits set Cost & Area Constraints for Architecture Can do substantially (10x-100x) better than default circuits ISLPED: 19 Aug 27, 2007

20 Enabling Technology is a Prerequisite Channels, Buffers, Switches Topology Routing Flow Control Microarchitecture ISLPED: 20 Aug 27, 2007

21 Channels 10x to 100x power reduction Eq signaling for faster propagation and increased repeater distance (D & P Chapter 8, Heaton 01) Elastic channels provide free buffers (Mizuno 01) Send 4-8 bits per cycle per wire (assuming 20FO4 cycle) ISLPED: 21 Aug 27, 2007

22 Buffers Dense arrays (vs. Flip-Flops or Latches) 1/10 area/bit 1/10 power for low-swing read Low-swing write 1/10 power for writes. Low-swing read - can keep swing low through muxes. ISLPED: 22 Aug 27, 2007

23 Switches Low-swing bit lines Operate at channel rate Reduces area and hence power Equalized drive Buffered crosspoints Integral allocation ISLPED: 23 Aug 27, 2007

24 Circuits Impact Architecture With standard-cell approach Power is approximately evenly split between channels, buffers, and routers With efficient circuits Channels 1/30, buffers 1/3 Routers dominate Routing >> Buffering >> Propagating Motivates topologies with fewer hops, longer channels. Just propagate bits - avoid buffering, really avoid routing ISLPED: 24 Aug 27, 2007

25 Properties of these elements drives optimal network organization ISLPED: 25 Aug 27, 2007

26 On-Chip Interconnection Network System = Processor Tiles Source: Balfour and Dally, ICS 06 ISLPED: 26 Aug 27, 2007

27 On-Chip Interconnection Network (2) System = Processor Tiles + Channels Source: Balfour and Dally, ICS 06 ISLPED: 27 Aug 27, 2007

28 Interconnection Network (3) System = Processor Tiles + Channels + Routers Source: Balfour and Dally, ICS 06 ISLPED: 28 Aug 27, 2007

29 Router Architecture Input-queued Virtual Channel Speculative Pipeline Source: Balfour and Dally, ICS 06 ISLPED: 29 Aug 27, 2007

30 Router Area Accurate modeling requires floorplan Source: Balfour and Dally, ICS 06 ISLPED: 30 Aug 27, 2007

31 Torus Source: Balfour and Dally, ICS 06 ISLPED: 31 Aug 27, 2007

32 Concentrated Mesh Source: Balfour and Dally, ICS 06 ISLPED: 32 Aug 27, 2007

33 Express Links Source: Balfour and Dally, ICS 06 ISLPED: 33 Aug 27, 2007

34 Network Replication Abundant wire resources build second network Resource allocation tradeoff Wide: [+] Serialization Latency [+] Router Energy Efficiency [ ] Router Area Replicated: [+] Decoupled Resources [+] Area Efficiency [?] Energy Efficiency [ ] Serialization Latency [+] SCALABLE Source: Balfour and Dally, ICS 06 ISLPED: 34 Aug 27, 2007

35 Energy Efficiency Network Energy Completion Time (normalized to Torus network) Concentrated Mesh (Replicated) Torus Fat-Tree Fat-Tree with Taper Mesh (Replicated) Source: Balfour and Dally, ICS 06 ISLPED: 35 Aug 27, 2007

36 Large differences in efficiency. Optimal topology not obvious, not regular and very sensitive to properties of network elements ISLPED: 36 Aug 27, 2007

37 Where is Energy Expended? Power [W] Buffers Switch Channels 0 Concentrated Mesh (Replicated) Torus Source: Balfour and Dally, ICS 06 ISLPED: 37 Aug 27, 2007

38 Flow Control Trade channel bandwidth (cheap) for buffer space (expensive) Make buffers shallow Compensate for lower duty factor by overprovisioning channels Little cost in energy Circuit switching (no buffers) Elastic buffers - use free buffers in the channels ISLPED: 38 Aug 27, 2007

39 Flow Control in an On-Chip FlatBfly S X D ISLPED: 39 Aug 27, 2007

40 View as Two Buffered Links S X D S X D ISLPED: 40 Aug 27, 2007

41 Channels Have Repeaters S X D ISLPED: 41 Aug 27, 2007

42 Buffers Decouple Channel Allocation in Time S X D S X D ISLPED: 42 Aug 27, 2007

43 Circuit Switching S X D S X D ISLPED: 43 Aug 27, 2007

44 With Elastic Buffers S X D S X D ISLPED: 44 Aug 27, 2007

45 Summary ISLPED: 45 Aug 27, 2007

46 Comments on Shekar s Talk There is more to life than meshes and buses Optimal networks, like flattened butterflies are better than either Buses weren t even good on boards Slow, no parallelism, excess power, no locality Even Shekar s numbers say buses are worse W vs 20-80W Differential signaling can be applied to networks too. Routers don t take 3-5 clocks - many examples of one clock routers See Mullins, ISCA 2004 Arbiters for circuit switched network no different than routers for packet switched But drops packets on collisions - packet switching more efficient Yes, embrace parallelism - with well-designed networks We re making progress - in December Shekar was saying buses, now he s advanced to circuit switching ISLPED: 46 Aug 27, 2007

47 Summary OCINs critically important Vital component of CMPs, SoCs Less mature technology than other components Naïve implementations consume significant Power Power is dominated by interconnect - operations are cheap Very different than off-chip networks Cost, Channels, Workloads, Design Issues Efficient network elements are enabling technology Energy and area efficient channels, buffers, & switches Change the equation for network design: channels << buffers << routers Topology Minimize diameter Concentrated Mesh with Express Channels Flattened Butterfly Bus is one extreme, but almost never the right answer Flow Control Minimize buffers at switch points Use elastic buffers to minimize latency Efficient OCIN designs yield significant gains ISLPED: 47 Aug 27, 2007

48 Backup Slides ISLPED: 48 Aug 27, 2007

49 On-Chip Flattened Butterfly Conventional 2D Mesh 2D Flattened Butterfly Source: Kim and Dally, to appear ISLPED: 49 Aug 27, 2007

50 On-Chip Flattened Butterfly dimension 1 Layout Mapping dimension 2 Source: Kim and Dally, to appear ISLPED: 50 Aug 27, 2007

51 Bypass Channels Conventional Flattened Butterfly Flattened Butterfly with Bypass Channels connected to local router Source: Kim and Dally, to appear ISLPED: 51 Aug 27, 2007

52 Latency Comparison Latency (normalized to mesh network) bitrev tornado UR MESH CMESH FBFLY-MIN FBFLY- NONMIN transpose randperm bitcomp FBFLY-BYP Source: Kim and Dally, to appear ISLPED: 52 Aug 27, 2007

53 Power Comparison Power (W) Memory Crossbar Channel 2 0 MESH CMESH FBFLY-MIN FBFLY- NONMIN Source: Kim and Dally, to appear FBFLY-BYP ISLPED: 53 Aug 27, 2007