Interconnection Networks

Transcription

1 Interconnection Networks Z. Jerry Shi Assistant Professor of Computer Science and Engineering University of Connecticut * Slides adapted from Blumrich&Gschwind/ELE475 03, Peh/ELE475 * Three questions about interconnection networks What is an interconnection network? A programmable system that transports data between terminals Where do you find interconnection network? Used in almost all digital systems that are large enough to have two components to connect The most common applications are in computer systems and communication switches Connection between processors and memories, I/O devices and I/O controllers Simple bus systems are used in many systems, but high processor performance demand fast interconnection networks Why are interconnection network important? Limiting factor in the performance of many systems 1

2 Architecture of Interconnection Networks How to connect the nodes up (processors, memories, router line cards, SoC modules) TOPOLOGY Which path should a message take? ROUTING AND DEADLOCKS How is the message actually forwarded from source to destination FLOW CONTROL How to build the routers ROUTER MICROARCHITECTURE How to build the links LINK ARCHITECTURE How do nodes talk to the network NETWORK INTERFACE Metrics in Interconnection Networks Performance Latency How fast data can be transported through the network Throughput How many pieces of data (messages) can be transported in each time unit Power Area Cost Fault-Tolerance Quality-of-service 2

3 Topology Interconnection networks consists of a set of shared router nodes and channels Topology refers to the arrangement of these nodes and channels Analogous to roadmap Channels (roads), packets (cars), router nodes (intersection) Topological Properties Routing Distance - number of links on route Average Distance Diameter - maximum routing distance Bisection Bandwidth is the bandwidth crossing a minimal cut that divides the network in half A network is partitioned by a set of links if their removal disconnects the graph Degree number of communication links attached to a node 3

4 Linear Arrays and Rings N-2 N-1... Linear Array Diameter? Average Distance? Bisection bandwidth? Route A -> B given by relative address R = B-A Ring? Examples: Fiber Distributed Data Interface (FDDI), Scalable Coherent Interface (SCI), FiberChannel Arbitrated Loop Multidimensional Meshes, Tori, and Hypercubes d-dimensional k-ary torus (or k-ary d-cube) N = k d Each dimension has k nodes, which can be located with a vector A k-ary d-cube can be constructed with k k-ary (d 1)-cubes The radix in each dimension may be different For example, 2,3,4-ary 3-cube d-dimensional k-ary mesh: similar to torus Cut the channels between the first and last node in every dimension Hypercube: binary d-cube The radix in all dimensions is either 0 or 1 4

5 Hypercubes Also called binary n-cubes Number of nodes N = 2 n Distance: O(logN) hops Good bisection bandwidth Complexity Out degree is n = logn 0-D 1-D 2-D 3-D 4-D 5-D! Real World 2D mesh 1824 node Paragon: 16 x 114 mesh 5

6 Properties Routing Relative distance: R = (b d-1 a d-1,..., b 0 a 0 ) Traverse r i = b i a i hops in each dimension dimension-order routing Degree? Diameter? Average Distance dk/4 for cube Bisection bandwidth? k d-1 bidirectional links Physical layout? 2D in O(N) space Higher dimension? Embeddings in two dimensions 6 x 3 x 2 Embed multiple logical dimension in one physical dimension using long wires 6

7 Topology Summary Topology Degree Diameter Ave Dist Bisection D (D P=1024 1D Array 2 N-1 N / 3 1 huge 1D Ring 2 N/2 N/4 2 2D Mesh 4 2 (N 1/2-1) 2/3 N 1/2 N 1/2 63 (21) 2D Torus 4 N 1/2 1/2 N 1/2 2N 1/2 32 (16) k-ary n-cube 2n nk/2 nk/4 nk/4 15 Hypercube n=logn n n/2 N/2 10 (5) All have some bad permutations Many popular permutations are very bad for meshes (transpose) Randomness in wiring or routing makes it hard to find a bad one! Trees Diameter and ave distance logarithmic k-ary tree, height d = log k N Address specified d-vector of radix k coordinates describing path down from root Fixed degree Route up to common ancestor and down R = B xor A let i be position of most significant 1 in R, route up i+1 levels down in direction given by low i+1 bits of B H-tree space is O(N) with O( N) long wires Bisection BW? 7

8 Fat-Trees Fatter links (really more of them) as you go up, so bisection BW scales with N Butterflies Tree with lots of roots! N log N (actually N/2 x logn) Exactly one route from any source to any dest R = A xor B, at level i use straight edge if r i =0, otherwise cross edge Bisection N/2 8

9 Benes network and Fat Tree Back-to-back butterfly can route all permutations Off line What if you just pick a random mid point? INPUT Butterfly network Inverse butterfly network OUTPUT Relationship Butterflies to Hypercubes Wiring is isomorphic Except that Butterfly always takes log n steps Many other types of multistage interconnection networks 9

10 How Many Dimensions? n = 2 or n = 3 Short wires, easy to build Many hops, low bisection bandwidth Requires traffic locality n >= 4 Harder to build, more wires, longer average length Fewer hops, better bisection bandwidth Can handle non-local traffic k-ary d-cubes provide a consistent framework for comparison N = k d Scale dimension (d) or nodes per dimension (k) Real Machines Wide links, smaller routing delay Tremendous variation 10

11 Routing Messages, Packets, Flits, Phits Flits (flow control digits) is the basic unit of bandwidth and storage allocation Phits (physical transfer digits) is the unit of information that is transferred across a channel in a single clock cycle 11

12 Typical Packet Format Trailer Error Code Data Payload Routing and Control Header digital symbol Sequence of symbols transmitted over a channel A packet consists of different types of flits Head, body, or tail The head flit carries the packet s routing information A packet has a format of HB*T* Routing Routing algorithm determines which of the possible paths are used as routes how the route is determined R: N x N C, which at each switch maps the destination node to the next channel on the route Issues: Routing mechanism arithmetic source-based port select table driven general computation Properties of the routes Deadlock free 12

13 Taxonomy of Routing Algorithms Deterministic Route determined by (source, dest), not intermediate state (i.e. traffic) Given two nodes x and y, the path R x,y is the same Oblivious Choose a route without considering any information about the network s current state Example, a random algorithm Adaptive Route influenced by traffic along the way Minimal Only selects shortest paths Example: routing on a ring Greedy Always send the packet in the shortest direction Uniform random Randomly pick a direction, with equal probability for picking either direction Weighted random Randomly pick a direction, but weight the short direction with 1 d/n where d is the shortest path Adaptive Send the packet in the direction for which local channel has the lowest load Record how many packets a channel has transmitted over the last T slots 13

14 Routing relation R: N N ρ(p) The output of the relation is an entire path There may be multiple paths R: N N ρ(c) Routing is incremental The output only indicates the channels that the packet take at the current node R: C N ρ(c) Similar to the second method Use the current channel instead of current node Adaptive Routing R: C N Σ C Essential for fault tolerance At least multipath Can improve utilization of the network Simple deterministic algorithms easily run into bad permutations Fully/partially adaptive, minimal/non-minimal Can introduce complexity or anomalies Little adaptation goes a long way! 14

15 Routing Mechanism Need to select output port for each input packet in a few cycles Simple arithmetic in regular topologies Example: x, y routing in a grid west (-x) x< 0 east (+x) x> 0 south (-y) x= 0, y < 0 north (+y) x= 0, y > 0 processor x= 0, y = 0 Reduce relative address of each dimension in order Dimension-order routing in k-ary d-cubes Calculate preferred directions then adjust one dimension each time Used in Cray T3D, which connects up to 2048 DEC Alpha processing elements Routing Mechanism (cont) P 3 P 2 P 1 P 0 Source-based Mainly used in deterministic and oblivious routing All routing decisions are made in the source and message header carries series of port selects Used and stripped en route Fast, simple, and scalable CS-2, Myrinet, MIT Artic Node-table More appropriate for adaptive routing Decide the output channel based on incoming channel and destination Can redirect traffic if one output link is congested or fails ATM, HPPI 15

16 Deadlock How can it arise? Necessary conditions: Shared resource (buffers or channels) Incrementally allocated Non-preemptible Think of a channel as a shared resource that is acquired incrementally Source buffer then destination buffer Channels along a route How do you avoid it? Deadlock avoidance: guarantee no deadlock Constrain how channel resources are allocated. Example: dimension order Deadlock recovery: deadlock is detected and corrected How do you prove that a routing algorithm is deadlock free? Deadlock Freedom Resources are logically associated with channels Messages introduce dependences between resources as they move forward Need to articulate the possible dependences that can arise between channels Show that there are no cycles in Channel Dependence Graph Find a numbering of channel resources such that every legal route follows a monotonic sequence => No traffic pattern can lead to deadlock Network need not be acyclic, on channel dependence graph All deadlock avoidance techniques use some form of resource ordering 16

17 Deadlock Recovery Detection Determining exactly whether the network is deadlocked is difficult Most practical detection mechanism are conservative May have false positives Timeout counters Reset when making progress Recovery Regressive: packets or connections that are deadlocked are removed Progressive: keep the packets or connections in escape buffer Potentially has better performance Routing using the escape buffer is designed to be deadlock-free Flow Control Flow control determines how a network s resources are allocated Resources: channel bandwidth, buffer capacity, etc. Good flow control: achieves a high fraction of ideal bandwidth and delivers packets with low, predictable latency Can also be viewed as a problem of contention resolution Problem is there because we are sharing resources Processor: Resources in a processor: ALUs, registers How to run as many operations, optimizing use of ALUs and registers Network Resources in a network: Buffers, links How to forward as many messages, optimizing use of buffers and links 17

18 Contention Two packets trying to use the same link at the same time Limited buffering Drop? Flow control protocols Bufferless Dropping Misrouting Circuit switching Header traverses the network and reserves resources Data are then sent through the reserved path Buffered Store-and-forward Virtual cut-through Wormhole Virtual-channel 18

19 Simplest Flow Control: Dropping If two things arrive and I don t have resources, drop one of them Flow control protocol on the Internet Not used in interconnection networks why? Time-space Diagram: Dropping 19

20 Next Simplest Flow Control: Misrouting If only one message can enter the network at each node, and one message can exit the network at each node, the network can never be congested. Right? Philosophy behind misrouting: intentionally route away from congestion No need for buffering Circuit Switching Bufferless Probe that sets up path through network If the request flit is blocked, it is held in place (not dropped) Reserve all links Data are then sent through links Simple router Similar to the dropping case Need only one register to buffer the header When is this good? When is it not? 20

21 Time-space Diagram: Circuit Switching Store-and-Forward Buffered flow control: flits can be stored in routing nodes Flits arriving on cycle i do not have to leave on cycle i + 1 Make intermediate stops and wait till the whole packet has arrived before you move on Two resources must be allocated to the packet A packet-sized buffer at the other side of the channel Exclusive use of the channel Other packets can use intermediate links Pros and cons? 21

22 Time-space Diagram: Store-and-Forward With store-and-forward, packets do no have to be divided into flits Virtual Cut-through Why wait till entire message has arrived at each intermediate stop? The head of the message can dash off first Of course, the two resources must be allocated When the head gets blocked, whole message gets blocked at the intermediate node 22

23 Time-space Diagram: Virtual Cut-through Wormhole Similar to virtual cut-through, but channel and buffers are allocated to flits rather than packets When the head flit arrives, it must acquire three resources before being forwarded to the next node A virtual channel for the packet State bits indicating the output channel, state of virtual channel (Idle, waiting for resources, or active), and other information One flit buffer One flit of channel bandwidth Body flits do not need to acquire virtual channels But still needs to allocate flit buffer and channel bandwidth The tail flit releases the virtual channel Channel is owned by a packet, but buffers are allocated on a flit-by-flit basis When a flit cannot acquire a buffer, the channel goes idle 23

24 Time-space Diagram: Wormhole Virtual Channel Associates several virtual channels with a single physical channel When a packet blocks, instead of holding on to physical links so others cannot use them, hold on to virtual links The head flit needs three resources to advance A virtual channel, a downstream flit buffer, and channel bandwidth Subsequent body flits uses the same virtual channel But still needs to allocate flit buffer and channel bandwidth However, these flits are not guaranteed access to the channel bandwidth Lanes on the highway You have to compete with other cars 24

25 Time-space diagram: virtual-channel Arbitration may not be fair It can be winner-take-all Link-level flow control Given that you can t drop packets, how to manage the buffers? When can you send stuff forward, when not? Three techniques Credit-based: upstream router keeps a count of the number of free flit buffer in each virtual channel downstream On/off: a single bit indicate whether the upstream node can send or not Ack/nack: upstream node optimistically sends flits when they are available and downstream node sends back ack or nack Flit-Reservation Reduces buffer turnaround time 25

26 Link-level flow control Short Links F/E Ready/Ack Req F/E Source Data Destination Long links Several flits on the wire Buffer turnaround time A flits leaves downstream node. Credit is sent to the current node. Credit is processed and a flip is sent to downstream node. Downstream node receives the flip hold pipeline delay wire delay buffer use release hold credit delay pipeline delay wire delay buffer use release credit delay Buffer turnaround time 26

27 Flit-reservation flow control Hides the overhead by separating the control and data networks Control flits race ahead to reserve network resources Can also streamlines the delivery of credits Allows zero buffer turnaround time Not always possible to reserve resources The control head flit is similar to a typical head flit, but with an additional field shows the time offset to the first data flit Routing node knows when the data flit will arrive, and starts to prepare buffer now Router (switch) microarchitecture: What s in a router? It s a system as well Logic State machines, Arbiters, Allocators Control the movement through router Idle, Routing, Waiting for resources, Active Memory Buffers Store flits before forwarding them SRAMs, registers, processor memory Communication Switches Transfer flits from input to output ports Crossbars, multiple crossbars, fully-connected, bus 27

28 Typical Router Design Input Ports Receiver Input Buffer Output Buffer Transmiter Output Ports Cross-bar Control Routing, Scheduling Router Components Output ports Transmitter (typically drives clock and data) Input ports Synchronizer and aligns data signal with local clock domain Essentially a FIFO buffer Crossbar Connects each input to any output Degree limited by area or pinout Buffering Control logic Complexity depends on routing logic and scheduling algorithm Determine output port for each incoming packet Arbitrate among inputs directed at same output 28

29 Buffer Organizations Input buffers Buffering at each input port, stores flits till they get to leave through switch to next hop Central buffers A central memory shared among every port Functions as switch as well Output buffers Flits flow right through to output port Highest throughput, no head-of-line blocking Input Buffered Router Input Ports R0 Output Ports R1 R2 Cross-bar R3 Scheduling Independent routing logic per input FSM Scheduler logic arbitrates each output Priority, FIFO, or random Head-of-line blocking problem If an earlier flit is missed, the later flits hold the buffer 29

30 Output Buffered Router Input Ports R0 Output Ports R1 Output Ports R2 Output Ports R3 Output Ports Control Commit to output - limited adaptivity Switch has to handle input line speeds Virtual-channel Router 30

31 Virtual-channel Router Packet head, body, tail flits Head Routing output port Request and arbitrate for next VC Request and arbitrate for switch path Request and arbitrate for buffer Traverse switch Body Request and arbitrate for switch path Request and arbitrate for buffer Traverse switch Tail Request and arbitrate for switch path Request and arbitrate for buffer Traverse switch Release switch path State machines Control the state of the router Each input channel G: Global State: is it idle? routing? waiting for VC? buffer? R: Output port Filled by routing O: Output VC Filled by VC allocation P: Head and tail queue pointers C: Credits Each output channel G: Global state: Idle? Active? Waiting for credits? I: Input VC that is sending flits to this output port C: Credit count 31

32 Pipelining of a typical virtual channel router Cycle Head flit RC VA SA ST Body flit 1 SA ST Boyd flit 2 SA ST Tail flit SA ST Cycle 0: Head flits arrives. G will change to R on the next cycle Cycle 1: RC(Routing computation). R and G (=V) will be updated on the next cycle Cycle 2: VA(Virtual channel allocation). On the next cycle, O and G (=A) will be updated. The state of output channel will be updated Cycle 3: SA: Switch allocation Cycle 4: ST: Switch traversal Output arbiters N requesters (inputs) trying to get a single resource under contention (output) N:1 arbiter for each output Several types of arbiters Fixed priority arbiter Variable priority arbiter Oblivious arbiter Round robin arbiter 32

33 Fixed Priority Arbiter Variable Priority Arbiter A one-hot priority signal p selects the highest priority Only one of the p s can be 1 33

34 Variable Priority Arbiters Oblivious Not dependent on previous grants or requests Rotating priorities Random priorities Variable Priority Arbiters Round robin Request that was last served should have lowest priority Serve all other requests first before returning to this requestor If a grant is issued this cycle, the request next to the one receiving the grant will have the highest priority on the next cycle 34

35 Allocators NxM allocator: N requestors fighting for M resources Results: A grant can be asserted only if the corresponding request is asserted At most one grant for each input may be asserted At most one grant for each resource may be asserted Allocators In Routers VC Allocator Input VCs requesting for a range of output VCs E.g. a packet of VC0 arrives at East input port. It s destined for west output port, and would like to get any of the VCs of that output port. Switch Allocator Input VCs of an input port request for different output ports (e.g. One s going North, another s going West) 35

36 Simplest Allocators: Separable Approximate with two stages of arbitration One on inputs, one on outputs. They can be in either order. Separable Allocator Example: Dumb arbiters that always choose the first request 36

37 Switches The fabric that directs flits from one input port to another output port Design issue: number of input and output ports, and speedups Speedup: the ratio of the total input bandwidth to the netowk s ideal capacity (the best throughput) Tradeoff between cost (delay, area, power) and performance (throughput) Tradeoff between leaving it up to allocation or simplifying the job for allocators Crossbar switches Input speedup = 1 Input speedup = 2 37

38 Effect of input speedup With a random allocator Throughput is the fraction of capacity Several flit buffer organizations Central Simple logical view There are actually two switches: MUX in and demux out Problems: bandwidth and latency Separate memory per input port Virtual channels associated with a physical channel can share buffer 38

39 Virtual Channel (VC) Buffer Organization One buffer per VC Allows switches to access multiple VC associated with one PC, but leads to poor memory utilization. Approximations: A small amount of output ports on a single buffer Divide VCs among buffers Memory Interleaving! Case Study: Alpha router 39

40 Alpha router Torus Virtual cut-through (316 packet buffers) Adaptive routing: prefer to continue in the same dimension Deadlock avoidance Coherence: Requests may fill up buffers, stalling acks (Solution: Virtual channel class, order) Network: Escape virtual channel Router microarchitecture 40

41 Router microarchitecture Network Interface How a processor sends data to the network Shared memory cache-coherent multiprocessors Interfaces caches with networks Message-passing multiprocessors Interfaces processor pipeline with networks Dedicated register (or two registers) Register map Memory map Virtual memory map I/O interrupt + DMA 41

42 Cache-coherent SMP processor-network interface Highly optimized interface: from load/ store to messages in a few cycles Request is placed in memory request register Tag: how to handle the reply, e.g., store the data in R24 Type: cacheable or not; read or write Cache hit: place in reply register right away Cache miss: enter miss status holding register (MSHR) Use this to merge reads/writes as well Number of MSHRs == number of pending memory references (4 to 32) Cache-coherent SMP memory-network interface Messages from the network initialize transaction status holding register (TSHR) Messages may be queued TSHR tracks the status of pending memory operations Example: For a non-cacheable read, the TSHR status changes: Read pending (waiting for bank) Bank activated (waiting for data) Read complete (preparing message) Idle (the reply message sent) 42

43 Message-passing multiprocessors: Dedicated register Send Move a value to the network out register Special MOV instruction for the last word to terminate the packet Read Block on the register until packet arrives, or test register and retry later Pros: fast Cons: Long messages: processor becoming DMA engine! Security: hold the register forever Register map Send a message atomically from a subset of the processor s general purpose register Cons: Long messages have to be segmented Pressures on general purpose register Processors are still DMA engines 43

44 I/O interface Most common interface today, in PCs, Clusters of workstations (e.g. Infiniband, Myrinet, PCI) Software-level messaging: Interrupt triggers handler Handler sets up DMA DMA engine constructs packets from memory and sends out to network Physical-memory-mapped or virtual-memory-mapped Case Study: Princeton SHRIMP Where: I/O bus How: Virtual memory map 44

45 Virtual memory mapping Map_network(My_virtual_addr_range,Your_virtual_addr_range) Each virtual page -> local physical page -> remote physical page -> remote virtual address Store to these virtual addresses => network Virtual memory map (SHRIMP) 45

46 Case Study: M-Machine Multicomputer Experimental multicomputer built at MIT and Standford 2-D torus Multi-ALU processor (MAP) chip 46