Parallel Architectures Group Grupo de Arquitecturas Paralelas (GAP)

Similar documents
Interconnection Network

Interconnection Networks

Interconnection Networks. Interconnection Networks. Interconnection networks are used everywhere!

Lecture 18: Interconnection Networks. CMU : Parallel Computer Architecture and Programming (Spring 2012)

System Interconnect Architectures. Goals and Analysis. Network Properties and Routing. Terminology - 2. Terminology - 1

Interconnection Networks

Interconnection Network Design

Chapter 2. Multiprocessors Interconnection Networks

Scalability and Classifications

Hyper Node Torus: A New Interconnection Network for High Speed Packet Processors

Architectural Level Power Consumption of Network on Chip. Presenter: YUAN Zheng

Interconnection Networks Programmierung Paralleler und Verteilter Systeme (PPV)

Chapter 15: Distributed Structures. Topology


On-Chip Interconnection Networks Low-Power Interconnect

Hardware Implementation of Improved Adaptive NoC Router with Flit Flow History based Load Balancing Selection Strategy

Lecture 2 Parallel Programming Platforms

Performance Evaluation of 2D-Mesh, Ring, and Crossbar Interconnects for Chip Multi- Processors. NoCArc 09

University of Castilla-La Mancha

Design and Implementation of an On-Chip timing based Permutation Network for Multiprocessor system on Chip

Introduction to Parallel Computing. George Karypis Parallel Programming Platforms

Why the Network Matters

Introduction to Exploration and Optimization of Multiprocessor Embedded Architectures based on Networks On-Chip

Topological Properties

Switch Fabric Implementation Using Shared Memory

Asynchronous Bypass Channels

Scaling 10Gb/s Clustering at Wire-Speed

Ethernet. Ethernet Frame Structure. Ethernet Frame Structure (more) Ethernet: uses CSMA/CD

Interconnection Networks

Parallel Programming

Local Area Networks transmission system private speedy and secure kilometres shared transmission medium hardware & software

Deadlock Characterization and Resolution in Interconnection Networks

LOAD-BALANCED ROUTING IN INTERCONNECTION NETWORKS

Communication Networks. MAP-TELE 2011/12 José Ruela


COMP 422, Lecture 3: Physical Organization & Communication Costs in Parallel Machines (Sections 2.4 & 2.5 of textbook)

Lecture 23: Interconnection Networks. Topics: communication latency, centralized and decentralized switches (Appendix E)

Overview of Network Hardware and Software. CS158a Chris Pollett Jan 29, 2007.

CCNA R&S: Introduction to Networks. Chapter 5: Ethernet

Packetization and routing analysis of on-chip multiprocessor networks

Recursive Partitioning Multicast: A Bandwidth-Efficient Routing for Networks-On-Chip

Where to Provide Support for Efficient Multicasting in Irregular Networks: Network Interface or Switch?

Network Layer: Network Layer and IP Protocol

InfiniBand Clustering

Module 15: Network Structures

Computer Networks Vs. Distributed Systems

From Hypercubes to Dragonflies a short history of interconnect

TRACKER: A Low Overhead Adaptive NoC Router with Load Balancing Selection Strategy

Router Architectures

Chapter 14: Distributed Operating Systems

Operating System Concepts. Operating System 資 訊 工 程 學 系 袁 賢 銘 老 師

Components: Interconnect Page 1 of 18

Switched Interconnect for System-on-a-Chip Designs

Chapter 16: Distributed Operating Systems

AS the number of components in a system increases,

Distributed Elastic Switch Architecture for efficient Networks-on-FPGAs

SUPPORT FOR HIGH-PRIORITY TRAFFIC IN VLSI COMMUNICATION SWITCHES

CROSS LAYER BASED MULTIPATH ROUTING FOR LOAD BALANCING

Local-Area Network -LAN

How To Understand The Concept Of A Distributed System

Module 5. Broadcast Communication Networks. Version 2 CSE IIT, Kharagpur

Computer Networks. Definition of LAN. Connection of Network. Key Points of LAN. Lecture 06 Connecting Networks

Computer Network. Interconnected collection of autonomous computers that are able to exchange information

MPLS. Packet switching vs. circuit switching Virtual circuits

TDT 4260 lecture 11 spring semester Interconnection network continued

Data Communication and Computer Network

Behavior Analysis of Multilayer Multistage Interconnection Network With Extra Stages

A First Approach to Provide QoS in Advanced Switching

Final Exam. Route Computation: One reason why link state routing is preferable to distance vector style routing.

A Dynamic Link Allocation Router

PART III. OPS-based wide area networks

Chapter 12: Multiprocessor Architectures. Lesson 04: Interconnect Networks

Investigating QoS Support for Traffic Mixes with the MediaWorm Router

Computer Systems Structure Input/Output

Interconnection Networks. B649 Parallel Computing Seung-Hee Bae Hyungro Lee

OFF-CHIP COMMUNICATIONS ARCHITECTURES FOR HIGH THROUGHPUT NETWORK PROCESSORS

Centralized Systems. A Centralized Computer System. Chapter 18: Database System Architectures

Computer Network and Communication

Wide Area Networks. Learning Objectives. LAN and WAN. School of Business Eastern Illinois University. (Week 11, Thursday 3/22/2007)

Introduction To Computer Networks

Computers and Electrical Engineering

LAN Switching and VLANs

Computer Networks. By Hardeep Singh

Data Link Layer(1) Principal service: Transferring data from the network layer of the source machine to the one of the destination machine

Definition. A Historical Example

GOAL: A Load-Balanced Adaptive Routing Algorithm for Torus Networks

Cost/Performance Evaluation of Gigabit Ethernet and Myrinet as Cluster Interconnect

Real-Time (Paradigms) (51)

LAN Switching Computer Networking. Switched Network Advantages. Hubs (more) Hubs. Bridges/Switches, , PPP. Interconnecting LANs

Efficient Built-In NoC Support for Gather Operations in Invalidation-Based Coherence Protocols

ΤΕΙ Κρήτης, Παράρτηµα Χανίων

Low-Overhead Hard Real-time Aware Interconnect Network Router

QoS Switching. Two Related Areas to Cover (1) Switched IP Forwarding (2) 802.1Q (Virtual LANs) and 802.1p (GARP/Priorities)

White Paper Abstract Disclaimer

PART II. OPS-based metro area networks

On-Chip Communication Architectures

Quality of Service (QoS) for Asynchronous On-Chip Networks

- Hubs vs. Switches vs. Routers -

Introduction to Quality of Service. Andrea Bianco Telecommunication Network Group

COMMUNICATION PERFORMANCE EVALUATION AND ANALYSIS OF A MESH SYSTEM AREA NETWORK FOR HIGH PERFORMANCE COMPUTERS

Transcription:

Handling in Interconnection Deadlock Networks Parallel Architectures Group Switching Techniques, Adaptive Routing and Jose Duato de Ingeniera de Sistemas, Computadores y Automatica Dept. Politecnica de Valencia, Spain Universidad 1

Adaptive Routing and Deadlock Handling in Interconnection Networks Jose Duato de Ingeniera de Sistemas, Computadores y Automatica Dept. Politecnica de Valencia, Spain Universidad 1

Outline Introduction Switching techniques Optimized switching techniques Deadlock handling Theory of deadlock avoidance Design methodologies Application to deadlock recovery Application to networks of workstations Performance evaluation 2

Outline Introduction Deadlock handling Theory of deadlock avoidance Design methodologies Application to deadlock recovery Application to networks of workstations Performance evaluation 2

Introduction (From W. J. Dally) performance of most digital systems today is limited by their The or interconnection, not by their logic or memory communication of the power is used to drive wires and most of the clock cycle Most spent on wire delay, not gate delay is technology improves, pin density and wiring density are scaling at As slower rate than the components themselves. Also, the frequency a communication between components is lagging far beyond the of rates of modern processors clock factors combine to make interconnection the key factor in the These of future digital systems success 3

Introduction (From W. J. Dally) designers strive to make more ecient use of scarce interconnection As bandwidth, interconnection networks are emerging as a nearly solution to the system-level communication problems for universal digital systems modern developed for the demanding communication requirements Originally multicomputers, interconnection networks are beginning to re- of place buses as the standard system-level interconnection networks are also replacing dedicated wiring in Interconnection systems as designers discover that routing packets special-purpose is both faster and more economical than routing wires 4

Wide area networks Parallel Architectures Group Introduction networks are currently being used for many dierent Interconnection ranging from internal buses in VLSI circuits to wide applications, area computer networks. These applications include: System area networks Telephone switches Internal networks for ATM switches Processor/memory interconnects for vector supercomputers Interconnects for multicomputers Interconnects for distributed shared-memory multiprocessors Clusters of workstations Local area networks Metropolitan area networks } Computer networks 5

computers should be designed using commodity components Parallel be cost-eective to commodity communication subsystems have been Unfortunately, to meet a dierent set of requirements, i.e., those arising designed high performance interconnection networks becomes a Designing issue to exploit the performance of parallel computers critical several high performance switches have been developed Recently, build inexpensive parallel computers by connecting cost-eective to Parallel Architectures Group Introduction in computer networks Most manufacturers designed custom interconnection networks computers through those switches 6

Deterministic routing, adaptive routing Packet switching, circuit switching, wormhole, virtual cut-through Parallel Architectures Group Main design parameters Denes how the nodes are interconnected by channels Topology: Direct networks, switch-based networks algorithm: Determines the path selected by amessage to Routing its destination reach technique: Determines how and when buers are Switching and switches are congured reserved 7

8 Interconnection Networks Shared-Medium Networks Local Area Networks Contention Bus (Ethernet) Token Bus (Arcnet) Token Ring (FDDI Ring, IBM Token Ring) Backplane Bus (Sun Gigaplane, DEC AlphaServer8X00, SGI PowerPath-2) Direct Networks (Router-Based Networks) Strictly Orthogonal Topologies Mesh 2-D Mesh (Intel Paragon) 3-D Mesh (MIT J-Machine) Torus (k-ary n-cube) 1-D Unidirectional Torus or Ring (KSR first-level ring) 2-D Bidirectional Torus (Intel/CMU iwarp) 3-D Bidirectional Torus (Cray T3D, Cray T3E) Hypercube (Intel ipsc, ncube) Other Topologies: Trees, Cube-Connected Cycles, de Bruijn, Star Graphs, etc. Indirect Networks (Switch-Based Networks) Hybrid Networks Regular Topologies Crossbar (Cray X/Y-MP, DEC GIGAswitch, Myrinet) Multistage Interconnection Networks Blocking Networks Unidirectional MIN (NEC Cenju-3, IBM RP3) Bidirectional MIN (IBM SP, TMC CM-5) Nonblocking Networks: Clos Network Irregular Topologies (DEC Autonet, Myrinet, ServerNet) Multiple-Backplane Buses (Sun XDBus) Hierarchical Networks (Bridged LANs, KSR) Cluster-Based Networks (Stanford DASH, HP/Convex Exemplar) Other Hypergraph Topologies: Hyperbuses, Hypermeshes, etc. Parallel Architectures Group

Direct networks (a) 2-ary 4-cube (hypercube) (b) 3-ary 2-cube (c) 3-ary 3D-mesh 9

0000 0000 0000 0000 0001 0001 0001 0001 0010 0010 0010 0010 0011 0011 0011 0011 0100 0100 0100 0100 0101 0101 0101 0101 0110 0110 0110 0110 0111 0111 0111 0111 1000 1000 1000 1000 1001 1001 1001 1001 1010 1010 1010 1010 1011 1011 1011 1011 1100 1100 1100 1100 1101 1101 1101 1101 1110 1110 1110 1110 1111 1111 1111 1111 Multistage interconnection networks Multistage butterfly network Omega network 10

Switch-based irregular topologies Bidirectional Links 1 2 0 1 2 0 5 7 5 7 Switch 3 4 3 4 6 Processing Elements 6 Switch-Based Network Graph Representation 11

Generalized MIN model N M P o r t s P o r t s C 0 G 0 C 1 G 1 G g 1 C g 12

manufacturers developed switches that are suitable to implement Some either direct or indirect networks (Inmos C104, SGI SPIDER) can view networks using point-to-point links as a set of We switches, each one connected to zero, one, or more interconnected Direct networks correspond to the case where every switch is to a single node connected Crossbar networks correspond to the case where there is a single connected to all the nodes switch Multistage interconnection networks correspond to the case switches are arranged into several stages and the switches where Parallel Architectures Group Unied View nodes: in intermediate stages are not connected to any processor 13

Router organization Injection Channel LC LC Ejection Channel LC LC Input Channels LC LC Switch LC LC Output Channels LC Routing & Arbitration LC 14

Switching Switching: Determines how and when buers are reserved and switches are congured control: Synchronization protocol for transmitting and Flow a unit of information receiving of ow control: Portion of the message whose transfer must Unit synchronized be control occurs at two levels: message ow control and physical Flow ow control channel 15

Time-space diagram (packet switching) Packet switching and circuit switching Channel Time Time-space diagram (circuit switching) Channel Time 16

Virtual cut-through and wormhole switching T D D D D D D D D D D D D D D H Time-space diagram Channel Time 17

Virtual channels Virtual channel controller From switch Channel multiplexor Physical channel Channel demultiplexor To switch Flit buffers Flit buffers 18

Performance of switching techniques Packet switching is well suited for very short messages Circuit switching is well suited for very long messages cut-through switching is well suited for messages of any Virtual but requires splitting messages into xed-size packets length switching is well suited for messages of any length but Wormhole at moderate loads. Virtual channels alleviate this situation saturates switching has been preferred for electronic routers because Wormhole buers can be small and the resulting circuits are compact and fast 19

Optimized switching techniques from real applications may be bimodal and may vary over Trac time Wormhole switching can be used for short messages Circuit switching can be used for very long messages set-up can be overlapped with useful computation and/or Path can be reused circuits circuits do not need buers at intermediate routers and can Physical made much faster than conventional links either by using wave be pipelining or optical technology 20

Optimized router organization 21 From/to Local Processor Input Channels Pipelined Input Channels Sync Sync Sync Sync Switch S Switch S Control Channels k 1 Switch S 0 Wormhole Control Unit PCS Control Unit mux mux mux Pipelined Output Channels Output Channels Parallel Architectures Group

Performance for multimedia applications 16000 14000 CS 28+4 WSNR 16+16 WH Average Latency (cycles) 12000 10000 8000 6000 4000 2000 short 10% messages (16 its) long 90% messages (1024 its) 0 0.05 0.1 0.15 0.2 0.25 Traffic ( CLK x 2) (flits/node/cycle) 22

Performance for multimedia applications Average Latency (cycles) 18000 16000 14000 12000 10000 8000 6000 4000 CS 28+4 WSNR 16+16 WH short 10% messages (16 its) long 90% messages (1024 its) 2000 0 0.05 0.1 0.15 0.2 Traffic ( CLK x 3) (flits/node/cycle) 23

Performance for multimedia applications Average Latency (cycles) 35000 30000 25000 20000 15000 10000 CS 28+4 WSNR 16+16 WH short 10% messages (16 its) long 90% messages (1024 its) 5000 0 0.05 0.1 0.15 0.2 Traffic ( CLK x 4) (flits/node/cycle) 24

Performance for multimedia applications Average Latency (cycles) 40000 35000 30000 25000 20000 15000 10000 WS 16+16 WH WH 2 VC WH 3 VC short 10% messages (16 its) long 90% messages (1024 its) Only long 5000 0 0 0.05 0.1 0.15 0.2 Long messages traffic (1024 flits long, 90%) messages shown are 25

Performance for multimedia applications Average Latency (cycles) 30000 25000 20000 15000 10000 5000 WS 16+16 WH WH 2 VC WH 3 VC short 10% messages (16 its) long 90% messages (1024 its) Only short 0 0 0.005 0.01 0.015 0.02 0.025 0.03 Short messages traffic (16 flits long, 10%) messages shown are 26

Performance for multimedia applications Average Latency (cycles) 30000 25000 20000 15000 10000 5000 WS 16+16 WH WH 2 VC WH 3 VC short 10% messages (16 its) long 90% messages (1024 its) Only short 0 0 1.0e-4 2.0e-4 3.0e-4 4.0e-4 Short messages traffic (16 flits long, 10%) messages shown are 27

Performance for multimedia applications 100000 80000 32 slots 16 slots 8 slots 4 slots 2 slots 1 slot short 10% messages Average Latency (cycles) 60000 40000 (16 its) long 90% messages (1024 its) 20000 0 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 Traffic for 256 Gbps (10% 16 flits, 90% 1024 flits) Gbps 256 band- link width 28

Performance for multimedia applications 120000 100000 32 slots 16 slots 8 slots 4 slots 2 slots 1 slot short 40% messages 80000 (16 its) Average Latency 60000 long 60% messages 40000 (1024 its) 20000 0 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 Traffic for 256 Gbps (40% 16 flits, 60% 1024 flits) Gbps 256 band- link width 29

Routing Algorithms Number of Destinations Unicast Routing Multicast Routing Routing Decisions Centralized Routing Source Routing Distributed Routing Multiphase Routing Implementation Table Lookup Finite-State Machine Adaptivity Deterministic Routing Adaptive Routing Progressiveness Progressive Backtracking Minimality Profitable Misrouting Number of Paths Complete Partial 30

Undeliverable Packets Situations that may prevent packet delivery Deadlock Prevention Avoidance Recovery Livelock Starvation Minimal Paths Restricted Nonminimal Paths Probabilistic Avoidance Resource Assignment Scheme 31

Deadlock handling Deadlock prevention: Backtracking Deadlock avoidance: Acyclic graph, acyclic subgraph Regressive deadlock recovery: Message removal, message abortion Progressive deadlock recovery: Disha Main goal Design of ecient deadlock-free fully adaptive routing algorithms 32

conguration Deadlocked N0 2 2 2 2 1 1 1 1 N3 1 1 1 1 0 0 0 0 2 2 2 2 3 3 3 3 N1 wait for resources Messages held by other messages in a cyclic way Removing cyclic dependencies ) will avoid deadlock 3 3 3 3 0 0 0 0 N2 33

Allowing cyclic dependencies for the unidirectional ring: c Ai channels can be used to Example messages to all the destinations. c Hi channels can only be forward n n c A3 c c c c H0 A0 A2 H2 c A1 n 0 1 3 n 2 c H1 exist cyclic dependencies between There Ai channels c c Hi However, dependencies channels have no cyclic is no deadlock because messages There for resources can always escape waiting by using c Hi channels used if the destination is higher than the current node. 34

Theory of deadlock avoidance (informal) Interconnection network 35

Adaptive routing function and selection function n c n c n c n d n d n d Routing Function Selection Function 36

routing function will be referred to as routing subfunction when The to escape channels restricted Parallel Architectures Group Routing subfunction channels can be split into two subsets: adaptive and escape Network channels 37

Approach to avoid deadlock adaptive routing function may allow cyclic dependencies between An as long as: channels There exist a subset of channels (escape channels) that have no dependencies between them cyclic It is possible to establish a path from the current node to the node using only escape channels destination For wormhole switching, when a message reserves an escape and then an adaptive channel, it must be able to select channel escape channel at the current node, i.e., escape channels an have no cyclic dependencies indirectly through adaptive should channels 38

) There is a deadlock Parallel Architectures Group Deadlock produced by indirect dependencies set of messages are cyclically A for channels occupied by waiting other messages in the set messages are able to use Some channels but reach an- escape cycle. Messages using escapother channels are cyclically wait- indirectly through adaptive ing channels 39

Extend the network topology and the routing function Guarantee the absence of deadlocks Parallel Architectures Group Design methodology Based on the extension of other routing functions Allows the use of all the alternative minimal paths Does not increase the number of physical channels Provides a way to: 40

Given an interconnection network I1, dene a minimal path deadlock-free routing function R1 connected to a minimal path or, alternatively, the channels supplied by R1 Verify that the extended channel dependency graph for R1 Parallel Architectures Group Design methodology Steps: Split each physical channel into a set of additional virtual channels. new routing function can use any of the new channels belonging The is If it is, the routing algorithm is valid. Otherwise, it must acyclic. discarded. This step is not required for store-and-forward and be virtual cut-through 41

Split each physical channel c i into k virtual channels Step2: i;1 ;a i;2 ;:::;a i;k,1 ;b i a Parallel Architectures Group Design example Routing algorithm for n-dimensional meshes Basic algorithm: Dimension order routing algorithm: Route over any minimal path using any of the a New Alternatively, route over the lowest useful dimension using channels. the corresponding b channel MIT Reliable Router uses two virtual channels for fully adaptive The routing and two virtual channels for dimension-order routing minimal in the absence of faults (on a 2-D mesh) 42

Example routing paths for 2-D meshes 0 1 2 3 4 5 6 7 Source node 8 9 10 11 Destination node Channels supplied by R 12 13 14 15 43

Extended channel dependency graph for R1 Parallel Architectures Group b10 b01 b12 b21 b03 b14 b25 b30 b52 b34 b45 b43 b54 b36 b58 b63 b74 b85 b67 b76 b87 b78 44

Average Latency (cycles) 400 350 300 250 200 150 Performance evaluation for the 2-D mesh Deterministic (1 vc) Deterministic (2 vc) Adaptive (2 vc) 0.55 0.52 size: Network processors. 256 length: Message its. 16 100 Random trac 0.1 0.2 0.3 0.4 0.5 0.6 Normalized Accepted Traffic 45

Average Latency (cycles) 350 300 250 200 150 100 Performance evaluation for the 3-D mesh Deterministic (1 vc) Deterministic (2 vc) Adaptive (2 vc) 0.52 size: Network processors. 512 length: Message its. 16 Random trac 50 0.1 0.2 0.3 0.4 0.5 Normalized Accepted Traffic 46

Average Latency (cycles) 250 200 150 100 0.52 Deterministic (2 vc) Part-Adaptive (2 vc) Adaptive (3 vc) size: Network processors. 256 length: Message its. 16 Performance evaluation for the 2-D torus Random trac 50 0.1 0.2 0.3 0.4 0.5 Normalized Accepted Traffic 47

Average Latency (cycles) 220 200 180 160 140 120 100 80 0.52 Performance evaluation for the 3-D torus Deterministic (2 vc) Part-Adaptive (2 vc) Part-Adaptive (3 vc) Adaptive (3 vc) size: Network processors. 512 length: Message its. 16 60 Random trac 40 0.1 0.2 0.3 0.4 0.5 Normalized Accepted Traffic 48

Performance evaluation for the 3-D torus (II) 80 Average Latency (cycles) 70 60 50 40 30 Deterministic (2 vc) Part-Adaptive (2 vc) Adaptive (3 vc) size: Network processors. 512 length: Message its. 16 Local trac 0.2 0.4 0.6 0.8 1 1.2 Normalized Accepted Traffic 49

Performance evaluation for the 3-D torus (III) 100 Average Latency (cycles) 90 80 70 60 50 Deterministic (2 vc) Part-Adaptive (2 vc) Adaptive (3 vc) size: Network processors. 512 length: Message its. 16 Bit-reversal pattern trac 40 0.05 0.1 0.15 0.2 0.25 0.3 0.35 Normalized Accepted Traffic 50

Accurate performance evaluation for the 3-D torus 550 500 Average Latency (ns) 450 400 350 300 250 Deterministic (2 vc) Part-Adaptive (2 vc) Adaptive (3 vc) size: Network processors. 512 length: Message its. 16 Random trac 10 20 30 40 50 60 Traffic (flits/node/us) 51

Accurate performance evaluation for the 3-D torus (II) 550 500 Average Latency (ns) 450 400 350 300 250 200 Deterministic (2 vc) Part-Adaptive (2 vc) Adaptive (3 vc) size: Network processors. 512 length: Message its. 16 Local trac 20 40 60 80 100 120 140 160 Traffic (flits/node/us) 52

Accurate performance evaluation for the 3-D torus (III) 700 Average Latency (ns) 650 600 550 500 450 400 350 300 Deterministic (2 vc) Part-Adaptive (2 vc) Adaptive (3 vc) size: Network processors. 512 length: Message its. 16 Bit-reversal pattern trac 250 5 10 15 20 25 30 35 40 45 Traffic (flits/node/us) 53

Application to deadlock recovery resources (channels or buers) are split into two classes: Routing and escape adaptive Adaptive resources can be freely used by all the packets a packet is waiting for longer than a timeout, it moves to an When resource escape a packet uses an escape resource, it cannot use an adaptive Once again resource routing scheme eliminates all the indirect dependencies between This and escape resources adaptive 54

Injection Channel LC LC Ejection Channel Router organization for Disha LC LC Input Channels LC LC Switch LC LC Output Channels LC LC Routing and Arbitration Deadlock Buffer 55

Routing on edge and deadlock buers 0 1 2 3 7 6 5 4 buers can only be used in increasing Deadlock order label 8 9 10 11 15 14 13 12 a deadlock is detected, the packet header When be routed to the deadlock buer can 0 1 2 3 7 6 5 4 Edge buers allow fully adaptive minimal routing 8 9 10 11 channels are dened so that the routing Escape is able to deliver messages for any subfunction 15 14 13 12 destination (including deadlock buers) 56

Extended channel dependency graph for edge buers n0 c10 n1 c21 n2 c50 n5 c41 n4 c32 n3 c65 c74 c83 n6 n7 n8 57

180 Performance evaluation 160 140.27.49.56.77 Average Latency (Cycles) 120 100 80 60 40 20 o Avoidance Det (2 VC) x Recovery Det (2 VC) + Avoidance Adap (3 VC) * Recovery Adap (3 VC) 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Normalized Accepted Traffic 58

Injection limitation Prevents performance degradation at saturation Reduces the frequency of deadlock occurrence to negligible values RESERVE RELEASE BUSY OUTPUT CHANNELS COUNTER THRESHOLD COMPARATOR INJECTION PERMITTED 59

Improved injection limitation mechanism PHYSICAL CHANNELS Vn-1 V1 V0 0 RESERVE RELEASE MESSAGE NUM. 1 2 3 BUSY OUTPUT CHANNELS m-1 BIT =1 counter COUNTER TRANSLATION TABLE COMPARATOR INJECTION PERMITTED Bitwise OR 60

Improved deadlock detection mechanism Counter Input Channels 0 1 2 3 Switch I Threshold Output Channels Counter I Thresho ld 61

Wiring exibility. Scalability. Incremental expansion capability. Parallel Architectures Group Application to networks of workstations of workstations are emerging as a cost-eective alternative Networks parallel computers. to interconnects like Autonet, Myrinet and ServerNet Switch-based been proposed to build networks of workstations with irregular have topology. The irregularity provides: 62

The irregularity makes deadlock avoidance and routing Drawback: complicated. quite ) Many messages are routed following non-minimal paths.! Higher message latency! Waste of resources! Lower throughput! Reduces contention by increasing routing adaptivity! Allows more messages to follow minimal paths Parallel Architectures Group Simplest solution: Avoid deadlock by eliminating all the cyclic dependencies between channels Alternative solution: Allow cyclic dependencies between channels 63

Switch-based networks with irregular topologies Bidirectional Links 1 2 0 1 2 0 5 7 5 7 Switch 3 4 3 4 6 Processing Elements 6 Switch-Based Network Graph Representation 64

Deadlock-free routing scheme (up/down routing). Provides partially adaptive communication between nodes. Distributed. Implemented using table-lookup. Parallel Architectures Group The Autonet routing algorithm General characteristics: 65

Each cycle has at least one link in the \up" direction and one in the \down" direction. link Cyclic dependencies are avoided: messages cannot cross a link the \up" direction after one in the \down" direction. in Parallel Architectures Group "up" direction 0 The up/down routing algorithm 4 2 5 is based on an assignment Routing direction to the operational links. of Routing rule: a legal route must 6 7 1 3 zero or more links in the traverse direction followed by zero or \up" more links in the \down" direction. 66

From 7 to 0: OK From 2 to 5: lack of adaptivity From 4 to 1: non-minimal routing basic routing rule prevents from using minimal routing and The in most cases because of \down" to \up" conicts. adaptivity Parallel Architectures Group "up" direction 0 Routing eciency 4 2 5 6 7 1 3 Probability of non-minimal routing increases with network size. 67

A design methodology for adaptive routing algorithms channels physical or split into duplicated interconnection network + deadlock-free new methodology virtual channels two and new) (original + routing extended routing function ) function 68

Extended routing function Newly injected messages can use the new channels without any For performance reasons, only minimal paths are restriction. allowed Original channels are used exactly in the same way as in the routing function original Once a message reserves one of the original channels, it cannot any of the new channels again use When the routing table provides both kinds of channels, give to new channels preference The extended routing function is deadlock-free 69

At intermediate switches, a higher priority is assigned to the { channels belonging to minimal paths new If all the new channels are busy, then an original channel { to a minimal path (if any) is selected belonging If none exists, then the one that provides the shortest path is { (this ensures deadlock-freedom) used Once a message reserves an original channel, it can no longer a new one reserve Parallel Architectures Group Improving the eciency of the methodology Idea: Focus on minimal routing, even if adaptivity is reduced Restrict the transition from new channels to original channels Improved adaptive routing function: Newly injected messages can only use new channels { 70

Basic up/down routing scheme (UD). Parallel Architectures Group Performance evaluation Evaluation of four routing schemes: Up/down routing scheme using two virtual channels per physical (UD-2VC). channel Adaptive routing scheme using two virtual channels per physical (A-2VC). channel Improved adaptive routing scheme using two virtual channels per channel (MA-2VC). physical Performance evaluation carried out by simulation. 71

Topology generated randomly (8-port switches) 4 nodes (processors) connected to each switch Two adjacent switches are connected by a single link Message destination is randomly chosen among nodes Parallel Architectures Group Network model: One routing control unit per switch (assigned in a round-robin fashion) It takes one clock cycle to compute the routing algorithm, to one it from an input buer to an output buer, or to transfer transfer one it across a physical channel 72

Simulation results (I) Average Latency (Cycles) 60 55 50 45 40 35 30 UD UD-2VC A-2VC MA-2VC size: 16 Network switches. length: Message its. 16 25 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Traffic (Flits/Cycle/Node) 73

Simulation results (II) Average Latency (Cycles) 60 55 50 45 40 35 UD UD-2VC A-2VC MA-2VC size: 32 Network switches. length: Message its. 16 30 0.1 0.2 0.3 0.4 0.5 Traffic (Flits/Cycle/Node) 74

Simulation results (III) Average Latency (Cycles) 80 70 60 50 40 UD UD-2VC A-2VC MA-2VC size: 64 Network switches. length: Message its. 16 30 0.05 0.1 0.15 0.2 0.25 0.3 Traffic (Flits/Cycle/Node) 75

Average Latency (Cycles) 200 Simulation results (IV) 180 160 140 120 100 UD UD-2VC A-2VC MA-2VC size: 64 Network switches. length: Message its. 64 80 0.05 0.1 0.15 0.2 0.25 0.3 Traffic (Flits/Cycle/Node) 76

Average Latency (Cycles) 900 Simulation results (V) 800 700 600 500 400 UD UD-2VC A-2VC MA-2VC size: 64 Network switches. length: Message its. 256 300 0.05 0.1 0.15 0.2 0.25 0.3 Traffic (Flits/Cycle/Node) 77

160000 140000 Amount of messages Simulation results for application traces 120000 Messages 100000 80000 60000 Traces from Barnes-Hut on executed 40000 20000 64 processors 0 1e+07 2e+07 3e+07 4e+07 5e+07 6e+07 7e+07 8e+07 Time 78

Simulation results for application traces 80000 70000 MA-2VC UD-2VC UD 60000 Latency (Cycles) 50000 40000 30000 20000 10000 0 1e+07 2e+07 3e+07 4e+07 5e+07 6e+07 7e+07 8e+07 Time (Cycles) 79

Zoom of the rst peak 450000 400000 350000 MA-2VC UD-2VC UD Latency (Cycles) 300000 250000 200000 150000 100000 50000 0 1.8e+07 1.85e+07 1.9e+07 1.95e+07 2e+07 Time (Cycles) 80

Zoom of the second peak 250000 200000 MA-2VC UD-2VC UD Latency (Cycles) 150000 100000 50000 0 3.7e+07 3.75e+07 3.8e+07 3.85e+07 3.9e+07 Time (Cycles) 81

Zoom of the third peak 180000 160000 MA-2VC UD-2VC UD 140000 Latency (Cycles) 120000 100000 80000 60000 40000 20000 0 5.2e+07 5.25e+07 5.3e+07 5.35e+07 5.4e+07 Time (Cycles) 82

switching techniques may considerably increase performance Hybrid using the appropriate switching technique for each message class by switching can take advantage of wave pipelining and optical Circuit to increase link bandwidth technology deadlock avoidance and recovery schemes allow the design Flexible more ecient routing algorithms of routing algorithms have been implemented in the MIT Reliable These and the Cray T3E Router routing and virtual channels are especially interesting when Adaptive produce bursty trac that saturates the network during applications routing and virtual channels must be implemented eciently Adaptive to minimize the increment in clock cycle time Parallel Architectures Group Final Remarks some time intervals (usually prior to synchronization points) 83

Final Remarks deadlock avoidance and recovery schemes allow the design Flexible more ecient routing algorithms of routing algorithms have been implemented in the MIT Reliable These and the Cray T3E Router routing and virtual channels are especially interesting when Adaptive produce bursty trac that saturates the network during applications some time intervals (usually prior to synchronization points) routing and virtual channels must be implemented eciently Adaptive to minimize the increment in clock cycle time 83