Interconnection Network

Similar documents

Interconnection Network Design

Interconnection Networks

System Interconnect Architectures. Goals and Analysis. Network Properties and Routing. Terminology - 2. Terminology - 1

Interconnection Networks. Interconnection Networks. Interconnection networks are used everywhere!

Topological Properties

Lecture 18: Interconnection Networks. CMU : Parallel Computer Architecture and Programming (Spring 2012)

Components: Interconnect Page 1 of 18

Lecture 23: Interconnection Networks. Topics: communication latency, centralized and decentralized switches (Appendix E)

Chapter 2. Multiprocessors Interconnection Networks

Interconnection Networks

Interconnection Networks Programmierung Paralleler und Verteilter Systeme (PPV)

Lecture 2 Parallel Programming Platforms

Introduction to Parallel Computing. George Karypis Parallel Programming Platforms

COMP 422, Lecture 3: Physical Organization & Communication Costs in Parallel Machines (Sections 2.4 & 2.5 of textbook)

Why the Network Matters

Scalability and Classifications

Hyper Node Torus: A New Interconnection Network for High Speed Packet Processors

Chapter 12: Multiprocessor Architectures. Lesson 04: Interconnect Networks

Annotation to the assignments and the solution sheet. Note the following points

Asynchronous Bypass Channels

LOAD-BALANCED ROUTING IN INTERCONNECTION NETWORKS

TDT 4260 lecture 11 spring semester Interconnection network continued

On-Chip Interconnection Networks Low-Power Interconnect

Scaling 10Gb/s Clustering at Wire-Speed

Introduction to Exploration and Optimization of Multiprocessor Embedded Architectures based on Networks On-Chip

From Hypercubes to Dragonflies a short history of interconnect

CSE 4351/5351 Notes 7: Task Scheduling & Load Balancing

Parallel Programming

Interconnection Networks

Behavior Analysis of Multilayer Multistage Interconnection Network With Extra Stages

Bandwidth Efficient All-to-All Broadcast on Switched Clusters

Where to Provide Support for Efficient Multicasting in Irregular Networks: Network Interface or Switch?

Switched Interconnect for System-on-a-Chip Designs

MPI Implementation Analysis - A Practical Approach to Network Marketing

Computer Network. Interconnected collection of autonomous computers that are able to exchange information

Multi-layer Structure of Data Center Based on Steiner Triple System

Design and Implementation of an On-Chip timing based Permutation Network for Multiprocessor system on Chip

Packetization and routing analysis of on-chip multiprocessor networks

Dual-Centric Data Center Network Architectures

Communication Networks. MAP-TELE 2011/12 José Ruela

Hardware Implementation of Improved Adaptive NoC Router with Flit Flow History based Load Balancing Selection Strategy

Introduction to LAN/WAN. Network Layer

Performance Evaluation of 2D-Mesh, Ring, and Crossbar Interconnects for Chip Multi- Processors. NoCArc 09

Technology-Driven, Highly-Scalable Dragonfly Topology

Chapter 2 Parallel Architecture, Software And Performance

Architectural Level Power Consumption of Network on Chip. Presenter: YUAN Zheng

Intel Ethernet Switch Load Balancing System Design Using Advanced Features in Intel Ethernet Switch Family

Performance of networks containing both MaxNet and SumNet links

UNIT 2 CLASSIFICATION OF PARALLEL COMPUTERS

Leveraging Torus Topology with Deadlock Recovery for Cost-Efficient On-Chip Network

Large-Scale Distributed Systems. Datacenter Networks. COMP6511A Spring 2014 HKUST. Lin Gu

Distributed Computing over Communication Networks: Topology. (with an excursion to P2P)

Real-Time (Paradigms) (51)

CROSS LAYER BASED MULTIPATH ROUTING FOR LOAD BALANCING

MULTISTAGE INTERCONNECTION NETWORKS: A TRANSITION TO OPTICAL

Advanced Computer Networks Network Topologies. Patrick Stuedi, Qin Yin, Timothy Roscoe Spring Semester 2015

Module 5. Broadcast Communication Networks. Version 2 CSE IIT, Kharagpur

Introduction to Infiniband. Hussein N. Harake, Performance U! Winter School

Data Center Network Topologies: FatTree

Lecture 24: WSC, Datacenters. Topics: network-on-chip wrap-up, warehouse-scale computing and datacenters (Sections )

Chapter 8 Interconnection Networks and Clusters

Cisco s Massively Scalable Data Center

SAN Conceptual and Design Basics

On-Chip Communication Architectures

The Butterfly, Cube-Connected-Cycles and Benes Networks

Module 15: Network Structures

Joint ITU-T/IEEE Workshop on Carrier-class Ethernet

Distributed Elastic Switch Architecture for efficient Networks-on-FPGAs

Chapter 14: Distributed Operating Systems

CONTINUOUS scaling of CMOS technology makes it possible

Performance Analysis of k-ary n-cube Interconnection Networks

How To Understand The Concept Of A Distributed System

COMMUNICATION PERFORMANCE EVALUATION AND ANALYSIS OF A MESH SYSTEM AREA NETWORK FOR HIGH PERFORMANCE COMPUTERS

The UC Berkeley-LBL HIPPI Networking Environment

Operating System Concepts. Operating System 資訊工程學系袁賢銘老師

Interconnection Networks. B649 Parallel Computing Seung-Hee Bae Hyungro Lee

Infrastructure Components: Hub & Repeater. Network Infrastructure. Switch: Realization. Infrastructure Components: Switch

Preserving Message Integrity in Dynamic Process Migration

CCNA R&S: Introduction to Networks. Chapter 5: Ethernet

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

Deadlock Characterization and Resolution in Interconnection Networks

Binary search tree with SIMD bandwidth optimization using SSE

OFF-CHIP COMMUNICATIONS ARCHITECTURES FOR HIGH THROUGHPUT NETWORK PROCESSORS

Interconnection Network of OTA-based FPAA

Chapter 16: Distributed Operating Systems

InfiniBand Clustering

The fat-stack and universal routing in interconnection networks

Principles and characteristics of distributed systems and environments

TRACKER: A Low Overhead Adaptive NoC Router with Load Balancing Selection Strategy

Chapter 3. Enterprise Campus Network Design

Dahu: Commodity Switches for Direct Connect Data Center Networks

Mixed-Criticality Systems Based on Time- Triggered Ethernet with Multiple Ring Topologies. University of Siegen Mohammed Abuteir, Roman Obermaisser

Network (Tree) Topology Inference Based on Prüfer Sequence

EFFECTS OF NoC ARCHITECTURAL PARAMETERS IN MPSoC PERFORMANCE

Load Balancing and Switch Scheduling

Parallel Architectures and Interconnection

Data Center Switch Fabric Competitive Analysis

EE4367 Telecom. Switching & Transmission. Prof. Murat Torlak

Intel Ethernet Switch Converged Enhanced Ethernet (CEE) and Datacenter Bridging (DCB) Using Intel Ethernet Switch Family Switches

Transcription:

Interconnection Network Recap: Generic Parallel Architecture A generic modern multiprocessor Network Mem Communication assist (CA) $ P Node: processor(s), memory system, plus communication assist Network interface and communication controller Scalable network Convergence allows lots of innovation, now within framework Integration of assist with node, what operations, how efficiently... 2 1

Interconnection Networks Static vs Dynamic networks Static: built out of point-to-point communication links between processors (aka direct networks) Usually associated to message passing architectures Examples: completely-/star-connected, linear array, ring, mesh, hypercube Dynamic: built out of links and switches (aka indirect networks) Usually associated to shared address space architectures Examples: crossbar, bus-based, multistage 3 Goal: Bandwidth and Latency 80 0.8 Latency 70 60 50 40 30 20 10 Saturation Delivered Bandwidth 0.7 0.6 0.5 0.4 0.3 0.2 0.1 Saturation 0 0 0.2 0.4 0.6 0.8 1 Delivered Bandwidth 0 0 0.2 0.4 0.6 0.8 1 1.2 Offered Bandwidth Latencies remain reasonable only as long as application BW requirement is much smaller than BW available on machine 4 2

Topological Properties Degree --- number of incident nodes Diameter --- maximum routing distance Bisection bandwidth: sum of bandwidth of smallest set of links that partition the network into two halves Routing distance --- number of links on route Average distance --- average routing distance over all pairs of nodes Scalability --- the ability to be modularly expandable with a scaleable performance Partitionable --- whether a subgraph keeps the same properties Symmetric: uniform traffic vs hot-spot Fault tolerance 5 Static Interconnection Networks Completely connected networks Star-connected networks Linear Array Mesh Hypercube Network 6 3

Linear Arrays and Rings Linear Array Torus Torus arranged to use short wires Linear Array Diameter of an array of size n nodes? Average Distance? Bisection bandwidth? Node labeling and Routing Algorithm: For linear array: next route(myid, src, dest){...... } For bidirectional rings:?? Examples: FDDI, SCI, FiberChannel Arbitrated Loop 7 2-D Meshes and Tori Examples: Intel Paragon (2d mesh), Cray T3D (3d torus) For a 2d mesh of size n x n Diameter? Bisection bandwidth? Average Distance? X-Y Routing: Labeling each node in a pair of integers (i,j). A message is routed first along the X dimension until it reaches the column of the destination node and then along the Y dimension until it reaches its destination nid route(myid, src, dest) {......} 8 4

Multidimensional Meshes and Tori 2D Mesh 2D Torus 3D Cube d-dimensional array n = k d-1 X...X k O nodes described by d-vector of coordinates (i d-1,..., i O ) d-dimensional k-ary mesh: N = k d k = d N described by d-vector of radix k coordinate d-dimensional k-ary torus (or k-ary d-cube)? 9 Hypercubes Also called binary n-cubes. # of nodes = N = 2 n. Degree: n = logn Distance O(logN) Hops Good bisection BW Examples: SGI Origin 2000 0-D 1-D 2-D 3-D 4-D 5-D! 10 5

Hypercube Labeling and Routing Binary representation Distance? xor operation E-cube routing 11 E-Cube Routing Dimension-ordered routing (extension of XY routing) Deterministic and minimal message routing 12 6

Hypercube Properties One node connected to d others One bit difference in labels direct link One hyper can be partitioned in two (d-1) hypers The Hamming distance = shortest path length Hamming distance = # of bits that are difference in source and dest (binary addresses of the two nodes) = # of nodes in source xor dest Each node address contains d bits fixing k of these bits, the nodes that differ in the remaining (d-k) for a (d-k) subcube of 2^(d-k) nodes. There are 2^k such subcubes. 13 K-ary d-cubes A k-ary d-cubes is a d-dimensional mesh with k elements along each dimension k is radix, d is dimension built from k-ary (d-1)-cubes by connecting the corresponding processors into a ring Some of the other topologies are particular instances of the k-ary d-cubes: A ring of n nodes is a n-ary 1-cube A two-dimensional n x n torus is a n-ary 2-cube 14 7

Toplology Summary Topology Degree Diameter Ave Dist Bisection D (D ave) @ P=1024 1D Array 2 N-1 N / 3 1 huge 1D Ring 2 N/2 N/4 2 2D Mesh 4 2 (N 1/2-1) 2/3 N 1/2 N 1/2 63 (21) 2D Torus 4 N 1/2 1/2 N 1/2 2N 1/2 32 (16) k-ary n-cube 2n nk/2 nk/4 nk/4 15 (7.5) @n=3 Hypercube n =log N n n/2 N/2 10 (5) 15 Bus-based Networks A bus is a shared communication link, using one set of wires to connect multiple processing elements. Processor/Memory bus, I/O bus Processor/Processor bus Very simple concept, its major drawback is that bandwidth does not scale up with the number of processors cache can alleviate problem because reduce traffic to memory 16 8

Crossbar Switching Networks Digital analogous of a switching board allows connection of any of p processors to any of b memory banks Examples: Sun Ultra HPC 1000, Fujitsu VPP 500, Myrinet switch Main drawback: complexity grows as P^2 too expensive for large p Crossbar [Bus] network is the dynamic analogous of the completed [star] network 17 Multistage Interconnection Network Agood compromise between cost and performance More scalable in terms of cost than crossbar, more scalable in terms of performance than bus Popular schemes include omega and butterfly networks 18 9

Omega Network Perfect Shuffle; Log p stages each with p/2 switchs 19 Routing algorithm Routing in Omega Network At each stage, look at the corresponding bit (starting with the msb) of the source and dest address If the bits are the same, messages passes through, otherwiese crossed-over Example of blocking: either (010 to 111) or (110 to 100) has to wait until link AB is free 20 10

Butterflies 16 node butterfly 4 3 2 1 0 0 1 0 1 0 1 0 1 0 1 building block Tree with lots of roots! N log N (actually N/2 x logn) Exactly one route from any source to any destination Bisection N/2 vs n (d-1)/d 21 Tree Structures Static vs Dynamic Trees H-tree 22 11

Tree Properties Diameter and ave distance logarithmic k-ary tree, height d = log k N address specified d-vector of radix k coordinates describing path down from root Fixed degree H-tree space is O(N) with O( N) long wires Bisection BW? Fat Tree Example: CM-5 23 Benes network and Fat Tree 16-node Benes Network (Unidirectional) 16-node 2-ary Fat-Tree (Bidirectional) Back-to-back butterfly can route all permutations off line What if you just pick a random mid point? 24 12

Relationship BttrFlies to Hypercubes Wiring is isomorphic Except that Butterfly always takes log n steps 25 Real Machines Wide links, smaller routing delay Tremendous variation 26 13

Basic Switch Organization Switches Input ports, output ports, crossbar internal switch, buffer, control Input Ports Receiver Input Buffer Output Buffer Transmiter Output Ports Cross-bar Control Routing, Scheduling 27 Circuit Switching Basic Switching Strategies establish a circuit from source to destination Packet Switching Store and Forward (SF) move entire packet one hop toward destination buffer till next hop permitted e.g. Internet Virtual Cut-Through and Wormhole pipeline the hops: switch examines the header, decides where to send the message, and then starts forwarding it immediately Virtual Cut-Through: buffer on blockage Wormhole: leave message spread through network on blockage 28 14

Store-and-Forward (SF) Performance Model Startup time Ts: the time required to handle a msg at the sending/receiving nodes; Per-hop time Th: the transfer time of the msg header in a link Per-word transfer time Tw: if the link bw is r, then Tw=1/r Latency for a message of size m words to be transmitted through l links: Tcomm = Ts + (mtw + Th)*l 29 Cut-Through and Wormhole Each message is broken into fixed units called flow control digits (flits) Flits contain no routing inforatmion They follow the same path established by a header. A message of size m words traverses l links: Tcomm = Ts+ l*th + m *Tw 30 15

SF vs Cut-Through Switching Store & Forward Routing Cut-Through Routing Source Dest Dest Time 3 2 1 3 2 0 1 0 3 2 1 0 31 Routing Recall: routing algorithm determines which of the possible paths are used as routes how the route is determined R: N x N -> C, which at each switch maps the destination node n d to the next channel on the route Issues: Routing mechanism arithmetic source-based port select table driven general computation Properties of the routes Deadlock free 32 16

Routing Mechanism need to select output port for each input packet in a few cycles Simple arithmetic in regular topologies ex: x, y routing in a grid west (-x) x< 0 east (+x) x> 0 south (-y) x= 0, y< 0 north (+y) x= 0, y> 0 processor x= 0, y= 0 Reduce relative address of each dimension in order Dimension-order routing in k-ary d-cubes e-cube routing in n-cube 33 Properties of Routing Algorithms Deterministic route determined by (source, dest), not intermediate state (i.e. traffic) Adaptive route influenced by traffic along the way Minimal only selects shortest paths Deadlock free no traffic pattern can lead to a situation where no packets mover forward 34 17

Deadlock-Free Routing How can it arise? necessary conditions: shared resource incrementally allocated non-preemptive How do you avoid it? constrain how channel resources are allocated ex: dimension-ordered routing How to prove that a routing algorithm is deadlock free 35 Proof Technique resources are logically associated with channels messages introduce dependences between resources as they move forward need to articulate the possible dependences that can arise between channels show that there are no cycles in Channel Dependence Graph find a numbering of channel resources such that every legal route follows a monotonic sequence => no traffic pattern can lead to deadlock network need not be acyclic, on channel dependence graph 36 18

Example: k-ary 2D array Theorem: x,y routing is deadlock free Numbering 1 2 3 00 01 02 03 2 1 0 18 17 17 16 10 11 12 13 20 21 22 23 30 31 32 33 any routing sequence: x direction, turn, y direction is increasing 18 19 37 Breaking deadlock with virtual channels Packet switches from lo to hi channel 38 19

Flow Control Multiple streams trying to use the same link at same time Ethernet/LAN: collision detection and retry after delay TCP/WAN: buffer, drop, adjust rate any solution must adjust to output rate Flow control in parallel computers Link-level control: Block in place The dest buffer may not be available to accept transfers; this may cause the buffer at sources to fill and exert back pressure in end-toend flow control. Ready Data 39 Network characterization Topology (what) physical interconnection structure of the network graph direct vs indirect Routing Algorithm (which) restricts the set of paths that msgs may follow Switching Strategy (how) how data in a msg traverses a route circuit switching vs. packet switching Flow Control Mechanism (when) when a msg or portions of it traverse a route what happens when traffic is encountered? Interplay of all of these determines performance 40 20