Lecture 18: Interconnection Networks. CMU 15-418: Parallel Computer Architecture and Programming (Spring 2012)

Similar documents
Interconnection Network

Interconnection Networks. Interconnection Networks. Interconnection networks are used everywhere!

Why the Network Matters

Interconnection Networks

Interconnection Networks Programmierung Paralleler und Verteilter Systeme (PPV)

Interconnection Network Design

System Interconnect Architectures. Goals and Analysis. Network Properties and Routing. Terminology - 2. Terminology - 1

Introduction to Exploration and Optimization of Multiprocessor Embedded Architectures based on Networks On-Chip

Hyper Node Torus: A New Interconnection Network for High Speed Packet Processors

Components: Interconnect Page 1 of 18

Architectural Level Power Consumption of Network on Chip. Presenter: YUAN Zheng

From Hypercubes to Dragonflies a short history of interconnect

On-Chip Interconnection Networks Low-Power Interconnect

Interconnection Networks

Topological Properties

Lecture 23: Interconnection Networks. Topics: communication latency, centralized and decentralized switches (Appendix E)

COMP 422, Lecture 3: Physical Organization & Communication Costs in Parallel Machines (Sections 2.4 & 2.5 of textbook)

Scalability and Classifications

Lecture 2 Parallel Programming Platforms

Chapter 2. Multiprocessors Interconnection Networks

Introduction to Parallel Computing. George Karypis Parallel Programming Platforms

Asynchronous Bypass Channels

Scaling 10Gb/s Clustering at Wire-Speed

Parallel Programming

CS 78 Computer Networks. Internet Protocol (IP) our focus. The Network Layer. Interplay between routing and forwarding

Communication Networks. MAP-TELE 2011/12 José Ruela

Chapter 12: Multiprocessor Architectures. Lesson 04: Interconnect Networks

Switched Interconnect for System-on-a-Chip Designs

Low-Overhead Hard Real-time Aware Interconnect Network Router

Design and Implementation of an On-Chip timing based Permutation Network for Multiprocessor system on Chip

MPLS. Packet switching vs. circuit switching Virtual circuits

From Bus and Crossbar to Network-On-Chip. Arteris S.A.

Technology-Driven, Highly-Scalable Dragonfly Topology

QoS Switching. Two Related Areas to Cover (1) Switched IP Forwarding (2) 802.1Q (Virtual LANs) and 802.1p (GARP/Priorities)

Interconnection Networks

Performance Evaluation of 2D-Mesh, Ring, and Crossbar Interconnects for Chip Multi- Processors. NoCArc 09

Where to Provide Support for Efficient Multicasting in Irregular Networks: Network Interface or Switch?

Design and Implementation of an On-Chip Permutation Network for Multiprocessor System-On-Chip

Quality of Service (QoS) for Asynchronous On-Chip Networks

Switch Fabric Implementation Using Shared Memory

Leveraging Torus Topology with Deadlock Recovery for Cost-Efficient On-Chip Network

Data Center Switch Fabric Competitive Analysis

Packetization and routing analysis of on-chip multiprocessor networks

Introduction to Infiniband. Hussein N. Harake, Performance U! Winter School

Annotation to the assignments and the solution sheet. Note the following points


1. The subnet must prevent additional packets from entering the congested region until those already present can be processed.

Hardware Implementation of Improved Adaptive NoC Router with Flit Flow History based Load Balancing Selection Strategy

Distributed Elastic Switch Architecture for efficient Networks-on-FPGAs

ECE 358: Computer Networks. Solutions to Homework #4. Chapter 4 - The Network Layer

Three Key Design Considerations of IP Video Surveillance Systems

Large-Scale Distributed Systems. Datacenter Networks. COMP6511A Spring 2014 HKUST. Lin Gu

A Detailed and Flexible Cycle-Accurate Network-on-Chip Simulator

Architecture of distributed network processors: specifics of application in information security systems

On-Chip Communication Architectures

Networks on Chip. on-chip interconnect: physical. Kees Goossens. Kees Goossens Eindhoven University of Technology 1

Performance of Software Switching

Optimizing Enterprise Network Bandwidth For Security Applications. Improving Performance Using Antaira s Management Features

WAN. Introduction. Services used by WAN. Circuit Switched Services. Architecture of Switch Services

Intel Ethernet Switch Converged Enhanced Ethernet (CEE) and Datacenter Bridging (DCB) Using Intel Ethernet Switch Family Switches

Master Course Computer Networks IN2097

TRILL for Service Provider Data Center and IXP. Francois Tallet, Cisco Systems

Switching. An Engineering Approach to Computer Networking

MPLS Environment. To allow more complex routing capabilities, MPLS permits attaching a

Announcements. Midterms. Mt #1 Tuesday March 6 Mt #2 Tuesday April 15 Final project design due April 11. Chapters 1 & 2 Chapter 5 (to 5.

Optical interconnection networks with time slot routing

Broadband Networks. Prof. Karandikar. Department of Electrical Engineering. Indian Institute of Technology, Bombay. Lecture - 26

Multiprotocol Label Switching (MPLS)

Brocade Solution for EMC VSPEX Server Virtualization

Smart Queue Scheduling for QoS Spring 2001 Final Report

Outline. Institute of Computer and Communication Network Engineering. Institute of Computer and Communication Network Engineering

Resource Utilization of Middleware Components in Embedded Systems

Datagram-based network layer: forwarding; routing. Additional function of VCbased network layer: call setup.

Computer Systems Structure Input/Output

The proliferation of the raw processing

vci_anoc_network Specifications & implementation for the SoClib platform

Large Scale Clustering with Voltaire InfiniBand HyperScale Technology

InfiniBand Clustering

Advanced Computer Networks. Datacenter Network Fabric

OpenFlow Based Load Balancing

MPLS-TP. Future Ready. Today. Introduction. Connection Oriented Transport

Network Architecture and Topology

Virtualization and SDN Applications

CROSS LAYER BASED MULTIPATH ROUTING FOR LOAD BALANCING

SiteCelerate white paper

Solving Network Challenges

Transcription:

Lecture 18: Interconnection Networks CMU 15-418: Parallel Computer Architecture and Programming (Spring 2012)

Announcements Project deadlines: - Mon, April 2: project proposal: 1-2 page writeup - Fri, April 20: project checkpoint: 1-2 page writeup - Thurs, May 10: final presentations + final writeup

Today s Agenda Interconnection Networks - Introduction and Terminology - Topology - Buffering and Flow control

Inteconnection Network Basics Topology - Specifies way switches are wired - Affects routing, reliability, throughput, latency, building ease Routing - How does a message get from source to destination - Static or adaptive Buffering and Flow Control - What do we store within the network? - Entire packets, parts of packets, etc? - How do we manage and negotiate buffer space? - How do we throttle during oversubscription? - Tightly coupled with routing strategy

Terminology Network interface - Connects endpoints (e.g. cores) to network. - Decouples computation/communication Links - Bundle of wires that carries a signal Switch/router - Connects fixed number of input channels to fixed number of output channels Channel - A single logical connection between routers/switches

More Terminology Node - A network endpoint connected to a router/switch Message - Unit of transfer for network clients (e.g. cores, memory) Packet - Unit of transfer for network Packet Flit - Flow control digit - Unit of flow control within network H F Head Flit F F F F Flits F T Tail Flit

Some More Terminology Direct or Indirect Networks - Endpoints sit inside (direct) or outside (indirect) the network - E.g. mesh is direct; every node is both endpoint and switch Router (switch), Radix of 2 (2 inputs, 2 outputs) Abbreviation: Radix-ary These routers are 2-ary 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 Indirect Direct

Today s Agenda Interconnection Networks - Introduction and Terminology - Topology - Buffering and Flow control

Properties of a Topology/Network Regular or Irregular - regular if topology is regular graph (e.g. ring, mesh) Routing Distance - number of links/hops along route Diameter - maximum routing distance Average Distance - average number of hops across all valid routes

Properties of a Topology/Network Bisection Bandwidth - Often used to describe network performance - Cut network in half and sum bandwidth of links severed - (Min # channels spanning two halves) * (BW of each channel) - Meaningful only for recursive topologies - Can be misleading, because does not account for switch and routing efficiency Blocking vs. Non-Blocking - If connecting any permutation of sources & destinations is possible, network is non-blocking; otherwise network is blocking.

Many Topology Examples Bus Crossbar Ring Tree Omega Hypercube Mesh Torus Butterfly

Bus + Simple + Cost effective for a small number of nodes + Easy to implement coherence (snooping) - Not scalable to large number of nodes (limited bandwidth, electrical loading reduced frequency) - High contention Memory Memory Memory Memory cache cache cache cache Proc Proc Proc Proc

Crossbar Every node connected to all others (non-blocking) Good for small number of nodes + Low latency and high throughput - Expensive - Not scalable O(N 2 ) cost - Difficult to arbitrate Core-to-cache-bank networks: - IBM POWER5 - Sun Niagara I/II 7 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7

Ring + Cheap: O(N) cost - High latency: O(N) - Not easy to scale - Bisection bandwidth remains constant Used in: RING - Intel Larrabee/Core i7 - IBM Cell S S S P P P M M M

Mesh O(N) cost Average latency: O(sqrt(N)) Easy to layout on-chip: regular & equal-length links Path diversity: many ways to get from one node to another Used in: - Tilera 100-core CMP - On-chip network prototypes

Torus Mesh is not symmetric on edges: performance very sensitive to placement of task on edge vs. middle Torus avoids this problem + Higher path diversity (& bisection bandwidth) than mesh - Higher cost - Harder to lay out on-chip - Unequal link lengths

Trees Planar, hierarchical topology Latency: O(logN) Good for local traffic + Cheap: O(N) cost + Easy to Layout - Root can become a bottleneck Fat trees avoid this problem (CM-5) Fat Tree

Hypercube Latency: O(logN) Radix: O(logN) #links: O(NlogN) + Low latency - Hard to lay out in 2D/3D 1101 1111 Used in some early message passing machines, e.g.: 0101 0111 1100 1110 - Intel ipsc - ncube 0100 0110 1000 1001 1011 1010 0001 0011 0000 0010

Multistage Logarithmic Networks Idea: Indirect networks with multiple layers of switches between terminals Cost: O(NlogN), Latency: O(logN) Many variations (Omega, Butterfly, Benes, Banyan, ) E.g. Omega Network: 000 001 Omega Net w or k 000 001 010 011 100 101 010 011 100 101 Q: Blocking or non-blocking? 110 111 110 111 conflict

Review: Topologies 3 7 7 2 6 5 6 5 1 4 3 4 3 0 2 2 1 1 0 1 2 3 0 0 Topology Crossbar Multistage Logarith. Mesh Direct/Indirect Indirect Indirect Direct Blocking/ Non-blocking Cost Non-blocking Blocking Blocking O(N 2 ) O(NlogN) O(N) Latency O(1) O(logN) O(sqrt(N))

Today s Agenda Interconnection Networks - Introduction and Terminology - Topology - Buffering and Flow control

Circuit vs. Packet Switching Circuit switching sets up full path - Establish route then send data - (no one else can use those links) - faster and higher bandwidth - setting up and bringing down links slow Packet switching routes per packet - Route each packet individually (possibly via different paths) - if link is free can use - potentially slower (must dynamically switch) - no setup, bring down time

Packet Switched Networks: Packet Format Header - routing and control information Payload - carries data (non HW specific information) - can be further divided (framing, protocol stacks ) Error Code - generally at tail of packet so it can be generated on the way out Header Payload Error Code

Handling Contention Two packets trying to use the same link at the same time What do you do? - Buffer one - Drop one - Misroute one (deflection) We will only consider buffering in this lecture

Flow Control Methods Circuit switching Store and forward (Packet based) Virtual Cut Through (Packet based) Wormhole (Flit based)

Circuit Switching Revisited Resource allocation granularity is high Idea: Pre-allocate resources across multiple switches for a given flow Need to send a probe to set up the path for preallocation + No need for buffering + No contention (flow s performance is isolated) + Can handle arbitrary message sizes - Lower link utilization: two flows cannot use the same link - Handshake overhead to set up a circuit

Store and Forward Flow Control Packet based flow control Store and Forward - Packet copied entirely into network router before moving to the next node - Flow control unit is the entire packet Leads to high per-packet latency Requires buffering for entire packet in each node S D Can we do better?

Cut through Flow Control Another form of packet based flow control Start forwarding as soon as header is received and resources (buffer, channel, etc) allocated - Dramatic reduction in latency Still allocate buffers and channel bandwidth for full packets S D What if packets are large?

Cut through Flow Control What to do if output port is blocked? Lets the tail continue when the head is blocked, absorbing the whole message into a single switch. - Requires a buffer large enough to hold the largest packet. Degenerates to store-and-forward with high contention Can we do better?

Wormhole Flow Control T B Packets broken into (potentially) smaller flits (buffer/bw allocation unit) Flits are sent across the fabric in a wormhole fashion - Body follows head, tail follows body - Pipelined - If head blocked, rest of packet stops - Routing (src/dest) information only in head B H How does body/tail know where to go? Latency almost independent of distance for long messages

Wormhole Flow Control Advantages over store and forward flow control + Lower latency + More efficient buffer utilization Limitations - Suffers from head-of-line (HOL) blocking - If head flit cannot move due to contention, another worm cannot proceed even though links may be idle Input Queues Switching Fabric Outputs 1 1 2 1 1 Idle! 2 2 2 1 2 HOL Blocking

Head-of-Line Blocking Red holds this channel: channel remains idle until read proceeds Channel idle but red packet blocked behind blue Buffer full: blue cannot proceed Blocked by other packets

Virtual Channel Flow Control Idea: Multiplex multiple channels over one physical channel Reduces head-of-line blocking Divide up the input buffer into multiple buffers sharing a single physical channel Dally, Virtual Channel Flow Control, ISCA 1990.

Virtual Channel Flow Control Buffer full: blue cannot proceed Blocked by other packets

Other Uses of Virtual Channels Deadlock avoidance - Enforcing switching to a different set of virtual channels on some turns can break the cyclic dependency of resources - Escape VCs: Have at least one VC that uses deadlock-free routing. Ensure each flit has fair access to that VC. - Protocol level deadlock: Ensure request and response packets use different VCs prevent cycles due to intermixing of different packet classes Prioritization of traffic classes - Some virtual channels can have higher priority than others

Communicating Buffer Availability Credit-based flow control - Upstream knows how many buffers are downstream - Downstream passes back credits to upstream - Significant upstream signaling (esp. for small flits) On/Off (XON/XOFF) flow control - Downstream has on/off signal to upstream Ack/Nack flow control - Upstream optimistically sends downstream - Buffer cannot be deallocated until ACK/NACK received - Inefficiently utilizes buffer space

Review: Flow Control S Store and Forward S Cut Through / Wormhole Shrink Buffers D Reduce latency D Any other issues? Head-of-Line Blocking Red holds this channel: channel remains idle until read proceeds Channel idle but red packet blocked behind blue Use Virtual Channels Blocked by other packets Buffer full: blue cannot proceed

Review: Flow Control S Store and Forward S Cut Through / Wormhole Shrink Buffers D Reduce latency D Any other issues? Head-of-Line Blocking Buffer full: blue cannot proceed Use Virtual Channels Blocked by other packets