A Low-Latency Asynchronous Interconnection Network with Early Arbitration Resolution

Similar documents
Asynchronous Bypass Channels

Lecture 18: Interconnection Networks. CMU : Parallel Computer Architecture and Programming (Spring 2012)

Design and Implementation of an On-Chip timing based Permutation Network for Multiprocessor system on Chip

Hyper Node Torus: A New Interconnection Network for High Speed Packet Processors

Architectural Level Power Consumption of Network on Chip. Presenter: YUAN Zheng

From Bus and Crossbar to Network-On-Chip. Arteris S.A.

On-Chip Interconnection Networks Low-Power Interconnect

Demystifying Data-Driven and Pausible Clocking Schemes

Performance Evaluation of 2D-Mesh, Ring, and Crossbar Interconnects for Chip Multi- Processors. NoCArc 09

Interconnection Networks

A Pausible Bisynchronous FIFO for GALS Systems

Interconnection Networks. Interconnection Networks. Interconnection networks are used everywhere!

Use-it or Lose-it: Wearout and Lifetime in Future Chip-Multiprocessors

Hardware Implementation of Improved Adaptive NoC Router with Flit Flow History based Load Balancing Selection Strategy

Design and Implementation of an On-Chip Permutation Network for Multiprocessor System-On-Chip

A CDMA Based Scalable Hierarchical Architecture for Network- On-Chip

A Generic Network Interface Architecture for a Networked Processor Array (NePA)

Introduction to Exploration and Optimization of Multiprocessor Embedded Architectures based on Networks On-Chip

In-network Monitoring and Control Policy for DVFS of CMP Networkson-Chip and Last Level Caches

Packetization and routing analysis of on-chip multiprocessor networks

System Interconnect Architectures. Goals and Analysis. Network Properties and Routing. Terminology - 2. Terminology - 1

Chapter 2. Multiprocessors Interconnection Networks

- Nishad Nerurkar. - Aniket Mhatre

Question: 3 When using Application Intelligence, Server Time may be defined as.

Low-Overhead Hard Real-time Aware Interconnect Network Router

Optimizing Configuration and Application Mapping for MPSoC Architectures

Switched Interconnect for System-on-a-Chip Designs

Lustre Networking BY PETER J. BRAAM

Interconnection Generation for System-on-Chip Design and Design Space Exploration

Distributed Elastic Switch Architecture for efficient Networks-on-FPGAs

Outline. Introduction. Multiprocessor Systems on Chip. A MPSoC Example: Nexperia DVP. A New Paradigm: Network on Chip

Interconnection Network Design

Low Power AMD Athlon 64 and AMD Opteron Processors

Applying the Benefits of Network on a Chip Architecture to FPGA System Design

Routing in packet-switching networks

Interconnection Networks

Computer Engineering: MS Program Overview, Fall 2013

Architectures and Platforms

Asynchronous IC Interconnect Network Design and Implementation Using a Standard ASIC Flow

Power Reduction Techniques in the SoC Clock Network. Clock Power

Based on Computer Networking, 4 th Edition by Kurose and Ross

other. A B AP wired network

Design of a Feasible On-Chip Interconnection Network for a Chip Multiprocessor (CMP)

On-Chip Communications Network Report

Making Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association

Quality of Service (QoS) for Asynchronous On-Chip Networks

Interconnection Network

Multihoming and Multi-path Routing. CS 7260 Nick Feamster January

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

Topics of Chapter 5 Sequential Machines. Memory elements. Memory element terminology. Clock terminology

Computer Network. Interconnected collection of autonomous computers that are able to exchange information

3D On-chip Data Center Networks Using Circuit Switches and Packet Switches

Protagonist International Journal of Management And Technology (PIJMT) Online ISSN Vol 2 No 3 (May-2015) Active Queue Management

Photonic Networks for Data Centres and High Performance Computing

PCI Express Overview. And, by the way, they need to do it in less time.

Scalability evaluation of barrier algorithms for OpenMP

Effects of Filler Traffic In IP Networks. Adam Feldman April 5, 2001 Master s Project

Network Structure or Topology

The Sierra Clustered Database Engine, the technology at the heart of

CROSS LAYER BASED MULTIPATH ROUTING FOR LOAD BALANCING

Router Architectures

CCNA R&S: Introduction to Networks. Chapter 5: Ethernet

Parallel Programming Survey

DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING NATIONAL INSTITUTE OF TECHNOLOGY ROURKELA EFFICIENT ROUTER DESIGN FOR NETWORK ON CHIP

The Quality of Internet Service: AT&T s Global IP Network Performance Measurements

Optical interconnection networks with time slot routing

FPGA-based Multithreading for In-Memory Hash Joins

TCP for Wireless Networks

TCP and Wireless Networks Classical Approaches Optimizations TCP for 2.5G/3G Systems. Lehrstuhl für Informatik 4 Kommunikation und verteilte Systeme

Principles and characteristics of distributed systems and environments

Module 15: Network Structures

Using Fuzzy Logic Control to Provide Intelligent Traffic Management Service for High-Speed Networks ABSTRACT:

Solving I/O Bottlenecks to Enable Superior Cloud Efficiency

Optimizing Shared Resource Contention in HPC Clusters

SAN Conceptual and Design Basics

Computer Systems Structure Input/Output

An Active Packet can be classified as

OpenSoC Fabric: On-Chip Network Generator

TRILL Large Layer 2 Network Solution

DDR subsystem: Enhancing System Reliability and Yield

Low latency synchronization through speculation

Why the Network Matters

A Seamless Handover Mechanism for IEEE e Broadband Wireless Access

An Ultra-low low energy asynchronous processor for Wireless Sensor Networks

Adapting Distributed Hash Tables for Mobile Ad Hoc Networks

Recovery Modeling in MPLS Networks

What is a System on a Chip?

Transcription:

A Low-Latency Asynchronous Interconnection Network with arly Arbitration Resolution Georgios Faldamis Cavium Inc. Gennette Gill D.. Shaw Research Weiwei Jiang Dept. of Computer Science Columbia University Steven M. Nowick Dept. of Computer Science Columbia University ACM/I Asia and South Pacific Design Automation Conf. (ASP-DAC 14)

Motivation for Networks-on-Chip Future of computing is multi-core 2 to 4 cores are common, 8 to 16 widely available e.g. Niagara 16-core, Intel 10-core Xeon, AMD 16-core Opteron xpected progression: hundreds or thousands of cores Trend towards complex systems-on-chip (SoC) Communication complexity: new limiting factor NoC design enables orthogonalization of concerns: Improves scalability - buses and crossbars unable to deliver desired bandwidth - global ad-hoc wiring does not scale to large systems Provides flexibility - handle pre-scheduled and dynamic traffic - route around faulty network nodes Facilitates design reuse - standard interfaces increase modularity, decrease design time 2

Key Active Research Challenges for NoCs Power consumption Will exceed future power budgets by a factor of 10x - [Owens I Micro-07] Global clocks: consume large fraction of overall power Complex clock-gating techniques - [Benini et al., TVLSI-02] Chips partitioned into multiple timing domains Difficult to integrate heterogeneous modules Dynamic voltage/frequency scaling (DVFS) for lower power - [Ogras/Marculescu DAC-08] A key performance bottleneck = latency Latency critical for on-chip memory access Important for chip multiprocessors (CMP s) 3

Potential Advantages of Asynchronous Design Lower power No clock power consumed Idle components consume no dynamic power - IBM/Columbia FIR filter [Tierno, Singh, Nowick, et al., ISSCC-02] Greater flexibility/modularity asier integration between multiple timing domains Supports reusable components - [Bainbridge/Furber, I Micro-02 Magazine] - [Dobkin/Ginosar, Async-04] Lower system latency No per-router clock synchronization no waiting for clock - [Sheibanyrad/Greiner et al., I Design & Test 08] - [Horak, Nowick, et al., NOCS-10] 4

Motivation for Our Research Target = interconnection network for CMP s Network between processors and cache memory GALS NoC: sync/async interfaces + async network Requires high performance Low system-level latency - Lightweight routers for low-latency High sustained throughput - Maximize steady-state throughput Target topology = variant MoT ( Mesh-of-Trees ) Tree topologies becoming widely used for CMP s: - XMT [Balkan/Vishkin et al., Hot Interconnects-07] - Single-cycle network [Rahimi, Benini, et al., DAT-11] - NOC-OUT [Grot, Falsafi, et al., I Micro-12] Our two main contributions: High-performance async network with advance arbitration Detailed comparative evaluation on 8 benchmarks 5 cores shared cache

Contributions (1) Mesh-of-Trees (MoT) network with early arbitration Target system-latency bottleneck Observe newly-entering traffic Perform early arbitration + channel pre-allocation Net benefit: bypass arbitration logic + pre-opened channel arly arbitration capability in fan-in router nodes Simple and fast operate as FIFO in many traffic scenarios network: Rapid advance notification of incoming data Fast and lightweight Key component for early arbitration 6

Contributions (2) Detailed experimentation and analysis arly arbitration network vs. baseline and predictive - baseline : [Horak/Nowick, NOCS-10] - predictive : [Gill/Nowick, NOCS-11] 8 diverse synthetic benchmarks - represent different network conditions Significant latency improvement and comparable throughput - New vs. baseline: 23-30% latency improvement - New vs. predictive: 13-38% latency improvement Low end-to-end system latency - ~1.7ns (at 25% load, 90nm): through 6 router nodes + 5 hops 7

Related Work: NoC Acceleration Techniques xpress virtual channels [Kumar/Peh, ISCA-07] Selective packets use dedicated fast channels Virtually bypass intermediate nodes improvements only against slow coarse-grained baseline: 3-cycle operation SMART NoC [Chen/Peh, DAT-13] Selective packets traverse multiple hops in one cycle requires advanced circuit-level techniques + aggressive timing assumptions Hybrid network [Modarressi/Arjomand, DAT-09] A normal packet-switched network + fast circuit-switched network Flits can switch between two sub-networks requires partitioned network (statically-allocated) + large circuit-switched setup time NoC using advanced bundles [Kumar et al., ICCD-07] Provides advanced information of flit arrival Closer to our approach advance bundles advance only one cycle per hop (unlike our approach) 8

Outline Introduction Background New Asynchronous MoT Network Overview of the arly Arbitration Approach Network Design of the New Arbitration Node xperimental Results Simulation Setup Network-Level Results Conclusion and Future Work 9

Background: Mesh-of-Trees (MoT) Variant Topology basics Fan-out and fan-in network inverse of classical MoT (Leighton) Two node types Routing: 1 input and 2 output channels Arbitration: 2 input and 1 output channels Routing features Deterministic wormhole routing Path examples shown in the figure No contention between distinct source/sink pairs Potential performance benefits Lower latency and higher throughput over 2D-mesh Shown to perform well for CMP s [Balkan/Vishkin, Trans. VLSI, Oct. 09], [Balkan/Vishkin, Hot Interconnects-07] 0 1 2 3 10 0 1 2 3

Background: Two Node Types Source Routing Req Ack Boolean Data 1 incoming handshaking channel Routing Primitive 2 outgoing handshaking channels Req0 Ack0 Data0 Req1 Ack1 Data1 Req0 Ack0 Data0 Req1 Ack1 Data1 2 incoming handshaking channels Arbitration Primitive Req Ack Data 1 outgoing handshaking channel Routing primitive 1 input channel and 2 output handshaking channels Route the input to one of the outputs Arbitration primitive 2 input and 1 output handshaking channels Merge two input streams into one output stream 11

Background: Asynchronous Protocols Sender Receiver Handshaking: transition signaling (two-phase) Two events per transaction - Req/Ack toggle Merits over level signaling (four-phase): - 1 roundtrip communication per data item - High throughput and low power Challenge of two-phase signaling: - designing lightweight implementations First communication Data encoding: single-rail bundled data Standard synchronous single-rail data + extra bundling req Merits of single-rail bundled data: - low power and very good coding efficiency - allow to re-use synchronous components Challenge: requires matched delay for bundling req - one-sided timing constraint: request must arrive after data is stable req ack req ack Second communication 12

Outline Introduction Background New Asynchronous MoT Network Overview of the arly Arbitration Approach Network Design of the New Arbitration Node xperimental Results Simulation Setup Network-Level Results Conclusion and Future Work 13

Overview: arly Arbitration Strategy Key network bottleneck System-latency - bottleneck of arbitration logic in fan-in nodes Basic strategy = anticipation Observe newly-entering traffic Do early arbitration + channel pre-allocation Net benefit: bypass arbitration logic 0 1 2 3 0 1 2 3 Proposed network As soon as flit enters network: - all downstream nodes quickly notified (by a monitoring network) - fan-in nodes: initiate early arbitration + channel pre-allocation When flit arrives at each fan-in node: - quickly sent out through pre-allocated channel Routing nodes (unchanged) New arbitration nodes 14

Outline Introduction Background New Asynchronous MoT Network Overview of the arly Arbitration Approach Network Design of the New Arbitration Node xperimental Results Simulation Setup Network-Level Results Conclusion and Future Work 15

Network: Overview Purpose: rapid advance notification of incoming data Structure: lightweight shadow replica of MoT network Small monitoring control unit attached to each node - i.e. both routing and arbitration Fast and lightweight Implemented by several gates for each control unit Different role for fan-out and fan-in monitoring Fan-out: fast forward early notification without using it Fan-in: fast forward and use it for early arbitration 16

Network: Structure Structure: a shadow replica of MoT network Small and fast monitoring control unit attached for each node Channel fan-out root Channel Channels control attached to each node Channels x Channels fan-in root Channel 17

Network: Operation When a flit enters the network arly notification generated and fast forwarded arly notification generated at fan-out root Channel fan-out root Channel arly notification traces same path as flits Channels Fan-in nodes preallocates the channel Channels Channels fan-in root Channel 18

Outline Introduction Background New Asynchronous MoT Network Overview of the arly Arbitration Approach Network Design of the New Arbitration Node xperimental Results Simulation Setup Network-Level Results Conclusion and Future Work 19

New Arbitration Node: Circuit-Level ackout0 ackout1 reqin0 reqin1 datain0 datain1 L3 D L4 D S 0 1 R 0 zerowins mux_select preackout0 Mutex L1 preackout1 1 L2 onewins Req-Latch mutex-req0 mutex-req1 output-en D RG takeover Monitor somethingcoming-in-0 somethingcoming-in-1 somethingcoming-out ackin reqout dataout 20

New Arbitration Node: Interfaces ackout0 ackout1 L3 D 2 input data channels reqin0 reqin1 datain0 datain1 L4 D S 0 1 R 0 zerowins mux_select preackout0 Mutex L1 preackout1 1 L2 onewins Req-Latch mutex-req0 mutex-req1 output-en D RG takeover Monitor somethingcoming-in-0 somethingcoming-in-1 somethingcoming-out ackin 1 output data channel reqout dataout 21

New Arbitration Node: Interfaces (cont.) channels: provide advance info. on incoming traffic ackout0 ackout1 L3 D L4 D preackout0 0 zerowins Mutex preackout1 1 onewins mutex-req0 mutex-req1 output-en takeover Monitor somethingcoming-in-0 somethingcoming-in-1 somethingcoming-out ackin reqin0 reqin1 S R mux_select L1 L2 Req-Latch reqout datain0 datain1 0 1 D dataout RG 22

New Arbitration Node: Structure Mutex: resolves arbitration between 2 input channels ackout0 ackout1 L3 D L4 D preackout0 0 zerowins Mutex preackout1 1 onewins mutex-req0 mutex-req1 output-en takeover Monitor somethingcoming-in-0 somethingcoming-in-1 somethingcoming-out ackin reqin0 reqin1 S R mux_select L1 L2 Req-Latch reqout datain0 datain1 0 1 D dataout RG 23

New Arbitration Node: Structure (cont.) : requests/releases Mutex Key component to enable early arbitration ackout0 ackout1 L3 D L4 D preackout0 0 zerowins Mutex preackout1 1 onewins mutex-req0 mutex-req1 output-en takeover Monitor somethingcoming-in-0 somethingcoming-in-1 somethingcoming-out ackin reqin0 reqin1 S R mux_select L1 L2 Req-Latch reqout datain0 datain1 0 1 D dataout RG 24

New Arbitration Node: Structure (cont.) Input channel latch + control: Two functions: (i) enables channel pre-allocation, (ii) flow control ackout0 ackout1 L3 D L4 D preackout0 0 zerowins Mutex preackout1 1 onewins mutex-req0 mutex-req1 output-en takeover Monitor somethingcoming-in-0 somethingcoming-in-1 somethingcoming-out ackin reqin0 reqin1 S R mux_select L1 L2 Req-Latch reqout datain0 datain1 0 1 D dataout RG 25

New Arbitration Node: Structure (cont.) control: fast forwards early notification ackout0 ackout1 L3 D L4 D preackout0 0 zerowins Mutex preackout1 1 onewins mutex-req0 mutex-req1 output-en takeover Monitor somethingcoming-in-0 somethingcoming-in-1 somethingcoming-out ackin reqin0 reqin1 S R mux_select L1 L2 Req-Latch reqout datain0 datain1 0 1 D dataout RG 26

New Arbitration Node: Key Feature (1) arly arbitration capability: signals initiate arbitration, before actual flit arrival ackout0 ackout1 L3 D L4 D preackout0 0 zerowins Mutex preackout1 1 onewins mutex-req0 mutex-req1 output-en takeover Monitor somethingcoming-in-0 somethingcoming-in-1 somethingcoming-out ackin reqin0 reqin1 S R mux_select L1 L2 Req-Latch reqout datain0 datain1 0 1 D dataout RG 27

New Arbitration Node: Key Feature (2) Highly optimized forward path: contains only 1 pre-opened latch = FIFO stage ackout0 ackout1 Latch preopened by early arbitration L3 D Forward path reqin0 reqin1 datain0 datain1 L4 D S 0 1 R 0 zerowins mux_select preackout0 Mutex L1 preackout1 1 L2 onewins Req-Latch mutex-req0 mutex-req1 output-en D takeover Monitor somethingcoming-in-0 somethingcoming-in-1 somethingcoming-out ackin reqout dataout RG 28

Simulation: Overview Two simulations #1. Single-flit scenario - friendly case - illustrate how early arbitration works #2. Contention between two input channels - more advanced and adversarial case - illustrate how to resolve contention 29

Simulation #1: Single-Flit Step #1: signal arrives (well before actual flit) ackout0 ackout1 Initiates early arbitration reqin0 reqin1 datain0 datain1 L3 D L4 D S 0 1 R 0 zerowins mux_select preackout0 Mutex L1 preackout1 1 L2 onewins Req-Latch mutex-req0 mutex-req1 output-en uickly forwarded D takeover Monitor somethingcoming-in-0 somethingcoming-in-1 somethingcoming-out ackin reqout dataout RG 30

Simulation #1: Single-Flit (cont.) Step #2: Completes early arbitration ackout0 ackout1 Wins arbitration Opens channel reqin0 reqin1 L3 D L4 D S R 0 zerowins mux_select preackout0 Mutex L1 preackout1 1 L2 onewins Req-Latch mutex-req0 mutex-req1 output-en takeover Monitor somethingcoming-in-0 somethingcoming-in-1 somethingcoming-out ackin reqout datain0 datain1 0 1 D dataout RG 31

Simulation #1: Single-Flit (cont.) Step #3: Flit arrives and gets through pre-allocated channel ackout0 ackout1 Channel already opened Flit arrives reqin0 reqin1 datain0 datain1 L3 D L4 D S 0 1 R 0 zerowins mux_select preackout0 Mutex L1 preackout1 1 L2 onewins Req-Latch mutex-req0 mutex-req1 output-en D takeover Monitor somethingcoming-in-0 somethingcoming-in-1 somethingcoming-out ackin reqout Flit sent out dataout RG 32

Forward Latency: Single-Flit ackout0 ackout1 Channel already opened Flit arrives reqin0 reqin1 datain0 datain1 L3 D L4 D S 0 1 R 0 zerowins mux_select preackout0 Mutex L1 preackout1 1 L2 onewins Req-Latch mutex-req0 mutex-req1 output-en D takeover Monitor somethingcoming-in-0 somethingcoming-in-1 somethingcoming-out ackin reqout Flit sent out dataout Forward latency = D-latch + XOR2 gate RG 33

Simulation #2: Contention Both monitoring signals arrive almost simultaneously ackout0 ackout1 L3 D L4 D preackout0 0 zerowins Mutex preackout1 1 onewins mutex-req0 mutex-req1 output-en takeover Monitor somethingcoming-in-0 somethingcoming-in-1 somethingcoming-out ackin reqin0 reqin1 S R mux_select L1 L2 Req-Latch reqout datain0 datain1 0 1 D dataout RG 34

Simulation #2: Contention (cont.) ackout0 ackout1 Channel #0 wins mutex Channel #0 Opens reqin0 reqin1 datain0 datain1 Both monitoring signals request mutex Assume channel #0 wins arbitration L3 D L4 D S 0 1 R 0 zerowins mux_select preackout0 Mutex L1 preackout1 1 L2 onewins Req-Latch mutex-req0 mutex-req1 output-en D RG takeover Monitor somethingcoming-in-0 somethingcoming-in-1 somethingcoming-out ackin reqout dataout 35

Simulation #2: Contention (cont.) Flit on channel #0 arrives and goes through pre-allocated channel Flit on channel #1 arrives but is blocked ackout0 ackout1 Channel #0 already opened L3 D Both flits arrive reqin0 reqin1 datain0 datain1 L4 D S 0 1 R 0 zerowins mux_select preackout0 Mutex L1 preackout1 1 L2 onewins Req-Latch mutex-req0 mutex-req1 output-en D takeover Monitor somethingcoming-in-0 somethingcoming-in-1 somethingcoming-out ackin Channel #1 is blocked reqout Channel #0 flit sent out dataout RG 36

Simulation #2: Contention (cont.) Channel #0 finally releases mutex channel #1 wins ackout0 ackout1 Channel #1 wins mutex L3 D Channel #1 finally opens reqin0 reqin1 L4 D S R 0 zerowins mux_select preackout0 Mutex L1 preackout1 1 L2 onewins Req-Latch mutex-req0 mutex-req1 output-en takeover Monitor somethingcoming-in-0 somethingcoming-in-1 somethingcoming-out ackin reqout datain0 datain1 0 1 D dataout RG 37

Simulation #2: Contention (cont.) Flit on channel #1 gets through ackout0 ackout1 Channel #1 is now opened reqin0 reqin1 datain0 datain1 L3 D L4 D S 0 1 R 0 zerowins mux_select preackout0 Mutex L1 preackout1 1 L2 onewins Req-Latch mutex-req0 mutex-req1 output-en D takeover Monitor somethingcoming-in-0 somethingcoming-in-1 somethingcoming-out ackin reqout Channel #1 flit sent out dataout RG 38

New Arbitration Node: Multi-Flit Design end0 end1 Structure largely the same as single-flit design Different : receives tail flag ackout0 ackout1 Flit type info: tail flag L3 D L4 D preackout0 0 zerowins Mutex preackout1 1 onewins mutex-req0 mutex-req1 output-en takeover Monitor somethingcoming-in-0 somethingcoming-in-1 somethingcoming-out ackin reqin0 reqin1 S R mux_select L1 L2 Req-Latch reqout datain0 datain1 0 1 D dataout RG 39

Outline Introduction Background New Asynchronous MoT Network Overview of the arly Arbitration Approach Network Design of the New Arbitration Node xperimental Results Simulation Setup Network-Level Results Conclusion and Future Work 40

xperimental Results: Overview Two levels of evaluation: Node-level: new arbitration node in isolation Network-level: 8 8 network with new node Node-level evaluation: see paper for details New arbitration node vs. two previous designs: - Baseline [Horak/Nowick NOCS-10] - Predictive [Gill/Nowick NOCS-11] 90nm ARM standard cells, gate-level SPIC simulation Network-level evaluation: our focus Three 8 8 MoT networks: each has 112 router nodes - Baseline, Predictive, New Modeled in structural technology-mapped Verilog - more accurate model than in [Gill/Nowick NOCS-11] 8 synthetic benchmarks: a wide range of traffic patterns 41

Benchmarks 8 diverse benchmarks The same as those in NOCS-11 Represent different network conditions Classification Three friendly benchmarks: - (1) Shuffle, (2) Tornado and (7) Single Source broadcast [Dally`03] - No contention Three moderately adversarial benchmarks: - (4) Simple alternation with overlap - (5) Random restricted broadcast with partial overlap - (8) Partial streaming with random interruption - No contention for some nodes, light or moderate contention for others Two most adversarial benchmarks: - (3) All-to-all random and (6) Hotspot8 - Heavy contention at some nodes 42

Network-Level Latency: Single-Flit Design Moderate to significant improvement over all benchmarks New vs. baseline: 23-30% improvement New vs. predictive: 13-38% improvement Latency Comparison for 25% Network Load Baseline Predictive New 43

Network-Level Latency: Single-Flit Design Perform well for benchmark #3 and #6 (adversarial cases) Predictive: even worse than baseline (~20% higher latency) New: better than baseline (~25% lower latency) Latency Comparison for 25% Network Load Baseline Predictive New 44

Network-Level Latency: Single-Flit Design xcellent latency stability: provides predictable behavior Network latency = ~1700ps, across all benchmarks through 6 router nodes + 5 hops Important for memory access in CMP s Latency Comparison for 25% Network Load 1700 Baseline Predictive New 45

Network-Level Throughput: Single-Flit Design New vs. baseline: improvement up to 17% on 6 benchmarks New vs. predictive: comparable throughput over all benchmarks Saturation Throughput Baseline Predictive New 46

Network-Level Results: Multi-Flit Design Fixed packet length = 3 flits/packet Results only for benchmark #1 and #3 For both benchmarks: ~30% latency and ~14% throughput improvement Latency vs. Input Rate #1 baseline #3 baseline #3 new #1 new 47

Conclusion Introduced a MoT network using early arbitration Address system-latency bottleneck Observe newly entering traffic - via lightweight shadow monitoring network Perform early arbitration + channel pre-allocation Detailed experimentation and analysis Significant improvements in system-latency - New vs. baseline: 23-30% across all benchmarks - New vs. predictive: up to 38% 48

Future Work Narrow channel reservation window Decrease time between channel reservation and flit arrival Increase network utility Target different topology xtend early arbitration to 2D-mesh, Clos network, etc. Build a complete GALS system Add mixed-timing interface connect cores by the network More experiments Real traffic benchmarks 49

Back-up Slides 50

Strategy Comparison: Overview Three network designs Baseline - [Horak/Nowick, NOCS-10] - foundation of the research Predictive - [Gill/Nowick, NOCS-11] - a more recent design New - the proposed design 51

Baseline Arbitration Node: Operation Step #1: Waiting for arbitration to complete Input channel 0 Flit arrives Input channel 1 a r b Output channel Step #2: Arbitration resolves Input channel 0 Input channel 1 a r b Output channel Flit sent out 52

Predictive Arbitration Node: Operation Datain0 Something-coming-0 Datain1 Something-coming-1 Waiting for arbitration to complete Flit arrives a r b Biased channel: held open Flits sent out without waiting Dataout Flit sent out Default Mode: similar operation as baseline design Datain0 Something-coming-0 Flit arrives Dataout Datain1 Something-coming-1 Flit sent out Non-biased channel: entirely blocked Biased Mode: optimized for one input channel (by prediction) 53

New Arbitration Node: Operation Step #1: Advance notification completes arbitration and opens the channel well before actual flit arrival arrives Datain0 Something-coming-0 Datain1 Something-coming-1 a r b Dataout Step #2: Flit arrives Datain0 Something-coming-0 Datain1 Something-coming-1 Flits sent out without waiting a r b Dataout Flit sent out Arbitration already done 54

The Role of Network Predictive design [Gill/Nowick, NOCS-11] Facilitates mode change - from optimized (biased) to unoptimized (default) only For safety purpose only - plays secondary role New design Key component of early arbitration strategy - directly initiates early arbitration For higher performance - especially system-latency 55