A Low-Latency Asynchronous Interconnection Network with Early Arbitration Resolution
|
|
|
- Arleen Edwards
- 10 years ago
- Views:
Transcription
1 A Low-Latency Asynchronous Interconnection Network with arly Arbitration Resolution Georgios Faldamis Cavium Inc. Gennette Gill D.. Shaw Research Weiwei Jiang Dept. of Computer Science Columbia University Steven M. Nowick Dept. of Computer Science Columbia University ACM/I Asia and South Pacific Design Automation Conf. (ASP-DAC 14)
2 Motivation for Networks-on-Chip Future of computing is multi-core 2 to 4 cores are common, 8 to 16 widely available e.g. Niagara 16-core, Intel 10-core Xeon, AMD 16-core Opteron xpected progression: hundreds or thousands of cores Trend towards complex systems-on-chip (SoC) Communication complexity: new limiting factor NoC design enables orthogonalization of concerns: Improves scalability - buses and crossbars unable to deliver desired bandwidth - global ad-hoc wiring does not scale to large systems Provides flexibility - handle pre-scheduled and dynamic traffic - route around faulty network nodes Facilitates design reuse - standard interfaces increase modularity, decrease design time 2
3 Key Active Research Challenges for NoCs Power consumption Will exceed future power budgets by a factor of 10x - [Owens I Micro-07] Global clocks: consume large fraction of overall power Complex clock-gating techniques - [Benini et al., TVLSI-02] Chips partitioned into multiple timing domains Difficult to integrate heterogeneous modules Dynamic voltage/frequency scaling (DVFS) for lower power - [Ogras/Marculescu DAC-08] A key performance bottleneck = latency Latency critical for on-chip memory access Important for chip multiprocessors (CMP s) 3
4 Potential Advantages of Asynchronous Design Lower power No clock power consumed Idle components consume no dynamic power - IBM/Columbia FIR filter [Tierno, Singh, Nowick, et al., ISSCC-02] Greater flexibility/modularity asier integration between multiple timing domains Supports reusable components - [Bainbridge/Furber, I Micro-02 Magazine] - [Dobkin/Ginosar, Async-04] Lower system latency No per-router clock synchronization no waiting for clock - [Sheibanyrad/Greiner et al., I Design & Test 08] - [Horak, Nowick, et al., NOCS-10] 4
5 Motivation for Our Research Target = interconnection network for CMP s Network between processors and cache memory GALS NoC: sync/async interfaces + async network Requires high performance Low system-level latency - Lightweight routers for low-latency High sustained throughput - Maximize steady-state throughput Target topology = variant MoT ( Mesh-of-Trees ) Tree topologies becoming widely used for CMP s: - XMT [Balkan/Vishkin et al., Hot Interconnects-07] - Single-cycle network [Rahimi, Benini, et al., DAT-11] - NOC-OUT [Grot, Falsafi, et al., I Micro-12] Our two main contributions: High-performance async network with advance arbitration Detailed comparative evaluation on 8 benchmarks 5 cores shared cache
6 Contributions (1) Mesh-of-Trees (MoT) network with early arbitration Target system-latency bottleneck Observe newly-entering traffic Perform early arbitration + channel pre-allocation Net benefit: bypass arbitration logic + pre-opened channel arly arbitration capability in fan-in router nodes Simple and fast operate as FIFO in many traffic scenarios network: Rapid advance notification of incoming data Fast and lightweight Key component for early arbitration 6
7 Contributions (2) Detailed experimentation and analysis arly arbitration network vs. baseline and predictive - baseline : [Horak/Nowick, NOCS-10] - predictive : [Gill/Nowick, NOCS-11] 8 diverse synthetic benchmarks - represent different network conditions Significant latency improvement and comparable throughput - New vs. baseline: 23-30% latency improvement - New vs. predictive: 13-38% latency improvement Low end-to-end system latency - ~1.7ns (at 25% load, 90nm): through 6 router nodes + 5 hops 7
8 Related Work: NoC Acceleration Techniques xpress virtual channels [Kumar/Peh, ISCA-07] Selective packets use dedicated fast channels Virtually bypass intermediate nodes improvements only against slow coarse-grained baseline: 3-cycle operation SMART NoC [Chen/Peh, DAT-13] Selective packets traverse multiple hops in one cycle requires advanced circuit-level techniques + aggressive timing assumptions Hybrid network [Modarressi/Arjomand, DAT-09] A normal packet-switched network + fast circuit-switched network Flits can switch between two sub-networks requires partitioned network (statically-allocated) + large circuit-switched setup time NoC using advanced bundles [Kumar et al., ICCD-07] Provides advanced information of flit arrival Closer to our approach advance bundles advance only one cycle per hop (unlike our approach) 8
9 Outline Introduction Background New Asynchronous MoT Network Overview of the arly Arbitration Approach Network Design of the New Arbitration Node xperimental Results Simulation Setup Network-Level Results Conclusion and Future Work 9
10 Background: Mesh-of-Trees (MoT) Variant Topology basics Fan-out and fan-in network inverse of classical MoT (Leighton) Two node types Routing: 1 input and 2 output channels Arbitration: 2 input and 1 output channels Routing features Deterministic wormhole routing Path examples shown in the figure No contention between distinct source/sink pairs Potential performance benefits Lower latency and higher throughput over 2D-mesh Shown to perform well for CMP s [Balkan/Vishkin, Trans. VLSI, Oct. 09], [Balkan/Vishkin, Hot Interconnects-07]
11 Background: Two Node Types Source Routing Req Ack Boolean Data 1 incoming handshaking channel Routing Primitive 2 outgoing handshaking channels Req0 Ack0 Data0 Req1 Ack1 Data1 Req0 Ack0 Data0 Req1 Ack1 Data1 2 incoming handshaking channels Arbitration Primitive Req Ack Data 1 outgoing handshaking channel Routing primitive 1 input channel and 2 output handshaking channels Route the input to one of the outputs Arbitration primitive 2 input and 1 output handshaking channels Merge two input streams into one output stream 11
12 Background: Asynchronous Protocols Sender Receiver Handshaking: transition signaling (two-phase) Two events per transaction - Req/Ack toggle Merits over level signaling (four-phase): - 1 roundtrip communication per data item - High throughput and low power Challenge of two-phase signaling: - designing lightweight implementations First communication Data encoding: single-rail bundled data Standard synchronous single-rail data + extra bundling req Merits of single-rail bundled data: - low power and very good coding efficiency - allow to re-use synchronous components Challenge: requires matched delay for bundling req - one-sided timing constraint: request must arrive after data is stable req ack req ack Second communication 12
13 Outline Introduction Background New Asynchronous MoT Network Overview of the arly Arbitration Approach Network Design of the New Arbitration Node xperimental Results Simulation Setup Network-Level Results Conclusion and Future Work 13
14 Overview: arly Arbitration Strategy Key network bottleneck System-latency - bottleneck of arbitration logic in fan-in nodes Basic strategy = anticipation Observe newly-entering traffic Do early arbitration + channel pre-allocation Net benefit: bypass arbitration logic Proposed network As soon as flit enters network: - all downstream nodes quickly notified (by a monitoring network) - fan-in nodes: initiate early arbitration + channel pre-allocation When flit arrives at each fan-in node: - quickly sent out through pre-allocated channel Routing nodes (unchanged) New arbitration nodes 14
15 Outline Introduction Background New Asynchronous MoT Network Overview of the arly Arbitration Approach Network Design of the New Arbitration Node xperimental Results Simulation Setup Network-Level Results Conclusion and Future Work 15
16 Network: Overview Purpose: rapid advance notification of incoming data Structure: lightweight shadow replica of MoT network Small monitoring control unit attached to each node - i.e. both routing and arbitration Fast and lightweight Implemented by several gates for each control unit Different role for fan-out and fan-in monitoring Fan-out: fast forward early notification without using it Fan-in: fast forward and use it for early arbitration 16
17 Network: Structure Structure: a shadow replica of MoT network Small and fast monitoring control unit attached for each node Channel fan-out root Channel Channels control attached to each node Channels x Channels fan-in root Channel 17
18 Network: Operation When a flit enters the network arly notification generated and fast forwarded arly notification generated at fan-out root Channel fan-out root Channel arly notification traces same path as flits Channels Fan-in nodes preallocates the channel Channels Channels fan-in root Channel 18
19 Outline Introduction Background New Asynchronous MoT Network Overview of the arly Arbitration Approach Network Design of the New Arbitration Node xperimental Results Simulation Setup Network-Level Results Conclusion and Future Work 19
20 New Arbitration Node: Circuit-Level ackout0 ackout1 reqin0 reqin1 datain0 datain1 L3 D L4 D S 0 1 R 0 zerowins mux_select preackout0 Mutex L1 preackout1 1 L2 onewins Req-Latch mutex-req0 mutex-req1 output-en D RG takeover Monitor somethingcoming-in-0 somethingcoming-in-1 somethingcoming-out ackin reqout dataout 20
21 New Arbitration Node: Interfaces ackout0 ackout1 L3 D 2 input data channels reqin0 reqin1 datain0 datain1 L4 D S 0 1 R 0 zerowins mux_select preackout0 Mutex L1 preackout1 1 L2 onewins Req-Latch mutex-req0 mutex-req1 output-en D RG takeover Monitor somethingcoming-in-0 somethingcoming-in-1 somethingcoming-out ackin 1 output data channel reqout dataout 21
22 New Arbitration Node: Interfaces (cont.) channels: provide advance info. on incoming traffic ackout0 ackout1 L3 D L4 D preackout0 0 zerowins Mutex preackout1 1 onewins mutex-req0 mutex-req1 output-en takeover Monitor somethingcoming-in-0 somethingcoming-in-1 somethingcoming-out ackin reqin0 reqin1 S R mux_select L1 L2 Req-Latch reqout datain0 datain1 0 1 D dataout RG 22
23 New Arbitration Node: Structure Mutex: resolves arbitration between 2 input channels ackout0 ackout1 L3 D L4 D preackout0 0 zerowins Mutex preackout1 1 onewins mutex-req0 mutex-req1 output-en takeover Monitor somethingcoming-in-0 somethingcoming-in-1 somethingcoming-out ackin reqin0 reqin1 S R mux_select L1 L2 Req-Latch reqout datain0 datain1 0 1 D dataout RG 23
24 New Arbitration Node: Structure (cont.) : requests/releases Mutex Key component to enable early arbitration ackout0 ackout1 L3 D L4 D preackout0 0 zerowins Mutex preackout1 1 onewins mutex-req0 mutex-req1 output-en takeover Monitor somethingcoming-in-0 somethingcoming-in-1 somethingcoming-out ackin reqin0 reqin1 S R mux_select L1 L2 Req-Latch reqout datain0 datain1 0 1 D dataout RG 24
25 New Arbitration Node: Structure (cont.) Input channel latch + control: Two functions: (i) enables channel pre-allocation, (ii) flow control ackout0 ackout1 L3 D L4 D preackout0 0 zerowins Mutex preackout1 1 onewins mutex-req0 mutex-req1 output-en takeover Monitor somethingcoming-in-0 somethingcoming-in-1 somethingcoming-out ackin reqin0 reqin1 S R mux_select L1 L2 Req-Latch reqout datain0 datain1 0 1 D dataout RG 25
26 New Arbitration Node: Structure (cont.) control: fast forwards early notification ackout0 ackout1 L3 D L4 D preackout0 0 zerowins Mutex preackout1 1 onewins mutex-req0 mutex-req1 output-en takeover Monitor somethingcoming-in-0 somethingcoming-in-1 somethingcoming-out ackin reqin0 reqin1 S R mux_select L1 L2 Req-Latch reqout datain0 datain1 0 1 D dataout RG 26
27 New Arbitration Node: Key Feature (1) arly arbitration capability: signals initiate arbitration, before actual flit arrival ackout0 ackout1 L3 D L4 D preackout0 0 zerowins Mutex preackout1 1 onewins mutex-req0 mutex-req1 output-en takeover Monitor somethingcoming-in-0 somethingcoming-in-1 somethingcoming-out ackin reqin0 reqin1 S R mux_select L1 L2 Req-Latch reqout datain0 datain1 0 1 D dataout RG 27
28 New Arbitration Node: Key Feature (2) Highly optimized forward path: contains only 1 pre-opened latch = FIFO stage ackout0 ackout1 Latch preopened by early arbitration L3 D Forward path reqin0 reqin1 datain0 datain1 L4 D S 0 1 R 0 zerowins mux_select preackout0 Mutex L1 preackout1 1 L2 onewins Req-Latch mutex-req0 mutex-req1 output-en D takeover Monitor somethingcoming-in-0 somethingcoming-in-1 somethingcoming-out ackin reqout dataout RG 28
29 Simulation: Overview Two simulations #1. Single-flit scenario - friendly case - illustrate how early arbitration works #2. Contention between two input channels - more advanced and adversarial case - illustrate how to resolve contention 29
30 Simulation #1: Single-Flit Step #1: signal arrives (well before actual flit) ackout0 ackout1 Initiates early arbitration reqin0 reqin1 datain0 datain1 L3 D L4 D S 0 1 R 0 zerowins mux_select preackout0 Mutex L1 preackout1 1 L2 onewins Req-Latch mutex-req0 mutex-req1 output-en uickly forwarded D takeover Monitor somethingcoming-in-0 somethingcoming-in-1 somethingcoming-out ackin reqout dataout RG 30
31 Simulation #1: Single-Flit (cont.) Step #2: Completes early arbitration ackout0 ackout1 Wins arbitration Opens channel reqin0 reqin1 L3 D L4 D S R 0 zerowins mux_select preackout0 Mutex L1 preackout1 1 L2 onewins Req-Latch mutex-req0 mutex-req1 output-en takeover Monitor somethingcoming-in-0 somethingcoming-in-1 somethingcoming-out ackin reqout datain0 datain1 0 1 D dataout RG 31
32 Simulation #1: Single-Flit (cont.) Step #3: Flit arrives and gets through pre-allocated channel ackout0 ackout1 Channel already opened Flit arrives reqin0 reqin1 datain0 datain1 L3 D L4 D S 0 1 R 0 zerowins mux_select preackout0 Mutex L1 preackout1 1 L2 onewins Req-Latch mutex-req0 mutex-req1 output-en D takeover Monitor somethingcoming-in-0 somethingcoming-in-1 somethingcoming-out ackin reqout Flit sent out dataout RG 32
33 Forward Latency: Single-Flit ackout0 ackout1 Channel already opened Flit arrives reqin0 reqin1 datain0 datain1 L3 D L4 D S 0 1 R 0 zerowins mux_select preackout0 Mutex L1 preackout1 1 L2 onewins Req-Latch mutex-req0 mutex-req1 output-en D takeover Monitor somethingcoming-in-0 somethingcoming-in-1 somethingcoming-out ackin reqout Flit sent out dataout Forward latency = D-latch + XOR2 gate RG 33
34 Simulation #2: Contention Both monitoring signals arrive almost simultaneously ackout0 ackout1 L3 D L4 D preackout0 0 zerowins Mutex preackout1 1 onewins mutex-req0 mutex-req1 output-en takeover Monitor somethingcoming-in-0 somethingcoming-in-1 somethingcoming-out ackin reqin0 reqin1 S R mux_select L1 L2 Req-Latch reqout datain0 datain1 0 1 D dataout RG 34
35 Simulation #2: Contention (cont.) ackout0 ackout1 Channel #0 wins mutex Channel #0 Opens reqin0 reqin1 datain0 datain1 Both monitoring signals request mutex Assume channel #0 wins arbitration L3 D L4 D S 0 1 R 0 zerowins mux_select preackout0 Mutex L1 preackout1 1 L2 onewins Req-Latch mutex-req0 mutex-req1 output-en D RG takeover Monitor somethingcoming-in-0 somethingcoming-in-1 somethingcoming-out ackin reqout dataout 35
36 Simulation #2: Contention (cont.) Flit on channel #0 arrives and goes through pre-allocated channel Flit on channel #1 arrives but is blocked ackout0 ackout1 Channel #0 already opened L3 D Both flits arrive reqin0 reqin1 datain0 datain1 L4 D S 0 1 R 0 zerowins mux_select preackout0 Mutex L1 preackout1 1 L2 onewins Req-Latch mutex-req0 mutex-req1 output-en D takeover Monitor somethingcoming-in-0 somethingcoming-in-1 somethingcoming-out ackin Channel #1 is blocked reqout Channel #0 flit sent out dataout RG 36
37 Simulation #2: Contention (cont.) Channel #0 finally releases mutex channel #1 wins ackout0 ackout1 Channel #1 wins mutex L3 D Channel #1 finally opens reqin0 reqin1 L4 D S R 0 zerowins mux_select preackout0 Mutex L1 preackout1 1 L2 onewins Req-Latch mutex-req0 mutex-req1 output-en takeover Monitor somethingcoming-in-0 somethingcoming-in-1 somethingcoming-out ackin reqout datain0 datain1 0 1 D dataout RG 37
38 Simulation #2: Contention (cont.) Flit on channel #1 gets through ackout0 ackout1 Channel #1 is now opened reqin0 reqin1 datain0 datain1 L3 D L4 D S 0 1 R 0 zerowins mux_select preackout0 Mutex L1 preackout1 1 L2 onewins Req-Latch mutex-req0 mutex-req1 output-en D takeover Monitor somethingcoming-in-0 somethingcoming-in-1 somethingcoming-out ackin reqout Channel #1 flit sent out dataout RG 38
39 New Arbitration Node: Multi-Flit Design end0 end1 Structure largely the same as single-flit design Different : receives tail flag ackout0 ackout1 Flit type info: tail flag L3 D L4 D preackout0 0 zerowins Mutex preackout1 1 onewins mutex-req0 mutex-req1 output-en takeover Monitor somethingcoming-in-0 somethingcoming-in-1 somethingcoming-out ackin reqin0 reqin1 S R mux_select L1 L2 Req-Latch reqout datain0 datain1 0 1 D dataout RG 39
40 Outline Introduction Background New Asynchronous MoT Network Overview of the arly Arbitration Approach Network Design of the New Arbitration Node xperimental Results Simulation Setup Network-Level Results Conclusion and Future Work 40
41 xperimental Results: Overview Two levels of evaluation: Node-level: new arbitration node in isolation Network-level: 8 8 network with new node Node-level evaluation: see paper for details New arbitration node vs. two previous designs: - Baseline [Horak/Nowick NOCS-10] - Predictive [Gill/Nowick NOCS-11] 90nm ARM standard cells, gate-level SPIC simulation Network-level evaluation: our focus Three 8 8 MoT networks: each has 112 router nodes - Baseline, Predictive, New Modeled in structural technology-mapped Verilog - more accurate model than in [Gill/Nowick NOCS-11] 8 synthetic benchmarks: a wide range of traffic patterns 41
42 Benchmarks 8 diverse benchmarks The same as those in NOCS-11 Represent different network conditions Classification Three friendly benchmarks: - (1) Shuffle, (2) Tornado and (7) Single Source broadcast [Dally`03] - No contention Three moderately adversarial benchmarks: - (4) Simple alternation with overlap - (5) Random restricted broadcast with partial overlap - (8) Partial streaming with random interruption - No contention for some nodes, light or moderate contention for others Two most adversarial benchmarks: - (3) All-to-all random and (6) Hotspot8 - Heavy contention at some nodes 42
43 Network-Level Latency: Single-Flit Design Moderate to significant improvement over all benchmarks New vs. baseline: 23-30% improvement New vs. predictive: 13-38% improvement Latency Comparison for 25% Network Load Baseline Predictive New 43
44 Network-Level Latency: Single-Flit Design Perform well for benchmark #3 and #6 (adversarial cases) Predictive: even worse than baseline (~20% higher latency) New: better than baseline (~25% lower latency) Latency Comparison for 25% Network Load Baseline Predictive New 44
45 Network-Level Latency: Single-Flit Design xcellent latency stability: provides predictable behavior Network latency = ~1700ps, across all benchmarks through 6 router nodes + 5 hops Important for memory access in CMP s Latency Comparison for 25% Network Load 1700 Baseline Predictive New 45
46 Network-Level Throughput: Single-Flit Design New vs. baseline: improvement up to 17% on 6 benchmarks New vs. predictive: comparable throughput over all benchmarks Saturation Throughput Baseline Predictive New 46
47 Network-Level Results: Multi-Flit Design Fixed packet length = 3 flits/packet Results only for benchmark #1 and #3 For both benchmarks: ~30% latency and ~14% throughput improvement Latency vs. Input Rate #1 baseline #3 baseline #3 new #1 new 47
48 Conclusion Introduced a MoT network using early arbitration Address system-latency bottleneck Observe newly entering traffic - via lightweight shadow monitoring network Perform early arbitration + channel pre-allocation Detailed experimentation and analysis Significant improvements in system-latency - New vs. baseline: 23-30% across all benchmarks - New vs. predictive: up to 38% 48
49 Future Work Narrow channel reservation window Decrease time between channel reservation and flit arrival Increase network utility Target different topology xtend early arbitration to 2D-mesh, Clos network, etc. Build a complete GALS system Add mixed-timing interface connect cores by the network More experiments Real traffic benchmarks 49
50 Back-up Slides 50
51 Strategy Comparison: Overview Three network designs Baseline - [Horak/Nowick, NOCS-10] - foundation of the research Predictive - [Gill/Nowick, NOCS-11] - a more recent design New - the proposed design 51
52 Baseline Arbitration Node: Operation Step #1: Waiting for arbitration to complete Input channel 0 Flit arrives Input channel 1 a r b Output channel Step #2: Arbitration resolves Input channel 0 Input channel 1 a r b Output channel Flit sent out 52
53 Predictive Arbitration Node: Operation Datain0 Something-coming-0 Datain1 Something-coming-1 Waiting for arbitration to complete Flit arrives a r b Biased channel: held open Flits sent out without waiting Dataout Flit sent out Default Mode: similar operation as baseline design Datain0 Something-coming-0 Flit arrives Dataout Datain1 Something-coming-1 Flit sent out Non-biased channel: entirely blocked Biased Mode: optimized for one input channel (by prediction) 53
54 New Arbitration Node: Operation Step #1: Advance notification completes arbitration and opens the channel well before actual flit arrival arrives Datain0 Something-coming-0 Datain1 Something-coming-1 a r b Dataout Step #2: Flit arrives Datain0 Something-coming-0 Datain1 Something-coming-1 Flits sent out without waiting a r b Dataout Flit sent out Arbitration already done 54
55 The Role of Network Predictive design [Gill/Nowick, NOCS-11] Facilitates mode change - from optimized (biased) to unoptimized (default) only For safety purpose only - plays secondary role New design Key component of early arbitration strategy - directly initiates early arbitration For higher performance - especially system-latency 55
Asynchronous Bypass Channels
Asynchronous Bypass Channels Improving Performance for Multi-Synchronous NoCs T. Jain, P. Gratz, A. Sprintson, G. Choi, Department of Electrical and Computer Engineering, Texas A&M University, USA Table
Lecture 18: Interconnection Networks. CMU 15-418: Parallel Computer Architecture and Programming (Spring 2012)
Lecture 18: Interconnection Networks CMU 15-418: Parallel Computer Architecture and Programming (Spring 2012) Announcements Project deadlines: - Mon, April 2: project proposal: 1-2 page writeup - Fri,
Design and Implementation of an On-Chip timing based Permutation Network for Multiprocessor system on Chip
Design and Implementation of an On-Chip timing based Permutation Network for Multiprocessor system on Chip Ms Lavanya Thunuguntla 1, Saritha Sapa 2 1 Associate Professor, Department of ECE, HITAM, Telangana
Hyper Node Torus: A New Interconnection Network for High Speed Packet Processors
2011 International Symposium on Computer Networks and Distributed Systems (CNDS), February 23-24, 2011 Hyper Node Torus: A New Interconnection Network for High Speed Packet Processors Atefeh Khosravi,
Architectural Level Power Consumption of Network on Chip. Presenter: YUAN Zheng
Architectural Level Power Consumption of Network Presenter: YUAN Zheng Why Architectural Low Power Design? High-speed and large volume communication among different parts on a chip Problem: Power consumption
From Bus and Crossbar to Network-On-Chip. Arteris S.A.
From Bus and Crossbar to Network-On-Chip Arteris S.A. Copyright 2009 Arteris S.A. All rights reserved. Contact information Corporate Headquarters Arteris, Inc. 1741 Technology Drive, Suite 250 San Jose,
On-Chip Interconnection Networks Low-Power Interconnect
On-Chip Interconnection Networks Low-Power Interconnect William J. Dally Computer Systems Laboratory Stanford University ISLPED August 27, 2007 ISLPED: 1 Aug 27, 2007 Outline Demand for On-Chip Networks
Demystifying Data-Driven and Pausible Clocking Schemes
Demystifying Data-Driven and Pausible Clocking Schemes Robert Mullins Computer Architecture Group Computer Laboratory, University of Cambridge ASYNC 2007, 13 th IEEE International Symposium on Asynchronous
Performance Evaluation of 2D-Mesh, Ring, and Crossbar Interconnects for Chip Multi- Processors. NoCArc 09
Performance Evaluation of 2D-Mesh, Ring, and Crossbar Interconnects for Chip Multi- Processors NoCArc 09 Jesús Camacho Villanueva, José Flich, José Duato Universidad Politécnica de Valencia December 12,
Interconnection Networks
Advanced Computer Architecture (0630561) Lecture 15 Interconnection Networks Prof. Kasim M. Al-Aubidy Computer Eng. Dept. Interconnection Networks: Multiprocessors INs can be classified based on: 1. Mode
A Pausible Bisynchronous FIFO for GALS Systems
2015 Symposium on Asynchronous Circuits and Systems A Pausible Bisynchronous FIFO for GALS Systems Ben Keller*, Ma@hew FojCk*, Brucek Khailany* *NVIDIA CorporaCon University of California, Berkeley May
Interconnection Networks. Interconnection Networks. Interconnection networks are used everywhere!
Interconnection Networks Interconnection Networks Interconnection networks are used everywhere! Supercomputers connecting the processors Routers connecting the ports can consider a router as a parallel
Use-it or Lose-it: Wearout and Lifetime in Future Chip-Multiprocessors
Use-it or Lose-it: Wearout and Lifetime in Future Chip-Multiprocessors Hyungjun Kim, 1 Arseniy Vitkovsky, 2 Paul V. Gratz, 1 Vassos Soteriou 2 1 Department of Electrical and Computer Engineering, Texas
Hardware Implementation of Improved Adaptive NoC Router with Flit Flow History based Load Balancing Selection Strategy
Hardware Implementation of Improved Adaptive NoC Rer with Flit Flow History based Load Balancing Selection Strategy Parag Parandkar 1, Sumant Katiyal 2, Geetesh Kwatra 3 1,3 Research Scholar, School of
Design and Implementation of an On-Chip Permutation Network for Multiprocessor System-On-Chip
Design and Implementation of an On-Chip Permutation Network for Multiprocessor System-On-Chip Manjunath E 1, Dhana Selvi D 2 M.Tech Student [DE], Dept. of ECE, CMRIT, AECS Layout, Bangalore, Karnataka,
A CDMA Based Scalable Hierarchical Architecture for Network- On-Chip
www.ijcsi.org 241 A CDMA Based Scalable Hierarchical Architecture for Network- On-Chip Ahmed A. El Badry 1 and Mohamed A. Abd El Ghany 2 1 Communications Engineering Dept., German University in Cairo,
A Generic Network Interface Architecture for a Networked Processor Array (NePA)
A Generic Network Interface Architecture for a Networked Processor Array (NePA) Seung Eun Lee, Jun Ho Bahn, Yoon Seok Yang, and Nader Bagherzadeh EECS @ University of California, Irvine Outline Introduction
Introduction to Exploration and Optimization of Multiprocessor Embedded Architectures based on Networks On-Chip
Introduction to Exploration and Optimization of Multiprocessor Embedded Architectures based on Networks On-Chip Cristina SILVANO [email protected] Politecnico di Milano, Milano (Italy) Talk Outline
In-network Monitoring and Control Policy for DVFS of CMP Networkson-Chip and Last Level Caches
In-network Monitoring and Control Policy for DVFS of CMP Networkson-Chip and Last Level Caches Xi Chen 1, Zheng Xu 1, Hyungjun Kim 1, Paul V. Gratz 1, Jiang Hu 1, Michael Kishinevsky 2 and Umit Ogras 2
Packetization and routing analysis of on-chip multiprocessor networks
Journal of Systems Architecture 50 (2004) 81 104 www.elsevier.com/locate/sysarc Packetization and routing analysis of on-chip multiprocessor networks Terry Tao Ye a, *, Luca Benini b, Giovanni De Micheli
System Interconnect Architectures. Goals and Analysis. Network Properties and Routing. Terminology - 2. Terminology - 1
System Interconnect Architectures CSCI 8150 Advanced Computer Architecture Hwang, Chapter 2 Program and Network Properties 2.4 System Interconnect Architectures Direct networks for static connections Indirect
Chapter 2. Multiprocessors Interconnection Networks
Chapter 2 Multiprocessors Interconnection Networks 2.1 Taxonomy Interconnection Network Static Dynamic 1-D 2-D HC Bus-based Switch-based Single Multiple SS MS Crossbar 2.2 Bus-Based Dynamic Single Bus
- Nishad Nerurkar. - Aniket Mhatre
- Nishad Nerurkar - Aniket Mhatre Single Chip Cloud Computer is a project developed by Intel. It was developed by Intel Lab Bangalore, Intel Lab America and Intel Lab Germany. It is part of a larger project,
Question: 3 When using Application Intelligence, Server Time may be defined as.
1 Network General - 1T6-521 Application Performance Analysis and Troubleshooting Question: 1 One component in an application turn is. A. Server response time B. Network process time C. Application response
Low-Overhead Hard Real-time Aware Interconnect Network Router
Low-Overhead Hard Real-time Aware Interconnect Network Router Michel A. Kinsy! Department of Computer and Information Science University of Oregon Srinivas Devadas! Department of Electrical Engineering
Optimizing Configuration and Application Mapping for MPSoC Architectures
Optimizing Configuration and Application Mapping for MPSoC Architectures École Polytechnique de Montréal, Canada Email : [email protected] 1 Multi-Processor Systems on Chip (MPSoC) Design Trends
Switched Interconnect for System-on-a-Chip Designs
witched Interconnect for ystem-on-a-chip Designs Abstract Daniel iklund and Dake Liu Dept. of Physics and Measurement Technology Linköping University -581 83 Linköping {danwi,dake}@ifm.liu.se ith the increased
Lustre Networking BY PETER J. BRAAM
Lustre Networking BY PETER J. BRAAM A WHITE PAPER FROM CLUSTER FILE SYSTEMS, INC. APRIL 2007 Audience Architects of HPC clusters Abstract This paper provides architects of HPC clusters with information
Interconnection Generation for System-on-Chip Design and Design Space Exploration
Vodafone Chair Mobile Communications Systems, Prof. Dr.-Ing. G. Fettweis Interconnection Generation for System-on-Chip Design and Design Space Exploration Dipl.-Ing. Markus Winter Vodafone Chair for Mobile
Distributed Elastic Switch Architecture for efficient Networks-on-FPGAs
Distributed Elastic Switch Architecture for efficient Networks-on-FPGAs Antoni Roca, Jose Flich Parallel Architectures Group Universitat Politechnica de Valencia (UPV) Valencia, Spain Giorgos Dimitrakopoulos
Outline. Introduction. Multiprocessor Systems on Chip. A MPSoC Example: Nexperia DVP. A New Paradigm: Network on Chip
Outline Modeling, simulation and optimization of Multi-Processor SoCs (MPSoCs) Università of Verona Dipartimento di Informatica MPSoCs: Multi-Processor Systems on Chip A simulation platform for a MPSoC
Interconnection Network Design
Interconnection Network Design Vida Vukašinović 1 Introduction Parallel computer networks are interesting topic, but they are also difficult to understand in an overall sense. The topological structure
Low Power AMD Athlon 64 and AMD Opteron Processors
Low Power AMD Athlon 64 and AMD Opteron Processors Hot Chips 2004 Presenter: Marius Evers Block Diagram of AMD Athlon 64 and AMD Opteron Based on AMD s 8 th generation architecture AMD Athlon 64 and AMD
Applying the Benefits of Network on a Chip Architecture to FPGA System Design
Applying the Benefits of on a Chip Architecture to FPGA System Design WP-01149-1.1 White Paper This document describes the advantages of network on a chip (NoC) architecture in Altera FPGA system design.
Routing in packet-switching networks
Routing in packet-switching networks Circuit switching vs. Packet switching Most of WANs based on circuit or packet switching Circuit switching designed for voice Resources dedicated to a particular call
Interconnection Networks
CMPT765/408 08-1 Interconnection Networks Qianping Gu 1 Interconnection Networks The note is mainly based on Chapters 1, 2, and 4 of Interconnection Networks, An Engineering Approach by J. Duato, S. Yalamanchili,
Computer Engineering: MS Program Overview, Fall 2013
Computer Engineering: MS Program Overview, Fall 2013 Prof. Steven Nowick ([email protected]) Chair, (on sabbatical) Prof. Charles Zukowski ([email protected]) Acting Chair, Overview of Program The
Architectures and Platforms
Hardware/Software Codesign Arch&Platf. - 1 Architectures and Platforms 1. Architecture Selection: The Basic Trade-Offs 2. General Purpose vs. Application-Specific Processors 3. Processor Specialisation
Asynchronous IC Interconnect Network Design and Implementation Using a Standard ASIC Flow
Asynchronous IC Interconnect Network Design and Implementation Using a Standard ASIC Flow Bradley R. Quinton Dept. of Electrical and Computer Engineering University of British Columbia [email protected]
Power Reduction Techniques in the SoC Clock Network. Clock Power
Power Reduction Techniques in the SoC Network Low Power Design for SoCs ASIC Tutorial SoC.1 Power Why clock power is important/large» Generally the signal with the highest frequency» Typically drives a
Based on Computer Networking, 4 th Edition by Kurose and Ross
Computer Networks Ethernet Hubs and Switches Based on Computer Networking, 4 th Edition by Kurose and Ross Ethernet dominant wired LAN technology: cheap $20 for NIC first widely used LAN technology Simpler,
other. A B AP wired network
1 Routing and Channel Assignment in Multi-Channel Multi-Hop Wireless Networks with Single-NIC Devices Jungmin So + Nitin H. Vaidya Department of Computer Science +, Department of Electrical and Computer
Design of a Feasible On-Chip Interconnection Network for a Chip Multiprocessor (CMP)
19th International Symposium on Computer Architecture and High Performance Computing Design of a Feasible On-Chip Interconnection Network for a Chip Multiprocessor (CMP) Seung Eun Lee, Jun Ho Bahn, and
On-Chip Communications Network Report
On-Chip Communications Network Report ABSTRACT This report covers the results of an independent, blind worldwide survey covering on-chip communications networks (OCCN), defined as is the entire interconnect
Making Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association
Making Multicore Work and Measuring its Benefits Markus Levy, president EEMBC and Multicore Association Agenda Why Multicore? Standards and issues in the multicore community What is Multicore Association?
Quality of Service (QoS) for Asynchronous On-Chip Networks
Quality of Service (QoS) for synchronous On-Chip Networks Tomaz Felicijan and Steve Furber Department of Computer Science The University of Manchester Oxford Road, Manchester, M13 9PL, UK {felicijt,sfurber}@cs.man.ac.uk
Interconnection Network
Interconnection Network Recap: Generic Parallel Architecture A generic modern multiprocessor Network Mem Communication assist (CA) $ P Node: processor(s), memory system, plus communication assist Network
Multihoming and Multi-path Routing. CS 7260 Nick Feamster January 29. 2007
Multihoming and Multi-path Routing CS 7260 Nick Feamster January 29. 2007 Today s Topic IP-Based Multihoming What is it? What problem is it solving? (Why multihome?) How is it implemented today (in IP)?
Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging
Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging In some markets and scenarios where competitive advantage is all about speed, speed is measured in micro- and even nano-seconds.
Topics of Chapter 5 Sequential Machines. Memory elements. Memory element terminology. Clock terminology
Topics of Chapter 5 Sequential Machines Memory elements Memory elements. Basics of sequential machines. Clocking issues. Two-phase clocking. Testing of combinational (Chapter 4) and sequential (Chapter
Computer Network. Interconnected collection of autonomous computers that are able to exchange information
Introduction Computer Network. Interconnected collection of autonomous computers that are able to exchange information No master/slave relationship between the computers in the network Data Communications.
3D On-chip Data Center Networks Using Circuit Switches and Packet Switches
3D On-chip Data Center Networks Using Circuit Switches and Packet Switches Takahide Ikeda Yuichi Ohsita, and Masayuki Murata Graduate School of Information Science and Technology, Osaka University Osaka,
Protagonist International Journal of Management And Technology (PIJMT) Online ISSN- 2394-3742. Vol 2 No 3 (May-2015) Active Queue Management
Protagonist International Journal of Management And Technology (PIJMT) Online ISSN- 2394-3742 Vol 2 No 3 (May-2015) Active Queue Management For Transmission Congestion control Manu Yadav M.Tech Student
Photonic Networks for Data Centres and High Performance Computing
Photonic Networks for Data Centres and High Performance Computing Philip Watts Department of Electronic Engineering, UCL Yury Audzevich, Nick Barrow-Williams, Robert Mullins, Simon Moore, Andrew Moore
PCI Express Overview. And, by the way, they need to do it in less time.
PCI Express Overview Introduction This paper is intended to introduce design engineers, system architects and business managers to the PCI Express protocol and how this interconnect technology fits into
Scalability evaluation of barrier algorithms for OpenMP
Scalability evaluation of barrier algorithms for OpenMP Ramachandra Nanjegowda, Oscar Hernandez, Barbara Chapman and Haoqiang H. Jin High Performance Computing and Tools Group (HPCTools) Computer Science
Effects of Filler Traffic In IP Networks. Adam Feldman April 5, 2001 Master s Project
Effects of Filler Traffic In IP Networks Adam Feldman April 5, 2001 Master s Project Abstract On the Internet, there is a well-documented requirement that much more bandwidth be available than is used
Network Structure or Topology
Volume 1, Issue 2, July 2013 International Journal of Advance Research in Computer Science and Management Studies Research Paper Available online at: www.ijarcsms.com Network Structure or Topology Kartik
The Sierra Clustered Database Engine, the technology at the heart of
A New Approach: Clustrix Sierra Database Engine The Sierra Clustered Database Engine, the technology at the heart of the Clustrix solution, is a shared-nothing environment that includes the Sierra Parallel
CROSS LAYER BASED MULTIPATH ROUTING FOR LOAD BALANCING
CHAPTER 6 CROSS LAYER BASED MULTIPATH ROUTING FOR LOAD BALANCING 6.1 INTRODUCTION The technical challenges in WMNs are load balancing, optimal routing, fairness, network auto-configuration and mobility
Router Architectures
Router Architectures An overview of router architectures. Introduction What is a Packet Switch? Basic Architectural Components Some Example Packet Switches The Evolution of IP Routers 2 1 Router Components
CCNA R&S: Introduction to Networks. Chapter 5: Ethernet
CCNA R&S: Introduction to Networks Chapter 5: Ethernet 5.0.1.1 Introduction The OSI physical layer provides the means to transport the bits that make up a data link layer frame across the network media.
Parallel Programming Survey
Christian Terboven 02.09.2014 / Aachen, Germany Stand: 26.08.2014 Version 2.3 IT Center der RWTH Aachen University Agenda Overview: Processor Microarchitecture Shared-Memory
DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING NATIONAL INSTITUTE OF TECHNOLOGY ROURKELA EFFICIENT ROUTER DESIGN FOR NETWORK ON CHIP
DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING NATIONAL INSTITUTE OF TECHNOLOGY ROURKELA EFFICIENT ROUTER DESIGN FOR NETWORK ON CHIP SWAPNA S 2013 EFFICIENT ROUTER DESIGN FOR NETWORK ON CHIP A
The Quality of Internet Service: AT&T s Global IP Network Performance Measurements
The Quality of Internet Service: AT&T s Global IP Network Performance Measurements In today's economy, corporations need to make the most of opportunities made possible by the Internet, while managing
Optical interconnection networks with time slot routing
Theoretical and Applied Informatics ISSN 896 5 Vol. x 00x, no. x pp. x x Optical interconnection networks with time slot routing IRENEUSZ SZCZEŚNIAK AND ROMAN WYRZYKOWSKI a a Institute of Computer and
FPGA-based Multithreading for In-Memory Hash Joins
FPGA-based Multithreading for In-Memory Hash Joins Robert J. Halstead, Ildar Absalyamov, Walid A. Najjar, Vassilis J. Tsotras University of California, Riverside Outline Background What are FPGAs Multithreaded
TCP for Wireless Networks
TCP for Wireless Networks Outline Motivation TCP mechanisms Indirect TCP Snooping TCP Mobile TCP Fast retransmit/recovery Transmission freezing Selective retransmission Transaction oriented TCP Adapted
TCP and Wireless Networks Classical Approaches Optimizations TCP for 2.5G/3G Systems. Lehrstuhl für Informatik 4 Kommunikation und verteilte Systeme
Chapter 2 Technical Basics: Layer 1 Methods for Medium Access: Layer 2 Chapter 3 Wireless Networks: Bluetooth, WLAN, WirelessMAN, WirelessWAN Mobile Networks: GSM, GPRS, UMTS Chapter 4 Mobility on the
Principles and characteristics of distributed systems and environments
Principles and characteristics of distributed systems and environments Definition of a distributed system Distributed system is a collection of independent computers that appears to its users as a single
Module 15: Network Structures
Module 15: Network Structures Background Topology Network Types Communication Communication Protocol Robustness Design Strategies 15.1 A Distributed System 15.2 Motivation Resource sharing sharing and
Using Fuzzy Logic Control to Provide Intelligent Traffic Management Service for High-Speed Networks ABSTRACT:
Using Fuzzy Logic Control to Provide Intelligent Traffic Management Service for High-Speed Networks ABSTRACT: In view of the fast-growing Internet traffic, this paper propose a distributed traffic management
Solving I/O Bottlenecks to Enable Superior Cloud Efficiency
WHITE PAPER Solving I/O Bottlenecks to Enable Superior Cloud Efficiency Overview...1 Mellanox I/O Virtualization Features and Benefits...2 Summary...6 Overview We already have 8 or even 16 cores on one
Optimizing Shared Resource Contention in HPC Clusters
Optimizing Shared Resource Contention in HPC Clusters Sergey Blagodurov Simon Fraser University Alexandra Fedorova Simon Fraser University Abstract Contention for shared resources in HPC clusters occurs
SAN Conceptual and Design Basics
TECHNICAL NOTE VMware Infrastructure 3 SAN Conceptual and Design Basics VMware ESX Server can be used in conjunction with a SAN (storage area network), a specialized high speed network that connects computer
Computer Systems Structure Input/Output
Computer Systems Structure Input/Output Peripherals Computer Central Processing Unit Main Memory Computer Systems Interconnection Communication lines Input Output Ward 1 Ward 2 Examples of I/O Devices
An Active Packet can be classified as
Mobile Agents for Active Network Management By Rumeel Kazi and Patricia Morreale Stevens Institute of Technology Contact: rkazi,[email protected] Abstract-Traditionally, network management systems
OpenSoC Fabric: On-Chip Network Generator
OpenSoC Fabric: On-Chip Network Generator Using Chisel to Generate a Parameterizable On-Chip Interconnect Fabric Farzad Fatollahi-Fard, David Donofrio, George Michelogiannakis, John Shalf MODSIM 2014 Presentation
TRILL Large Layer 2 Network Solution
TRILL Large Layer 2 Network Solution Contents 1 Network Architecture Requirements of Data Centers in the Cloud Computing Era... 3 2 TRILL Characteristics... 5 3 Huawei TRILL-based Large Layer 2 Network
DDR subsystem: Enhancing System Reliability and Yield
DDR subsystem: Enhancing System Reliability and Yield Agenda Evolution of DDR SDRAM standards What is the variation problem? How DRAM standards tackle system variability What problems have been adequately
Low latency synchronization through speculation
Low latency synchronization through speculation D.J.Kinniment, and A.V.Yakovlev School of Electrical and Electronic and Computer Engineering, University of Newcastle, NE1 7RU, UK {David.Kinniment,Alex.Yakovlev}@ncl.ac.uk
Why the Network Matters
Week 2, Lecture 2 Copyright 2009 by W. Feng. Based on material from Matthew Sottile. So Far Overview of Multicore Systems Why Memory Matters Memory Architectures Emerging Chip Multiprocessors (CMP) Increasing
A Seamless Handover Mechanism for IEEE 802.16e Broadband Wireless Access
A Seamless Handover Mechanism for IEEE 802.16e Broadband Wireless Access Kyung-ah Kim 1, Chong-Kwon Kim 2, and Tongsok Kim 1 1 Marketing & Technology Lab., KT, Seoul, Republic of Korea, {kka1,tongsok}@kt.co.kr
An Ultra-low low energy asynchronous processor for Wireless Sensor Networks
An Ultra-low low energy asynchronous processor for Wireless Sensor Networks.Necchi,.avagno, D.Pandini,.Vanzago Politecnico di Torino ST Microelectronics Wireless Sensor Networks - Ad-hoc wireless networks
Adapting Distributed Hash Tables for Mobile Ad Hoc Networks
University of Tübingen Chair for Computer Networks and Internet Adapting Distributed Hash Tables for Mobile Ad Hoc Networks Tobias Heer, Stefan Götz, Simon Rieche, Klaus Wehrle Protocol Engineering and
Recovery Modeling in MPLS Networks
Proceedings of the Int. Conf. on Computer and Communication Engineering, ICCCE 06 Vol. I, 9-11 May 2006, Kuala Lumpur, Malaysia Recovery Modeling in MPLS Networks Wajdi Al-Khateeb 1, Sufyan Al-Irhayim
What is a System on a Chip?
What is a System on a Chip? Integration of a complete system, that until recently consisted of multiple ICs, onto a single IC. CPU PCI DSP SRAM ROM MPEG SoC DRAM System Chips Why? Characteristics: Complex
