Distributed Elastic Switch Architecture for efficient Networks-on-FPGAs

Similar documents
Applying the Benefits of Network on a Chip Architecture to FPGA System Design

Asynchronous Bypass Channels

Networking Virtualization Using FPGAs

Design and Implementation of an On-Chip timing based Permutation Network for Multiprocessor system on Chip

Interconnection Networks. Interconnection Networks. Interconnection networks are used everywhere!

Lecture 18: Interconnection Networks. CMU : Parallel Computer Architecture and Programming (Spring 2012)

System Interconnect Architectures. Goals and Analysis. Network Properties and Routing. Terminology - 2. Terminology - 1

On-Chip Interconnection Networks Low-Power Interconnect

Interconnection Network Design

Architectural Level Power Consumption of Network on Chip. Presenter: YUAN Zheng

Interconnection Generation for System-on-Chip Design and Design Space Exploration

Interconnection Networks Programmierung Paralleler und Verteilter Systeme (PPV)

Hardware Implementation of Improved Adaptive NoC Router with Flit Flow History based Load Balancing Selection Strategy

Hyper Node Torus: A New Interconnection Network for High Speed Packet Processors

Qsys and IP Core Integration

DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING NATIONAL INSTITUTE OF TECHNOLOGY ROURKELA EFFICIENT ROUTER DESIGN FOR NETWORK ON CHIP

Switched Interconnect for System-on-a-Chip Designs

A Generic Network Interface Architecture for a Networked Processor Array (NePA)

Low-Overhead Hard Real-time Aware Interconnect Network Router

Scalability and Classifications

What is a System on a Chip?

Interconnection Network

LogiCORE IP AXI Performance Monitor v2.00.a

SoC IP Interfaces and Infrastructure A Hybrid Approach

From Bus and Crossbar to Network-On-Chip. Arteris S.A.

A Detailed and Flexible Cycle-Accurate Network-on-Chip Simulator

SOCWIRE: A SPACEWIRE INSPIRED FAULT TOLERANT NETWORK-ON-CHIP FOR RECONFIGURABLE SYSTEM-ON-CHIP DESIGNS

University of Castilla-La Mancha

Design and Implementation of an On-Chip Permutation Network for Multiprocessor System-On-Chip

PEX 8748, PCI Express Gen 3 Switch, 48 Lanes, 12 Ports

DigiPoints Volume 1. Student Workbook. Module 4 Bandwidth Management

Computer Systems Structure Input/Output

Introduction to Exploration and Optimization of Multiprocessor Embedded Architectures based on Networks On-Chip

Performance Evaluation of 2D-Mesh, Ring, and Crossbar Interconnects for Chip Multi- Processors. NoCArc 09

Design of a High Speed Communications Link Using Field Programmable Gate Arrays

Quality of Service (QoS) for Asynchronous On-Chip Networks

Optimizing Configuration and Application Mapping for MPSoC Architectures

Architectures and Platforms

Computer Network. Interconnected collection of autonomous computers that are able to exchange information

Asynchronous IC Interconnect Network Design and Implementation Using a Standard ASIC Flow

Module 5. Broadcast Communication Networks. Version 2 CSE IIT, Kharagpur

PROGETTO DI SISTEMI ELETTRONICI DIGITALI. Digital Systems Design. Digital Circuits Advanced Topics

Header Parsing Logic in Network Switches Using Fine and Coarse-Grained Dynamic Reconfiguration Strategies

Open Flow Controller and Switch Datasheet

A Low-Radix and Low-Diameter 3D Interconnection Network Design

- Nishad Nerurkar. - Aniket Mhatre

Introduction to CMOS VLSI Design (E158) Lecture 8: Clocking of VLSI Systems

Exploiting Stateful Inspection of Network Security in Reconfigurable Hardware

How Router Technology Shapes Inter-Cloud Computing Service Architecture for The Future Internet

Solving Network Challenges

Interconnection Networks

Using FPGAs to Design Gigabit Serial Backplanes. April 17, 2002

Switch Fabric Implementation Using Shared Memory

A Low Latency Library in FPGA Hardware for High Frequency Trading (HFT)

SCSI vs. Fibre Channel White Paper

OpenSoC Fabric: On-Chip Network Generator

路 論 Chapter 15 System-Level Physical Design

Lecture 7: Clocking of VLSI Systems

Threshold-based Exhaustive Round-robin for the CICQ Switch with Virtual Crosspoint Queues

2.1 CAN Bit Structure The Nominal Bit Rate of the network is uniform throughout the network and is given by:

White Paper Increase Flexibility in Layer 2 Switches by Integrating Ethernet ASSP Functions Into FPGAs

FPGA. AT6000 FPGAs. Application Note AT6000 FPGAs. 3x3 Convolver with Run-Time Reconfigurable Vector Multiplier in Atmel AT6000 FPGAs.

FPGA Clocking. Clock related issues: distribution generation (frequency synthesis) multiplexing run time programming domain crossing

Topological Properties

Nexus: An Asynchronous Crossbar Interconnect for Synchronous System-on-Chip Designs

Lecture 2 Parallel Programming Platforms

Enabling Open-Source High Speed Network Monitoring on NetFPGA

Behavior Analysis of Multilayer Multistage Interconnection Network With Extra Stages

Use-it or Lose-it: Wearout and Lifetime in Future Chip-Multiprocessors

Aims and Objectives. E 3.05 Digital System Design. Course Syllabus. Course Syllabus (1) Programmable Logic

Real-time Processor Interconnection Network for FPGA-based Multiprocessor System-on-Chip (MPSoC)

Bloom Filter based Inter-domain Name Resolution: A Feasibility Study

Maximizing Server Storage Performance with PCI Express and Serial Attached SCSI. Article for InfoStor November 2003 Paul Griffith Adaptec, Inc.

CMS Level 1 Track Trigger

Storage at a Distance; Using RoCE as a WAN Transport

Introduction to System-on-Chip

Interconnection Networks

Computer System Design. System-on-Chip

A case study of mobile SoC architecture design based on transaction-level modeling

Cray Gemini Interconnect. Technical University of Munich Parallel Programming Class of SS14 Denys Sobchyshak

COMP 422, Lecture 3: Physical Organization & Communication Costs in Parallel Machines (Sections 2.4 & 2.5 of textbook)

SpW-10X Network Performance Testing. Peter Mendham, Jon Bowyer, Stuart Mills, Steve Parkes. Space Technology Centre University of Dundee

7 Series FPGA Overview

PCI Express Overview. And, by the way, they need to do it in less time.

Digitale Signalverarbeitung mit FPGA (DSF) Soft Core Prozessor NIOS II Stand Mai Jens Onno Krah

Design of a Feasible On-Chip Interconnection Network for a Chip Multiprocessor (CMP)

Transcription:

Distributed Elastic Switch Architecture for efficient Networks-on-FPGAs Antoni Roca, Jose Flich Parallel Architectures Group Universitat Politechnica de Valencia (UPV) Valencia, Spain Giorgos Dimitrakopoulos Electrical and Computer Engineering Dept. Democritus University of Thrace (DUTH) Xanthi, Greece dimitrak@ee.duth.gr G. Dimitrakopoulos - FPL 2012 1

Outline Interconnecting System-on-Chips implemented on top of FPGAs Network-on-FPGAs Constraints set by the FPGA architecture Elastic switch architecture Step-by-step construction Layout freedom due to distributed nature Experimental results Conclusions G. Dimitrakopoulos - FPL 2012 2

The need for scalable soft interconnect FPGAs can host a complete system-on-chip Platform FPGAs System integration and cores communication is an increasingly complex problem Already well-defined abstractions (AXI) are used Need scalability both at the physical level and at the logical-abstraction-level Apply networking principles to the on-fpga environment Idea already adopted in the ASIC domain Can the basic architectures migrate to the FPGA environment? G. Dimitrakopoulos - FPL 2012 3

Network-on-FPGAs Switches Links The soft interconnection network is the glue that binds together the components of the system The NoC parallelizes communication using switches connected with pointto-point links Data are packetized and transmitted Network Interfaces flit-by-flit on the links G. Dimitrakopoulos - FPL 2012 4

Baseline switch architecture Tasks can be implemented in a single cycle Merged arbiter multiplexer (FPL 2011) helps in this direction by unifying SA and ST G. Dimitrakopoulos - FPL 2012 5

Mapping NoC Switches on FPGAs SA, ST and LRC are mapped on LUTs and FFs of FPGA Delay affected heavily on high-radix switches Word width does not increase delay much For buffers we have many options Use RAM macros -> Poor Utilization of a scarce resource Use Distributed RAM -> Better utilization-overhead tradeoff Use FFs -> High overhead Abundant wiring for wide links Initial analysis done by M. Papamichael, J. Hoe, ACM FPGA 2012 G. Dimitrakopoulos - FPL 2012 6

NoC Switch Design Guidelines for FPGAs Utilize wide datapaths taking advantage of the abundant wiring without violating the area constraints. This approach benefits also packet latency by decreasing its serialization part Use the appropriate radix for the switches that does not limit the maximum clock frequency of the network. As long as the clock frequency constraint is not violated highradix switches can be built further reducing network latency Balanced pipelines is a good option (state-of-the-art targets single cycle implementations) Use cautiously the storage resources for buffering without negatively affecting network throughput State-of-the-art Uses LUT RAM for flit buffers Allow any form of flexibility in the implementation that would not stress too much the corresponding CAD tools G. Dimitrakopoulos - FPL 2012 7

Main idea Allow for fine-grained switch decomposition Build switch from identical modules that operate independently Each module (called AC) can store data, arbitrate and switch data independently Distributed switch Each AC uses is mapped only on the logic array (LUTs and registers) leaving any of the scarce RAM resource free to other modules of the system G. Dimitrakopoulos - FPL 2012 8

AC-based switch Step 1 1 2 Move input buffers to the upstream output port G. Dimitrakopoulos - FPL 2012 9

AC-based switch Step 2 2 3 Retime input buffers inside the crossbar s multiplexers Arbitration done per stage G. Dimitrakopoulos - FPL 2012 10

AC-based switch Step 3 3 4 Balance wire length Data transferred using an elastic request/ready pipeline G. Dimitrakopoulos - FPL 2012 11

Elastic AC module Data transferred when request and ready are both true The two incoming branches are round-robin arbitrated Backpressure propagates only one cycle backwards Two registers per output are needed to catch any in-flight data Effectively the 2 registers form a 2-slot FIFO More registers can be added to act as a larger FIFO if needed for performance One register per elastic stage is possible but reduces the throughput to half G. Dimitrakopoulos - FPL 2012 12

AC radix: Latency vs Buffering Latency affected by clock freq. and number of cycles to cross a switch High-radix AC modules reduces the number of cycles per switch but increase clock frequency Connecting wires become longer High-radix AC modules require less the buffering G. Dimitrakopoulos - FPL 2012 13

Distributed Layout Small pressure to the place & route CAD tools Even if starting from a regular topology the final placement can be arbitrary without any impact on clock frequency G. Dimitrakopoulos - FPL 2012 14

Results 5x5 switches with DOR for a 2D mesh on a Virtex 5 FPGA DESA offers smaller area and delay Delay benefits come from pipelining and from simple logic per pipeline stage The delay in each stage is mostly determined by the delay of the links If we pipeline the baseline for speed this will increase RTT and will add significant area overhead due to the extra buffers needed to cover RTT DESA delay linearly increases with the link distance between AC modules G. Dimitrakopoulos - FPL 2012 15

Network-level performance Network performance on a 4x4 2D mesh Uniform and Bit-complement traffic Both switches have equal number of buffers Baseline slightly lower latency at low loads DESA needs 2-cycles per switch but a 2x the clock frequency So absolute latency in ns is better or the same Same saturation throughput DESA has 2x the clock frequency which means 2x absolute throughput G. Dimitrakopoulos - FPL 2012 16

Conclusions Optimize the automatic mapping of networks on FPGAs using a distributed elastic switch architecture Avoids using any of the scarce memory blocks Reduces the effects of long links The AC module reduces significantly the per-switch latency while keeping network throughput unaffected. Area savings are also noticeable AC radix provides a tradeoff between latency, speed and buffering G. Dimitrakopoulos - FPL 2012 17

Backup slides Complementary to FPL 2011 paper G. Dimitrakopoulos - FPL 2012 18

Round-robin arbitration Most widely used selection policy Arbitration begins at the highest-priority position (HP) and searches for the first active request The search moves in a cyclic manner and covers all positions The request that won receives the lowest priority in the next cycle G. Dimitrakopoulos - FPL 2011 19

Requests-Priority recoding Transform each request and priority bit to a 2bit unsigned arithmetic symbol The request the MSBit Arbitration equivalent to sorting (max selection) G. Dimitrakopoulos - FPL 2011 20

Merged arbiter multiplexer G. Dimitrakopoulos - FPL 2012 21