OpenSoC Fabric: On-Chip Network Generator



Similar documents
All Programmable Logic. Hans-Joachim Gelke Institute of Embedded Systems. Zürcher Fachhochschule

Digitale Signalverarbeitung mit FPGA (DSF) Soft Core Prozessor NIOS II Stand Mai Jens Onno Krah

Architekturen und Einsatz von FPGAs mit integrierten Prozessor Kernen. Hans-Joachim Gelke Institute of Embedded Systems Professur für Mikroelektronik

Seeking Opportunities for Hardware Acceleration in Big Data Analytics

OpenSPARC T1 Processor

7a. System-on-chip design and prototyping platforms

Design and Implementation of an On-Chip timing based Permutation Network for Multiprocessor system on Chip

Performance Evaluation of 2D-Mesh, Ring, and Crossbar Interconnects for Chip Multi- Processors. NoCArc 09

Introduction to Exploration and Optimization of Multiprocessor Embedded Architectures based on Networks On-Chip

- Nishad Nerurkar. - Aniket Mhatre

Applying the Benefits of Network on a Chip Architecture to FPGA System Design

Architectural Level Power Consumption of Network on Chip. Presenter: YUAN Zheng

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

Xeon+FPGA Platform for the Data Center

FPGA-based MapReduce Framework for Machine Learning

Networking Virtualization Using FPGAs

FPGA Accelerator Virtualization in an OpenPOWER cloud. Fei Chen, Yonghua Lin IBM China Research Lab

SOC architecture and design

Kalray MPPA Massively Parallel Processing Array

Cray Gemini Interconnect. Technical University of Munich Parallel Programming Class of SS14 Denys Sobchyshak

Solving Network Challenges

From Hypercubes to Dragonflies a short history of interconnect

Emerging storage and HPC technologies to accelerate big data analytics Jerome Gaysse JG Consulting

What is a System on a Chip?

Agenda. Michele Taliercio, Il circuito Integrato, Novembre 2001

ARM Webinar series. ARM Based SoC. Abey Thomas

High Performance or Cycle Accuracy?


Qsys and IP Core Integration

Lecture 18: Interconnection Networks. CMU : Parallel Computer Architecture and Programming (Spring 2012)

Vorlesung Rechnerarchitektur 2 Seite 178 DASH

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

On-Chip Communications Network Report

FPGA Acceleration using OpenCL & PCIe Accelerators MEW 25

Architectures and Platforms

Data Center and Cloud Computing Market Landscape and Challenges

Hardware Implementation of Improved Adaptive NoC Router with Flit Flow History based Load Balancing Selection Strategy

Product Brief. R7A-200 Processor Card. Rev 1.0

Accelerate Cloud Computing with the Xilinx Zynq SoC

On-Chip Interconnection Networks Low-Power Interconnect

How To Build An Ark Processor With An Nvidia Gpu And An African Processor

Supercomputing Clusters with RapidIO Interconnect Fabric

PCI Express and Storage. Ron Emerick, Sun Microsystems

Pre-tested System-on-Chip Design. Accelerates PLD Development

Extending the Power of FPGAs. Salil Raje, Xilinx

Outline. Introduction. Multiprocessor Systems on Chip. A MPSoC Example: Nexperia DVP. A New Paradigm: Network on Chip

Copyright 2013, Oracle and/or its affiliates. All rights reserved.

OpenSPARC Program. David Weaver Principal Engineer, UltraSPARC Architecture Principal OpenSPARC Evangelist Sun Microsystems, Inc.

Accelerating the Data Plane With the TILE-Mx Manycore Processor

Introduction to System-on-Chip

Using a Generic Plug and Play Performance Monitor for SoC Verification

Cloud Data Center Acceleration 2015

Parallel Programming Survey

Scaling Mobile Compute to the Data Center. John Goodacre

ARM Cortex-A9 MPCore Multicore Processor Hierarchical Implementation with IC Compiler

A Detailed and Flexible Cycle-Accurate Network-on-Chip Simulator

Memory Architecture and Management in a NoC Platform

Low-Overhead Hard Real-time Aware Interconnect Network Router

Exascale Challenges and General Purpose Processors. Avinash Sodani, Ph.D. Chief Architect, Knights Landing Processor Intel Corporation

DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING NATIONAL INSTITUTE OF TECHNOLOGY ROURKELA EFFICIENT ROUTER DESIGN FOR NETWORK ON CHIP

PCI Express Impact on Storage Architectures and Future Data Centers. Ron Emerick, Oracle Corporation

A Survey on ARM Cortex A Processors. Wei Wang Tanima Dey

EDUCATION. PCI Express, InfiniBand and Storage Ron Emerick, Sun Microsystems Paul Millard, Xyratex Corporation

Secured Embedded Many-Core Accelerator for Big Data Processing

Echtzeittesten mit MathWorks leicht gemacht Simulink Real-Time Tobias Kuschmider Applikationsingenieur

White Paper. S2C Inc Technology Drive, Suite 620 San Jose, CA 95110, USA Tel: Fax:

Computer Organization

Embedded Development Tools

System Performance Analysis of an All Programmable SoC

Multiprocessor System-on-Chip

Cloud-Based Apps Drive the Need for Frequency-Flexible Clock Generators in Converged Data Center Networks

Breaking the Interleaving Bottleneck in Communication Applications for Efficient SoC Implementations

Making Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association

OC By Arsene Fansi T. POLIMI

Von der Hardware zur Software in FPGAs mit Embedded Prozessoren. Alexander Hahn Senior Field Application Engineer Lattice Semiconductor

WiSER: Dynamic Spectrum Access Platform and Infrastructure

ZigBee Technology Overview

Next Generation GPU Architecture Code-named Fermi

Open Flow Controller and Switch Datasheet

Using Network Virtualization to Scale Data Centers

This Unit: Putting It All Together. CIS 501 Computer Architecture. Sources. What is Computer Architecture?

CHAPTER 1 INTRODUCTION

How Router Technology Shapes Inter-Cloud Computing Service Architecture for The Future Internet

Distributed Elastic Switch Architecture for efficient Networks-on-FPGAs

HPC Update: Engagement Model

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

White Paper The Numascale Solution: Extreme BIG DATA Computing

High Performance Computing in the Multi-core Area

How To Understand The Concept Of A Distributed System

Rapid System Prototyping with FPGAs

Optimizing Configuration and Application Mapping for MPSoC Architectures

numascale White Paper The Numascale Solution: Extreme BIG DATA Computing Hardware Accellerated Data Intensive Computing By: Einar Rustad ABSTRACT

Going Linux on Massive Multicore

Lesson 7: SYSTEM-ON. SoC) AND USE OF VLSI CIRCUIT DESIGN TECHNOLOGY. Chapter-1L07: "Embedded Systems - ", Raj Kamal, Publs.: McGraw-Hill Education

An Interconnection Network for a Cache Coherent System on FPGAs. Vincent Mirian

Multistage Interconnection Network for MPSoC: Performances study and prototyping on FPGA

Why the Network Matters

Transcription:

OpenSoC Fabric: On-Chip Network Generator Using Chisel to Generate a Parameterizable On-Chip Interconnect Fabric Farzad Fatollahi-Fard, David Donofrio, George Michelogiannakis, John Shalf MODSIM 2014 Presentation 2014 Aug 13

Abstract Machine Model For Emerging Node Architectures The number of cores on a chip will be on the order of 1000s Throughput Optimized Cores (Thin Cores) Massively Parallel,Simple Core Latency Optimized Core (Fat Cores) Coherence Domain Maintaining cache coherence is NOT scalable Expect coherence domains Flat and infinitely fast on-chip interconnect is NO longer practical Expect complex NOCs Processing elements within a node are NOT equidistant. Expect non-uniformity Download the CAL AMM doc from http://www.cal-design.org/

Motivation Straw-man Exascale Processor Shekhar Borkar, IPDPS 2013 Simplest Core First level of hierarchy RF C C C C * Shared Cache Logic C C C C 600K Transistors Logic Processor Next level of hierarchy Interconnect Next level cache Next level of hierarchy Interconnect Next level cache Next level of hierarchy Interconnect Next level cache Next level of hierarchy Interconnect Last level cache Interconnect Next level cache Next level of hierarchy Interconnect Next level cache Technology 7nm, 2018 Die area 500 mm2 Cores 2048 Frequency 4.2 GHz TFLOPs 17.2 Power 600 Watts E Efficiency 34 pj/flop Computations alone consume 34 MW for Exascale 41

Impact of NoCs Application Performance Dual FPMA Cs 34% Power Router s and links 26% Clock distrib ution 10% An analysis of on-chip interconnection networks for large-scale chip multiprocessors ACM Transactions on computer architecture and code optimization (TACO), April 2010 IMEM and DMEM 20% 10-port RF 10% A 5-GHz Mesh Interconnect for a Teraflops Processor. IEEE Micro. 2007

Evaluation Infrastructure Software simulation slow for large-scale chips and systems Also does not evaluate cycle time Hardware emulation requires a large development effort Offers limited internal visibility for statistics 5 A complexity-effective architecture for accelerating full-system multiprocessor simulations using FPGAs. FPGA 2008

Building an SoC from IP Logic Blocks It s Legos with a some extra integration and verification cost Processor Core (ARM, Tensilica, MIPS deriv) With extra options like DP FPU, ECC OpenSoC Fabric (on-chip network) (currently proprietary ARM or Arteris) DDR memory controller (Denali/Cadence, SiCreations) + Phy & Programmable PLL PCIe Gen3 Root complex Memory DRAM Memory DRAM memctl memctl PCIe FLASH ctl IB or GigE IB or GigE Integrated FLASH Controller 6 10GigE or IB DDR 4x Channel

Design Priorities Primary goal is re-configurability Provide a powerful collection of parameters Make creating new modules and replacing existing ones easy Fast verification of hardware and software models Provide standardized connection interface Invite community to participate and contribute 7

Abstract Block Diagram CPU(s) CPU(s) CPU(s) AXI AXI AXI CPU(s) AXI OpenSoC Fabric AXI CPU(s) AXI AXI AXI PCIe 10GbE HMC 8

Example Instantiation class MyOpenSOC extends Module {! val mycpus = Vec.fill(5) {Module(new CPU())}! val myhmc = Module(new HMC())! val mytengbe = Module(new TenGbE())! val mypcie = Module(new PCIe())! val NumNodes = 8!! val OpenSoCFabric = Module(new MyOpenSoCFabric(NumNodes))!! for (i <- 0 until 4) {! mycpus(i).io.axiport <> MyOpenSoCFabric.io.AXI(i)! }! myhmc.io.axiport <> MyOpenSoCFabric.io.AXI(5)! mytengbe.io.axiport <> MyOpenSoCFabric.io.AXI(6)! mypcie.io.axiport <> MyOpenSoCFabric.io.AXI(7)! }! 9

What is Chisel? Constructing Hardware In a Scala Embedded Language An open-source hardware construction language developed at the ASPIRE Lab at UC Berkeley Hierarchical + Object Oriented + Functional Construction Generates both software and hardware Easy functional verification with software model (C++) Use hardware (Verilog) in FPGAs or ASICs import Chisel._!! class GCD extends Module {! val io = new Bundle {! val a = UInt(INPUT, 16)! val b = UInt(INPUT, 16)! val e = Bool(INPUT)! val z = UInt(OUTPUT, 16)! val v = Bool(OUTPUT)! }! val x = Reg(UInt())! val y = Reg(UInt())! when (x > y) { x := x - y }! unless (x > y) { y := y - x }! when (io.e) { x := io.a; y := io.b }! io.z := x! io.v := y === UInt(0)! }!! object Example {! def main(args: Array[String]): Unit = {! chiselmain(args,! () => Module(new GCD()))! }! }!

ASICs in Chisel Raven 28nm Processor Site Clock test DCDC site test site SRAM test site

Why Chisel? For thousand-core chips, neither software or hardware models suffice alone Software models too slow and do not regard hardware complexity Hardware RTL too labor intensive and collecting internal statistics can be difficult Software Compilation Chisel Scala Hardware Compilation Chisel provides both models from the same codebase C++ Simulation SystemC Simulation Verilog Provides hierarchical design for easy parameterization and module replacement SST FPGA ASIC 12

Fabric Internal Block Diagram Top-Level Network Interface Parameters Router Topology Generator Routing Function Allocator Arbiter Arbiter Channel Channel Router Switch Router Channel Channel AXI AXI Network Interface Network Interface... Channel InjectionQ...... Channel... Channel EjectionQ...... Channel Network Interface Network Interface AXI AXI InjectionQ EjectionQ 13

Class Hierarchy Diagram Flow of Instantiation Top-Level Network Network Interface Injection/ Ejection Queues Topology Other PIF Routing Function Router AXI Torus Flatten Butterfly Mesh Allocator Switch Arbiter Cyclic 14 Priority Round Robin

Existing NoC Generators A few examples such as: Arteris (FlexNoC) ARM (CoreLink and CoreSight) Orchestra (SoC generator) Academic tools (Verilog for FPGA) Our project will be open source Functionality can be extended by users Both software and hardware models 15

Current Status All major architectural components are complete and tested Each with individual testers Initial network is simple mesh network Dimension ordered routing Wormhole buffered flow control Currently debugging network functionality and adding features Implementing and testing Virtual Channels Implementing and testing different topologies (Flattened Butterfly, Torus, etc.) 16 Developing more sophisticated testbenches to ensure robustness

Future Work Continue to add network features Adaptive routing Additional topologies And so much more! Expand to support circuit-switched networks Enables optical networking studies Validate against Booksim Booksim has been validated against RTL implementations Complete AXI Interface Enable integration with ARM-based cores Integrate into multi-core SystemC based software models Include HMC, DRAM, and NVRAM memory endpoints Use ASIC flow to incorporate power and area calculations into software model

Questions? For more information, please visit: http://www.opensocfabric.org