Reconfigurable Computing. Reconfigurable Architectures. Chapter 3.2



Similar documents
7a. System-on-chip design and prototyping platforms

Architectural Level Power Consumption of Network on Chip. Presenter: YUAN Zheng

Architectures and Platforms

Embedded System Hardware - Processing (Part II)

Networking Virtualization Using FPGAs

Interconnection Networks Programmierung Paralleler und Verteilter Systeme (PPV)

Interconnection Networks. Interconnection Networks. Interconnection networks are used everywhere!

Open Flow Controller and Switch Datasheet

Design of a High Speed Communications Link Using Field Programmable Gate Arrays

System Interconnect Architectures. Goals and Analysis. Network Properties and Routing. Terminology - 2. Terminology - 1

Best Practises for LabVIEW FPGA Design Flow. uk.ni.com ireland.ni.com

Aims and Objectives. E 3.05 Digital System Design. Course Syllabus. Course Syllabus (1) Programmable Logic

COMPUTER HARDWARE. Input- Output and Communication Memory Systems

Design and Implementation of an On-Chip timing based Permutation Network for Multiprocessor system on Chip

Interconnection Networks

Computer Organization & Architecture Lecture #19

What is a System on a Chip?

Header Parsing Logic in Network Switches Using Fine and Coarse-Grained Dynamic Reconfiguration Strategies

FPGA. AT6000 FPGAs. Application Note AT6000 FPGAs. 3x3 Convolver with Run-Time Reconfigurable Vector Multiplier in Atmel AT6000 FPGAs.

A Generic Network Interface Architecture for a Networked Processor Array (NePA)

Power Reduction Techniques in the SoC Clock Network. Clock Power

Digitale Signalverarbeitung mit FPGA (DSF) Soft Core Prozessor NIOS II Stand Mai Jens Onno Krah

Systolic Computing. Fundamentals

CHAPTER 5 FINITE STATE MACHINE FOR LOOKUP ENGINE

Memory Systems. Static Random Access Memory (SRAM) Cell

INTERNATIONAL JOURNAL OF ELECTRONICS AND COMMUNICATION ENGINEERING & TECHNOLOGY (IJECET)

SoC IP Interfaces and Infrastructure A Hybrid Approach

All Programmable Logic. Hans-Joachim Gelke Institute of Embedded Systems. Zürcher Fachhochschule

Chapter 7 Memory and Programmable Logic

Operating System Support for Multiprocessor Systems-on-Chip

SOC architecture and design

Von der Hardware zur Software in FPGAs mit Embedded Prozessoren. Alexander Hahn Senior Field Application Engineer Lattice Semiconductor

Introduction to Programmable Logic Devices. John Coughlan RAL Technology Department Detector & Electronics Division

Chapter 2 Heterogeneous Multicore Architecture

Lesson 7: SYSTEM-ON. SoC) AND USE OF VLSI CIRCUIT DESIGN TECHNOLOGY. Chapter-1L07: "Embedded Systems - ", Raj Kamal, Publs.: McGraw-Hill Education

Serial port interface for microcontroller embedded into integrated power meter

Switched Interconnect for System-on-a-Chip Designs

Chapter 12: Multiprocessor Architectures. Lesson 04: Interconnect Networks

Computer Network. Interconnected collection of autonomous computers that are able to exchange information

Example-driven Interconnect Synthesis for Heterogeneous Coarse-Grain Reconfigurable Logic

Qsys and IP Core Integration

OpenSPARC T1 Processor

Architekturen und Einsatz von FPGAs mit integrierten Prozessor Kernen. Hans-Joachim Gelke Institute of Embedded Systems Professur für Mikroelektronik

Module 2. Embedded Processors and Memory. Version 2 EE IIT, Kharagpur 1

REC FPGA Seminar IAP Seminar Format

CHAPTER 7: The CPU and Memory

Outline. Introduction. Multiprocessor Systems on Chip. A MPSoC Example: Nexperia DVP. A New Paradigm: Network on Chip

Lecture 5: Gate Logic Logic Optimization

Computer Performance. Topic 3. Contents. Prerequisite knowledge Before studying this topic you should be able to:

Hardware Implementation of Improved Adaptive NoC Router with Flit Flow History based Load Balancing Selection Strategy

Microprocessor & Assembly Language

Testing of Digital System-on- Chip (SoC)

Network Traffic Monitoring an architecture using associative processing.

1. Introduction to Embedded System Design

Chapter 11 I/O Management and Disk Scheduling

Low-Overhead Hard Real-time Aware Interconnect Network Router

Central Processing Unit (CPU)

A New, High-Performance, Low-Power, Floating-Point Embedded Processor for Scientific Computing and DSP Applications

Why the Network Matters

Making Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association

Lecture 18: Interconnection Networks. CMU : Parallel Computer Architecture and Programming (Spring 2012)

RAPID PROTOTYPING OF DIGITAL SYSTEMS Second Edition

8-Bit Flash Microcontroller for Smart Cards. AT89SCXXXXA Summary. Features. Description. Complete datasheet available under NDA

Xeon+FPGA Platform for the Data Center

ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-17: Memory organisation, and types of memory

OPNET Network Simulator

Logical Operations. Control Unit. Contents. Arithmetic Operations. Objectives. The Central Processing Unit: Arithmetic / Logic Unit.

Seeking Opportunities for Hardware Acceleration in Big Data Analytics

Lecture 2 Parallel Programming Platforms

CMS Level 1 Track Trigger

Introduction to Exploration and Optimization of Multiprocessor Embedded Architectures based on Networks On-Chip

Implementation Details

STM32 F-2 series High-performance Cortex-M3 MCUs

Computer Architecture

Lizy Kurian John Electrical and Computer Engineering Department, The University of Texas as Austin

FPGA Clocking. Clock related issues: distribution generation (frequency synthesis) multiplexing run time programming domain crossing

ON SUITABILITY OF FPGA BASED EVOLVABLE HARDWARE SYSTEMS TO INTEGRATE RECONFIGURABLE CIRCUITS WITH HOST PROCESSING UNIT

Chapter 2 Logic Gates and Introduction to Computer Architecture

Switch Fabric Implementation Using Shared Memory

UNIT 2 CLASSIFICATION OF PARALLEL COMPUTERS

KeyStone Architecture Security Accelerator (SA) User Guide

4. H.323 Components. VOIP, Version 1.6e T.O.P. BusinessInteractive GmbH Page 1 of 19

Agenda. Michele Taliercio, Il circuito Integrato, Novembre 2001

Contents. System Development Models and Methods. Design Abstraction and Views. Synthesis. Control/Data-Flow Models. System Synthesis Models

DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING NATIONAL INSTITUTE OF TECHNOLOGY ROURKELA EFFICIENT ROUTER DESIGN FOR NETWORK ON CHIP

Computer Systems Structure Input/Output

Chapter 2. Multiprocessors Interconnection Networks

Clocking. Figure by MIT OCW Spring /18/05 L06 Clocks 1

PowerPC Microprocessor Clock Modes

路 論 Chapter 15 System-Level Physical Design

On-Chip Interconnection Networks Low-Power Interconnect

Reconfigurable Architecture Requirements for Co-Designed Virtual Machines

Reconfigurable System-on-Chip Design

Exploiting Stateful Inspection of Network Security in Reconfigurable Hardware

AMD Opteron Quad-Core

Introduction to Digital System Design

Transcription:

Reconfigurable Architectures Chapter 3.2 Prof. Dr.-Ing. Jürgen Teich Lehrstuhl für Hardware-Software-Co-Design

Coarse-Grained Reconfigurable Devices

Recall: 1. Brief Historically development (Estrin Fix-Plus and Rammig machine) 2. Programmable Logic 1. PALs and PLAs 2. CPLDs 3. FPGAs 1. Technology 2. Architecture by means of an example 1. Actel 2. Xilinx 3. Altera 3

Once again: General purpose vs Special purpose With LUTs as function generators, FPGA can be seen as general purpose devices. Like any general purpose device, they are flexible but often inefficient. Flexible because any n-variable Boolean function can be implemented using an n-input LUT. Inefficient since complex functions must be implemented in many LUTs at different locations. The connection among the LUTs is done using the routing matrix wich increases the signal delays. LUT implementation is usually slower than direct wiring. 4

Once again: General purpose vs Special purpose Example: Implement the function using 2-input LUTs. F = ABD + AC D + A BC LUTs are grouped in logic blocks (LB). 2 2-input LUT per LB Connection inside a LB is efficient (direct) Connection outside LBs are slow (Connection matrix) A B D Connection matrix A B C A C D F 5

Once again: General purpose vs Special purpose Idea: Implement frequently used blocks as hard-core module in the device A B D Connection matrix A B C A C D F A B C D F 6

Coarse grained reconfigurable devices Overcome the inefficiency of FPGAs by providing coarse grained functional units (adders, multipliers, integrators, etc.), efficiently implemented Advantage: Very efficient in terms of speed (no need for connections over connection matrices for basic operators) Advantage: Direct wiring instead of LUT implementation A coarse grained device is usually an array of programmable and identical processing elements (PE) capable of executing few operations like addition and multiplication. Depending on the manufacturer, the functional units communicate via buses or can be directly connected using programmable routing matrices. 7

Coarse grained reconfigurable devices Memory exists between and inside the PEs. Several other functional units according to the manufacturer. A PE is usually an 8-bit, 16-bit or 32-bit tiny ALU which can be configured to execute only one operation on a given period (until the next configuration). Communication among the PEs can be either packet oriented (on buses) or point-to-point (using crossbar switches). Since each vendor has its own implementation approach, study will be done by means of few examples. Considered are: PACT XPP, Quicksilver ACM, NEC DRP, TCPA. 8

The PACT XPP Overall structure XPP (Extreme Processing Platform) is a hierarchical structure consisting of: An array of Processing Array Elements (PAE) grouped in clusters called Processing Array (PA) PAC = Processing Array Cluster (PAC) + Configuration manager (CM) A hierarchical configuration tree Local CMs manage the configuration at the PA level The local CMs access the local configuration memory while the supervisor CM (SCM) accesses external memory and supervises the whole configuration process on the device 9

The PACT XPP The Processing Array Elements A Communication Network Memory elements aside the PACs A set of I/Os The PAE: Two types of PAE The ALU PAE The RAM PAE The ALU PAE: Contains an ALU which can be configured to perform basic operations Back-register (BREG) provides routing channels for data and events from bottom to top Forward Register (FREG) provides routing channels from top to bottom The ALU PAE 10

The PACT XPP - The Processing Array Elements DataFlow Register (DF-REG) can be used at the object outputs for buffering data Input register can be preloaded by configuration data. The RAM PAE: 1. Differs from the ALU-PAE only on the function. Instead of an ALU, a RAM-PAE contains a dual-ported RAM. 2. Useful for data storage 3. Data is written or read after the reading of an address at the RAM-inputs The RAM PAE 4. BREG, FREG, and DF-REG of the RAM- PAE have the same function as in the ALU-PAE 11

The PACT XPP - Routing Routing in PACT XPP: Two independent networks One for data transmission The other for event transmission A Configuration BUS exists besides the data and event networks (very little information exists about the configuration bus) All objects can be connected to horizontal routing channels using switch-objects Vertical routing channels are provided by the BREG and FREG BREGs route from bottom to top FREGs route from top to bottom Vertical routing channels Horizontal routing channels 12

The PACT XPP - Interface Interfaces are available inside the chip Number and type of interfaces vary from device to device On the XPP42-A1: 6 internal interfaces consisting of: 4 identical general purpose I/O on-chip interfaces (bottom left, upper left, upper right, and bottom right) One configuration manager One JTAG (Join Test Action Group, "IEEE Standard 1149.1") Boundary scan interface for testing purpose (not shown in the picture) Interfaces 13

The PACT XPP - Interface The I/O interfaces can operate independent from each other. Two operation modes The RAM mode The streaming mode RAM mode: Each port can access external Static RAM (SRAM). Control signals for the SRAM transactions are available. No additional logic required 14

The PACT XPP - Interface Streaming mode: 1. For high speed streaming of data to and from the device 2. Each I/O element provides two bidirectional ports for data streaming 3. Handshake signals are used for synchronization of data packets to external port 15

The Quicksilver ACM - Architecture Structure: Fractal-like structure Hierarchical group of four nodes with full communication among the nodes 4 lower level nodes are grouped in a higher level node The lowest level consists of 4 heterogeneous processing nodes The connection is done in a Matrix Interconnect Network (MIN) A system controller Various I/O 16

The Quicksilver ACM The processing node An ACM processing node consists of: An algorithmic engine. It is unique to each node type and defines the operation to perform by the node. The node memory for data storage at the node level. A node wrapper which is common to all nodes. It is used to hide the complexity of the heterogeneous architecture. 17

The Quicksilver ACM The processing node Four types of nodes exist: The Programmable Scalar Node (PSN) provides a standard 32-bit RISC architecture with 32-bit general purpose registers The Adaptive Execution Node (AXN) provides variable size MAC and ALU operations The Domain Bit Manipulation (DBM) node provides bit manipulation and byte oriented operation External Memory Controller node provides DDRRAM, SRAM, memory random access DMA control interface ACM PSN-Node 18

The Quicksilver ACM The processing node ACM AXN-Node ACM DBM-Node 19

The Quicksilver ACM The processing node The node wrapper envelopes the algorithmic engine and presents an identical interface to neighbouring nodes. It features: 1. A MIN interface to support the communication among nodes via the MIN-network 2. A hardware task manager for task management at the node level 3. A DMA engine 4. Dedicated I/O circuitry 5. Memory controllers The ACM Node Wrapper 6. Data distributors and aggregators 20

The Quicksilver ACM - The MIN Matrix Interconnect Network is the communication medium in an ACM chip 1. Hierarchically organized. The MIN at a given level connects many lower-level MINs 2. The MIN-Root is used for: 1. Off-chip communication 2. Configuration 3. Supports the communication among nodes 4. Provides service like Point to point dataflow streaming, Realtime broadcasting, DMA, etc. Example of ACM chip configuration 21

The Quicksilver ACM - The System Controller The system controller is in charge of the system management: Loads tasks into node ready-to-run queue for execution Statically or dynamically sets the communication channels between the processing nodes Carries out the reconfiguration of nodes on a clock cycle-by-clock cycle basis The ACM chip features a set of I/O interfaces controllers like: PCI PLL SDRAM and SRAM The system controller The interface controllers 22

The NEC DRP Architecture The NEC Dynamically Reconfigurable Processor (DRP) consists of: A set of byte oriented processing elements (PE) A programmable interconnection network for communication among the PEs. A sequencer. Can be programmed as finite state machine (FSM) to control the reconfiguration process Memory around the device for storing configuration and computation data Various Interfaces 23

The NEC DRP - The Processing Element ALU: ordinary byte arithmetic/logic operations DMU (data management unit): handles byte select, shift, mask, constant generation, etc., as well as bit manipulations An instruction dictates ALU/DMU operations and inter-pe connections Source/destination operands can either be from/to its own register file other PEs (i.e., flow through) Instruction pointer (IP) is provided from STC (state transition controller) 24

The NEC DRP Reconfiguration Process Instruction Pointer (IP) from STC identifies a datapath plane Spatial computation with using a customized datapath plane Data In AES 3DES MD5 SHA-1 Dynamic Reconfigura tion When IP changes, datapath plane switches instantaneously PE instructions as a collection behave like an extreme VLIW (task selection by descriptor) Control Compress Multiple Datapath Planes Data Out Sequencing through instructions => Dynamic reconfiguration 25

The NEC DRP Reconfiguration Process 1 IP = 1 PE 0 1 2 PE Array ALU DMU Insts. 1 Identify the instruction to be executed 2 Decode the instruction in the ALU plane 3 + 4 Configure the ALU Plane according to the instruction 2 IP = 1 0 1 2 3 PE PE Ad d Ad d Ad d PE Array Se l C m p Ad d Se l C m p 4 ALU DMU Insts. 26

Tightly-Coupled Processor Arrays (TCPA) Processor elements (PEs) with VLIW (Very long instruction word)-architecture Weakly programmable Small local instruction memory Limited parametrizable instruction set focused on digital signal processing Data flow oriented control path, no global address space, data streaming over the processing field Regular interconnect network Application areas: Digital signal processing, e.g., mobile communication, HDTV, multimedia,... 30

Tightly-Coupled Processor Arrays (TCPA) 31

TCPA Interconnect Network Basic structure: Grid Dynamic reconfigurable By using a bypass, more than one hop is possible in a single clock cycle Interconnect wrapper is responsible for switching 32

TCPA Network Example 4D Hypercube 33

TCPA Network Example 2D Torus 34

TCPA Dynamic Reconfiguration Multicast-Scheme for partial dynamic reconfiguration Differential reconfiguration (program/connections) also possible 35

24 Core TCPA Lehrstuhl für Informatik 12 24x 16 Bit cores Technology CMOS 1.0 V 9 metal layers 90 nm standard cell layout FUs/PE 2xAdd, 2xMul, 1xShift, 1xDPU Register/PE: 15 Instruction memory 1024x32 = 4kB Clock frequency: 200 MHz Peak Performance: 24 GOPS Energy consumption 133 mw @ 200 MHz (Hybrid Clock Gating). Power efficiency: 180 MOPS/mW 36