Reconfigurable Architectures Chapter 3.2 Prof. Dr.-Ing. Jürgen Teich Lehrstuhl für Hardware-Software-Co-Design
Coarse-Grained Reconfigurable Devices
Recall: 1. Brief Historically development (Estrin Fix-Plus and Rammig machine) 2. Programmable Logic 1. PALs and PLAs 2. CPLDs 3. FPGAs 1. Technology 2. Architecture by means of an example 1. Actel 2. Xilinx 3. Altera 3
Once again: General purpose vs Special purpose With LUTs as function generators, FPGA can be seen as general purpose devices. Like any general purpose device, they are flexible but often inefficient. Flexible because any n-variable Boolean function can be implemented using an n-input LUT. Inefficient since complex functions must be implemented in many LUTs at different locations. The connection among the LUTs is done using the routing matrix wich increases the signal delays. LUT implementation is usually slower than direct wiring. 4
Once again: General purpose vs Special purpose Example: Implement the function using 2-input LUTs. F = ABD + AC D + A BC LUTs are grouped in logic blocks (LB). 2 2-input LUT per LB Connection inside a LB is efficient (direct) Connection outside LBs are slow (Connection matrix) A B D Connection matrix A B C A C D F 5
Once again: General purpose vs Special purpose Idea: Implement frequently used blocks as hard-core module in the device A B D Connection matrix A B C A C D F A B C D F 6
Coarse grained reconfigurable devices Overcome the inefficiency of FPGAs by providing coarse grained functional units (adders, multipliers, integrators, etc.), efficiently implemented Advantage: Very efficient in terms of speed (no need for connections over connection matrices for basic operators) Advantage: Direct wiring instead of LUT implementation A coarse grained device is usually an array of programmable and identical processing elements (PE) capable of executing few operations like addition and multiplication. Depending on the manufacturer, the functional units communicate via buses or can be directly connected using programmable routing matrices. 7
Coarse grained reconfigurable devices Memory exists between and inside the PEs. Several other functional units according to the manufacturer. A PE is usually an 8-bit, 16-bit or 32-bit tiny ALU which can be configured to execute only one operation on a given period (until the next configuration). Communication among the PEs can be either packet oriented (on buses) or point-to-point (using crossbar switches). Since each vendor has its own implementation approach, study will be done by means of few examples. Considered are: PACT XPP, Quicksilver ACM, NEC DRP, TCPA. 8
The PACT XPP Overall structure XPP (Extreme Processing Platform) is a hierarchical structure consisting of: An array of Processing Array Elements (PAE) grouped in clusters called Processing Array (PA) PAC = Processing Array Cluster (PAC) + Configuration manager (CM) A hierarchical configuration tree Local CMs manage the configuration at the PA level The local CMs access the local configuration memory while the supervisor CM (SCM) accesses external memory and supervises the whole configuration process on the device 9
The PACT XPP The Processing Array Elements A Communication Network Memory elements aside the PACs A set of I/Os The PAE: Two types of PAE The ALU PAE The RAM PAE The ALU PAE: Contains an ALU which can be configured to perform basic operations Back-register (BREG) provides routing channels for data and events from bottom to top Forward Register (FREG) provides routing channels from top to bottom The ALU PAE 10
The PACT XPP - The Processing Array Elements DataFlow Register (DF-REG) can be used at the object outputs for buffering data Input register can be preloaded by configuration data. The RAM PAE: 1. Differs from the ALU-PAE only on the function. Instead of an ALU, a RAM-PAE contains a dual-ported RAM. 2. Useful for data storage 3. Data is written or read after the reading of an address at the RAM-inputs The RAM PAE 4. BREG, FREG, and DF-REG of the RAM- PAE have the same function as in the ALU-PAE 11
The PACT XPP - Routing Routing in PACT XPP: Two independent networks One for data transmission The other for event transmission A Configuration BUS exists besides the data and event networks (very little information exists about the configuration bus) All objects can be connected to horizontal routing channels using switch-objects Vertical routing channels are provided by the BREG and FREG BREGs route from bottom to top FREGs route from top to bottom Vertical routing channels Horizontal routing channels 12
The PACT XPP - Interface Interfaces are available inside the chip Number and type of interfaces vary from device to device On the XPP42-A1: 6 internal interfaces consisting of: 4 identical general purpose I/O on-chip interfaces (bottom left, upper left, upper right, and bottom right) One configuration manager One JTAG (Join Test Action Group, "IEEE Standard 1149.1") Boundary scan interface for testing purpose (not shown in the picture) Interfaces 13
The PACT XPP - Interface The I/O interfaces can operate independent from each other. Two operation modes The RAM mode The streaming mode RAM mode: Each port can access external Static RAM (SRAM). Control signals for the SRAM transactions are available. No additional logic required 14
The PACT XPP - Interface Streaming mode: 1. For high speed streaming of data to and from the device 2. Each I/O element provides two bidirectional ports for data streaming 3. Handshake signals are used for synchronization of data packets to external port 15
The Quicksilver ACM - Architecture Structure: Fractal-like structure Hierarchical group of four nodes with full communication among the nodes 4 lower level nodes are grouped in a higher level node The lowest level consists of 4 heterogeneous processing nodes The connection is done in a Matrix Interconnect Network (MIN) A system controller Various I/O 16
The Quicksilver ACM The processing node An ACM processing node consists of: An algorithmic engine. It is unique to each node type and defines the operation to perform by the node. The node memory for data storage at the node level. A node wrapper which is common to all nodes. It is used to hide the complexity of the heterogeneous architecture. 17
The Quicksilver ACM The processing node Four types of nodes exist: The Programmable Scalar Node (PSN) provides a standard 32-bit RISC architecture with 32-bit general purpose registers The Adaptive Execution Node (AXN) provides variable size MAC and ALU operations The Domain Bit Manipulation (DBM) node provides bit manipulation and byte oriented operation External Memory Controller node provides DDRRAM, SRAM, memory random access DMA control interface ACM PSN-Node 18
The Quicksilver ACM The processing node ACM AXN-Node ACM DBM-Node 19
The Quicksilver ACM The processing node The node wrapper envelopes the algorithmic engine and presents an identical interface to neighbouring nodes. It features: 1. A MIN interface to support the communication among nodes via the MIN-network 2. A hardware task manager for task management at the node level 3. A DMA engine 4. Dedicated I/O circuitry 5. Memory controllers The ACM Node Wrapper 6. Data distributors and aggregators 20
The Quicksilver ACM - The MIN Matrix Interconnect Network is the communication medium in an ACM chip 1. Hierarchically organized. The MIN at a given level connects many lower-level MINs 2. The MIN-Root is used for: 1. Off-chip communication 2. Configuration 3. Supports the communication among nodes 4. Provides service like Point to point dataflow streaming, Realtime broadcasting, DMA, etc. Example of ACM chip configuration 21
The Quicksilver ACM - The System Controller The system controller is in charge of the system management: Loads tasks into node ready-to-run queue for execution Statically or dynamically sets the communication channels between the processing nodes Carries out the reconfiguration of nodes on a clock cycle-by-clock cycle basis The ACM chip features a set of I/O interfaces controllers like: PCI PLL SDRAM and SRAM The system controller The interface controllers 22
The NEC DRP Architecture The NEC Dynamically Reconfigurable Processor (DRP) consists of: A set of byte oriented processing elements (PE) A programmable interconnection network for communication among the PEs. A sequencer. Can be programmed as finite state machine (FSM) to control the reconfiguration process Memory around the device for storing configuration and computation data Various Interfaces 23
The NEC DRP - The Processing Element ALU: ordinary byte arithmetic/logic operations DMU (data management unit): handles byte select, shift, mask, constant generation, etc., as well as bit manipulations An instruction dictates ALU/DMU operations and inter-pe connections Source/destination operands can either be from/to its own register file other PEs (i.e., flow through) Instruction pointer (IP) is provided from STC (state transition controller) 24
The NEC DRP Reconfiguration Process Instruction Pointer (IP) from STC identifies a datapath plane Spatial computation with using a customized datapath plane Data In AES 3DES MD5 SHA-1 Dynamic Reconfigura tion When IP changes, datapath plane switches instantaneously PE instructions as a collection behave like an extreme VLIW (task selection by descriptor) Control Compress Multiple Datapath Planes Data Out Sequencing through instructions => Dynamic reconfiguration 25
The NEC DRP Reconfiguration Process 1 IP = 1 PE 0 1 2 PE Array ALU DMU Insts. 1 Identify the instruction to be executed 2 Decode the instruction in the ALU plane 3 + 4 Configure the ALU Plane according to the instruction 2 IP = 1 0 1 2 3 PE PE Ad d Ad d Ad d PE Array Se l C m p Ad d Se l C m p 4 ALU DMU Insts. 26
Tightly-Coupled Processor Arrays (TCPA) Processor elements (PEs) with VLIW (Very long instruction word)-architecture Weakly programmable Small local instruction memory Limited parametrizable instruction set focused on digital signal processing Data flow oriented control path, no global address space, data streaming over the processing field Regular interconnect network Application areas: Digital signal processing, e.g., mobile communication, HDTV, multimedia,... 30
Tightly-Coupled Processor Arrays (TCPA) 31
TCPA Interconnect Network Basic structure: Grid Dynamic reconfigurable By using a bypass, more than one hop is possible in a single clock cycle Interconnect wrapper is responsible for switching 32
TCPA Network Example 4D Hypercube 33
TCPA Network Example 2D Torus 34
TCPA Dynamic Reconfiguration Multicast-Scheme for partial dynamic reconfiguration Differential reconfiguration (program/connections) also possible 35
24 Core TCPA Lehrstuhl für Informatik 12 24x 16 Bit cores Technology CMOS 1.0 V 9 metal layers 90 nm standard cell layout FUs/PE 2xAdd, 2xMul, 1xShift, 1xDPU Register/PE: 15 Instruction memory 1024x32 = 4kB Clock frequency: 200 MHz Peak Performance: 24 GOPS Energy consumption 133 mw @ 200 MHz (Hybrid Clock Gating). Power efficiency: 180 MOPS/mW 36