08 - Address Generator Unit (AGU)

Similar documents

1. Memory technology & Hierarchy

Modeling Sequential Elements with Verilog. Prof. Chien-Nan Liu TEL: ext: Sequential Circuit

What is a bus? A Bus is: Advantages of Buses. Disadvantage of Buses. Master versus Slave. The General Organization of a Bus

Computer Systems Structure Main Memory Organization

Microprocessor & Assembly Language

7a. System-on-chip design and prototyping platforms

Architectures and Platforms

361 Computer Architecture Lecture 14: Cache Memory

ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-17: Memory organisation, and types of memory

Computer Architecture

Chapter 1 Computer System Overview

Digitale Signalverarbeitung mit FPGA (DSF) Soft Core Prozessor NIOS II Stand Mai Jens Onno Krah

Lecture-3 MEMORY: Development of Memory:

COMPUTER HARDWARE. Input- Output and Communication Memory Systems

Switch Fabric Implementation Using Shared Memory

Registers & Counters

DDR subsystem: Enhancing System Reliability and Yield

Homework # 2. Solutions. 4.1 What are the differences among sequential access, direct access, and random access?

Agenda. Michele Taliercio, Il circuito Integrato, Novembre 2001

ARM Microprocessor and ARM-Based Microcontrollers

Software based Finite State Machine (FSM) with general purpose processors

Module 2. Embedded Processors and Memory. Version 2 EE IIT, Kharagpur 1

A New Paradigm for Synchronous State Machine Design in Verilog

Computers. Hardware. The Central Processing Unit (CPU) CMPT 125: Lecture 1: Understanding the Computer

Am186ER/Am188ER AMD Continues 16-bit Innovation

Memory Systems. Static Random Access Memory (SRAM) Cell

ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-12: ARM

(Refer Slide Time: 00:01:16 min)

Serial port interface for microcontroller embedded into integrated power meter

White Paper Utilizing Leveling Techniques in DDR3 SDRAM Memory Interfaces

EEM870 Embedded System and Experiment Lecture 1: SoC Design Overview

ETEC 2301 Programmable Logic Devices. Chapter 10 Counters. Shawnee State University Department of Industrial and Engineering Technologies

NAND Flash FAQ. Eureka Technology. apn5_87. NAND Flash FAQ

Computer Architecture

Advanced Computer Architecture-CS501. Computer Systems Design and Architecture 2.1, 2.2, 3.2

MICROPROCESSOR AND MICROCOMPUTER BASICS

How To Design A Single Chip System Bus (Amba) For A Single Threaded Microprocessor (Mma) (I386) (Mmb) (Microprocessor) (Ai) (Bower) (Dmi) (Dual

Building Blocks for PRU Development

Central Processing Unit (CPU)

Memory Basics. SRAM/DRAM Basics

Chapter 7. Registers & Register Transfers. J.J. Shann. J. J. Shann

SoC IP Interfaces and Infrastructure A Hybrid Approach

Low Power AMD Athlon 64 and AMD Opteron Processors

Pre-tested System-on-Chip Design. Accelerates PLD Development

Computer-System Architecture

MP3 Player CSEE 4840 SPRING 2010 PROJECT DESIGN.

SOLVING HIGH-SPEED MEMORY INTERFACE CHALLENGES WITH LOW-COST FPGAS

Von der Hardware zur Software in FPGAs mit Embedded Prozessoren. Alexander Hahn Senior Field Application Engineer Lattice Semiconductor

Ingar Fredriksen AVR Applications Manager. Tromsø August 12, 2005

A Verilog HDL Test Bench Primer Application Note

CHAPTER 7: The CPU and Memory

Navigating Big Data with High-Throughput, Energy-Efficient Data Partitioning

Slide Set 8. for ENCM 369 Winter 2015 Lecture Section 01. Steve Norman, PhD, PEng

Chapter 11 I/O Management and Disk Scheduling

Modeling Registers and Counters

A3 Computer Architecture

Chapter 2 Basic Structure of Computers. Jin-Fu Li Department of Electrical Engineering National Central University Jungli, Taiwan

Intel 965 Express Chipset Family Memory Technology and Configuration Guide

16-bit ALU, Register File and Memory Write Interface

Keil C51 Cross Compiler

Enhancing High-Speed Telecommunications Networks with FEC

Switched Interconnect for System-on-a-Chip Designs

Price/performance Modern Memory Hierarchy

DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING Question Bank Subject Name: EC Microprocessor & Microcontroller Year/Sem : II/IV

ARM Webinar series. ARM Based SoC. Abey Thomas

Testing of Digital System-on- Chip (SoC)

Chapter 2 Logic Gates and Introduction to Computer Architecture

MICROPROCESSOR. Exclusive for IACE Students iacehyd.blogspot.in Ph: /422 Page 1

Tensilica Software Development Toolkit (SDK)

ECE410 Design Project Spring 2008 Design and Characterization of a CMOS 8-bit Microprocessor Data Path

ESP-CV Custom Design Formal Equivalence Checking Based on Symbolic Simulation

Introduction to System-on-Chip

Technical Note. Micron NAND Flash Controller via Xilinx Spartan -3 FPGA. Overview. TN-29-06: NAND Flash Controller on Spartan-3 Overview

Counters and Decoders

SYSTEM-ON-PROGRAMMABLE-CHIP DESIGN USING A UNIFIED DEVELOPMENT ENVIRONMENT. Nicholas Wieder

Let s put together a Manual Processor

Avalon Interface Specifications

Lecture 5: Gate Logic Logic Optimization

OC By Arsene Fansi T. POLIMI

A New Chapter for System Designs Using NAND Flash Memory

Chapter 5 :: Memory and Logic Arrays

EE482: Advanced Computer Organization Lecture #11 Processor Architecture Stanford University Wednesday, 31 May ILP Execution

We r e going to play Final (exam) Jeopardy! "Answers:" "Questions:" - 1 -

GETTING STARTED WITH PROGRAMMABLE LOGIC DEVICES, THE 16V8 AND 20V8

Design of a High Speed Communications Link Using Field Programmable Gate Arrays

what operations can it perform? how does it perform them? on what kind of data? where are instructions and data stored?

CHAPTER 2: HARDWARE BASICS: INSIDE THE BOX

Applying the Benefits of Network on a Chip Architecture to FPGA System Design

Logical Operations. Control Unit. Contents. Arithmetic Operations. Objectives. The Central Processing Unit: Arithmetic / Logic Unit.

Introduction Disks RAID Tertiary storage. Mass Storage. CMSC 412, University of Maryland. Guest lecturer: David Hovemeyer.

Computer Architecture Lecture 2: Instruction Set Principles (Appendix A) Chih Wei Liu 劉志尉 National Chiao Tung University

SDLC Controller. Documentation. Design File Formats. Verification

Lizy Kurian John Electrical and Computer Engineering Department, The University of Texas as Austin

With respect to the way of data access we can classify memories as:

The Central Processing Unit:

AN 223: PCI-to-DDR SDRAM Reference Design

Computer Systems Structure Input/Output

APPLICATION NOTE AN-409

Exception and Interrupt Handling in ARM

Introduction to Digital System Design

Transcription:

September 30, 2013

Todays lecture Memory subsystem Address Generator Unit (AGU)

Memory subsystem Applications may need from kilobytes to gigabytes of memory Having large amounts of memory on-chip is expensive Accessing memory is costly in terms of power Designing the memory subsystem is one of the main challenges when designing for low silicon area and low power

Memory issues Memory speed increases slower than logic speed Impact on Memory size Small size memory: Fast and area inefficient Large size memory: Slow and area efficient

Comparison: Traditional SRAM vs ASIC memory block I Traditional memory I I I Single tri-state data bus for read/write Asynchronous operation ASIC memory block I I Separate buses for data input and data output Synchronous operation

Embedded SRAM overview Medium to large sized SRAM memories are almost always based on this architecture That is, it is not possible to have a large asynchronous memory!

Best practices for memory usage in RTL code Use synchronous memories for all but the smallest memory blocks Register the data output as soon as possible A little combinational logic could be ok, but don t put a multiplier unit here for example Some combinational logic before the inputs to the memory is ok The physical layout of the chip can cause delays here that you don t see in the RTL code

Best practices for memory usage in RTL code Disable the enable signal to the memory when you are not reading or writing This will save you a lot of power

ASIC memories ASIC synthesis tools are usually not capable of creating optimized memories Specialized memory compilers are used for this task Conclusion: You can t implement large memories in VHDL or Verilog, you need to instantiate them An inferred memory will be implemented using standard cells which is very inefficient (10x larger than an inferred memory and much slower)

Inferred vs instantiated memory reg [31:0] mem [511:0]; always @(posedge clk) begin if (enable) begin if (write) begin mem[addr] <= writedata; end else begin data <= mem[addr]; end end end SPHS9gp_512x32m4d4_bL dm(.q(data),.a(addr),.ck(clk),.d(writedata),.en(enable),.we(write));

Memory design in a core Memory is not located in the core Memory address generation is in the core Memory interface is in the core

Scratch pad vs Cache memories Scratch pad memory Simpler, cheaper, and use less power It has more deterministic behavior Suitable for embedded / DSP May exist in separate address spaces

Scratch pad vs Cache memories Cache memory Consumes more power Cache miss induced cycles costs uncertainty Suitable for general computing, general DSP Global address space. Hide complexity

Selecting Scratch pad vs Cache memory Low addressing High addressing complexity complexity Cache If you are lazy Good candidate when aiming for quick TTM Scratch pad Good candidate when Difficult to implement aiming for low power/cost and analyze

The cost of caches More silicon area (tag memory and memory area) More power (need to read both tag and memory area at the same time) If low latency is desired, several ways in a set associative cache may be read simultaneously as well. Higher verification cost Many potential corner cases when dealing with cache misses.

Address generation Regardless of whether cache or scratch pad memory is used, address generation must be efficient Key characteristic of most DSP applications: Addressing is deterministc

Typical AGU location

Basic AGU functionality Inputs Initial address Combinational output Address calculation logic circuit Keeper Address pointer Registered output Addressing feedback [Liu2008]

AGU Example Memory direct A <= Immediate data Address register + offset A <= AR + Immediate data Register indirect A <= RF Register + offset A <= RF + immediate data Addr. reg. post increment A <= AR; AR <= AR + 1 Addr. reg. pre decrement AR <= AR - 1;A <= AR; Address register + register A <= RF + AR

Modulo addressing Most general solution Address = TOP + AR % BUFFERSIZE Modulo operation too expensive in AGU (Unless BUFFERSIZE is a power of two) More practical: AR = AR + 1 If AR is more than BOT, correct by setting AR to TOP

AGU Example with Modulo Addressing Lets add Modulo addressing to the AGU: A=AR; AR = AR + 1 if(ar == BOT) AR = TOP;

AGU Example with Modulo Addressing What about post-decrement mode?

Modulo addressing - Post Decrement The programmer can exchange TOP and BOTTOM. Alternative - Add hardware to select TOP and BOTTOM based on which instruction is used

Variable step size Sometimes it makes sense to use a larger stepsize than 1 In this case we can t check for equality but must check for greater than or less than conditions if( AR > BOTTOM) AR = AR - BUFFERSIZE

Variable step size Keepers and registers for BUFFERSIZE, STEPSIZE, and BOTTOM not shown. Note that STEPSIZE can t be larger than BUFFERSIZE

Bit reversed addressing Important for FFT:s and similar creatures Typical behavior from FFT-like transforms: Input Transformed Output sample sample index (binary) x[0] X[0] 000 x[1] X[4] 100 x[2] X[2] 010 x[3] X[6] 110 x[4] X[1] 001 x[5] X[5] 101 x[6] X[3] 011 x[7] X[7] 111

Bit reversed addressing Case 1: Buffer size and buffer location can be fixed Case 2: Buffer size is fixed, buffer location is arbitrary Case 3: Buffer size and location are not fixed

Bit reversed addressing Case 1: Buffer size and buffer location can be fixed Solution: If buffer size is 2 N, place the start of the buffer at an even multiple of 2 N ADDR = {FIXED PART, BIT REVERSE(ADDR COUNTER N BITS);}

Bit reversed addressing Case 2: Buffer size is fixed, buffer location is arbitrary Solution: If buffer size is 2 N, place the start of the buffer at an even multiple of 2 N ADDR = BASE REGISTER + BIT REVERSE(ADDR COUNTER N BITS);

Bit reversed addressing Case 3: Buffer size and location are not fixed The most programmer friendly solution. Can be done in several ways.

Why design for memory hierarchy Small size memory is faster Acting data / program in small size memories Large size memory is area efficient Volume storage using large size memories

Typical SoC Memory Hierarchy DSP MCU Accelerators L1: RF DP+CP L1: RF DP+CP DP+CP DM1 DMn PM DM1 DMn PM DMn PM I/F DMA DMA I/F DMA I/F DMA SoCBUS and its arbitration / routing / control Main on chip memory Nonvolatile memory I/F Off chip DRAM I/F I/F [Liu2008]

Memory partition Requirements The number of data simultaneously Supporting access of different data types Memory shutting down for low power Overhead costs from memory peripheral Critical path from memory peripheral Limit of on chip memory size

Issues with off-chip memory Relatively low clockrate compared to on-chip Will need clock domain crossing, possibly using asynchronous FIFOs High latency Many clockcycles to send a command and receive a response Burst oriented Theoretical bandwidth can only be reached by relatively large consecutive read/write transactions

Why burst reads from external memory? Procedure for reading from (DDR-)SDRAM memory Read out one column (with around 2048 bits) from DRAM memory into flip-flops (slow) Do a burst read from these flip-flops (fast) Write back all bits into the DRAM memory Conclusion: If we have off-chip memory we should use burst reads if at all possible

Example: Image processing Typical organization of framebuffer memory Linear addresses

Example: Image processing Fetching an 8x8 pixel block from main memory Minimum transfer size: 16 pixels Minimum alignment: 16 pixels Must load: 256 bytes

Rearranging memory content to save bandwidth Use tile based frame buffer instead of linear

Rearranging memory content to save bandwidth Fetching 8x8 block from memory Minimum transfer size: 16 pixels Minimum alignment: 16 pixels Must load: 144 pixels Only 56% of the previous example!

Rearranging memory content to save bandwidth Extra important when using a wide memory bus and/or a cache

Buses External memory SDRAM, DDR-SDRAM, DDR2-SDRAM, etc SoC bus AMBA, PLB, Wishbone, etc You still need to worry about burst length, latency, etc

DMA definition and specification DMA: Direct memory access An external device independent of the core Running load and store in parallel with DSP DSP processor can do other things in parallel Requirements Large bandwidth and low latency Flexible and support different access patterns For DSP: Multiple access is not so important

Data memory access policies Both DMA and DSP can access the data memory simultaneously Requires a dual-port memory DSP has priority, DMA must use spare cycles to access DSP memory Verifying the DMA controller gets more complicated DMA has priority, can stall DSP processor Verifying the processor gets more complicated Ping-pong buffer Verifying the software gets more complicated

Ping-pong buffers

A simplified DMA architecture [Liu2008]

A typical DMA request Start address in data memory Start address in off-chip memory Length Priority Permission to stall DSP core Priority over other users of off-chip memory Interrupt flag Should the DSP core get an interrupt when the DMA is finished?

Scatter Gather-DMA Pointer to DMA request 1 Pointer to DMA request 2 Pointer to DMA request 3 End of List Start address in data memory Start address in off-chip memory Length Priority Permission stall DSP core Priority over other users of offchip memory Interrupt flag Should the DSP core get an interrupt when the DMA is finished?

Discussed on the whiteboard Why a cache takes more power than a scratch pad memory Schematic of a basic AGU including a control table Bit-reversed addressing in an AGU