Basic Architectural Features. Architectures for Programmable DSPs. Basic Architectural Features. Basic Architectural Features

Similar documents
Let s put together a Manual Processor

(Refer Slide Time: 00:01:16 min)

Chapter 4 Register Transfer and Microoperations. Section 4.1 Register Transfer Language

CHAPTER 7: The CPU and Memory

Microprocessor & Assembly Language

Chapter 2 Logic Gates and Introduction to Computer Architecture

ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-12: ARM

Advanced Computer Architecture-CS501. Computer Systems Design and Architecture 2.1, 2.2, 3.2

what operations can it perform? how does it perform them? on what kind of data? where are instructions and data stored?

This Unit: Floating Point Arithmetic. CIS 371 Computer Organization and Design. Readings. Floating Point (FP) Numbers

Floating Point Fused Add-Subtract and Fused Dot-Product Units

Binary Division. Decimal Division. Hardware for Binary Division. Simple 16-bit Divider Circuit

CHAPTER 4 MARIE: An Introduction to a Simple Computer

CS101 Lecture 26: Low Level Programming. John Magee 30 July 2013 Some material copyright Jones and Bartlett. Overview/Questions

The string of digits in the binary number system represents the quantity

Divide: Paper & Pencil. Computer Architecture ALU Design : Division and Floating Point. Divide algorithm. DIVIDE HARDWARE Version 1

Central Processing Unit (CPU)

CPU Organisation and Operation

MICROPROCESSOR. Exclusive for IACE Students iacehyd.blogspot.in Ph: /422 Page 1

1 Classical Universal Computer 3

ARM Microprocessor and ARM-Based Microcontrollers

Chapter 2 Basic Structure of Computers. Jin-Fu Li Department of Electrical Engineering National Central University Jungli, Taiwan

MICROPROCESSOR AND MICROCOMPUTER BASICS

Central Processing Unit

MACHINE ARCHITECTURE & LANGUAGE

İSTANBUL AYDIN UNIVERSITY

Chapter 5 Instructor's Manual

Implementation of Modified Booth Algorithm (Radix 4) and its Comparison with Booth Algorithm (Radix-2)

Binary Numbering Systems

Introduction to Xilinx System Generator Part II. Evan Everett and Michael Wu ELEC Spring 2013

5 Combinatorial Components. 5.0 Full adder. Full subtractor

ALFFT FAST FOURIER Transform Core Application Notes

The Central Processing Unit:

Lecture 8: Binary Multiplication & Division

A single register, called the accumulator, stores the. operand before the operation, and stores the result. Add y # add y from memory to the acc

FPGA. AT6000 FPGAs. Application Note AT6000 FPGAs. 3x3 Convolver with Run-Time Reconfigurable Vector Multiplier in Atmel AT6000 FPGAs.

Lecture 2. Binary and Hexadecimal Numbers

Learning Outcomes. Simple CPU Operation and Buses. Composition of a CPU. A simple CPU design

An Efficient RNS to Binary Converter Using the Moduli Set {2n + 1, 2n, 2n 1}

Binary Adders: Half Adders and Full Adders

1. Convert the following base 10 numbers into 8-bit 2 s complement notation 0, -1, -12

Lecture N -1- PHYS Microcontrollers

Computer Organization. and Instruction Execution. August 22

Computers. Hardware. The Central Processing Unit (CPU) CMPT 125: Lecture 1: Understanding the Computer

NEW adder cells are useful for designing larger circuits despite increase in transistor count by four per cell.

Instruction Set Architecture. or How to talk to computers if you aren t in Star Trek

PROBLEMS #20,R0,R1 #$3A,R2,R4

Logical Operations. Control Unit. Contents. Arithmetic Operations. Objectives. The Central Processing Unit: Arithmetic / Logic Unit.

Instruction Set Architecture (ISA)

Microcontroller Basics A microcontroller is a small, low-cost computer-on-a-chip which usually includes:

Digital Hardware Design Decisions and Trade-offs for Software Radio Systems

Computer organization

Today. Binary addition Representing negative numbers. Andrew H. Fagg: Embedded Real- Time Systems: Binary Arithmetic

PART B QUESTIONS AND ANSWERS UNIT I

Management Challenge. Managing Hardware Assets. Central Processing Unit. What is a Computer System?

ECE410 Design Project Spring 2008 Design and Characterization of a CMOS 8-bit Microprocessor Data Path

Chapter 1 Computer System Overview

UNIVERSITY OF CALIFORNIA, DAVIS Department of Electrical and Computer Engineering. EEC180B Lab 7: MISP Processor Design Spring 1995

COMBINATIONAL and SEQUENTIAL LOGIC CIRCUITS Hardware implementation and software design

Binary Number System. 16. Binary Numbers. Base 10 digits: Base 2 digits: 0 1

Solution: start more than one instruction in the same clock cycle CPI < 1 (or IPC > 1, Instructions per Cycle) Two approaches:

CHAPTER 3 Boolean Algebra and Digital Logic

Systems I: Computer Organization and Architecture

1. True or False? A voltage level in the range 0 to 2 volts is interpreted as a binary 1.

An Introduction to the ARM 7 Architecture

Digital Logic Design. Basics Combinational Circuits Sequential Circuits. Pu-Jen Cheng

361 Computer Architecture Lecture 14: Cache Memory

Introduction to Cloud Computing

ETEC 2301 Programmable Logic Devices. Chapter 10 Counters. Shawnee State University Department of Industrial and Engineering Technologies

Switch Fabric Implementation Using Shared Memory

8051 hardware summary

Computer Performance. Topic 3. Contents. Prerequisite knowledge Before studying this topic you should be able to:

An Effective Deterministic BIST Scheme for Shifter/Accumulator Pairs in Datapaths

Embedded System Hardware - Processing (Part II)

The 104 Duke_ACC Machine

Computer Architecture Lecture 2: Instruction Set Principles (Appendix A) Chih Wei Liu 劉 志 尉 National Chiao Tung University

CPU Organization and Assembly Language

Monday January 19th 2015 Title: "Transmathematics - a survey of recent results on division by zero" Facilitator: TheNumberNullity / James Anderson, UK

A New Paradigm for Synchronous State Machine Design in Verilog

ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-17: Memory organisation, and types of memory

Overview. CISC Developments. RISC Designs. CISC Designs. VAX: Addressing Modes. Digital VAX

Microprocessor and Microcontroller Architecture

1 Description of The Simpletron

CHAPTER 2: HARDWARE BASICS: INSIDE THE BOX

PowerPC Microprocessor Clock Modes

LMS is a simple but powerful algorithm and can be implemented to take advantage of the Lattice FPGA architecture.

DIGITAL-TO-ANALOGUE AND ANALOGUE-TO-DIGITAL CONVERSION

Computer System: User s View. Computer System Components: High Level View. Input. Output. Computer. Computer System: Motherboard Level

Programming Logic controllers

Spacecraft Computer Systems. Colonel John E. Keesee

3.Basic Gate Combinations

Sistemas Digitais I LESI - 2º ano

1. Memory technology & Hierarchy

MICROPROCESSOR BCA IV Sem MULTIPLE CHOICE QUESTIONS

A High-Performance 8-Tap FIR Filter Using Logarithmic Number System

A s we saw in Chapter 4, a CPU contains three main sections: the register section,

Oct: 50 8 = 6 (r = 2) 6 8 = 0 (r = 6) Writing the remainders in reverse order we get: (50) 10 = (62) 8

MP3 Player CSEE 4840 SPRING 2010 PROJECT DESIGN.

ECE 0142 Computer Organization. Lecture 3 Floating Point Representations

Transcription:

Basic Architectural Features Dr Deepa Kundur University of Toronto A digital signal processor is a specialized microprocessor for the purpose of real-time DSP computing DSP applications commonly share the following characteristics: Algorithms are mathematically intensive; common algorithms require many multiply and accumulates Algorithms must run in real-time; processing of a data block must occur before next block arrives Algorithms are under constant development; DSP systems should be flexible to support changes and improvements in the state-of-the-art Dr Deepa Kundur (University of Toronto) 1 / 74 Dr Deepa Kundur (University of Toronto) 2 / 74 Basic Architectural Features Basic Architectural Features Programmable DSPs should provide for instructions that one would find in most general microprocessors The basic instruction capabilities (provided with dedicated high-speed hardware) should include: arithmetic operations: add, subtract and multiply logic operations: AND, OR, XOR, and NOT multiply and accumulate () operation signal scaling operations before and/or after digital signal processing Support architecture should include: RAM; ie, on-chip memories for signal samples ROM; on-chip program memory for programs and algorithm parameters such as filter coefficients on-chip registers for storage of intermediate results Dr Deepa Kundur (University of Toronto) 3 / 74 Dr Deepa Kundur (University of Toronto) 4 / 74

Multiplier Multiplier Shifter Multiply and accumulate () unit Arithmetic logic unit The following specifications are important when designing a multiplier: speed decided by architecture which trades off with circuit complexity and power dissipation accuracy decided by format representations (number of bits and fixed/floating pt) dynamic range decided by format representations Dr Deepa Kundur (University of Toronto) 5 / 74 Dr Deepa Kundur (University of Toronto) 6 / 74 Parallel Multiplier Parallel Multiplier: Bit Expansion Advances in speed and size in VLSI technology have made hardware implementation of parallel or array multipliers possible Parallel multipliers implement a complete multiplication of two binary numbers to generate the product within a single processor cycle! Consider the multiplication of two unsigned fixed-point integer numbers A and B where A is m-bits (A m 1, A m 2,, A 0 ) and B is n-bits (B n 1, B n 2,, B 0 ): A = B = m 1 A i 2 i ; 0 A 2 m 1, A i {0, 1} i=0 n 1 B j 2 j ; 0 B 2 n 1, B i {0, 1} j=0 Generally, we will require r-bits where r > max(m, n) to represent the product P = A B; known as bit expansion Dr Deepa Kundur (University of Toronto) 7 / 74 Dr Deepa Kundur (University of Toronto) 8 / 74

Parallel Multiplier: Bit Expansion Parallel Multiplier: Bit Expansion Q: How many bits are required to represent P = A B? Let the minimum number of bits needed to represent the range of P be given by r An r-bit unsigned fixed-point integer number can represent values between 0 and 2 r 1 Therefore, 0 P 2 r 1 P min = A min B min = 0 0 = 0 P max = A max B max = (2 m 1) (2 n 1) = 2 n+m 2 m 2 n + 1 Rephrased Q: How many bits are required to represent P max? P max = 2 n+m 2 } m {{ 2 n + 1 } < 2 n+m 1 for positive n, m < 1 P max = 2 n+m 2 m 2 n + 1 2 n+m for large n, m Therefore, P max < 2 n+m is a tight bound Dr Deepa Kundur (University of Toronto) 9 / 74 Dr Deepa Kundur (University of Toronto) 10 / 74 Parallel Multiplier: Bit Expansion Parallel Multiplier: n = m = 4 Example: m = n = 4; note: r = m + n = 4 + 4 = 8 Rephrased Q: How many bits are required to represent P max? Therefore, for large n, m r = log 2 (P max ) = log 2 ( 2 n+m ) = m + n 4 1 4 1 P = A B = A i 2 i B i 2 i i=0 i=0 = ( A 0 2 0 + A 1 2 1 + A 2 2 2 + A 3 2 3) (B 0 2 0 + B 1 2 1 + B 2 2 2 + B 3 2 3) = A 0 B 0 2 0 + (A 0 B 1 + A 1 B 0 )2 1 + (A 0 B 2 + A 1 B 1 + A 2 B 0 )2 2 +(A 0 B 3 + A 1 B 2 + A 2 B 1 + A 3 B 0 )2 3 + (A 1 B 3 + A 2 B 2 + A 3 B 1 )2 4 +(A 2 B 3 + A 3 B 2 )2 5 + (A 3 B 3 )2 6 = P 0 2 0 + P 1 2 1 + P 2 2 2 + P 3 2 3 + P 4 2 4 + P 5 2 5 + P 6 2 6 + P 7 2 7 }{{}??? Dr Deepa Kundur (University of Toronto) 11 / 74 Dr Deepa Kundur (University of Toronto) 12 / 74

Parallel Multiplier: n = m = 4 Parallel Multiplier: n = m = 4 Recall base 10 addition Example: (3785 + 6584) CARRY-OVER 1 1 0 0 3 7 8 5 + 6 5 8 4 1 0 3 6 9 CARRY-OVER Example: m = n = 4; note: r = m + n = 4 + 4 = 8 4 1 4 1 P = A B = A i 2 i B i 2 i i=0 i=0 = ( A 0 2 0 + A 1 2 1 + A 2 2 2 + A 3 2 3) (B 0 2 0 + B 1 2 1 + B 2 2 2 + B 3 2 3) = A 0 B 0 2 0 + (A 0 B 1 + A 1 B 0 )2 1 + (A 0 B 2 + A 1 B 1 + A 2 B 0 )2 2 +(A 0 B 3 + A 1 B 2 + A 2 B 1 + A 3 B 0 )2 3 + (A 1 B 3 + A 2 B 2 + A 3 B 1 )2 4 +(A 2 B 3 + A 3 B 2 )2 5 + (A 3 B 3 )2 6 = P 0 2 0 + P 1 2 1 + P 2 2 2 + P 3 2 3 + P 4 2 4 + P 5 2 5 + P 6 2 6 + P 7 2 7 Need to compensate for carry-over bits! Dr Deepa Kundur (University of Toronto) 13 / 74 Dr Deepa Kundur (University of Toronto) 14 / 74 Parallel Multiplier: n = m = 4 Parallel Multiplier: Braun Multiplier Example: structure of a 4 4 Braun multiplier; ie, m = n = 4 Need to compensate for carry-over bits! operand carry-in P 0 = A 0 B 0 P 1 = A 0 B 1 + A 1 B 0 + 0 P 2 = A 0 B 2 + A 1 B 1 + A 2 B 0 + PREV CARRY OVER carry-out + sum operand + + + + + + P 6 = A 3 B 3 + PREV CARRY OVER P 7 = PREV CARRY OVER + + + + + + Dr Deepa Kundur (University of Toronto) 15 / 74 Dr Deepa Kundur (University of Toronto) 16 / 74

Parallel Multiplier: Braun Multiplier Parallel Multiplier Speed: for parallel multiplier the multiplication time is only the longest path delay time through the gates and adders (well within one processor cycle) Note: additional hardware before and after the Braun multiplier is required to deal with signed numbers represented in two s complement form Bus Widths: straightforward implementation requires two buses of width n-bits and a third bus of width 2n-bits, which is expensive to implement Multiplier To avoid complex bus implementations: program bus can be reused after the multiplication instruction is fetched bus for X can be used for Z by discarding the lower n bits of Z or by saving Z at two successive memory locations Dr Deepa Kundur (University of Toronto) 17 / 74 Dr Deepa Kundur (University of Toronto) 18 / 74 Shifter Shifter Q: When is scaling useful? required to scale down or scale up operands and results to avoid errors resulting from overflows and underflows during computations Note: When computing the sum of N numbers, each represented by n-bits, the overall sum will have n + log 2 N bits Q: Why? Each number is represented with n bits For the sum of N numbers, Pmax = N (2 n 1) Therefore, r = log 2 P max log 2 (N 2 n ) = log 2 2 n + log 2 N = n + log 2 N to avoid overflow can scale down each of the N numbers by log 2 N bits before conducting the sum to obtain actual sum scale up the result by log2 N bits when required trade-off between overflow prevention and accuracy Dr Deepa Kundur (University of Toronto) 19 / 74 Dr Deepa Kundur (University of Toronto) 20 / 74

Shifter Example: Suppose n = 4 and we are summing N = 3 unsigned fixed point integers as follows: S = x 1 + x 2 + x 3 x 1 = 10 = [1 0 1 0] x 2 = 5 = [0 1 0 1] x 3 = 8 = [1 0 0 0] S = 10 + 5 + 8 = 23 > 2 4 1 = 15 Must scale numbers down by at least log 2 N = log 2 3 1584 < 2 Require scaling through a single right-shift Shifter To scale numbers down by a factor of 2: x 1 = 10 = [1 0 1 0] ˆx 1 = [0 1 0 1] = 5 x 2 = 5 = [0 1 0 1] ˆx 2 = [0 0 1 0] = 2 5 2 x 3 = 8 = [1 0 0 0] ˆx 3 = [0 1 0 0] = 4 To add: Ŝ = ˆx 1 + ˆx 2 + ˆx 3 = 5 + 2 + 4 = 11 = [1 0 1 1] To scale sum up by a factor of 2 (allow bit expansion here): S = [1 0 1 1 0] = 22 23 = 10 + 5 + 8 = S Dr Deepa Kundur (University of Toronto) 21 / 74 Dr Deepa Kundur (University of Toronto) 22 / 74 Shifter Consider the following related example To scale numbers down by a factor of 2: Shifter Q: When is scaling useful? x 1 = 11 = [1 0 1 1] ˆx 1 = [0 1 0 1] = 5 11 2 x 2 = 5 = [0 1 0 1] ˆx 2 = [0 0 1 0] = 2 5 2 x 3 = 9 = [1 0 0 1] ˆx 3 = [0 1 0 0] = 4 9 2 To add: Conducting floating point additions, where each operand should be normalized to the same exponent prior to addition Ŝ = ˆx 1 + ˆx 2 + ˆx 3 = 5 + 2 + 4 = 11 = [1 0 1 1] To scale sum up by a factor of 2 (allow bit expansion here): one of the operands can be shifted to the required number of bit positions to equalize the exponents S = [1 0 1 1 0] = 22 25 = 11 + 5 + 9 = S Dr Deepa Kundur (University of Toronto) 23 / 74 Dr Deepa Kundur (University of Toronto) 24 / 74

Barrel Shifter Barrel Shifter Shifting in conventional microprocessors is implemented by an operation similar to one in a shift register taking one clock cycle for every single bit shift Many shifts are often required creating a latency of multiple clock cycles Barrel shifters allow shifting of multiple bit positions within one clock cycle reducing latency for real-time DSP computations Implementation of a 4-bit shift-right barrel shifter: Input Bits switch closes when control signal is ON input left/right Shifter control inputs output number of bit positions for shift Output Bits Note: only one control signal can be on at a time Dr Deepa Kundur (University of Toronto) 25 / 74 Dr Deepa Kundur (University of Toronto) 26 / 74 Barrel Shifter Barrel Shifter Implementation of a 4-bit shift-right barrel shifter: Input Shift (Switch) Output (B 3 B 2 B 1 B 0 ) Input Bits A 3 A 2 A 1 A 0 0 (S 0 ) A 3 A 2 A 1 A 0 A 3 A 2 A 1 A 0 1 (S 1 ) A 3 A 3 A 2 A 1 A 3 A 2 A 1 A 0 2 (S 2 ) A 3 A 3 A 3 A 2 A 3 A 2 A 1 A 0 3 (S 3 ) A 3 A 3 A 3 A 3 switch closes when control signal is ON logic circuit takes a fraction of a clock cycle to execute majority of delay is in decoding the control lines and setting up the path from the input lines to the output lines Note: only one control signal can be on at a time Output Bits Dr Deepa Kundur (University of Toronto) 27 / 74 Dr Deepa Kundur (University of Toronto) 28 / 74

Multiply and Accumulate Configuration multiply and accumulate () unit performs the accumulation of a series of successively generated products common operation in DSP applications such as filtering ADD/SUB Multiplier Product Register can implement A + BC operations clearing the accumulator at the right time (eg, as an initialization to zero) provides appropriate sum of products Accumulator Dr Deepa Kundur (University of Toronto) 29 / 74 Dr Deepa Kundur (University of Toronto) 30 / 74 Configuration ADD/SUB Accumulator Multiplier Product Register multiplication and accumulation each require a separate instruction execution cycle however, they can work in parallel when multiplier is working on current product, the accumulator works on adding previous product Configuration ADD/SUB Accumulator Multiplier Product Register if N products are to be accumulated, N 1 multiplies can overlap with the accumulation during the first multiply, the accumulator is idle during the last accumulate, the multiplier is idle since all N products have been computed to compute a for N products, N + 1 instruction execution cycles are required for N 1, works out to almost one operation per instruction cycle Dr Deepa Kundur (University of Toronto) 31 / 74 Dr Deepa Kundur (University of Toronto) 32 / 74

Overflow and Underflow Q: If a sum of 256 products is to be computed using a pipelined unit and if the execution time of the unit is 100 ns, what is the total time required to compute the operation? A: For 256 operations, need 257 execution cycles Total time required = 257 100 10 9 sec = 257µs ADD/SUB Accumulator Multiplier Product Register Strategies to address overflow or underflow: accumulator guard bits (ie, extra bits for the accumulator) added; implication: size of ADD/SUB unit will increase barrel shifters at the input and output of unit needed to normalize values saturation logic used to assign largest (smallest) values to accumulator when overflow (underflow) occurs Dr Deepa Kundur (University of Toronto) 33 / 74 Dr Deepa Kundur (University of Toronto) 34 / 74 Arithmetic and Logic Bus Architecture and Memory Bus Architecture and Memory arithmetic logic unit (ALU) carries out additional arithmetic and logic operations required for a DSP: add, subtract, increment, decrement, negate AND, OR, NOT, XOR, compare shift, multiply (uncommon to general microprocessors) with additional features common to general microprocessors: status flags for sign, zero, carry and overflow overflow management via saturation logic register files for storing intermediate results Bus architecture and memory play a significant role in dictating cost, speed and size of DSPs Common architectures include the von Neumann and Harvard architectures Dr Deepa Kundur (University of Toronto) 35 / 74 Dr Deepa Kundur (University of Toronto) 36 / 74

von Neumann Architecture Bus Architecture and Memory Harvard Architecture Bus Architecture and Memory Processor Address Data Memory program and data reside in same memory single bus is used to access both Implications: slows down program execution since processor has to wait for data even after instruction is made available Processor Address Data Address Data Program Memory Data Memory program and data reside in separate memories with two independent buses Implications: faster program execution because of simultaneous memory access capability What if there are two operands? Dr Deepa Kundur (University of Toronto) 37 / 74 Dr Deepa Kundur (University of Toronto) 38 / 74 Bus Architecture and Memory Another Possible DSP Bus Structure On-Chip Memory Bus Architecture and Memory Processor Address Data Address Data Address Data Program Memory Data Memory Data Memory suited for operations with two operands (eg, multiplication) on-chip = on-processor help in running the DSP algorithms faster than when memory is off-chip dedicated addresses and data buses are available speed: on-chip memories should match the speeds of the ALU operations size: the more area chip memory takes, the less area available for other DSP functions Implications: requires hardware and interconnections increasing cost hardware complexity-speed trade-off needed! Dr Deepa Kundur (University of Toronto) 39 / 74 Dr Deepa Kundur (University of Toronto) 40 / 74

On-Chip Memory Bus Architecture and Memory Bus Architecture and Memory Practical Organization of On-Chip Memory Wish List: all memory should reside on-chip! separate on-chip program and data spaces on-chip data space partitioned further into areas for data samples, coefficients and results Implication: area dedicated to memory will be so large that basic DSP functions may not be implementable on-chip! for DSP algorithms requiring repeated executions of a single instruction (eg, ): instructions can be placed in external memory and once fetched can be placed in the instruction cache the result is normally saved at the end, so external memory can be employed only two data memories for operands can be placed on-chip Dr Deepa Kundur (University of Toronto) 41 / 74 Dr Deepa Kundur (University of Toronto) 42 / 74 Bus Architecture and Memory Practical Organization of On-Chip Memory Data Addressing Capabilities Data Addressing using dual-access on-chip memories (can be accessed twice per instruction cycle): can get away with only two on-chip memories for instructions such as multiply instruction fetch + two operand fetches + memory access to save result can all be done in one clock cycle can configure on-chip memory for different uses at different times efficient way of accessing data (signal sample and filter coefficients) can significantly improve implementation performance flexible ways to access data helps in writing efficient programs data addressing modes enhance DSP implementations Dr Deepa Kundur (University of Toronto) 43 / 74 Dr Deepa Kundur (University of Toronto) 44 / 74

DSP Addressing Modes immediate register direct indirect special addressing modes: circular bit-reversed Data Addressing Immediate Addressing Mode operand is explicitly known in value Data Addressing capability to include data as part of the instruction Instruction ADD #imm Operation #imm + A A #imm: value represented by imm (fixed number such as filter coefficient is known ahead of time) A: accumulator register Dr Deepa Kundur (University of Toronto) 45 / 74 Dr Deepa Kundur (University of Toronto) 46 / 74 Data Addressing Register Addressing Mode operand is always in processor register reg capability to reference data through its register Data Addressing Direct Addressing Mode operand is always in memory location mem capability to reference data by giving its memory location directly Instruction Operation Instruction Operation ADD reg reg + A A ADD mem mem + A A reg: processor register provides operand A: accumulator register mem: specified memory location provides operand (eg, memory could hold input signal value) A: accumulator register Dr Deepa Kundur (University of Toronto) 47 / 74 Dr Deepa Kundur (University of Toronto) 48 / 74

Data Addressing Indirect Addressing Mode operand memory location is variable operand address is given by the value of register addrreg operand accessed using pointer addrreg Instruction Operation ADD addrreg addrreg + A A addrreg: needs to be loaded with the register location before use A: accumulator register Special Addressing Modes Data Addressing Circular Addressing Mode: circular buffer allows one to handle a continuous stream of incoming data samples; once the end of the buffer is reached, samples are wrapped around and added to the beginning again useful for implementing real-time digital signal processing where the input stream is effectively continuous Bit-Reversed Addressing Mode: address generation unit can be provided with the capability of providing bit-reversed indices useful for implementing radix-2 FFT (fast Fourier Transform) algorithms where either the input or output is in bit-reversed order Dr Deepa Kundur (University of Toronto) 49 / 74 Dr Deepa Kundur (University of Toronto) 50 / 74 Special Addressing Modes Circular Addressing: Data Addressing Can avoid constantly testing for the need to wrap Suppose we consider eight registers to store an incoming data stream Reference Index Address 0 = 0 mod 8 = 8 mod 8 = 16 mod 8 000 = 0 1 = 1 mod 8 = 9 mod 8 = 17 mod 8 001 = 1 2 = 2 mod 8 = 10 mod 8 = 18 mod 8 010 = 2 3 = 3 mod 8 = 11 mod 8 = 19 mod 8 011 = 3 4 = 4 mod 8 = 12 mod 8 = 20 mod 8 100 = 4 5 = 5 mod 8 = 13 mod 8 = 21 mod 8 101 = 5 6 = 6 mod 8 = 14 mod 8 = 22 mod 8 110 = 6 7 = 7 mod 8 = 15 mod 8 = 23 mod 8 111 = 7 Special Addressing Modes Bit-Reversed Addressing: Data Addressing Input Index Output Index 000 = 0 000 = 0 001 = 1 100 = 4 010 = 2 010 = 2 011 = 3 110 = 6 100 = 4 001 = 1 101 = 5 101 = 5 110 = 6 011 = 3 111 = 7 111 = 7 Dr Deepa Kundur (University of Toronto) 51 / 74 Dr Deepa Kundur (University of Toronto) 52 / 74

Data Addressing Special Addressing Modes Bit-Reversed Addressing: Why? x(0) x(4) x(2) x(6) x(1) x(5) x(3) 0 0 0 Stage 1 Stage 2 Stage 3-1 2-1 -1-1 0 0-1 -1 0 1 2-1 -1-1 X(0) X(1) X(2) X(3) X(4) X(5) X(6) fast execution of algorithms is the most important requirement of a DSP architecture high speed instruction operation large throughputs facilitated by advances in VLSI technology and design innovations x(7) 0-1 2 3-1 -1 X(7) Dr Deepa Kundur (University of Toronto) 53 / 74 Dr Deepa Kundur (University of Toronto) 54 / 74 Hardware Architecture Parallelism dedicated hardware support for multiplications, scaling, loops and repeats, and special addressing modes are essential for fast DSP implementations Harvard architecture significantly improves program execution time compared to von Neumann Parallelism means: provision of multiple function units, which may operate in parallel to increase throughput multiple memories different ALUs for data and address computations advantage: algorithms can perform more than one operation at a time increasing speed on-chip memories aid speed of program execution considerably disadvantage: complex hardware required to control units and make sure instructions and data can be fetched simultaneously Dr Deepa Kundur (University of Toronto) 55 / 74 Dr Deepa Kundur (University of Toronto) 56 / 74

Pipelining Pipelining Example architectural feature in which an instruction is broken into a number of steps a separate unit performs each step at the same time usually working on different stage of data advantage: if repeated use of the instruction is required, then after an initial latency the output throughput becomes one instruction per unit time disadvantages: pipeline latency, having to break instructions up into equally-timed units Five steps: Step 1: instruction fetch Step 2: instruction decode Step 3: operand fetch Step 4: execute Step 5: save Time Slot Step 1 Step 2 Step 3 Step 4 Step 5 Result t 0 Inst 1 t 1 Inst 2 Inst 1 t 2 Inst 3 Inst 2 Inst 1 t 3 Inst 4 Inst 3 Inst 2 Inst 1 t 4 Inst 5 Inst 4 Inst 3 Inst 2 Inst 1 complete t 5 Inst 6 Inst 5 Inst 4 Inst 3 Inst 2 complete Simplifying assumption: all steps take equal time Dr Deepa Kundur (University of Toronto) 57 / 74 Dr Deepa Kundur (University of Toronto) 58 / 74 System Level Parallelism and Pipelining Consider 8-tap FIR filter: 7 = h(k)x(n k) k=0 = x(n) + x(n 1) + x(n 2) + + x(n 6) + x(n 7) System Level Parallelism and Pipelining Consider 8-tap FIR filter: input needed in registers is [x(n) x(n 1) x(n 2) x(n 7)] time to produce = time to process the input block [x(n) x(n 1) x(n 2) x(n 7)] = x(n) + x(n 1) + x(n 2) + + x(n 6) + x(n 7) can be implemented in many ways depending on number of multipliers and accumulators available new input x(n + 1) can be processed after is produced corresponding input needed in registers is [x(n + 1) x(n) x(n 1) x(n 6)] Dr Deepa Kundur (University of Toronto) 59 / 74 Dr Deepa Kundur (University of Toronto) 60 / 74

System Level Parallelism and Pipelining Consider 8-tap FIR filter: If it takes T B time units to process the register block, then for a continuous input stream the throughput is one output sample per T B time units A new input sample is placed into the register block every T B time units System Level Parallelism and Pipelining Consider 8-tap FIR filter: If the sampling period T S is larger than T B, then buffering is needed If the sampling period T S is less than T B, then the processor may be idle T B can be reduced with appropriate parallelism and pipelining Time Register Block 0 [x(n) x(n 1) x(n 2) x(n 7)] T B [x(n + 1) x(n) x(n 1) x(n 6)] 2T B [x(n + 2) x(n + 1) x(n) x(n 5)] 3T B [x(n + 3) x(n + 2) x(n + 1) x(n 4)] A shift in the register block every T B time units is needed to accommodate a new input sample Dr Deepa Kundur (University of Toronto) 61 / 74 Dr Deepa Kundur (University of Toronto) 62 / 74 Implementation Using a Single T : time taken to compute one product term and add it to accumulator new input sample can be processed every 8T time units; ie, T B = 8T Dr Deepa Kundur (University of Toronto) 63 / 74 At t = 0, initialization occurs Accumulator = 0 Dr Deepa Kundur (University of Toronto) 64 / 74

At t = T Accumulator = 0+ x(n) At t = 2T Accumulator = x(n) + x(n 1) Dr Deepa Kundur (University of Toronto) 65 / 74 Dr Deepa Kundur (University of Toronto) 66 / 74 At t = 3T Accumulator = x(n) + x(n 1) + x(n 2) At t = 4T Accumulator = x(n) + x(n 1) + x(n 2) + x(n 3) Dr Deepa Kundur (University of Toronto) 67 / 74 Dr Deepa Kundur (University of Toronto) 68 / 74

At t = 5T Accumulator = x(n) + x(n 1) + x(n 2) + x(n 3) + x(n 4) At t = 6T Accumulator = x(n) + x(n 1) + x(n 2) + x(n 3) + x(n 4) + x(n 5) Dr Deepa Kundur (University of Toronto) 69 / 74 Dr Deepa Kundur (University of Toronto) 70 / 74 At t = 7T Accumulator = x(n) + x(n 1) + x(n 2) + x(n 3)+ x(n 4) + x(n 5) + x(n 6) At t = 8T Accumulator = x(n) + x(n 1) + x(n 2) + x(n 3)+ x(n 4) + x(n 5) + x(n 6) + x(n 7) Dr Deepa Kundur (University of Toronto) 71 / 74 Dr Deepa Kundur (University of Toronto) 72 / 74

Pipelined Implementation: 8 Multipliers and 8 Accumulators Parallel Implementation: Two s T T T T T T T 4T 4T 4T 4T 4T 4T 4T 0 T : time taken to compute one product term and add it to accumulator new input sample can be processed every T time units; ie, T B = T (8 times faster!) + T : time taken to compute one product term and add it to accumulator new input sample can be processed every 4T time units; ie, T B = 4T Dr Deepa Kundur (University of Toronto) 73 / 74 Dr Deepa Kundur (University of Toronto) 74 / 74