Chapter 3: Pipelining and Parallel Processing. Keshab K. Parhi

Similar documents
CMOS Thyristor Based Low Frequency Ring Oscillator

on an system with an infinite number of processors. Calculate the speedup of

INTEGRATED CIRCUITS. For a complete data sheet, please also download:

NTE2053 Integrated Circuit 8 Bit MPU Compatible A/D Converter

NETWORK STRUCTURES FOR IIR SYSTEMS. Solution 12.1

LMS is a simple but powerful algorithm and can be implemented to take advantage of the Lattice FPGA architecture.

Abstract. Cycle Domain Simulator for Phase-Locked Loops

Implementation of Modified Booth Algorithm (Radix 4) and its Comparison with Booth Algorithm (Radix-2)

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

PROGETTO DI SISTEMI ELETTRONICI DIGITALI. Digital Systems Design. Digital Circuits Advanced Topics

Interfacing Analog to Digital Data Converters

ECE124 Digital Circuits and Systems Page 1

MIMO detector algorithms and their implementations for LTE/LTE-A

Experiment NO.3 Series and parallel connection

1. Memory technology & Hierarchy

Low Power AMD Athlon 64 and AMD Opteron Processors

Agenda. Michele Taliercio, Il circuito Integrato, Novembre 2001

Gate Delay Model. Estimating Delays. Effort Delay. Gate Delay. Computing Logical Effort. Logical Effort

Alpha CPU and Clock Design Evolution

Diode Applications. As we have already seen the diode can act as a switch Forward biased or reverse biased - On or Off.

Chapter 9 Latches, Flip-Flops, and Timers

Chapter 6 PLL and Clock Generator

Series and Parallel Circuits

Systolic Computing. Fundamentals

From Control Loops to Software

Power Reduction Techniques in the SoC Clock Network. Clock Power

Counters and Decoders

High-Level Synthesis for FPGA Designs

AN111: Using 8-Bit MCUs in 5 Volt Systems

The full wave rectifier consists of two diodes and a resister as shown in Figure

ICS514 LOCO PLL CLOCK GENERATOR. Description. Features. Block Diagram DATASHEET

Introduction to Digital System Design

Timing Methodologies (cont d) Registers. Typical timing specifications. Synchronous System Model. Short Paths. System Clock Frequency

Sequential Logic: Clocks, Registers, etc.

Introduction, Rate and Latency

9/14/ :38

ISSCC 2003 / SESSION 13 / 40Gb/s COMMUNICATION ICS / PAPER 13.7

Kirchhoff's Current Law (KCL)

A New Paradigm for Synchronous State Machine Design in Verilog

M25P05-A. 512-Kbit, serial flash memory, 50 MHz SPI bus interface. Features

A Novel Low Power, High Speed 14 Transistor CMOS Full Adder Cell with 50% Improvement in Threshold Loss Problem

Application Note 82 Using the Dallas Trickle Charge Timekeeper

Section 3. Sensor to ADC Design Example

Lecture 5: Gate Logic Logic Optimization

On some Potential Research Contributions to the Multi-Core Enterprise

CHAPTER 5 FINITE STATE MACHINE FOR LOOKUP ENGINE

Topics of Chapter 5 Sequential Machines. Memory elements. Memory element terminology. Clock terminology

Let s put together a Manual Processor

Quiz for Chapter 1 Computer Abstractions and Technology 3.10

Lesson 12 Sequential Circuits: Flip-Flops

VLSI IMPLEMENTATION OF INTERNET CHECKSUM CALCULATION FOR 10 GIGABIT ETHERNET

4.2 Description of the Event operation Network (EON)

INTEGRATED CIRCUITS. For a complete data sheet, please also download:

RAM & ROM Based Digital Design. ECE 152A Winter 2012

Lecture 10 Sequential Circuit Design Zhuo Feng. Z. Feng MTU EE4800 CMOS Digital IC Design & Analysis 2010

TRUE SINGLE PHASE CLOCKING BASED FLIP-FLOP DESIGN

Flip-Flops, Registers, Counters, and a Simple Processor

International Journal of Electronics and Computer Science Engineering 1482

INTEGRATED CIRCUITS. For a complete data sheet, please also download:

Elettronica dei Sistemi Digitali Costantino Giaconia SERIAL I/O COMMON PROTOCOLS

Instruction scheduling

Series and Parallel Circuits

PowerPC Microprocessor Clock Modes

Measurement of Capacitance

CHAPTER 16 MEMORY CIRCUITS

CD4027BC Dual J-K Master/Slave Flip-Flop with Set and Reset

EE 42/100 Lecture 24: Latches and Flip Flops. Rev B 4/21/2010 (2:04 PM) Prof. Ali M. Niknejad

Fault Modeling. Why model faults? Some real defects in VLSI and PCB Common fault models Stuck-at faults. Transistor faults Summary

CMOS Logic Integrated Circuits

Sequential 4-bit Adder Design Report

Software Programmable DSP Platform Analysis Episode 7, Monday 19 March 2007, Ingredients. Software Pipelining. Data Dependence. Resource Constraints

MM74HC4538 Dual Retriggerable Monostable Multivibrator

Profiling services for resource optimization and capacity planning in distributed systems

EE482: Advanced Computer Organization Lecture #11 Processor Architecture Stanford University Wednesday, 31 May ILP Execution

(Refer Slide Time: 2:03)

Chapter 19 Operational Amplifiers

PACKAGE OUTLINE DALLAS DS2434 DS2434 GND. PR 35 PACKAGE See Mech. Drawings Section

An Overview of Distributed Databases

Control 2004, University of Bath, UK, September 2004

Experiment 2 Diode Applications: Rectifiers

8-Bit Flash Microcontroller for Smart Cards. AT89SCXXXXA Summary. Features. Description. Complete datasheet available under NDA

A linear combination is a sum of scalars times quantities. Such expressions arise quite frequently and have the form

Power Supplies. 1.0 Power Supply Basics. Module

Module 4 : Propagation Delays in MOS Lecture 22 : Logical Effort Calculation of few Basic Logic Circuits

HT9170 DTMF Receiver. Features. General Description. Selection Table

Transistor Amplifiers

COMBINATIONAL and SEQUENTIAL LOGIC CIRCUITS Hardware implementation and software design

The I2C Bus. NXP Semiconductors: UM10204 I2C-bus specification and user manual HAW - Arduino 1

LTC Channel Analog Multiplexer with Serial Interface U DESCRIPTIO

Web Streamed SDR s in Service at GB2RHQ Hack Green Bunker

Accurate Measurement of the Mains Electricity Frequency

SEQUENTIAL CIRCUITS. Block diagram. Flip Flop. S-R Flip Flop. Block Diagram. Circuit Diagram

Experiment #5, Series and Parallel Circuits, Kirchhoff s Laws

What is a System on a Chip?

Lecture-3 MEMORY: Development of Memory:

ELECTRICAL CIRCUITS. Electrical Circuits

Product Datasheet P MHz RF Powerharvester Receiver

Series and Parallel Resistive Circuits

Interfacing To Alphanumeric Displays

Quiz for Chapter 6 Storage and Other I/O Topics 3.10

Transcription:

Chapter 3: Pipelining and Parallel Processing Keshab K. Parhi

Outline Introduction Pipelining of FIR Digital Filters Parallel Processing Pipelining and Parallel Processing for Low Power Pipelining for Lower Power Parallel Processing for Lower Power Combining Pipelining and Parallel Processing for Lower Power

Introduction Pipelining Comes from the idea of a water pipe: continue sending water without waiting the water in the pipe to be out water pipe leads to a reduction in the critical path Either increases the clock speed (or sampling speed) or reduces the power consumption at same speed in a DSP system Parallel Processing Multiple outputs are computed in parallel in a clock period he effective sampling speed is increased by the level of parallelism Can also be used to reduce the power consumption 3

Introduction (cont d) Example 1: Consider a 3-tap FIR filter: y(n)=ax(n)+bx(n-1)+cx(n-) x(n) x(n-1) x(n-) D D a b c M A : : multiplica tion time Addition time y(n) he critical path (or the minimum time required for processing a new sample) is limited by 1 multiply and add times. hus, the sample period (or the sample frequency ) is given by: f sample sample M M + 1 + A A 4

Introduction (cont d) If some real-time application requires a faster input rate (sample rate), then this direct-form structure cannot be used! In this case, the critical path can be reduced by either pipelining or parallel processing. Pipelining: reduce the effective critical path by introducing pipelining latches along the critical data path Parallel Processing: increases the sampling rate by replicating hardware so that several inputs can be processed in parallel and several outputs can be produced at the same time Examples of Pipelining and Parallel Processing See the figures on the next page 5

Introduction (cont d) Example Figure (a): A data path Figure (b): he -level pipelined structure of (a) Figure (c): he -level parallel processing structure of (a) 6

Pipelining of FIR Digital Filters he pipelined implementation: By introducing additional latches in Example 1, the critical path is reduced from M+A to M+A.he schedule of events for this pipelined system is shown in the following table. You can see that, at any time, consecutive outputs are computed in an interleaved manner. Clock Input Node 1 Node Node 3 Output 0 x(0) ax(0)+bx(-1) 1 x(1) ax(1)+bx(0) ax(0)+bx(-1) cx(-) y(0) x() ax()+bx(1) ax(1)+bx(0) cx(-1) y(1) 3 x(3) ax(3)+bx() ax()+bx(1) cx(0) y() x(n) D D D a b c D 1 3 y(n) 7

Pipelining of FIR Digital Filters (cont d) In a pipelined system: In an M-level pipelined system, the number of delay elements in any path from input to output is (M-1) greater than that in the same path in the original sequential circuit Pipelining reduces the critical path, but leads to a penalty in terms of an increased latency Latency: the difference in the availability of the first output data in the pipelined system and the sequential system wo main drawbacks: increase in the number of latches and in system latency Important Notes: he speed of a DSP architecture ( or the clock period) is limited by the longest path between any latches, or between an input and a latch, or between a latch and an output, or between the input and the output 8

Pipelining of FIR Digital Filters (cont d) his longest path or the critical path can be reduced by suitably placing the pipelining latches in the DSP architecture he pipelining latches can only be placed across any feed-forward cutset of the graph wo important definitions Cutset: a cutset is a set of edges of a graph such that if these edges are removed from the graph, the graph becomes disjoint Feed-forward cutset: a cutset is called a feed-forward cutset if the data move in the forward direction on all the edge of the cutset Example 3: (P.66, Example 3..1, see the figures on the next page) (1) he critical path is A3 A5 A4 A6, its computation time: 4 u.t. () Figure (b) is not a valid pipelining because it s not a feed-forward cutset (3) Figure (c) shows -stage pipelining, a valid feed-forward cutset. Its critical path is u.t. 9

Pipelining of FIR Digital Filters (cont d) Signal-flow graph representation of Cutset Original SFG A cutset A feed-forward cutset Assume: he computation time for each node is assumed to be 1 unit of time 10

Pipelining of FIR Digital Filters (cont d) ransposed SFG and Data-broadcast structure of FIR filters ransposed SFG of FIR filters x(n) 1 Z 1 Z y(n) 1 Z 1 Z a b c y(n) a b c x(n) Data broadcast structure of FIR filters x(n) c b a D D y(n) 11

Pipelining of FIR Digital Filters (cont d) Fine-Grain Pipelining Let M=10 units and A= units. If the multiplier is broken into smaller units with processing times of 6 units and 4 units, respectively (by placing the latches on the horizontal cutset across the multiplier), then the desired clock period can be achieved as (M+A)/ A fine-grain pipelined version of the 3-tap data-broadcast FIR filter is shown below. Figure: fine-grain pipelining of FIR filter 1

Parallel Processing Parallel processing and pipelining techniques are duals each other: if a computation can be pipelined, it can also be processed in parallel. Both of them exploit concurrency available in the computation in different ways. How to design a Parallel FIR system? Consider a single-input single-output (SISO) FIR filter: y(n)=a*x(n)+b*x(n-1)+c*x(n-) Convert the SISO system into an MIMO (multiple-input multiple-output) system in order to obtain a parallel processing structure For example, to get a parallel system with 3 inputs per clock cycle (i.e., level of parallel processing L=3) y(3k)=a*x(3k)+b*x(3k-1)+c*x(3k-) y(3k+1)=a*x(3k+1)+b*x(3k)+c*x(3k-1) y(3k+)=a*x(3k+)+b*x(3k+1)+c*x(3k) 13

Parallel Processing (cont d) Parallel processing system is also called block processing, and the number of inputs processed in a clock cycle is referred to as the block size SISO x(3k) x(3k+1) x(n) y(n) MIMO x(3k+) y(3k) y(3k+1) y(3k+) Sequential System 3-Parallel System In this parallel processing system, at the k-th clock cycle, 3 inputs x(3k), x(3k+1) and x(3k+) are processed and 3 samples y(3k), y(3k+1) and y(3k+) are generated at the output Note 1: In the MIMO structure, placing a latch at any line produces an effective delay of L clock cycles at the sample rate (L: the block size). So, each delay element is referred to as a block delay (also referred to as L-slow) 14

Parallel Processing (cont d) For example: When block size is, 1 delay element = sampling delays x(k) x(k-) D When block size is 10, 1 delay element = 10 sampling delays x(10k) X(10k-10) D 15

Parallel Processing (cont d) Figure: Parallel processing architecture for a 3-tap FIR filter with block size 3 16

Parallel Processing (cont d) Note : he critical path of the block (or parallel) processing system remains unchanged. But since 3 samples are processed in 1 (not 3) clock cycle, the iteration (or sample) period is given by the following equations: clock iteration = M + sample A = clock L + 3 So, it is important to understand that in a parallel system sample clock, whereas in a pipelined system sample = clock Example: A complete parallel processing system with block size 4 (including serial-to-parallel and parallel-to-serial converters) (also see P.7, Fig. 3.11) M A 17

Parallel Processing (cont d) A serial-to-parallel converter Sample Period /4 /4 /4 /4 x(n) D D D 4k+3 A parallel-to-serial converter x(4k+3) x(4k+) x(4k+1) x(4k) y(4k+3) y(4k+) y(4k+1) y(4k) 4k 0 D D D /4 /4 /4 y(n) 18

Parallel Processing (cont d) Why use parallel processing when pipelining can be used equally well? Consider the following chip set, when the critical path is less than the I/O bound (output-pad delay plus input-pad delay and the wire delay between the two chips), we say this system is communication bounded So, we know that pipelining can be used only to the extent such that the critical path computation time is limited by the communication (or I/O) bound. Once this is reached, pipelining can no longer increase the speed Chip 1 Chip output pad communicat ion input pad computatio n 19

Parallel Processing (cont d) So, in such cases, pipelining can be combined with parallel processing to further increase the speed of the DSP system By combining parallel processing (block size: L) and pipelining (pipelining stage: M), the sample period can be reduce to: iteration = sample L clock Example: (p.73, Fig.3.15 ) Pipelining plus parallel processing Example (see the next page) Parallel processing can also be used for reduction of power consumption while using slow clocks = M 0

Parallel Processing (cont d) Example:Combined fine-grain pipelining and parallel processing for 3-tap FIR filter 1

Pipelining and Parallel Processing for Low Power wo main advantages of using pipelining and parallel processing: Higher speed and Lower power consumption When sample speed does not need to be increased, these techniques can be used for lowering the power consumption wo important formulas: Computing the propagation delay pd of CMOS circuit pd = C k V ch arg e ( V0 Vt Computing the power consumption in CMOS circuit 0 ) Ccharge: the capacitance to be charged or discharged in a single clock cycle P CMOS = C total V 0 f Ctotal: the total capacitance of the CMOS circuit

Pipelining and Parallel Processing for Low Power (cont d) Pipelining for Lower Power he power consumption in the original sequential FIR filter seq seq: the clock period of the = Ctotal V0 f, f = seq original sequential FIR filter P 1 For a M-level pipelined system, its critical path is reduced to 1/M of its original length, and the capacitance to be charged/discharged in a single clock cycle is also reduced to 1/M of its original capacitance If the same clock speed (clock frequency f) is maintained, only a fraction (1/M) of the original capacitance is charged/discharged in the same amount of time. his implies that the supply voltage can be reduced to βvo (0<β <1). Hence, the power consumption of the pipelined filter is: P pip = C total β V 0 f = β P seq 3

Pipelining and Parallel Processing for Low Power (cont d) he power consumption of the pipelined system, compared with the original system, is reduced by a factor ofβ How to determine the power consumption reduction factor β? Using the relationship between the propagation delay of the original filter and the pipelined filter Sequential (critical path): seq Pipelined: (critical path when M=3) ( ) V 0 seq seq seq ( ) βv 0 4

Pipelining and Parallel Processing for Low Power (cont d) he propagation delays of the original sequential filter and the pipelined FIR filter are: seq = C k V, ( C M ch arg e 0 ch arg e pip ( V0 Vt ) k ( βv0 Vt = ) βv ) 0 M Since the same clock speed is maintained in both filters, we get the equation to solve β: ( βv0 Vt ) = β ( V0 Vt ) Example: Please read textbook for Example 3.4.1 (pp.75) 5

Pipelining and Parallel Processing for Low Power (cont d) Parallel Processing for Low Power In an L-parallel system, the charging capacitance does not change, but the total capacitance is increased by L times In order to maintain the same sample rate, the clock period of the L- parallel circuit is increased to Lseq (where seq is the propagation delay of the original sequential circuit). his means that the charging capacitance is charged/discharged L times longer (i.e., Lseq). In other words, the supply voltage can be reduced to βvo since there is more time to charge the same capacitance How to get the power consumption reduction factor β? he propagation delay consideration can again be used to compute β (Please see the next page) 6

Pipelining and Parallel Processing for Low Power (cont d) Sequential(critical path): seq Parallel: (critical path when L=3) ( ) V 0 3 seq 3 seq 3 seq ( ) βv 0 he propagation delay of the original system is still same, but the propagation delay of the L-parallel system is given by L seq = C k β V ch arg e ( β V 0 V t 0 ) 7

Pipelining and Parallel Processing for Low Power (cont d) Hence, we get the following equation to compute β: L ( β V 0 V t ) = β ( V 0 V t ) Once β is computed, the power consumption of the L-parallel system can be calculated as P para = f ( LC total )( β V 0 ) = β L Examples: Please see the examples (Example 3.4. and Example 3.4.3) in the textbook (pp.77 and pp.80) P seq 8

Pipelining and Parallel Processing for Low Power (cont d) Figures for Example 3.4. A 4-tap FIR filter A -parallel filter An area-efficient -parallel filter 9

Pipelining and Parallel Processing for Low Power (cont d) Figures for Example 3.4.3 (pp.80) An area-efficient -parallel filter and its critical path 30

Pipelining and Parallel Processing for Low Power (cont d) Combining Pipelining and Parallel Processing for Lower Power Pipelining and parallel processing can be combined for lower power consumption: pipelining reduces the capacitance to be charged/discharged in 1 clock period, while parallel processing increases the clock period for charging/discharging the original capacitance 3 3 3 3 (a) 3 (b) 3 Figure: (a) Charge/discharge of entire capacitance in clock period (b) Charge/discharge of capacitance in clock period 3 using a 3-parallel -pipelined FIR filter 31

Pipelining and Parallel Processing for Low Power (cont d) he propagation delay of the L-parallel M-pipelined filter is obtained as: L pd = ( C M ) C ch arg e 0 ch arg e ( β V 0 V t ) k ( V 0 V t k β V = L V ) 0 Finally, we can obtain the following equation to compute β LM ( β V 0 V t ) = β ( V 0 V t ) Example: please see the example in the textbook (pp.8) 3