UMBC. Control. Datapath. Memory. 1 (December 11, :44 pm) Datapath: The core -- all other components are support units that store

Similar documents
Multipliers. Introduction

NEW adder cells are useful for designing larger circuits despite increase in transistor count by four per cell.

Digital Logic Design. Basics Combinational Circuits Sequential Circuits. Pu-Jen Cheng

Let s put together a Manual Processor

Lecture 5: Gate Logic Logic Optimization

System on Chip Design. Michael Nydegger

1. True or False? A voltage level in the range 0 to 2 volts is interpreted as a binary 1.

Binary Adders: Half Adders and Full Adders

ECE410 Design Project Spring 2008 Design and Characterization of a CMOS 8-bit Microprocessor Data Path

Chapter 4 Register Transfer and Microoperations. Section 4.1 Register Transfer Language

WEEK 8.1 Registers and Counters. ECE124 Digital Circuits and Systems Page 1

Adder.PPT(10/1/2009) 5.1. Lecture 13. Adder Circuits

e.g. τ = 12 ps in 180nm, 40 ps in 0.6 µm Delay has two components where, f = Effort Delay (stage effort)= gh p =Parasitic Delay

NAME AND SURNAME. TIME: 1 hour 30 minutes 1/6

Modeling Sequential Elements with Verilog. Prof. Chien-Nan Liu TEL: ext: Sequential Circuit

Sequential 4-bit Adder Design Report

DEPARTMENT OF INFORMATION TECHNLOGY

Counters and Decoders

Chapter 7. Registers & Register Transfers. J.J. Shann. J. J. Shann

Counters are sequential circuits which "count" through a specific state sequence.

Implementation of Modified Booth Algorithm (Radix 4) and its Comparison with Booth Algorithm (Radix-2)

Flip-Flops, Registers, Counters, and a Simple Processor

Sistemas Digitais I LESI - 2º ano

CSE140 Homework #7 - Solution

EE 42/100 Lecture 24: Latches and Flip Flops. Rev B 4/21/2010 (2:04 PM) Prof. Ali M. Niknejad

Gate Delay Model. Estimating Delays. Effort Delay. Gate Delay. Computing Logical Effort. Logical Effort

ETEC 2301 Programmable Logic Devices. Chapter 10 Counters. Shawnee State University Department of Industrial and Engineering Technologies

CHAPTER IX REGISTER BLOCKS COUNTERS, SHIFT, AND ROTATE REGISTERS

LFSR BASED COUNTERS AVINASH AJANE, B.E. A technical report submitted to the Graduate School. in partial fulfillment of the requirements

Memory Elements. Combinational logic cannot remember

exclusive-or and Binary Adder R eouven Elbaz reouven@uwaterloo.ca Office room: DC3576

Pass Gate Logic An alternative to implementing complex logic is to realize it using a logic network of pass transistors (switches).

Lecture 8: Binary Multiplication & Division

Chapter 10 Advanced CMOS Circuits

Binary Division. Decimal Division. Hardware for Binary Division. Simple 16-bit Divider Circuit

Experiment # 9. Clock generator circuits & Counters. Eng. Waleed Y. Mousa

Contents COUNTER. Unit III- Counters

BINARY CODED DECIMAL: B.C.D.

COMBINATIONAL and SEQUENTIAL LOGIC CIRCUITS Hardware implementation and software design

Combinational Logic Design Process

Gates, Circuits, and Boolean Algebra

Module 3: Floyd, Digital Fundamental

Latches, the D Flip-Flop & Counter Design. ECE 152A Winter 2012

Module 4 : Propagation Delays in MOS Lecture 22 : Logical Effort Calculation of few Basic Logic Circuits

The string of digits in the binary number system represents the quantity

Layout of Multiple Cells

ASYNCHRONOUS COUNTERS

DIGITAL COUNTERS. Q B Q A = 00 initially. Q B Q A = 01 after the first clock pulse.

CMOS Binary Full Adder

Today. Binary addition Representing negative numbers. Andrew H. Fagg: Embedded Real- Time Systems: Binary Arithmetic

ANALOG & DIGITAL ELECTRONICS

CS 61C: Great Ideas in Computer Architecture Finite State Machines. Machine Interpreta4on

3.Basic Gate Combinations

Registers & Counters

Computer organization

Lesson 12 Sequential Circuits: Flip-Flops

Introduction to CMOS VLSI Design (E158) Lecture 8: Clocking of VLSI Systems

(Refer Slide Time: 00:01:16 min)

Counters & Shift Registers Chapter 8 of R.P Jain

CHAPTER 3 Boolean Algebra and Digital Logic

DIGITAL ELECTRONICS. Counters. By: Electrical Engineering Department

FPGA. AT6000 FPGAs. Application Note AT6000 FPGAs. 3x3 Convolver with Run-Time Reconfigurable Vector Multiplier in Atmel AT6000 FPGAs.

ECE 3401 Lecture 7. Concurrent Statements & Sequential Statements (Process)

Gates & Boolean Algebra. Boolean Operators. Combinational Logic. Introduction

RN-Codings: New Insights and Some Applications

Lecture 7: Clocking of VLSI Systems

Oct: 50 8 = 6 (r = 2) 6 8 = 0 (r = 6) Writing the remainders in reverse order we get: (50) 10 = (62) 8

Design Verification & Testing Design for Testability and Scan

A single register, called the accumulator, stores the. operand before the operation, and stores the result. Add y # add y from memory to the acc

Lecture 5: Logical Effort

Digital Design. Assoc. Prof. Dr. Berna Örs Yalçın

CHAPTER 11: Flip Flops

List of Experiment. 8. To study and verify the BCD to Seven Segments DECODER.(IC-7447).

A New Paradigm for Synchronous State Machine Design in Verilog

Asynchronous counters, except for the first block, work independently from a system clock.

Life Cycle of a Memory Request. Ring Example: 2 requests for lock 17

Digital Electronics Part I Combinational and Sequential Logic. Dr. I. J. Wassell

5 Combinatorial Components. 5.0 Full adder. Full subtractor

Microprocessor & Assembly Language

Take-Home Exercise. z y x. Erik Jonsson School of Engineering and Computer Science. The University of Texas at Dallas

Design Verification and Test of Digital VLSI Circuits NPTEL Video Course. Module-VII Lecture-I Introduction to Digital VLSI Testing

Optimization and Comparison of 4-Stage Inverter, 2-i/p NAND Gate, 2-i/p NOR Gate Driving Standard Load By Using Logical Effort

ALGEBRA. sequence, term, nth term, consecutive, rule, relationship, generate, predict, continue increase, decrease finite, infinite

PLL frequency synthesizer

Three-Phase Dual-Rail Pre-Charge Logic

Two-level logic using NAND gates

An Efficient RNS to Binary Converter Using the Moduli Set {2n + 1, 2n, 2n 1}

Boolean Algebra Part 1

CSE140: Components and Design Techniques for Digital Systems

Design Example: Counters. Design Example: Counters. 3-Bit Binary Counter. 3-Bit Binary Counter. Other useful counters:

Lab 1: Study of Gates & Flip-flops

NTE2053 Integrated Circuit 8 Bit MPU Compatible A/D Converter

Step : Create Dependency Graph for Data Path Step b: 8-way Addition? So, the data operations are: 8 multiplications one 8-way addition Balanced binary

Fault Modeling. Why model faults? Some real defects in VLSI and PCB Common fault models Stuck-at faults. Transistor faults Summary

EE 261 Introduction to Logic Circuits. Module #2 Number Systems

COMBINATIONAL CIRCUITS

Sequential Circuit Design

Systems I: Computer Organization and Architecture

Combinational Logic Design

Transcription:

Digital Device Components A simple processor illustrates many of the basic components used in any digital system: Memory Control Input-Output Datapath Datapath: The core -- all other components are support units that store either the results of the datapath or determine what happens in the next cycle. 1 (December 11, 2000 3:44 pm) UNIVERSITY OF MARYL BALTIMORE COUNTY

Digital Device Components Memory: A broad range of classes exist determined by the way data is accessed: Read-Only vs. Read-Write Sequential vs. Random access Single-ported vs. Multi-ported access Or by their data retention characteristics: Dynamic vs. Static Stay tuned for a more extensive treatment of memories. Control: A FSM (sequential circuit) implemented using random logic, PLAs or memories. Interconnect and Input-Output: Parasitic resistance, capacitance and inductance affects performance of wires both on and off the chip. Growing die size increases the length of the on-chip interconnect, increasing the value of the parasitics. 2 (December 11, 2000 3:44 pm) UNIVERSITY OF MARYL BALTIMORE COUNTY

Digital Device Components Datapath elements include adders, multipliers, shifters, BFUs, etc. The speed of these elements often dominates the overall system performance so optimization techniques are important. However, as we will see, the task is non-trivial since there are multiple equivalent logic and circuit topologies to choose from, each with adv./ disadv. in terms of speed, power and area. Also, optimizations focused at one design level, e.g., sizing transistors, leads to inferior designs. Control Bit-sliced organization is common for datapaths. Data-In Registers Adder Shifter Multiplexer Data-Out Bit 4 Bit 3 Bit 2 Bit 1 Bit 0 3 (December 11, 2000 3:44 pm) UNIVERSITY OF MARYL BALTIMORE COUNTY

Datapath Operators: Addition/Subtraction Let s start with addition, since it is a very common datapath element and often a speed-limiting element. Optimizations can be applied at the logic or circuit level. Logic-level optimization try to rearrange the Boolean equations to produce a faster or smaller circuit, e.g. carry look-ahead adder. Circuit-level optimizations manipulate transistor sizes and circuit topology to optimize speed. Let s start with some basic definitions before considering optimizations: A B C i G(A.B) P(A+B) P (A + B) Sum C o 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 1 0 0 1 1 1 0 0 1 1 0 1 1 0 1 1 1 0 0 0 1 1 1 0 0 1 0 1 1 0 1 1 1 0 1 1 0 0 1 1 1 1 1 1 0 1 1 Carry status delete delete propagate propagate propagate propagate generate generate 4 (December 11, 2000 3:44 pm) UNIVERSITY OF MARYL BALTIMORE COUNTY

Datapath Operators: Addition/Subtraction G(A.B): (generate) Occurs when a C o is internally generated within the adder (occurs independent of C i ). P(A+B): (propagate) Indicates that C i is propagated (passed) to C o. P (A XOR B): (propagate) Used in some adders for the P term since it can be reused to generate the sum term. D(A.B): (delete) Ensures that a carry bit will be deleted at C o. The Boolean expressions for S and C o are: Sum = A.B.C i + A.B.C i + A.B.C i + A.B.C i = A XOR B XOR C Carry = A.B + A.C i + B.C i 5 (December 11, 2000 3:44 pm) UNIVERSITY OF MARYL BALTIMORE COUNTY

Datapath Operators: Addition/Subtraction But S and C o can be written in terms of G and P : C o (G, P ) = G + P C i (or P in this case). S(G, P ) = P XOR C i Note that G and P are INdependent of C i. (Also, C o and S can be expressed in terms of delete (D)). Ripple-carry adder: A 0 B 0 A 1 B 1 A 2 B 2 A 3 B 3 C i,0 C o,0 C o,1 C o,2 =C i,1 C o,3 S 0 S 1 The critical path (worst case delay over all possible inputs) is a ripple from lsb to msb. S 2 S 3 6 (December 11, 2000 3:44 pm) UNIVERSITY OF MARYL BALTIMORE COUNTY

Datapath Operators: Addition/Subtraction The delay in this case is proportional to the number of bits, N, in the input words: t adder = (N - 1)t carry + t sum where t carry and t sum are the propagation delays from C i to C o & S. One possible worst case bit pattern (from lsb to msb) is: A: 00000001; B: 01111111 Convince yourself that this is true. Note that when optimizing this structure, it is far more important to optimize t carry than t sum. The inverting property of a full adder can be used to achieve this goal: A B A B C i C o C i C o S S 7 (December 11, 2000 3:44 pm) UNIVERSITY OF MARYL BALTIMORE COUNTY

Datapath Operators: Addition/Subtraction Thus, S(A, B, C i ) = S(A, B, C i ) C o (A, B, C i ) = C o (A, B, C i ) One possible (un-optimized) implementation: A B S C i P XOR C i Transistor level diagram uses 32 transistors. (see Weste and Eshraghian). C i A B C i.p(a + B) A B C o G(A.B) 8 (December 11, 2000 3:44 pm) UNIVERSITY OF MARYL BALTIMORE COUNTY

Datapath Operators: Addition/Subtraction C o is reused in the S term as: Sum = A.B.C i + (A + B + C i )C o A B A C i A B A C i C i A B B B A C o B C i C i A Symmetrical design eliminates diffusion caps and reduces series R. S A B C i B Are the n and p trees duals C o of each other? 28 transistors Even with some design tricks, e.g., transistors on the critical path, C i placed closest to the output and symmetrical design, this implementation is slow. 9 (December 11, 2000 3:44 pm) UNIVERSITY OF MARYL BALTIMORE COUNTY

Datapath Operators: Addition/Subtraction The load capacitance in previous version on C o consists of 2 diffusion capacitances (inverter) and 6 (next bit) gate capacitances: C<n+1> Overflow B<n> S<n> A<n> B<3> B<3> A<3> B<2> A<2> B<1> A<1> B<0> A<0> C<n> C<3> Sign of the result S<3> S<2> S<1> S<0> A<3> B<2> A<2> B<1> A<1> B<0> A<0> Subtract C<3> S<3> S<2> S<1> S<0> C in Eliminates the inverter delay per bit for carry! This version increases C o s load to 4 diffusion caps, 2 internal (sum) gate caps plus the 6 (next bit) gate caps. 10 (December 11, 2000 3:44 pm) UNIVERSITY OF MARYL BALTIMORE COUNTY

Datapath Operators: Addition/Subtraction Serial addition can be used if area is a concern: n bit shift register addend Clk Set Clr C out Reg 1-bit n bit shift register result Clk augand Cin In this case, you want equal Sum and Carry delays in order to minimize clock cycle time. Bit-level pipelining can be used to break the dependency between addition time and the number of bits by inserting s between each register bit. 11 (December 11, 2000 3:44 pm) UNIVERSITY OF MARYL BALTIMORE COUNTY

Datapath Operators: Addition/Subtraction Transmission-gate Adder: Total transistors is 26 B XNOR S A C o C i XOR Note: S and C o delay times are approximately equal -- good for multipliers. See Weste and Eshraghian for an 18 transistor implementation. 12 (December 11, 2000 3:44 pm) UNIVERSITY OF MARYL BALTIMORE COUNTY

Datapath Operators: Addition/Subtraction Dynamic Adder Design: np-cmos adder A 1 B 1 φ φ φ φ B 1 C i A A 1 1 φ φ C i2 B 1 C i1 φ C i1 A 1 B 1 φ S 1 φ C i0 B 0 A 0 B 0 A 0 C i1 φ A 0 B 0 φ φ φ φ B 0 C i0 A 0 C i0 S 0 13 (December 11, 2000 3:44 pm) UNIVERSITY OF MARYL BALTIMORE COUNTY

Datapath Operators: Addition/Subtraction Dynamic Adder Design: Manchester Carry-Chain adder. A chain of pass-transistors are used to implement the carry chain. φ C i,0 3.5 P 0 G 0 P 1 P 2 C o,0 C o,1 C o,2 C 3 2.5 2 1.5 o,3 1 G 1 G 2 G 3 3 2.5 2 1.5 1 P 3 P 4 C o,4 G 4 C o,4 φ 4 3.5 3 2.5 2 1.5 Transistor sizes largest here since worst case is to discharge all nodes C o,k. Precharge: All intermediate nodes, e.g. C o,0, charged to V DD. Evaluate: Node C o,k is discharged, for example, if there is an incoming carry, C i,0 and the previous propagate signals are high, P 0 to P k-1. Only 4 diffusion capacitances are present per node but the distributed RCnature of the chain results in delay that is quadratic with number of bits. Buffers and/or transistor sizing can be used to improve performance. 14 (December 11, 2000 3:44 pm) UNIVERSITY OF MARYL BALTIMORE COUNTY

Datapath Operators: Addition/Subtraction Consider the worst case delay of the carry chain: R 1 R 2 R 3 R 4 R 5 R 6 Out C 1 C 2 C 3 C 4 C 5 C 6 Elmore delay is given by: N t p = 0.69 C i i = 1 The delay of the RC network is then: t p = 0.69(C 1 R 1 + C 2 (R 1 + R 2 ) + C 3 (R 1 + R 2 + R 3 ) + C 4 (R 1 + R 2 + R 3 + R 4 ) + C 5 (R 1 + R 2 + R 3 + R 4 + R 5 ) + C 6 (R 1 + R 2 + R 3 + R 4 + R 5 + R 6 ) Since R 1 appears 6 times in the expression, it makes sense to minimize its contribution. i R j j = 1 Note that reducing R by a factor, e.g. k, at each stage increases the capacitance by a factor k and increases area. A k-factor of 1.5, reduces delay by 40% and increases area by 3.5X. 15 (December 11, 2000 3:44 pm) UNIVERSITY OF MARYL BALTIMORE COUNTY

Datapath Operators: Addition/Subtraction Carry-Bypass adder: P 0 G 0 P 1 G 1 P 2 G 2 P 3 G 3 C i,0 C o,0 C o,1 C o,2 C o,3 C o,3 Mux BP = P 0 P 1 P 2 P 3 Assume A k and B k (for k = 1...3) are set such that all P k (propagate) are high. In this case, an incoming carry C i,0 = 1, propagates along the complete chain and C o,3 = 1. In other words: if (P 0 P 1 P 2 P 3 == 1) then C o,3 = C i,0 else either DELETE or GENERATE occurred. 16 (December 11, 2000 3:44 pm) UNIVERSITY OF MARYL BALTIMORE COUNTY

Datapath Operators: Addition/Subtraction Linear Carry-Select adder: One way around waiting for the incoming carry is to compute the result of both possible values in advance and let the incoming carry select the correct result. 0 1 Setup P,G 0-carry propagation 1-carry propagation This block adds bits k to k+3. Select operation is much faster than time to compute either of the two possible carry vectors. C o,k-1 C o,k+3 Mux Carry vector Sum Generation For Square-Root Carry-Select, higher order blocks take more operand bits than lower order blocks. A Square-Root Carry-Select Adder (delay = O(N 1/2 )) is constructed by increasing the number of input bits in each block from lsb to msb. 17 (December 11, 2000 3:44 pm) UNIVERSITY OF MARYL BALTIMORE COUNTY

Datapath Operators: Addition/Subtraction Carry look-ahead adder (avoiding the ripple altogether): Compute the carries to each stage in parallel. The carry out of the k th stage is computed as: C o,k = G k + P k. C o,k-1 where G k = A k. B k P k = A k + B k The dependency between C o,k and C o,k-1 can be eliminated by expanding C o,k-1. C o,k = G k + P k. (G k-1 + P k-1.c o,k-2 ) For example, for 4 stages of look-ahead: C 0 = G 0 + P 0 C i C 1 = G 1 + P 1 G 0 + P 1 P 0 C i C 2 = G 2 + P 2 G 1 + P 2 P 1 G 0 + P 2 P 1 P 0 C i C 3 = G 3 + P 3 G 2 + P 3 P 2 G 1 + P 3 P 2 P 1 G 0 + P 3 P 2 P 1 P 0 C i Note that the low-order terms, e.g., P 0 and G 0, appear in the expression for every bit, making the fanout load large. 18 (December 11, 2000 3:44 pm) UNIVERSITY OF MARYL BALTIMORE COUNTY

Datapath Operators: Addition/Subtraction Carry look-ahead adder: One possible implementation without using simple logic gates. G 3 G 2 G 1 C i,0 G 0 C 0,3 P 0 P 1 P 2 P 3 Size and fan-in of the gates limit the size to about four. 19 (December 11, 2000 3:44 pm) UNIVERSITY OF MARYL BALTIMORE COUNTY

Datapath Operators: Addition/Subtraction Carry look-ahead adder: Factoring term C 3 yields: C 3 = G 3 + P 3 (G 2 + P 2 (G 1 + P 1 (G 0 + P 0 C i,0 ))) Domino CMOS implementation: Worst case is pull-down through 6 series n-channel transistors. P<1> G<1> Clk P<3> P<2> G<2> G<3> C<3> P<0> G<0> C i,0 Other high speed versions given in Weste and Eshraghian. Clk 20 (December 11, 2000 3:44 pm) UNIVERSITY OF MARYL BALTIMORE COUNTY

Datapath Operators: Addition/Subtraction The Logarithmic look-ahead adder: O(log 2 N) delay: (G 0, P 0 ) (G 1, P 1 ) (G 2, P 2 ) (G 3, P 3 ) C o,0 Co,1 C o,2 C o,3 C o,4 C o,5 Forward binary tree (G 4, P 4 ) (G 5, P 5 ) C o,6 (G 6, P 6 ) (G 7, P 7 ) (C 4-7, P 4-7 ) C o,7 The dot operator ( )is defined as: (g, p). (g, p ) = (g + pg, pp ) Inverse binary tree The number of logic levels is proportional to log 2 N, fan-in is limited and the layout is compact (jigsaw puzzle) (see Rabaey for details). 21 (December 11, 2000 3:44 pm) UNIVERSITY OF MARYL BALTIMORE COUNTY

Datapath Operators: Comparison Magnitude Comparators: May be built from an adder, complementer (XOR gates) and a zero detect unit. B >= A B<3> A<3> B<2> A<2> B<1> A<1> B<0> A<0> B = A Zero detect NOR gate. Think about the modifications necessary to make it a signed comparator (Hint: A couple of XOR gates). 22 (December 11, 2000 3:44 pm) UNIVERSITY OF MARYL BALTIMORE COUNTY

Datapath Operators: Binary Counters Asynchronous: Based on the Toggle register. T T Q C T T T T T T Q Clk T Q Q<0> Q<1> Q<2> Q<3> T Q T Q T Q Clk Q<3> T Q T Q T Q T Q "Ripple Carry" Binary counter Not a good choice for performance and testability (with no reset). 23 (December 11, 2000 3:44 pm) UNIVERSITY OF MARYL BALTIMORE COUNTY

Datapath Operators: Binary Counters Synchronous counter. Q<0> Q<1> Q<2> Q<3> D Q 1-bit Reg 0 1 D Q 1-bit Reg 0 1 D Q 1-bit Reg 0 1 D Q 1-bit Reg Clk Clear Clk Clear Clk Clear Clk Clear Clk Clear Replace gate with an adder for up/down counting capability. Weste and Eshraghian also show a version that can be initialized. 24 (December 11, 2000 3:44 pm) UNIVERSITY OF MARYL BALTIMORE COUNTY

Datapath Operators: Multiplication Multiplication can be broken down into two steps: Computation of partial products. Accumulation of the shifted partial products. X 1100 0101 1100 0000 1100 0000 0111100 Binary multiplication equivalent to operation Multipliers may be classified by the format in which data words are accessed: Serial Serial/parallel Parallel The parallel form computes the partial products in parallel. 25 (December 11, 2000 3:44 pm) UNIVERSITY OF MARYL BALTIMORE COUNTY

Datapath Operators: Multiplication Parallel Unsigned Multiplication: m 1 X = X i 2 i i = 0 n 1 Y = Y j 2 j j = 0 Multiplying 2 unsigned binary integers results in: P = X Y = X i 2 i Y j 2 j = X 3 X 2 X 1 X 0 Y 3 Y 2 Y 1 Y 0 m 1 n 1 P k 2 k i = 0 j = 0 Multiplicand Multiplier m + n 1 k = 0 X 3 Y 0 X 2 Y 0 X 1 Y 0 X 0 Y 0 X 3 Y 1 X 2 Y 1 X 1 Y 1 X 0 Y 1 X 3 Y 2 X 2 Y 2 X 1 Y 2 X 0 Y 2 X 3 Y 3 X 2 Y 3 X 1 Y 3 X 0 Y 3 P 7 P 6 P 5 P 4 P 3 P 2 P 1 P 0 There are m*n summands produced by a set of m*n gates in parallel. 26 (December 11, 2000 3:44 pm) UNIVERSITY OF MARYL BALTIMORE COUNTY

Datapath Operators: Multiplication Parallel Multiplication: Multiplication is carried out using a bitwise of the operands, X i and Y i. Most of the work (and delay) is in summing the partial products. B C i Y X Multiplication C o A A NxN multiplier requires: N(N-2) full adders N half adders Sum the N 2 gates Partial products 27 (December 11, 2000 3:44 pm) UNIVERSITY OF MARYL BALTIMORE COUNTY

Datapath Operators: Multiplication Array multiplier: N M X 3 X 2 X 1 X 0 Y 0 t mult = (M-1)+(N-2)t carry + (N-1)t sum + t and X 3 X 2 X 1 X 0 Y 1 P 0 HA HA X P X Y 1 3 X 2 X 1 0 2 X 3 X 2 X 1 X 0 Y 3 HA P 2 There are a large number of nearly identical critical paths in this circuit. HA P 7 P 6 P 5 P 4 P 3 28 (December 11, 2000 3:44 pm) UNIVERSITY OF MARYL BALTIMORE COUNTY

Datapath Operators: Multiplication From the delay expression and the fact that all critical paths have the same length, minimizing t mult requires minimizing both t carry and t sum. This is in contrast with the adder where minimizing t carry was key. The transmission gate adder is a good choice here. Parallel Signed Multiplication: Baugh-Wooley algorithm: m 2 A = a m 1 2 m 1 + a i 2 i i = 0 m 2 B = b m 1 2 m 1 + b i 2 i i = 0 Only 3 additional adders required over the unsigned version. Let A and B represent signed integers. Expanding shows that the last two rows of summands are all negative so the algorithm simply adds in their negations. m 2 P = a + a i 2 i = m 1 2 m 1 i = 0 a m 1 b n 1 2 m n 2 m 2 b n 1 2 n 1 n 2 + b i 2 i i = 0 n 2 + + a i b j 2 i + j a i b n 1 a m 1 i = 0 j = 0 m 2 i = 0 2 n 1 + i n 2 i = 0 b i 2 m 1 + i 29 (December 11, 2000 3:44 pm) UNIVERSITY OF MARYL BALTIMORE COUNTY

Datapath Operators: Multiplication Parallel Signed Multiplication: ( a7b0) ( a7b1) ( a7b2) ( a7b3) ( a7b4) ( a7b5) ( a7b6) a7b7 ( ) a7 b 7 ( ) P15 a7 a7 a6b0 ( ) ( a6b1) ( a6b2) ( a6b3) ( a6b4) ( a6b5) ( a6b6) ( a6 b 7) P14 a6 a6 a5 a5 a4 a4 a3 a3 a2 a2 a1 a1 a0 a0 a5b0 ( ) ( a5b1) ( a5b2) ( a5b3) ( a5b4) ( a5b5) ( a5b6) ( a5 b 7) a4b0 ( ) ( a4b1) ( a4b2) ( a4b3) ( a4b4) ( a4b5) ( a4b6) ( a4 b 7) a3b0 ( ) ( a3b1) ( a3b2) ( a3b3) ( a3b4) ( a3b5) ( a3b6) ( a3 b 7) a2b0 ( ) ( a2b1) ( a2b2) ( a2b3) ( a2b4) ( a2b5) ( a2b6) ( a2 b 7) a1b0 ( ) ( a1b1) ( a1b2) ( a1b3) ( a1b4) ( a1b5) ( a1b6) ( a1 b 7) a0b0 ( ) ( a0b1) ( a0b2) ( a0b3) ( a0b4) ( a0b5) ( a0b6) ( a0 b 7) P13 P12 P11 P10 P9 a7b7 ( ) P7 b0 b 0 b1 b 1 P0 b2 b 2 P1 b3 b 3 P2 b4 b 4 P3 b5 b 5 P4 b6 b 6 P5 b7 b 7 P6 P8 30 (December 11, 2000 3:44 pm) UNIVERSITY OF MARYL BALTIMORE COUNTY

Datapath Operators: Multiplication Carry-Save Multiplier: Carry bits can be passed diagonally downwards instead of to the left. HA HA HA HA 4x4 version Cost: A little extra area: HA HA HA HA Vector-merging adder Advantage: t mult = (N-1)t carry + t and + t merge Critical path is uniquely defined: (Assuming t add = t carry ). Minimizing t merge is useful, e.g. use carry-select or lookahead. Here the carry bits are not immediately added but rather saved for the next adder stage. 31 (December 11, 2000 3:44 pm) UNIVERSITY OF MARYL BALTIMORE COUNTY

Datapath Operators: Multiplication Serial Unsigned Multiplication: If area is a concern. reset G 2 Clk Reg 1-bit serial register X Y G 1 C in X i and Y i delivered serially Clk to the inputs of G 1 at different rates. P 7 P 0 Computes the summands row-wise from right to left. Disadv: Quadratic delay: t mult = M x N x t carry Serial/Parallel Unsigned Multiplier shown in Weste and Eshraghian. 32 (December 11, 2000 3:44 pm) UNIVERSITY OF MARYL BALTIMORE COUNTY

Datapath Operators: Multiplication Booth Encoding: A special encoding of the multiplier word reduces the number of required addition stages and speeds up multiplication substantially. Radix-4 scheme: ( N 1 ) 2 Y = Y j 4 j with ( Y j { 2, 1, 0, 1, 2 } ) j = 0 The number of partial products (and additions) is halved, resulting in area and speed advantage. The disadvantage is a somewhat more involved multiplier cell. operation replaced with inversion and shift logic. Virtually every multiplier in use employs the Booth scheme. 33 (December 11, 2000 3:44 pm) UNIVERSITY OF MARYL BALTIMORE COUNTY

Datapath Operators: Multiplication Wallace Multiplier: Trees can be used to replace the linear partial-sum adders: C i C i Y 0 Y 1 Y 2 Y 3 Y 4 C i-1 C i-1 C i C i Y 0 Y 1 Y 2 Y 3 Y 4 Y 5 C i C i-1 C i-1 C i Y 5 C i Sum C i-1 Slice of a 6-bit carry-save mult. # of ripple stages is N-2 C Sum Adv: O(log 2 N) mult time. Disadv: Very irregular -- difficult to layout. 34 (December 11, 2000 3:44 pm) UNIVERSITY OF MARYL BALTIMORE COUNTY

Datapath Operators: Shifters Right/Left 1-bit shifter: A 3 A 2 A 1 A 0 I R I L 0 1 0 1 0 1 0 1 S S S S Mux Mux Mux Mux Right/Left H 3 H 2 H 1 H 0 35 (December 11, 2000 3:44 pm) UNIVERSITY OF MARYL BALTIMORE COUNTY

Datapath Operators: Shifters Barrel shifter: s<3> s<2> s<1> s<0> r<3> r<2> r<1> r<0> l<6:0> Arithmetic and logical shifts and rotates possible by muxing l<6:0> to the appropriate values. shift result 1 2 4 8 l<3:0> l<4:1> l<5:2> l<6:3> 36 (December 11, 2000 3:44 pm) UNIVERSITY OF MARYL BALTIMORE COUNTY