A Radix-10 Combinational Multiplier

Similar documents
RN-coding of Numbers: New Insights and Some Applications

Divide: Paper & Pencil. Computer Architecture ALU Design : Division and Floating Point. Divide algorithm. DIVIDE HARDWARE Version 1

Let s put together a Manual Processor

An Efficient RNS to Binary Converter Using the Moduli Set {2n + 1, 2n, 2n 1}

Implementing the Functional Model of High Accuracy Fixed Width Modified Booth Multiplier

CSE140 Homework #7 - Solution

The string of digits in the binary number system represents the quantity

Systems I: Computer Organization and Architecture

Digital Logic Design. Basics Combinational Circuits Sequential Circuits. Pu-Jen Cheng

Multipliers. Introduction

RN-Codings: New Insights and Some Applications

Hardware Implementations of RSA Using Fast Montgomery Multiplications. ECE 645 Prof. Gaj Mike Koontz and Ryon Sumner

Implementation of Modified Booth Algorithm (Radix 4) and its Comparison with Booth Algorithm (Radix-2)

Lecture 8: Binary Multiplication & Division

Binary Division. Decimal Division. Hardware for Binary Division. Simple 16-bit Divider Circuit

Counters and Decoders

SAD computation based on online arithmetic for motion. estimation

System on Chip Design. Michael Nydegger

Computer Science 281 Binary and Hexadecimal Review

INTEGRATED CIRCUITS. For a complete data sheet, please also download:

Design and FPGA Implementation of a Novel Square Root Evaluator based on Vedic Mathematics

ECE 0142 Computer Organization. Lecture 3 Floating Point Representations

A single register, called the accumulator, stores the. operand before the operation, and stores the result. Add y # add y from memory to the acc

VLSI IMPLEMENTATION OF INTERNET CHECKSUM CALCULATION FOR 10 GIGABIT ETHERNET

CHAPTER 3 Boolean Algebra and Digital Logic

Module 3: Floyd, Digital Fundamental

Upon completion of unit 1.1, students will be able to

ASYNCHRONOUS COUNTERS

Latches, the D Flip-Flop & Counter Design. ECE 152A Winter 2012

This Unit: Floating Point Arithmetic. CIS 371 Computer Organization and Design. Readings. Floating Point (FP) Numbers

Lecture 2. Binary and Hexadecimal Numbers

Number Systems and Radix Conversion

Floating Point Fused Add-Subtract and Fused Dot-Product Units

High-Performance Modular Multiplication on the Cell Processor

A new binary floating-point division algorithm and its software implementation on the ST231 processor

FPGA. AT6000 FPGAs. Application Note AT6000 FPGAs. 3x3 Convolver with Run-Time Reconfigurable Vector Multiplier in Atmel AT6000 FPGAs.

Lecture 5: Gate Logic Logic Optimization

Binary Number System. 16. Binary Numbers. Base 10 digits: Base 2 digits: 0 1

BINARY CODED DECIMAL: B.C.D.

2) Write in detail the issues in the design of code generator.

FPGA Implementation of an Extended Binary GCD Algorithm for Systolic Reduction of Rational Numbers

Introduction to Xilinx System Generator Part II. Evan Everett and Michael Wu ELEC Spring 2013

Step : Create Dependency Graph for Data Path Step b: 8-way Addition? So, the data operations are: 8 multiplications one 8-way addition Balanced binary

Oct: 50 8 = 6 (r = 2) 6 8 = 0 (r = 6) Writing the remainders in reverse order we get: (50) 10 = (62) 8

SPARC64 VIIIfx: CPU for the K computer

Understanding Logic Design

The components. E3: Digital electronics. Goals:

Chapter 1: Digital Systems and Binary Numbers

Advanced Computer Architecture-CS501. Computer Systems Design and Architecture 2.1, 2.2, 3.2

EXPERIMENT 4. Parallel Adders, Subtractors, and Complementors

on an system with an infinite number of processors. Calculate the speedup of

Sequential Logic. (Materials taken from: Principles of Computer Hardware by Alan Clements )

CSI 333 Lecture 1 Number Systems

Digital System Design Prof. D Roychoudhry Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

High Speed Gate Level Synchronous Full Adder Designs

Levent EREN A-306 Office Phone: INTRODUCTION TO DIGITAL LOGIC

Experiment # 9. Clock generator circuits & Counters. Eng. Waleed Y. Mousa

RAM & ROM Based Digital Design. ECE 152A Winter 2012

International Journal of Electronics and Computer Science Engineering 1482

Sistemas Digitais I LESI - 2º ano

CPU Performance. Lecture 8 CAP

Faculty of Engineering Student Number:

An Effective Deterministic BIST Scheme for Shifter/Accumulator Pairs in Datapaths

AN IMPROVED DESIGN OF REVERSIBLE BINARY TO BINARY CODED DECIMAL CONVERTER FOR BINARY CODED DECIMAL MULTIPLICATION

Decimal Number (base 10) Binary Number (base 2)

DEPARTMENT OF INFORMATION TECHNLOGY

Solution: start more than one instruction in the same clock cycle CPI < 1 (or IPC > 1, Instructions per Cycle) Two approaches:

To convert an arbitrary power of 2 into its English equivalent, remember the rules of exponential arithmetic:

Today. Binary addition Representing negative numbers. Andrew H. Fagg: Embedded Real- Time Systems: Binary Arithmetic

路 論 Chapter 15 System-Level Physical Design

EE 261 Introduction to Logic Circuits. Module #2 Number Systems

NEW adder cells are useful for designing larger circuits despite increase in transistor count by four per cell.

Computer Architecture TDTS10

A High-Performance 8-Tap FIR Filter Using Logarithmic Number System

The implementation and performance/cost/power analysis of the network security accelerator on SoC applications

2011, The McGraw-Hill Companies, Inc. Chapter 3

Binary Adders: Half Adders and Full Adders

Zuse's Z3 Square Root Algorithm Talk given at Fall meeting of the Ohio Section of the MAA October College of Wooster

Chapter 2 Logic Gates and Introduction to Computer Architecture

INTEGRATED CIRCUITS. For a complete data sheet, please also download:

Design Example: Counters. Design Example: Counters. 3-Bit Binary Counter. 3-Bit Binary Counter. Other useful counters:

COMBINATIONAL and SEQUENTIAL LOGIC CIRCUITS Hardware implementation and software design

SIM-PL: Software for teaching computer hardware at secondary schools in the Netherlands

Improved NAND Flash Memories Storage Reliablity Using Nonlinear Multi Error Correction Codes

COMP 250 Fall 2012 lecture 2 binary representations Sept. 11, 2012

Module 4 : Propagation Delays in MOS Lecture 22 : Logical Effort Calculation of few Basic Logic Circuits

Chapter 07: Instruction Level Parallelism VLIW, Vector, Array and Multithreaded Processors. Lesson 05: Array Processors

DNA Data and Program Representation. Alexandre David

Flip-Flops, Registers, Counters, and a Simple Processor

A Digital Timer Implementation using 7 Segment Displays

NUMBER SYSTEMS. William Stallings

High Speed and Efficient 4-Tap FIR Filter Design Using Modified ETA and Multipliers

Efficient Interconnect Design with Novel Repeater Insertion for Low Power Applications

HOMEWORK # 2 SOLUTIO

ETEC 2301 Programmable Logic Devices. Chapter 10 Counters. Shawnee State University Department of Industrial and Engineering Technologies

A High-Yield Area-Power Efficient DWT Hardware for Implantable Neural Interface Applications

Quiz for Chapter 1 Computer Abstractions and Technology 3.10

LSN 2 Number Systems. ECT 224 Digital Computer Fundamentals. Department of Engineering Technology

Digital Electronics Detailed Outline

Transcription:

A Radix- Combinational Multiplier Tomás Lang and Alberto Nannarelli Dept. of Electrical Engineering and Computer Science, University of California, Irvine, USA Dept. of Informatics & Math. Modelling, Technical University of Denmark, Kongens Lyngby, Denmark Abstract In this work, we present a combinational decimal multiply unit which can be pipelined to reach the desired throughput. With respect to previous implementations of decimal multiplication, the proposed unit is combinational (parallel) and not sequential, has a simpler recoding of the operands which reduces the number of partial product precomputations and uses counters to eliminate the need of the decimal equivalent of a 4: adder. The results of the implementation show that the combinational decimal multiplier offers a good compromise between latency and area when compared to other decimal multiply units and to binary double-precision multipliers. I. INTRODUCTION Hardware implementations of decimal arithmetic units have recently gained importance because they provide higher accuracy in financial applications []. In this work, we present a combinational decimal multiplier which can be pipelined to reach the desired throughput. The multiply unit is organized as follows: the multiplier is recoded; the partial products are kept in a redundant format; the partial product are accumulated by a tree of redundant adders and the final product is obtained by converting the carry-save tree s outputs into binary-coded decimal (BCD) format. With respect to previous implementations of radix- multipliers such as the ones in [], [3] and [4], our design is different in the following aspects: ) the multiplier is combinational (parallel) and not sequential; ) we recode only the multiplier while in [4] both operands are recoded in -5 to +5; 3) in the partial product generation only multiples of 5 and are required; 4) the accumulation of partial products is done in a tree of radix- carry-save adders and counters while in the sequential unit of [4] signed-digit adders are used. We present the standard cells implementation of the multiplier and compare its latency with those of the schemes presented in [] and [3]. Moreover, we compare the delay and the area of the decimal combinational multiplier with those obtained by the implementation of a binary (radix-4) doubleprecision multiplier. II. MULTIPLIER ARCHITECTURE For the multiplication p = x y, we assume that both the multiplicand x and the multiplier y are sign-and-magnitude n-digit fractional numbers in BCD format normalized in [.,.). The multiplication shift-and-add algorithm is based on the identity n p = x y = xy i r i i= where for decimal operands r =, y i [, 9] and x is a n-digit BCD vector. We consider in the following n =6. To avoid complicated multiples of x, we recode y i = y Hi + y Li with y H {, 5, } and y L {,,,, } as indicated in Table I. With this recoding, we need to precompute only the multiples 5x and x, while x is obtained by left-shifting x one digit. The negative values x and x are represented in radix- radix-complement. Each partial product xy i is positive and sign extension is not necessary. The partial products are accumulated by using an adder tree, and the multiplication is completed by a carry-save to BCD conversion, as shown in Fig.. A. Precomputation of x and 5x The multiplication by is straightforward. Each digit is multiplied by and the carry is propagated to the next digit. The carry does not propagate any further. In [5] the multiples 5x and x are used for decimal multiplication and division. However, the generation of 5x is performed with a carry propagation over the whole number. We now present the algorithm we use for the precomputation of 5x without carry propagation. To obtain g =5x =x/ we perform the following two steps: () e =x (shift left one digit) () g = e/: To perform this operation we divide by two each digit of e. However, since e i / has a fractional part when e i is odd, we have e i / = f i + h i+ / f i =,...,4 and h i =, g i = f i +5h i g i =,...,9 That is, the algorithm to produce g =5x is h i+ = if e i (or x i+ ) odd f i = e i / = x i+ / g i = f i +5h i y i y H y L y i y H y L 5 5 6 5 7 5 3 5-8 - 4 5-9 - TABLE I RECODING OF DIGITS OF y. 444 785 /6/$. 33

X Y 6 6 X n*4 Y i 4 precomputations Partial Product Generation Adder Tree shift left digit x precompute 5X 5x Precomputations Partial Product generation 5 3: digit mux 9 s complement x 9 s complement _ x 5: digit mux x (n+)*4 precompute X x 4 recoder carry save > BCD P 3 (n+)*4 radix carry save adder (n+)*4 n+ xy xy i S i C cin Fig.. Scheme of the multiplier. Fig.. Partial product generation. B. Partial Product Generation The recoding of Table I produces two n-digit terms for each partial product (PP). Table II shows an example of multiplication (4 4 digits). Negative numbers are represented in radix- complement and are implemented by doing the 9 s complement of the BCD digit and adding a in the leastsignificant position. To reduce the delay of the additions, we perform the addition xy i = xy Hi + xy Li carry save and use a radix- carry-save representation of the partial products xy i = xy Hi + xy Li = { xyis [, 9] xy ic [, ] That is, the addition to produce a partial product results in x : 9 6 3 y : 8 4 5 = xy : 5x 9 8 5 xy : 5x 9 8 5 x - 9 6 3 xy : x 9 6 3 xy 3 : x 9 6 3 x - 3 9 6 5 9 8 8 6 3 5 TABLE II EXAMPLE OF MULTIPLICATION (4 4 DIGITS). n... xy Hi xxxx... xxxx xxxx xy Li xxxx... xxxx xxxx xy is ssss... ssss ssss xy ic c... c b b= if y Li < The scheme for the generation of a partial product is shown in Fig.. C. Partial Product Accumulation The carry-free accumulation of the partial products is done using radix- carry-save adders (CSA), which add a carrysave operand plus another BCD operand to produce a carrysave result (see Fig. 3). For 6-digit operands, we obtain 6 carry-save partial products. Because the radix- CSA reduces two BCD digits and one carry bit to one BCD digit and one carry bit, by arranging in a first-level of the tree the xy is s and by using 8 CSAs, we leave the carries (xy ic s) of 8 partial products not accounted for. This is shown in Fig. 4. These carry vectors are added by using an array of carrycounters (CC). A digit counter of this array adds 8 carries of the same weight and produces a decimal digit. That is, inputs 8 m-bit vectors C i c j [, ] output m-digit vector Z z j [, 8] asshowninfig.5. By arranging the radix- carry-save adders and the carrycounters as in Fig. 6, we can accumulate the partial products in a 6-level tree. 34

A B D n*4 n*4 n C i m m n*4 S C n... A aaaa... aaaa aaaa B bbbb... bbbb bbbb D d... d d S ssss... ssss ssss C c... c Fig. 3. n n-digit radix- CSA. r CC m*4 Z m m... C c c... c c c C c c... c c c C 3 c c... c c c C 4 c c... c c c C 5 c c... c c c C 6 c c... c c c C 7 c c... c c c C 8 c c... c c c Z zzzz zzzz... zzzz zzzz zzzz Fig. 5. m-digit radix- counter. The adder can be divided into three stages. Like in radix-, the first stage of the unit produces p and g defined as { { if ai + d p i = i =9 if ai + d g otherwise i = i = otherwise In parallel, the following operations are computed (modulo ) Fig. 4. Array for partial products. Solid circles indicate BCD digits, hollow circles indicate bits. D. Radix- carry-save to BCD converter The final carry-propagating addition consists in converting the radix- carry-save representation into the BCD one. This can be done with a simplified radix- CPA in which the input is just a value in radix- carry-save representation. The inputs are the n-digit/bit outputs of the adder tree and the output is the product of x and y represented in 3-digit BCD (no truncation or rounding are considered) inputs: A n-digit vec. with a i [, 9] D n-bit vec. with d i [, ] output: S n-digit vec. with s i [, 9] (BCD) s i = a i + d i i =,,...,n s i = a i + d i + A second stage consists of a parallel prefix structure to compute the carries c i from the ps andgs and a final stage selects the precomputed digits of s or s according to the specific carry bit c i { si if c s i = i = i =,,...,n s i if c i = This adding scheme can be generalized to a BCD carrypropagate adder by first applying a simplified radix- CSA which reduces the two BCD vectors to a radix- carry-save representation. Then, the above presented converter is used to complete the carry-propagate addition. III. IMPLEMENTATION AND COMPARISONS The 6-digit radix- combinational multiplier (Fig. ) has been synthesized using the STM 9 nm CMOS standard cells library [6] and Synopsys Design Compiler. The synthesized unit has a critical path of.65 ns and an area of.3 mm equivalent to the area of about 68, NAND- gates (Table III). The unit can be pipelined to achieve a target clock cycle T C. For example, allowing a latch overhead of % of T C, our unit can be pipelined into critical path m = stages..8 T C Obtained by omitting the input D in the scheme of Fig. 3. 35

xy F xy E xy D xy C xy B xy A xy 9 xy 8 xy 7 xy 6 xy 5 xy 4 xy 3 xy xy xy xy odd carries r CC 8*4 8 8*4 8 8*4 8 8*4 8 8*4 8 8*4 8 8*4 8 8*4 8 3*4 s 7 s 6 s 5 s 4 s 3 s s s s 8 LEVEL s 7 s 6 s 5 s 4 s 3 s s s LEVEL *4 *4 *4 *4 3 LEVEL 3 3 4*4 4 4*4 4 s3 s3 LEVEL 4 s3 s3 LEVEL 5 3*4 3 s4 s s4 c s 8 s odd odd s3 odd carries 3*4 3 r CC 3*4 s5 s s5 c LEVEL 6 3*4 3 p s p c Fig. 6. Adder tree. To clock the unit at a frequency of GHz, the multiplier can be split into 4 stages with a resulting latency of 4. ns. The sequential decimal multipliers of [] and [4] for 6-digit operands have a latency of n +4=cycles. The unit of [] has been implemented in [3] in a. μm library of standard cells and the reported critical path is.58 ns. Because between the. μm and the 9 nm libraries, the CMOS transistor channel lenght L and the supply voltage V DD scale by about the same amount S = L (.μm) = L (9 nm) 9. V. V = V DD(.μm) V DD(9 nm) we can assume the constant field scaling model applies. As a consequence, the critical path.58 ns scales to.58 S =.47 ns in the 9 nm library. Therefore, the latency of the the radix- multiplier of [] for 6-digit operands can be estimated to be (n +4).47 = 9.5 ns. The sequential multiplier of [3] has a latency of n +8 cycles and is implemented in the. μm library for different operand sizes. The reported value of T C =.5ns for 6- digit operands scales to T C =.4 ns in our library with a corresponding latency of (n +8).4 = 9.6 ns. By pipelining the combinational multiplier to match T C =.4 ns, we get stages (see delays in Table III) and an operation latency of.4 = 4.4 ns. Furthermore, we compared the radix- combinational mul- 36

Block Delay [ns] Area [μm ] Comp. x, 5x. PPs gen. mux. 55, PPs gen. CSA.9 Tree level.6 Tree level.7 Tree level 3. 35, Tree level 4.7 Tree level 5.9 Tree level 6. r- CS BCD.4, Whole multiplier.65 3, crit. path TABLE III RESULTS OF IMPLEMENTATION. tiplier with a binary double-precision tree-multiplier (with radix-4 recoding) which has a comparable dynamic range ( 53 < 6 ). The latency of the binary unit is.4 ns and its area. mm (Table IV). By comparing the radix- and the binary multipliers, the binary one is about two times faster and 33% smaller. IV. CONCLUSIONS In this work, we presented the architecture of a radix- combinational multiplier and its implementation in standard cells. The partial products are generated in such a way that only multiples and 5 of the multiplicand are required, and their accumulation is done by using radix- CSAs and counters. The synthesized unit has a operation latency of.65 ns and can be pipelined to obtain a target throughput. Critical path Area Unit [ns] ratio [mm ] ratio radix-4 mult.4... decimal mult.65.9.3.5 TABLE IV COMPARISON OF RADIX-4 AND RADIX- MULTIPLIERS. Although it might not be reasonable to compare the latencies of combinational and sequential units, the proposed multiplier has the shortest latency when pipelined and clocked at the maximum frequency of pre-existing radix- sequential multipliers. Finally, the radix- multiplier is compared with a binary double-precision multiplier, which is a de facto standard in most processors. The delay of the radix- multiplier is about twice that of the binary multiplier, and its area its about 5% larger. REFERENCES [] M. F. Cowlishaw, Decimal floating-point: algorism for computers, in Proc. of 6th Symposium on Computer Arithmetic, June 3, pp. 4. [] M. Erle and M. Schulte, Decimal Multiplication via Carry-save Addition, in Proc. of 4th International Conference on Application-Specific Systems, Architectures and Processors, July 3, pp. 337 347. [3] R. Kenney, M. Erle, and M. Schulte, A High-Frequency Decimal Multiplier, in Proc. of International Conference on Computer Design (ICCD), Oct. 4, pp. 6 9. [4] M. Erle, E. Schwarz, and M. Schulte, Decimal multiplication with efficient partial product generation, in Proc. of 7th Symposium on Computer Arithmetic, June 5, pp. 8. [5] R. K. Richards, Arithmetic Operations in Digital Computers. D. Van Nostrand Company, Inc., 955. [6] STMicroelectronics. 9nm CMOS9 Design Platform. [Online]. Available: http://www.st.com/stonline/prodpres/dedicate/soc/asic/9plat.htm 37