Floating Point Fused Add-Subtract and Fused Dot-Product Units



Similar documents
A High Speed Binary Floating Point Multiplier using Dadda Algorithm

Implementation of Modified Booth Algorithm (Radix 4) and its Comparison with Booth Algorithm (Radix-2)

Design and Analysis of Parallel AES Encryption and Decryption Algorithm for Multi Processor Arrays

Design and FPGA Implementation of a Novel Square Root Evaluator based on Vedic Mathematics

Divide: Paper & Pencil. Computer Architecture ALU Design : Division and Floating Point. Divide algorithm. DIVIDE HARDWARE Version 1

RN-Codings: New Insights and Some Applications

ECE 0142 Computer Organization. Lecture 3 Floating Point Representations

Chapter 2 Logic Gates and Introduction to Computer Architecture

Vedicmultiplier for RC6 Encryption Standards Using FPGA

A Novel Low Power, High Speed 14 Transistor CMOS Full Adder Cell with 50% Improvement in Threshold Loss Problem

Chapter 4 Register Transfer and Microoperations. Section 4.1 Register Transfer Language

International Journal of Electronics and Computer Science Engineering 1482

Implementing the Functional Model of High Accuracy Fixed Width Modified Booth Multiplier

Implementation and Design of AES S-Box on FPGA

This Unit: Floating Point Arithmetic. CIS 371 Computer Organization and Design. Readings. Floating Point (FP) Numbers

High Speed and Efficient 4-Tap FIR Filter Design Using Modified ETA and Multipliers

An Efficient RNS to Binary Converter Using the Moduli Set {2n + 1, 2n, 2n 1}

Implementation of Full -Parallelism AES Encryption and Decryption

FPGA Design of Reconfigurable Binary Processor Using VLSI

Binary Numbering Systems

The string of digits in the binary number system represents the quantity

A High-Performance 8-Tap FIR Filter Using Logarithmic Number System

Static-Noise-Margin Analysis of Conventional 6T SRAM Cell at 45nm Technology

ON SUITABILITY OF FPGA BASED EVOLVABLE HARDWARE SYSTEMS TO INTEGRATE RECONFIGURABLE CIRCUITS WITH HOST PROCESSING UNIT

RN-coding of Numbers: New Insights and Some Applications

Binary Division. Decimal Division. Hardware for Binary Division. Simple 16-bit Divider Circuit

Design and Implementation of Fast Fourier Transform Algorithm in FPGA

EE361: Digital Computer Organization Course Syllabus

AN IMPROVED DESIGN OF REVERSIBLE BINARY TO BINARY CODED DECIMAL CONVERTER FOR BINARY CODED DECIMAL MULTIPLICATION

Precision & Performance: Floating Point and IEEE 754 Compliance for NVIDIA GPUs

IJESRT. [Padama, 2(5): May, 2013] ISSN:

Introduction to Xilinx System Generator Part II. Evan Everett and Michael Wu ELEC Spring 2013

A DA Serial Multiplier Technique based on 32- Tap FIR Filter for Audio Application

Leakage Power Reduction Using Sleepy Stack Power Gating Technique

FPGA. AT6000 FPGAs. Application Note AT6000 FPGAs. 3x3 Convolver with Run-Time Reconfigurable Vector Multiplier in Atmel AT6000 FPGAs.

High Speed Gate Level Synchronous Full Adder Designs

Systems I: Computer Organization and Architecture

Hardware Implementation of AES Encryption and Decryption System Based on FPGA

From Concept to Production in Secure Voice Communications

Attention: This material is copyright Chris Hecker. All rights reserved.

Radar Processing: FPGAs or GPUs?

DESIGN OF AN ERROR DETECTION AND DATA RECOVERY ARCHITECTURE FOR MOTION ESTIMATION TESTING APPLICATIONS

Lecture N -1- PHYS Microcontrollers

Reconfigurable Low Area Complexity Filter Bank Architecture for Software Defined Radio

How To Fix A 3 Bit Error In Data From A Data Point To A Bit Code (Data Point) With A Power Source (Data Source) And A Power Cell (Power Source)

Copyright. Eric Charles Quinnell

Innovative improvement of fundamental metrics including power dissipation and efficiency of the ALU system

Understanding Logic Design

Let s put together a Manual Processor

Digital Hardware Design Decisions and Trade-offs for Software Radio Systems

FAULT TOLERANCE FOR MULTIPROCESSOR SYSTEMS VIA TIME REDUNDANT TASK SCHEDULING

Automatic Floating-Point to Fixed-Point Transformations

A Compact FPGA Implementation of Triple-DES Encryption System with IP Core Generation and On-Chip Verification

Lab 1: Full Adder 0.0

Sistemas Digitais I LESI - 2º ano

CONSTRAINT RANDOM VERIFICATION OF NETWORK ROUTER FOR SYSTEM ON CHIP APPLICATION

Design and Verification of Nine port Network Router

Zukang Shen Home Address: Work: Kindred Lane Cell:

A single register, called the accumulator, stores the. operand before the operation, and stores the result. Add y # add y from memory to the acc

NEW adder cells are useful for designing larger circuits despite increase in transistor count by four per cell.

synthesizer called C Compatible Architecture Prototyper(CCAP).

Design and Implementation of an On-Chip timing based Permutation Network for Multiprocessor system on Chip

Design and Implementation of Concurrent Error Detection and Data Recovery Architecture for Motion Estimation Testing Applications

The implementation and performance/cost/power analysis of the network security accelerator on SoC applications

IMPROVING PERFORMANCE OF RANDOMIZED SIGNATURE SORT USING HASHING AND BITWISE OPERATORS

Multipliers. Introduction

S. Venkatesh, Mrs. T. Gowri, Department of ECE, GIT, GITAM University, Vishakhapatnam, India

FPGA Architecture for OFDM Software Defined Radio with an optimized Direct Digital Frequency Synthesizer

Attaining EDF Task Scheduling with O(1) Time Complexity

International Association of Scientific Innovation and Research (IASIR) (An Association Unifying the Sciences, Engineering, and Applied Research)

A Parallel Processor for Distributed Genetic Algorithm with Redundant Binary Number

Measures of Error: for exact x and approximation x Absolute error e = x x. Relative error r = (x x )/x.

Optimization and Comparison of 4-Stage Inverter, 2-i/p NAND Gate, 2-i/p NOR Gate Driving Standard Load By Using Logical Effort

Binary Adders: Half Adders and Full Adders

International Journal of Advancements in Research & Technology, Volume 2, Issue3, March ISSN

LMS is a simple but powerful algorithm and can be implemented to take advantage of the Lattice FPGA architecture.

Quartus II Introduction for VHDL Users

Design and Verification of Area-Optimized AES Based on FPGA Using Verilog HDL

Guru Ghasidas Vishwavidyalaya, Bilaspur (C.G.) Institute of Technology. Electronics & Communication Engineering. B.

A New, High-Performance, Low-Power, Floating-Point Embedded Processor for Scientific Computing and DSP Applications

Two's Complement Adder/Subtractor Lab L03

FPGA Implementation of an Extended Binary GCD Algorithm for Systolic Reduction of Rational Numbers

FPGA IMPLEMENTATION OF AES ALGORITHM

7a. System-on-chip design and prototyping platforms

An Effective Deterministic BIST Scheme for Shifter/Accumulator Pairs in Datapaths

Implementation Of Clock Gating Logic By Matching Factored Forms A.kirankumari 1, P.P. Nagaraja Rao 2

Design and Development of Virtual Instrument (VI) Modules for an Introductory Digital Logic Course

Two-Phase Clocking Scheme for Low-Power and High- Speed VLSI

How To Calculate Kinematics Of A Parallel Robot

Verification & Design Techniques Used in a Graduate Level VHDL Course

數 位 積 體 電 路 Digital Integrated Circuits

ALFFT FAST FOURIER Transform Core Application Notes

Digital Circuit Design

Transcription:

Floating Point Fused Add-Subtract and Fused Dot-Product Units S. Kishor [1], S. P. Prakash [2] PG Scholar (VLSI DESIGN), Department of ECE Bannari Amman Institute of Technology, Sathyamangalam, Tamil Nadu, India Assistant Professor (Sr.G), Department of ECE, Bannari Amman Institute of Technology, Sathyamangalam, Tamil Nadu, India ABSTRACT: A single precision floating-point fused add-subtract unit and fused dot-product unit is presented that performs simultaneous floating-point add and multiplication operations. It takes to perform a major part of single addition, subtraction and dot-product using parallel implementation. This unit uses the IEEE-754 single-precision format and supports all rounding modes. The fused add-subtract unit is only about 56% larger than a conventional floating-point multiplier, and consumes 50% more power than the conventional floating-point adder. The speed of the fused dot-product is about 27% faster than the conventional parallel approach. This will combine to use for FFT algorithms mainly. The simulation results are obtained using Xilinx 14.3 EDA tool. The results show that the RTL view and synthesis reports. KEYWORDS: Fused Add-Subtract unit (FAS), Single precision floating point Dot-Product unit (FDP), Rounding modes, Number Of LUT'S, Delay, Verilog, Xilinx. I. INTRODUCTION Fixed-point arithmetic has been used for the longest time in computer arithmetic calculations due to its ease of implementation compared to floating-point arithmetic and the limited integration capabilities of available chip design technologies in the past. The design of binary fixed-point adders, multipliers, subtractions, and dividers is covered in numerous textbooks and conference papers. However, advanced technology applications require a data space that ranges from the infinitesimally small to the infinitely large. Such applications require the design of floating-point hardware. A floating point number representation can simultaneously provide a large range of numbers and a high degree of precision. As a result, a portion of most microprocessors is often dedicated to hardware for floating point computation. Floating-point arithmetic is attractive for the implementation for a variety of Digital Signal Processing (DSP) applications because it allows the designer and user to concentrate on the algorithms and architecture without worrying about numerical issues such as scaling, overflow, and underflow. In the past, many DSP applications used fixed point arithmetic due to the high cost (in time, silicon area and power consumption) of floating-point arithmetic units. In IEEE-754, the 32-bit with base 2 format is officially referred to as single precision or binary32. It was called single in IEEE 754. Fig.1 Parallel Implementation of FAS. Copyright to IJIRSET www.ijirset.com 575

This is required, for example, in computation of the FFT butterfly operation. In traditional floating-point hardware these operations may be performed in a serial fashion which limits the throughput. The use of a fused addsubtract (fused AS) unit in fig.1 and fused dot-product unit (FDP) in fig.2 accelerates the butterfly operation. Alternatively, the addition and subtraction may be performed in parallel with two floating-point adders which is expensive (in silicon area and in power consumption). Fig.2 Parallel Implementation of FDP This paper is organized as following. Section I describes the introduction about FAS and FDP for FFT applications. Section II Floating Point FAS. Section III Floating-Point FDP. Section IV Simulation and performance analysis in table forms are shown in Table (2-3), followed by the conclusion in Section V. II. FLOATING POINT FUSED ADD-SUBTRACT UNIT The architecture of the fused add-subtract unit is derived from the floating-point add unit. The exponent difference, significant shift and exponent adjustment functions can be performed once with a single set In Fig.3 shows the architecture of the fused add-subtract unit, the blocks with white background are the same blocks used for a single floating-point add operation. The blocks with green background are additional blocks used to perform the subtract operation, and the blocks with yellow background are similar to the floating point add blocks, but with extended functionality to calculate the sign and exponent for the new subtract operation Fig.3 Floating-Point Fused Add-Subtract Unit It detects the effective operation based on the signs of the two operands and the intended operation. It also generates guard and pre-sticky bits that aid in the proper rounding of the final results. In a parallel conventional Copyright to IJIRSET www.ijirset.com 576

implementation of the fused add-subtract such as that two floating-point adders are used to perform the operation. This approach is fast, however, the area and power overhead is large because two floating point add/subtract units are used. In a conventional implementation of the fused add-subtract one floating-point adder/subtractor is used to perform the operation in addition to a storage element to store the addition or subtraction result. This approach is very efficient in terms of area. However, due to the serial execution of both operations, the time needed to get both results is twice the time needed by the parallel approach. Also since a storage element is used, it adds slightly to the area and power overhead, through two floating-point adders operating. III. FLOATING POINT FUSED DOT-PRODUCT UNIT The architecture of the fused dot-product unit is derived from the floating-point add unit. The exponent difference, significand shift and exponent adjustment functions can be performed once with a single set of hardware, with the results shared by both the add and the subtract operations in fig.3. New add and normalize blocks are needed for the new subtract operation. It shows the architecture of the fused add-subtract unit, the blocks with white background are the same blocks used for a single floating-point add operation. The blocks with green background are additional blocks used to perform the subtract operation, are similar to the floating point add blocks, but with extended functionality to calculate the sign and exponent for the new subtract operation. Since two operations are explicitly performed for sum and difference results (e.g., if the addition is used for the sum, the subtraction is used for the difference), the addition and subtraction are separately placed and only one LZA and normalization (for the subtraction) is required. Assuming both sign bits are positive, the addition and subtraction are performed separately. Then, two multiplexers select the sum and difference with the operation decision bit, which is the XOR of the two sign bits. This will realize their Dot-product format of multiplication and sum them again to make as FDP for better than serial implementation. This FDP will increase the efficiency of FFT implementation. Fig.4 Floating point Fused Dot-Product Unit IV. SIMULATION RESULTS AND PERFORMANCE ANALYSIS The various arithmetic modules, Conventional floating point multiplier, Fused Add-Subtract unit and Fused dot-product unit are developed using Verilog. (fig.5 - fig.6 show that FAS & FDP symbol and fig.7 fig.9 show that RTL view of FAS & FDP). Copyright to IJIRSET www.ijirset.com 577

Fig.5 FAS symbol Fig.6 FDP symbol Fig.7 RTL view of FAS Fig.8 RTL view of FAS Copyright to IJIRSET www.ijirset.com 578

Fig.10 FAS Outputs TABLE 1 Performance Analysis of FAS Logic Utilization Used Available Utilization Number of Slice 360 27288 1% LUT's Number of Fully 4 363 1% Used LUT-FF Pairs Number of Bonded 136 296 45% IOBs Number of 1 16 6% BUFG/BUFGCTRLS V. CONCLUSION This proposed fused architecture has been specifically designed for faster than conventional floating point add-subtract unit and multiplier and provide a slightly more accurate result. Both Fused FDP and FAS unit is more efficient than the older designs. FAS are more sense only rounding is performed over 3 rounding in parallel approaches. Mainly these architecture is useful for FFT implementation where we use of 2 FAS and 2 FDP unit for Radix- 2 FFT will designed for future. The proposed system reduces the shift amount and normalization is applied to reduce the size of significand addition and LZA reduces the reduction tree. Thus it improves more in Future work of FFT implementations. REFERENCES [1] Saleh and E.E. Swartzlander, Jr., A Floating-Point Fused Add-Subtract Unit, Proc. IEEE Midwest Symp. Circuits and Systems Copyright to IJIRSET www.ijirset.com 579

(MWSCAS), pp. 519-522, 2008. [2] H.H. Saleh, H.Fused Floating-Point Arithmetic for DSP, PhD dissertation,univ. of Texas, 2008. [3] Jorge Tonfat, Ricardo Reis, Improved Fused Floating Point Add-Subtract and Multiply-Add Unit for FFT Implementation, in 2014 2nd International Conference on Devices, Circuits and Systems (ICDCS). [4] Jongwook Sohn, Earl E. Swartzlander, Improved Architectures for a Fused Floating-Point Add Subtract Unit, Ieee Transactions On Circuits And Systems I: Regular Papers, Vol. 59, No. 10, October 2012 [5] V. Oklobdzija, "High-Speed VLSI Arithmetic Units: Adders and Multipliers", in "Design of High-Performance Microprocessor Circuits", Book Chapter, Book edited by A. Chandrakasan, IEEE Press, 2000. [6] Earl E. Swartzlander, Hani H.M. Saleh, FFT Implementation with Fused Floating-Point Operations, in IEEE Transactions On Computers, Vol. 61, No. 2, February 2012. [7] Jongwook Sohn, Earl E. Swartzlander "Improved Architectures for a Floating-Point Fused Dot Product Unit" in 2013 IEEE 21st Symposium on Computer Arithmetic. [8] Zhang Zhang, Dongge Wang "FFT Implementation with Multi-Operand Floating Point Units" in 978-1-61284-193-9/11/ 2011 IEEE Transactions. Copyright to IJIRSET www.ijirset.com 580