A DA Serial Multiplier Technique based on 32- Tap FIR Filter for Audio Application



Similar documents
Floating Point Fused Add-Subtract and Fused Dot-Product Units

Implementation and Design of AES S-Box on FPGA

High Speed and Efficient 4-Tap FIR Filter Design Using Modified ETA and Multipliers

LMS is a simple but powerful algorithm and can be implemented to take advantage of the Lattice FPGA architecture.

FPGA. AT6000 FPGAs. Application Note AT6000 FPGAs. 3x3 Convolver with Run-Time Reconfigurable Vector Multiplier in Atmel AT6000 FPGAs.

Hardware Implementations of RSA Using Fast Montgomery Multiplications. ECE 645 Prof. Gaj Mike Koontz and Ryon Sumner

Multipliers. Introduction

Introduction to Xilinx System Generator Part II. Evan Everett and Michael Wu ELEC Spring 2013

High-Level Synthesis for FPGA Designs

Implementation of Full -Parallelism AES Encryption and Decryption

DDS. 16-bit Direct Digital Synthesizer / Periodic waveform generator Rev Key Design Features. Block Diagram. Generic Parameters.

Design and Analysis of Parallel AES Encryption and Decryption Algorithm for Multi Processor Arrays

ON SUITABILITY OF FPGA BASED EVOLVABLE HARDWARE SYSTEMS TO INTEGRATE RECONFIGURABLE CIRCUITS WITH HOST PROCESSING UNIT

Digital Systems Design! Lecture 1 - Introduction!!

Implementation of Modified Booth Algorithm (Radix 4) and its Comparison with Booth Algorithm (Radix-2)

Design and Implementation of an On-Chip timing based Permutation Network for Multiprocessor system on Chip

A High-Performance 8-Tap FIR Filter Using Logarithmic Number System

Understanding CIC Compensation Filters

7a. System-on-chip design and prototyping platforms

International Journal of Electronics and Computer Science Engineering 1482

FPGAs for High-Performance DSP Applications

Hardware and Software

Reconfigurable Low Area Complexity Filter Bank Architecture for Software Defined Radio

synthesizer called C Compatible Architecture Prototyper(CCAP).

Innovative improvement of fundamental metrics including power dissipation and efficiency of the ALU system

Open Flow Controller and Switch Datasheet

Architectures and Platforms

International Journal of Advancements in Research & Technology, Volume 2, Issue3, March ISSN

Modeling a GPS Receiver Using SystemC

All Programmable Logic. Hans-Joachim Gelke Institute of Embedded Systems. Zürcher Fachhochschule

Hardware Implementation of Improved Adaptive NoC Router with Flit Flow History based Load Balancing Selection Strategy

Lecture N -1- PHYS Microcontrollers

INTRODUCTION TO DIGITAL SYSTEMS. IMPLEMENTATION: MODULES (ICs) AND NETWORKS IMPLEMENTATION OF ALGORITHMS IN HARDWARE

Department of Electrical and Computer Engineering Ben-Gurion University of the Negev. LAB 1 - Introduction to USRP

Polymorphic AES Encryption Implementation

Attaining EDF Task Scheduling with O(1) Time Complexity

How To Fix A 3 Bit Error In Data From A Data Point To A Bit Code (Data Point) With A Power Source (Data Source) And A Power Cell (Power Source)

Agenda. Michele Taliercio, Il circuito Integrato, Novembre 2001

Design and FPGA Implementation of a Novel Square Root Evaluator based on Vedic Mathematics

AC : PRACTICAL DESIGN PROJECTS UTILIZING COMPLEX PROGRAMMABLE LOGIC DEVICES (CPLD)

FPGA Design of Reconfigurable Binary Processor Using VLSI

An Efficient Architecture for Image Compression and Lightweight Encryption using Parameterized DWT

Aims and Objectives. E 3.05 Digital System Design. Course Syllabus. Course Syllabus (1) Programmable Logic

International Association of Scientific Innovation and Research (IASIR) (An Association Unifying the Sciences, Engineering, and Applied Research)

Infinite Impulse Response Filter Structures in Xilinx FPGAs

White Paper FPGA Performance Benchmarking Methodology

Introduction to Digital System Design

Technical Aspects of Creating and Assessing a Learning Environment in Digital Electronics for High School Students

FPGA Implementation of RSA Encryption Engine with Flexible Key Size

What is a System on a Chip?

Chapter 2 Logic Gates and Introduction to Computer Architecture

Quartus II Software Design Series : Foundation. Digitale Signalverarbeitung mit FPGA. Digitale Signalverarbeitung mit FPGA (DSF) Quartus II 1

A VHDL Implemetation of the Advanced Encryption Standard-Rijndael Algorithm. Rajender Manteena

System on Chip Platform Based on OpenCores for Telecommunication Applications

DEVELOPMENT OF DEVICES AND METHODS FOR PHASE AND AC LINEARITY MEASUREMENTS IN DIGITIZERS

Non-Data Aided Carrier Offset Compensation for SDR Implementation

ALFFT FAST FOURIER Transform Core Application Notes

IJESRT. [Padama, 2(5): May, 2013] ISSN:

40G MACsec Encryption in an FPGA

9/14/ :38

Verification & Design Techniques Used in a Graduate Level VHDL Course

Best Practises for LabVIEW FPGA Design Flow. uk.ni.com ireland.ni.com

Em bedded DSP : I ntroduction to Digital Filters

Design of a High Speed Communications Link Using Field Programmable Gate Arrays

Implementation of emulated digital CNN-UM architecture on programmable logic devices and its applications

Low Cost System on Chip Design for Audio Processing

Float to Fix conversion

Exploiting Stateful Inspection of Network Security in Reconfigurable Hardware

RAPID PROTOTYPING OF DIGITAL SYSTEMS Second Edition

NIOS II Based Embedded Web Server Development for Networking Applications

Digital to Analog Converter. Raghu Tumati

Modeling Sequential Elements with Verilog. Prof. Chien-Nan Liu TEL: ext: Sequential Circuit

University of St. Thomas ENGR Digital Design 4 Credit Course Monday, Wednesday, Friday from 1:35 p.m. to 2:40 p.m. Lecture: Room OWS LL54

Digital Signal Controller Based Automatic Transfer Switch

路 論 Chapter 15 System-Level Physical Design

Digitale Signalverarbeitung mit FPGA (DSF) Soft Core Prozessor NIOS II Stand Mai Jens Onno Krah

Power Reduction Techniques in the SoC Clock Network. Clock Power

Improved Method for Parallel AES-GCM Cores Using FPGAs

Design and Verification of Nine port Network Router

Chapter 4 Register Transfer and Microoperations. Section 4.1 Register Transfer Language

Implementation of Digital Signal Processing: Some Background on GFSK Modulation

The implementation and performance/cost/power analysis of the network security accelerator on SoC applications

DIGITAL-TO-ANALOGUE AND ANALOGUE-TO-DIGITAL CONVERSION

BUILD VERSUS BUY. Understanding the Total Cost of Embedded Design.

MP3 Player CSEE 4840 SPRING 2010 PROJECT DESIGN.

Final Year Project Progress Report. Frequency-Domain Adaptive Filtering. Myles Friel. Supervisor: Dr.Edward Jones

EXPERIMENT 8. Flip-Flops and Sequential Circuits

Design and Implementation of Fast Fourier Transform Algorithm in FPGA

Hardware Implementation of AES Encryption and Decryption System Based on FPGA

EE361: Digital Computer Organization Course Syllabus

Abstract. Cycle Domain Simulator for Phase-Locked Loops

A DESIGN OF DSPIC BASED SIGNAL MONITORING AND PROCESSING SYSTEM

Lesson 7: SYSTEM-ON. SoC) AND USE OF VLSI CIRCUIT DESIGN TECHNOLOGY. Chapter-1L07: "Embedded Systems - ", Raj Kamal, Publs.: McGraw-Hill Education

Chapter 7 Memory and Programmable Logic

From Concept to Production in Secure Voice Communications

Transcription:

A DA Serial Multiplier Technique ased on 32- Tap FIR Filter for Audio Application K Balraj 1, Ashish Raman 2, Dinesh Chand Gupta 3 Department of ECE Department of ECE Department of ECE Dr. B.R. Amedkar National Institute Dr. B.R. Amedkar National Institute Dr. B.R. Amedkar National Of Technology of Technology Institute of Technology (NIT) Jalandhar, Punja, India (NIT) Jalandhar, Punja, India (NIT) Jalandhar, Punja, India alraj224@gmail.com ramana@nitj.ac.in dinesh_chand_gupta@yahoo.com Astract--This paper presents the design and implementation of a high speed 32-tap finite impulse response (FIR) filter that employs the Distriutive Arithmetic (DA) technique for the complex computation of Audio Coefficients and Multipliers. Distriutive Arithmetic (DA) has een used to implement a it-serial scheme of a general symmetric version of FIR filter due to its high staility and linearity y taking optimal advantage of the look-up tale (LUT), ase structure of Field Programmale Gate Array (FPGA). Distriutive Arithmetic algorithm (DA-FIR) technique is employed for reducing the complex computations, therey increasing the speed and it also reduces the area and power consumption. The design is modeled using Verilog HDL and implemented on Virtex II Pro FPGA that consumes 39% resources of FPGA and shows the clock latency of 34 cycles at 192.45 MHz clock frequency. Keywords- FIR filter, MATLAB, Distriutive Arithmetic (DA), FPGA, Cadence RTL compiler. I. Introduction The most common approaches to the implementation of digital filtering algorithm are general purpose digital signal processing chips for audio application, or special purpose digital filtering chips and application-specific integrated circuits (ASICs) for higher rates[9].this approach to the implementation of Distriutive Arithmetic algorithm on FPGA. Technological advancements y Xilinx in Field Programmale Gate Arrays (FPGAs) in the past 10 years have opened new paths for DSP design engineers. The FPGA maintains the advantages of the high specificity of the ASIC while avoiding the high development costs and inaility to make design modifications after production. The FPGA also adds design flexiility and adaptaility with optimal device utilization conserving oth oard space and system power which is often not the case with DSP chips. When the design is demanding more than 100 MIPS, time-to-market critical or design adaptaility is crucial the Xilinx FPGA is a great solution. When designing a DSP system in an FPGA the data can e processed taking advantage of single chip paralleled structures and arithmetic algorithms to exceed the performance of a single general-purpose DSP device. Distriuted Arithmetic [2] used for Baugh-Wooley multiplication is just one of the approaches used to increase data andwidth and the throughput as much as several orders of magnitudes greater in FPGAs than that possile in off-the-shelf DSP solutions. One example is a 16- Tap, 8-Bit Finite Impulse Response (FIR) filters [3]. It can support more than 780 MIPS at 5 million samples per second while occupying less than 2,100-gates or 70% of the 100 CLBs in an XC4003 device. Another example is a 32-Tap, 10-Bit FIR filter. It can support more than 3 GOPS at 5 million samples per second. When increasing the numer of Taps has no significant impact on the sample rate when Distriuted Arithmetic is used; The FPGA design cycle requires less hardware specific knowledge that required y most DSP chips or ASIC design solutions. Smaller design groups with less experienced engineers can e given the opportunity and trusted to design larger, more complex DSP systems in a shorter amount of time than that of larger design groups with more experienced engineers who are required to have knowledge of device specific programming languages. The FPGA design team can have a complex DSP system designed, tested, verified. The paper is organized in the following sections: In section II FPGA Implementation of FIR filter using DA Technique to compression are discussed. Section III MATLAB designing of FIR filter explained in this section. The performance parameters of the system are discussed in section IV. Finally conclusions have een drawn in section V. II. FPGA Implementation of FIR Digital Filter Using DA Technique and Baugh-Wooley Multiplier This paper focus on the implementation of a specific fixed coefficient FIR filters y using the Xilinx tools. The specific filter is designed with 10-its input and 32-its coefficients, which are optimized y using our new pulic-domain software. Our ojective is to compare the new software techniques with the technique used in the Xilinx audio application. The Audio application use to asic uilding locks to uild an 32-tap low pass FR filter with 10 its input and 32 its signed coefficients. The asic uilding locks include constant coefficient Multiplier (KCM), Adders, registers and a delay locked loop. In fully parallel FIR filter, the inputs to the multiplier are the tap data and the constant coefficient. These multipliers are called KCM since one of the inputs in a constant. In Xilinx, KCMs are implemented using the Hyrid Technique [q. An example structure similar to [6] is presented in Figure 1. Efficient implementation of the KCM is achieved y the authors y storing the pre-computed partial products of the fixed 158

coefficients. These partial products are stored in ROMs using distriuted memory in Xilinx FPGAs. In our new pulic domain software, the multipliers are constructed using shifting and adding arithmetic operation. The performance of the filter generated y the software is then comparing to the results, which appear in the Audio application. Also, note that the Audio application uses a low pass FIR filter. To e consistent with the Audio application, look up tale (LUT) and the numer of slice registers of the Xilinx technology are the main asis for comparison. However, the Audio application structure has a latency of four clock cycles while our structure outputs data at the end of first clock cycle itself. Using FIR filter in the actual system, multiple applications of the FIR filter linear-phase characteristics, according to the any coefficients used to design. Specification of the given FIR filter is: 32taps, 10-its data and 32-its coefficient width and single MAC implementation. Basic lock diagram of the direct form programmale FIR Filter is shown in the Fig.3. X [9:5] X [9:0] 10 X [4:0] 5 5 LUT XK LUT XK A. FIR FILTER IMPLEMENTATION Finite impulse response (FIR) filtering is one of the most widely used operations in Digital Signal Processing (DSP) devices [2]. The asic equation of the FIR filter is given as N 1 n H[ z] h[ n] z (1) n 0 N is the length of h (n), that FIR filter tap numer. Because it is z -1 of the (N-1) polynomial, which has (N-1) a zero in the z plane, the original z= 0 is a (N-1) re-order pole. The FIR filter direct-type structure as shown in Figure2, the output can e expressed as: N 1 (2) y[ n] h[ i] x[ n i] It is not difficult to achieve this structure of FIR filter with adder and the multiplier ut the direct implementation of the FIR filter regardless of speed or in power consumption is failed. 16 16 A D D E R Figure1. Look Up Tale Architecture Figure2. Direct-type FIR filter y 16 [15:0] 16 Figure3. Basic Block diagram of FIR Filter Data width of all components are 10-it, XRAM and BRAM oth have size 10x64[2]. XRAM stores the input data while coefficients are already stored in the BRAM. Controller is responsile for ordering and control of each logic function. It generates the addresses, read, and write signals for oth memories. Data path is responsile for performing data manipulations. As far as operation is concerned, the present and N-1 previous data samples of input x (n) comes to the data path through XREG & are multiplied y corresponding N-tap coefficients one y one at each clock through BREG. Summation is also done at each clock y adding current and previous results of multiplications to form the filter output y (n) after N cycles. Data path consists of a single 10-it MAC unit and a round off module to get a 16-it output of the filter output after rounding the 32-it output of MAC unit. Direct form FIR filter implementation is run time programmale and uses generic multiplier. Upper limit for the numer of taps is 32, 10- its data and 32-its coefficient. Power, area and speed results of this FIR filter are considered as a referenced to study this effect of partitioned multiplier on the performance of Direct Form FIR Filter. B. Distriuted Arithmetic (DA) Distriuted arithmetic (DA) is an important FPGA technology. It is extensively used in computing the sum of products, N 1 y[ n] h, x h[ n] x[ n] (3) n 0 DA system, assumes that the variale x [n] is represented y B 1 (4) 0 x[ n] X [ n]2, x [ n] [0,1] If h [n] is the known coefficients of the FIR filter, then output of FIR filter in it level from is: 159

N 1 B 1 (5) n 0 0 y[ n] h[ n] X x [ n] X 2 In distriuted arithmetic from is given elow eq. B 1 N 1 y[ n] 2 X f ( h[ n], x [ n]) (6) 0 n 0 In Eq. (6) second summation term realizing as one LUT. The use of this LUT or ROM eliminates the multipliers [10]. Distriuted Arithmetic (DA) algorithm as a very efficient solution especially suited for LUT- ased FPGA architecture. Using the Baugh-Wooley multiplier architecture is used on an efficient partition of the function in partial terms using 2 s complement inary data representation. The partial it terms can e stored in LUTs [8], oserved that the requirement of ROM/LUT capacity increases exponentially with the order of the filter and taps of the filter. C. Baugh-Wooley multiplier The Baugh Wooley (BW) multiplier is an efficient way to handle the sign its. This technique has een developed in order to design regular multipliers, suited for 2 s-complement numers [8]. Geali has extended this asic idea and developed efficient fastest product processors capale of performing multiply-accumulate operations without the speed penalty [7]. Let us consider two n-it numers, A and B to e multiplied. A and B can e represented as. A = -a n-1 2 n-1 + a i2 i (7) B = - n-1 2 n-1 + j2 j (8) Where the a i and i are the its in A and B respectively, and a n-1 and n-1 are sign its. The product of A*B=P is given elow P = A*B (9) P = (-a n-1 2 n-1 + a i2 i ) x (- n-1 2 n-1 + j2 j ) (10) The final product is otained y sutracting the last two positive terms from the first two terms P = a n-1 n-1 2 2n-2 + a i j 2 i+j -2 n-1 a i n-1 2 i -2 n-1 a n-i j 2 j (11) D. Baugh- Wooley 8- Bit Multiplier Multipliers are complex adder arrays. Figure 5 shows the 8-it BW multiplier. A BW multiplier is a regular multiplier that is suited for 2 s complement numers. Multiplication is done two steps. This BW multiplier each ox corresponded to an AND gate, full- adder shown in figure [4]. Figure [4] shows internal structure of cell construction in Baugh-Wooley multiplier. The main aim is power consumption y using Baugh-Wooley multiplier. This multiplier high speed and less area required. Figure [5] shows 16 it multiplication has an output of 32 its, when doing a multiplication of 8-it precision; only the most significant 8 its need to go through the final stage of adders. The results of the products in the less significant 8 its can e directly routed to the output, 8 additional multiplexers after the last row of adder will control the output its 31 to 16, ased on the information whether the circuit is performing a multiplication of either 8-its or 16-its. During the 8-it precision multiplication, there is switching activity only in the lower, the output further lower power consumption. III. s i c i FA S C0 0 Figure4. 0 BW multiplier cell construction Figure5. 16, 8-it scalale BW Multiplier [5] MATLAB Design of 32-tap FIR filter for Audio Application IV. A. Design performance requirements Design prolem of FIR digital filter is determine the transforming the required sequence or the impulse response of the constant prolems design a system function H (z) to e approaching the technical use of the finite impulse algorithm. a 160

Design methods are mainly low pass filter function method, frequency sampling method, time domain method, This Filter design methodology used to FIR low-pass filter, sampling frequency of 40000Hz; Pass-and frequency of 22000Hz; stopand frequency of 1200Hz. B. Audio Application specific design method of FIR filter The MATLAB TOOL is used to design the audio recording specific signal spectrum. We get some of the impulse response. Generate the original speech signal in MATLAB after add the noise speech signal and finally we get the reconstructed noisy speech wave forms shown in elow figures. A m p li t u d e Figure6. Impulse response of FIR filter Figure7. Original speech signal Figure8. Noisy speech signal Time A. Determine the impulse response coefficients Using MATLAB software to find the filter coefficients, the 32-tap FIR filter using audio application unit impulse response are real numers to meet the odd or even symmetry conditions. Unit impulse response coefficients otained are as follows tale1. Tale I. Impulse Response Coefficients Coefficient Fir Digital Filter Coefficients numer H[0]=H[31] 0.098 H[1]=H[30] -0.811 H[2]=H[29] 1.097 H[3]=H[28] 0.353 H[4]=H[27] 0.0091 H[5]=H[26] -0.6692 H[6]=H[25] -0.0236 H[7]=H[24] 0.9774 H[8]=H[23] 0.0624 H[9]=H[22] 1.5443 H[10]=H[21] 1.2018 H[11]=H[20] -1.05689 The most of FIR filter tap coefficients are the decimal and signed numers ut do not support floating numers you need to convert tap coefficients into a inary numers so the numer of taps coefficient code is an issue that must e considered. This design uses SD code to minimize the code numer of 1and -1 to achieve multiplier at a minimum cost. Filter coefficients of the SD code as follows, each tap coefficient of the first left seven (multiplied y 128) accurate to four decimal places. Some example is given here. 128 h (0) = h (31) = (1.1648) 10 = (1.00101), 128 h (1) = h (30) = (1.5872) 10 = (1.1001), B. Verilog simulation results for 32-tap FIR filter According to the characteristics of FIR filter coefficients even symmetric, adding up the output of the symmetric tap a coefficient then multiplied y the corresponding weighted of the inary and then adds the results. Top level of 32-tap filter synthesizes results shown in elow. Figure9. Reconstructed noisy speech signal IV. Experimental Results of FIR filter 161

Figure10. Top level of 32-tap FIR filter We Design the Net list of 32-tap FIR filter figure from Cadence RTL Compiler. C. Comparison of Power, Area and Timing Report Tale2. Shows the comparison of power, area and timing report etween different architectures of 32-tap FIR Filter. Tale 2: Comparison of Different Architectures Results Figure11. Net list of 32-tap FIR filter Types 32-tap Decimation Filter using audio application DA serial multiplier using audio application This work 32-tap FIR filter using DA and BW algorithm for audio application Power (mw) Area(µm 2 ) Timing Report(Ps) 29.329 42843 3567 22.568 29394 2987 19.755 19657 1818 Figure12. RTL Schematic of 32-tap FIR filter Figure13.Simulation result of 32-tap FIR filter using audio application Simulation is as shown in Figure13 can e verified: the results of y (n) =h (n) *x for example, the corresponding pulse (00001111) derived from the output y (7) =h (n) *7=38.54*15=578.10 and y (7) = (00000000000000000000001001000010) 2 = (578), comparison simulation results with the actual operational error =0.68, the results are very close to meet the design requirements. Figure14. Layout of 32-tap FIR filter using Speech signal in SOC IV. Conclusion and future work In this paper a 32-tap FIR filter is designed for Audio application. Xilinx 13.2 ISE is used for designing filter that uses DA and BW algorithm and oth are implemented on FPGA vertex II pro. Simulation results shows that oth DA and BW consumed less power and area as compare to decimation and DA serial multiplier for desired application. The DA and BW consume 39% resources of FPGA and show the clock latency of 34 cycles at 192.45 MHz clock frequency. Pipelining may e used for etter results. REFERENCES [1] Y. Lio, FPGA application development from practice to improve Beijing: China Electric Power Press, 2007 [2] W.Jinmig, VerilogHDL programming tutorial. Beijing: People s Posts & Telecom Press, 2004.1 [3] Z. ghaijun, FPGA-ased 16-order FIR filter design and Implementation. Journal of Anhui University: Neuroscience Edition, 2009. [4] Z.B.cheng, Dynamic Distriuted Algorithm for FPGA-ased research and application. Journal of Changchun University of Technology: Natural Science Edition, 2005.12. [5] I. Koren, Computer Arithmetic Algorithms, 2nd Edition, A K Peters, Ltd. 63 South Avenue Natick, MA 01760. [6] K. Chapman, Building High Performance FIR Filter Using KCM Xilinx Ltd -UK, July 1996. [7] S. Akhter, VHDL Implementation of Fast NXN Multiplier Based on Vedic Mathematics, Jaypee Institute of Information Technology University, Noida, 201307 UP, INDIA, 2007 IEEE. [8] P. Mehta and D. Gawali, Conventional versus Vedic Mathematical Method for Hardware Implementation of a Multiplier, Proceeding International Conference on Advances in Computing, Control, & Telecommunication Technologies, 2009. ACT '09, pp-640-642, 2009. [9] K.Y.Khoo, A. Kwentus, and A. N. Willson, Jr. An efficient 175 MHz Programmale FIR digital filter. In IEEE Int.Symp. Circuits and Syst., pages 72-75, May 1993. [10] T. Vigneswaran and P. S. Reddy Design of Digital FIR Filter Based on DA algorithm Journal of Applied Science, 2007 162

[11] J. S. Park, Computation Sharing Programmale FIR Filter for Low- Power and High-Performance Applications, IEEE Journal of Solid- State Circuits, vol. 39, no. 2, Feruary 2004. 163