SAD computation based on online arithmetic for motion. estimation



Similar documents
SAD computation based on online arithmetic for motion estimation

Implementation of Modified Booth Algorithm (Radix 4) and its Comparison with Booth Algorithm (Radix-2)

An Efficient RNS to Binary Converter Using the Moduli Set {2n + 1, 2n, 2n 1}

Floating Point Fused Add-Subtract and Fused Dot-Product Units

Let s put together a Manual Processor

Design and FPGA Implementation of a Novel Square Root Evaluator based on Vedic Mathematics

A New Algorithm for Carry-Free Addition of Binary Signed-Digit Numbers

Multipliers. Introduction

Dynamic Resource Allocation in Softwaredefined Radio The Interrelation Between Platform Architecture and Application Mapping

Error Detection and Data Recovery Architecture for Systolic Motion Estimators

DESIGN OF AN ERROR DETECTION AND DATA RECOVERY ARCHITECTURE FOR MOTION ESTIMATION TESTING APPLICATIONS

Design and Implementation of Concurrent Error Detection and Data Recovery Architecture for Motion Estimation Testing Applications

Efficient Motion Estimation by Fast Three Step Search Algorithms

RN-Codings: New Insights and Some Applications

MICROPROCESSOR AND MICROCOMPUTER BASICS

FPGA area allocation for parallel C applications

Implementation and Design of AES S-Box on FPGA

Hardware Implementations of RSA Using Fast Montgomery Multiplications. ECE 645 Prof. Gaj Mike Koontz and Ryon Sumner

FPGA. AT6000 FPGAs. Application Note AT6000 FPGAs. 3x3 Convolver with Run-Time Reconfigurable Vector Multiplier in Atmel AT6000 FPGAs.

A CDMA Based Scalable Hierarchical Architecture for Network- On-Chip

Aims and Objectives. E 3.05 Digital System Design. Course Syllabus. Course Syllabus (1) Programmable Logic

Reconfigurable Architecture Requirements for Co-Designed Virtual Machines

MP3 Player CSEE 4840 SPRING 2010 PROJECT DESIGN.

Research on Clustering Analysis of Big Data Yuan Yuanming 1, 2, a, Wu Chanle 1, 2

Design and Implementation of an On-Chip timing based Permutation Network for Multiprocessor system on Chip

RN-coding of Numbers: New Insights and Some Applications

Implementation of Full -Parallelism AES Encryption and Decryption

ON SUITABILITY OF FPGA BASED EVOLVABLE HARDWARE SYSTEMS TO INTEGRATE RECONFIGURABLE CIRCUITS WITH HOST PROCESSING UNIT

A Computer Vision System on a Chip: a case study from the automotive domain

DDS. 16-bit Direct Digital Synthesizer / Periodic waveform generator Rev Key Design Features. Block Diagram. Generic Parameters.

FPGA Implementation of an Advanced Traffic Light Controller using Verilog HDL

Design and Analysis of Parallel AES Encryption and Decryption Algorithm for Multi Processor Arrays

Chapter 2 Logic Gates and Introduction to Computer Architecture

FPGA Design of Reconfigurable Binary Processor Using VLSI

High Speed and Efficient 4-Tap FIR Filter Design Using Modified ETA and Multipliers

A Lab Course on Computer Architecture

Sigma- Delta Modulator Simulation and Analysis using MatLab

Innovative improvement of fundamental metrics including power dissipation and efficiency of the ALU system

United States Naval Academy Electrical and Computer Engineering Department. EC262 Exam 1

HARDWARE ACCELERATION IN FINANCIAL MARKETS. A step change in speed

STUDY ON HARDWARE REALIZATION OF GPS SIGNAL FAST ACQUISITION

A DA Serial Multiplier Technique based on 32- Tap FIR Filter for Audio Application

White Paper Utilizing Leveling Techniques in DDR3 SDRAM Memory Interfaces

MACHINE ARCHITECTURE & LANGUAGE

9/14/ :38

An Effective Deterministic BIST Scheme for Shifter/Accumulator Pairs in Datapaths

MIMO detector algorithms and their implementations for LTE/LTE-A

A Survey of Video Processing with Field Programmable Gate Arrays (FGPA)

Advanced Computer Architecture-CS501. Computer Systems Design and Architecture 2.1, 2.2, 3.2

A comprehensive survey on various ETC techniques for secure Data transmission

Microprocessor & Assembly Language

1. Memory technology & Hierarchy

Performance Oriented Management System for Reconfigurable Network Appliances

This Unit: Floating Point Arithmetic. CIS 371 Computer Organization and Design. Readings. Floating Point (FP) Numbers

Optimising the resource utilisation in high-speed network intrusion detection systems.

Attaining EDF Task Scheduling with O(1) Time Complexity

An Efficient Architecture for Image Compression and Lightweight Encryption using Parameterized DWT

Manchester Encoder-Decoder for Xilinx CPLDs

How To Fix A 3 Bit Error In Data From A Data Point To A Bit Code (Data Point) With A Power Source (Data Source) And A Power Cell (Power Source)

Digital Logic Design. Basics Combinational Circuits Sequential Circuits. Pu-Jen Cheng

A Systolic Algorithm to Process Compressed Binary Images

Hardware Implementation of AES Encryption and Decryption System Based on FPGA

Low-resolution Image Processing based on FPGA

On the Data Reuse and Memory Bandwidth Analysis for Full-Search Block-Matching VLSI Architecture

Digital Systems Design! Lecture 1 - Introduction!!

Switch Fabric Implementation Using Shared Memory

Method for Multiplier Verication Employing Boolean Equivalence Checking and Arithmetic Bit Level Description

CHAPTER 5 FINITE STATE MACHINE FOR LOOKUP ENGINE

Sistemas Digitais I LESI - 2º ano

Modeling Sequential Elements with Verilog. Prof. Chien-Nan Liu TEL: ext: Sequential Circuit

To convert an arbitrary power of 2 into its English equivalent, remember the rules of exponential arithmetic:

FAULT TOLERANCE FOR MULTIPROCESSOR SYSTEMS VIA TIME REDUNDANT TASK SCHEDULING

FPGA Implementation of an Extended Binary GCD Algorithm for Systolic Reduction of Rational Numbers

International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Vol. XXXIV-5/W10

Lecture 8: Binary Multiplication & Division

A Binary Adaptable Window SoC Architecture for a StereoVision Based Depth Field Processor

Introduction to Digital System Design

Counters and Decoders

The string of digits in the binary number system represents the quantity

Study and Implementation of Video Compression Standards (H.264/AVC and Dirac)

IJESRT. [Padama, 2(5): May, 2013] ISSN:

A Scalable Large Format Display Based on Zero Client Processor

what operations can it perform? how does it perform them? on what kind of data? where are instructions and data stored?

Go Faster - Preprocessing Using FPGA, CPU, GPU. Dipl.-Ing. (FH) Bjoern Rudde Image Acquisition Development STEMMER IMAGING

Relating Empirical Performance Data to Achievable Parallel Application Performance

Chapter 2 Basic Structure of Computers. Jin-Fu Li Department of Electrical Engineering National Central University Jungli, Taiwan

Technical Aspects of Creating and Assessing a Learning Environment in Digital Electronics for High School Students

Analysis of Compression Algorithms for Program Data

Central Processing Unit

Lab 1: Introduction to Xilinx ISE Tutorial

HSI BASED COLOUR IMAGE EQUALIZATION USING ITERATIVE n th ROOT AND n th POWER

Implementation of emulated digital CNN-UM architecture on programmable logic devices and its applications

Reconfigurable Low Area Complexity Filter Bank Architecture for Software Defined Radio

Performance Comparison of an Algorithmic Current- Mode ADC Implemented using Different Current Comparators

EE482: Advanced Computer Organization Lecture #11 Processor Architecture Stanford University Wednesday, 31 May ILP Execution

SIM-PL: Software for teaching computer hardware at secondary schools in the Netherlands

Architectures and Platforms

BSc in Computer Engineering, University of Cyprus

Transcription:

SAD computation based on online arithmetic for motion estimation J. Olivares a, J. Hormigo b, J. Villalba b, I. Benavides a and E. L. Zapata b a Dept. of Electrics and Electronics, University of Córdoba, Spain {olivares, el1bebej}@uco.es b Dept. of Computer Architecture, University of Málaga, Spain {hormigo, julio, ezapata}@ac.uma.es Abstract Block-based motion estimation is one of the critical tasks in today s video compression standards such as H.26x, MPEG-1, -2 and -4. Most of the block-based motion estimation algorithms are based on computing the Sum of Absolute Differences (SAD) between corresponding elements in the candidate and reference blocks. In this paper an FPGA design is proposed for rapidly computing the minimum SAD. Two goals are achieved due to the use of online arithmetic (): it is possible to implement a full 16 16 macroblock SAD in a single FPGA device; and it allows us to speed up computation by early termination of the SAD calculation when the candidate involved is bigger than the current reference SAD. Reconfigurable devices enable us to change 8 8 or 16 16 pixels per block quickly and easily. For a 16 16 SAD unit 1945 look up tables (LUTs) are required at 425 MHz. A comparison with other related works is provided. Keywords: motion estimation, FPGA, sum of absolute differences, online arithmetic 1 Introduction Motion estimation (ME) plays an important role in today s video coding and processing systems, since motion vectors provide critical information for temporal redundancy reduction. It has been 1

widely used in the H.26x, MPEG-1, -2 and -4 video compression standards. Motion estimation is defined as searching for the best motion vector, being the displacement of the coordinates of the most similar block in the previous frame compared to the block in the current frame. Full search block matching is the most popular algorithm to perform ME, and it searches through every candidate location to find the best match. To do this, the current frame is partitioned into two dimensional blocks (typically 8 8 or 16 16 pixel blocks) and a search window in the reference frame is defined. Each block of the current frame is compared with all the blocks of a previous frame within the same window. The final motion vector corresponds to the block with minimum distortion within the search window. The most commonly used metric to calculate the distortion is the Sum of Absolute Differences (SAD) [1], which adds up the absolute differences between corresponding elements in the candidate and reference block. The heavy computational cost of block matching algorithms (BMA) can be a significant problem in real time coding applications. To reduce computational complexity many fast algorithms have been proposed, which search a subset of candidate blocks [2][3]. Besides this, different architectures have been designed to speed up the associated massive arithmetic calculation [4][1]. However, the need for specialized hardware contradicts the flexibility demanded by current video coding systems. A feasible solution to this problem is to use a programmable processor core along with a field programmable gate array device (FPGA) which is in charge of performing critical tasks. The reasons for using FPGAs include the following advantages: increased flexibility and rapid adaptation to new developments; appropriate performance; and faster design times achieved by re-using IP cores and high-level design languages (such as VHDL). In this context, our design is intended to speed up computation of the minimum SAD by its implementation in an FPGA (SAD processor in Figure 1), while a data dispatcher supplies the reference and candidate blocks to the FPGA device (see Figure 1). An FPGA architecture to compute the minimum SAD is proposed in this paper. This design can be integrated with any BMA (full search or another efficient search strategy). Despite the parallelism inherent to SAD, full parallel implementation has proved difficult, since it requires a large number of operands for typical block sizes (a 8 8 pixel block requires 2

MINIMUM SAD DISPATCHER 2N 2 b CANDIDATE BLOCK MAD PROCESSOR MOTION VECTOR 2N 2 b REFERENCE BLOCK CORE PROCESSOR Figure 1: Motion estimation system 128 8 bit operands, and a 16 16 pixel macroblock needs 512 8 bit operands). Due to the large amount of hardware, the computation of the SAD on only one row of a macroblock (16 1) is implemented on an FPGA device in [1], who propose replicating or pipelining the design to obtain the 16 16 computation. Four FPGA chips with 1234 I/O pins each are used in [5] for a completely parallel design. On the other hand, the use of online arithmetic () for motion estimation is proposed in [6] to speed up the computation by early termination of the SAD calculation. A serial architecture (pixel by pixel) for 4 4 blocks is proposed in [6], based on ASIC implementation. This paper is organized as follows: in Section 2 a brief description of the techniques is provided; in Section 3 we deal with the computation of the minimum SAD using ; Section 4 presents the implementation of the proposed design in FPGA devices; the results of several simulations are shown in Section 5 to illustrate the clock cycles saved with early termination; a comparison with other works is described in Section 6; and finally, the most relevant results of this paper are summarized in Section 7. 3

2 Online arithmetic Online arithmetic techniques have been considered as the solution to many signal processing problems, such as digital filtering, Fourier transform, and others [7] [10]. Recent works have presented the suitability of for FPGAs designs [11]. The basic idea of is to perform computations which overlap with the digit-by-digit communications of operands/results [7]. algorithms operate in a digit-serial manner, beginning with the most significant digit (MSD). To generate the first digit of the result, δ + 1 digits of the input operands are needed. Thus, after δ digits of the operands are received, for each new digit of the operands, a new digit of the result is obtained. For this reason, δ is known as online delay. Due to the online delay, after the last digits of the inputs are introduced into the system, a number of zero digits equal to the online delay have to be introduced to ensure a correct result. The most-significant-digit-first mode of computation requires flexibility in computing digits on the basis of partial information about inputs. This is achieved by using a redundant representation system. In a redundant representation with radix r, each digit has more than r possible values. This permits several representations of a given value. Therefore, there is flexibility in choosing an output digit at a given step, so that a compensation can be introduced if needed. A signed-digit (SD) representation system [12] is used in this paper. In radix-2 SD representation, the digit set is { 1, 0, 1}. Two bits are required to represent each digit, as shown in Table 1. The first bit is negatively weighted and the second one is positively weighted. This number representation system eliminates the long carry propagation chains in the addition operation, although it requires the carry of the two previous digits. In short, the advantages of using online arithmetic are as follows: it reduces the number of signal lines connecting modules due to its serial-digit character; the MSD-first computation allows subsequent calculations to occur at a much earlier stage; and it eliminates carry propagation chains, since it uses a redundant number representation system. 4

3 Online computation of the minimum SAD The goal of our FPGA design is to find which of the candidate blocks (supplied by the dispatcher) best matches the reference block. The most commonly used metric to determine the best match is the Sum of Absolute differences (SAD). Thus, our design computes the minimum SAD from among all the candidate blocks. To do this, a search iteration is performed for each candidate block. During each search iteration, the SAD corresponding to a candidate block is computed using all its pixels simultaneously. The value obtained is compared with the reference SAD (SADr) which is the minimum SAD computed before the current iteration. If the current SAD (SADc) is less than SADr, it is stored as SADr for the remaining search iterations. Both the SAD computation and comparison operation are performed using techniques. This allows us to begin the comparison when the first digit of the SAD is obtained and to stop the computation early if the digits computed are sufficient to ensure that SADc is greater than SADr. 3.1 Online SAD computation The SAD adds up the absolute differences between corresponding elements in the candidate and reference block N N SAD = c i,j r i,j, (1) i=1 j=1 Table 1: Digit codification in radix-2 signed-digit representation Digit value Digit representation +1 01 0 00 0 11-1 10 5

where r i,j are the elements of the reference block, and c i,j the elements of the candidate block. Thus, the computation of the SAD is divided into three steps: - Compute the differences between corresponding elements d i,j = c i,j r i,j - Determine the absolute value of each difference d i,j - Add all absolute values We now describe how each of these operations is performed using online arithmetic, and how the pixel values are converted into radix-2 SD representation. Conversion to SD representation and difference computation: In radix-2 signed-digit representation, each digit is composed of two bits, the first one negatively weighted and the second positively weighted. Thus, a signed-digit number can be interpreted as the difference between two unsigned numbers, one composed of positively weighted bits for each digit, minus the one composed of negatively weighted bits. In fact,this difference must be computed to convert an SD number into a non-redundant representation. This property is used to simultaneously convert each pixel value into SD representation and compute the difference between the pixels of the reference block and the current block at no computational cost. In this way, each digit of the value d i,j = c i,j r i,j is obtained in SD representation by only taking the corresponding bit of c i,j as the positively weighted one and the corresponding bit of r i,j as the negatively weighted one, since c i,j and r i,j are unsigned numbers. Absolute value: To compute the absolute value of d i,j, the sign of this value has to be changed if d i,j is negative. In SD representation, the negation operation is performed by exchanging both bits of each digit. Since the MSD-first mode of computation is being used, the sign detection of d i,j is performed on-the-fly by checking whether the first non-zero digit of d i,j is positive (01) or negative (10). The digits of d i,j are received in MSD-first mode and go directly to the output when they are zero (00 or 11). If the first non-zero digit received is positive (01), this and all the remaining digits correspond directly with the output. Nevertheless, if the first non-zero digit received is negative (10), the bits of this and all the remaining digits are interchanged to obtain the output. The absolute value operation is performed with no online delay. 6

Ai + Ai - Bi + Bi - Ci + Ci - Di + Di - Ei + Ei - Fi + Fi - Gi + Gi - Hi + Hi - AB CD EF GH AD EH AH Figure 2: Online design for the sum of the absolute differences. Sum of absolute differences: The absolute difference of all the pixels corresponding to the current and reference blocks is computed in parallel. Thus, N 2 absolute difference blocks are required. An online adder tree is used to obtain the sum of all d i,j values. In Figure 2 this structure is shown for 4 4 pixels per block(n = 4). Each -adder in this figure corresponds to a standard SD online adder (see figure 3). The number of addition steps of the complete adder tree is log 2 (N 2 ). In radix-2 signed-digit representation, the online delay of the addition is two i.e., the MSD of the result is obtained two cycles after the MSD of the inputs has been sent to the adder. Nevertheless in our case, the carry bit is used as the MSD of the results and this digit is obtained one cycle before. Therefore, the online delay of the complete adder tree is 2 log 2 (N 2 ), but the first digit of the results is 7

X j+3 + X j+3 - Y j+3 + Y j+3 - FA D R D R FA Z j+2 + Z j+1 - D R D R Z j+1 + D R - + Z j Z j Figure 3: Online adder design. obtained log 2 (N 2 ) cycles earlier. 3.2 Signed-digit online comparison Once the first digit of the SAD corresponding to the current block is obtained, the comparison between the current SAD and the minimum SAD can begin. Thanks to the fact that the MSD-first mode of computation is used, an efficient comparison algorithm can be applied. Nevertheless, since SD representation allows several representations for a given value, the comparison operation between two values is not as simple as in conventional representations. In [13, 6] a comparison algorithm and its hardware implementation are proposed. The two SD numbers are first converted to sign-magnitude format and then a standard comparison is used. The magnitude computation and comparison are performed on-the-fly in an MSD-first 8

manner. Nevertheless, this comparator has an online delay of two. We propose a comparison algorithm with no online delay. This is based on the analysis of the sign of the difference operation between the two values to be compared. Thus, the online delay of two is avoided due to the substraction operation. Let us define the SD numbers A and B, where A = n 1 and B have a similar expression. The result of operation A-B is Let R be the result of the difference n 1 i=0 a i 2 i, a i { 1, 0, 1} (2) A B = (a i b i ) 2 i, a i, b i { 1, 0, 1} (3) i=0 R = A B = n 1 i=0 r i 2 i, r i { 2, 1, 0, 1, 2} (4) Let us assume that when using an online comparator, the sign of R can be determined at digit k, if the partial accumulated sum R k complies with Given the previous definition of R k, R can be redefined as n 1 R k = r i 2 i 2 k+1 (5) i=k k 1 R = A B = R k + r i 2 i (6) i=0 Since it is proved that k 1 i=0 k 1 r i 2 i 2 2 i < 2 k+1 (7) i=0 R k 2 k+1 R > 0 (8) R k 2 k+1 R < 0 (9) If the condition represented in equation 5 does not comply with k > 0, the sign of R cannot be guaranteed until the last digit (k = 0). Let us define the normalized partial accumulated sum as R k = R k /2 k ; the condition in equation 5 is then equivalent to 9

n 1 R k = r i 2 i k 2 (10) i=k The value R k can be computed using an online recurrence (Note that k ranges from N-1 to 0) R k = 2 R k+1 + (a k b k ) (11) The value R k only depends on its previous value and the current digits, thus an online comparator, as well as minimum or maximum algorithms, can be implemented with no online delay based on this computation. An online comparator requires the value R k to be computed in each iteration, starting at k = n 1 (MSD), until R k 2 or k = 0. At this point, the decision is determined based on the sign of R k. Transition a k - b k -1 0 1-2 2 k R >1 k R =1 k R =0 k R =-1 k R <-1 0,1,2 1-1 0,-1,-2 2-2 Figure 4: State-flow diagram of the online comparator design. In [14] we evaluate different hardware designs for the comparator. Faster implementation is accomplished if the design is implemented as a state machine following the state-flow diagram represented in Figure 4. Each state represents a possible value of R k, i.e., equal, possibly greater, greater, possibly less or less. The transitions between states are determined by the digits a k and b k. The design used in this paper is a simplification of this state machine. 10

4 FPGA implementation of the SAD processor Figure 5 presents the architecture of the design corresponding to the SAD processor. The absolute value of the differences is computed for each pair of pixels ( c i,j r i,j ) and their summation is calculated on the N 2 Operand adder. The result is stored digit-by-digit in a SADc register and is simultaneously compared with the corresponding digit of SADr in the comparator (COMP). If at any cycle the condition SADc > SADr is detected, the computation is stopped and a new candidate block is required. Otherwise, if the condition SADc < SADr is verified, SADc is stored in SADr when a less significant digit of the SAD is calculated. c i0,0 r i0,0 c 0,0 -r 0,0 2b c i0,1 r i0,1 c 0,1 -r 0,1 2b N 2 -OPERAND 2b COMP stop min c in,n r in,n c N,N -r N,N 2b SAD c 2b SAD r Figure 5: SAD processor architecture The timing of the computation for the 4 4 SAD processor is shown in Figure 6. In each cycle, the outputs corresponding to the absolute value block ( c i,j r i,j, each of the four steps in the adder-tree ( (i)), and the comparator (COMP) are represented. In fact, regarding the comparator, this does not really constitute the output, but rather the last digit used for the comparison. The zero digits represent the zero values which have to be introduced into the input due to the system s online delay. Since each addition has an online delay of two, and the absolute value blocks and the comparator have no online delay, eight zeroes are required in this case. The worst case occurs when a new minimum SAD is found, and then 21 cycles are required for the full process, where the last cycle is run to store SADc in SADr. However, as Figure 6 11

New computation 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 - r c ij ij Σ (1) Σ (2) Σ (3) Σ (4) COMP d 7 d 6 d 5 d 4 d 3 d 2 d 1 d 0 0 0 0 0 0 0 0 0 d 8 d 7 d 6 d 5 d 4 d 3 d 2 d 1 d 0 0 0 0 0 0 0 d 9 d 8 d 7 d 6 d 5 d 4 d 3 d 2 d 1 d 0 0 0 0 0 d 10 d 9 d 8 d 7 d 6 d 5 d 4 d 3 d 2 d 1 d 0 0 0 d 11 d 10 d 9 d 8 d 7 d 6 d 5 d 4 d 3 d 2 d 1 d 0 d 11 d 10 d 9 d 8 d 7 d 6 d 5 d 4 d 3 d 2 d 1 d 0 Best case (9 cycles) Worst case (21 cycles) Figure 6: Timing of the 4 4 SAD processor shows, a new SAD computation can start after 16 cycles (after the 8 digits and 8 zeroes are introduced) in which case this period of time is the maximum between two consecutive SAD computations. This period is reduced if the candidate SAD is rejected before. In the best case, this happens after analysing the MSD of the candidate SAD, i.e., after 9 cycles. Therefore, the number of cycles for a SAD computation and comparison is between 9 and 16 for a 4 4 SAD processor. This period ranges from 13 to 20 cycles for an 8 8 block, and from 17 to 24 cycles for a 16 16 block. The design has been implemented on the Xilinx SPARTAN-II and VIRTEX-II FPGA families for three different block sizes. For compilation, simulation and implementation, we use the Xilinx ISE Series 5.2i. The main results of the implementation are shown in Table 2. The area/number of pixels ratio is relatively low, due to the serial-digit character of online computation. The maximum clock frequency is independent of block size because when the number of operators increases, only the number of parallel operations and the number of steps in the adder-tree increase. Although this value strongly depends on the technology used (as shown in Table 2), our results are very promising. Table 3 shows how the area and delay are distributed among the different parts of the design for the 16 16 SAD processor. Note that the percentage given refers to the total number of LUTs of the SAD processor. The maximum clock frequency of the global system is determined 12

Table 2: Area and clock frequency corresponding to different FPGA implementations. SPARTAN-II VIRTEX-II Block size Area (4 inputs-luts) 4x4 (16 pixels) 246 241 8x8 (64 pixels) 603 595 16x16 (256 pixels) 1982 1945 Maximum Frequency(MHz) 231.24 424.99 Table 3: Distribution of LUTs and delay in the 16 16 SAD processor. Time Delay (ns) Area Parts SPARTAN-II VIRTEX-II LUTs % Absolute difference 3.675 1.839 1024 52.7% Adder-tree 4.325 2.353 768 39.5% Comparator 4.887 2.048 6 0.3% Control and Connectivity - - 146 7.5% either by the delay of the comparator or the adder (although both values are similar), depending on the FPGA family, since the basic cells are slightly different. The area is mainly occupied by the absolute value blocks and the adder-tree, due to the large amount of operands for this block size. The general performance of these implementations is shown in Table 4, where the number of SADs per second and the number of frames per second (fps) are given for a 640x480 pixels per frame image. 13

Table 4: Number of SAD calculations and frames per second. SPARTAN-II VIRTEX-II Block Size Window Size SAD (millions per second) fps SAD (millions per second) fps 4x4 8x8 14.45 77.08 26.56 141.66 8x8 16x16 11.56 30.50 21.25 56.06 16x16 32x32 9.64 9.56 17.71 17.57 5 Early termination of SAD calculation Several video sequences have been processed to estimate the number of clock cycles saved. The parameters used are: - 16x16 block size. - 24x24 search window. - Full-search block matching algorithm. - 150 frames of each video have been evaluated. The traditional model shown in Figure 5 uses a final comparator for the SAD comparison. A new model is proposed (as shown in Figure 7), which introduces several comparison levels into the adder tree to evaluate partial SAD information. It is possible that partial SADs of 64 pixels or 128 pixels of a 16x16 block are greater than the reference SAD; if so, the SAD calculation can be stopped before running the entire number of cycles, which cannot be done with the traditional model. Figure 7 shows the new model for partial comparison. This property is demonstrated in the present section. The added cost for the new model is the area occupied of six new comparators. Nevertheless, each comparator only requires 6 LUTs and involves less than 2% of the final area. Figure 8 shows the results obtained for three versions of the implemented algorithm: one with only one final comparator for 256 PIXELS PROCESSED LEVEL, called C256P; one with 14

64 TREE 64 TREE 64 TREE 64 TREE COMPARATOR COMPARATOR COMPARATOR COMPARATOR 64 PIXELS PROCESSED LEVEL 128 PIXELS PROCESSED LEVEL COMPARATOR COMPARATOR COMPARATOR 256 PIXELS PROCESSED LEVEL Figure 7: New comparators for partial SAD comparison a final comparator plus two comparators for 128 PIXELS PROCESSED LEVEL, called C128P; and one with a final comparator plus two comparators for 128 PIXELS PROCESSED LEVEL and four comparators for 64 PIXELS PROCESSED LEVEL, called C64P. The videos tested were: - hall monitor.mpeg - flower.mpeg - tennis.mpeg - coast guard.mpeg The number of clock cycles saved for the C64P model ranges from 4.5% to 13%, in contrast to the conventional C256P model with only one comparator, which saves between 3.3% and 4.53% clock cycles. Introducing partial comparators allows us to improve the efficiency of the system. 15

Figure 8: Number of clock cycles saved 6 Comparison with other works In this section we compare our design to other recent works, the main ones being [13, 6] and [1, 5]. The use of online arithmetic to compute the minimum SAD was proposed in [13, 6] for ASIC implementation. An SD adder was used for the computation of the differences, whereas our approach does not use such hardware, since we merge this computation and the SD conversion, saving both time and area. Note that since a difference computation is required for each pixel, the amount of hardware saved is considerable. The authors consider independent bit planes and compute the summation of absolute differences for independent planes, starting from the most significant digit. The mathematical basis for this procedure is not correct since the absolute value of a signed-digit number is not equal to the summation of the absolute value of the different weighted digits. This is due to the fact 16

that each digit can be positive or negative. This leads the authors to obtain a motion vector which is not correct for most cases. For details see cite [15]. Our approach also considers bit planes. Moreover, we take into account the dependence between bit planes (carry propagation and correct calculation of the absolute value) which leads to obtaining the best motion vector. On the other hand, an algorithm based on online arithmetic is proposed in [13, 6] for the SAD comparison. The two SD numbers are first converted to sign-magnitude format, and then a standard comparison is used. The magnitude computation and comparison are performed on-the-fly in an MSD-first manner. Nevertheless, this comparator has an online delay of two and relatively high complexity. The main advantage of our design is that no online delay is required for the comparison operation, thus speeding up computation. Furthermore, our design is based on a simpler method involving less hardware cost. The authors do not provide enough data regarding their ASIC implementation to enable us to perform a quantitative comparison in terms of area and delay. According to [13, 6], the cycle time corresponds to one SD adder plus one 2 to 1 MUX, one AND and one three-input OR gate. The cycle time of our design is only one SD adder. Despite the fact that our design is intended for an FPGA implementation, we estimate that an ASIC implementation of our design will significantly improve the performance of the design [13, 6]. In [1], the computation of the SAD for 16 pixels (SAD16), which is equivalent to a macroblock row for MPEG, is implemented on an FPGA device. The design is based on carry save adders which perform the computation in parallel over all the digits of the data. According to the authors, the design is synthesized using FPGA Express from Synopsys by targeting the FLEX20KE family from Altera, obtaining an area of 1699 LUTs, and a maximum frequency of 197 MHz, with a latency of 19 cycles (96ns). The estimated bandwidth for this design is 50.4 Gbps and the estimated throughput is 197 million SADs per second. The results of our implementation using the VIRTEX-II family is used for comparison, since it provides similar performance. The worst case for our equivalent design (4 4 or 16 pixels) occurs when a new minimum SAD is found, and then 21 cycles are required to complete the full process (see Section 4); that is, to compute SAD16, compare the result with the previous minimum and store 17

it; this lasts 49 ns at a frequency of 425 MHz. The bandwidth of our design is 27.2 Gbps, which is less than in [1], since data are serially transmitted. As shown in Table 4, the throughput is 26.56 million SADs per second, which is about seven times less than in [1]. Besides this, our design only requires 241 LUTs, which is seven times less area than in [1]. However, the current compression standard systems require 16 16 blocks (and also 8 8 for MPEG-4). The authors of [1] state briefly how to extend the design to compute a 16 16 SAD in two ways. The first one is based on using 16 SAD16 units (one for each row) and a final adder tree. They estimate that 27 clock cycles are required. Nevertheless, the number of LUTs for the design is close to 30000, which does not seem feasible for the current FPGA devices. Our 16 16 design requires only 1945 LUTs, which is easily implemented on a single FPGA device. The second approach presented in [1] is based on reusing the SAD16 units to compute the SAD of all the 16 rows, which are buffered, to finally add them up. This involves 42 clock cycles with a larger area size due to buffering and the fact that longer binary data (16 bits instead of 12 bits) must be supported. Moreover, the intrinsic pipeline behavior of the SAD16 units is eliminated. For a similar area, our design computes a SAD every 24 cycles for the worst case (including the comparison, see Section 4). On the other hand, the solution proposed in [5] involves the use of four Altera STRATIX EP1S80 devices with 1234 I/O pins. This design uses 7765 LCs and requires 29 cycles for a SAD computation at 380 MHz. This means that our design obtains better performance regarding time while requiring far less hardware. We would like to emphasize that the previous comparisons refer to our worst case (16 cycles for 4 4 SAD and 24 cycles for 16 16 SAD). However, the best case means that after analysing the MSD of the candidate SAD we then reject it; this involves only 9 cycles for 4 4 SAD and 17 cycles for 16 16 SAD (see Section 6). Moreover, the TIMING results used for our design include the comparison OPERATION (which involves a few more clock cycles due to carry propagation) whereas the designs referred at [1] and [5] do not include this operation time. 18

7 Conclusion An FPGA implementation of a motion estimation core based on the computation of the minimum SAD has been presented in this paper. The proposed core can be integrated with a full search algorithm or any more efficient search strategy. The computation is carried out by using online arithmetic. The different operations involved in the SAD computation have been efficiently adapted to online arithmetic, and a new comparator design with no online delay has been proposed. This allows us to implement the design on a single FPGA device. The proposed core can speed up the computation by early termination of the SAD calculation when the candidate involved is bigger than the current SAD reference. Furthermore, the FPGA implementation of the design makes it possible to reconfigure the hardware to deal with 8 8 and 16 16 pixel blocks, according to the MPEG-4 standard requirements. We present the implementation s delay and area details for 4 4, 8 8 and 16 16 pixel blocks. We also provide comparisons with other current related works demonstrating the advantages of using our design. References [1] S. Wong, S. Vassiliadis, S. Cotofana A Sum of Absolute Differences Implementation in FPGA Hardware, 28th Euromicro Conference (EUROMICRO 02), pp.183 188, Dortmund, Germany, 2002. [2] J. Kim, S. Byun, Y. Kim, B. Ahn Fast Full Search Motion Estimation Algorithm Using Early Detection of Impossible Canditate Vectors, IEEE Trans. on Signal Processing, vol.50, pp. 2355 2365 Sep. 2002. [3] Y. Chan, W. Siu An efficient Search Strategy for Block Motion Estimation Using Image Features, IEEE Trans. on Image Processing, vol.10, pp. 1223 1238, Aug. 2001. [4] S.b. Pan, S.S. Chae and R.H. Park, VLSI Architecture for Block matching Algorithms using Systolic Arrays IEEE Trans. Circuits Syst. Video Tech.,vol. 6, pp.67 73, Feb, 1996. 19

[5] S. Wong, B. Stougie, S. Cotofana Alternatives in FPGA-based SAD Implementations, Proc. IEEE International Conf. on Field-Programmable Technology, pp. 449 452, 2002. [6] C. Su and C. Jen, Motion Estimation using MSD-first Processing, IEE Proc. Circuits Devices System, vol. 150, No. 2, pp. 124 133, 2003. [7] M. Ercegovac and T. Lang, On-line Arithmetic for DSP Applications, 32nd Midwest Symposium on Circuits and Systems, pp. 365 368, 1989. [8] M. D. Ercegovac and T. Lang. On-line Arithmetic: a Design Methodology and Applications in Digital Signal Processing. In VLSI Signal Processing III, pages 252 263, 1988. Reprinted in E. E. Swartzlander, Computer Arithmetic, Vol. 2, IEEE Computer Society Press Tutorial, Los Alamitos, CA, 1990. [9] Lau, D.; Schneider, A. Ercegovac, M.D.; Villasenor, J., FPGA-based Structures for Online FFT and DCT Proc.7th IEEE Symposium Field-Programmable Custom Computing Machines, pp. 310 311, 1999. [10] Rajagopal, S.; Cavallaro, J.; On-line Arithmetic for Detection in Digital Communication Receivers,15th IEEE Symposium on Computer Arithmetic, pp. 257 265, 2001. [11] McIlhenny, R.; Ercegovac, M.D.; On the Design of an On-line FFT Network for FPGA s, 33rd Asilomar Conference on Signals, Systems, and Computers, vol. 2, pp.1484 1488, 1999. [12] A. Avizienis, Signed Digit Number Representation for Fast Parallel Arithmetic, IRE Tran. Electron. Comput., Vol. EC-10, pp. 389-400, 1961. [13] C. Su and C. Jen, Motion Estimation Using On-Line Arithmetic, IEEE Int. Symposium on Circuits and Systems (ISCAS-2000), pp. 683 686, May 28-31,2000. [14] Hormigo, J.; Olivares, J.; Villalba, J.; Benavides, I.; New On-line Comparator with no Online Delay, 8th World Multiconference on Systemics, Cybernetics and Informatics, 2004. [15] J. Villalba, J Hormigo Analysis of the Mistakes in the Paper Motion Estimation using MSD-first Processing, IEE Circ., Dev. & Syst, vol 150, no. 2, April 2003, Internal Report 20

Depart. Computer Architecture, University of Málaga, Dec. 2004 http://www.ac.uma.es/cgibin/htgrep/pubsearch.cgi?isindex=villalba, 21