SAD computation based on online arithmetic for motion. estimation

Size: px
Start display at page:

Download "SAD computation based on online arithmetic for motion. estimation"

Transcription

1 SAD computation based on online arithmetic for motion estimation J. Olivares a, J. Hormigo b, J. Villalba b, I. Benavides a and E. L. Zapata b a Dept. of Electrics and Electronics, University of Córdoba, Spain {olivares, el1bebej}@uco.es b Dept. of Computer Architecture, University of Málaga, Spain {hormigo, julio, ezapata}@ac.uma.es Abstract Block-based motion estimation is one of the critical tasks in today s video compression standards such as H.26x, MPEG-1, -2 and -4. Most of the block-based motion estimation algorithms are based on computing the Sum of Absolute Differences (SAD) between corresponding elements in the candidate and reference blocks. In this paper an FPGA design is proposed for rapidly computing the minimum SAD. Two goals are achieved due to the use of online arithmetic (): it is possible to implement a full macroblock SAD in a single FPGA device; and it allows us to speed up computation by early termination of the SAD calculation when the candidate involved is bigger than the current reference SAD. Reconfigurable devices enable us to change 8 8 or pixels per block quickly and easily. For a SAD unit 1945 look up tables (LUTs) are required at 425 MHz. A comparison with other related works is provided. Keywords: motion estimation, FPGA, sum of absolute differences, online arithmetic 1 Introduction Motion estimation (ME) plays an important role in today s video coding and processing systems, since motion vectors provide critical information for temporal redundancy reduction. It has been 1

2 widely used in the H.26x, MPEG-1, -2 and -4 video compression standards. Motion estimation is defined as searching for the best motion vector, being the displacement of the coordinates of the most similar block in the previous frame compared to the block in the current frame. Full search block matching is the most popular algorithm to perform ME, and it searches through every candidate location to find the best match. To do this, the current frame is partitioned into two dimensional blocks (typically 8 8 or pixel blocks) and a search window in the reference frame is defined. Each block of the current frame is compared with all the blocks of a previous frame within the same window. The final motion vector corresponds to the block with minimum distortion within the search window. The most commonly used metric to calculate the distortion is the Sum of Absolute Differences (SAD) [1], which adds up the absolute differences between corresponding elements in the candidate and reference block. The heavy computational cost of block matching algorithms (BMA) can be a significant problem in real time coding applications. To reduce computational complexity many fast algorithms have been proposed, which search a subset of candidate blocks [2][3]. Besides this, different architectures have been designed to speed up the associated massive arithmetic calculation [4][1]. However, the need for specialized hardware contradicts the flexibility demanded by current video coding systems. A feasible solution to this problem is to use a programmable processor core along with a field programmable gate array device (FPGA) which is in charge of performing critical tasks. The reasons for using FPGAs include the following advantages: increased flexibility and rapid adaptation to new developments; appropriate performance; and faster design times achieved by re-using IP cores and high-level design languages (such as VHDL). In this context, our design is intended to speed up computation of the minimum SAD by its implementation in an FPGA (SAD processor in Figure 1), while a data dispatcher supplies the reference and candidate blocks to the FPGA device (see Figure 1). An FPGA architecture to compute the minimum SAD is proposed in this paper. This design can be integrated with any BMA (full search or another efficient search strategy). Despite the parallelism inherent to SAD, full parallel implementation has proved difficult, since it requires a large number of operands for typical block sizes (a 8 8 pixel block requires 2

3 MINIMUM SAD DISPATCHER 2N 2 b CANDIDATE BLOCK MAD PROCESSOR MOTION VECTOR 2N 2 b REFERENCE BLOCK CORE PROCESSOR Figure 1: Motion estimation system bit operands, and a pixel macroblock needs bit operands). Due to the large amount of hardware, the computation of the SAD on only one row of a macroblock (16 1) is implemented on an FPGA device in [1], who propose replicating or pipelining the design to obtain the computation. Four FPGA chips with 1234 I/O pins each are used in [5] for a completely parallel design. On the other hand, the use of online arithmetic () for motion estimation is proposed in [6] to speed up the computation by early termination of the SAD calculation. A serial architecture (pixel by pixel) for 4 4 blocks is proposed in [6], based on ASIC implementation. This paper is organized as follows: in Section 2 a brief description of the techniques is provided; in Section 3 we deal with the computation of the minimum SAD using ; Section 4 presents the implementation of the proposed design in FPGA devices; the results of several simulations are shown in Section 5 to illustrate the clock cycles saved with early termination; a comparison with other works is described in Section 6; and finally, the most relevant results of this paper are summarized in Section 7. 3

4 2 Online arithmetic Online arithmetic techniques have been considered as the solution to many signal processing problems, such as digital filtering, Fourier transform, and others [7] [10]. Recent works have presented the suitability of for FPGAs designs [11]. The basic idea of is to perform computations which overlap with the digit-by-digit communications of operands/results [7]. algorithms operate in a digit-serial manner, beginning with the most significant digit (MSD). To generate the first digit of the result, δ + 1 digits of the input operands are needed. Thus, after δ digits of the operands are received, for each new digit of the operands, a new digit of the result is obtained. For this reason, δ is known as online delay. Due to the online delay, after the last digits of the inputs are introduced into the system, a number of zero digits equal to the online delay have to be introduced to ensure a correct result. The most-significant-digit-first mode of computation requires flexibility in computing digits on the basis of partial information about inputs. This is achieved by using a redundant representation system. In a redundant representation with radix r, each digit has more than r possible values. This permits several representations of a given value. Therefore, there is flexibility in choosing an output digit at a given step, so that a compensation can be introduced if needed. A signed-digit (SD) representation system [12] is used in this paper. In radix-2 SD representation, the digit set is { 1, 0, 1}. Two bits are required to represent each digit, as shown in Table 1. The first bit is negatively weighted and the second one is positively weighted. This number representation system eliminates the long carry propagation chains in the addition operation, although it requires the carry of the two previous digits. In short, the advantages of using online arithmetic are as follows: it reduces the number of signal lines connecting modules due to its serial-digit character; the MSD-first computation allows subsequent calculations to occur at a much earlier stage; and it eliminates carry propagation chains, since it uses a redundant number representation system. 4

5 3 Online computation of the minimum SAD The goal of our FPGA design is to find which of the candidate blocks (supplied by the dispatcher) best matches the reference block. The most commonly used metric to determine the best match is the Sum of Absolute differences (SAD). Thus, our design computes the minimum SAD from among all the candidate blocks. To do this, a search iteration is performed for each candidate block. During each search iteration, the SAD corresponding to a candidate block is computed using all its pixels simultaneously. The value obtained is compared with the reference SAD (SADr) which is the minimum SAD computed before the current iteration. If the current SAD (SADc) is less than SADr, it is stored as SADr for the remaining search iterations. Both the SAD computation and comparison operation are performed using techniques. This allows us to begin the comparison when the first digit of the SAD is obtained and to stop the computation early if the digits computed are sufficient to ensure that SADc is greater than SADr. 3.1 Online SAD computation The SAD adds up the absolute differences between corresponding elements in the candidate and reference block N N SAD = c i,j r i,j, (1) i=1 j=1 Table 1: Digit codification in radix-2 signed-digit representation Digit value Digit representation

6 where r i,j are the elements of the reference block, and c i,j the elements of the candidate block. Thus, the computation of the SAD is divided into three steps: - Compute the differences between corresponding elements d i,j = c i,j r i,j - Determine the absolute value of each difference d i,j - Add all absolute values We now describe how each of these operations is performed using online arithmetic, and how the pixel values are converted into radix-2 SD representation. Conversion to SD representation and difference computation: In radix-2 signed-digit representation, each digit is composed of two bits, the first one negatively weighted and the second positively weighted. Thus, a signed-digit number can be interpreted as the difference between two unsigned numbers, one composed of positively weighted bits for each digit, minus the one composed of negatively weighted bits. In fact,this difference must be computed to convert an SD number into a non-redundant representation. This property is used to simultaneously convert each pixel value into SD representation and compute the difference between the pixels of the reference block and the current block at no computational cost. In this way, each digit of the value d i,j = c i,j r i,j is obtained in SD representation by only taking the corresponding bit of c i,j as the positively weighted one and the corresponding bit of r i,j as the negatively weighted one, since c i,j and r i,j are unsigned numbers. Absolute value: To compute the absolute value of d i,j, the sign of this value has to be changed if d i,j is negative. In SD representation, the negation operation is performed by exchanging both bits of each digit. Since the MSD-first mode of computation is being used, the sign detection of d i,j is performed on-the-fly by checking whether the first non-zero digit of d i,j is positive (01) or negative (10). The digits of d i,j are received in MSD-first mode and go directly to the output when they are zero (00 or 11). If the first non-zero digit received is positive (01), this and all the remaining digits correspond directly with the output. Nevertheless, if the first non-zero digit received is negative (10), the bits of this and all the remaining digits are interchanged to obtain the output. The absolute value operation is performed with no online delay. 6

7 Ai + Ai - Bi + Bi - Ci + Ci - Di + Di - Ei + Ei - Fi + Fi - Gi + Gi - Hi + Hi - AB CD EF GH AD EH AH Figure 2: Online design for the sum of the absolute differences. Sum of absolute differences: The absolute difference of all the pixels corresponding to the current and reference blocks is computed in parallel. Thus, N 2 absolute difference blocks are required. An online adder tree is used to obtain the sum of all d i,j values. In Figure 2 this structure is shown for 4 4 pixels per block(n = 4). Each -adder in this figure corresponds to a standard SD online adder (see figure 3). The number of addition steps of the complete adder tree is log 2 (N 2 ). In radix-2 signed-digit representation, the online delay of the addition is two i.e., the MSD of the result is obtained two cycles after the MSD of the inputs has been sent to the adder. Nevertheless in our case, the carry bit is used as the MSD of the results and this digit is obtained one cycle before. Therefore, the online delay of the complete adder tree is 2 log 2 (N 2 ), but the first digit of the results is 7

8 X j+3 + X j+3 - Y j+3 + Y j+3 - FA D R D R FA Z j+2 + Z j+1 - D R D R Z j+1 + D R - + Z j Z j Figure 3: Online adder design. obtained log 2 (N 2 ) cycles earlier. 3.2 Signed-digit online comparison Once the first digit of the SAD corresponding to the current block is obtained, the comparison between the current SAD and the minimum SAD can begin. Thanks to the fact that the MSD-first mode of computation is used, an efficient comparison algorithm can be applied. Nevertheless, since SD representation allows several representations for a given value, the comparison operation between two values is not as simple as in conventional representations. In [13, 6] a comparison algorithm and its hardware implementation are proposed. The two SD numbers are first converted to sign-magnitude format and then a standard comparison is used. The magnitude computation and comparison are performed on-the-fly in an MSD-first 8

9 manner. Nevertheless, this comparator has an online delay of two. We propose a comparison algorithm with no online delay. This is based on the analysis of the sign of the difference operation between the two values to be compared. Thus, the online delay of two is avoided due to the substraction operation. Let us define the SD numbers A and B, where A = n 1 and B have a similar expression. The result of operation A-B is Let R be the result of the difference n 1 i=0 a i 2 i, a i { 1, 0, 1} (2) A B = (a i b i ) 2 i, a i, b i { 1, 0, 1} (3) i=0 R = A B = n 1 i=0 r i 2 i, r i { 2, 1, 0, 1, 2} (4) Let us assume that when using an online comparator, the sign of R can be determined at digit k, if the partial accumulated sum R k complies with Given the previous definition of R k, R can be redefined as n 1 R k = r i 2 i 2 k+1 (5) i=k k 1 R = A B = R k + r i 2 i (6) i=0 Since it is proved that k 1 i=0 k 1 r i 2 i 2 2 i < 2 k+1 (7) i=0 R k 2 k+1 R > 0 (8) R k 2 k+1 R < 0 (9) If the condition represented in equation 5 does not comply with k > 0, the sign of R cannot be guaranteed until the last digit (k = 0). Let us define the normalized partial accumulated sum as R k = R k /2 k ; the condition in equation 5 is then equivalent to 9

10 n 1 R k = r i 2 i k 2 (10) i=k The value R k can be computed using an online recurrence (Note that k ranges from N-1 to 0) R k = 2 R k+1 + (a k b k ) (11) The value R k only depends on its previous value and the current digits, thus an online comparator, as well as minimum or maximum algorithms, can be implemented with no online delay based on this computation. An online comparator requires the value R k to be computed in each iteration, starting at k = n 1 (MSD), until R k 2 or k = 0. At this point, the decision is determined based on the sign of R k. Transition a k - b k k R >1 k R =1 k R =0 k R =-1 k R <-1 0,1, ,-1, Figure 4: State-flow diagram of the online comparator design. In [14] we evaluate different hardware designs for the comparator. Faster implementation is accomplished if the design is implemented as a state machine following the state-flow diagram represented in Figure 4. Each state represents a possible value of R k, i.e., equal, possibly greater, greater, possibly less or less. The transitions between states are determined by the digits a k and b k. The design used in this paper is a simplification of this state machine. 10

11 4 FPGA implementation of the SAD processor Figure 5 presents the architecture of the design corresponding to the SAD processor. The absolute value of the differences is computed for each pair of pixels ( c i,j r i,j ) and their summation is calculated on the N 2 Operand adder. The result is stored digit-by-digit in a SADc register and is simultaneously compared with the corresponding digit of SADr in the comparator (COMP). If at any cycle the condition SADc > SADr is detected, the computation is stopped and a new candidate block is required. Otherwise, if the condition SADc < SADr is verified, SADc is stored in SADr when a less significant digit of the SAD is calculated. c i0,0 r i0,0 c 0,0 -r 0,0 2b c i0,1 r i0,1 c 0,1 -r 0,1 2b N 2 -OPERAND 2b COMP stop min c in,n r in,n c N,N -r N,N 2b SAD c 2b SAD r Figure 5: SAD processor architecture The timing of the computation for the 4 4 SAD processor is shown in Figure 6. In each cycle, the outputs corresponding to the absolute value block ( c i,j r i,j, each of the four steps in the adder-tree ( (i)), and the comparator (COMP) are represented. In fact, regarding the comparator, this does not really constitute the output, but rather the last digit used for the comparison. The zero digits represent the zero values which have to be introduced into the input due to the system s online delay. Since each addition has an online delay of two, and the absolute value blocks and the comparator have no online delay, eight zeroes are required in this case. The worst case occurs when a new minimum SAD is found, and then 21 cycles are required for the full process, where the last cycle is run to store SADc in SADr. However, as Figure 6 11

12 New computation r c ij ij Σ (1) Σ (2) Σ (3) Σ (4) COMP d 7 d 6 d 5 d 4 d 3 d 2 d 1 d d 8 d 7 d 6 d 5 d 4 d 3 d 2 d 1 d d 9 d 8 d 7 d 6 d 5 d 4 d 3 d 2 d 1 d d 10 d 9 d 8 d 7 d 6 d 5 d 4 d 3 d 2 d 1 d d 11 d 10 d 9 d 8 d 7 d 6 d 5 d 4 d 3 d 2 d 1 d 0 d 11 d 10 d 9 d 8 d 7 d 6 d 5 d 4 d 3 d 2 d 1 d 0 Best case (9 cycles) Worst case (21 cycles) Figure 6: Timing of the 4 4 SAD processor shows, a new SAD computation can start after 16 cycles (after the 8 digits and 8 zeroes are introduced) in which case this period of time is the maximum between two consecutive SAD computations. This period is reduced if the candidate SAD is rejected before. In the best case, this happens after analysing the MSD of the candidate SAD, i.e., after 9 cycles. Therefore, the number of cycles for a SAD computation and comparison is between 9 and 16 for a 4 4 SAD processor. This period ranges from 13 to 20 cycles for an 8 8 block, and from 17 to 24 cycles for a block. The design has been implemented on the Xilinx SPARTAN-II and VIRTEX-II FPGA families for three different block sizes. For compilation, simulation and implementation, we use the Xilinx ISE Series 5.2i. The main results of the implementation are shown in Table 2. The area/number of pixels ratio is relatively low, due to the serial-digit character of online computation. The maximum clock frequency is independent of block size because when the number of operators increases, only the number of parallel operations and the number of steps in the adder-tree increase. Although this value strongly depends on the technology used (as shown in Table 2), our results are very promising. Table 3 shows how the area and delay are distributed among the different parts of the design for the SAD processor. Note that the percentage given refers to the total number of LUTs of the SAD processor. The maximum clock frequency of the global system is determined 12

13 Table 2: Area and clock frequency corresponding to different FPGA implementations. SPARTAN-II VIRTEX-II Block size Area (4 inputs-luts) 4x4 (16 pixels) x8 (64 pixels) x16 (256 pixels) Maximum Frequency(MHz) Table 3: Distribution of LUTs and delay in the SAD processor. Time Delay (ns) Area Parts SPARTAN-II VIRTEX-II LUTs % Absolute difference % Adder-tree % Comparator % Control and Connectivity % either by the delay of the comparator or the adder (although both values are similar), depending on the FPGA family, since the basic cells are slightly different. The area is mainly occupied by the absolute value blocks and the adder-tree, due to the large amount of operands for this block size. The general performance of these implementations is shown in Table 4, where the number of SADs per second and the number of frames per second (fps) are given for a 640x480 pixels per frame image. 13

14 Table 4: Number of SAD calculations and frames per second. SPARTAN-II VIRTEX-II Block Size Window Size SAD (millions per second) fps SAD (millions per second) fps 4x4 8x x8 16x x16 32x Early termination of SAD calculation Several video sequences have been processed to estimate the number of clock cycles saved. The parameters used are: - 16x16 block size. - 24x24 search window. - Full-search block matching algorithm frames of each video have been evaluated. The traditional model shown in Figure 5 uses a final comparator for the SAD comparison. A new model is proposed (as shown in Figure 7), which introduces several comparison levels into the adder tree to evaluate partial SAD information. It is possible that partial SADs of 64 pixels or 128 pixels of a 16x16 block are greater than the reference SAD; if so, the SAD calculation can be stopped before running the entire number of cycles, which cannot be done with the traditional model. Figure 7 shows the new model for partial comparison. This property is demonstrated in the present section. The added cost for the new model is the area occupied of six new comparators. Nevertheless, each comparator only requires 6 LUTs and involves less than 2% of the final area. Figure 8 shows the results obtained for three versions of the implemented algorithm: one with only one final comparator for 256 PIXELS PROCESSED LEVEL, called C256P; one with 14

15 64 TREE 64 TREE 64 TREE 64 TREE COMPARATOR COMPARATOR COMPARATOR COMPARATOR 64 PIXELS PROCESSED LEVEL 128 PIXELS PROCESSED LEVEL COMPARATOR COMPARATOR COMPARATOR 256 PIXELS PROCESSED LEVEL Figure 7: New comparators for partial SAD comparison a final comparator plus two comparators for 128 PIXELS PROCESSED LEVEL, called C128P; and one with a final comparator plus two comparators for 128 PIXELS PROCESSED LEVEL and four comparators for 64 PIXELS PROCESSED LEVEL, called C64P. The videos tested were: - hall monitor.mpeg - flower.mpeg - tennis.mpeg - coast guard.mpeg The number of clock cycles saved for the C64P model ranges from 4.5% to 13%, in contrast to the conventional C256P model with only one comparator, which saves between 3.3% and 4.53% clock cycles. Introducing partial comparators allows us to improve the efficiency of the system. 15

16 Figure 8: Number of clock cycles saved 6 Comparison with other works In this section we compare our design to other recent works, the main ones being [13, 6] and [1, 5]. The use of online arithmetic to compute the minimum SAD was proposed in [13, 6] for ASIC implementation. An SD adder was used for the computation of the differences, whereas our approach does not use such hardware, since we merge this computation and the SD conversion, saving both time and area. Note that since a difference computation is required for each pixel, the amount of hardware saved is considerable. The authors consider independent bit planes and compute the summation of absolute differences for independent planes, starting from the most significant digit. The mathematical basis for this procedure is not correct since the absolute value of a signed-digit number is not equal to the summation of the absolute value of the different weighted digits. This is due to the fact 16

17 that each digit can be positive or negative. This leads the authors to obtain a motion vector which is not correct for most cases. For details see cite [15]. Our approach also considers bit planes. Moreover, we take into account the dependence between bit planes (carry propagation and correct calculation of the absolute value) which leads to obtaining the best motion vector. On the other hand, an algorithm based on online arithmetic is proposed in [13, 6] for the SAD comparison. The two SD numbers are first converted to sign-magnitude format, and then a standard comparison is used. The magnitude computation and comparison are performed on-the-fly in an MSD-first manner. Nevertheless, this comparator has an online delay of two and relatively high complexity. The main advantage of our design is that no online delay is required for the comparison operation, thus speeding up computation. Furthermore, our design is based on a simpler method involving less hardware cost. The authors do not provide enough data regarding their ASIC implementation to enable us to perform a quantitative comparison in terms of area and delay. According to [13, 6], the cycle time corresponds to one SD adder plus one 2 to 1 MUX, one AND and one three-input OR gate. The cycle time of our design is only one SD adder. Despite the fact that our design is intended for an FPGA implementation, we estimate that an ASIC implementation of our design will significantly improve the performance of the design [13, 6]. In [1], the computation of the SAD for 16 pixels (SAD16), which is equivalent to a macroblock row for MPEG, is implemented on an FPGA device. The design is based on carry save adders which perform the computation in parallel over all the digits of the data. According to the authors, the design is synthesized using FPGA Express from Synopsys by targeting the FLEX20KE family from Altera, obtaining an area of 1699 LUTs, and a maximum frequency of 197 MHz, with a latency of 19 cycles (96ns). The estimated bandwidth for this design is 50.4 Gbps and the estimated throughput is 197 million SADs per second. The results of our implementation using the VIRTEX-II family is used for comparison, since it provides similar performance. The worst case for our equivalent design (4 4 or 16 pixels) occurs when a new minimum SAD is found, and then 21 cycles are required to complete the full process (see Section 4); that is, to compute SAD16, compare the result with the previous minimum and store 17

18 it; this lasts 49 ns at a frequency of 425 MHz. The bandwidth of our design is 27.2 Gbps, which is less than in [1], since data are serially transmitted. As shown in Table 4, the throughput is million SADs per second, which is about seven times less than in [1]. Besides this, our design only requires 241 LUTs, which is seven times less area than in [1]. However, the current compression standard systems require blocks (and also 8 8 for MPEG-4). The authors of [1] state briefly how to extend the design to compute a SAD in two ways. The first one is based on using 16 SAD16 units (one for each row) and a final adder tree. They estimate that 27 clock cycles are required. Nevertheless, the number of LUTs for the design is close to 30000, which does not seem feasible for the current FPGA devices. Our design requires only 1945 LUTs, which is easily implemented on a single FPGA device. The second approach presented in [1] is based on reusing the SAD16 units to compute the SAD of all the 16 rows, which are buffered, to finally add them up. This involves 42 clock cycles with a larger area size due to buffering and the fact that longer binary data (16 bits instead of 12 bits) must be supported. Moreover, the intrinsic pipeline behavior of the SAD16 units is eliminated. For a similar area, our design computes a SAD every 24 cycles for the worst case (including the comparison, see Section 4). On the other hand, the solution proposed in [5] involves the use of four Altera STRATIX EP1S80 devices with 1234 I/O pins. This design uses 7765 LCs and requires 29 cycles for a SAD computation at 380 MHz. This means that our design obtains better performance regarding time while requiring far less hardware. We would like to emphasize that the previous comparisons refer to our worst case (16 cycles for 4 4 SAD and 24 cycles for SAD). However, the best case means that after analysing the MSD of the candidate SAD we then reject it; this involves only 9 cycles for 4 4 SAD and 17 cycles for SAD (see Section 6). Moreover, the TIMING results used for our design include the comparison OPERATION (which involves a few more clock cycles due to carry propagation) whereas the designs referred at [1] and [5] do not include this operation time. 18

19 7 Conclusion An FPGA implementation of a motion estimation core based on the computation of the minimum SAD has been presented in this paper. The proposed core can be integrated with a full search algorithm or any more efficient search strategy. The computation is carried out by using online arithmetic. The different operations involved in the SAD computation have been efficiently adapted to online arithmetic, and a new comparator design with no online delay has been proposed. This allows us to implement the design on a single FPGA device. The proposed core can speed up the computation by early termination of the SAD calculation when the candidate involved is bigger than the current SAD reference. Furthermore, the FPGA implementation of the design makes it possible to reconfigure the hardware to deal with 8 8 and pixel blocks, according to the MPEG-4 standard requirements. We present the implementation s delay and area details for 4 4, 8 8 and pixel blocks. We also provide comparisons with other current related works demonstrating the advantages of using our design. References [1] S. Wong, S. Vassiliadis, S. Cotofana A Sum of Absolute Differences Implementation in FPGA Hardware, 28th Euromicro Conference (EUROMICRO 02), pp , Dortmund, Germany, [2] J. Kim, S. Byun, Y. Kim, B. Ahn Fast Full Search Motion Estimation Algorithm Using Early Detection of Impossible Canditate Vectors, IEEE Trans. on Signal Processing, vol.50, pp Sep [3] Y. Chan, W. Siu An efficient Search Strategy for Block Motion Estimation Using Image Features, IEEE Trans. on Image Processing, vol.10, pp , Aug [4] S.b. Pan, S.S. Chae and R.H. Park, VLSI Architecture for Block matching Algorithms using Systolic Arrays IEEE Trans. Circuits Syst. Video Tech.,vol. 6, pp.67 73, Feb,

20 [5] S. Wong, B. Stougie, S. Cotofana Alternatives in FPGA-based SAD Implementations, Proc. IEEE International Conf. on Field-Programmable Technology, pp , [6] C. Su and C. Jen, Motion Estimation using MSD-first Processing, IEE Proc. Circuits Devices System, vol. 150, No. 2, pp , [7] M. Ercegovac and T. Lang, On-line Arithmetic for DSP Applications, 32nd Midwest Symposium on Circuits and Systems, pp , [8] M. D. Ercegovac and T. Lang. On-line Arithmetic: a Design Methodology and Applications in Digital Signal Processing. In VLSI Signal Processing III, pages , Reprinted in E. E. Swartzlander, Computer Arithmetic, Vol. 2, IEEE Computer Society Press Tutorial, Los Alamitos, CA, [9] Lau, D.; Schneider, A. Ercegovac, M.D.; Villasenor, J., FPGA-based Structures for Online FFT and DCT Proc.7th IEEE Symposium Field-Programmable Custom Computing Machines, pp , [10] Rajagopal, S.; Cavallaro, J.; On-line Arithmetic for Detection in Digital Communication Receivers,15th IEEE Symposium on Computer Arithmetic, pp , [11] McIlhenny, R.; Ercegovac, M.D.; On the Design of an On-line FFT Network for FPGA s, 33rd Asilomar Conference on Signals, Systems, and Computers, vol. 2, pp , [12] A. Avizienis, Signed Digit Number Representation for Fast Parallel Arithmetic, IRE Tran. Electron. Comput., Vol. EC-10, pp , [13] C. Su and C. Jen, Motion Estimation Using On-Line Arithmetic, IEEE Int. Symposium on Circuits and Systems (ISCAS-2000), pp , May 28-31,2000. [14] Hormigo, J.; Olivares, J.; Villalba, J.; Benavides, I.; New On-line Comparator with no Online Delay, 8th World Multiconference on Systemics, Cybernetics and Informatics, [15] J. Villalba, J Hormigo Analysis of the Mistakes in the Paper Motion Estimation using MSD-first Processing, IEE Circ., Dev. & Syst, vol 150, no. 2, April 2003, Internal Report 20

21 Depart. Computer Architecture, University of Málaga, Dec

SAD computation based on online arithmetic for motion estimation

SAD computation based on online arithmetic for motion estimation Microprocessors and Microsystems 30 (2006) 250 258 www.elsevier.com/locate/micpro SAD computation based on online arithmetic for motion estimation J. Olivares a, J. Hormigo b, J. Villalba b, *, I. Benavides

More information

Implementation of Modified Booth Algorithm (Radix 4) and its Comparison with Booth Algorithm (Radix-2)

Implementation of Modified Booth Algorithm (Radix 4) and its Comparison with Booth Algorithm (Radix-2) Advance in Electronic and Electric Engineering. ISSN 2231-1297, Volume 3, Number 6 (2013), pp. 683-690 Research India Publications http://www.ripublication.com/aeee.htm Implementation of Modified Booth

More information

An Efficient RNS to Binary Converter Using the Moduli Set {2n + 1, 2n, 2n 1}

An Efficient RNS to Binary Converter Using the Moduli Set {2n + 1, 2n, 2n 1} An Efficient RNS to Binary Converter Using the oduli Set {n + 1, n, n 1} Kazeem Alagbe Gbolagade 1,, ember, IEEE and Sorin Dan Cotofana 1, Senior ember IEEE, 1. Computer Engineering Laboratory, Delft University

More information

Floating Point Fused Add-Subtract and Fused Dot-Product Units

Floating Point Fused Add-Subtract and Fused Dot-Product Units Floating Point Fused Add-Subtract and Fused Dot-Product Units S. Kishor [1], S. P. Prakash [2] PG Scholar (VLSI DESIGN), Department of ECE Bannari Amman Institute of Technology, Sathyamangalam, Tamil Nadu,

More information

Let s put together a Manual Processor

Let s put together a Manual Processor Lecture 14 Let s put together a Manual Processor Hardware Lecture 14 Slide 1 The processor Inside every computer there is at least one processor which can take an instruction, some operands and produce

More information

Design and FPGA Implementation of a Novel Square Root Evaluator based on Vedic Mathematics

Design and FPGA Implementation of a Novel Square Root Evaluator based on Vedic Mathematics International Journal of Information & Computation Technology. ISSN 0974-2239 Volume 4, Number 15 (2014), pp. 1531-1537 International Research Publications House http://www. irphouse.com Design and FPGA

More information

A New Algorithm for Carry-Free Addition of Binary Signed-Digit Numbers

A New Algorithm for Carry-Free Addition of Binary Signed-Digit Numbers 2014 IEEE 22nd International Symposium on Field-Programmable Custom Computing Machines A New Algorithm for Carry-Free Addition of Binary Signed-Digit Numbers Klaus Schneider and Adrian Willenbücher Embedded

More information

Multipliers. Introduction

Multipliers. Introduction Multipliers Introduction Multipliers play an important role in today s digital signal processing and various other applications. With advances in technology, many researchers have tried and are trying

More information

Dynamic Resource Allocation in Softwaredefined Radio The Interrelation Between Platform Architecture and Application Mapping

Dynamic Resource Allocation in Softwaredefined Radio The Interrelation Between Platform Architecture and Application Mapping Dynamic Resource Allocation in Softwaredefined Radio The Interrelation Between Platform Architecture and Application Mapping V. Marojevic, X. Revés, A. Gelonch Polythechnic University of Catalonia Dept.

More information

Error Detection and Data Recovery Architecture for Systolic Motion Estimators

Error Detection and Data Recovery Architecture for Systolic Motion Estimators Error Detection and Data Recovery Architecture for Systolic Motion Estimators L. Arun Kumar #1, L. Sheela *2 # PG Scholar, * Assistant Professor, Embedded System Technologies, Regional Center of Anna University

More information

DESIGN OF AN ERROR DETECTION AND DATA RECOVERY ARCHITECTURE FOR MOTION ESTIMATION TESTING APPLICATIONS

DESIGN OF AN ERROR DETECTION AND DATA RECOVERY ARCHITECTURE FOR MOTION ESTIMATION TESTING APPLICATIONS DESIGN OF AN ERROR DETECTION AND DATA RECOVERY ARCHITECTURE FOR MOTION ESTIMATION TESTING APPLICATIONS V. SWARNA LATHA 1 & K. SRINIVASA RAO 2 1 VLSI System Design A.I.T.S, Rajampet Kadapa (Dt), A.P., India

More information

Design and Implementation of Concurrent Error Detection and Data Recovery Architecture for Motion Estimation Testing Applications

Design and Implementation of Concurrent Error Detection and Data Recovery Architecture for Motion Estimation Testing Applications Design and Implementation of Concurrent Error Detection and Data Recovery Architecture for Motion Estimation Testing Applications 1 Abhilash B T, 2 Veerabhadrappa S T, 3 Anuradha M G Department of E&C,

More information

Efficient Motion Estimation by Fast Three Step Search Algorithms

Efficient Motion Estimation by Fast Three Step Search Algorithms Efficient Motion Estimation by Fast Three Step Search Algorithms Namrata Verma 1, Tejeshwari Sahu 2, Pallavi Sahu 3 Assistant professor, Dept. of Electronics & Telecommunication Engineering, BIT Raipur,

More information

RN-Codings: New Insights and Some Applications

RN-Codings: New Insights and Some Applications RN-Codings: New Insights and Some Applications Abstract During any composite computation there is a constant need for rounding intermediate results before they can participate in further processing. Recently

More information

MICROPROCESSOR AND MICROCOMPUTER BASICS

MICROPROCESSOR AND MICROCOMPUTER BASICS Introduction MICROPROCESSOR AND MICROCOMPUTER BASICS At present there are many types and sizes of computers available. These computers are designed and constructed based on digital and Integrated Circuit

More information

FPGA area allocation for parallel C applications

FPGA area allocation for parallel C applications 1 FPGA area allocation for parallel C applications Vlad-Mihai Sima, Elena Moscu Panainte, Koen Bertels Computer Engineering Faculty of Electrical Engineering, Mathematics and Computer Science Delft University

More information

Implementation and Design of AES S-Box on FPGA

Implementation and Design of AES S-Box on FPGA International Journal of Research in Engineering and Science (IJRES) ISSN (Online): 232-9364, ISSN (Print): 232-9356 Volume 3 Issue ǁ Jan. 25 ǁ PP.9-4 Implementation and Design of AES S-Box on FPGA Chandrasekhar

More information

Hardware Implementations of RSA Using Fast Montgomery Multiplications. ECE 645 Prof. Gaj Mike Koontz and Ryon Sumner

Hardware Implementations of RSA Using Fast Montgomery Multiplications. ECE 645 Prof. Gaj Mike Koontz and Ryon Sumner Hardware Implementations of RSA Using Fast Montgomery Multiplications ECE 645 Prof. Gaj Mike Koontz and Ryon Sumner Overview Introduction Functional Specifications Implemented Design and Optimizations

More information

FPGA. AT6000 FPGAs. Application Note AT6000 FPGAs. 3x3 Convolver with Run-Time Reconfigurable Vector Multiplier in Atmel AT6000 FPGAs.

FPGA. AT6000 FPGAs. Application Note AT6000 FPGAs. 3x3 Convolver with Run-Time Reconfigurable Vector Multiplier in Atmel AT6000 FPGAs. 3x3 Convolver with Run-Time Reconfigurable Vector Multiplier in Atmel AT6000 s Introduction Convolution is one of the basic and most common operations in both analog and digital domain signal processing.

More information

A CDMA Based Scalable Hierarchical Architecture for Network- On-Chip

A CDMA Based Scalable Hierarchical Architecture for Network- On-Chip www.ijcsi.org 241 A CDMA Based Scalable Hierarchical Architecture for Network- On-Chip Ahmed A. El Badry 1 and Mohamed A. Abd El Ghany 2 1 Communications Engineering Dept., German University in Cairo,

More information

Aims and Objectives. E 3.05 Digital System Design. Course Syllabus. Course Syllabus (1) Programmable Logic

Aims and Objectives. E 3.05 Digital System Design. Course Syllabus. Course Syllabus (1) Programmable Logic Aims and Objectives E 3.05 Digital System Design Peter Cheung Department of Electrical & Electronic Engineering Imperial College London URL: www.ee.ic.ac.uk/pcheung/ E-mail: p.cheung@ic.ac.uk How to go

More information

Reconfigurable Architecture Requirements for Co-Designed Virtual Machines

Reconfigurable Architecture Requirements for Co-Designed Virtual Machines Reconfigurable Architecture Requirements for Co-Designed Virtual Machines Kenneth B. Kent University of New Brunswick Faculty of Computer Science Fredericton, New Brunswick, Canada ken@unb.ca Micaela Serra

More information

MP3 Player CSEE 4840 SPRING 2010 PROJECT DESIGN. zl2211@columbia.edu. ml3088@columbia.edu

MP3 Player CSEE 4840 SPRING 2010 PROJECT DESIGN. zl2211@columbia.edu. ml3088@columbia.edu MP3 Player CSEE 4840 SPRING 2010 PROJECT DESIGN Zheng Lai Zhao Liu Meng Li Quan Yuan zl2215@columbia.edu zl2211@columbia.edu ml3088@columbia.edu qy2123@columbia.edu I. Overview Architecture The purpose

More information

Research on Clustering Analysis of Big Data Yuan Yuanming 1, 2, a, Wu Chanle 1, 2

Research on Clustering Analysis of Big Data Yuan Yuanming 1, 2, a, Wu Chanle 1, 2 Advanced Engineering Forum Vols. 6-7 (2012) pp 82-87 Online: 2012-09-26 (2012) Trans Tech Publications, Switzerland doi:10.4028/www.scientific.net/aef.6-7.82 Research on Clustering Analysis of Big Data

More information

Design and Implementation of an On-Chip timing based Permutation Network for Multiprocessor system on Chip

Design and Implementation of an On-Chip timing based Permutation Network for Multiprocessor system on Chip Design and Implementation of an On-Chip timing based Permutation Network for Multiprocessor system on Chip Ms Lavanya Thunuguntla 1, Saritha Sapa 2 1 Associate Professor, Department of ECE, HITAM, Telangana

More information

RN-coding of Numbers: New Insights and Some Applications

RN-coding of Numbers: New Insights and Some Applications RN-coding of Numbers: New Insights and Some Applications Peter Kornerup Dept. of Mathematics and Computer Science SDU, Odense, Denmark & Jean-Michel Muller LIP/Arénaire (CRNS-ENS Lyon-INRIA-UCBL) Lyon,

More information

Implementation of Full -Parallelism AES Encryption and Decryption

Implementation of Full -Parallelism AES Encryption and Decryption Implementation of Full -Parallelism AES Encryption and Decryption M.Anto Merline M.E-Commuication Systems, ECE Department K.Ramakrishnan College of Engineering-Samayapuram, Trichy. Abstract-Advanced Encryption

More information

ON SUITABILITY OF FPGA BASED EVOLVABLE HARDWARE SYSTEMS TO INTEGRATE RECONFIGURABLE CIRCUITS WITH HOST PROCESSING UNIT

ON SUITABILITY OF FPGA BASED EVOLVABLE HARDWARE SYSTEMS TO INTEGRATE RECONFIGURABLE CIRCUITS WITH HOST PROCESSING UNIT 216 ON SUITABILITY OF FPGA BASED EVOLVABLE HARDWARE SYSTEMS TO INTEGRATE RECONFIGURABLE CIRCUITS WITH HOST PROCESSING UNIT *P.Nirmalkumar, **J.Raja Paul Perinbam, @S.Ravi and #B.Rajan *Research Scholar,

More information

A Computer Vision System on a Chip: a case study from the automotive domain

A Computer Vision System on a Chip: a case study from the automotive domain A Computer Vision System on a Chip: a case study from the automotive domain Gideon P. Stein Elchanan Rushinek Gaby Hayun Amnon Shashua Mobileye Vision Technologies Ltd. Hebrew University Jerusalem, Israel

More information

DDS. 16-bit Direct Digital Synthesizer / Periodic waveform generator Rev. 1.4. Key Design Features. Block Diagram. Generic Parameters.

DDS. 16-bit Direct Digital Synthesizer / Periodic waveform generator Rev. 1.4. Key Design Features. Block Diagram. Generic Parameters. Key Design Features Block Diagram Synthesizable, technology independent VHDL IP Core 16-bit signed output samples 32-bit phase accumulator (tuning word) 32-bit phase shift feature Phase resolution of 2π/2

More information

FPGA Implementation of an Advanced Traffic Light Controller using Verilog HDL

FPGA Implementation of an Advanced Traffic Light Controller using Verilog HDL FPGA Implementation of an Advanced Traffic Light Controller using Verilog HDL B. Dilip, Y. Alekhya, P. Divya Bharathi Abstract Traffic lights are the signaling devices used to manage traffic on multi-way

More information

Design and Analysis of Parallel AES Encryption and Decryption Algorithm for Multi Processor Arrays

Design and Analysis of Parallel AES Encryption and Decryption Algorithm for Multi Processor Arrays IOSR Journal of VLSI and Signal Processing (IOSR-JVSP) Volume 5, Issue, Ver. III (Jan - Feb. 205), PP 0- e-issn: 239 4200, p-issn No. : 239 497 www.iosrjournals.org Design and Analysis of Parallel AES

More information

Chapter 2 Logic Gates and Introduction to Computer Architecture

Chapter 2 Logic Gates and Introduction to Computer Architecture Chapter 2 Logic Gates and Introduction to Computer Architecture 2.1 Introduction The basic components of an Integrated Circuit (IC) is logic gates which made of transistors, in digital system there are

More information

FPGA Design of Reconfigurable Binary Processor Using VLSI

FPGA Design of Reconfigurable Binary Processor Using VLSI ISSN (Online) : 2319-8753 ISSN (Print) : 2347-6710 International Journal of Innovative Research in Science, Engineering and Technology Volume 3, Special Issue 3, March 2014 2014 International Conference

More information

High Speed and Efficient 4-Tap FIR Filter Design Using Modified ETA and Multipliers

High Speed and Efficient 4-Tap FIR Filter Design Using Modified ETA and Multipliers High Speed and Efficient 4-Tap FIR Filter Design Using Modified ETA and Multipliers Mehta Shantanu Sheetal #1, Vigneswaran T. #2 # School of Electronics Engineering, VIT University Chennai, Tamil Nadu,

More information

A Lab Course on Computer Architecture

A Lab Course on Computer Architecture A Lab Course on Computer Architecture Pedro López José Duato Depto. de Informática de Sistemas y Computadores Facultad de Informática Universidad Politécnica de Valencia Camino de Vera s/n, 46071 - Valencia,

More information

Sigma- Delta Modulator Simulation and Analysis using MatLab

Sigma- Delta Modulator Simulation and Analysis using MatLab Computer and Information Science; Vol. 5, No. 5; 2012 ISSN 1913-8989 E-ISSN 1913-8997 Published by Canadian Center of Science and Education Sigma- Delta Modulator Simulation and Analysis using MatLab Thuneibat

More information

Innovative improvement of fundamental metrics including power dissipation and efficiency of the ALU system

Innovative improvement of fundamental metrics including power dissipation and efficiency of the ALU system Innovative improvement of fundamental metrics including power dissipation and efficiency of the ALU system Joseph LaBauve Department of Electrical and Computer Engineering University of Central Florida

More information

United States Naval Academy Electrical and Computer Engineering Department. EC262 Exam 1

United States Naval Academy Electrical and Computer Engineering Department. EC262 Exam 1 United States Naval Academy Electrical and Computer Engineering Department EC262 Exam 29 September 2. Do a page check now. You should have pages (cover & questions). 2. Read all problems in their entirety.

More information

HARDWARE ACCELERATION IN FINANCIAL MARKETS. A step change in speed

HARDWARE ACCELERATION IN FINANCIAL MARKETS. A step change in speed HARDWARE ACCELERATION IN FINANCIAL MARKETS A step change in speed NAME OF REPORT SECTION 3 HARDWARE ACCELERATION IN FINANCIAL MARKETS A step change in speed Faster is more profitable in the front office

More information

STUDY ON HARDWARE REALIZATION OF GPS SIGNAL FAST ACQUISITION

STUDY ON HARDWARE REALIZATION OF GPS SIGNAL FAST ACQUISITION STUDY ON HARDWARE REALIZATION OF GPS SIGNAL FAST ACQUISITION Huang Lei Kou Yanhong Zhang Qishan School of Electronics and Information Engineering, Beihang University, Beijing, P. R. China, 100083 ABSTRACT

More information

A DA Serial Multiplier Technique based on 32- Tap FIR Filter for Audio Application

A DA Serial Multiplier Technique based on 32- Tap FIR Filter for Audio Application A DA Serial Multiplier Technique ased on 32- Tap FIR Filter for Audio Application K Balraj 1, Ashish Raman 2, Dinesh Chand Gupta 3 Department of ECE Department of ECE Department of ECE Dr. B.R. Amedkar

More information

White Paper Utilizing Leveling Techniques in DDR3 SDRAM Memory Interfaces

White Paper Utilizing Leveling Techniques in DDR3 SDRAM Memory Interfaces White Paper Introduction The DDR3 SDRAM memory architectures support higher bandwidths with bus rates of 600 Mbps to 1.6 Gbps (300 to 800 MHz), 1.5V operation for lower power, and higher densities of 2

More information

MACHINE ARCHITECTURE & LANGUAGE

MACHINE ARCHITECTURE & LANGUAGE in the name of God the compassionate, the merciful notes on MACHINE ARCHITECTURE & LANGUAGE compiled by Jumong Chap. 9 Microprocessor Fundamentals A system designer should consider a microprocessor-based

More information

9/14/2011 14.9.2011 8:38

9/14/2011 14.9.2011 8:38 Algorithms and Implementation Platforms for Wireless Communications TLT-9706/ TKT-9636 (Seminar Course) BASICS OF FIELD PROGRAMMABLE GATE ARRAYS Waqar Hussain firstname.lastname@tut.fi Department of Computer

More information

An Effective Deterministic BIST Scheme for Shifter/Accumulator Pairs in Datapaths

An Effective Deterministic BIST Scheme for Shifter/Accumulator Pairs in Datapaths An Effective Deterministic BIST Scheme for Shifter/Accumulator Pairs in Datapaths N. KRANITIS M. PSARAKIS D. GIZOPOULOS 2 A. PASCHALIS 3 Y. ZORIAN 4 Institute of Informatics & Telecommunications, NCSR

More information

MIMO detector algorithms and their implementations for LTE/LTE-A

MIMO detector algorithms and their implementations for LTE/LTE-A GIGA seminar 11.01.2010 MIMO detector algorithms and their implementations for LTE/LTE-A Markus Myllylä and Johanna Ketonen 11.01.2010 2 Outline Introduction System model Detection in a MIMO-OFDM system

More information

A Survey of Video Processing with Field Programmable Gate Arrays (FGPA)

A Survey of Video Processing with Field Programmable Gate Arrays (FGPA) A Survey of Video Processing with Field Programmable Gate Arrays (FGPA) Heather Garnell Abstract This paper is a high-level, survey of recent developments in the area of video processing using reconfigurable

More information

Advanced Computer Architecture-CS501. Computer Systems Design and Architecture 2.1, 2.2, 3.2

Advanced Computer Architecture-CS501. Computer Systems Design and Architecture 2.1, 2.2, 3.2 Lecture Handout Computer Architecture Lecture No. 2 Reading Material Vincent P. Heuring&Harry F. Jordan Chapter 2,Chapter3 Computer Systems Design and Architecture 2.1, 2.2, 3.2 Summary 1) A taxonomy of

More information

A comprehensive survey on various ETC techniques for secure Data transmission

A comprehensive survey on various ETC techniques for secure Data transmission A comprehensive survey on various ETC techniques for secure Data transmission Shaikh Nasreen 1, Prof. Suchita Wankhade 2 1, 2 Department of Computer Engineering 1, 2 Trinity College of Engineering and

More information

150127-Microprocessor & Assembly Language

150127-Microprocessor & Assembly Language Chapter 3 Z80 Microprocessor Architecture The Z 80 is one of the most talented 8 bit microprocessors, and many microprocessor-based systems are designed around the Z80. The Z80 microprocessor needs an

More information

1. Memory technology & Hierarchy

1. Memory technology & Hierarchy 1. Memory technology & Hierarchy RAM types Advances in Computer Architecture Andy D. Pimentel Memory wall Memory wall = divergence between CPU and RAM speed We can increase bandwidth by introducing concurrency

More information

Performance Oriented Management System for Reconfigurable Network Appliances

Performance Oriented Management System for Reconfigurable Network Appliances Performance Oriented Management System for Reconfigurable Network Appliances Hiroki Matsutani, Ryuji Wakikawa, Koshiro Mitsuya and Jun Murai Faculty of Environmental Information, Keio University Graduate

More information

This Unit: Floating Point Arithmetic. CIS 371 Computer Organization and Design. Readings. Floating Point (FP) Numbers

This Unit: Floating Point Arithmetic. CIS 371 Computer Organization and Design. Readings. Floating Point (FP) Numbers This Unit: Floating Point Arithmetic CIS 371 Computer Organization and Design Unit 7: Floating Point App App App System software Mem CPU I/O Formats Precision and range IEEE 754 standard Operations Addition

More information

Optimising the resource utilisation in high-speed network intrusion detection systems.

Optimising the resource utilisation in high-speed network intrusion detection systems. Optimising the resource utilisation in high-speed network intrusion detection systems. Gerald Tripp www.kent.ac.uk Network intrusion detection Network intrusion detection systems are provided to detect

More information

Attaining EDF Task Scheduling with O(1) Time Complexity

Attaining EDF Task Scheduling with O(1) Time Complexity Attaining EDF Task Scheduling with O(1) Time Complexity Verber Domen University of Maribor, Faculty of Electrical Engineering and Computer Sciences, Maribor, Slovenia (e-mail: domen.verber@uni-mb.si) Abstract:

More information

An Efficient Architecture for Image Compression and Lightweight Encryption using Parameterized DWT

An Efficient Architecture for Image Compression and Lightweight Encryption using Parameterized DWT An Efficient Architecture for Image Compression and Lightweight Encryption using Parameterized DWT Babu M., Mukuntharaj C., Saranya S. Abstract Discrete Wavelet Transform (DWT) based architecture serves

More information

Manchester Encoder-Decoder for Xilinx CPLDs

Manchester Encoder-Decoder for Xilinx CPLDs Application Note: CoolRunner CPLDs R XAPP339 (v.3) October, 22 Manchester Encoder-Decoder for Xilinx CPLDs Summary This application note provides a functional description of VHDL and Verilog source code

More information

How To Fix A 3 Bit Error In Data From A Data Point To A Bit Code (Data Point) With A Power Source (Data Source) And A Power Cell (Power Source)

How To Fix A 3 Bit Error In Data From A Data Point To A Bit Code (Data Point) With A Power Source (Data Source) And A Power Cell (Power Source) FPGA IMPLEMENTATION OF 4D-PARITY BASED DATA CODING TECHNIQUE Vijay Tawar 1, Rajani Gupta 2 1 Student, KNPCST, Hoshangabad Road, Misrod, Bhopal, Pin no.462047 2 Head of Department (EC), KNPCST, Hoshangabad

More information

Digital Logic Design. Basics Combinational Circuits Sequential Circuits. Pu-Jen Cheng

Digital Logic Design. Basics Combinational Circuits Sequential Circuits. Pu-Jen Cheng Digital Logic Design Basics Combinational Circuits Sequential Circuits Pu-Jen Cheng Adapted from the slides prepared by S. Dandamudi for the book, Fundamentals of Computer Organization and Design. Introduction

More information

A Systolic Algorithm to Process Compressed Binary Images

A Systolic Algorithm to Process Compressed Binary Images A Systolic Algorithm to Process Compressed Binary Images Fikret Ercal, Mark Allen, and Hao Feng University of Missouri Rolla Department of Computer Science and Intelligent Systems Center Rolla, MO 65401

More information

Hardware Implementation of AES Encryption and Decryption System Based on FPGA

Hardware Implementation of AES Encryption and Decryption System Based on FPGA Send Orders for Reprints to reprints@benthamscience.ae The Open Cybernetics & Systemics Journal, 2015, 9, 1373-1377 1373 Open Access Hardware Implementation of AES Encryption and Decryption System Based

More information

Low-resolution Image Processing based on FPGA

Low-resolution Image Processing based on FPGA Abstract Research Journal of Recent Sciences ISSN 2277-2502. Low-resolution Image Processing based on FPGA Mahshid Aghania Kiau, Islamic Azad university of Karaj, IRAN Available online at: www.isca.in,

More information

On the Data Reuse and Memory Bandwidth Analysis for Full-Search Block-Matching VLSI Architecture

On the Data Reuse and Memory Bandwidth Analysis for Full-Search Block-Matching VLSI Architecture IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 12, NO. 1, JANUARY 2002 61 On the Data Reuse and Memory Bandwidth Analysis for Full-Search Block-Matching VLSI Architecture Jen-Chieh

More information

Digital Systems Design! Lecture 1 - Introduction!!

Digital Systems Design! Lecture 1 - Introduction!! ECE 3401! Digital Systems Design! Lecture 1 - Introduction!! Course Basics Classes: Tu/Th 11-12:15, ITE 127 Instructor Mohammad Tehranipoor Office hours: T 1-2pm, or upon appointments @ ITE 441 Email:

More information

Switch Fabric Implementation Using Shared Memory

Switch Fabric Implementation Using Shared Memory Order this document by /D Switch Fabric Implementation Using Shared Memory Prepared by: Lakshmi Mandyam and B. Kinney INTRODUCTION Whether it be for the World Wide Web or for an intra office network, today

More information

Method for Multiplier Verication Employing Boolean Equivalence Checking and Arithmetic Bit Level Description

Method for Multiplier Verication Employing Boolean Equivalence Checking and Arithmetic Bit Level Description Method for Multiplier Verication Employing Boolean ing and Arithmetic Bit Level Description U. Krautz 1, M. Wedler 1, W. Kunz 1 & K. Weber 2, C. Jacobi 2, M. Panz 2 1 University of Kaiserslautern - Germany

More information

CHAPTER 5 FINITE STATE MACHINE FOR LOOKUP ENGINE

CHAPTER 5 FINITE STATE MACHINE FOR LOOKUP ENGINE CHAPTER 5 71 FINITE STATE MACHINE FOR LOOKUP ENGINE 5.1 INTRODUCTION Finite State Machines (FSMs) are important components of digital systems. Therefore, techniques for area efficiency and fast implementation

More information

Sistemas Digitais I LESI - 2º ano

Sistemas Digitais I LESI - 2º ano Sistemas Digitais I LESI - 2º ano Lesson 6 - Combinational Design Practices Prof. João Miguel Fernandes (miguel@di.uminho.pt) Dept. Informática UNIVERSIDADE DO MINHO ESCOLA DE ENGENHARIA - PLDs (1) - The

More information

Modeling Sequential Elements with Verilog. Prof. Chien-Nan Liu TEL: 03-4227151 ext:34534 Email: jimmy@ee.ncu.edu.tw. Sequential Circuit

Modeling Sequential Elements with Verilog. Prof. Chien-Nan Liu TEL: 03-4227151 ext:34534 Email: jimmy@ee.ncu.edu.tw. Sequential Circuit Modeling Sequential Elements with Verilog Prof. Chien-Nan Liu TEL: 03-4227151 ext:34534 Email: jimmy@ee.ncu.edu.tw 4-1 Sequential Circuit Outputs are functions of inputs and present states of storage elements

More information

To convert an arbitrary power of 2 into its English equivalent, remember the rules of exponential arithmetic:

To convert an arbitrary power of 2 into its English equivalent, remember the rules of exponential arithmetic: Binary Numbers In computer science we deal almost exclusively with binary numbers. it will be very helpful to memorize some binary constants and their decimal and English equivalents. By English equivalents

More information

FAULT TOLERANCE FOR MULTIPROCESSOR SYSTEMS VIA TIME REDUNDANT TASK SCHEDULING

FAULT TOLERANCE FOR MULTIPROCESSOR SYSTEMS VIA TIME REDUNDANT TASK SCHEDULING FAULT TOLERANCE FOR MULTIPROCESSOR SYSTEMS VIA TIME REDUNDANT TASK SCHEDULING Hussain Al-Asaad and Alireza Sarvi Department of Electrical & Computer Engineering University of California Davis, CA, U.S.A.

More information

FPGA Implementation of an Extended Binary GCD Algorithm for Systolic Reduction of Rational Numbers

FPGA Implementation of an Extended Binary GCD Algorithm for Systolic Reduction of Rational Numbers FPGA Implementation of an Extended Binary GCD Algorithm for Systolic Reduction of Rational Numbers Bogdan Mătăsaru and Tudor Jebelean RISC-Linz, A 4040 Linz, Austria email: bmatasar@risc.uni-linz.ac.at

More information

International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Vol. XXXIV-5/W10

International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Vol. XXXIV-5/W10 Accurate 3D information extraction from large-scale data compressed image and the study of the optimum stereo imaging method Riichi NAGURA *, * Kanagawa Institute of Technology nagura@ele.kanagawa-it.ac.jp

More information

Lecture 8: Binary Multiplication & Division

Lecture 8: Binary Multiplication & Division Lecture 8: Binary Multiplication & Division Today s topics: Addition/Subtraction Multiplication Division Reminder: get started early on assignment 3 1 2 s Complement Signed Numbers two = 0 ten 0001 two

More information

A Binary Adaptable Window SoC Architecture for a StereoVision Based Depth Field Processor

A Binary Adaptable Window SoC Architecture for a StereoVision Based Depth Field Processor A Binary Adaptable Window SoC Architecture for a StereoVision Based Depth Field Processor Andy Motten, Luc Claesen Expertise Centre for Digital Media Hasselt University tul IBBT Wetenschapspark 2, 50 Diepenbeek,

More information

Introduction to Digital System Design

Introduction to Digital System Design Introduction to Digital System Design Chapter 1 1 Outline 1. Why Digital? 2. Device Technologies 3. System Representation 4. Abstraction 5. Development Tasks 6. Development Flow Chapter 1 2 1. Why Digital

More information

Counters and Decoders

Counters and Decoders Physics 3330 Experiment #10 Fall 1999 Purpose Counters and Decoders In this experiment, you will design and construct a 4-bit ripple-through decade counter with a decimal read-out display. Such a counter

More information

The string of digits 101101 in the binary number system represents the quantity

The string of digits 101101 in the binary number system represents the quantity Data Representation Section 3.1 Data Types Registers contain either data or control information Control information is a bit or group of bits used to specify the sequence of command signals needed for

More information

Study and Implementation of Video Compression Standards (H.264/AVC and Dirac)

Study and Implementation of Video Compression Standards (H.264/AVC and Dirac) Project Proposal Study and Implementation of Video Compression Standards (H.264/AVC and Dirac) Sumedha Phatak-1000731131- sumedha.phatak@mavs.uta.edu Objective: A study, implementation and comparison of

More information

IJESRT. [Padama, 2(5): May, 2013] ISSN: 2277-9655

IJESRT. [Padama, 2(5): May, 2013] ISSN: 2277-9655 IJESRT INTERNATIONAL JOURNAL OF ENGINEERING SCIENCES & RESEARCH TECHNOLOGY Design and Verification of VLSI Based AES Crypto Core Processor Using Verilog HDL Dr.K.Padama Priya *1, N. Deepthi Priya 2 *1,2

More information

A Scalable Large Format Display Based on Zero Client Processor

A Scalable Large Format Display Based on Zero Client Processor International Journal of Electrical and Computer Engineering (IJECE) Vol. 5, No. 4, August 2015, pp. 714~719 ISSN: 2088-8708 714 A Scalable Large Format Display Based on Zero Client Processor Sang Don

More information

what operations can it perform? how does it perform them? on what kind of data? where are instructions and data stored?

what operations can it perform? how does it perform them? on what kind of data? where are instructions and data stored? Inside the CPU how does the CPU work? what operations can it perform? how does it perform them? on what kind of data? where are instructions and data stored? some short, boring programs to illustrate the

More information

Go Faster - Preprocessing Using FPGA, CPU, GPU. Dipl.-Ing. (FH) Bjoern Rudde Image Acquisition Development STEMMER IMAGING

Go Faster - Preprocessing Using FPGA, CPU, GPU. Dipl.-Ing. (FH) Bjoern Rudde Image Acquisition Development STEMMER IMAGING Go Faster - Preprocessing Using FPGA, CPU, GPU Dipl.-Ing. (FH) Bjoern Rudde Image Acquisition Development STEMMER IMAGING WHO ARE STEMMER IMAGING? STEMMER IMAGING is: Europe's leading independent provider

More information

Relating Empirical Performance Data to Achievable Parallel Application Performance

Relating Empirical Performance Data to Achievable Parallel Application Performance Published in Proceedings of the International Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA'99), Vol. III, Las Vegas, Nev., USA, June 28-July 1, 1999, pp. 1627-1633.

More information

Chapter 2 Basic Structure of Computers. Jin-Fu Li Department of Electrical Engineering National Central University Jungli, Taiwan

Chapter 2 Basic Structure of Computers. Jin-Fu Li Department of Electrical Engineering National Central University Jungli, Taiwan Chapter 2 Basic Structure of Computers Jin-Fu Li Department of Electrical Engineering National Central University Jungli, Taiwan Outline Functional Units Basic Operational Concepts Bus Structures Software

More information

Technical Aspects of Creating and Assessing a Learning Environment in Digital Electronics for High School Students

Technical Aspects of Creating and Assessing a Learning Environment in Digital Electronics for High School Students Session: 2220 Technical Aspects of Creating and Assessing a Learning Environment in Digital Electronics for High School Students Adam S. El-Mansouri, Herbert L. Hess, Kevin M. Buck, Timothy Ewers Microelectronics

More information

Analysis of Compression Algorithms for Program Data

Analysis of Compression Algorithms for Program Data Analysis of Compression Algorithms for Program Data Matthew Simpson, Clemson University with Dr. Rajeev Barua and Surupa Biswas, University of Maryland 12 August 3 Abstract Insufficient available memory

More information

Central Processing Unit

Central Processing Unit Chapter 4 Central Processing Unit 1. CPU organization and operation flowchart 1.1. General concepts The primary function of the Central Processing Unit is to execute sequences of instructions representing

More information

Lab 1: Introduction to Xilinx ISE Tutorial

Lab 1: Introduction to Xilinx ISE Tutorial Lab 1: Introduction to Xilinx ISE Tutorial This tutorial will introduce the reader to the Xilinx ISE software. Stepby-step instructions will be given to guide the reader through generating a project, creating

More information

HSI BASED COLOUR IMAGE EQUALIZATION USING ITERATIVE n th ROOT AND n th POWER

HSI BASED COLOUR IMAGE EQUALIZATION USING ITERATIVE n th ROOT AND n th POWER HSI BASED COLOUR IMAGE EQUALIZATION USING ITERATIVE n th ROOT AND n th POWER Gholamreza Anbarjafari icv Group, IMS Lab, Institute of Technology, University of Tartu, Tartu 50411, Estonia sjafari@ut.ee

More information

Implementation of emulated digital CNN-UM architecture on programmable logic devices and its applications

Implementation of emulated digital CNN-UM architecture on programmable logic devices and its applications Implementation of emulated digital CNN-UM architecture on programmable logic devices and its applications Theses of the Ph.D. dissertation Zoltán Nagy Scientific adviser: Dr. Péter Szolgay Doctoral School

More information

Reconfigurable Low Area Complexity Filter Bank Architecture for Software Defined Radio

Reconfigurable Low Area Complexity Filter Bank Architecture for Software Defined Radio Reconfigurable Low Area Complexity Filter Bank Architecture for Software Defined Radio 1 Anuradha S. Deshmukh, 2 Prof. M. N. Thakare, 3 Prof.G.D.Korde 1 M.Tech (VLSI) III rd sem Student, 2 Assistant Professor(Selection

More information

Performance Comparison of an Algorithmic Current- Mode ADC Implemented using Different Current Comparators

Performance Comparison of an Algorithmic Current- Mode ADC Implemented using Different Current Comparators Performance Comparison of an Algorithmic Current- Mode ADC Implemented using Different Current Comparators Veepsa Bhatia Indira Gandhi Delhi Technical University for Women Delhi, India Neeta Pandey Delhi

More information

EE482: Advanced Computer Organization Lecture #11 Processor Architecture Stanford University Wednesday, 31 May 2000. ILP Execution

EE482: Advanced Computer Organization Lecture #11 Processor Architecture Stanford University Wednesday, 31 May 2000. ILP Execution EE482: Advanced Computer Organization Lecture #11 Processor Architecture Stanford University Wednesday, 31 May 2000 Lecture #11: Wednesday, 3 May 2000 Lecturer: Ben Serebrin Scribe: Dean Liu ILP Execution

More information

SIM-PL: Software for teaching computer hardware at secondary schools in the Netherlands

SIM-PL: Software for teaching computer hardware at secondary schools in the Netherlands SIM-PL: Software for teaching computer hardware at secondary schools in the Netherlands Ben Bruidegom, benb@science.uva.nl AMSTEL Instituut Universiteit van Amsterdam Kruislaan 404 NL-1098 SM Amsterdam

More information

Architectures and Platforms

Architectures and Platforms Hardware/Software Codesign Arch&Platf. - 1 Architectures and Platforms 1. Architecture Selection: The Basic Trade-Offs 2. General Purpose vs. Application-Specific Processors 3. Processor Specialisation

More information

http://www.ece.ucy.ac.cy/labs/easoc/people/kyrkou/index.html BSc in Computer Engineering, University of Cyprus

http://www.ece.ucy.ac.cy/labs/easoc/people/kyrkou/index.html BSc in Computer Engineering, University of Cyprus Christos Kyrkou, PhD KIOS Research Center for Intelligent Systems and Networks, Department of Electrical and Computer Engineering, University of Cyprus, Tel:(+357)99569478, email: ckyrkou@gmail.com Education

More information