Low Latency, High Throughput, and Less Complex VLSI Architecture for 2D-DFT

The International Arab Journal of Information Technology, Vol., o., January Low Latency, High Throughput, Less Complex VLSI Architecture for D-DT Sohil Shah, Preethi Venkatesan, Deepa Sundar, unii Kannan Department of Electronics Engineering, Anna University, India Abstract: This paper proposes a pipelined, systolic architecture for two- dimensional discrete ourier transform computation which is highly concurrent. The architecture consists of two, one-dimensional discrete ourier transform blocks connected via an intermediate buffer. The proposed architecture offers low latency as well as high throughput can perform both one two- dimensional discrete ourier transforms. The architecture supports transform length that is not power of two not based on products of co-prime numbers. The simulation synthesis were carried out using Cadence tools, csim RTL Compiler, respectively, with nm libraries. Keywords: Digital signal processing chip, discrete ourier transforms, systolic array, very-large-scale integration circuit. Received ebruary, ; accepted June,. Introduction The discrete ourier transform has a vital importance in various domains like signal processing, image processing, communication various other areas of science engineering. In particular, two dimensional Discrete ourier Transform (DT) has a variety of applications in image processing domain like filtering, compression, pattern matching, magnetic resonance imaging watermarking. Hence efficient implementation of the DT architecture has become a challenge. Since many of these applications process real time data, computational speed has to increase with low latency. ast ourier Transform (T) based architectures were designed to acquire computational speed [. But they had many disadvantages such as the requirement of transform length to be a power of some fixed radix which limits the choice of reachable values of the increase in computational complexity latency in pipelined Ts with the increase in the transform length. Other architectures has been studied for years, among which systolic architecture is preferred as it is simple offers rhythmic processing with repetitive elements or identical cells called processing elements. Hence there is a need for the development of a high performance systolic architecture for two- dimensional DT that work on transform lengths that is not power of two. In this paper, a less complex systolic architecture for the computation of d-dt is proposed based on an algorithm which facilitates the computation of DT for any transform length,.i.e., can be a prime number need not be a power of any fixed radix. High throughput is achieved by using pipelining concept where d-dt is computed via two pipelined d-dt blocks provided with an intermediate buffer. In Section, the previous works related to systolic architectures of DT are summarized. Section gives a mathematical framework for d-dt computation for transform length via four inner products of real valued data of length approximately equal to /. Section discusses the proposed VLSI architecture which includes one-dimensional DT block, intermediate buffer, floating point adders multipliers. Implementation aspects results for the proposed architecture are given in section.. Related ork A variety of systolic array designs have been proposed for the direct computation of DT [,. These designs offer simple architecture can be used for any transform length but they are computationally complex as they work on the direct algorithm of DT where the arithmetic operations per DT computation is of O( ). Hence as the transform length increases they offer a computation intensive architecture which does not support direct calculation of two dimensional, d-dt. Since computation of d-dt from d-dt requires a transposition operation, latency is reduced hence transposition free two dimensional DT architectures were developed [,. But this has been achieved at the cost of increased hardware reduced performance which is undesirable. Computationally efficient architecture has been designed using two level transform factorization which provides low latency high throughput [,. But this design is not

The International Arab Journal of Information Technology, Vol., o., January applicable for odd values of hence offers less choice in choosing. A reduced complexity algorithm that operates on any transform length for -D DT is given in [.Based on that algorithm, a two dimensional architecture offering low latency high throughput has been proposed in this paper where a dual port RA is used in between two -D blocks. The dual port RA is controlled in such a way that pipelining is achieved thereby reducing latency increasing throughput.. athematical Background.. General ormulation DT of an array {f (n, m) for n, m=,, -} of size ( x ) is defined as [, ( k, l ) = n = m = for k, m, l, n =,, -, f ( m, n ) ab =e -jπab/ ln DT of a sequence {f(n),,,..., -} is defined as, ( k ) = for k, n =,,., - f ( n ) The equation can be split into two separate mathematical computations, ' ( m, l ) = for l, n =,,., - ( k, l ) = for k, m =,,., - m = f ( m, n) ' ( m, l ) ln The -point D DT of all rows of an input array of size (X) will result in intermediate matrix [ (m, l). The D DT is then computed by the inner product of the k th column of the DT kernel with the l th column of the intermediate matrix [ (m, l) [... ormulation of a Proposed Algorithm Since equations indicates that D-DT computation can be performed using two consecutive D-DT, we hereby extend the algorithm which was proposed for D-DT to help in D-DT computation with the introduction of an additional hardware which we propose to avoid transposition operation to enhance speed of computation. Thus equation can be expressed as, km km () () () () () ( k ) = f ( ) + ( k ) = f () + ( f ( n ) ( f ( n) where k=,,., ( ) = where =( -)/. f ( ) + + ( f ( n ) + f ( n ) + f ( n) f ( n )) can either be an odd or even number. Here we assume to be an odd number for hardware proposal suitable modification can be done for even values of. odified equations to for even values of are, jπk ( k) = f () + f ( / ) e + (f(n) + f( n) ) () jπk ( k) = f () + f ( / ) e + ( f ( n) + f ( n) ) () ( ) = f () + ( f ( n) + f ( n)) + f ( / ) () ( / ) = ( f (n) f (n+ )) () where =(/)-. Splitting the real imaginary coefficients of we get, ( k ) = f () + ( ( k ) ( k )) + j( ( k ) + ( k )) ( k ) = f () + ( ( k ) + ( k )) + j( ( k ) ( k )) where where, ( k ) = Re[ x ( n ) Re[ n = ( k ) = Im[ y ( n ) Im[ ( k ) = Im[ x ( n ) Re[ ( k ) = n = Re[ y ( n ) Im[ x(n)=[f(n) + f( -n) y(n)=[f(n) - f( -n) Re[. Im[. are the real imaginary part of a complex inputs respectively. Thus -point complex inner products is reduced to real inner products ) ) () () () () () () () () () () ()

Low Latency, High Throughput, Less Complex VLSI Architecture for D-DT which can be processed simultaneously each of length (/) points.. Architectural Aspects.. Proposed D-DT Structure rom the mathematical modification it is own that d structure can be easily arrived at by simply computing d-dt concurrently for the rows columns of the input array matrix the intermediate matrix result, respectively. The resultant intermediate matrix columns are globally routed in a proper manner using the intermediate buffer block of size ( x ) cells each of length bits. igure shows the fully pipelined general architectural implementation of d- DT. It consists of number of Processing Cells (PCs), number of Summing blocks & (SB & SB) one Summing block (SB), where is given by (-)/. Summing blocks form the first row last column of the structure in both the DT Blocks Processing cells are placed on the left side of the structure below the first row. The functions of the processing cells summing blocks are shown in the igure. Each summing block consists of only floating point adders performing addition/subtraction operation, which in turn receives continuous inputs during every clock cycle. SB receives a pair of inputs f( + n) f( n + ) where,,..,, denotes the total number of SB & SB blocks, from the preprocessing blocks (integer to P converter) or from the intermediate buffer stage in case of second block. SB consists of adder cells. Each SB unit receives the first element of each row/column from the blocks discussed above it consists of only single adder, computing the DC component for each row of the input matrix. Each processing cells will consist of four floating point multiplier to multiply the two processed output from the summing blocks with the DT coefficient producing real imaginary outputs separately, thus in short it computes the operation shown in equations,,,, outputting four intermediate results. ultiplicative DT coefficients are hardwired in the multiplier unit. This complex multiplication can also be performed using CORDIC multiplier by repeated shift add operation or RO based look up tables can also be used. Here each multiplier are replaced by RO tables of size L, where L is the word length based number of bits used for mantissa part for P representation, which contain all the possible product values obtained on multiplying fixed DT coefficients by all possible inputs values. Thus four adders are required in processing cells partially computing the operation of real imaginary part separately as shown in equations. inal summing block SB takes four inputs from the processing cells in the previous column adds the DC component received from SB, thus performing five add/subtract operation, giving out real imaginary values of the computed d-dt separately, of the input row/column of the matrix. Each PCs SBs consists of latches at each output point for data transfer to the adjacent PCs/SB through both horizontal vertical stream thus making it fully pipelined architecture reducing latency. The output from the SB block is available after (+) cycles in the following next clock cycle, a pair of output from each SB unit is available one after another from d-dt unit. PCs form the critical path of the structure, which can be reduced using the techniques suggested above, thus serve as a deciding factor for maximum frequency of operation. The inputs outputs to the DT unit are latched in order to store the processed values in correct sequence to buffer unit which in turn is controlled using S thus this latch doesn t play any role in latency calculation. The number of latches increases with thus we have latches following the th SB block. The input to the second DT unit can be fed only after a column computed from the DT of row sequence is available in the buffer unit. Thus, only after (+) clock cycles, input is fed to second DT block thus output of the d-dt sequence become available after clock cycles from the time the input for which DT has to be computed is fed to the first DT block. igure. Proposed d systolic structure for (x) d-dt computation.

The International Arab Journal of Information Technology, Vol., o., January diagonal arrows indicate the flow in which input is stored. Vertical arrows indicate the flow at which the output is read out number adjacent to rows indicate the clock cycle number at which the particular row of cells each is read out. (a) Summing block SB. (b) Summing block SB. igure. Structure of internal buffer for =. (c) Summing block SB. (d) Processing cell PC. igure. Structure function of various units... Buffer Unit The internal structure operation of buffer unit is shown in igure. It consists of cells each of size * L (word length considering both real imaginary part). Twice the size of input matrix number of cells are required to store the d-dt output of input matrix, thus performing continuous computation without any halt. hile output from first block is store continuously into one set of buffer cells other set feed the input to second DT block concurrently. Thus it can also be termed as Dual Port Ram where ports are used to simultaneously write the input into one memory cell read the output from another memory cell simultaneously in a single clock period. This buffer can be implemented as either D- or as RA in ASIC PGA. Using it as RA unit will add to extra fetching time thus slow down the operation using it as D- designed using stard cells will add overhead in terms of area power. Clock gating can be applied to the cells using D- to save power. Thus there is a trade off in terms of area speed. The number inside the cell indicates the clock cycle number at which data to each cell is written.. loating Point Block Each bit floating point input is split into a real an imaginary part, with each part having a total of bits, where mantissa is bits, exponent is bits sign bit. To multiply two floating point numbers, the mantissa of both the inputs are multiplied, while the exponents are added, the result is rounded normalized. hen we add two floating point numbers, first their exponents have to be made equal for this, shifters have been used. Then both the input s mantissa is added the result is normalized. Carry look ahead addition technique has been used in both the floating point multiplication addition. odified Booth algorithm has been used for performing the multiplication. The output also consists of bits mantissa, bits exponent sign bit... Scope for uture ork This architecture can be even used for implementing prime factor DT as well as for multidimensional DT by simply repeating the same block over over again with careful implementation of buffer unit which will further store planes of (-) DT output before computation in the last th block. igure shows the general structure for multidimensional DT. igure. General structure for ultidimensional Implementation.. Implementation Results The whole d-dt architecture for a x input has been coded in VERILOG HDL simulation synthesis has been carried out using Cadence tools, csim RTL compiler respectively. The simulation is shown in igure. Each input is a bit integer. An integer to floating point converter is used to convert

Low Latency, High Throughput, Less Complex VLSI Architecture for D-DT the integer input to a complex number with floating point representation. A finite state machine has been used as a control circuit which controls the flow of d- DT output into the buffers from buffers into the other d-dt block. All the synthesis reports have been obtained using the nm technology library. The synthesis report of the entire design has been summarized in Table.umber of multiplication addition needed for different transform length along with their latency Average Cycles per Transform (ACT) for the proposed architecture has been given in Table. igure. Simulation using Cadence csim. Table. Synthesis results. d-dt block Speed Power Values hz.milliwatt Area.mm Gate Count Table. Latency, Computation for proposed architecture. umber of points Latency ACT ultipliers Adders, Table. Comparison of proposed structure with existing architectures =. Structure Adders ultipliers Latency ACT [,, [,, [,, Proposed,, The proposed architecture is compared with existing architectures of d-dt for being in Table. Here Latency denotes the number of cycles within which the first output is obtained. ACT is defined as the number of clock cycles between corresponding results for successive data blocks. A lesser value of ACC denotes higher throughput. umber of adders multipliers denotes the complexity of the structure. Hence from the table it is obvious that the proposed structure is less complex when compared to [, [. Also, it offers low latency compared to [ high throughput when compared to [ [.. Application As stated previously discrete ourier transform has a wide range of applications in signal processing communication domain. ireless communication has seen many advances in recent years in which channel estimation plays an important role to cancel out the channel noise offer better performance. Two dimensional discrete ourier transform offers an efficient method to estimate spatially correlated channels in IO communication systems [ offers an interesting area of research. Pattern matching is another area where D DT has found its importance. The phase component of discrete ourier transform serves as an excellent classifier has been used in areas like finger print recognition [, dental radiography identification [ texture feature extraction [.. Conclusion Thus a systolic architecture for two dimensional discrete ourier transform providing less complex design with low latency meeting high throughput requirements has been implemented in this paper. Architectural techniques such as pipelining parallelism have been used. Since the architecture is recursive, it can be extended to multi-dimensional DT by using intermediate buffers whose size increases exponentially with the dimension. Inverse Discrete Cosine Transforms (IDCT) Discrete Cosine Transform (DCT) can also be implemented using the same architecture with slight modification as they are based on the same algorithm. The front end design has been carried out. Placement Routing on Cadence Encounter is proposed for the future. References [ Chang C., ang C., Chang Y., Efficient VLSI Architectures for ast Computation of the Discrete ourier Transform its Inverse, IEEE Transactions on Signal Processing, vol., no., pp. -,. [ Ito K., orita A., Aoki T., Higuchi T., akajima H., Kobayashi K., A ingerprint Recognition Algorithm Using Phase-Based Image atching for Low-Quality ingerprints, in IEEE International Conference on the Image Processing, vol., pp. -,.

The International Arab Journal of Information Technology, Vol., o., January [ Kar D. Rao V., A ew Systolic Realization for the Discrete ourier Transform, IEEE Transactions on Signal Processing, vol., no., pp. -,. [ Lim H. Swartzler E., A Systolic Array for D DT D DCT, in Proceedings of the IEEE International Conference on Application Specific Array Processors, pp. -,. [ Lim H. Swartzler E., ultidimensional Systolic Arrays for The Implementation Of Discrete ourier Transforms, IEEE Transactions Signal Processing, vol., no., pp. -,. [ Liu C., Jen C., On the Design of VLSI Arrays for Discrete ourier Transform, in IEE Proceedings on Circuits, Devices Systems, vol., no., pp. -,. [ eher P., Design of a ully-pipelined Systolic Array for lexible Transposition-ree VLSI of - D DT, IEEE Transactions on Circuits System II, vol., no., pp. -,. [ eher P., Highly Concurrent Reduced Complexity -D Systolic Array for Discrete ourier Transform, IEEE Signal Processing Letters, vol., no.,. [ ash J., Computationally Efficient Systolic Architecture for Computing the Discrete ourier Transform, IEEE Transactions on Signal Processing, vol., no.,. [ ash J., Systolic Architecture for Computing the Discrete ourier Transform on PGAs, in Proceedings of the th Annual IEEE Symposium on ield-programmable Custom Computing achines, pp. -,. [ ikaido A., Ito K., Aoki T., Kosuge E., Kawamata R., A Phase-Based Image Registration Algorithm for Dental Radiograph Identification, IEEE International Conference on Image Processing, vol., pp. -,. [ Tong H. Zekavat S., Spatially Correlated IO Channel: Generation via Virtual Channel Representation, IEEE Communication Letters, vol., pp. -,. [ Xianchao Z., Liusheng H., Guoliang C., A ew Approach for Computing the Discrete ourier Transform of Arbitrary Length, in Proceedings of the th International Conference on Signal Processing, vol., pp. -,. [ Yu T., uthukkumarasamy V., Verma B., Blumenstein., A Texture eature Extraction Technique Using D DT Hamming Distance, in Proceedings of the ifth International Conference on Computational Intelligence ultimedia Applications, p.,. Sohil Shah has completed his BE in electronics communication Engineering at adras Institute of Technology, Anna University in the year. He is pursuing his.tech in communication at IIT, Bombay. His areas of interest include VLSI design optical communication. Preethi Venkatesan has completed her BE in electronics communication engineering at adras Institute of Technology, Anna University in the year. Her areas of interest include VLSI design PGA implementations. Deepa Sundar has completed her BE in electronics communication engineering at adras Institute of Technology, Anna University in the year. Her areas of interest include VLSI design image processing. unii Kannan has been working as faculty in the department of electronics engineering, IT Campus of Anna University since. His areas of interest are computer architecture, digital electronics, computer networking VLSI. His current area of interest is VLSI architecture for signal processing applications.