WITH the advent of video coding standards, such as

3 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS II: EXPRESS BRIEFS, VOL. 58, NO. 7, JULY 011 A Reconfigurable Multi-Transform VLSI Architecture Supporting Video Codec Design Kanwen Wang, Jialin Chen, Wei Cao, Ying Wang, Lingli Wang, Member, IEEE, and Jiarong Tong Abstract This brief presents a reconfigurable VLSI architecture which is designed for multi-transform codec in several video coding standards of MPEG-/, VC-1, H.6/AVC and AVS. The reconfigurable multiple constant multiplication algorithm with two fusing strategies is provided to generate constant multipliers in the matrix calculation blocks. Additionally, adder-sharing strategy is adopted in the unified preprocessing/ postprocessing block to save circuit areas. The proposed architecture can support different standards through static reconfiguration and forward/inverse transform functions through dynamic reconfiguration. It is suitable for the real-time processing of 1080P HD video codec with six video standards transforms. Index Terms Adder-sharing, computer-aided design (CAD), dynamic reconfiguration, multi-transform, reconfigurable multiple constant multiplication (RMCM), static reconfiguration. I. INTRODUCTION WITH the advent of video coding standards, such as Motion Pictures Expert Group (MPEG-/), Windows Media Video 9 (VC-1), Advanced Video Coding (H.6/AVC), and Audio Video Coding Standard of China (AVS), there is an urgent need to integrate them into a single chip. Simply combining multi-standard codec circuits together will increase the silicon area and power consumption, which makes the design unacceptable. When looking deeply into the principle of encoding and decoding process, many sharing logic can be found. Moreover, the basic algorithms of different compression standards are alike in spite of their own coding methods. Therefore, through reconfigurations, different standards may be efficiently incorporated. In video compression, discrete cosine transform (DCT) is widely used because it concentrates signal information in a few low-frequency components. To meet the requirement of real-time processing, hardware implementations of -D DCT/inverse DCT (IDCT) are adopted. The -D DCT/IDCT can be implemented with the 1-D DCT/IDCT and a transpose memory in a row column decomposition manner. However, DCT requires float-point multiplication, which will cause precision problems for hardware implementations. Hence, transform without float-point multiplication, namely, integer transform, has been raised. It is similar, but not identical, to DCT. Integer transform is employed in all video coding standards, except Manuscript received March 15, 011; accepted April 5, 011. Date of current version July 0, 011. This work was supported in part by the State Key Laboratory of ASIC and System Research Program of Fudan University under Grant 09ZD005 and Grant 09XT00, by the National 863 Program of China under Grant 009AA0101, and by the Fundamental Research Funds for the Central Universities. This paper was recommended by Associate Editor B.-D. (Brian) Liu. The authors are with the State Key Laboratory of ASIC and System, Fudan University, Shanghai 0103, China. Corresponding author: Wei Cao (e-mail: caow@fudan.edu.cn). Digital Object Identifier 10.1109/TCSII.011.15865 MPEG-/. The likeness of transform matrixes among different coding standards may be shared. In the literature, many multi-transform designs have been published. A low-cost VLSI architecture is designed for multistandard inverse transform in [5]. In [6], the delta matrix is employed for sharing the inverse multi-transform. In [7], the fast 1-D integer forward/inverse transforms for VC-1 is proposed. However, they were either designed for video decoders or finite standards. On the other hand, finding an optimal solution for those designs takes a large search space. Usually, it is accomplished with computer-aided design (CAD). In [] and [3], such tools have been developed for multiple constant multiplication (MCM). However, the solutions can only be applied for multiplication with multiple outputs or reconfigurable single output. In this brief, a multitransform VLSI architecture utilizing the reconfigurable MCM (RMCM) algorithm is proposed for the real-time processing of 1080P HD video, which can support both forward and inverse transforms of MPEG-/, VC-1, H.6/AVC, and AVS. The rest of this brief is organized as follows: Section II provides reviews of 1-D multi-transform. Section III presents the proposed circuit architecture along with the RMCM algorithm and adder-sharing strategy. The VLSI implementation results and comparisons are given in Section IV. Finally, Section V concludes this brief. II. REVIEWS OF 1-D MULTI-TRANSFORM In video coding standards, 8 8 transform coding is required in MPEG-/, VC-1, H.6/AVC, and AVS, whereas transform coding is required in VC-1 and H.6/AVC. The 1-D 8-point forward transform coefficient matrix is described as follows: a a a a a a a a b c d e e d c b f g g f f g g f c e b d d b e c C tran_8 =. (1) a a a a a a a a d b e c c e b d g f f g g f f g e d c b b c d e In total, there are 6 coefficients in this matrix, with 7 different values of a to g. The inverse transform matrix is the transposed form of the forward transform matrix. By using fast algorithm from [1], an 8-point forward transform matrix can be decomposed into two -point forward transform matrices, which are a a a a b c d e f g g f c e b d U tran_ = V a a a a tran_ =. d b e c g f f g e d c b () 159-777/$6.00 011 IEEE

WANG et al.: RECONFIGURABLE MULTI-TRANSFORM VLSI ARCHITECTURE SUPPORTING VIDEO CODEC DESIGN 33 TABLE I COEFFICIENTS AMONG DIFFERENT VIDEO CODING STANDARDS Note that the -point U matrix is also used in -point transform coding. The 1-D 8-point transform is illustrated as where Y 8 =C 8 X 8 (3) Y 8 = [ y0 y1 y y3 y y5 y6 y7 ] T X 8 = [ x0 x1 x x3 x x5 x6 x7 ] T. C 8 is the 8-point transform coefficient matrix. The 1-D 8-point forward transform can be decomposed as y0 x0+x7 y1 x0 x7 y x1+x6 y3 x1 x6 =U y tran_ =V x+x5 y5 tran_. () x x5 y6 x3+x y7 x3 x In addition, the 1-D inverse transform is expressed as y0 x0 x1 y1 =U T x tran_ +V T x3 tran_ y x x5 y3 x6 x7 y7 x0 x1 y6 =U T x tran_ V T x3 tran_. (5) y5 x x5 y x6 x7 In order to get the results of matrix calculation, lots of multiplication and addition are required. The positions and signs of coefficients a to g are the same, although each standard has its own coefficient values in the transform matrix. The coefficients in MPEG-/ here are using 10-bit fixed-point numbers. According to [5], this is the minimum bitwidth meeting the constraint of IEEE Standard 1180-1990. Additionally, the inverse transform of H.6/AVC must conform to the data flow defined in the standard to avoid mismatch problems. The transform matrices of such data flow are inferred to as 1 1 1 1 Uh6 T 1 1 = 1 1 1 1 1 1 1 1 1 1 1 0 0 1 3 1 1 0 Vh6 T 1 0 = 1 0 1 3 1 0 1 3. (6) 0 1 0 1 0 1 1 0 0 1 0 1 1 3 The other coefficients are using integer numbers defined by the standards. Table I summarizes all coefficients needed among different video coding standards. Fig. 1. Proposed 1-D multi-transform VLSI architecture. III. PROPOSED 1-D MULTI-TRANSFORM CIRCUIT ARCHITECTURE The proposed architecture is shown in Fig. 1. It mainly consists of the U matrix calculation block, the V matrix calculation block, the preprocessing/postprocessing block, the adder tree block, and two mux blocks. Three pipeline stages are applied at preprocessing/postprocessing, matrix calculation, and adder tree blocks, as indicated by dashed lines. The matrix calculation is multiplierless, which is made of only adders and shifters. Two kinds of constant multipliers are used to calculate each term of the matrix product. That is, an AFG constant multiplier is in charge of U matrix calculations, and a BCDE constant multiplier is in charge of V matrix calculations. The constant multipliers are responsible for calculating coefficients in parallel and can be reconfigured to support different standards. Finding the optimal solution for MCM problems, i.e., the one with the fewest number of adders is known to be NP-complete []. Voronenko and Puschel [] proposed a heuristic graphbased algorithm, which could generate an optimal directed acyclic graph (DAG) for given multiple constants with multiple outputs. In [3], an algorithm for time-multiplexed MCM was further presented, which could produce a single output for multiple constants based on the input control. Unfortunately, neither of them could produce architectures with reconfigurable multiple outputs. Here, a novel RMCM algorithm, which is based on [] and [3], is given. An integrated tool is developed with two fusing strategies to support this algorithm. This tool is able to produce architectures with reconfigurable multiple outputs. A small example is demonstrated here. In Fig., two optimal MCM DAGs are displayed for multiplication, i.e., {y0 = 5x, y1 = 19x} in (a) and {y = 36x, y3 = x} in (b). Strategy 1 performs node assignment and edge merging from DAG_B to DAG_A. Usually, adders and subtractors will be assigned to their counterparts, but it is possible that configurable adders/subtractors are given. Edge merging appends multiplexers to the inputs of nodes where necessary. Strategy first tries to reconstruct DAG_A and DAG_B to find more common nodes (e.g., 9x) and then performs the same steps in strategy 1. The reconstructed DAG_C and DAG_D are shown in (c) and (d). They may not be the optimal MCM DAGs, but more common nodes can reduce the cost of multiplexers. (e) and (f) are the generated RMCM

3 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS II: EXPRESS BRIEFS, VOL. 58, NO. 7, JULY 011 Fig.. (a) Optimal MCM DAG_A for outputs y0 and y1. (b) Optimal MCM DAG_B for outputs y and y3. (c) Reconstructed MCM DAG_C for outputs y0 and y1. (d) Reconstructed MCM DAG_D for outputs y and y3. (e) RMCM DAG_E using strategy 1 for outputs y0 and y13. (f) RMCM DAG_F using strategy for outputs y0 and y13. DAGs using strategies 1 and, respectively. Outputs y0 and y are fused to y0, whereas outputs y1 and y3 are fused to y13. In (e), with both multiplexers set to select the left inputs, the active datapaths correspond to (a), and with both multiplexers set to the right input, they correspond to (b). In this example, using strategy, (f) has a better area result. The following pseudocodes describe how two MCM DAGs, i.e., DAG_A and DAG_B, are fused into a RMCM DAG. FuseMCMDAGs (DAG_A, DAG_B) min _cost = MAX_COST rmcm_dag = NULL for all node mappings of DAG_B to DAG_A do dag = FuseDAGs(DAG_A, DAG_B) cost = EstimateCost(dag) if cost < min _cost then min _cost = cost rmcm_dag = dag end if end for // Strategy 1 (DAG_C, DAG_D)=ReconstructDAG(DAG_A, DAG_B) for all node mappings of DAG_D to DAG_C do dag = FuseDAGs(DAG_C, DAG_D) cost = EstimateCost(dag) if cost < min _cost then min _cost = cost rmcm_dag = dag end if end for // Strategy return rmcm_dag Fig. 3. Proposed AFG constant multiplier. The node mappings enumerate all possibilities and respect the ordering of the nodes in both MCM DAGs. The output nodes of each DAG are also aligned. The FuseDAGs process includes both node assignment and edge merging. The EstimateCost process will evaluate the cost of the FuseDAGs process according to the information from the technology library being used. The RMCM algorithm chooses from the results of two fusing strategies and guarantees that the number of adder nodes will not exceed the largest number of the initial DAG being fused, and the least cost of multiplexers are appended. The hardware structure of coefficient values in the matrix for multiplication can be easily generated by the RMCM tool in the form of hardware description language. The tool will first construct a DAG for each set of coefficients for one standard. Then, every two DAGs (e.g., one DAG with 36, 73, and 196 in the U matrix for MPEG-/ and the other DAG with 17,, and 10 in the U matrix for -point VC-1) will be tried fusing with the two aforementioned strategies to get a lower area result. The minimum cost of the RMCM DAG is the final structure of the constant multiplier. It is capable of generating multiple outputs in parallel and changing values by different configurations. The structure of the AFG constant multiplier is shown in Fig. 3. There are five adder nodes in total in the AFG constant multiplier with a depth of three. It consumes 95% more area than the AFG constant multiplier design that requires only MPEG-/ coefficients. This is the overhead caused by multiplexing circuit for reconfigurability. The structure of the BCDE constant multiplier is shown in Fig.. There are six adder nodes in total in the BCDE constant multiplier with a depth of three, whereas there are seven adder nodes in [5]. The reconfigurability overhead for the BCDE constant multiplier design is 71% more area. Each node has a name and represents an intermediate value, which is labeled with brackets. Note that by different configurations, they may produce different values, which are separated by commas. All node calculation steps are shownintableii. The final matrix coefficient can be easily acquired by using intermediate nodes, as expressed in Table III. For example, coefficient 36 is the result of left-shift 1 bit of 181, which is the AFG node value of sub_31. Moreover, 181 is the result of 15 subtracted from left-shift bits of 9, which is the AFG node value of sub_1 and add_sub_11. Then, the full expression of 36 is (((6 (16 1)) ) (16 1)) 1. The preprocessing block is used to realize the butterfly structure of forward transform and the permutation structure of inverse transform, whereas the postprocessing block is used to compute the butterfly structure of inverse transform

WANG et al.: RECONFIGURABLE MULTI-TRANSFORM VLSI ARCHITECTURE SUPPORTING VIDEO CODEC DESIGN 35 Fig.. Proposed BCDE constant multiplier. TABLE II NODE CALCULATION STEPS Fig. 5. Proposed unified preprocessing/postprocessing architecture. TABLE IV HARDWARE COST OF 1-D MULTI-TRANSFORM ARCHITECTURE TABLE III COEFFICIENTS CALCULATION STEPS TABLE V THREE TRANSFORM FUNCTIONS OF IMPLEMENTATION and the permutation structure of forward transform. In [], they are designed in separate. Here, a unified preprocessing/postprocessing block is presented using adder-sharing strategy, which means that forward and inverse transform share a common butterfly structure and permutation structure. By this way, eight adders are saved with eight multiplexers. Fig. 5 shows this architecture. The adder tree block is used to obtain the sum of the matrix product. The sign positions of the forward and inverse transform addition are the same for two -point matrices; thus, only the input signals to adders will be changed. In addition, mux input and output blocks are responsible for the -/8-point and forward/inverse transform selections. The hardware of H.6/AVC inverse transform is separately designed. The 8-point inverse transform requires 3 adders [11]. These adders could be shared in the proposed architecture with the number of in the matrix calculation block, 0 in the adder tree block, and 8 in the preprocessing/postprocessing block. The operations of x/, 3x/, and x/ in the matrix are performed with x 1, x+(x 1), and x, respectively. In detail, x/ is finished in four AFG constant multiplier blocks, whereas 3x/ is done by add_11 in four BCDE constant multiplier blocks. In the adder tree block, the additions of the U matrix products remain unchanged, but those of the V matrix products require extra multiplexers to select add/sub operations and accomplish x/. The adders in the preprocessing/ postprocessing block are shared to perform butterfly operations without any changes. IV. VLSI IMPLEMENTATION RESULTS AND COMPARISONS The proposed design is described in Verilog HDL, modeled by MATLAB and verified inside Synopsys VCS with MAT- LAB data. The verification of H.6/AVC is done with the help of JM 17. reference software. It is synthesized using Synopsys Design Compiler with SMIC 130-nm standard cell library. Table IV gives the design result. It is shown that the U and V matrix calculations account for about 60% of the whole architecture; therefore, their improvements are very critical. In order to demonstrate the advantages of the reconfigurable architecture, the forward and inverse transforms are implemented, respectively; all of three results are listed in Table V.

36 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS II: EXPRESS BRIEFS, VOL. 58, NO. 7, JULY 011 TABLE VI 1-D MULTITRANSFORM ARCHITECTURE COMPARISONS It is noted that a unified architecture can perform functions of both forward and inverse transforms through dynamic reconfiguration, instead of using two architectures. Accordingly, 31% areas can be saved. Qi et al. [5] also achieved constant multipliers. The BCDE subunit in Fig. from [5] is reimplemented here and shows % more area than the proposed RMCM-based BCDE constant multiplier. The number of total adder that counts for the proposed inverse transform architecture is 66, whereas in [5] and [6], this numbers are 7 (the number 70 in [5] is wrong, which has been confirmed by the author) and 11. As a result, 6 and 6 adders are saved, respectively. In addition, with the help of CAD tools, the RMCM-based constant multiplier can be extended to support more video coding standards. Table VI shows comparisons between the proposed and existing designs. As far as the authors knowledge, no architecture can support the codec design of all six video standards transforms. Under the same 130-nm technology, the aforementioned presented inverse transform architecture can save 1% and 7% area with more supporting standards when compared with that of [5] and [6]. Furthermore, the proposed reconfigurable multitransform VLSI architecture requires about 8% and 1% area penalty. The throughputs of the proposed 1-D architecture are 8 pixels/cycle with 8-point transform and pixels/cycle with -point transform as it computes the transform data in parallel. Since the architecture applies three pipeline stages, it takes 10 (3 + 7) cycles to finish the processing of eight 1-D 8-point transforms in the row (column) order, considering the pipeline latency. Two proposed 1-D multi-transform architectures can be used along with a modified transpose memory [10] to construct a -D multi-transform architecture. The total latency for a -D transform is 10 (3++3) cycles. Suppose the architecture is fully pipelined, it needs 105 ( 16 1.5 1 + 10) cycles to process a -D transform of a ::0 macroblock. This number is larger than that of a -D 8 8 transform. Therefore, the real-time analysis is based on -D transforms. When used inside a decoder, the proposed multi-transform VLSI architecture can process 1080P at 60-Hz HD video bitstream under 55-MHz (190 1088 60 105/(16 16)) working frequency. When used inside an encoder, both forward and inverse transform functions are required. This is achieved through dynamic reconfiguration. Considering one cycle of reconfiguration time, it needs 11 (105 +1)cycles to process a -D transform of a ::0 macroblock. Thus, it can process 1080P at 30-Hz HD video bitstream under 55-MHz (190 1088 30 11/(16 16)) working frequency. Hence, the multi-transform architecture can be utilized into the real-time HD video codec to process MPEG- /, VC-1, H.6/AVC, and AVS video bitstreams. V. C ONCLUSION In this brief, a reconfigurable multi-transform VLSI architecture supporting video codec design has been presented. The reconfigurability of the architecture is reflected in two ways: 1) Matrix coefficients from different standards can be statically reconfigured based on the RMCM algorithm, and ) forward and inverse transforms can be dynamically reconfigured with the adder-sharing design of the preprocessing/postprocessing blocks. Moreover, the RMCM algorithm can be used to find optimal solutions to support more standards. This architecture is suitable for the transform processing with 1080P HD video codec design of MPEG-/, VC-1, H.6/AVC, and AVS. ACKNOWLEDGMENT The authors would like to thank the anonymous reviewers, whose advice helped to improve the quality of this brief. REFERENCES [1] W. H. Chen, C. H. Smith, and S. C. Fralick, A fast computational algorithm for the discrete cosine transform, IEEE Trans. Commun., vol. COM-5, no. 9, pp. 100 1009, Sep. 1977. [] Y. Voronenko and M. Puschel, Multiplierless multiple constant multiplication, ACM Trans. Algorithms, vol. 3, no., p. 11, May 007. [3] P. Tummeltshammer, J. C. Hoe, and M. Puschel, Time-multiplexed multiple-constant multiplication, IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., vol. 6, no. 9, pp. 1551 1563, Sep. 007. [] J. I. Guo, R. C. Ju, and J. W. Chen, An efficient -D DCT/IDCT core design using cyclic convolution and adder-based realization, IEEE Trans. Circuits Syst. Video Technol., vol. 1, no., pp. 16 8, Apr. 00. [5] H. Qi, Q. Huang, and W. Gao, A low-cost very large scale integration architecture for multistandard inverse transform, IEEE Trans. Circuits Syst. II, Exp. Briefs, vol. 57, no. 7, pp. 551 555, Jul. 010. [6] S. Lee and K. Cho, Architecture of transform circuit for video decoder supporting multiple standards, Electron. Lett., vol., no., pp. 7 75, Feb. 008. [7] C. P. Fan and G. A. Su, Fast algorithm and low-cost hardware-sharing design of multiple integer transforms for VC-1, IEEE Trans. Circuits Syst. II, Exp. Briefs, vol. 56, no. 10, pp. 788 79, Oct. 009. [8] G. A. Su and C. P. Fan, Low-cost hardware-sharing architecture of fast 1-D inverse transforms for H.6/AVC and AVS applications, IEEE Trans. Circuits Syst. II, Exp. Briefs, vol. 55, no. 1, pp. 19 153, Dec. 008. [9] K. Kim and J. S. Koh, An area efficient DCT architecture for MPEG- video encoder, IEEE Trans. Consum. Electron., vol. 5, no. 1, pp. 6 67, Feb. 1999. [10] T. C. Wang, Y. W. Huang, H. C. Fang, and L. G. Chen, Parallel D transform and inverse transform architecture for MPEG- AVC/H.6, in Proc. IEEE ISCAS, May 003, pp. 800 803. [11] Y. C. Chao, H. H. Tsai, Y. H. Lin, J. F. Fang, and B. D. Liu, A novel design for computations of all transforms in H.6/AVC decoders, in Proc. IEEE ICME, Jul. 007, pp. 191 1917.