IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATION, VOL. XX, NO. Y, MONTH Low power data transfer and storage exploration for

Transcription

1 IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATION, VOL. XX, NO. Y, MONTH Low power data transfer and storage exploration for H.263 video decoder system Lode Nachtergaele, Francky Catthoor, Bhanu Kapoor, Stefan Janssens and Dennis Moolenaar Abstract We describe a power exploration methodology for data-dominated applications using a H.263 video decoding demonstrator application. The starting point for our exploration is a C specication of the video decoder, available in the public domain from Telenor Research. We have transformed the data transfer scheme in the specication and have optimised the distributed memory organisation. This results in a memory architecture with signicantly reduced power consumption. For the worst-case mode using Predicted (P) frames, memory power consumption is reduced by a factor of 7 when compared to the reference design. For the worst-case mode using Predicted and Bidirectional (PB) frames, memory power consumption is reduced by a factor of 9. To achieve these results, we make use of our formalised high-level memory management methodology, partly supported in our ATOMIUM environment. Keywords Videophone systems, Logic design, Very-largescale integration T I. Introduction HE video coding algorithm of Draft Recommendation H.263 is based on motion -compensated hybrid predictive and transform coding with improvements to t bit rates less than 64kbit/s. It is a complex and relevant example of a data-dominant application. A hardware realisation of such a decoder has to be power ecient in order to reduce the size of the chip packages where it is embedded, or the battery if it would be used in a mobile application. It is well-known by now that any future complex chip realisation has to take power reduction into account [1]. Our previous research has clearly shown that the dominant power contribution in data-dominated designs lies in the data transfer and storage of multi-dimensional array signals and other complex data types [2], [3]. In this paper we exploit this feature to achieve large savings in the system power without having to worry about the detailed data-path, foreground registers, and controller architecture. The main contributions in this paper will be the evaluation of the applicability and eectiveness of our power oriented methodology for data-dominated applications [4], [5], [3], (see Fig. 1), a study of the eect of the possible optimisations, and the application of the most promising alternatives in the correct sequence on the H.263 decoder This research was partly sponsored by a co-operation with Texas Instruments Incorporated. L. Nachtergaele and F. Catthoor are with IMEC, Kapeldreef 75, B3001 Heverlee, Belgium. F. Catthoor is also Professor at the Katholieke Universiteit, Leuven, Belgium. B. Kapoor was a resident at IMEC from the Corporate R&D labs of Texas Instruments Incorporated, Dallas, Texas. S. Janssens was a student from Erasmus Hogeschool and is now with IMEC D. Moolenaar was a student from Delft Univ. of Technology and is now with IMEC. algorithm. In addition, we have substantiated our earlier claims [2] that the cost of the background storage and related transfers is dominant during the system exploration. This will be shown in section VI by investigating the power in a representative data-path in H.263, including its corresponding local memories. In the rest of this paper, we have concentrated on the main storage (memory) and transfer related parts of the H.263 decoder architecture. This exploration has been done based on a power model described in section III. The nal results for the dierent steps are illustrated in Fig. 12. A brief version of this paper has been published in [6]. The numerous pointers and variables in the C code, which are used in the reference implementation, have been removed by rewriting the specication into a mixed applicative-procedural DFL description [7]. As a result, more indices and some extra signal copies and accesses are present in the code but the dependencies are much more transparent. This allows for systematic identication of the sources for potential optimisation. Moreover, this step is essential in applying a number of data storage and transfer related analysis and exploration/optimisation techniques which are collected in our high-level memory management methodology/script, partly supported by the prototype ATOMIUM environment [4]. Our strategy to obtain area and power gures is based on selecting the worst-case parameters and modes in the H.263 specication. This is valid for computing the maximal power budget and for nding the component size, which aects mainly area, but not directly for the average power consumption. Still, we believe that the maximal power consumption gives a good view on the relative importance of the dierent components in the power budget and on the savings which can be obtained. In order to have a good view on the absolute average power consumption, we require accurate statistics on the occurrence of the different cases. In the sequel, we will only give some relative indication of this. The following major algorithmic transformations and memory organisation optimisations have been performed on the DFL specication, incorporating mainly the power budget related to the access to/from the frame memories: 1. First the code was pruned to retain the operations relevant to the overall complexity of the description with respect to the number of cycles, area and power consumption. This boils down to keeping the relevant storage and accesses of the arrays storing the picture information explicitly and hiding details of arithmetic operations in function calls. As a result, the potential overhead of transfers and storage in the applicative writing style is removed when

2 IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATION, VOL. XX, NO. Y, MONTH Formal verification #Cycles Memory Library #,type,ports, memory System Specification Pruning/analysis Pruned system specification Data flow trafo Loop trafo Optimized flow-graph Mem.Hierarchy Flow-graph balancing Extended/ordered flow-graph Allocation/ Assignment Signal to ports and memories Inplace Optimization Index expressions in reduced memories Address optimization Address hardware generation Netlist of Memories, Address logic w1. #r/w w2. size (POWER/AREA) C-code generation Simulation Updated flow-graph #Cycles HBB lib Data-path/ control synthesis Fig. 1. ATOMIUM script for storage/communication optimisation in the specication to be used for simulation and hardware/software synthesis. interpreted eectively. 2. Several data-ow transformations have been performed. The methodology for carrying out these transformations and their eects are described in [8]. One of the major transformations results in the removal of all the border accesses used in the H.263 decoder, as discussed in subsection V-A. 3. Advanced transformations on the global function hierarchy and loop nests have been performed. These transformations have a signicant eect and will be partly discussed in subsection V-B. They are also essential to enable the application of the further exploration steps. 4. In order to further exploit locality of access and data reuse, extra memory hierarchy levels have been incorporated (see subsection V-C). For the P mode, this step has been especially eective in the \overlapped-block motion compensation" (OBMC) mode which has the largest power consumption. We will only show the principle involved in this optimisation as depicted in Fig. 6. For the B pictures and for the combination with computation, we will show more details. 5. Finally we have performed actual memory allocation and in-place mapping to determine the detailed memory organisation for the frame memories and some of the smaller intermediate memories. This step will be discussed in Section V-D. It has a large eect on the required area, which is reduced by almost a factor 2, with only a very limited increase in the power budget. II. H.263 video decoding H.263 is a draft recommendation for video coding for narrow telecommunication channels at < 64kbit/s [9]. The coding/decoding is a block based algorithm that exploits spatial and temporal redundancy. Three standard video formats are used in conjunction with H.263, called QCIF, Sub-QCIF, and CIF. A QCIF picture has pixels, represented by 9 11 macroblocks. Each macroblock has six blocks of 8 8 pixels. This is due to the (4:2:0) decimation of chrominance values. The picture that serves as the reference for prediction is called the P-picture. From the past P-picture, a future P-picture is predicted. This is called the forward P prediction. Interpolation between past and future P-pictures yield Bidirectional B-pictures (see Fig. 2). A PB-frame consists of two pictures : a P-picture, which is predicted from last decoded P-picture, and a B-picture, which is predicted from last decoded P-picture and the P- picture currently being decoded. Parts of the B-picture may be bidirectionally predicted from the past and future P-pictures. For PB-frames the coding mode intra (I) implies the P-blocks are intra coded, and the B-blocks are inter coded with prediction as for an inter block. A decoder can be in one of the three modes; I, P, or PB mode. Two extensions are orthogonal to the P and the PB modes: the unrestricted motion vector extension allows motion vectors pointing outside the frame, whereas in overlapped block motion compensation (OBMC), 4 extra motion vectors are used to compensate motion. When we refer in this paper to the P or PB mode, we assume that both extensions are in use. Hence, the P and the PB mode refer to two modes that are most energy consuming. Past P-picture Forward B Forward P B-picture Backward B Future P-picture Fig. 2. Forward P, forward B and backward B predictions. III. Power model Time For data intensive applications, such as video decoding, data transfers dominate the power consumption. Therefore the primary design goal is to reduce memory transfers

3 IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATION, VOL. XX, NO. Y, MONTH between large frame memories and datapaths. The cost of a data transfer is a function of the memory size, memory type, and the access frequency F real. F real is dened as the real number of accesses per second and not the clock frequency. When there is a clock tick and the memory is not accessed, it is assumed that memory is in power-down mode. This assumption holds for most modern low-power RAMs [10]. The memory itself is characterised by the number of ports, words, bits, and the aspect ratio of the layout. We make use of an accurate but proprietary power model from Texas Instruments for the power exploration. In this paper, only the number of transfers per frame/picture (directly related to F real ) will be discussed. Data-paths R W R e g f R W 54,912w 9bit 1-port SRAM edgeframe?k,?bit?-port buffers 38,016w 9bit 1-port SRAM 76032R W Shared memory space 38,016w 9bit 1-port SRAM IV. The reference design To obtain an acceptable reference, we have counted the number of transfers to the arrays, that hold the past P, future P, and B pictures, in the Telenor C implementation [11]. These numbers depend on the mode of reconstruction. The ow of data using all extensions is depicted in Fig. 3 using thin lines. The order of computation of pictures P T-1, Pext T-1, Pnew T-1, P T, B1 T, B2 T, and B T is shown in the gure. The dashed lines indicate that pictures Pnew T-1 and P T are stored in array signal whereas the pictures B1 T, B2 T and B T are stored in. The rectangles with a bold border are the nal pictures after decoding. The thick line indicates that oldframe and are interchanged after each decoded frame. In the C code, this is done by swapping the pointers to oldframe and. This reects that main memory is not being wasted in the C implementation, because the simulation speed is also aected by this. The corresponding abstract organisation for the continuous P mode is shown in Fig. 4. Fig. 4. Reference memory organisation and worst-case number of transfers while decoding 1 PB-frame TABLE I Frame memory transfers per picture in the reference C code old/new frame Mode Worst-Case Average edgeframe Mode Worst-Case Average Mode Worst-Case Average P T-1 oldframe Add Border Pext T edgeframe Decoding Forward P Forward B In-place switch Pnew T B1T P T Backward B B2T Fig. 3. Data ow for decoding PB-frames. Table I lists worst-case and average number of transfers to the frame memories per picture. The worst-case numbers are obtained analytically and not by simulation. This means that whenever code is executed conditionally, the conditions are assumed to branch to the most energy consuming option. For example, it is assumed that every macroblock is motion compensated. This is clearly a worst-case assumption. Mode 1 uses prediction with overlapped motion compensation and unrestricted motion vectors. In Mode 2, bidirectional prediction is also included, introducing the extra transfers to the. An acceptable value for the average case was obtained by counting the accesses when simulating the decoding of the video stream suz This stream contains 75 QCIF frames (which B T corresponds with 2.5 second real-time video). The length of stream depends on the encoding options and is listed in Table II together with the compression ratio. The C code, used as a reference, is optimised to run as fast as possible on a given workstation. It is indeed not optimised for ecient implementation. But it is a typical documentation that implementation groups start with. Mostly a direct mapping of the algorithm and the datastructures is made on a block diagram and each block is then optimised locally and implemented eciently. This is why most video decoders have a large external memory (with high bandwidth) that holds 3 complete images. Also, the memory interface typically becomes a big component of the design. When compared to similar state of the art video decoders [12], [13], [14], [15], [16], [17], [18], [19] which also include memory for three pictures, we believe that the access numbers to these memories will be comparable if the bi-directional mode is considered. If bi-directional is not used, the accesses will be comparable to the accesses corresponding to the P-picture.

4 IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATION, VOL. XX, NO. Y, MONTH TABLE II Number of words of 8 bit in test streams suz and the compression factor, Mode d stands for \unrestricted motion vectors", Mode f stands for \Overlapped Block Motion Compensation (OBMC)" and Mode g stands for \bidirectional prediction" Mode Length Compression d f fd g gd gf gfd V. System exploration for power We will now give a summary for each of the main optimisations listed in section I. They have been applied starting from the initial mixed applicative and procedural DFL description of the video decoder. The high-level memory management methodology/script, partly supported by the prototype ATOMIUM environment [4], has been applied here. A. Removal of the border In order to accommodate for unrestricted motion vectors, a complete border consisting of 44 macroblocks is added to the oldframe. It is not just lled with zeroes but with real data copies in a non-trivial way [9]. To simplify the control ow in the original C, these data are duplicated in the frame signals (cfr. edgeframe in gure 3) prior to the actual image manipulations, resulting in storage and transfer overhead both for reading and writing. Actually, this requires an extra pixels to be stored. To reduce this overhead, the dependences on the border data can be checked by (manifest) conditions on the position of the pixels to be read. Now, instead of storing and accessing duplicate data, the original pixels are read at the boundary row/columns of the image frame. These guarding conditions have to be implemented in the controller and will steer the data-path. Usually, also some local buering is necessary then. Several stages of optimisation are possible here, starting from a simple context-independent caching of the border data (which is apparently selected in most industrial designs) up to a heavily optimised context-dependent checking and reduced local buering. All of these alternatives make the storage for the extra borders superuous but only the latter option allows to remove all redundant picture accesses. If we assume that on the average this reduction is about a factor 8 1, we have an extra reduction 1 This is realistic because the motion vectors are in practice relatively small and not uniformly distributed between 0 and 15. If we assume a uniform range of (-7,7) for the motion, the number of pixels residing outside the oldframe border is on the average about 1/8 of in read accesses of about This is however datadependent. In terms of power consumption, our detailed models show that we obtain a saving of between 24% and 27% by the combined eect of less transfers and a reduced frame size. The gain in power comes at the price of an increased complexity of the code and the size of the controller though. Still, as the power consumed in the controller is quite small, the trade-o for power is clearly in favour of transforming all the border accesses. The resulting data ow without the border is depicted in Fig. 5. B. Loop and function restructuring to combine backward and forward P and B predictions, and In the Telenor C code, decoding a PB-frame starts with decoding the incoming bit stream and results in a P and a B macroblock containing dierential errors in the frequency domain and motion information (Task 1 in Fig. 5). Next, the forward P and B predictions are performed based on the motion information (Task 2 and 3). This yields a forward predicted B and P block. Both blocks are directly stored in a picture called B1 T and Pnew T respectively. Then, the decoded P macroblock is transformed to the spacial domain by means of an (Task 4). This P macroblock is added together with the macroblock read back from picture PnewT and stored in picture P T (Task 5). This picture, together with the macroblock stored in B1 T, is needed to do the backward B prediction (Task 6). The result is stored in picture B2 T. Also this picture is corrected with dierential errors similar as for the P-picture (Task 7 and 8). P T oldframe Decoding 1 Forward P 3 Forward B 2 Pnew T B1T 5 4 P T 7 Backward B 6 B2T 8 B T Fig. 5. Data ow after removal of the border and ordering of the main tasks while inter decoding a PB-frame The gure also illustrates that instead of just producing a P and B picture once, the pictures are read and written several times in the original description. More precisely, since B1 T, B2 T and B T are stored in, every pixel in it is three times written and two times read. Pnew T and P T are stored in, hence this picture memory is written twice and read once. Probably, the reason for this was to simplify the algorithmic description eort for the system designers. As an illustrative example, we will now explain how global loop transformations and complex restructuring of the hierarchy in the code allows to create more locality of access based on the pseudo code for task 2 and 3 in Fig. 5. The code for Forward P (Task 3) is : the pixels inside

5 IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATION, VOL. XX, NO. Y, MONTH if "Advanced prediction mode" { for (comp=0; comp < 4; comp) { P[T][0] = recon_comb_obmc(p[t-1][0],,comp); P[T][1] = recon_comp(p[t-1][1], ); P[T][2] = recon_comp(p[t-1][2], ); else { In this code the rst index of signal P indicates the previous ([T-1]) or the current ([T]) prediction. The second index is to access the luminance information Y (index [0]) or the chrominance information Cr (index [1]) and Cb (index [2]). The code for Forward B (Task 2) is : if "Advanced prediction mode" { if "Overlapped motion compensation" { B[T][0] = recon_comp(p[t-1][0],,comp); else { B[T][1] = recon_comp(p[t-1][1], ); B[T][2] = recon_comp(p[t-1][2], ); else { C. Memory hierarchy related optimisations This step involves data ow transformations which introduce extra transfers between the dierent memory levels and which are used mainly to reduce the power cost. In particular, temporary values { to be assigned to a \lower level" { are added wherever a signal in a \higher" level is read more than once. The duplicate read is then performed on the lower level temporary signal. The same can happen in the other direction for writes. If a signal assigned to a higher level is composed of several contributions, it does not make sense to update the nal result always in the higher level memory. Instead, it is usually better to perform the composition from the contributions consecutively (or in a close ordering) in a lower level (or several levels in more complex situations) and then directly transfer the - nal result to the higher level. The principle of this buering process on the macro-block access is shown in Fig. 6. P T-1 oldframe 3x3 Buffer Reconctruction P T Remark that in this code the recon comp function is issued once instead of four times the recon comp obmc function in the comp loop. When merging task 2 and 3, we get : Fig. 6. Principle of 3 3 (macro-)block buering between old=edgeframe and motion compensation routines, which act on central block to be stored in. if "Advance prediction mode" { for (comp=0; comp < 4; comp) { P[T][0] = recon_comp_obmc(p[t-1][0],,comp); if (mode == MODE_INTER4V) { B[T][0] = recon_comp(p[t-1][0],,comp); else { P[T][1] = recon_comp(p[t-1][1], ); B[T][1] = recon_comp(p[t-1][1], ); P[T][2] = recon_comp(p[t-1][2], ); B[T][2] = recon_comp(p[t-1][2], ); else { In Fig. 5, the reconstructed macroblocks are rst written to Pnew T-1. After reconstruction, a correction is performed with the dierential errors resulting from the transform. This process is shown in the following pseudo code : for (macroblocknr=1 to 99) { (Pblock[][], Bblock[][]) = Decoding(); (reconf_pblock[][], reconf_bblock[][]) = Forward_B&P_prediction(P[T-1]); "Store Preconblock[][] in Pnew[T]"; _Pblock[][] = (Pblock[][]); "Add _Pblock[][] to Pnew[T] and store in P[T]"; This can be rewritten as : The recon comp, recon comp new and the recon comp obmc functions perform dierent kinds of motion compensations depending on the motion vectors. Moreover, they are not embedded in the same loop scopes. However, with complex code restructuring it is possible to combine them. This class of optimisations is crucial because they enable further optimisation on the memory hierarchy, which is discussed hereafter in subsection V-C. for (macroblocknr=1 to 99) { (Pblock[][], Bblock[][]) = Decoding(); (recon_pblock[][], recon_bblock[][]) = Forward_B&P_prediction(buffer); _Pblock[][] = (Pblock[][]); new_pblock[][] = recon_pblock[][] _Pblock[][]; "Store new_pblock[][] in P[T]";

6 IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATION, VOL. XX, NO. Y, MONTH In this code, a buer, called buffer, is created. Now the task Forward B&P prediction will read several times from buffer instead of from picture P T-1. This results in large power savings. Also, an extra block, called new Pblock is introduced. Therefore extra copies from this block to picture P T are necessary. Since these extra copies are situated at a lower memory hierarchy level, the global power consumption due to the memory transfers will still be reduced. We now apply this principle in case of decoding an PB-frame, like depicted in Fig. 5. Instead of storing the forward predicted macroblock in Pnew T, the result is stored in a buer called reconf Pblock. This buer is corrected with the dierential errors that result from the, called Pblock, to yield the nal forward P prediction new Pblock. This nal block together with the forward predicted B macroblock, called reconf Bblock, and motion information is required for the backward reconstruction. Instead of reading from B1 T and P T, the backward reconstruction is based on the buers reconf Bblock and new Pblock. The result in is stored in buer reconb Bblock. instead of B2 T. Similar as for the P macroblock, it is corrected with the differential errors in Bblock to yield the nal B prediction block new Bblock. Extra transfers are introduced to transfer the nal block to the picture B T stored at the highest level. The resulting data ow, when decoding a PB-frame after introducing extra memory hierarchy, is shown in Fig. 7. The pictures with a bold border are to be stored at the \highest" level of hierarchy. This level corresponds to the memory with the biggest transfer cost. Other smaller buers, such as buffer, Pblock, Bblock, reconf Pblock, reconf Bblock, new Pblock, reconb Bblock and new Bblock are stored at \lower" levels. In addition to this, many other similar optimisations have been performed for the dierent decoder modes (especially in the \overlapped -overlappedblock motion compensation" mode). P T-1 oldframe Buffer Decoding 1 Forward P&B 2 Pblock reconf_pblock reconf_bblock 4 Bblock 3 _Pblock new_pblock 6 5 Backward B _Bblock reconb_bblock 7 new_bblock B T P T Fig. 7. Modied dataow after introducing memory hierarchy. one that needs most memory, corresponds with the dependence in the following pseudo code : for (y=1; y <= 11; y) { for (x=1; x <= 9; x) { Read from block (y-1,x-1) from P[T-1]; Predict block (y,x); Write block (y,x) in P[T]; Subtracting the consumption address (y? 1) 11 x? 1 from the production address y 11 x : [y 11 x]? [(y? 1) 11 x? 1] = 11 1 = 12 yields the numbers of blocks in the diagonally shaded intersection of Fig. 8 (Right). When introducing a least-in-rstout (LIFO) buer of 121 macro blocks, picture P T-1 and picture P T can be stored inplace in only 1 picture called old/ : for (y=1; y <= 11; y) { for (x=1; x <= 9; x) { Read from block (y-1,x-1) from old/; Predict block (y,x); Pop block from buffer and store at position (y-1,x-1); Push block (y,x) in the buffer; The buer mechanism can be implemented by calculating the block addresses modulo 13 [20]. This results in a snake-like operation of the buer, as illustrated in Fig. 9. The resulting data ow is depicted in Fig. 10 where the 13 macroblocks are shown in the pipeline of the snake. Implementing this dataow, taking into account extra possibilities of memory hierarchy optimisations, leads to the detailed organisation depicted in Fig. 11. This in-place optimisation does not aect the number of background transfers but signicantly reduces the total size of the background memories. This will result in a smaller area cost. The combined picture is only 13 macroblocks larger than one of the two pictures required initially. oldframe old/ D. In-place storage of past and future P-pictures In Fig. 8 (Left), the light gray area covers the portion of oldframe that is still needed for reconstruction. In Fig. 8 (Middle), the gray area covers pixels that already are calculated. Array signals oldframe and can be stored in-place if the shaded area in Fig. 8 (Right) is stored in a buer. Decoding the macroblock in row y and column x uses data that is stored in blocks with coordinates (y 1; x 1) in oldframe. The worst-case dependence, the Fig. 8. Put oldframe and in-place E. Relative impact of the dierent exploration steps Fig. 12 gives an overview of the relative power consumption for each optimisation stage for the PB mode. This

7 IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATION, VOL. XX, NO. Y, MONTH New frame New active 16 8 Block (y-1,x-1) at T-1 Block (y,x) at T Block (y,x) at T-1 Fig. 9. Principle of buer update mechanism Old active Old frame recon error Fig. 11. Detailed memory organisation of in-place Header decoding old/ Decoding 2 1 Forward P&B Backward B 7 B T Relative power H263 reference (PB) Remove border Combine back/forward Combine Local cache OBMC Interpolation buffer Inplace Fig. 10. Data ow after in-place storage of oldframe and. is the power consumed by the picture memories when decoding bidirectional B frames with unrestricted motion vectors and overlapped motion compensation. The power consumption is normalised with respect to the power consumption for the reference design. The power gures are based on worst-case assumptions. The bar chart shows that when all optimisations discussed in this paper are applied, the power consumption is reduced by a factor of 9. Similar optimisations as reported in this article have been applied on the H.263 decoder running in the P mode. They reduced the worst-case power consumption by a factor of 7. The main dierence with optimisations for the PB mode is the absence of optimisations related to bidirectional coding. The optimisations described in this paper has been partially applied on the public domain C code from Telenor Research [11]. Simulation of the resulting C code, while decoding stream suz for all the dierent decoding modes, shows that the average power consumption of the memories reduced to 57%. Remark that in these simulations, the extra transfers due to an extra layer of hierarchy are taken into account. VI. Power consumption of A DFL specication of algorithm [21] was simulated and veried using Mentor's DSP Station. This specication was synthesised using our datapath synthesis tool Dolphin [22]. Dolphin synthesis has resulted in a VHDL Fig. 12. Relative power in continuous PB mode for each optimisation stage of H.263 frame access netlist which was mapped to the TI TGC2000 library using Synopsys' Design Analyzer and converted to the Verilog netlist. A net capacitance le for the design was generated using Synopsys' Design Analyzer tool. The Verilog netlist has been simulated for toggle counts using Cadence's Verilog-XL simulator. The average power consumption for the datapath was then computed using the net capacitance and the toggle count les. The computation of power consumption of the memory unit in the uses the power modelling described in section III. Table III lists the average power consumption of the for the 3 video formats used in conjunction with H.263. The computation uses a frame rate of 30 frames per second to derive the smallest possible frequency of operation for the datapath and memory units. This module is the most arithmetic dominant in the entire H.263 specication. Still, it has been shown that the power for a direct realisation with commercial logic synthesis and gate array circuits is about 2 orders of magnitude smaller than the power in the combined unoptimised frame accesses. So, initially ignoring this arithmetic in the system exploration is motivated. VII. Conclusion We believe that the results described in this paper clearly substantiate the validity of the proposed high-level memory

8 IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATION, VOL. XX, NO. Y, MONTH TABLE III power consumption for the three video formats, Pt = Normalised average power for the, Pm = Normalised average memory power at frequency fm, fm = Smallest frequency (in MHz) at which memory can be operated, Pd = Normalised average datapath power in mw at frequency fd, fd = Smallest frequency (in MHz) at which datapath can be operated. Format Pt Pm fm Pd fd QCIF Sub-QCIF CIF management methodology for data-dominated applications like the H.263 video decoder. They show the very promising results on power reduction which can be obtained by system level exploration, i.e. up to a factor of 9 of maximal power in the worst-case mode. The same methodology has also been applied to an MPEG-2 [23] video decoder, a medical back-projector [24] and a segment protocol processor of the common adaptation layer of ATM [24]. Also there signicant savings have been obtained. In the future, we will also explore the possibilities of these optimisations on a mixed software-hardware platform, as provided e.g. by the TI cdsp approach which supports a single-chip heterogeneous design consisting of embedded cores, sea-of-gate logic and embedded memories. Acknowledgements: We gratefully acknowledge the discussions with our colleagues and especially the contributions of E. De Greef, M. Eyckmans, P. Six and S. Wuytack. This research was partly sponsored by Texas Instruments Incorporated, Dallas, Texas. References [1] R-H.Yan, L.Terman (eds.), \Special issue on Low Power Electronics," Proceedings of the IEEE, vol. 83, no. 4, pp. 495{700, April [2] F.Catthoor, F.Franssen, S.Wuytack, L. Nachtergaele, and H. De Man, \Global Communication and Memory Optimizing Transformations for Low Power Signal Processing Systems," in VLSI Signal Processing VII, Jan Rabaey, Paul M. Chau, and John Eldon, Eds., New York, October 1994, IEEE workshop on VLSI signal processing, pp. 178{187, IEEE Press. [3] Sven Wuytack, Francky Catthoor, Lode Nachtergaele, and Hugo De Man, \Power exploration for data dominated video applications," in nternational Symposium on Low Power Electronics and Design, Monterey, California, August 1996, pp. 359{364. [4] Lode Nachtergaele, Francky Catthoor, Florin Balasa, Frank Franssen, Eddy De Greef, Hans Samsom, and Hugo De Man, \Optimization of memory organization and hierarchy for decreased size and power in video and image processing systems," in Records of the 1995 IEEE International Workshop on Memory Technology, Design and Testing, San Jose, California, August 1995, pp. 82{87. [5] Eddy De Greef, Francky Catthoor, and Hugo De Man, \Memory organization for video algorithms on programmable signal processors," in Computer Design : VLSI in Computers & Processors. IEEE, October 1995, pp. pp. 552{557. [6] Lode Nachtergaele, Francky Catthoor, Bhanu Kapoor, Stefan Janssens, and Dennis Moolenaar, \Low power storage exploration for h.263 video decoder," in VLSI Signal processing, November [7] P.N. Hilnger, J. Rabaey, D. Genin, C. Scheers, and H. De Man, \DSP specication using the Silage language," in Proc. Int. Conf. on Acoustics, Speech and Signal Processing, Alburquerque, NM, April 1990, pp. 1057{1060. [8] F. Catthoor, M. Janssen, L. Nachtergaele, and H. De Man, \System-level data-ow transformations for power reductionin image and video processing," in Proceedings of the International Conference on Electronics, Circuits and Systems, Rhodos, Greece, October 1996, IEEE, pp. 1025{1028. [9] Karel Rijkse, \Video coding for narrow telecommunication channels at < 64 kbit/s," Tech. Rep., Telenor R & D, [10] Kiyoo Itoh, Katsuro Sasaki, and Yoshinobu Nakagome, \Trends in low-power ram circuit technologies," Proceedings of the IEEE, vol. 83, no. 4, pp. 524{543, April [11] Digital Video Coding at Telenor R & D, \Telenor's h.263 software, version 1.3," February 1995, software/. [12] Aldo Cugnini and Richard Shen, \Mpeg-2 video decoder for the digital hdtv grand alliance system," IEEE Transactions on Consumer Electronics, vol. 41, no. 3, pp. 748{753, August [13] T. Demura, et. al., \A single-chip mpeg2 video decoder lsi," in International Solid-State Circuits Conference. IEEE, ferbruary 1994, pp. 72{73. [14] D. Galbi, et. al., \An mpeg-1 audio/video decoder with runlength compressed antialiased video overlays," in International Solid-State Circuits Conference. IEEE, February 1995, pp. 289{ 287. [15] GEC Plessey Semiconductor, \An overview of the h.261 video compression standard and its implementation in the gps chipset," October [16] Michel Harrand, Michel Henry, Philippe Chaisemartin, Paul Mougeat, Yves Durand, Alain Tournier, Robin Wilson, Jean- Claude Herluison, Jean-Claude Langchambon, Jean-Luc Bauer, and Michel Runtz andjoseph Bulone, \A single chip videophone encoder/decoder," in International Solid-State Circuits Conference. IEEE, February 1995, pp. 292{293. [17] Toshihiro Masaki, Yasuo Morimoto, Takao Onoye, and Isao Shirakawa, \Vlsi implementation of inverse discrete cosine transformer and motion compensator for mpeg2 hdtv video decoding," IEEE Transaction on Circuit and Systems for Video Technology, vol. 5, no. 5, pp. 387{395, october [18] M. Toyokura, et. al., \A video dsp with a macroblock-levelpipeline and a simd type vector-pipeline architecture for mpeg2 codec," in International Solid-State Circuits Conference. IEEE, february 1994, pp. 74{75. [19] Shinobu Ueda, Y. Kiyose, Y. Kishida, S. Sotoda, M. Kawabata, T. Furukawa, and S. Kawabe, \Development of an mpeg2 decoder for magneto-optical disk video players," IEEE Transactions on Consumer Electronics, vol. 41, no. 3, pp. 521{527, august [20] J. Vanhoof, K. van Rompaey, I. Bolsens, G. Goossens, and H. De Man, High-Level Synthesis for Real-Time Digital Signal Processing, Kluwer Academic Publishers, Boston, [21] W-H Chen, C. H. Smith, and S. C. Fralick, \A fast computational algorithm for the discrete cosine transform," IEEE Transactions on Communications, pp. 1004{1009, September [22] P. Schaumont, B. Van Thournout, I. Bolsens, and H. De Man, \Synthesis of pipelined dsp accelerators with dynamic scheduling," in Proceedings of the 8 th International Symposium on System-Level Synthesis, Cannes, France, September 1995, ACM/IEEE, pp. 72{77. [23] D. Moolenaar, \System specication and storage exploration for two video compression standards," M.S. thesis, Delft University, Delft, The Netherlands, May 1996, ftp://ftp.imec.be/pub/vsdm/reports/video codec optim/ MPEG2 code optim.ps.gz. [24] F. Catthoor, L. Nachtergaele, and S. Wuytack, \Optimizing data transfers and memory for low power," accepted for publication in ASIC & EDA magazine, ftp://ftp.imec.be/pub/vsdm/reports/system lev power opt/ fc-asic eda96.ps.gz, 1997.

9 IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATION, VOL. XX, NO. Y, MONTH Lode Nachtergaele is member of the Multimedia Image Compression Systems (MICS) group since '96. This group is part of the application group of the VLSI Systems & Design Methodology (VSDM) division of the Interuniversity Micro Electronics Center (IMEC). His current aim is to further distill an operational methodology that improves design times of embedded multimedia systems. The resulting design ow is reinjected as a stepping stone in future application challenges. Ing. Nachtergaele received his degree of Industrial Engineer in 1989 from the Katholieke Industriele Hogeschool, Oostende, Belgium. In the same year he joined IMEC starting his career in the group that worked on the Cathedral-II silicon compiler. There he was involved in the development of the Silage simulator S2C. In '92, he joined the System Exploration for Memory and Power (SEMP) group. Together with his colleagues, he worked on the ATOMIUM methodology, partly supported with prototype tools. Francky Catthoor received the engineering degree and a Ph.D. in electrical engineering from the Katholieke Universiteit Leuven, Belgium in 1982 and 1987 respectively. From September 1983 till June 1987 he has been a researcher in the area of VLSI design methodologies for Digital Signal Processing, with Prof. Hugo De Man and Prof. Joos Vandewalle as Ph.D. thesis advisors. Since 1987, he has headed several research domains in the area of high-level and system synthesis techniques and architectural methodologies, all within the VLSI Systems & Design Methodology (VSDM) division at the Inter-university Micro- Electronics Center (IMEC), Heverlee, Belgium. He is assistant professor at the EE department of the K.U.Leuven since His current research activities belong to the eld of architecture design methods and system-level exploration for power and area, mainly oriented towards memory management and global data transfer optimization. The major target application domains are real-time signal and data processing algorithms in image, video and end-user telecom applications, and data structure dominated modules in telecom networks. Both customized architectures and programmable multimedia processors are targeted. In 1986 he received the Young Scientist Award from the Marconi International Fellowship. Since 1995 he is an associate editor for the IEEE Trans. on VLSI Systems and since 1996 also for the Journal of VLSI Signal Processing. Stefan Janssens is with the System Exploration for Memory and Power group (SEMP) since '96. This group is part of the design technology group of the VLSI Systems & Design Methodology division (VSDM) of the Interuniversity Micro-Electronics Center (IMEC). He is currently focussed on the application and evaluation of the ATOMIUM and ADOPT methodologies in industrial applications. Ing. Janssens received his degree of Industrial Engineer in 1996 and joined IMEC in the same year; before he was involved with IMEC for his thesis. Dennis Moolenaar joined IMEC's Wireless Systems group since '96. This group is part of the application group of the VLSI Systems & Design Methodology (VSDM) division of the Interuniversity Micro Electronics Center (IMEC). The mission of this group is to do research in future telecom systems on silicon. Ir. Moolenaar current aim is to investigate the integration of multi-processor systems on a single chip. At the moment he is implementing a custom low power multi processor architecture for a DECT/GSM/DCS1800 multi-mode terminal. His interests are in processor architectures, multi processor systems and low power design. Ir. Moolenaar joined IMEC in '96. Before that he was involved with IMEC as a student for his internship and master thesis. Bhanu Kapoor received his B. Tech. degree in Electrical Engineering from the Indian Institute of Technology, Kanpur, India, in He received his M.S. and Ph.D. degrees in Computer Science from the Southern Methodist University, Dallas, Texas, in 1990 and 1994, respectively. He has been with the Corporate R&D labs of Texas Instruments Incorporated since His main research interests are in the areas of high performance and low power VLSI design and CAD tools, with an emphasis on algorithms and architectures for DSP applications. He is a member of the IEEE and the ACM.