How To Write A Dra Audio Coding Standard

Transcription

1 Chinese Journal of Electronics Vol.23, No.3, July 2014 DRA Audio Coding Standard MA Wenhua 1,XUJing 2, MA Yuanzhe 3 and YOU Yuli 4 (1.Department of Computers, Cisco School of Informatics, Guangdong University of Foreign Studies, Guangzhou , China) (2.Department of Science and Technology, Guangdong Rising Assets Management Co., Guangzhou , China) (3.Department of Biomedical Science and Technology, South China University of Technology, Guangzhou , China) (4.Guangdong Provincial Key Laboratary for Digital Audio Technology, Guangzhou , China) Abstract DRA (Dynamic resolution adaptation) audio coding standard was shown to deploy transientlocalized MDCT to effectively suppress pre-echo artifacts and statistic allocation of codebooks to improve the compression efficiency of Huffman coding. Its quantizers and Huffman codebooks are designed in such a way that a signal path of 24 bits is provided throughout the codec so that high audio quality can be delivered if bit rate suffices. Although simple, it delivers state-of-the-arts compression efficiency as shown by five rounds of ITU-R BS compliant subjective listening tests. Key words Audio coding, Standard, Listening test, Modified discrete cosine transform (MDCT), Huffman coding. I. Introduction There have been extensive standardization activities for audio coding in the past twenty years. MPEG-1 may be the first international standard for perceptual high quality audio coding [1,2]. It is essentially a subband coder that deploys a 32- band QMF (Quadrature mirror filter bank). Its Layer 3 applies switched MDCT (Modified discrete cosine transform) to the subband signals output from the QMF for increased frequency resolution. MPEG-1 was extended by MPEG-2 BC (Backward compatible) to provide for lower sample rates and multichannel surround sounds [2,3]. MPEG-2 AAC abandons backward compatibility with MPEG-1 and MPEG-2 BC in order to achieve significant improvement in compression efficiency [2,4]. AAC uses an MDCT that switches between 1024 and 128 MDCT coefficients. AAC is carried over into MPEG-4 AAC with the addition of more coding tools, such as Perceptual Noise Shaping and Long Term Prediction, and coder configurations [5].In order to effectively handle low-bitrate applications such as internet streaming, HE-AAC version 1 profile (HE-AAC v1) introduces Spectral band replication (SBR) to enhance the compression efficiency in the frequency domain [6] and HE-AAC version 2 profile (HE-AAC v2) adds to it Parametric stereo (PS) to enhance its compression efficiency of stereo signals [7]. Dolby AC-3 is probably the most commercially successful audio coding standard [2,8]. ItusesMDCTthatswitches between 128 and 256 MDCT coefficients. WMA (Windows media audio) of Microsoft uses MDCT that switches between 64, 128, 256, 512, 1024, and 2048 MDCT coefficients [9]. Vorbis, an open source codec offered by the Xiph.Org Foundation, uses MDCT that switches between 256 and 1024 MDCT coefficients [10]. DRA (Dynamic resolution adaptation) audio coding algorithm from Digital rise technologies was first adopted as an audio coding standard for the electronic industry by China s Ministry of Information Industry in 2007 [11] and quickly became a mandatory audio coding standard for China mobile multimedia broadcasting (CMMB) lunched by China s State Administration of Radio, Film and Television [12]. In early 2009, it was adopted as part of the Blu-ray disc format by the Blu-ray Disc Association [13] and then as a national standard of China by China s State Standardization Commission [14].It became a mandatory audio coding standard for China s Terrestrial DTV Broadcasting in 2011 [15]. This paper presents an overview of DRA audio coding standard. Section II outlines its algorithmic architecture. Section III depicts its transient-localized MDCT that provides improved pre-echo suppression with minimal bit and computational overheads and Section IV explains its statistic approach to the allocation of Huffman codebooks to enhance its compression efficiency. Section V summarizes the results of five rounds of ITU-RBS.1116 compliant subjective listening tests administrated by Chinese government during DRAs standardization process, which clearly indicates that DRA audio coding standard delivers state-of-art compression efficiency. Section VI concludes this paper. II. Algorithmic Architecture The encoder and decoder of DRA audio coding standard Manuscript Received Mar. 2013; Accepted May This work is supported by the 2012 Guangdong Science and Technology Plan for Commercialization of Advanced and New Technologies (No.2012B ), Collaboration of Guangdong and Hong Kong for Major Break-Through in Key Technical Fields (No.2005A ) and by Fund for Guangdong Provincial Key Labs (No.2005B-60149).

2 522 Chinese Journal of Electronics 2014 are shown in Figs.1 and 2, respectively. It is apparent that DRA audio coding algorithm is a simple, essentially bare-bone, adaptive transform coder. The quantization step sizes are themselves logarithmically quantized with a step size of 0.2dB. When the quantization step size is one, the maximum allowed quantization index is ±(2 23 1) and the Huffman codebooks are designed to accommodate this. Consequently, a signal path of up to 24 bits is provided throughout the codec so that audio quality far exceeds the perceptual capability of the human ear can be delivered if bit rate suffices. The quantization indexes thus obtained are Huffmanencoded with codebooks assigned from a library by the Statistics-adaptive codebook assignment tool, to be discussed in Section IV. While transient detection, perceptual model and global bit allocation are necessary components of DRA encoder, they are not part of DRA decoder and there is little, if any, restriction on their implementation, so they are not stipulated in the DRA standard and are thereby not discussed here. Fig. 1. DRA encoder Fig. 2. DRA decoder The encoding process (see Fig.1) starts with the transformation of the input PCM samples into the frequency domain by Transient-localized MDCT. It is a simple and effective tool to suppress pre-echo artifacts that frequently impedes many audio coding algorithms and will be discussed in Section III. The resulting MDCT coefficients are optionally (deployable only at low bitrates) processed by joint channel coding: sum/difference coding and joint intensity coding. While the implementation for sum/difference coding is regular, the joint intensity coding is a little different. Instead of joining stereo pairs, it joins all channels into the left channel, thereby providing significant bit rate reduction when surround sounds are involved. Linear scalar quantization is used to quantize the MDCT coefficients. Toward this end, the MDCT coefficients are first segmented into quantization units, each of which is boxed in the frequency domain by the critical bands and in the time domain by the MDCT blocks that are statistically similar. Then each quantization unit is assigned a quantization step size by the global bit allocation algorithm. This step size is subsequently used to uniformly (linearly) quantize all MDCT coefficients inside the unit. III. Transient-Localized MDCT Given that audio signals are usually quasi-stationary with very strong tonal components, but are frequently interrupted by dramatic transients, it is, therefore, critical for an audio coding algorithm to employ a filter bank that can adapt its temporal-frequency resolution to this piecewise quasistationary nature of audio signals, i.e., having high frequency resolution during quasi-stationary episodes and high temporal resolution around transients. One widely used filter bank is MDCT [2,16],whichcanbe described by the following basis function: r» 2 π h(k, n) =w(n) M cos n + M +1 «k + 1 «M 2 2 where k =0, 1,,M 1; n =0, 1,, 2M 1; and w(n) isa window function of size 2M. The temporal-frequency resolution is largely determined by M, which may be referred to as block size. A large M means low temporal resolution but high frequency resolution, while a small M means high temporal resolution and low frequency resolution. Many widely used algorithms, such as MPEG-1 Layer 3 [1], Dolby AC-3 [8], and MPEG-2/4 ACC [4,5], use two block sizes (L for long and S for short windows, respectively) in order to adapt their resolutions to an input audio signal. The window functions corresponding to these two block sizes are illustrated as (a) and(b) in Fig.3. In order for the MDCT to be able to properly switch between these two block sizes, the perfect reconstruction conditions [2,16] call for the use of the three transitional window functions illustrated in Fig.3 as (c), (d), and (e). Window function (Fig.3(e)) may not be needed if certain restriction on the application of short window is imposed. Because transients in audio signals typically consist of no more than a few samples, the short block size for a frame of detected transient should be a few samples as well, thereby matching the filter s temporal resolution to the transient. Unfortunately, such a choice results in extremely poor frequency resolution within the frame and, therefore, is inappropriate for the rest of the samples in the frame because such other (1)

3 DRA Audio Coding Standard 523 samples, provided they are sufficiently far away from the transient, are quasi-stationary and therefore are better processed using high frequency resolution. This conflict conventionally has resulted in a compromise short block size that is neither optimal for the transient samples nor for the quasi-stationary samples in the same frame. One consequence of this is pre-echo artifacts. Fig. 3. Window functions Temporal noise shaping (TNS) [17],usedbyAAC [5],isone of the pre-echo control methods [2]. It deploys linear prediction over the MDCT coefficients for the block with transient, in order to boost the coding gain for the block and also to shape some of the pre-echo quantization noise to behind the transient. TNS is computationally intensive because of the linear prediction and the overhead for transferring the description of the predictor, including the prediction filter coefficients, is remarkable. DRA audio coding standard [18,19] deploys a very simple method for pre-echo suppression: the transient samples are covered by a special short window, which has the same window span as the normal short window, but has a significantly shorter effective size. The shorter effective size provides better transient localization while the normal window span leaves the frequency resolution unchanged. Pre-echo is better suppressed because transient localization reduces the spread of both quantization noise and high bit rates associated with transients. 1. Brief window functions This special short window is illustrated as (f) in Fig.3 and is referred to as the brief window function and labeled as WS B2B. It is still a short window of length S, thesame length as other windows within the frame. However, unlike those other windows, brief window WS B2B uses only a central portion of its overall length for signal shaping, employing a number of leading and trailing zeros in order to improve its temporal resolution. Obviously, its effective size is reduced. For example, if the short window size S = 256 samples, the brief window WS B2B may be designed such that it is nonzero within the central 160 samples, with the first 16 and last 16 of such samples overlapping the respective transition windows that are adjacent to it, and with zeros for the first 48 and the last 48 samples of the window. Its effective window size is reduced from 256 to 160. In order to switch to/from this brief window from/to the long (WL L2L) and short (WS S2S) windows, the perfect reconstruction conditions [2,16] call for the addition of transitional windows which are illustrated as (g), (h), (i), (j), (k), (l), and (m) in Fig.3. Obviously, these and the brief window need more memory to store. This brief window is applied only to the block of samples containing a transient, while the short and/or the appropriate transition windows are applied to the quasi-stationary samples in the remainder of the frame. 2. Enhanced pre-echo control The significantly reduced effective size of the brief window offers better pre-echo suppression for two reasons. The first is that a finer temporal resolution is deployed to transient samples and high bit rates associated with transients are constrained to fewer samples. The second reason is that the spread of quantization noise is reduced. For the example at the beginning of this section where a short window size S = 256 samples, its spread of quantization noise is 256 samples, but this is reduced to only 160 samples for the brief window. For a typical sample rate of 48 khz, they amount to 5.33 and 3.3ms, respectively. Given that significant premasking tends to last about 1 2ms [2],thespread of quantization noise that may cause pre-echo is reduced from to ms, a significant reduction. 3. Window sequencing If there is no transient detected within the current frame, select a long window, the specific shape of which depending on the existence and location of any transient in the previous and next frame, as shown in Table 1. Table 1. Determination of long windows for a frame without detected transient Frame Previous Current Subsequent WL L2L No transient No transient WL L2S Transient, but not in the first block WL L2B Transient in the first block WL S2L No transient Transient not WL S2S Transient, but not in the first block in last block WL S2B Transient in the first block WL B2L No transient Transient in WL B2S Transient, but not in the first block last block WL B2B Transient in the first block Due to the increased number of windows as compared with conventional methods, the determination of appropriate window sequence is more involved, but a short procedure [18],out-

4 524 Chinese Journal of Electronics 2014 lined as follows, that applies the perfect reconstruction conditions suffices. If a transient has been detected in the current frame, the selection of windows is a little bit more involved. First of all, identify the location(s) of the transient(s). Around a transient, select a series of short windows according to the following principles: WS B2B is applied to the block where a transient occurs, in order to improve the temporal resolution of the MDCT. The window for the block that is immediately before this transient block has a designation of the form...2b. The window for the block that is immediately after this transient block has a designation of the form... B2.... Consequently, the allowed placement of windows is summarized in Table 2. For the rest of the frame (away from the transient), short window WS S2S should be deployed, except for the first and last blocks of the frame. For the first block of the frame, if there is no transient in the last block of the previous frame, short window WS S2S should be used; otherwise, short window WS B2S should be used. For the last block of the frame, if there is no transient in the first block of the subsequent frame, short window WS S2S should be used; otherwise, short window WS S2B should be used. Table 2. Allowed placement of windows around a block with a detected transient Pre-transient Transient Post-transient WL L2B WL B2L WL S2B WL B2S WL B2B WS B2B WL B2B WS S2B WS B2S WS B2B WS B2B Some example window sequences are shown in Fig.4. Fig.4(a) is an example for the conventional approach; (b)shows that a transient occurs in the first block of the frame, so brief window WS B2B is deployed for this block, WL L2B is deployed for the previous frame and WS B2S is deployed for the second block of the frame; (c) shows that a transient occurs in the third block of the frame; (d) shows that two transients occur in the third and sixth blocks, so two brief windows are placed, respectively; (e) shows that a transient occurs in the last block of the frame. 4. Indication of window sequence to decoder The encoder needs to indicate to the decoder the window(s) that it used to encode the current frame so that the decoder can use the same window(s) to decode the frame. This can be accomplished by a window index with the conventional methods. This is also true for the present method. For a frame without a detected transient, one label from those in Table 1 needs to be transferred to the decoder. For a frame with detected transient, the window sequencing procedure outlined in Section III needs to know (1) transient location(s), which the conventional methods also need, and (2) whether there is a transient in the first block of the current frame and of the subsequent frame, respectively. This later information may be conveyed by the following nomenclature: WS CurrentSubs, where Current (S =no, B =yes) identifies if there is transient in the first block of current frame, and Subs (S =no, B =yes) identifies if there is transient in the first block of the subsequent frame. This is shown in Table 3. Obviously, the first column of Table 3 is the same set of labels used for the short windows, i.e., (a), (f), (g), and (h) in Fig.3. Fig. 4. Window sequence examples. (a) Window sequence for conventional methods; (b) Window sequence for the proposed method when a transient occurs in the first block; (c) When a transient occurs in the third block; (d) When two transients occur in the third and sixth blocks; (e) Whenatransientoccursinthelastblock Table 3. Encoding of the existence or absence of transient in the first block of the current/subsequent frame Label Transient in the first block of current frame subsequent frame WS B2B yes yes WS B2S yes no WS S2B no yes WS S2S no no Combining the labels in Table 1 and in Table 3, we get the complete set of window labels shown in Fig.3, and they are sufficient to convey the window sequencing information. The total number of window labels is now 13, while that number is 5 or 4 for the conventional approach (depending on whether or not window WL S2S is forbidden), so one or two more bits (4 versus 3 or 2) are needed to transfer the window sequencing information. IV. Statistics-Adaptive Codebook Assignment Another special feature of DRA audio coding standard is the assignment of its Huffman codebooks. Conventional approach to codebook allocation usually assigns a Huffman codebook to encode all MDCT coefficients within a quantization unit. This codebook is usually the smallest one that can accommodate the largest quantization index within the unit. Consequently, once the quantization step size is determined, all quantization indexes within the unit are also determined. And so is the Huffman codebook, there is no other option.

5 DRA Audio Coding Standard 525 Since the quantization indexes within a quantization unit do not necessarily share the same statistic properties, the traditional approach does not provide a good match, if any, between the statistic properties of the Huffman codebooks and those of the quantization indexes. This motivates a statisticsadaptive approach to codebook assignment, whose steps are outlined as follows: (1) The quantization indexes are grouped into granules of four, the smallest codebook that can accommodate the largest quantization index in the granule is assigned to the granule. (2) Segment the indexes of these codebook into large segments based on their local statistic properties. (3) Select the largest codebook within each segment as the codebook for that segment. The advantage of this approach is illustrated in Fig.5. Since the largest quantization index falls into quantization unit d, so a large codebook is assigned using previous methods, which is obviously not a good match because most of the indexes in the unit are much smaller. Using the DRA approach, however, the largest quantization index is segmented into segment C, so share a codebook with other large quantization indexes. Also, all quantization indexes in segment D are small, so a small codebook is selected. This obviously results in fewer bits for coding the quantization indexes. standard went through five rounds of ITU-R BS.1116 [1] compliant subjective listening test. The results of them are shown in Table 4. Table 4. Results of ITU-R BS compliant subjective listening tests. All listening panels are expert listeners and most are young audio professional Test Test Stereo lab date 128 kbps 320 kbps 384 kbps NTICRT 08/ SLDST 10/ SLDST 01/ SLDST 07/ SLDST 08/ The bit rate of all these tests was specified in such a way that it is the upper limit absolutely not to be exceeded in any frame. For example, if the sample rate is 48 khz, the bit rate of 128 kbps translates into 2730 bits per frame because a DRA frame consists of 1024 samples. No frame can use more than 2730 bits and no bit reservoir is allowed. The first test was conducted by National Testing and Inspection Center for Radio and TV Products of China (NTI- CRT) in August Ten stereo sound tracks selected mostly from SQAM CD [6] and five 5.1 surround sound tracks were used in the test. The test subjects were all expert listeners consisting of conductors, musicians, recording engineers, and audio engineers. The other four tests were all performed by the State Lab for DTV System Testing (SLDST) under China s State Administration for Radio, Film, and TV. Some of China s famous conductors, musicians and recording engineers were among the listening panels. Seniors and graduate students from the School of Recording Arts, Communication University of China, and the Department of Sound Recording, Beijing Film Academy, were found to possess better acuity due to their age and training, so they had become the majority in the later tests. Other than a few Chinese sound tracks, most of the test materials were selected from the SQAM CD [6] and a pool of surround sound tracks used by EBU and MPEG, including Pitch pipe, Harpsichord, and Elloit1. Fig. 5. Advantages of statistic allocation of Huffman codebooks The conventional methods only need to transfer the codebook indexes because the scopes of codebook application are the same as the quantization units, which are pre-determined. DRA approach, however, need to transfer the scopes of codebook application in addition to the codebook indexes, since the scopes are unrelated to the quantization units. This overhead must be contained in order for DRA approach to offer any advantage. This can be easily achieved by imposing a limit to the number of segments. V. Subjective Listening Tests During its standardization process, DRA audio coding VI. Conclusion DRA audio coding standard was shown to use transientlocalized MDCT to effectively suppress pre-echo artifacts and statistic allocation of codebooks to improve its entropy coding efficiency. Its quantizers and Huffman codebooks are designed in such a way that a signal path of 24 bits is provided throughout the codec so that highest audio quality can be delivered if bit rate suffices. The final results of five ITU-R BS compliant subjective listening tests indicate that DRA audio coding standard has achieved transparency for stereo at 128 kbps and 5.1 surround at 384 kbps. References [1] ISO/IEC :1993 Coding of Moving Pictures and Associated Audio for Digital Storage Media at up to About 1, 5

6 526 Chinese Journal of Electronics 2014 Mbit/s Part [2] T. Painter and A. Spanias, Perceptual coding of digital audio, Proceedings of the IEEE, Vol.88, No.4, pp , [3] ISO/IEC :1997 Information Technology: Generic Coding [4] ISO/IEC :1997 Information Technology: Generic Coding 7: Advanced Audio Coding (AAC). [5] ISO/IEC :1999 Information Technology: Generic Coding [6] ISO/IEC :2001 Information Technology: Generic Coding [7] ISO/IEC :2005 Information Technology: Generic Coding [8] Dolby Laboratories, Digital Audio Compression Standard A/52B, Advanced Television Systems Committee (ATSC), [9] Windows media audio, Media Audio, [10] Vorbis I specification, Xiph.org Foundation, [11] SJ/T , Electronic Industry Standard: DRA Audio Coding Technology. [12] GY/T , Mobile Multimedia Broadcasting, Part 2: Multiplexing. [13] Q/DCHL , BD-ROM 2.3 Specification: DRA Audio Coding Technology. [14] GB/T , National Standard: DRA Audio Coding Technology. [15] GB/T , Terrestrial DTV Receiver Specification. [16] John P. Princen and Alan B. Bradley, Analysis/synthesis filter bank design based on time domain aliasing cancellation, IEEE Transactions on ASSP, Vol.34, No.5, pp , [17] J. Herre and J.D. Johnston, Enhancing the performance of perceptual audio coders by using temporal noise shaping (TNS), 101st AES Convention, 1996 Reprint#4384. [18] Yuli You, Audio Coding: Theory and Practice, Springer, USA, [19] Yuli You and Wenhua Ma, Modile Multimedin Broadcasting Standards: Technology and Practice, Springer, USA, MA Wenhua is an associate professor at Guangdong University of Foreign Studies and a senior member of Chinese Institute of Electronics. His research interests include signal processing and embedded systems. ( @qq.com) XU Jing received the MBA and M.S. degrees from Sun Yat-Sen University and is the director of Department of Science and Technology, Guangdong Rising Asset Management Co., Ltd., Guangzhou, China. His research interests include audio coding and streaming technologies. MA Yuanzhe is pursuing the B.S. degree in biomedical engineering in South China University of Technology, Guangzhou, China. YOU Yuli received the Ph.D. degree from the University of Minnesota in Twin Cities, Minneapolis, MN 55455, USA in 1995 and is the inventor of DRA audio coding standard. His research interests include audio, image and video processing.