Multimedia Communications: Coding, Systems, and Networking. Prof. Tsuhan Chen MPEG Audio

18-796 ultimedia Communications: Coding, Systems, and Networking Prof. Tsuhan Chen tsuhan@ece.cmu.edu PEG Audio 1

Outline Basics Psychoacoustics Subband coding PEG-1 audio Layer I and II Layer III Frame structure and packetization PEG-2 audio ultichannel audio Backward compatible coding Non backward compatible coding Digital Audio Telephone Speech Wideband Speech ediumband Audio Wideband Audio Frequency Band (Hz) Sampling Rate (khz) Bits per Sample Raw Bitrate (kbits/s) 300~3400 8 8 64 50~7000 16 8 128 10~11000 24 16 384 10~22000 48 16 768 CD: 44.1 khz 16 bits 2 channels = 1.411 bits/s 2

Psychoacoustics Threshold in quiet 26 critical bands 0~24 khz Frequency masking in the same critical band Frequency asking SR (Signal-to-ask Ratio) 3

Temporal asking Post-asking: 50~200ms Pre-asking: 1/10 of post-masking Subband Coding H 1 (z) Q F 1 (z) H 2 (z) Q F 2 (z) H (z) Q F (z) Analysis Filterbank Synthesis Filterbank aximal downsampling Q should be based on signal-to-masking ratio (SR) Ear s critical bands are not uniform, but logarithmic The filter bank should match the critical bands Tree-structure filter bank (to be derived on board) 4

Subband Coding vs. DCT z -1 E(z) R(z) z z -1 z Polyphase Representation When E(z) = DCT matrix, this becomes DCT No overlap; blocking artifact odified DCT (DCT) 50% overlap; less blocking artifact PEG-1 Audio ISO/IEC 11172-3 (1988~1991) First high quality audio compression standard Sampling rates: 32, 44.1, 48 khz CD quality two-channel audio at ~256 kbits/s CD: 44.1 khz 16 bits 2 = 1.411 bits/s Quality demonstration (PEG-1 Layer II) Stereo 44.1 khz at 64 kbits/s Stereo 44.1 khz at 128 kbits/s Stereo 44.1 khz at 192 kbits/s Stereo 44.1 khz at 256 kbits/s 5

Block Diagram PC audio samples 32, 44.1, 48 khz analysis filterbank quantizer and coding frame packing encoded bitstream psychoacoustic model 11172-3 ancillary data Decoder Block Diagram encoded bitstream frame unpacking reconstruction synthesis filterbank PC audio samples 32, 44.1, 48 khz 11172-3 Decoder ancillary data 6

Layers Increasing complexity, delay, and quality Layer I: ~384 kbits/s for perceptually lossless quality (4:1) Layer II: ~192 kbits/s for perceptually lossless quality (8:1) Layer III: ~128 kbits/s for perceptually lossless quality (12:1) (for two channels) 100% perceptual lossless Layer I and II 32 Analysis Filterbank 512-tap Scaler & Quantizer ux FFT 512-pt for Layer I 1024-pt for Layer II/III asking Threshold Generator Dynamic Bit Allocator Coder 7

Block-Based Coding 12 12 12 Analysis Filterbank... Block: Layer I Superblock: Layer II/III 12 samples for Layer I, 36 samples for Layer II/III Block companding: Each block normalized by scalefactor For Layer II, up to 3 scalefactors, with 2-bit scalefactor select Each block/superblock receives one bit allocation Layer III Analysis Filterbank 6 or 18 with overlap DCT Scaler & Quantizer Huffman Coding ux FFT asking Threshold Generator Coding 8

Features in Layer III Hybrid filterbank DCT with filterbank Long/short window switching Short for better temporal resolution (to prevent pre-echoes) Long for better frequency resolution Nonuniform quantization Entropy coding Run-length and Huffman coding Bit reservoir (buffer) Frame Structure Header Info Side Info Subband Sanples Aux Data Header info: Sync bits, system info, CRC (cyclic redundancy code) Side info: bit allocation, scalefactor, (and scalefactor select for Layer II and III) Subband samples: 32 12 for Layer I, 32 36 for Layer II and III Packetization: 4-byte header, 184-byte payload 9

Stereo Redundancy Coding Four modes: mono, stereo, dual with two separate channel, joint stereo Joint stereo mode Human stereo perception > 2kHz is based on envelope Intensity stereo coding > 2kHz Encode (L + R) Assign independent left- and right- scalefactors Layer III supports (L+R) and (L R) coding PEG-2 Audio ISO/IEC 13818-3 Allows lower sampling rates 16, 22.05, and 24 khz: about half of PEG-1 From wideband speech to mediumband audio Higher frequency resolution Layer I, II, and III ultichannel coding 2~5 channels; surround sound, multilingual, for visual/hearing-impaired Backward compatible and non-backward compatible coding (13818-7: PEG-2 AAC) 10

ultichannel Audio 2/0-stereo 3/0 3/1 Surround 3/2 3/2 with woofer (5.1 system) LFE: Low-frequency enhancement (woofer) 15~120 Hz Can be anywhere Compatibility Forward compatibility A new decoder can decode an old bitstream Usually simple to achieve Backward compatibility An old decoder can decode a new bitstream, at least partially Usually limits the coding efficiency 11

PEG-2 Backward Compatible Audio Coding PEG-1 Header PEG-1 Data PEG-1 Ancillary Data PEG-1/2 Frame PEG-2 Header PEG-2 Data L C R LS RS atrix L0 R0 T3 T4 T5 PEG-1 PEG-2 Extension ux L0 = α ( L + R0 = α ( R + β C + δ LS) β C + δ RS) α = 1+ 1 1 ; β = δ = or α = 1; β = δ = 0 2 2 Backward Compatible Audio Coding (cont.) L C R LS RS atrix L0 R0 T3 T4 T5 PEG-1 PEG-2 Extension ux Demux PEG-1 Decoder PEG-2 Extension Decoder L0 R0 T3 T4 T5 Inverse atrix L C R LS RS atrixing Dematrixing 12

Non Backward Compatible (NBC) Coding PEG-2 Advanced Audio Coding (AAC) ISO/IEC 13818-7 (April 1997) 320~384 kbits/s for 5 channels, 64kbits/channel NBC at 320 kbits/s as good as BC coding at 640 kbits/s 1~48 audio channels, 0~16 LFEs, 0~16 data streams Same framework (perceptual subband coding) as PEG-1, with some enhancements PEG-2 AAC Enhancements Preprocessing High resolution filterbanks Legend Data Control Noiseless Decoding Inverse Quantizer 1024-line DCT / 128 Temporal noise shaping (TNS): time-dependent quantization Coupling channel 13818-7 Coded Audio Stream Bitstream Demultiplex Scale Factors /S Prediction Intensity multichannel coding Backward adaptive prediction in subbands /S stereo coding Noiseless coding (entropy coding): Huffman coding Intensity/ Coupling TNS Filter Bank Gain Control Output Time Signal 13

Input time signal Perceptual odel Gain Control Legend Filter Bank Data Control TNS Intensity/ Coupling Iteration Loops Quantized Spectrum of Previous Frame Prediction /S Bitstream ultiplex 13818-7 Coded Audio Stream Scale Factors Rate/Distortion Control Process Quantizer Noiseless Coding PEG-2 AAC Profiles ain Low Complexity ain profile Best quality, highest complexity 1024 or 128 DCT Low-complexity profile No temporal noise shaping, no prediction Scalable sampling-rate profile Scalable output sampling rates and complexity Uses hybrid filterbanks (like PEG-1 Layer III) No prediction, no coupling channel Scaleable Sampling Rate 20 khz 18 khz 12 khz 6 khz 14

Simcast To achieve backward compatibility at the cost of higher bitrate L0 R0 PEG-1 PEG-1 Decoder L0 R0 L C R LS RS PEG-2 AAC ux Demux PEG-2 AAC Decoder L C R LS RS References Peter Noll, PEG digital audio coding, IEEE Signal Processing agazine, Sept. 1997, pp. 59-81 D. Pan, A tutorial on PEG/audio compression, IEEE ultimedia, v. 2, no. 2, 1995, pp. 60-74 http://www.mpeg.org/peg/audio.html http://www.cselt.it/mpeg/faq/faq-audio.htm http://www.tnt.uni-hannover.de/project/mpeg/audio/ 15