Lecture 7: Audio compression and coding

Transcription

1 EE E682: Speech & Audio Processing & Recognition Lecture 7: Audio compression and coding Dan Ellis Michael Mandel Columbia University Dept. of Electrical Engineering dpwe/e682 March 31, 29 1 Information, Compression & Quantization 2 Speech coding 3 Wide-Bandwidth Audio Coding E682 (Ellis & Mandel) L7: Audio compression and coding March 31, 29 1 / 37

2 Outline 1 Information, Compression & Quantization 2 Speech coding 3 Wide-Bandwidth Audio Coding E682 (Ellis & Mandel) L7: Audio compression and coding March 31, 29 2 / 37

3 Compression & Quantization How big is audio data? What is the bitrate? F s frames/second (e.g. 8 or 441) x C samples/frame (e.g. 1 or 2 channels) x B bits/sample (e.g. 8 or 16) F s C B bits/second (e.g. 64 Kbps or 1.4 Mbps) bits / frame CD Audio 1.4 Mbps 32 8 Mobile 8 441!13 Kbps frames / sec Telephony 64 Kbps How to reduce? lower sampling rate less bandwidth (muffled) lower channel count no stereo image lower sample size quantization noise Or: use data compression E682 (Ellis & Mandel) L7: Audio compression and coding March 31, 29 3 / 37

4 Data compression: Redundancy vs. Irrelevance Two main principles in compression: remove redundant information remove irrelevant information Redundant information is implicit in remainder e.g. signal bandlimited to 2kHz, but sample at 8kHz can recover every other sample by interpolation: In a bandlimited signal, the red samples can be exactly recovered by interpolating the blue samples sample Irrelevant info is unique but unnecessary e.g. recording a microphone signal at 8 khz sampling rate time E682 (Ellis & Mandel) L7: Audio compression and coding March 31, 29 4 / 37

5 Irrelevant data in audio coding For coding of audio signals, irrelevant means perceptually insignificant an empirical property Compact Disc standard is adequate: 44 khz sampling for 2 khz bandwidth 16 bit linear samples for 96 db peak SNR Reflect limits of human sensitivity: 2 khz bandwidth, 1 db intensity sinusoid phase, detail of noise structure dynamic properties - hard to characterize Problem: separating salient & irrelevant E682 (Ellis & Mandel) L7: Audio compression and coding March 31, 29 5 / 37

6 Quantization Represent waveform with discrete levels Q[x] x Q[x[n]] Equivalent to adding error e[n]: x[n] = Q [x[n]] + e[n] e[n] uncorrelated, uniform white noise p(e[n]) -D /2 +D /2 x[n] -2 error e[n] = x[n] - Q[x[n]] variance σ 2 e = D2 12 E682 (Ellis & Mandel) L7: Audio compression and coding March 31, 29 6 / 37

7 Quantization noise (Q-noise) Uncorrelated noise has flat spectrum With a B bit word and a quantization step D max signal range (x) = (2 B 1 ) D... (2 B 1 1) D quantization noise (e) = D/2... D/2 Best signal-to-noise ratio (power) SNR = E[x 2 ]/E[e 2 ] = (2 B ) 2.. or, in db, 2 log 1 2 B 6 B db level / db -2-4 Quantized at 7 bits freq / Hz E682 (Ellis & Mandel) L7: Audio compression and coding March 31, 29 7 / 37

8 Redundant information Redundancy removal is lossless Signal correlation implies redundant information e.g. if x[n] = x[n 1] + v[n] x[n] has a greater amplitude range uses more bits than v[n] sending v[n] = x[n] x[n 1] can reduce amplitude, hence bitrate x[n] - x[n-1] but: white noise sequence has no redundancy... Problem: separating unique & redundant E682 (Ellis & Mandel) L7: Audio compression and coding March 31, 29 8 / 37

9 Optimal coding Shannon information: An unlikely occurrence is more informative p(a) =.5 p(b) =.5 ABBBBAAABBABBABBABB p(a) =.9 p(b) =.1 AAAAABBAAAAAABAAAAB A, B equiprobable A is expected; B is big news Information in bits I = log 2 (probability) clearly works when all possibilities equiprobable Optimal bitrate av.token length = entropy H = E[I ].. equal-length tokens are equally likely How to achieve this? transform signal to have uniform pdf nonuniform quantization for equiprobable tokens variable-length tokens Huffman coding E682 (Ellis & Mandel) L7: Audio compression and coding March 31, 29 9 / 37

10 Quantization for optimum bitrate Quantization should reflect pdf of signal: p(x = x ) p(x < x ) x' cumulative pdf p(x < x ) maps to uniform x Or, codeword length per Shannon log 2 (p(x)): p(x) x Shannon info / bits Codewords xx 11111xx 1111xx 11xx 1xx xx 11xx 111xx 11111xx xx xx xx xx Huffman coding: tree-structured decoder E682 (Ellis & Mandel) L7: Audio compression and coding March 31, 29 1 / 37

11 Vector Quantization Quantize mutually dependent values in joint space: 3 2 x x May help even if values are largely independent larger space x1,x2 is easier for Huffman E682 (Ellis & Mandel) L7: Audio compression and coding March 31, / 37

12 Compression & Representation As always, success depends on representation Appropriate domain may be naturally bandlimited e.g. vocal-tract-shape coefficients can reduce sampling rate without data loss In right domain, irrelevance is easier to get at e.g. STFT to separate magnitude and phase E682 (Ellis & Mandel) L7: Audio compression and coding March 31, / 37

13 Aside: Coding standards Coding only useful if recipient knows the code! Standardization efforts are important Federal Standards: Low bit-rate secure voice: FS115e: LPC Kbps FS116: 4.8 Kbps CELP ITU G.x series (also H.x for video) G.726 ADPCM G.729 Low delay CELP MPEG MPEG-Audio layers 1,2,3 (mp3) MPEG 2 Advanced Audio Codec (AAC) MPEG 4 Synthetic-Natural Hybrid Codec More recent standards proprietary: WMA, Skype... Speex... E682 (Ellis & Mandel) L7: Audio compression and coding March 31, / 37

14 Outline 1 Information, Compression & Quantization 2 Speech coding 3 Wide-Bandwidth Audio Coding E682 (Ellis & Mandel) L7: Audio compression and coding March 31, / 37

15 Speech coding Standard voice channel: analog: 4 khz slot ( 4 db SNR) digital: 64 Kbps = 8 bit µ-law x 8 khz How to compress? Redundant signal assumed to be a single voice, not any possible waveform Irrelevant need code only enough for intelligibility, speaker identification (c/w analog channel) Specifically, source-filter decomposition vocal tract & f change slowly Applications: live communications offline storage E682 (Ellis & Mandel) L7: Audio compression and coding March 31, / 37

16 Channel Vocoder (194s-196s) Basic source-filter decomposition filterbank breaks into spectral bands transmit slowly-changing energy in each band Encoder Bandpass filter 1 Smoothed energy Downsample & encode E 1 Decoder Bandpass filter 1 Input Bandpass filter N Smoothed energy Downsample & encode E N Bandpass filter 1 Output Voicing analysis V/UV Pitch Pulse generator Noise source 1-2 bands, perceptually spaced Downsampling? Excitation? pitch / noise model or: baseband + flattening... E682 (Ellis & Mandel) L7: Audio compression and coding March 31, / 37

17 LPC encoding The classic source-filter model Input s[n] Encoder Filter coefficients {ai} 1 /A(e j! ) LPC analysis f Residual e[n] t Represent & encode Represent & encode Decoder Excitation generator e[n] ^ All-pole filter H(z) = "a i z -i Output s[n] ^ Compression gains: filter parameters are slowly changing excitation can be represented many ways Filter parameters Excitation/pitch parameters 5 ms 2 ms E682 (Ellis & Mandel) L7: Audio compression and coding March 31, / 37

18 Encoding LPC filter parameters For communications quality : 8 khz sampling (4 khz bandwidth) 1th order LPC (up to 5 pole pairs) update every 2-3 ms 3-5 param/s Representation & quantization {ai } - poor distribution, can t interpolate reflection coefficients {ki }: guaranteed stable LSPs - lovely! a i k i f Li Bit allocation (filter): GSM (13 kbps): 8 LARs x 3-6 bits / 2 ms = 1.8 Kbps FS116 (4.8 kbps): 1 LSPs x 3-4 bits / 3 ms = 1.1 Kbps E682 (Ellis & Mandel) L7: Audio compression and coding March 31, / 37

19 Line Spectral Pairs (LSPs) LSPs encode LPC filter by a set of frequencies Excellent for quantization & interpolation Definition: zeros of P(z) = A(z) + z p 1 A(z 1 ) Q(z) = A(z) z p 1 A(z 1 ) z = e jω z 1 = e jω A(z) = A(z 1 ) on u.circ. P(z), Q(z) have (interleaved) zeros when {A(z)} = ± {z p 1 A(z 1 )} reconstruct P(z), Q(z) = i (1 ζ iz 1 ) etc. A(z) = [P(z) + Q(z)]/2 A(z -1 ) = A(z) = Q(z) = P(z) = E682 (Ellis & Mandel) L7: Audio compression and coding March 31, / 37

20 Encoding LPC excitation Excitation already better than raw signal: 5 Original signal LPC residual time / s save several bits/sample, but still > 32 Kbps Crude model: U/V flag + pitch period 7 bits / 5 ms = 1.4 Kbps 2.4 Kbps Pitch period values 16 ms frame boundaries time / s Band-limit then re-extend (RELP) E682 (Ellis & Mandel) L7: Audio compression and coding March 31, 29 2 / 37

21 Encoding excitation Something between full-quality residual (32 Kbps) and pitch parameters (1.4 kbps)? Analysis by synthesis loop: Input s[n] LPC analysis Filter coefficients A(z) Excitation code Synthetic excitation c[n] control Pitch-cycle predictor + b z NL x[n] ^ LPC filter 1 A(z) + MSError minimization Excitation encoding Perceptual weighting A(z) W(z)= A(z/!) Perceptual weighting discounts peaks: gain / db 2 1/A(z) 1 1/A(z/!) -1 A(z) W(z)= A(z/!) ^x[n]*h a [n] - s[n] freq / Hz E682 (Ellis & Mandel) L7: Audio compression and coding March 31, / 37

22 Multi-Pulse Excitation (MPE-LPC) Stylize excitation as N discrete pulses mi 5 original LPC residual multi-pulse excitation ti time / samps encode as N (ti, m i ) pairs Greedy algorithm places one pulse at a time: E pcp = A(z) [ ] X (z) A(z/γ) A(z) S(z) = X (z) A(z/γ) R(z) A(z/γ) R(z) is residual of target waveform after inverse-filtering cross-correlate hγ and r h γ, iterate E682 (Ellis & Mandel) L7: Audio compression and coding March 31, / 37

23 CELP Represent excitation with codebook e.g. 512 sparse excitation vectors Codebook Search index excitation time / samples linear search for minimum weighted error? FS Kbps CELP (3ms frame = 144 bits): 1 LSPs 4x4 + 6x3 bits = 34 bits Pitch delay 4 x 7 bits = 28 bits Pitch gain 4 x 5 bits = 2 bits Codebk index 4 x 9 bits = 36 bits Codebk gain 4 x 5 bits = 2 bits 138 bits E682 (Ellis & Mandel) L7: Audio compression and coding March 31, / 37

24 Aside: CELP for nonspeech? CELP is sometimes called a hybrid coder: originally based on source-filter voice model CELP residual is waveform coding (no model) freq / khz 4 3 Original (mrzebra-8k) CELP does not break with multiple voices etc. just does the best it can kbps buzz-hiss 4.8 kbps CELP LPC filter models vocal tract; also matches auditory system? i.e. the source-filter separation is good for relevance as well as redundancy? E682 (Ellis & Mandel) L7: Audio compression and coding March 31, / 37 time / sec

25 Outline 1 Information, Compression & Quantization 2 Speech coding 3 Wide-Bandwidth Audio Coding E682 (Ellis & Mandel) L7: Audio compression and coding March 31, / 37

26 Wide-Bandwidth Audio Coding Goals: transparent coding i.e. no perceptible effect general purpose - handles any signal Simple approaches (redundancy removal) Adaptive Differential PCM (ADPCM) X[n] + + D[n] Adaptive quantizer C[n] = Q[D[n]] C[n] X p [n] Predictor X p [n] = F[X'[n-i]] X'[n] D'[n] Dequantizer D'[n] = Q -1 [C[n]] as prediction gets smarter, becomes LPC e.g. shorten - lossless LPC encoding Larger compression gains needs irrelevance hide quantization noise with psychoacoustic masking E682 (Ellis & Mandel) L7: Audio compression and coding March 31, / 37

27 Noise shaping Plain Q-noise sounds like added white noise actually, not all that disturbing.. but worst-case for exploiting masking Have Q-noise scale with signal level i.e. quantizer step gets larger with amplitude minimum distortion for some center-heavy pdf mu-law quantization mu(x) Or: put Q-noise around peaks in spectrum key to getting benefit of perceptual masking x - mu(x) level / db 2-2 Signal Linear Q-noise -4-6 Transform Q-noise freq / Hz E682 (Ellis & Mandel) L7: Audio compression and coding March 31, / 37

28 Subband coding Idea: Quantize separately in separate bands Input Encoder Bandpass Analysis filters f Decoder Downsample Quantize M Q[ ] Q -1 [ ] M f Reconstruction Output (M channels) + filters f M Q[ ] Q -1 [ ] M Q-noise stays within band, gets masked Critical sampling 1/M of spectrum per band gain / db -1-2 alias energy normalized freq 1 some aliasing inevitable Trick is to cancel with alias of adjacent band quadrature-mirror filters E682 (Ellis & Mandel) L7: Audio compression and coding March 31, / 37

29 MPEG-Audio (layer I, II) Basic idea: Subband coding plus psychoacoustic masking model to choose dynamic Q-levels in subbands Input 32 band polyphase filterbank Psychoacoustic masking model Scale normalize Scale normalize Scale indices Per-band masking margins Quantize Quantize Format & pack bitstream Bitstream Data frame Control & scales 32 chans x 36 samples 24 ms 22 khz 32 equal bands = 69 Hz bandwidth 8 / 24 ms frames = 12 / 36 subband samples fixed bitrates Kbps/chan (1-6 bits/samp) scale factors are like LPC envelope? E682 (Ellis & Mandel) L7: Audio compression and coding March 31, / 37

30 MPEG Psychoacoustic model Based on simultaneous masking experiments Difficulties: noise energy masks 1 db better than tones masking level nonlinear in frequency & intensity complex, dynamic sounds not well understood Procedure pick tonal peaks in NB FFT spectrum remaining energy noisy peaks apply nonlinear spreading function sum all masking & threshold in power domain SPL / db SPL / db SPL / db Signal spectrum Tonal peaks Masking spread Non-tonal energy Resultant masking freq / Bark E682 (Ellis & Mandel) L7: Audio compression and coding March 31, 29 3 / 37

31 MPEG Bit allocation Result of psychoacoustic model is maximum tolerable noise per subband Masking tone level SNR ~ 6 B Masked threshold Safe noise level Quantization noise Subband N freq safe noise level required SNR bits B Bit allocation procedure (fixed bit rate): pick channel with worst noise-masker ratio improve its quantization by one step repeat while more bits available for this frame Bands with no signal above masking curve can be skipped E682 (Ellis & Mandel) L7: Audio compression and coding March 31, / 37

32 MPEG Audio Layer III Transform coder on top of subband coder Digital Audio Signal (PCM) (768 kbit/s) Subband 31 Filterbank 32 Subbands MDCT FFT 124 Points Window Switching Line 575 Psychoacoustic Model Distortion Control Loop Nonuniform Quantization Rate Control Loop Huffman Encoding Coding of Sideinformation External Control Bitstream Formatting CRC-Check Coded Audio Signal 192 kbit/s 32 kbit/s Blocks of 36 subband time-domain samples become 18 pairs of frequency-domain samples more redundancy in spectral domain finer control e.g. of aliasing, masking scale factors now in band-blocks Fixed Huffman tables optimized for audio data Power-law quantizer E682 (Ellis & Mandel) L7: Audio compression and coding March 31, / 37

33 Adaptive time window Time window relies on temporal masking single quantization level over 8-24 ms window Nightmare scenario: Pre-echo distortion backward masking saves in most cases Adaptive switching of time window: window level normal transition short time / ms E682 (Ellis & Mandel) L7: Audio compression and coding March 31, / 37

34 The effects of MP3 Before & after: freq / khz Josie - direct from CD 5 freq / khz After MP3 encode (16 kbps) and decode 5 freq / khz Residual (after aligning 1148 sample delay) chop off high frequency (above 16 khz) occasional other time-frequency holes quantization noise under signal time / sec E682 (Ellis & Mandel) L7: Audio compression and coding March 31, / 37

35 MP3 & Beyond MP3 is transparent at 128 Kbps for stereo (11x smaller than 1.4 Mbps CD rate) only decoder is standardized: better psychological models better encoders MPEG2 AAC rebuild of MP3 without backwards compatibility 3% better (stereo at 96 Kbps?) multichannel etc. MPEG4-Audio wide range of component encodings MPEG Audio, LSPs,... SAOL synthetic component of MPEG-4 Audio complete DSP/computer music language! how to encode into it? E682 (Ellis & Mandel) L7: Audio compression and coding March 31, / 37

36 Summary For coding, every bit counts take care over quantization domain & effects Shannon limits... Speech coding LPC modeling is old but good CELP residual modeling can go beyond speech Wide-band coding noise shaping hides quantization noise detailed psychoacoustic models are key E682 (Ellis & Mandel) L7: Audio compression and coding March 31, / 37

37 References Ben Gold and Nelson Morgan. Speech and Audio Signal Processing: Processing and Perception of Speech and Music. Wiley, July ISBN D. Pan. A tutorial on MPEG/audio compression. IEEE Multimedia, 2(2):6 74, Karlheinz Brandenburg. MP3 and AAC explained. In Proc. AES 17th International Conference on High Quality Audio Coding, pages 1 12, T. Painter and A. Spanias. Perceptual coding of digital audio. Proc. IEEE, 8(4): , Apr. 2. E682 (Ellis & Mandel) L7: Audio compression and coding March 31, / 37