Audio Coding, Psycho- Accoustic model and MP3

Size: px

Start display at page:

Download "Audio Coding, Psycho- Accoustic model and MP3"

Belinda Price
7 years ago
Views:

1 INF5081: Multimedia Coding and Applications Audio Coding, Psycho- Accoustic model and MP3, NR Torbjørn Ekman, Ifi Nils Christophersen, Ifi Sverre Holm, Ifi What is Sound? Sound waves: 20Hz - 20kHz Speed: m/s (air) Wavelength: 165 cm cm 1

Christophersen, Ifi Sverre Holm, Ifi What is Sound?

2 Threshold for audible sound Reference 20 µpa = N/m 2 Analogue audio frequencies: 20Hz - 20kHz mono: x(t) scalar stereo: xr ( t) x( t) = xl ( t) 2

3 Dynamics compression A-Law A abs( S) sign( S) S' = 1+ ln A 1+ ln( A abs( S)) sign( S) 1+ ln A for else abs(s) 1 A µ-law 1+ ln(1 + µ abs( S)) S' = sign( S), µ = 255 ln(1 + µ ) Audio Compression small files, low data rate at transmission reconstruction must be (as much as possible) similar to original signal redundancy (lossless coding) irrelevancy (do not code what you cannot hear) 3

Compression small files, low data rate at transmission reconstruction must be (as much as

4 Data Rates Stereo CD Audio 2 = bit s s bit Quality Sample Rate Bit/Sample Channels Data Rate kb/s Frequency Telephone Mono 64, MW Mono 88,00 UKW Stereo 705,60 CD Stereo 1411, DAT Stereo 1536, The inner human ear 4

Telephone 8.000 8 Mono 64,00 200-3400 MW 11.025 8 Mono 88,00 UKW 22.

5 Bandpass-filter in the ear Frequency to position Digital representation of sound Kvantiserings nivåer time Diskretisert variabel bit 4 bit Using 16 bit we get 2*2*2* 2 = 2 16 = levels giving: 16*44100 = bits/sek Kontinuerlig variabel 5

6 Example: Beethoven, 5th symphony 0.15 Beethoven, Bethovens 5th symphony, 9., samplet sampled med with khz khz bit kvantisering 4 bit kvantisering 0 Amplitude bit quantizing Tid 4 bit quantizing ( dum og dårlig ) Sampling When x(t) is bandwidth-limited: then with f > ω x( f ) = 0 n= [] n x ( t) = x g( t n t) = 1 < 1 2ω t x[ n] = x( n t) f s sin(2πωt) g( t) = 2πωt 6

25-0.3 0 20 40 60 80 100 120 Tid 4 bit quantizing ( dum og dårlig ) Sampling When x(t) is bandwidth-limited:

7 Quantisation x Q(x) k bits { y, K 1, y n } i L = 2 j k representations x y x y Q( x) = y i PCM = Pulse Code Modulation Sampling: Quantisation: Coding: { x( t) } { x[ n] } { x[ n] } { Q( x[ n] )} Q( { x[ n] }) { } n i redundancy irrelevancy Play: y ( t) Q( x[ n ]) g( t n t) = i i 7

8 Masking Masking Threshold for human ear Threshold changes: neighbouring frequencies (Example 0.5, 1, 4, 8 khz) in time 8

9 Masking Absolute threshold of hearing. Masking: One sound is inaudible in the presence of another sound. 1. Simultaneous masking Noise Masking Tone Tone Masking Noise Noise Masking Noise 2. Nonsimultaneous masking Pre masking (2 ms) Post masking (100 ms) Noise Masking Tone Filtered Noise Center 410 Hz Width 111 Hz Tone 1, 820 Hz 5 db below noise Tone 2, 410 Hz 5 db below noise Noise + Tone 1 Noise + Tone 2 Not masked Masked You can not hear a sinusoid that lies in the same critical band as a filtered noise if the soundpreasure level is below a certain threshold. This effect also stretches out beyond the critical band. 9

Nonsimultaneous masking Pre masking (2 ms) Post masking (100 ms) Noise Masking Tone Filtered Noise Center 410 Hz Width 111 Hz Tone 1, 820 Hz 5 db below

10 Tone Masking Noise Filtered Noise Center 1 khz Width 162 Hz 15 db below Tone 1, 2 khz Tone 2, 1 khz Noise + Tone 1 Noise + Tone 2 Not masked Masked You can not hear a filtered noise that lies in the same critical band as a sinusoid if the soundpreasure level is below a certain threshold. This effect also stretches out beyond the critical band. Exploit Masking If a sound is masked we can t hear it. Make a frequency analyze of the signal and find the masking threshold. Put the quantization noise under the masking threshold and we don t hear the quantization. 10

threshold. This effect also stretches out beyond the critical band. Exploit Masking If a sound is masked we can t hear it.

11 Pre echo distortion The original sound of a castanet. The abruptness in time domain result in all frequencies being involved. The data is split into windows of finite length. The quantization noise is spread over a entire window. This makes the castanets sound less distinct. Audible effects can be avoided with shorter windows, exploiting premasking. Scale factors and Quantization When the dynamics change over time, only a small subset of the quantization steps are used in regions with low magnitudes. Use scale factors instead: Take a window of data. Find the max magnitude in this window. Use the next larger scale factor from a table. Normalize with the scale factor. Quantize. Now the whole dynamic range of the quantizer is used. Send scale factor and quantized samples. 11

Scale factors and Quantization When the dynamics change over time, only a small subset of the quantization steps are used in regions with low magnitudes.

12 MPEG compression factors MPEG 1 Audio: PCM 32, 44.1, 48 khz, max 448 kbit/s MPEG 2 Audio: PCM 16, 22.05, 24, 32, 44.1, 48 khz, max 384 KBit/s MPEG Audio Layer I,II,III Layer I Layer II Digital TV Layer III MP3 12

13 MP3 - MPEG 1 Audio Layer 3 Sampling: 16 khz - 48 khz Bit rate: 32 kb/s kb/s (CD Audio: 44.1 khz, 1411 kb/s) Karlheinz Brandenburg: MP3 and AAC explained Psychoacoustics in the Encoder 13

14 perceptual encoding / decoding Filterbank 14

15 Ideal sub-band coder impossible: ideal sub-band coder downsampling aliasing possible: nearly perfect H m 1 for f Dm, m = 1, K, M ( f ) = 0 else Downsampling from M f s back to sub-bandwidth B, upper frequency is multiple of B f s can sample at f s = 2B = 2M B f s (instead of ) x m [] n [ k] y m M [ k] = x [ k M ] m y m 15

from M f s back to sub-bandwidth B, upper frequency is multiple of B f s can

16 Filterbank in MPEG-1 audio layer 1-3 Polyphase filterbank 32 subbands 512 tap FIR-filters 80 + and * per output Equal width Not perfect reconstruction Frequency overlap A closer look The subbands overlap at 3 db to the adjacent bands. The leakage to the other bands is small. The total response almost adds up to one (0 db). 16

Frequency overlap A closer look The subbands overlap at 3 db to the adjacent

17 White noise The white noise run through the filterbank. The samples from each band are played in the order of the subbands. The subsampled filtered sequence. The samples from each band are played in the order of the subbands. The reconstruction error is 84 db. Nonideal filterbanks Y( e M 1 n= 1 jω X( e ) = X( e jω 2 πn j ω M In a perfect filterbank the first part is the only part. M 1 1 R jω A jω ) Hk ( e ) Hk ( e ) + k= 0 14 M M 1 1 j ω R jω A M ) Hk ( e ) Hk ( e ) k= 0 14 M πn The second part consists of the aliasing terms. The filterbank is designed so that the aliasing is small. 17

Nonideal filterbanks Y( e M 1 n= 1 jω X( e ) = X( e jω 2 πn j ω M In a perfect filterbank the first part is the only part.

18 Tubthumper, a time domain view The red line is the reconstruction error after splitting the signal in subbands, down sampling and applying the synthesis filterbank. The reconstruction error is 84 db and sounds like Tubthumper, frequency view Subband Center frequency [khz] No subsampling Subsampled 32 times 18

The reconstruction error is 84 db and sounds like Tubthumper, frequency view Subband 1

19 Filterbank MPEG polyphase 12 samples 12 samples 12 samples filterbank band 0 band 1... Layer I frame 384 samples band 31 Layer II/III frame 1152 samples Critical Bands Heinrich Barkhausen ( ) psycho-acoustic width measured in bark f /100 1bark = log( f /1000) for else f <

.. Layer I frame 384 samples band 31 Layer II/III frame 1152 samples

20 MPEG - Sub bands Layer I: 32 bands, 625 Hz each, Fourier transform Layer II: 32 bands, three frames, time masking Layer III: Division according to critical bands MPEG masking Psycho-acoustic model masking of neighbouring bands signals are coded when above masking threshold MUSICAM (Masking-pattern adapted Universal Subband Integrated Coding and Multiplexing) Layer I: simplified, Layer II: entirely, Layer III: with other methods 20

neighbouring bands signals are coded when above masking threshold MUSICAM (Masking-pattern adapted Universal

21 Example: Masking MPEG Audio band level masking coding ?????? 12 x 15????????????? - x x??????? Bit Allocation and Masking The masking threshold in each subband gives the Just Noticeable Distortion (JND) limit for that band. Bits are assigned subbands so that the quantization noise falls below or as little over the JND as possible. 21

22 Castanets and Guitar Bit allocation with 2 bits per sample 22

23 Bit allocation with 4 bits per sample Signal to Quantization Noise Ratio and the Just Noticeable Distortion Frame at t=0.6 s Frame at t=1.1 s 23

24 Examples on compression Compression MP1 4 bit 2 bit MP1 error (SQR) 22 db 11 db Direct Quantization 8 bit 4bit 2 bit Direct Quantization Error (SQR) 31 db 7.8 db 1.1 db Downsampling to 22 khz bandwidth and quantization 16 bit 8 bit 4 bit MPEG-1 Layer 3 encoder 24

25 MP3 Filter bank - sub bands Series MDCT fine grain frequency resolution non-uniform quantisation perception model Huffman coding MP3 (vs. Layer I/II) modified DCT (Series MDCT vs. FFT) critical bands Huffman coding entropy reduction dynamics compression difference and sum of stereo signals 25

26 MPEG Audio Layer I,II,III Layer I: 19 ms delay, FFT, 384 samples, frequency masking, equal bands Layer II: 35 ms delay, FFT, 1152 samples, frequency masking, time simulated, equal bands Layer III: 59 ms delay, DCT, 1152 samples, frequency and time masking, bands as in bark scale MPEG Layer I, II, III Data rates subj. quality bandwidth compression 1 min audio Audio CD CD : MB MPEG1 Layer I CD : MB MPEG1 Layer II CD : MB MPEG1 Layer III CD :1 962 kb MPEG2 Layer III Radio 64 22:1 481 kb MPEG2 Layer III Telephone 16 88:1 120 kb CS-ACELP Speech 5,30 264:1 40 kb 26

27 MPEG-2 AAC Audio Formats PCM - Pulse Code Modulation ITU G.711; speech data 4kHz bandwidth, 64 kb/s data rate ADPCM (Adaptive Differential PCM) ITU G.726, G.727; 16, 24, 32, 40 kbit/s. Standard for CCITT G.721 SB-ADPCM (Sub-Band ADPCM) ISDN, G.722; 7 khz bandwidth in 64 kbit/s streams 27

28 Audio Formats AIFF - Audio Interchange File Format Apple (extension from IFF by Electronic Arts) Wave (by Microsoft and IBM) Part of RIFF (Resource Interchange File Format) NeXT/Sun Audio File Format! big endian Proprietary Audio Formats AT&T Proprietary Compression Algorithm EPAC (Bell Labs) Microsoft Windows Media Audio (WMA) AC-3 Audio Code No. 3 - Dolby Digital Surround 28

29 Speech compression formats GSM 06-10: bit values in 260 Bit (33 Byte) are compressed; 8000 samples/s result in data rate of 1650 Byte/s CELP (Code Excited Linear Prediction): analytical model LD-CELP (Low Delay CELP): G.728 LPC-10E (Linear Prediction Coder (Enhanced): military coder, analytical model, 2.4 kbit/s understandable, but low quality. End of Part Thank you for your attention! 29

Introduction and Comparison of Common Videoconferencing Audio Protocols I. Digital Audio Principles

Introduction and Comparison of Common Videoconferencing Audio Protocols I. Digital Audio Principles Sound is an energy wave with frequency and amplitude. Frequency maps the axis of time, and amplitude