Audio coding: 3-dimensional stereo and presence

Transcription

1 Audo codng: 3-dmensonal stereo and presence Audo codng cheme: CIN / TI / M.Eng. Verson: 1.0 Last update: January 2002 Date: January 2002 Lecturer: Davd Robnson Unversty of Essex

2 Audo Codng, 3-D tereo and Presence 2 Introducton In ths second lecture, we wll examne the desgn and operaton of audo codecs. An audo codec s a devce that reduces the amount of data requred to represent an audo sgnal. We wll dscuss the need for audo codng, the general prncples of audo codng, and the desgn of a phychoacoustc model. 2.1 Why reduce the data rate? The compact dsc s now so much a part of everyday lfe that ts technologcal propertes are taken for granted. Indeed, the 750 MB of audo data contaned upon a typcal CD seems small compared to the capacty of current storage devces. Moore s law [1] predcts that computatonal processng power wll double every 18 months. Data storage capacty s ncreasng at a smlar rate. The capacty of the humble CD wll seem mnuscule compared to next year s hard dsk drves and future optcal dsc formats. As the storage capacty of a CD s dwarfed, t s easy to forget that the data requrements of CD qualty dgtal audo are mmense compared to textual meda. For example, 30 seconds of CD qualty dgtal audo requres the same storage space as the complete works of hakespeare 1. Though the cost of dgtal storage falls year on year, the data rate of CD qualty audo s stll too hgh for certan applcatons. Two pertnent examples are dscussed below. Frstly, audo broadcasters wsh to transmt CD qualty rado servces. However, the rado spectrum s very crowded, and the prolferaton of devces such as moble phones has made rado bandwdth an expensve commodty. If CD qualty audo were transmtted on exstng analogue FM frequences, then the frequency range from 88 MHz to 108 MHz would accommodate just 12 rado statons. However, analogue transmssons must contnue durng the transton to dgtal broadcastng, so addtonal bandwdth has been allocated for the dgtal servces 2. The bandwdth allocated for the fve BBC natonal rado statons s 1.54 MHz. After channel codng, ths yelds a broadcast data rate of 1.2 Mbps. The data rate of CD s 1.4 Mbps. Thus, a sngle CD-qualty audo servce requres more bandwdth than s avalable for fve rado statons. econdly, computer networks, especally home connectons, have faled to ncrease n capacty n accordance wth Moore s law. The most common nternet connecton at home n the UK s currently the 56k modem. Data transfer rates of approxmately 3-4 KB per second (32000 bts per second) are typcal. Thus, for every one second download tme, the user can transfer seconds of CD qualty audo. Real-tme delvery of audo n ths manner s mpossble. Dstrbutng albums of musc over the nternet for off-lne lstenng s smlarly mpractcal, snce a 3 mnute pop song requres over two hours download tme. 1 The complete works of hakespeare n ACII Plan text format occupy 5219KB, or bts. 30 seconds of CD qualty audo occupy 44100*16*2*30= bts. Thus ths edton of the complete works of hakespeare requres the same bnary storage as 31.3 seconds of CD qualty dgtal audo. 2 In the Unted Kngdom, 12.5 MHz of Band III spectrum from MHz has been allocated to dgtal audo broadcastng. Ths wll accommodate seven data channels. The BBC has been allocated one of these channels for ts natonal servces

3 Audo Codng, 3-D tereo and Presence The data rate of CD qualty dgtal audo s too hgh for both these applcatons. The data rate must be reduced n order to make ether applcaton practcal. In addton, there are other applcatons where the data rate of CD qualty audo s not prohbtve, but reducng ths data rate would provde economc or functonal benefts. For these reasons, t s desrable to reduce the data rate of the audo sgnal, wthout compromsng the audo qualty. However, wthout sophstcated audo codecs, the data rate and audo qualty are nextrcably lnked. 2.2 Data reducton by qualty reducton The smplest method of reducng btrate 3 s to reduce the audo qualty. Three btrate reducng strateges are lsted below, together wth the qualty mplcatons for each strategy. 1. Reduce the samplng rate. Ths wll reduce the frequency range (bandwdth) of the audo sgnal. 2. Reduce the bt-depth. Ths wll ncrease the nose floor of the audo sgnal. 3. Convert a stereo (2-channel) sgnal to a mono (1-channel) sgnal. Ths wll remove all spatal nformaton from the audo sgnal. Table 2.1 lsts some common audo formats. These llustrate varous combnatons of the above strateges. name samples / second PCM bts / sample channels frequency range / Hz The lowest btrate n Table 2.1 s stll too hgh to transmt n real tme over a 56k modem. The stereo FM parameters defne a dgtal channel wth comparable qualty to exstng analogue FM broadcasts. Ths qualty s acceptable to most consumers, but qualty reductons below ths level are perceved and dslked by many lsteners. To reduce the btrate further, a more sophstcated approach s requred. NR / db PCM bt rate / kbps DVD khz DAT khz CD khz FM khz FM khz PC khz Phone khz Table 2.1: Lnear PCM Btrates 3 Throughout ths dscusson, the data rate of an audo sgnal wll be referred to as the btrate. The btrate s specfed n bts per second (bps), klobts per second (kbps), or Megabts per second (Mbps). The k and M prefxes are used to represent 10 3 and 10 6 respectvely (I unts) rather than 2 10 and 2 20 (commonly used n PC specfcatons - see [IEC , 2000] for clarfcaton of ths ssue)

4 Audo Codng, 3-D tereo and Presence 2.3 Lossless and lossy audo codecs There are two dstnct types of audo codec: lossless and lossy. A lossless codec wll return an exact copy of the orgnal dgtal audo sgnal followng the encode and decode process. A smlar approach s often used wthn the computer world to reduce the sze of documents or program fles, wthout changng the data. Algorthms sutable for data nclude Zp [2] and t [3]. Algorthms sutable for audo nclude LPAC [4], Merdan Lossless Packng (MLP) [5], and Monkey s Audo [6]. Both types of algorthm explot redundances wthn the data. For example, the wav e- forms of muscal sgnals are often repettve n nature. torng the dfference between each cycle of the waveform, rather than the waveform tself, often requres fewer bts. In a lossless codec, the dfference between the predcted values and the actual waveform s also stored, so that the waveform can be reconstructed exactly. Further detals of lossless codec desgn are gven at the end of these notes. A lossless audo codec by defnton cannot reduce the audo qualty. However, lossless audo codecs rarely reduce the btrate to below 50% of the orgnal value. Also, the exact btrate reducton s hghly sgnal dependent, so the btrate of the audo data cannot be guaranteed to match that of the transmsson channel. A burst of whte nose (whch s random and hence dffcult to predct or compress) may cause the encoded btrate to match or exceed that of the orgnal sgnal. To reduce the btrate stll further, lossy audo codecs dscard audo data. Ths means that the decoded waveform s not an exact copy of the orgnal. However, unlke the measures descrbed n 2.2, lossy audo codecs am to dscard data n a manner that s naudble, or at least not objectonable to a human lstener. Ths s possble due to the complex nature of human hearng. Ths topc was dscussed n depth n the frst lecture. To summarse: the presence of one sound can prevent a human lstener from hearng a second (queter) sound. Ths phenomenon s llustrated n Fgure 2.1 [7]. The MAF curve represents the level below whch a sound of a gven frequency s naudble. The presence of an audble tone rases the threshold n the spectral regon around the tone, such that any addtonal sound fallng below the masked threshold (as ndcated n Fgure 2.1) s naudble

5 Audo Codng, 3-D tereo and Presence Fgure 2.1: pectral maskng. Where ths occurs, the masked sound can be removed or dstorted by the audo codec wthout changng the perceved qualty of the audo sgnal. Lossy codecs whch operate n ths manner are often referred to as psychoacoustc based codecs, snce they requre knowledge of the propertes of the human audtory system. By combnng ths approach wth lossless data reducton, the btrate may be reduced by 90% wthout sgnfcantly reducng the perceved audo qualty. The result s that a btrate whch provdes lttle better than telephone qualty wthout data reducton, can yeld near CD qualty wth data reducton. Psychoacoustc based codecs are the most recent generaton of lossy audo codecs. Two other types or famles of lossy audo codec exst, and these are mentoned n passng. The frst type ams to dscard data wthout sgnfcantly reducng the perceved qualty of the audo sgnal, but does so wthout sophstcated knowledge of the human audtory system. The oldest such codecs are the A- law and µ-law codng schemes, where non-lnear quantsaton steps are used to ncrease the perceved sgnal to nose rato of an 8-bt quantser. Another lossy codng mechansm s Adaptve Dfferental Pulse Code Modulaton. In ADPCM, each sample s predcted from the prevous samples, and only the dfference between the predcton and the actual value s stored. The decoder follows the same predctve rules as the encoder, and adds the stored dfference to each predcted sample value. Typcally, the nput samples are of 8 or 16 bt resoluton, and the encoded dfferences are stored n four bt resoluton, gvng 50% or 75% data reducton. Ths codec s lossless, except where the dfference between the predcted and actual values cannot be represented n four bts. In practce, ths stuaton s common, but the error s sometme naudble, and rarely annoyng

6 Audo Codng, 3-D tereo and Presence Fgure 2.2: General structure of a psychoacoustc codec Both the above lossy codecs are desgned for use wth telephone qualty speech sgnals, though they can be used wth some success to code CD qualty musc sgnals. There s a further type of lossy codec whch s desgned for speech codng only. Code excted lnear predctve codng employs a code book of exctaton sgnals followed by a lnear predctve flter. The output of the code book and flter s compared wth the ncomng speech sgnal, and the code book ndex whch gves the best match s transmtted. Typcally, a sngle 10-bt ndex nto the code book can represent 40 ncomng samples. Ths mechansm of lossy codng s used on dgtal moble telephone networks, and the code book s desgned to represent speech-lke sounds. Ths approach s not sutable for hgh qualty musc codng, as anyone who has heard musc va a GM moble phone can testfy. These speech-only lossy codecs are not relevant to the hgh qualty audo, and wll not be dscussed further. Psychoacoustc based lossy codecs are most relevant to hgh qualty audo. The general prncple of operaton, and the detals of the popular MPEG-1 famly of codecs wll now be dscussed. 2.4 General psychoacoustc codng prncples A generalsed psychoacoustc codec may operate as shown n Fgure 2.2. In the frst stage of the encoder, the ncomng sgnal s splt nto several frequency bands by a bank of bandpass flters. A psychoacoustc model calculates the masked threshold for each frequency band, and ths s converted nto a gnal to Mask Rato for each band. pectral components that le above the masked threshold are judged to be audble, and yeld a postve gnal to Mask Rato. pectral components that le below the masked threshold are judged to be naudble, and yeld a negatve gnal to Mask Rato. The gnal to Mask rato drects a bt allocaton algorthm. The number of bts allocated to each frequency band determnes the accuracy of the quantser, whch n turn determnes the amount of nose that wll be added wthn each band. The ntenton s to add nose wthn masked spectral regons of the audo sgnal, but not to change or dstort audble spectral components. The ampltude of the sgnal n each band s normalsed to unty before quantsaton, and the scale factor requred to revert the sgnal to ts orgnal level s stored, along wth the output of the quantser. The scale factor and/or quantser output for a gven band may be omtted f the sgnal wthn

7 Audo Codng, 3-D tereo and Presence the frequency band les well below the masked threshold. The resultng btrate s much less than that of the orgnal audo sgnal. The decoder reverses ths process by generatng the sgnal n each band from the quantsed values, multplyng each sgnal by the approprate scale factor, and bandpass flterng the contents of each band. Fnally, outputs of all the frequency bands are summed to yeld the fnal decoded audo sgnal. Hopefully, the decoded sgnal wll sound almost dentcal to the orgnal sgnal. The accuracy of the psychoacoustc model wll effect the perceved sound qualty of the coded audo. If the model ncorrectly predcts that a spectral component s naudble, when n realty s t above the masked threshold, then a human lstener wll perceve the nose added by the codec wthn ths frequency regon. However, even f the psychoacoustc model perfectly predcts human percepton, the resultng coded audo sgnal wll stll contan audble nose f the btrate s too low. In a constant btrate compressed audo sgnal, only a certan number of bts are avalable per second. If the psychoacoustc model calculates a hgh gnal to Mask Rato for many frequency bands, ths may nstruct the bt allocaton model to use more bts than are avalable. In ths case, the bt allocaton model must choose the best compromse to mnmse the audble codng nose, whlst remanng wthn the allocated btrate. Varable btrate codng overcomes ths problem, by allocatng the correct number of bts to ensure that the quantsaton nose wthn each frequency band s below the masked threshold. Ths wll reduce the btrate durng quet or easy to encode passages, whlst ncreasng the btrate durng loud or complex passages. Varable btrate encodng s only avalable wthn some audo codecs. There are two sub-types of psychoacoustc codec: subband codecs and transform codecs. ubband codecs store the waveform present n each frequency band n a sub-sampled, quantsed form. Transform codecs perform a tme to frequency transformaton (e.g. the Fast Fourer Transform) upon the orgnal audo sgnal, or the sgnal wthn each frequency band. The resultng transform coeffcents are stored, after quantsaton, accordng to the MR predcton of the psychoacoustc model. Transform codecs typcally offer greater btrate reducton than subband codecs. Ths s partly due to the hgher frequency resoluton offered by the transform, whch allows the codng nose to be dstrbuted more accurately accordng to the masked threshold. The major dsadvantage of transform codng s that all current tme to frequency transformatons process the audo n dscrete tme doman blocks, and ths blockng can cause audble problems. These problems wll be dscussed n ecton 2.5.3, wth respect to the MPEG-1 layer III codec. 2.5 MPEG audo codecs These general prncples of audo codng are seen at work n the MPEG-1 famly of audo codecs. The MPEG-1 standard conssts of three layers of codng, where each layer offers an ncrease n complexty, delay, and subjectve performance wth respect to the prevous layer. The hgher layers buld on the technology of the lower layers, and a layer n decoder s requred to decode all lower layers. The MPEG-1 standard [8] supports samplng rates of 32 khz, 44.1 khz and 48 khz, and btrates between 32 kbps (mono) and 448 kbps (Layer I stereo). The MPEG-2 standard [9] contans a backwards compatble mult-channel codec, and extends the range of allowed btrates and sam

8 Audo Codng, 3-D tereo and Presence plng rates 4. A propretary extenson called MPEG-2.5 [10] s n common use for layer III. The samplng rates and btrates are summarsed n the followng table. codec MPEG-1 samplng rates / khz allowed btrates / kbps layer I 32, 64, 96, 128, 160, 192, 224, 256, 288, 320, 352, 384, 416, 32, 44.1, layer II 32, 48, 56, 64, 80, 96, 112, 128, 160, 192, 224, 256, 320, 384 layer III MPEG-2 32, 40, 48, 56, 64, 80, 96, 112, 128, 160, 192, 224, 256, 320 layer I 32, 48, 56, 64, 80, 96, 112, 128, 144, 160, 176, 192, 224, , 22.05, 24 layer II 8, 16, 24, 32, 40, 48, 56, 64, 80, 96, 112, 128, 144, 160 layer III MPEG- 2.5 layer III 8, , 12 8, 16, 24, 32, 40, 48, 56, 64, 80, 96, 112, 128, 144, 160 8, 16, 24, 32, 40, 48, 56, 64, 80, 96, 112, 128, 144, 160 Table 2.2: Allowed btrates n the MPEG audo codng standards A revew of the MPEG standards for audo codng s found n [11], and a clear descrpton of layer III and AAC codng s contaned n [12]. Parts of the followng explanaton are drawn from [13]. 4 The MPEG-2 standard also defnes a non-backwards compatble codec known as MPEG-2 AAC (Advanced Audo Codng). Ths secton of the standard was fnalsed some years after layers I, II, and III. It ncludes several refnements that mprove codng effcency (most notably temporal nose shapng), but the general codng prncples are very smlar to MPEG-1 layer III. Further detals can be found n the standards document and an excellent descrpton appears n [Bos et al, 1997]

9 Audo Codng, 3-D tereo and Presence MPEG-1 layer I audo codng Fgure 2.3: tructure of MPEG-1 audo encoder and decoder, Layers I and II. The structure of the MPEG-1 layers I and II encoder s shown n Fgure 2.3. The operaton of the layer I encoder s as follows. All references to tme and frequency assume 48 khz samplng. 1. The analyss flterbank splts the ncomng audo sgnal nto 32 spectral bands. The flters are lnearly spaced, each havng a bandwdth of 750 Hz. 2. The samples n each band are crtcally decmated, and splt nto blocks of 12 decmated samples. calefactors are calculated whch normalse the ampltude of the maxmum sample n each band to unty. 3. In a parallel process, the sgnal s wndowed, and a 512-pont FFT s performed, to calculate the spectrum of the current audo block. 4. The psychoacoustc model calculates the masked threshold from the spectrum of the current block. Ths s transformed nto a gnal to Masker Rato for each band. 5. The dynamc bt and scalefactor allocator selects one of 15 possble quantsers for each band, based upon the avalable btrate, the scalefactor, and the maskng nformaton. The am s to meet the btrate requrements whlst maskng the codng nose as much as possble. 6. The scaler and quantser acts as nstructed by the allocator, to scale and quantse each block of 12 samples. 7. Fnally, the quantsed samples, scalefactors, and control nformaton are multplexed together for transmsson or storage

10 Audo Codng, 3-D tereo and Presence The decoder unpacks ths nformaton, scales and nterpolates the quantsed samples as nstructed va the control nformaton, and passes the 32 bands through a synthess flter to generate PCM audo samples. The decoder does not requre a psychoacoustc model, so decoder complexty s reduced compared to the encoder. Ths s useful for broadcast applcatons, where a sngle (expensve) encoder must transmt to thousands of (nexpensve) decoders. The decoder s specfed exactly by the MPEG standard, but the encoder can use any codng strategy that yelds a vald btstream. For example, the psychoacoustc model may be arbtrarly complex (or non-exstent f encodng speed s the only concern). In theory, ths allows future developments n psychoacoustc knowledge to be ncorporated nto the encoder, wthout breakng compatblty wth exstng decoders. In practce, the fxed choce of flterbank parameters lmts the fne-tunng that may be carred out MPEG-1 layer II The layer II codec operates n a smlar manner to layer I, but acheves hgher audo qualty at a gven btrate va the followng modfcatons. 1. The 512-pont FFT s replaced by a 1024-pont FFT. Ths ncreases the frequency resoluton of the maskng calculaton, at the expense of ncreasng the encoder delay. 2. The smlarty between adjacent scalefactors n adjacent blocks s exploted, thus reducng the amount of control nformaton that must be transmtted. 3. More accurate (smaller stepped) quantsers are made avalable. MPEG-1 later II codng s used by Dgtal Audo Broadcastng wthn the UK and much of the world (apart from Amerca). It acheves near CD-qualty at around 256 kbps stereo

11 Audo Codng, 3-D tereo and Presence MPEG-1 layer III Fgure 2.4: tructure of MPEG-1 layer III audo encoder and decoder. The layer III codec s sgnfcantly more complex than the lower layers. It uses both subband and transform codng, and s the only layer wth mandatory support for varable btrate codng. The layer III encoder s shown n Fgure 2.4. Each of the 32 frequency bands s sub-dvded by a 6-pont or 18-pont Modfed Dscrete Cosne Transform. Ths gves a possble frequency resoluton of up to 42 Hz, compared to 750 Hz for layers I and II. The layer III codec swtches between the two possble MDCT lengths (often referred to as short and long blocks) dependng on the nput sgnal. Ths strategy s useful because, after quantsaton of the coeffcents, the temporal structure of the audo nformaton wthn the MDCT block s often dstorted. Hence, short blocks are used for encodng transent nformaton to mnmse audble temporal smearng, whle long blocks are used for near steady-state sgnals to gve ncreased spectral accuracy. Three other sgnfcant mprovements are ncluded n the layer III encoder. A non-unform quantser s used to ncrease the effectve dynamc range (n a smlar manner to A-law or µ-law encodng, but operatng upon a sngle frequency band). The quantsed samples are losslessly packed usng Huffman codng. Fnally, a bt reservor s ncluded n the layer III specfcaton. Ths allows the encoder to ncrease the btrate durng bref hard to encode sectons, so long as t can reduce the btrate durng a nearby easy to encode secton. The overall btrate s held constant, so the scheme s stll referred to as constant btrate. In ths manner, the reservor provdes some of the adva n

12 Audo Codng, 3-D tereo and Presence tages of varable btrate codng, whlst mantanng compatblty wth fxed btrate transmsson channels. The layer III decoder s more complex than that requred for layers I or II. However, the popularty of MPEG-1 and -2 layer III has led to low-cost sngle chp layer III decoders becomng avalable. Layer III s sad to offer near CD qualty at 128 kbps. Many of the ntrcaces of the MPEG-1 layers are not covered here. Example encoders and decoders are descrbed n the approprate standards documents ([8] and [9]). One mportant feature s dscussed n the next secton Jont stereo codng The redundancy sometmes found wthn two channel (stereo) sgnals allows for a sgnfcant btrate reducton wthout a correspondng reducton n audo qualty. MPEG-1 defnes four modes: 1. Mono 2. tereo 3. Dual (two separate channels) 4. Jont tereo In the frst three modes, one or two separate channels are coded ndvdually. In the fourth mode, the nformaton n the two stereo channels s combned n one of two possble ways to reduce the btrate. Intensty stereo codng takes advantage of the human ear s nsenstvty to nteraural phase dffe r- ences at hgher frequences. For each frequency band, the data from the two stereo channels s combned, and the resultng sngle channel of audo data s coded. Two coeffcents are also stored to defne the level at whch ths sngle channel should appear n each of the stereo channels upon decodng. Ths procedure s only approprate at hgher frequences, but t can offer a 20% btrate savng compared to normal stereo. Unfortunately, the use of ntensty stereo can be audble. Though the ear cannot detect the nteraural phase of hgh frequency tones, the ear can detect nteraural tme delays n the envelope of hgh frequency sgnals. These tme delays are destroyed by ntensty stereo codng, and the stereo mage appears to partally collapse. However, ths effect s less objectonable than hghly audble codng nose, so ntensty stereo s useful at low btrates, where t effectvely frees some bts to reduce the codng nose

13 Audo Codng, 3-D tereo and Presence Matrx stereo codng explots the smlarty between two stereo channels. Rather than codng the Left and Rght Channels, the um (or Mddle ) and Dfference (or de ) sgnals are coded n- stead, thus: M L + R = 2 (2-1) L R = 2 (2-2) M + L = (2-3) 2 M R = (2-4) 2 The transformaton from L/R to M/ s entrely lossless and reversble va equatons (2-3) and (2-4), though quantsaton of the M/ sgnals wll prevent perfect reconstructon n practce. For a sgnal wth very lttle dfference between the two stereo channels (.e. an almost mono sgnal) the energy wthn the channel s mnmal, and the btrate requred for ths channel s comparatvely low. Thus, for a mono or 100% out of phase sgnal, the btrate reducton s nearly 50%. For most audo sgnals, some btrate reducton may be acheved by the use of jont stereo. It offers no beneft where the two stereo channels are completely uncorrelated. In some crcumstances, t may cause problems. For example, consder a stereo sgnal consstng of audo on the left channel only, wth an almost slent rght channel. The rght channel may contan a hss, or a quet echo. The M and channels wll be almost dentcal. However, the dfference between the two channels s enough to ensure that the codng nose ntroduced nto each channel s not dentcal. Ths codng nose s masked n both channels of the M/ representaton. When the left and rght channels are restored n the decoder, the rght channel conssts of the dfference between the M and sgnals. Hence, the rght channel wll contan very lttle sgnal nformaton, but lots of codng nose. Ths occurs because the sgnal that masked the codng nose n the M/ representaton s spatally separated from the codng nose n the decoded L/R output. MPEG-1 layer III can use a combnaton of stereo technques, n whch the encoder swtches dynamcally between ndependent stereo, matrx stereo, and/or ntensty stereo, dependng on the ncomng audo sgnal and the desred btrate. Ths s yet another reason why layer III can acheve hgher qualty at a specfed btrate, or a lower btrate at a gven qualty than layers I and II. It s nterestng to note the target btrates of the three layers. The specfcatons suggest that layers I and II acheve CD qualty at 256 kbps stereo; layer II at 192 kbps jont stereo, and layer III at kbps jont stereo. Experence suggests that these recommendatons are less than exact. ome audo sgnals are audbly degraded by some or all of the layers at any btrate. Further, the suggested btrate for layer III s especally optmstc; nearly twce ths btrate s often requred to ensure CD qualty over a wde range of materal. The majorty of layer III encoders delver a bandwdth of

14 Audo Codng, 3-D tereo and Presence 16 khz at 128 kbps, whch s by defnton not CD qualty. Whlst many audo extracts do sound acceptable at 128 kbps, a sgnfcant mnorty do not. 3 The Psychoacoustc model One of the most mportant components wthn a pyschocacoustc based codec s the psychoacoustc model. Ths conssts of an algorthm that predcts what s (and s not) audble to a human lstener. In theory, the sound qualty of an audo codec depends on the accuracy of the psychoacoustc model wthn the encoder. In practce, other factors are equally as mportant as the psychoacoustc model, such as the flterbank parameters and the choce of block-sze. If the rest of the codec s well desgned, then a comparatvely smple psychoacoustc model can yeld good results. A basc psychoacoustc model wll be descrbed here. In 1988, James Johnston publshed detals of a model for calculatng the perceptual entropy of audo sgnals [14]. Ths model calculates the masked threshold due to an audo sgnal n order to predct whch components of the sgnal are naudble. In ths way, the model can be used to predct how much data s needed to transparently code the sgnal. The model s ncluded n an audo coder [15], where the predcton of naudble components s used to feed a bt allocaton algorthm whch reduces the data rate to 128 kbps for a mono sgnal. mlar models are ncluded n most audo codecs. The Johnston model calculates the spectral maskng due to an audo sgnal, but temporal maskng s not addressed. Though the human audtory system s contnuous n both tme and frequency, the spectral maskng estmate calculated by the Johnston model s dscrete n tme and frequency. The sgnal s splt nto short (64 ms) frames, and the spectral maskng for each frame s computed as f the sgnal were steady state. The maskng s computed for 25 frequency bns, spaced equally on the crtcal band scale. Thus one masked threshold s calculated for each frequency bn every 64 ms. Ths threshold s calculated by takng the FFT of a 64 ms frame, summng the energy n each frequency bn, spreadng the energy to smulate spectral maskng, adjustng for the nature of the sgnal, and normalsng the result. The followng detaled walk through the Johnston audtory model s drawn from [14], [15], and [7]. The plots show the progress of a synthetc sgnal (consstng of a 500Hz tone and a 5kHz tone) through the model

15 Audo Codng, 3-D tereo and Presence 3.1 Algorthm Wndow and FFT The audo sgnal s splt nto frames. A frame length of 64ms s employed (2048 samples at a samplng frequency of 32kHz). The current frame s wndowed wth a Hannng (rased cosne) wndow and an FFT (Fast Fourer Transform) s performed. Each lne ( n ) n the complex FFT refers to a spectral component of frequency f (n khz), gven by Fgure 3.1: f(n) gnal pectrum l. f s f ( n) =, (3-1) 1000* wndow whch s vald for the frst (wndow / 2) complex lnes Crtcal Band Analyss The real and magnary components of the spectrum Re( n), Im( n) from the FFT are converted to the power spectrum, P(n), thus: 2 2 P ( n) = Re ( n) + Im ( n) (3-2) Ths power spectrum s segregated n lnear frequency, but the audtory system processes frequency on a near logarthmc scale, called the crtcal band scale, as dscussed n the frst lecture. The relatonshp between lnear frequency, f n khz, and the crtcal band, or Bark frequency, zc n Bark, s gven by f z c = 1+ [13arctan( 0.76 f ) + 3.5arctan ], (3-3) 7.5 adapted from [16] to number crtcal bands from 1 to 25. If the lowest frequency component fallng n crtcal band, where = nt( z c ), s gven by bl, and the hghest frequency component fallng n crtcal band s gven by bh, then the summaton of the energy n band s gven by: 2 bh B = P( n) (3-4) n= bl

16 Audo Codng, 3-D tereo and Presence Fgure 3.2: B Crtcal Band Energy Fgure 3.3: j preadng Functon The energy n each crtcal band s summed n ths manner. The D.C. component of the spectrum s not ncluded n ths summaton preadng functon The followng spreadng functon (taken from [17]) s used to estmate the effects of maskng across crtcal bands. j ( y ) ( 0. ) 2, ( db) = y + 474, db (3-5), j = 10, j (db) 10 (3-6) where: y = j (not the modulus, as stated n some other papers) = Bark frequency of masked sgnal j = Bark frequency of masker sgnal The spread crtcal band spectrum s calculated by convolvng the spreadng functon wth the crtcal band spectrum. Ths can be acheved by matrx multplcaton, thus: Fgure 3.4: C I pread CB pectrum

17 Audo Codng, 3-D tereo and Presence C C C C = 1,1 1,2 3,1 25,1 1,2 2,2 3,2 25,2 1,3 2,3 3,3 25,3 1,25 2,25 3,25 25,25 B B B B (3-7) Coeffcent of tonalty The maskng threshold for nose masked by a tone s taken to be db below C, but the maskng threshold for a tone masked by nose s taken to be 5. 5 db below C [15]. Johnston uses the pectral Flatness Measure (FM), calculated from the geometrc and arthmetc means of the power spectrum, to determne how tone-lke or nose-lke the sgnal s. The FM s gven by where ( GM ) log ( )] FM ( db) = 10[log10 10 AM (3-8) 1 log ( P( n ), (3-9) N 10 ( GM ) = log10 ) N n= 1 N 1 log 10 ( AM ) = log10 P( n) (3-10) N n= 1 and N = wndow / 2. An FM of zero db would ndcate that the sgnal s entrely nose lke, whle an FM >= FM, where FM = db, would ndcate that the sgnal s entrely tone db max db max 60 lke. Most tone lke sgnals, such as organ, sne waves, or flute have an FM that s close to or over the lmt. A coeffcent of tonalty s calculated as follows: FM db α = mn, 1 (3-11) FM db max O, the threshold off- Ths coeffcent s used to geometrcally weght the two thresholds, yeldng set, thus: O = α( ) + (1 α)5.5 (3-12)

18 Audo Codng, 3-D tereo and Presence pread threshold estmate The offset O s subtracted from the spread crtcal band spectrum C to gve the spread spectrum estmate T, thus: T log 10 ( C ) ( /10) 10 O = (3-13) Re-normalsaton of the threshold estmate The spreadng functon descrbed n ecton 3 ncreases the overall energy, where as the psychophyscal process that we are attemptng to model spreads the energy by dspersng t. For example, examne the behavour wth a hypothetcal stmulus wth unty energy n each crtcal band. The actual spreadng functon of the ear wll result n no overall change to the level of energy n any crtcal band 5. However, the spreadng functon presented here wll cause the energy n each band to ncrease, due to the addtve contrbutons of energy spread from adjacent crtcal bands. The soluton presented here s to normalse the threshold estmate at ths stage. A hypothetcal stmulus, wth unty energy n each crtcal band, s used as the B n equaton (3-7), to gve the spread spectrum error, C E, thus: Fgure 3.5: T I pread Threshold Estmate Fgure 3.6: T Normalsed threshold estmate C C C C E1 E 2 E3 E 25 = 1,1 1,2 3,1 25,1 1,2 2,2 3,2 25,2 1,3 2,3 3,3 25,3 1,25 2,25 3,25 25, (3-14) The normalsed threshold estmate from the threshold estmate, thus: ' T s calculated by convertng C E nto db, and subtractng t 5 In realty the lowest and hghest bands wll loose energy by ths process, but all other bands wll loose and gan equal amounts of energy by dsperson, hence the total level of energy n each band wll reman unchanged

19 Audo Codng, 3-D tereo and Presence Converson to db PL ' T T 10 log10 ( C ) To nclude the absolute threshold of hearng (the Mnmum Audble Feld, or MAF) n the maskng threshold estmate, t s necessary to relate the dgtal audo sgnal to a real lstenng level. Johnston sets the absolute level such that a sgnal of 4kHz, wth peak magntude of ±1 least sgnfcant bt n a 16-bt nteger, s at the absolute threshold of hearng Incluson of Mnmum Audble Feld threshold nformaton The mnmum audble feld nformaton s taken from [18]. The converson to db PL s carred out usng the reference of the MAF threshold at 2kHz beng equvalent to 0 db PL [19]. The mnmum threshold n each crtcal band, = (3-15) E M s taken to be the medan value of the MAF curve fallng wthn that band. Hence, the fnal threshold estmate s gven by ( T M ) T ' = max ' ),. (3-16) (db spl Fgure 3.7: T Normalsed Threshold Estmate n db PL Ths estmate s compared to the sgnal level n each band, and a sgnal to masker rato s calculated. The MR drves the bt allocaton algorthm, as descrbed n secton Fgure 3.8: M Mnmum Audble Feld n db PL Fgure 3.9: T Fnal Threshold Estmate n db PL, ncludng MAF

20 Audo Codng, 3-D tereo and Presence 4 Lossless audo codng The followng explanaton of lossless audo codng was wrtten by Matt Ashland, and s reproduced wth permsson. ee for more detals. 4.1 Converson to X,Y The frst step n lossless compresson s to model the channels L and R n a more effcent manner, as some X and Y values. There s often a great deal of correlaton between the L and R channels, and ths can be exploted several ways, wth one popular way beng through the use of md / sde encodng. In ths case, a md (X) and a sde (Y) value are encoded nstead of a L and a R value. The md (X) s the sum of the L and R channels and the sde (Y) s the dfference n the channels. Ths can be acheved, thus: 4.2 Predctor X = (L + R) / 2 Y = (L - R) Next, the X and Y data s passed through a predctor n an attempt to remove any redundancy. The am of ths stage s to make the X and Y arrays contan the smallest possble values whle stll remanng decompressble. Ths stage s what separates one compresson scheme from another. There are vrtually countless ways to do ths. Here s a sample usng smple lnear algebra: PX and PY are the predcted X and Y; X -1 s the prevous X value; X -2 s the X value two back PX = (2 * X -1 ) - X -2 PY = (2 * Y -1 ) - Y -2 As an example, f X = (2, 8, 24,?); PX = (2 * X -1 ) - X -2 = (2 * 24) - 8 = 40 Then, these predcted values are compared wth the actual value and the dfference (error) s what gets sent to the next stage for encodng. Most good predctors are adaptve,.e. that they adjust to how predctable the current data s. For example, consder a factor 'm' that ranges from 0 to 1024 (0 s no predcton and 1024 s full pred c- ton). After each predcton, m s adjusted up or down dependng on whether the predcton was helpful or not. Therefore, n the prevous example, the output of the predctor s: X = (2, 8, 24,?) PX = (2 * X -1 ) - X -2 = (2 * 24) - 8 = 40 If? = 45 and m = 512, then [Fnal Value] =? - (PX * m / 1024) = 45 - (40 * m / 1024) = 45 - (40 * 512 / 1024) = =

21 Audo Codng, 3-D tereo and Presence After ths sample, m would be adjusted upwards because a hgher m would have been more effcent. Usng dfferent predcton equatons and usng multple passes through the predctor can make a substantal dfference n the compresson rato that may be acheved. Here s a lst of some predcton equatons as shown n the horten techncal documentaton [20] (for dfferent orders): P0 = 0 P1 = X -1 P2 = (2 * X -1 ) - X -2 P3 = (3 * X -1 ) - (3 *X -2 ) + X Encodng of Data / Rce codng The goal behnd audo compresson s to make all of the numbers as small as possble by removng any correlaton that may exst between them once ths s acheved the resultng numbers must be wrtten to dsk. One of (f not the) most effcent way to do ths s wth rce codng. Why are smaller numbers better? They are better because they can be represented usng fewer bts. For example, consder the followng array of numbers (32 bt longs): Base 10: 10, 14, 15, 46 or, n bnary: Base 2: 1010, 1110, 1111, Now obvously f we want to represent these numbers n the fewest possble bts, t would be qute neffcent to represent them each as separate longs wth 32 bts apece. That would take 128 bts, and just from lookng at the same numbers represented n base two, t s obvous that there must be a better way. The deal thng would be just to concatenate the four numbers together usng the least bts necessary, so 1010, 1110, 1111, wthout the commas would be The problem here s that we don't know where one number starts and the next begns. Ths s where rce codng comes nto play. Rce codng s a way of usng fewer bts to represent small numbers, whle stll mantanng the ablty to tell one from the next. In essence, t works as follows: 1) Make a best guess as to how many bts a number wll take, and call that k 2) Take the rghtmost k bts of the number and remember what they are 3) Imagne the bnary number wthout those rghtmost k bts and look at ts new value (ths s the overflow that doesn't ft n k bts) 4) Use these values to encode the number; Ths encoded value s represented as a number of zeroes correspondng to step 3, followed by a 1 to termnate the "overflow", then fnally the k bts from step 2. Consder the fourth number n the example 10, 14, 15, 46, as follows:

22 Audo Codng, 3-D tereo and Presence 1) You make your best guess as to how many bts a number wll take, and call that k: snce the prevous 3 numbers took 4 bts, that seems lke a reasonable guess so we wll set k = 4 2) Take the rghtmost k bts of the number and remember what they are: The rght 4 bts of 46 (101110) are ) Imagne the bnary number wthout those rghtmost k bts and look at ts new value (ths s the overflow that doesn't ft n k bts): When you take the 1110 away from the rght of you are left wth 10 or 2 (n base 10) 4) Use these values to encode the number o, we put two 0's, followed by the termnatng 1, followed by the k bts 1110 altogether we have To reverse ths operaton, we just take and k = 4 and work our way backwards We frst see that the overflow s 2 (there are two zeroes before the termnatng 1) We also see that the last four bts = o, we take the value 10 (the overflow) and the values 1110 (the k) and just do a lttle shftng and volah! (overflow s shfted << k bts) Here s a lttle more techncal and mathematcal descrpton of the same process: Assumng some nteger n s the number to encode, and k s the number of bts to encode drectly. 1) sgn (1 for postve, 0 for negatve) 2) n / (2 k ) 0's 3) termnatng 1 4) k least sgnfcant bts of n As an example, f n = 578 and k = 8: ) sgn (1 for postve, 0 for negatve) = [1] 2) n / (2 k ) 0's: n / 2 k = 578 / 256 = 2 = [00] 3) termnatng 1: [1] 4) k least sgnfcant bts of n: 578 = [ ] 5) put the 1-4 together: [1][00][1][ ] = Durng the encode process, the optmum k s determned by lookng at the average value over the past however many values ( works well), and choosng the optmum k for that average. (bascally t's guessng what the next value wll be, and tryng to choose the most effcent k based on that) The optmum k can be calculated as [log(n) / log(2)]

23 Audo Codng, 3-D tereo and Presence REFERENCE [1] Moore, G. E. (1965). Crammng More Components onto Integrated Crcuts Electroncs, vol. 38, Aprl, pp [2] PKWARE (WEB). Genune PKZIP Products. [3] Aladdn ystems (WEB). tuffit - The complete zp and st compresson soluton. [4] Lebchen, T. (WEB). LPAC - Lossless Predctve Audo Compresson. [5] Gerzon, M. A.; Craven, P. G.; tuart, J. R.; Law, M. J.; Wlson, R. J. (1999). The MLP Lossless Compresson ystem paper I, presented at the Audo Engneerng ocety 17 th Internatonal Conference: Hgh- Qualty Audo Codng, eptember [6] Ashland, M. T. (WEB). Monkey's Audo - a fast and powerful lossless audo compressor [7] Rmell, A. (1996). Psychoacoustc foundatons n Reducton of loudspeaker polar response aberratons through the applcaton of psychoacoustc error concealment, PhD thess, Department of Electronc ystems Engneerng, Unversty of Essex. [8] IO/IEC (1993). Informaton technology Codng of movng pctures and assocated audo for dgtal storage meda at up to about 1,5 Mbt/s Part 3: Audo. Geneva: Internatonal Organsaton for tandardzaton. [9] IO/IEC (1998). Informaton technology Generc codng of movng pctures and assocated audo nformaton Part 3: Audo Geneva: Internatonal Organsaton for tandardzaton. [10] Detz, M.; Herre, J.; Techmann, B.; Brandenburg, K. (1997). Brdgng the Gap: Extendng MPEG Audo Down to 8 kbt/s. preprnt 4508, presented at the 102nd conventon of the Audo Engneerng ocety, March 1997 [11] Brandenburg, K.; Bos, M. (1997). Overvew of MPEG Audo: Current and Future tandards for Low-Bt-Rate Audo Codng Journal of the Audo Engneerng ocety, vol. 45, Jan., pp [12] Brandenburg, K. (1999). MP3 and AAC Explaned. paper , presented at the AE 17th Internatonal Conference of the Audo Engneerng ocety: Hgh-Qualty Audo Codng; eptember

24 Audo Codng, 3-D tereo and Presence [13] Holler, M. P. (1996). Data Reducton A seres of 3 lectures. Course notes, M.c. Audo ystems Engneerng, Unversty of Essex. [14] Johnston, J. D. (1988a). Estmaton of Perceptual Entropy Usng Nose Maskng Crtera. ICAP, A1.9, pp [15] Johnston, J.D. (1988b). Transform Codng of Audo gnals Usng Perceptual Nose Crtera. IEE Journal on elected Areas n Communcatons, vol. 6, Feb., pp [16] Zwcker, E.; and Terhardt, E. (1980). Analytcal expressons for crtcal-band rate and crtcal bandwdth as a functon of frequency. Journal of the Acoustcal ocety of Amerca, vol. 68, pp [17] chroeder, M. R.; Atal, B..; and Hall, J. L. (1979). Optmzng dgtal speech coders by explotng maskng propertes of the human ear. Journal of the Acoustcal ocety of Amerca, vol. 66, pp [18] Robnson, D. W.; and Dadson, R.. (1956). A re-determnaton of the equal-loudness relatons for pure tones. Brtsh Journal of Appled Physcs, vol. 7, pp [19] IO (1996). Acoustcs - Reference zero for the calbraton of audometrc equpment. n Part 7: Reference threshold of hearng under free-feld and dffuse-feld lstenng condtons. Geneva: Internatonal Organsaton for tandardzaton, [20] Robnson, T. (WEB). HORTEN: mple lossless and near-lossless waveform compresson. Background Readng: [11] ectons 2 and 3 Adapted from: Robnson, D.J.M. (2002). Perceptual model for assessment of coded audo PhD thess, Department of Electronc ystems Engneerng, Unversty of Essex (n press)