Multichannel Audio Modeling and Coding Using a Multiscale Source/Filter Model

Transcription

1 University of Crete Department of Computer Science FO.R.T.H. Institute of Computer Science Multichannel Audio Modeling and Coding Using a Multiscale Source/Filter Model (MSc. Thesis) Kyriaki Karadimou Heraklion November 2005

2

3 Department of Computer Science University of Crete Multichannel Audio Modeling and Coding Using a Multiscale Source/Filter Model Submitted to the Department of Computer Science in partial fulfillment of the requirements for the degree of Master of Science November 14, 2005 c 2005 University of Crete & ICS-FO.R.T.H. All rights reserved. Author: Kyriaki Karadimou Department of Computer Science Board of enquiry: Supervisor Member Member Panagiotis Tsakalides Associate Professor Yannis Stylianou Associate Professor Apostolos Traganitis Professor Accepted by: Chairman of the Graduate Studies Committee Dimitris Plexousakis Associate Professor Heraklion, November 2005

4

5 In memory of my father

6

7 vii Abstract During the last decade, stereophonic audio is progressively being commercially substituted by multichannel audio. The increase of the audio channels, both at the capturing and rendering sides, led to the creation of more realistic audio recordings that immerse the listener into the acoustic scene, but increased the requirements for transmission data rates as well. For this reason, many compression techniques have been proposed in order to give efficient solutions in several storage and transmission constraints. Consequently, multichannel audio compression algorithms have been developed which not only reduce the intra-channel redundancies, but the inter-channel redundancies as well. These algorithms, while very effective, remain highly demanding for many practical applications. Therefore, this thesis proposes a source/filter model which divides the multichannel audio signal into two parts, a low-dimensional part that contains the microphone-specific information, and a high-dimensional part, which contains most of the inter-channel redundancy. By taking advantage of this redundancy among the channels we achieve very low data rate requirements, such as 10 kbps per channel for high quality coding (similar to the original recording). For comparison, current compression algorithms for multichannel audio such as Dolby AC-3 and MPEG-2 AAC have minimal bit rate requirements in the order of 64 kbits/sec/channel, for achieving acoustically indistinguishable encoding. Another relevant issue is the fact that in order to create the multiple channels for multichannel audio rendering, a large number of microphones in a venue is used. These microphone signals are then mixed into a smaller number of channels that constitute the final multichannel audio recording. For various applications, methods that allow for remote mixing are of great interest in the music industry. Within our goals was to design an algorithm that can be used towards addressing such issues. Another area within our interests is remote collaboration of geographically distributed musicians, a field of great significance with extensions to music education and research. This innovative approach relaxes the current bandwidth constraints of these demanding applications, enabling their widespread usage and more clearly revealing their value. The coding section of our model is based on a speech coding scheme, which estimates the probability density function of the source, giving a more compact representation of the transmission data and efficiently quantizes the estimated parameters with varying bit rate systems. This thesis, applies this coding method for the first time in the multichannel audio domain.

8

9 PerÐlhyh Ta teleutaða qrìnia sunteleðtai mia megˆlh epanˆstash sto q ro thc anaparagwg c qou. Metˆ th metˆbash apì ta analogikˆ sust mata ston yhfiakì qo, ta stereofwnikˆ sust mata anaparagwg c èqoun arqðsei na dðnoun th jèsh touc sta polukanalikˆ, auxˆnontac ton arijmì kanali n anaparagwg c (hqeðwn). Autì od ghse se pio realistikèc hqograf seic, oi opoðec perikleðoun (immerse) ton akroat me thn akoustik skhn, dðnontac thn aðsjhsh ìti brðsketai sto q ro thc hqogrˆfhshc (p.q. sunauliakìc q roc). Apì thn ˆllh pleurˆ ìmwc, aut h posìthta plhroforðac aôxhse kai tic anˆgkec apoj keushc kai apodotik c metˆdoshc eðte mèsw DiadiktÔou, eðte mèsw asôrmatwn diktôwn. Gia autì to lìgo èqoun anaptuqjeð prìsfata algìrijmoi gia th sumpðesh polukanalikoô qou, pou mei noun ìqi mìno thn plhroforða kˆje kanalioô xeqwristˆ (intra-channel) allˆ kai thn koin plhroforða pou pijanìn upˆrqei metaxô perissìterwn tou enìc kanali n (interchannel). AutoÐ oi algìrijmoi, parìlo pou eðnai exairetikˆ apotelesmatikoð, exakoloujoôn na jewroôntai mh praktikoð gia pollèc efarmogèc ìpou to eôroc metˆdoshc eðnai idiaðtera qamhlì. Se autì to shmeðo ja prèpei na anafèroume ìti mia polukanalik eggraf pragmatopoieðtai me perissìtera mikrìfwna apì ton telikì arijmì twn kanali n thc hqogrˆfhshc. Ta mikrofwnikˆ autˆ s mata qrhsimopoioôntai gia th sônjesh thc polukanalik c eggraf c, mia diadikasða pou sun jwc anafèretai wc {mðxh} (mixing) kai gðnetai apì eidikoôc me krit ria kurðwc aisjhtik c kai empeirikèc gn seic. S mera, gia na metadojeð mia polukanalik hqogrˆfhsh (p.q. mða zwntan sunaulða) mèsw yhfiakoô radiof nou, h parousða tou eidikoô gia th {mðxh} sto sugkekrimèno q ro pou gðnetai h sunaulða eðnai aparaðthth. EpÐshc, ìlh h diadikasða prèpei na gðnei sto q ro autì. Sthn prˆxh, dhlad, aut h diadikasða apoteleð èna prìblhma kai eðnai ènac apì touc lìgouc pou empodðzoun s mera th metˆdosh zwntan n programmˆtwn polukanalikoô qou. Me th mèjodo pou parousiˆzoume ja eðnai dunat h ex apostˆsewc {mðxh} polukanalikoô qou (remote mixing), h opoða èqei pollèc efarmogèc sth mousik biomhqanða. Mia ˆllh efarmog ìpou den eparkoôn oi upˆrqousec teqnikèc kwdikopoðhshc eðnai h ex apostˆsewc tautìqronh sunergasða mousik n. Aut jewreðtai mia apì tic shmantikìterec efarmogèc eikonik n periballìntwn kai èqei apodeiqjeð ìti apaitoôntai megˆlec taqôthtec metˆdoshc gia thn sunergasða twn mousik n qwrðc na gðnontai antilhptèc oi qronikèc kajuster seic sth metˆdosh. 'Enac apì touc stìqouc loipìn, aut c thc metaptuqiak c ergasðac eðnai h parousðash miac nèac mejìdou, h opoða ja periorðzei touc lìgouc pou kajistoôn tètoiec efarmogèc apotreptikèc sthn prˆxh (qamhlèc taqôthtec metˆdoshc, duskolðec sth {mðxh}) kai epitrèpei thn eurôterh qr sh kai exˆplws touc. H paroôsa ergasða proteðnei mða nèa mèjodo pou kwdikopoieð thn polukanalik mousik me polô qamhlèc taqôthtec metˆdoshc. ProteÐnetai, sugkekrimèna, èna montèlo phg c/ ix

10 x Multichannel Audio Modelling and Coding Using a Multiscale Source/Filter Model fðltrou, to opoðo diaqwrðzei thn koin plhroforða se ìla ta hqhtikˆ kanˆlia apì thn plhroforða pou qarakthrðzei to kˆje sugkekrimèno kanˆli. Me autì ton trìpo, h koin plhroforða mporeð na metadojeð mèsw tou diktôou mìno gia èna kanˆli (kanˆli anaforˆc). Ta upìloipa kanˆlia ja anasuntejoôn sto dèkth apì to kanˆli anaforˆc me th qr sh thc epiplèon plhroforðac (h opoða ja prèpei na èqei ìso to dunatì mikrìterec apait seic se taqôthta diktôou). Ekmetalleuìmenoi loipìn, thn pleonˆzousa plhroforða tou kˆje kanalioô, epitugqˆnoume taqôthtec metˆdoshc thc tˆxhc twn 10 kbps gia kˆje kanˆli gia uyhl poiìthta kwdikopoðhsh (parìmoia me thn poiìthta thc arqik c hqogrˆfhshc). Endeiktikˆ anafèroume ìti sqetikˆ sust mata sumpðeshc polukanalikoô qou, ìpwc eðnai ta dhmofil Dolby AC-3 kai MPEG-2 AAC, èqoun elˆqisth apaitoômenh taqôthta metˆdoshc tou polukanalikoô qou 64 kbps/ kanˆli. Tèloc, gia thn kwdikopoðhsh twn paramètrwn tou proteinìmenou montèlou basizìmaste se èna sôsthma kwdikopoðhshc shmˆtwn fwn c, to opoðo proseggðzei th sunˆrthsh puknìthtac pijanìthtac thc phg c (parèqontac mða sumpuknwmènh anaparˆstash twn dedomènwn proc metˆdosh) kai kbantðzei me apotelesmatikì trìpo tic paramètrouc tou montèlou me poikðlouc rujmoôc metˆdoshc. Na shmei soume se autì to shmeðo ìti aut h mèjodoc kwdikopoðhshc efarmìzetai sthn paroôsa ergasða gia pr th forˆ se polukanalikì qo. O endiaferìmenoc anagn sthc mporeð na brei perissìterec plhroforðec sqetikèc me thn proteinìmenh mèjodo (sta ellhnikˆ) sto Appendix B.

11 EuqaristÐec Se aut thn enìthta, xefeôgontac proc stigm n apì to episthmonikì plaðsio thc metaptuqiak c ergasðac, ni jw thn anˆgkh na euqarist sw ton patèra mou Dhm trio Karad mo, o opoðoc pðsteue se mèna, upost rize kˆje mou prospˆjeia kai ston opoðo ofeðlw se èna megˆlo bajmì autì pou eðmai s mera. Akìmh, ja jela na euqarist sw th mhtèra mou Kalliìph Karad mou gia thn aperiìristh upost rix thc stic dôskolec stigmèc mou ed sto nhsð -qwrðc aut tðpote apì ìla autˆ de ja eðqe sumbeð- kaj c kai tic poluagaphmènec mou adelfèc, Tatiˆna kai MarÐa Karad mou. Ja jela akìmh na euqarist sw ton epìpth mou k. Panagi th TsakalÐdh gia thn enjˆrrunsh kai thn kajod ghs tou. Parìlo to fìrto twn upoqre se n tou wc prìedroc tan pˆnta par n me ousiastikèc sumboulèc basismènec tìso stic episthmonikèc tou gn seic ìso kai sthn poluet empeirða tou ston ereunhtikì tomèa. Ton euqarist kai elpðzw na stˆjhka ˆxia twn prosdoki n kai thc empistosônhc tou. EÐmai akìmh idiaðtera eugn mwn sto deôtero epìpth mou kai eidikì ston tomèa tou polukanalikoô qou k. Ajanˆsio Mouqtˆrh, qwrðc thn ousiastik sumbol tou opoðou de ja eðqe oloklhrwjeð h paroôsa ergasða. Ton teleutaðo enˆmish qrìno eðqame mða ˆyogh kai euqˆristh sunergasða katˆ th diˆrkeia thc opoðac kèrdisa pˆra pollˆ. Sth sunèqeia ja jela na euqarist sw gia th summetoq touc kai tic qr simec parathr seic touc kai ta ˆlla dôo mèlh thc epitrop c exètashc thc metaptuqiak c mou ergasðac ton k. Apìstolo TraganÐth kai ton k. Iwˆnnh StulianoÔ. Idiaitèrwc ton k. StulianoÔ, tou opoðou parakoloôjhsa arketˆ maj mata kai pisteôw ìti eðnai ènac axiìlogoc kajhght c me èntonh klðsh kai agˆph sthn didaskalða, kai tou opoðou oi sumboulèc kai oi epishmˆnseic bo jhsan shmantikˆ se aut n thn prospˆjeia. 'Ena akìmh mèloc tou Tm matoc pou euqarist gia me thn polôtimh kajod ghsh kai yuqologik tou upost rixh sta pr ta mou b mata sto Tm ma Epist mhc Upologist n (proerqìmenh apì èna entel c diaforetikì peribˆllon) eðnai o pr toc akadhmaðkìc mou sômbouloc k. Gewrgakìpouloc. Se kamða perðptwsh de ja jela na xeqˆsw thn k. Rèna Kalaðtzˆkh, th {mhtèra} twn metaptuqiak n foitht n ìpwc suqnˆ thn apokaloôme, thc opoðac h exuphretikìthta, to endiafèron, h eugèneia kai h ergatikìthta, mou eðqe kˆnei idiaðterh entôpwsh, sunhjismènh apì tic entel c antðjetec sumperiforèc twn ergazomènwn sth grammateiak upost rixh thc prohgoômenhc sqol c mou. EpÐshc, euqarist jermˆ gia to qrìno pou dièjesan ìlouc ìsouc summeteðqan sta listening tests thc ergasðac; h gn mh touc tan polôtimh kai bo jhse sthn exagwg qr simwn sumperasmˆtwn gia th mèjodo pou proteðnoume. 'Ena megˆlo euqarist qrwstˆw ston HlÐa GkrÐnia tìso gia th bo jeiˆ tou ìso kai gia thn upomon pou epèdeixe stic diˆforec krðseic ˆgqouc mou. Shmantikèc sthn poreða aut c thc ergasðac up rxan kai oi suzht seic pˆnw se jèmata kwdikopoðhshc kai oi xègnoiastec xi

12 xii Multichannel Audio Modelling and Coding Using a Multiscale Source/Filter Model ekdromèc se diˆforec ìmorfec gwnièc thc Kr thc me ton Giˆnnh Agiomurgiannˆkh. 'Enac akìmh fðloc, tou opoðou h sumparˆstash kai h epoikodomhtik gkrðnia bo jhsan me ton trìpo touc, eðnai o Qr stoc Papaqr stou; tou eôqomai kal sunèqeia sto didaktorikì tou. EÔqomai epðshc ston Panagi th Koutsourˆkh, tou opoðou to an suqo pneôma kai to qioômor stˆjhkan pollèc forèc aform gia polô euqˆrista dialeðmmata, kal jhteða kai kal epistrof, giatð mac leðpei. Me th Zw PolitopoÔlou, th Jèmida Zamˆnh, th MarÐa Markˆkh kai th Basilik AlebÐzou perˆsame pollèc euqˆristec stigmèc, suzht ntac epi pantìc episthtoô. Sth Zw eôqomai kourˆgio kai upomon se ìti aforˆ th suggraf thc metaptuqiak c thc ergasðac, sth Jèmida na mh qˆsei potè th zwntˆnia thc, sth MarÐa kal epituqða sto didaktorikì thc kai sth Basilik kal stadiodromða; kai na mh xeqnˆei ìpwc kai ìloi mac ˆllwste, ta lìgia tou Kabˆfh... Ki an den mporeðc na kˆmeic th zw sou ìpwc th jèleic, toôto prospˆjhse toulˆqiston ìso mporeðc: mhn thn exeutelðzeic mec sthn poll sunˆfeia tou kìsmou.

13 Contents 1 Introduction Motivation Structure of the thesis I Theoretical Background 7 2 Background Introduction to Multichannel Audio Multichannel Audio Coding Systems Dolby AC MPEG-2 Advanced Audio Coding MAACKLT Multichannel Audio Compression Algorithm Binaural Cue Coding Areas of Applications Filter Banks Introduction to Filter Banks Sampling Operation Matrix Representations Polyphase Decomposition Perfect Reconstruction Quadrature Mirror Filter Banks Orthogonal Filter Banks Linear-Phase Filter Banks Tree-Structured Filter Banks Polyphase M-Channel Filter Banks DFT Filter Banks MDFT Filter Banks Cosine Modulated Filter Banks

14 xiv Multichannel Audio Modelling and Coding Using a Multiscale Source/Filter Model Critically Subsampled Case Oversampled Case Modified Discrete Cosine Transform Filter Banks MDCT and Window Functions Wavelets Introduction in Wavelets Continuous Wavelet Transform Morlet Wavelet Discrete Wavelet Transform Haar Wavelets Daubechies Wavelets Biorthogonal Wavelets Comparison of Orthogonal and Biorthogonal Wavelets Other Wavelet Families Autocorrelation Analysis Autoregressive Models Correlation Matrix Power Spectral Density Yule-Walker Equations Linear Prediction Model Wiener-Hopf Equations Levinson-Durbin Algorithm Linear Prediction and Autoregressive Models Line Spectral Frequencies (LSF) Random Process Modeling and Decorrelation Introduction Gaussian Mixture Model Model Definition Model Motivations Karhunen Loéve Transform Eigen-analysis Karhunen Loéve Transform Definition

15 CONTENTS xv II Multichannel Audio Modeling and Coding 83 7 Multiscale Source/Filter Model Introduction Recordings for Multichannel Audio Multiscale Source/Filter Model Modeling Results Multichannel Audio Coding Introduction General Model Description Quantization of Speech Line Spectral Frequencies Coding Results Conclusion and Future work 111 III Appendices 113 A The first steps of Multichannel Audio 115 B Description of the thesis in greek 117

16 xvi Multichannel Audio Modelling and Coding Using a Multiscale Source/Filter Model

17 List of Tables 7.1 Results from the ABX listening tests. We tested 4 different types of filter banks (3 wavelet-based and 1 MDCT-based), namely 8-band with 40 th order Daubechies filters-db40 (test ABX-1) and db4 (ABX-2), 2-band with db40 (ABX-3) and 32-level MDCT-based with KBD window (ABX-4) The Log Spectral Distortion for various bit rates. The value of 10 KBits/sec is found to be the minimal bit rate for high quality coding (similar to the original recording). For comparison, current compression algorithms for multichannel audio have minimal bit rate requirements in the order of 64 KBits/sec/channel, for achieving high quality encoding An example of the fixed rate total bits that were assigned in every band for the coding procedure, which corresponds to the 10 KBits/sec case of Table 8.1 (bitrate vs. LSD) The LSD values for various numbers of GMM classes per band. We can conclude that the LSD is only marginally decreased as the number of the GMM clusters increases During the coding procedure we choose among the coded LSF vectors the one with the minimum LSD. Here, we compare it with the case of coding the LSF vector with the GMM cluster of maximum probability

18 xviii Multichannel Audio Modelling and Coding Using a Multiscale Source/Filter Model

19 List of Figures 2.1 The setup of the speakers in 5.1 channels surround system The encoding procedure of Dolby AC-3 coding system The principle of Psychoacoustic Masking [1] The decoding procedure of Dolby AC-3 coding system The encoding procedure of MPEG-2 AAC coding system The decoding procedure of MPEG-2 AAC coding system Modified AAC encoder in MAACKLT multichannel audio compression algorithm [2] The proposed decoder in MAACKLT multichannel audio compression algorithm [2] The main framework of Binaural Cue Coding [3] The psychoacoustic parameters synthesis scheme in the decoder of Binaural Cue Coding [4] The newer automobiles have at least five loudspeakers and a subwoofer, which, compared to home environments, are located in a better configuration M-channel filter bank (a) Downsampling and (b) Upsampling Operations Polyphase Decomposition for M= Analysis filter bank. (a) direct implementation (b) polyphase realization Synthesis filter bank. (a) direct implementation (b) polyphase realization (a) Two-channel filter bank (b) Signal spectra with aliasing Regular tree-structured filter banks (a) Analysis (b) Synthesis Octave-band tree structure with J stages (a) Analysis (b) Synthesis Octave-band tree structure filter bank and (c) the corresponding frequency response M-channel filter bank in polyphase structure Idealized frequency response of the DFT filter bank DFT polyphase filter bank

20 xx Multichannel Audio Modelling and Coding Using a Multiscale Source/Filter Model 3.13 Modified DFT filter bank [5] (a) Analysis and (b) Synthesis Cosine Modulated filter bank with critical subsampling Kaiser window for window length n = 10 and α = 0.5, 1, 2, 4, 8, 16 (obtained from window) KBD window function for window length n = 100 and α = 0.5, 2, 8, 32 (obtained from window) Wavelet Analysis contrary to other analysis techniques A Dirac pulse at t = t 0 and its region of influence (a) for CWT and (b) for STFT [6, p. 20] Morlet Wavelet in time domain Octave-band analysis and synthesis filter bank [5, p. 222] Haar wavelet and scaling function [5, p. 230] Some members of the Daubechies family Scaling function, wavelet and their duals of Biorthogonal Wavelets, used in FBI fingerprint compression Standard [5, p. 251] Symlets Wavelets Mexican Hat Wavelet Meyer Wavelet AR analyzer with delayed inputs u(n 1), u(n 2),..., u(n M) and parameters a 1, a 2,..., a M Forward Linear Predictor of order M, with tap inputs u(n 1),..., u(n M) and linear prediction coefficients w 1,..., w M The forward prediction error-filter, which uses the inputs u(n), u(n 1),..., u(n M) to produce the error e M (n) (a) A forward prediction error-filter of order M and (b) the corresponding autoregressive model (a) The spectrum of a signal with linear prediction coefficients at 210, 1280, 2320, 2720 and 3180 Hz and (b) the corresponding poles,whose angles θ i, (i = 1,..., 4) with the horizontal axis are the LSF s The representation of a GMM with M components. The unknown probability density function g(x) of a random vector x is given as the weighted sum of M Gaussian component densities N(x; µ x i, Σ xx i ) with mixture weights p(ω i ), i = 1,..., M

21 LIST OF FIGURES xxi 7.1 Consider a microphone signal M 1, which can be seen as the convolution of its spectral envelope s 1 and the residual signal e 1. Then, the residual e 1 will contain the same harmonics as M 1, but their amplitudes will have almost flat shape in the frequency spectrum If the AR vector could capture the exact envelope of the spectrum of microphones M 1 and M 2 recordings, the two residual signals e 1 and e 2 would have flat magnitude with exactly the same frequency components and they would resemble each other. Thus, they are almost equal in the frequency domain Every recording is decomposed into M subband signals. For each signal we compute its spectral envelope, except from the subbands of the first recording, where we compute the residual signal, too. Then, we code and transmit all the envelopes with the residual of the first recording Normalized Mutual Information between the residual signals from the reference and target recordings as a function of the number of bands of the filter bank, for various filter orders. The increase of the filter order results in better separation of the different bands in frequency domain and thus in better modeling of the spectral envelopes. The latter is very important feature, since it leads to more relevant residual signals, which in turn increases NMI Results from the 5-grade scale DCR-based listening tests, where graphical representations of the 95% confidence interval are shown (the x s mark the mean value and the two horizontal lines indicate the confidence limits). These results show clearly that the resynthesized signals are of high quality (similar to the quality of the original recording) and the model does not seem to introduce any serious artifacts The proposed Multiscale Source/Filter Model The modeling and coding procedure of (a) the first recording and (b) the j recordings (j = 2,..., N), in the special case where a wavelet filter bank of W layers is been used Every layer from the wavelet decomposition is segmented into a series of k short-time overlapping frames using a sliding Hamming window and AR analysis is applied in each frame. The spectral envelopes are modeled as LPC s, which are converted in LSF s. (a) In Layer 1, LSF s are quantized and transmitted together with the residual signals while (b) in Layer j (j = 2,..., M) only the quantized LSF s are transmitted

22 xxii Multichannel Audio Modelling and Coding Using a Multiscale Source/Filter Model 8.4 Quantization scheme among clusters The logarithmic compression function c(x) = 1(1 + erf(x/ 6)) Overall Quantization scheme The reconstruction of the initial recordings at the receiver Results from the 5-grade scale DCR-based listening tests for the coding procedure

23 Chapter 1 Introduction During the last decade a revolution is taking place in the field of sound reproduction. Stereophonic audio is progressively being commercially substituted by multichannel audio. Audio reproduction systems with 5 or even 7 channels around the listener and 1 or 2 channels for low frequency sounds are becoming more and more popular. This increase of the audio channels, both at the capturing and rendering sides, leads to the creation of more realistic audio recordings that immerse the listener into the acoustic scene. The most prevalent configuration is the so-called 5.1 channels. It was introduced in the film industry 1 and tends to be the standard in home theatre systems. In the long run, multichannel audio reproduction systems are expected to be replaced by systems that immerse the listener into a virtual acoustic scene. The difference lies in the number of reproduction channels and in their way of placement (e.g. placing the listener in a sphere of loudspeakers). However, the main objective is the creation of an interactive scene, where the listener is not simply a passive receptor but can alter the acoustic scene depending on his needs. These systems will allow the listener to modify his environment in real time and with the highest possible realism. For example, he will have the opportunity to virtually experience a live concert in the Boston Symphony Hall, through live audio and video streaming of the performance. Immersive Audio virtual rendering systems will also allow the remote collaboration of musicians. For example, musicians that are geographically distributed will have the opportunity of giving a common concert, while being in different locations. The audience can also be in different places, while they will have the sense of being in a specific concert hall, selected by the virtual concert s organizer. The connection of the musicians and the audience will take place through the Internet. There are several immersive audio systems applications of great importance, whose 1 For more information about the first steps of multichannel audio the interested reader is referred to Appendix A

24 2 Multichannel Audio Modelling and Coding Using a Multiscale Source/Filter Model experimental implementation is allowed by todays technology. However, it is difficult to be implemented outside the laboratory s conditions. For example, the IMSC 2 Laboratory captured the New World Symphony s performance of Copland s Symphony, using multichannel audio methods and high definition video, and capturing the acoustics of Miami Beach s Lincoln Theatre. The concert was stored on an IMSC server in Arlington and streamed to the campus of the University of Southern California in Los Angeles during the performance event. A 10.2 channel Immersive Audio system was installed in Bing Theatre to render the sound for the live audience as if they were at the Lincoln Theater on the original night of the performance. In this demonstration, the only constraint was the required bit rate; special networks were used with bit rate in the order of GBits/sec. Besides the fact that this kind of applications is impossible to be implemented in current Internet conditions, the implementation of virtual applications adds more constraints with regard to the required transmission bit rates. 1.1 Motivation The augmentation of audio channels, which was occurred in the passage from stereophonic to multichannel audio, gives rise to a very important issue, that is how to efficiently record, store and transmit this amount of audio channels over the current Internet infrastructure and wireless networks. This issue gained the interest of the research community and many compression techniques have been proposed in order to give efficient solutions in the issues of low data rate, low complexity and error manipulation. Consequently, multichannel audio compression algorithms have been developed which not only reduce the intra-channel redundancies, but the inter-channel redundancies as well. Some popular methods of the latter is the Mid/Side Coding [7, 8, 9, 10], Intensity Stereo Coding [11, 10, 12, 13] and KLT-based methods [2]. Although the multichannel audio coding algorithms mentioned in the previous paragraph result in reduction of the data rates required by the original recording, they still remain highly demanding for many practical applications when the available channel bandwidth is low. This is especially important given the fact that many multichannel audio systems require even more than the 5.1 channels of currently popular standards, and thus even higher data rates. In recent years, the concept of Spatial Audio Coding has been introduced, with the objective of further taking advantage of inter-channel redundancies in multichannel audio recordings. Under this approach, the objective is to decode a (stereo or mono) channel of audio using some additional (side) information, so as to recreate 2 Integrated Media Systems Center (IMSC) Laboratory of the University of Southern California (USC), in the Los Angeles of USA

25 Chapter 1. Introduction 3 the spatial rendering of the original multichannel recording. The side information is extracted during encoding; in the most popular implementation of this approach, Binaural Cue Coding (BCC) [14, 3], the side information contains the inter-channel level difference, time difference, and correlation. The resulting signal contains one full channel of audio (downmix), along with the side information with bit rate in the order of few KBits/sec per channel. Multichannel audio recordings are made using a large number of microphones in a venue, resulting in numerous microphone signals. These are then mixed in order to create the final multichannel audio recording. In many applications it would be desirable to transmit the multiple microphone signals of a performance, before those are mixed into the (usually much smaller number of) channels of the multichannel recording. This would allow for remote mixing of the multichannel recording, which is an important aspect for many applications in the music industry. Remote collaboration of geographically distributed musicians is a field of great significance with extensions to music education and research. Current experiments have shown that high data rates are needed so that musicians can perform and interact with minimal delay [15]. Remote mixing in the client side would also enable the user to interact with the music in an unparalleled fashion, allowing him to create his own music by mixing sounds as he pleases. In this thesis, we propose a source/filter model for multichannel audio recordings that can be utilized for revealing the underlying inter-channel similarities. Our model consists of the filter part that corresponds to the specifics of each microphone information, and the source part that contains mostly the inter-channel similarities. Using the appropriate filter for each channel and the source part of only one of the microphone signals, we can resynthesize a high quality approximation of each channel. The filter part of each channel need only be encoded, requiring few KBits/sec/channel for high quality coding (along with one full audio channel); this part can be used to decode the multiple channels at the receiving end. Thus, the model achieves low bit rates for transmission of the multiple microphone signals of a music performance. This is important, since our method allows the transmission of multichannel audio signals through low bandwidth channels such as the current Internet infrastructure or wireless networks for broadcasting. The methods proposed in this thesis are tailored towards the transmission of the various microphone signals (stem recordings) of a performance before they are mixed into the final multichannel signal. The algorithm can handle a large number of microphone signals that needs to be transmitted and can be applied to applications such as remote mixing and distributed performances. Our innovative approach relaxes the current bit rate constraints of these applications, allowing their widespread usage. Our method has the same objective with Spatial Audio Coding, i.e. to reduce a multichannel recording into one full audio

26 4 Multichannel Audio Modelling and Coding Using a Multiscale Source/Filter Model channel and some side information of the order of few KBits/sec per channel. However, it should be viewed as a generalization of BCC. In BCC, the side information can be used to recreate the spatial rendering of the various channels. In our method, the side information can theoretically (as we widely explain in the next chapters) recreate the exact microphone signals of the multichannel recording. Finally, in our methodology, we propose a tradeoff between the accuracy and objectives of the final multichannel recording. We propose that it is possible to achieve low data rates by substituting some microphone signals with others, which, although they are different acoustically, they however retain the objectives of the initial recording. By the term objectives we refer to the aesthetic reason for a particular microphone placement (e.g. a microphone might be placed close to the chorus of an orchestra for placing emphasis on this part of the orchestra). With our algorithm, we attempt to resynthesize the recording of this chorus microphone signal using another microphone signal of the same performance (e.g. from a microphone placed close to the violins). The resynthesized signal can be of very good quality and might sound as if it was recorded with a microphone placed close to the chorus albeit different when compared to the actual chorus microphone signal. This is the case when the new signal retains the objectives of the recording, with a loss of accuracy (i.e. the new recording does not sound the same as the original recording). We claim that with our model it is possible to achieve low data rates, good audio quality, and retain the sense of realism, without significant sacrifices regarding the accuracy of the multichannel recording. The performance of the proposed model is verified by objective and subjective measures. The current model was presented in [16] 1.2 Structure of the thesis The present thesis deals with the issue of Multichannel Audio Modeling and Coding and it is organized as follows: In Chapter 2, we describe the architecture of the most common used coding systems for multichannel audio; the Dolby AC-3 and the MPEG-2 AAC. We also present an algorithm similar to AAC, the MAACKLT, which proposes a new technique related to the reduction of inter-channel redundancies and a recent development in the concept of Spatial Audio Coding, called Binaural Cue Coding (BCC). In the next four chapters (Chapter 3 to 6) we discuss some fundamental theoretical issues, which contribute in the plenitude of the thesis and they are used in the implementation of the proposed method. In Chapter 3, we present the basic theory regarding filter banks and we describe some popular filter banks such as QMF, DFT, MDFT and MDCT. Some of these filter banks have been implemented in order to improve the modeling performance of our algorithm. In

27 Chapter 1. Introduction 5 Chapter 4, we continue the theoretical analysis, examining the framework of wavelets. We begin with the definitions of Continuous and Discrete Wavelet Transforms and we describe some of the most frequently used wavelet families (Morlet, Haar, Daubechies, Symlets and more). In Chapter 5, we study two fundamental models in Autocorrelation Analysis; the Autoregressive Model and Linear Prediction Model. We close this chapter with the description of Line Spectral Frequencies (LSF s) coefficients that are widely used in the proposed model. In Chapter 6, we present the Gaussian Mixture Model (GMM) and the Karhunen Loève Transform, which are combined in our method to model and decorrelate the LSF s of the multichannel signals. In Chapter 7, we propose a novel source/filter model for the transmission of multichannel audio signals, which takes advantage of the redundancy among the channels in order to achieve low data rate requirements. In particular, we begin with a brief description of how multimicrophone recordings for multichannel rendering are made, we continue with the description of the proposed method and we conclude with some modeling results, such as distortion measurements and statistical treatment of experimental data. In Chapter 8, we extend the use of the the speech source coding scheme of [17], in multichannel audio recordings. In this scheme, an optimal parametric vector quantizer for speech LSF s is described, which has been applied in our source/filter model and we proceed to the results of the coding algorithm s implementation. Finally, in Chapter 9, we conclude with remarks and we propose some future research directions.

28 6 Multichannel Audio Modelling and Coding Using a Multiscale Source/Filter Model

29 Part I Theoretical Background

30

31 Chapter 2 Background 2.1 Introduction to Multichannel Audio Today, the majority of audio recordings used in the fields of information, entertainment, multimedia and so forth, are two channel presentations (stereo sound format). Despite the dominance of stereo, recently another audio format appeared, known as the multichannel audio or surround sound format. In the latter, several audio channels are recorded and mixed in order to recreate the spatial realism of the recording venue. Multichannel audio can immerse the listener into the acoustic scene, which is not possible to achieve by stereo music. One of the most popular surround systems is the so-called 5.1 channels, which refers to a format that uses five channels of full frequency sound and one subwoofer channel (.1 designation) for low frequency effects (LFE), below 120 Hz. The needed bandwidth for the low frequency channel is very small compared to the rest five channels. The setup of the speakers that reproduce the channels signals are depicted in Fig Compared to the previous formats, in multichannel audio the amount of information needed to represent the audio signals is increased, which results in an imperative need to efficiently manipulate this information in order to store or transmit it. Therefore, many compression algorithms have been proposed, which represent the audio signals in a way that after decoding, the reproduced sound is very similar to the original, while achieving low data rates. The most dominant algorithms are implemented in two popular coding systems; the Digital Audio Compression Standard AC-3 of Dolby Laboratories and the Advanced Audio Coding (AAC) of MPEG. Other relevant algorithms are MAACKLT proposed in [2] and Binaural Cue Coding (BCC) [14, 3, 4]. In the next section we will describe these algorithms, trying to examine their strong points. 1 Obtained from entertainment/roomlayout.html

32 10 Multichannel Audio Modelling and Coding Using a Multiscale Source/Filter Model Figure 2.1: The setup of the speakers in 5.1 channels surround system. 2.2 Multichannel Audio Coding Systems Dolby AC-3 A very popular algorithm for high-quality audio compression is Dolby AC-3, which is also known as Dolby Digital or Dolby SR D, has been developed by Dolby Laboratories at late 1980 s. It has both stereophonic and five-channel versions, while its data rates range from 32 to 640 kbps. In the 5.1 channels case, the minimum achieved data rate for high quality audio is 382 kbps. The encoding procedure of AC-3 coding system is depicted in Fig Overlapping blocks of 512 Pulse Code Modulation (PCM) time samples of an audio signal are multiplied by a Kaiser-Bessel Derived (KBD) analysis window and are modeled by the analysis part of a perfect reconstruction Modified Discrete Cosine Transform (MDCT) filter bank 2. The resultant frequency coefficients are represented as a binary exponent and a mantissa. The exponent encoder is a procedure, which exploits the MDCT coefficients redundancies that occur in time and frequency domain and estimates the signal spectrum known as spectral envelope. The mantissa quantizer groups MDCT coefficients in blocks, while the maximum of each block is quantized as an exponent proportional to the left shifts required until overflow. In the next step of encoding, the spectral envelope is used in the bit allocation routine. This routine evaluates the bits that will be used for the encoding of each mantissa and determines the prospective bit rate. Finally, the signal spectrum along with the quantized mantissas are combined into AC-3 frames, which are transmitted to the receiver. 2 More information related to MDCT filter bank and window functions can be found in subsections 3.8 and respectively.

33 Chapter 2. Background 11 PCM samles Analysis MDCT fb Exponents Spectral Envelope Encoder Perceptual Model Bit Allocation Mantissas Mantissa Quantizer Bit Allocation Information Quantized Mantissas Encoded Spectral Envelope AC 3 frames Encoded Bit Stream Figure 2.2: The encoding procedure of Dolby AC-3 coding system. At this point we mention that Dolby AC-3 is enhanced with psychoacoustic analysis, exploiting knowledge of the properties of the human auditory system (in particular, the spectral and temporal masking effects of inner ear). The principle of audio masking is illustrated in Fig The signal component at 1 khz, distorts and raises the masking threshold which defines the level that other signal components must exceed in order to be audible. If a second audio component is present at the same time and close in frequency to the first, then for the second component to be perceived by the ear, it must be at a higher level than it would otherwise need to be if present only on its own; otherwise it is masked by the first signal. Essentially, the system codes only audio signal components that the ear will hear and discards any audio information that the ear will not perceive, according to the psychoacoustical model. Specifically in AC-3, (see Fig. 2.2), the coefficients of the spectral envelope are entered into a perceptual model, which estimates the masked threshold of each frame. This model exists only at the encoder, is not inverted to the decoder and it determines the most suitable (for the audio data) set of perceptual model parameters. After several threshold calculations in a rate control routine, these parameters result to a fixed form and they are transmitted to the decoder. The encoding algorithm of AC-3 has some extra functions as well. A frame header is specified, which determines the bit-rate, the number of channels, the sampling frequency and more information necessary for the retrieval of the original bit stream at the decoder. Error detection functions are also inserted, which allow the decoder to make sure that the data are error free. The spectral resolution of the analysis MDCT filter bank can dynamically vary, adapting to the features of the audio blocks. Finally, the original channels may be coupled at high frequencies at the coding procedure (this technique is also known as intensity coding) [11, 10], in order to accomplish a more efficient coding approach. In channel coupling the properties of spatial hearing are exploited and the main idea is to transmit only one spectral envelope (instead of two or more) from independent channels together

34 12 Multichannel Audio Modelling and Coding Using a Multiscale Source/Filter Model Figure 2.3: The principle of Psychoacoustic Masking [1]. with some side information, which is used in the decoder for recovering the individual envelopes. On the other hand, the decoding portion of AC-3 system is mainly the inverse of the corresponding encoding, as it is shown in Fig The encoded bit stream is received, synchronized, checked for transmission errors and decomposed into spectral envelopes and quantized mantissas. From bit allocation iterations, useful information for the de-quantization of mantissas is extracted. The spectral envelopes are decoded and transformed to exponents, which together with de-quantized mantissas are inserted to the synthesis MDCT filter bank, giving the original PCM time samples. In case where the transmitted channels were coupled in the encoding process, they must be de-coupled. Also, if the spectral resolution at the analysis filter bank has been assigned dynamically, it must be altered in the synthesis filter bank in the same manner. For more information on Dolby AC-3 the interested reader is referred to [11, 10, 12, 13] MPEG-2 Advanced Audio Coding MPEG-2 Advanced Audio Coding (AAC) is considered as the most powerful highquality compression system for digital multichannel audio signals of the MPEG family, supporting up to 48 coded channels. It was developed in the early 1990 s and it achieves 320 kbps data rate for the 5.1 channels surround system. The applications of AAC cover a very large range from multichannel broadcasting systems (Digital Audio Broadcasting (DAB), Digital TV Broadcasting, Music download services, Internet music streaming, Internet radio, Teleconferencing, Audio for games) to storage operations. The encoder and decoder of AAC system are shown in figures 2.5 and 2.6. Note that the decoder consists of the inverse encoding processes. The AAC is a scalable system, which offers three profiles [8, 9]: Main Profile (MP): In this profile, best audio quality is achieved in various sampling

35 Chapter 2. Background 13 Encoded Bit Stream Syncronization, Error Detection and Decomposition of AC 3 frames Quantized Mantissas Encoded Spectral Envelope Bit Allocation Bit Allocation Information Mantissa De Quantizer Mantissas Spectral Envelope Decoder Exponents Synthesis MDCT fb PCM samles Figure 2.4: The decoding procedure of Dolby AC-3 coding system. rates. All the tools depicted in Fig. 2.5 may be used, except the pre-processing tool and the memory/processing requirements are higher than low complexity profile (LCP). The decoder in the main profile can decode audio bit streams produced by the LC profile. Low Complexity Profile (LCP): In LCP, pre-processing and prediction tools are excepted and the use of TNS (which will be discussed later) is restricted. The memory/ processing requirements are very low, while the high level of quality is retained. Scalable Sampling Rate Profile (SSRP): In this profile, a pre-processing tool is used, consisting of a polyphase quadrature filter (PQF), gain detectors and gain modifiers. The prediction tool is excluded, while TNS s use is restricted. Lower complexity than MP and LCP and frequency scalable signals can be provided. The MPEG AAC encoder, which is shown in Fig. 2.5, receives an audio signal and uses an appropriate window function together with an MDCT filter bank to decompose the input signal into subsampled spectral components. The window functions, which ensure the good frequency selectivity of the filter bank, vary dynamically with the signal s characteristics. In particular, a switching between sine and KBD windows is allowed and an extra bit for every frame (signifying the window function) is transmitted to the decoder. Along with the decomposition in the frequency domain, an estimation of the current masking threshold is computed by the perceptual model. Rules from psychoacoustic field are applied for the threshold estimation and the resultant information is used in the quantizer, minimizing the audible quantization distortion. After the analysis filter bank, the Temporal Noise Shaping (TNS) process is applied. In coding, many difficulties may occur due to the fact that quantization errors from one block are spread in time (extending in the order of few milliseconds) and are not effectively cut off by the masking threshold. This problem is known as the pre-echo problem and it is faced up by the TNS tool.

36 14 Multichannel Audio Modelling and Coding Using a Multiscale Source/Filter Model Rate Control Perceptual Model Scale Factors Quantizer Entropy Coding Input Signal Pre Processing MDCT Filter Bank TNS Side Information Coding, Bitstream Formatting Intensity Coupling Prediction M/S Quantized Spectrum of previous frame AAC Encoded Audio Stream Figure 2.5: The encoding procedure of MPEG-2 AAC coding system. As in Dolby AC-3, the intensity coding (or channel coupling) technique is used in AAC, together with Middle/Side (M/S) coding [7, 10], in order to eliminate the redundancies among channels. M/S technique (or sum/difference coding ) is based on the evidence that the human ear at high frequencies (more than about 2 khz) does not focus on the signal itself, but it mainly evaluates energy envelopes. Thus, in M/S two symmetric channels (e.g. arranged symmetrically on the listener s left and right side) are not direct coded, but only their sum and their difference is encoded and transmitted. The next step in AAC encoder is prediction, which exploits the fact that cascade frames spectral components (resulting from the spectral decomposition of the filter bank) are correlated, achieving redundancies reduction. Backward adaptive predictors are used and the quantizer is fed with the prediction error instead of the spectral components. However, the predictor is applied only in the case where a coding gain is guaranteed. For this reason an appropriate predictor control is applied. The latter activates the prediction procedure when coding gain can be achieved and transmits a small amount of predictor control information to the decoder. In a different case, prediction is deactivated. The biggest data rate reduction is accomplished during quantization. Firstly, a nonuniform quantizer is applied to the spectral values, followed by Huffman coding. The number of bits used to code the quantized spectrum in Huffman procedure (12 Huffman codebooks are used), depends on the sampling frequency and the intended data rate. As mentioned before, the psychoacoustic model, which is similar to the one used in AC-3, contributes to the reduction of quantization distortion. Another tool, which amplifies the shaping of the quantization noise is the individual amplification of spectral coefficients groups, the so-called scale factor bands. The amplification information is stored in the scale factors and is transmitted to the decoder, while differential scale factors are Huffman coded. Finally, the quantized/coded coefficients and the control parameters (coded side information) are fed to a bit stream formatter, resulting to the final encoded bit stream of MPEG-2 AAC encoder. At this point, we mention that the depicted (see Fig. 2.5) pre-processing part of the encoder is used only in the Scalable Sampling Rate Profile. It includes a PQF with four

37 Chapter 2. Background 15 AAC Encoded Audio Stream Side Information Decoding, Bitstream Formatting Noiseless Decoding De Quantizer Scale Factors M/S Prediction Intensity Coupling TNS MDCT Filter Bank Post Processing Output Time Signal Figure 2.6: The decoding procedure of MPEG-2 AAC coding system. bandwidth output and at 48 Hz sampling rate it can give output bandwidth at 24, 18, 12 and 6 Hz. This part of the encoding process also consists of a gain control, which deals with the pre-echo problem as well as of gain controllers that control the amplitude of every PQF band. More information on the MPEG-2 AAC compression system can be found in [8, 9, 7, 10] MAACKLT Multichannel Audio Compression Algorithm Modified AAC with Karhunen-Loève Transform (MAACKLT) is a compression system proposed in 2003, which is based on AAC algorithm and introduces two issues in multichannel coding [2]. Firstly, a novel inter-channel decorrelation scheme is proposed, achieving a better coding gain. In particular, MAACKLT results in audio signals with better quality than AAC at the typical data rate of 64 kbps per channel. Secondly a quality-scalable encoding policy is suggested. The encoder of this algorithm is shown in Fig. 2.7 and is a modified version of the corresponding AAC encoder. In Fig. 2.7 the shaded components show the novel parts, while the rest remain the same with AAC (see Fig. 2.5). One of the main differences compared to AAC is the use of Karhunen-Loève Transform after the MDCT filter bank. After input signals transformation into the frequency domain, the KLT estimates the so-called decorrelated eigen-channel signals. The latter participate in the estimation of masking thresholds in the perceptual model. Overhead information that is related with KLT is added into the final bit stream. Another difference with AAC system is the fact that the M/S tool is disabled in MAACKLT. Since KLT results in independent eigen-channels with minimal correlation between any pair of channels, the M/S tool is not required. Finally, a prioritized eigen-channel transmission policy is applied, in order to achieve the quality scalability. The decoding procedure of this algorithm is depicted in Fig. 2.8, where the encoded bitstream, the mapping information and the covariance matrices are extracted from the transmitted bit stream. If data loss occurs during the transmission, the eigen-channel concealment block is enabled to restore the lost data. The signals of the eigen-channels

38 16 Multichannel Audio Modelling and Coding Using a Multiscale Source/Filter Model Figure 2.7: Modified AAC encoder in MAACKLT multichannel audio compression algorithm [2]. are reconstructed from their compressed versions with the use of the decoder of the AAC main profile. The mapping information contributes to the restoration from a 16-bit dynamic range of the decoded eigen-channel to their initial range. The inverse KLT uses the covariance matrices and reconstructs the multichannel blocks. Eventually, these blocks are combined to the desired multichannel signals. More detailed description of the MAACKLT can be found in [2, 18] Binaural Cue Coding In recent years, a new concept for efficient coding of multichannel audio signals has been introduced, called Spatial Audio Coding. This approach takes advantage of the interchannel redundancies in multichannel recordings, maintains the backward compatibility with existing stereophonic decoders and achieves data rates close to those needed for stereo (or mono) transmission. The most popular development in coding of spatial audio is the Binaural Cue Coding (BCC), which is based on the parametric coding of multichannel signals using well-known parameters from psychoacoustics [14, 3, 4]. The main framework of BCC is illustrated in Fig In the BCC encoder, the input signals x 1 (n),..., x C (n) are downmixed into a single sum signal, while the most prominent perceptually motivated parameters are extracted. In the psychoacoustic area, the interchannel time difference (ICTD), level difference (ICLD) and correlation (ICC) have a well recorded history in describing the perception of the auditory spatial image. The latter parameters are evaluated between channel pairs (both in frequency 3 and time domain) and coded, constituting the BCC side information. The BCC side information together with the coded downmixed signal are transmitted to the receiver, where the decoder attempts 3 ICTD, ICLD and ICC are estimated in the frequency domain using a Cochlear filter bank or a DFTbased filter bank (for achieving implementation of low complexity).

39 Chapter 2. Background 17 Figure 2.8: The proposed decoder in MAACKLT multichannel audio compression algorithm [2]. Figure 2.9: The main framework of Binaural Cue Coding [3]. to resynthesize the spatial image. Thereby, in the decoder the sum signal (see Fig. 2.10) is processed by a filter bank and spectral coefficients are estimated. The latter are combined with different delays (d 1,..., d C ) and different scale factors (a i, i = 1,..., C), and they are re-correlated. The resultant signals are converted to the time domain by the synthesis part of the filter bank (IFB). The bit rates achieved by BCC, range from 24 kbps to 64 kbps and the lower the bit rate the more is the BCC coding scheme at an advantage (compared to relevant algorithms). Finally, BCC has been included in the standardization of MPEG Surround System by the ISO/MPEG group [19] and in the experimental multichannel version of the popular MP3 compression format [20]. Figure 2.10: The psychoacoustic parameters synthesis scheme in the decoder of Binaural Cue Coding [4].

40 18 Multichannel Audio Modelling and Coding Using a Multiscale Source/Filter Model 2.3 Areas of Applications The following areas are some examples where multichannel audio is starting to appear. Audio Broadcasting: Today, most of the digital broadcasting systems provide interference-free reception and the potential for enhanced user services. However, due to the small channel bandwidth, the large cost of transmission equipment and radio broadcasting licenses the spread of these systems is limited. In the future, it is expected that the current radio programs will be supplemented by multichannel audio, pictures, texts and graphics, which will increase their information value and will motivate the users to make the transition from the traditional FM receivers to new digital receivers. In this scheme, multichannel audio meets its most promising application. Despite the tremendous usage in home theatre surround systems, multichannel audio is not widely used in environments such as cars. This could be a very popular multichannel implementation, since many people in their cars listen to the radio. The newer automobiles have at least five loudspeakers and a subwoofer, which, compared to home environments, are located in a better configuration (Fig. 2.11). The listener in the car also, has a stable and predictable position regarding to loudspeakers, which provides a more realistic experience of the music. Recently, mostly in the US, service providers of satellite radio such as Cirius and XM transmit their programs in multichannel format [21, 22]. Digital Television Broadcasting: Currently, the majority of digital TV broadcasting services use the stereo audio format. On the other hand, the few Digital Video Broadcasting (DVB) systems, which provide surround sound, simulcast stereo and multichannel services. The latter though is contained in a separate data-stream, resulting in high bit rates and great operational complexity. The greatest challenge of multichannel systems applications is to establish these services to multichannel format with a small overhead in bit rate at the same transmitting signal, while maintaining stereo compatibility. Internet Audio: A few years ago, the usage of perceptual models in audio coding systems (such as MP3 of MPEG family and Dolby AC-3) led to a rapid development of Internet audio

41 Chapter 2. Background 19 Figure 2.11: The newer automobiles have at least five loudspeakers and a subwoofer, which, compared to home environments, are located in a better configuration. and computer based audio. Nowadays, an experimental backward compatible extension of MP3 compression format is presented by the Fraunhofer IIS [20], which produce high quality 5.1-channel sound using novel coding schemes such as BCC. Such multichannel compression formats could be used in various ways. Many radio stations are streaming their program to the Internet. Due to the constrained transmission bandwidth they only use stereo or ever monophonic content. Thus, the streaming of multichannel audio through radio broadcasting material is a very challenging field. Another possible service of surround sound, related to Internet, is music download and preview service. A number of commercial music download services are already available and work with success, using stereo sound. Additionally, many record companies and music companies (working with mail orders) would like to give the consumers the opportunity to listen a short time of their products even a low quality preview of their music songs. Such services could be implemented with multichannel audio coding systems, since scalability (for preview service) and high audio quality are provided.

43 Chapter 3 Filter Banks 3.1 Introduction to Filter Banks In this chapter we study filter banks, which are structures of low pass, bandpass, and highpass filters, commonly used in signal processing applications, such as audio and image coding [23, 24, 5, 25, 26, 27, 28, 29]. Their popularity derives from the fact that they provide efficient implementations of signals decomposition and composition, emphasizing in prescribed spectral aspects of the original signal. Figure 3.1 depicts a filter bank, which consists of M channels. The input signal x(n) is decomposed in M signals y 0 (m), y 1 (m),..., y M 1 (m) using the passband filters H 0 (z), H 1 (z),..., H M 1 (z) and a downsampler of order N. By downsampling, we ignore samples which are located in multiples of N and the sampling rate for each channel is reduced by N. The signals y i (i = 0,..., M 1) are called subband signals and the first part of the depicted structure (up to these signals) is called analysis filter bank. In order to reconstruct the original signal, we first apply an upsampler (of order L), we then pass the upsampled signals through M filters G 0 (z), G 1 (z),..., G M 1 (z) and finally we add the resultant signals. In upsampling operation we insert zeros between the samples to recover the original sampling rate, while the following filters replace these zeros with meaningful values. The second part of the depicted structure is called synthesis filter bank. If the order of downsampler equals the order of the upsampler (N = L), the sampling is entitled as critical subsampling, since this order is the maximum factor for which we succeed the perfect reconstruction of x(n). In the rest of this section, we begin with the definition of some fundamental issues in filter bank analysis and we continue describing some of the most frequently used filter banks.

44 22 Multichannel Audio Modelling and Coding Using a Multiscale Source/Filter Model H (z) 0 N y (m) 0 L G (z) 0 x(n) H (z) 1 N y (m) L G (z) 1 + ^x (n) H (z) M 1 N y (m) M 1 L G (z) M 1 Figure 3.1: M-channel filter bank Sampling Operation Downsampling: This operation serves to reduce or to eliminate redundancies in the M subband signals. In Fig. 8.3(a) the following relationship holds x [n] = x[nn]. (3.1) In z-transform domain: X (z) = 1 N N 1 k=0 where X (z) is the z-transform of x [n]. X(W k Nz 1 N ) where WN = e j2π N (3.2) Equation (3.2) for N = 2 becomes X (z) = 1 2 (X(z 1 2 ) + X( z 1 2 )) Upsampling: This operation serves to recover the original sampling rate by inserting zeros between the x(n) N x (n) x(n) L x (n) (a) (b) Figure 3.2: (a) Downsampling and (b) Upsampling Operations.

45 Chapter 3. Filter Banks 23 samples. In Fig. 8.3(b) the following relationship holds x[n/l], n = kl, k Z x [n] = 0, n kl (3.3) In z-transform domain: and for L = 2 : X (z) = X (z L ) (3.4) X (z) = X (z 2 ) = 1 (X(z) + X( z)) Matrix Representations In filter bank analysis we usually use matrix representations, because they are more convenient and they also offer a compact way to describe filter banks. Below, we give the definitions of some important matrices and we describe their relation to the analysis and synthesis filters. Most of the following relations refer to a two-channel filter bank, but they can be generalized. Analysis and Synthesis Matrices: In order to express the two-channel filter bank in a compact matrix form, we define the Toeplitz matrix H m (z) = H 0(z) H 0 ( z), (3.5) H 1 (z) H 1 ( z) which is called analysis matrix and the matrix G m (z) = G 0(z) G 1 (z) G 0 ( z) G 1 ( z), (3.6) which is known as synthesis matrix. Thus, the relation between the subband signals and the original signal is given in a matrix form by y p (z 2 ) = 1 2 HT m(z)x m (z), (3.7)

46 24 Multichannel Audio Modelling and Coding Using a Multiscale Source/Filter Model where x m (z) = X(z), X( z) y p (z) = Y 0(z). Y 1 (z) (3.8) Polyphase Decomposition A polyphase decomposition of a signal x(n) into M components is given by [27]: X(z) = M 1 l=0 z l X l (z M ), (3.9) where X l (z) x l (n) = x(nm + l). (3.10) An example of polyphase decomposition of a signal x(n) into three sub-signals x 0 (n), x 1 (n), x 2 (n) is shown in Fig The aforementioned sub-signals are called polyphase components of x(n) and if we interleave them, we can retrieve the original signal x(n). Polyphase Representation of the Analysis Filter Bank: If we consider the analysis filter bank shown in Fig. 3.4(a), the signals y 0 (m) and y 1 (m) Figure 3.3: Polyphase Decomposition for M=3.

47 Chapter 3. Filter Banks 25 x(n) _ x 0(m) 2 H (z) 00 + y (m) 0 x(n) H (z) 0 2 y (m) 0 z 1 H 01 (z) H (z) 10 H (z) 1 2 y (m) 1 2 _ x (m) 1 H (z) 11 + y (m) 1 (a) (b) Figure 3.4: Analysis filter bank. (a) direct implementation (b) polyphase realization. can be written as y 0 (m) = n = k = k h 0 (n) x(2m n) h 0 (2k) x(2m 2k) + h 0 (2k + 1) x(2m 2k 1) k h 00 (k) x 0 (m k) + h 01 (k) x 1 (m k) k (3.11) and y 1 (m) = n = k h 1 (n) x(2m n) h 10 (k) x 0 (m k) + k h 11 (k) x 1 (m k), (3.12) where h 00 (k) = h 0 (2k), h 01 (k) = h 0 (2k + 1), h 10 (k) = h 1 (2k), h 11 (k) = h 1 (2k + 1), x 0 (k) = x(2k), x 1 (k) = x(2k 1). (3.13) are the polyphase components [5]. From equations (3.11) and (3.12) we conclude that we can implement the analysis filter bank using the polyphase components (Fig. 3.4(b)). If we compare this polyphase implementation with the respective direct one (Fig. 3.4(a)) we observe that only the required components are computed. Thus, efficient implementation with simple filter design is achieved, taking advantage of filter bank s properties. Equations (3.11) and (3.12) can be rewritten in matrix form and in the z-domain as where y p (z) = H p (z) x p (z) (3.14)

48 26 Multichannel Audio Modelling and Coding Using a Multiscale Source/Filter Model y p (z) = Y 0(z), x p (z) = X 0 (z), Y 1 (z) X 1 (z) H p (z) = H 00(z) H 01 (z). H 10 (z) H 11 (z) Matrix H p (z) is called the analysis polyphase matrix and H ik (z) is the k th components of the i th filter H i (z), where (3.15) polyphase H i (z) = H i0 (z 2 ) + zh i1 (z 2 ). (3.16) It is obvious that polyphase decomposition is used for the signals as well for the filters of the filter bank. Polyphase Representation of the Synthesis Filter Bank: Let us now consider the synthesis part of a filter bank as depicted in Fig. 3.5, where (a) shows the direct implementation and (b) shows the polyphase realization of the filter bank. Respectively with the analysis polyphase matrix H p (z), we define the synthesis polyphase matrix G p (z) by G p (z) = G 00(z) G 10 (z) (3.17) G 01 (z) G 11 (z) where G i (z) = G i0 (z 2 ) + z 1 G i1 (z 2 ) (3.18) The z-transform ˆX(z) of the reconstructed signal ˆx(n), in matrix representation is given y (m) 0 G (z) y (m) 0 y (m) 1 2 G 0 (z) ^x(n) 2 G (z) 1 + y (m) 1 G (z) 01 G (z) 10 G (z) 11 + z (a) (b) Figure 3.5: Synthesis filter bank. (a) direct implementation (b) polyphase realization.

49 Chapter 3. Filter Banks 27 by [ ˆX(z) = 1 ] z 1 G 00(z 2 ) G 10 (z 2 ) G 01 (z 2 ) G 11 (z 2 ) Y 0(z 2 ). (3.19) Y 1 (z 2 ) Note that the synthesis polyphase matrix and the last vector in equation (3.19) is in z 2. This allows us to put filters G ik (z), i, k = 0, 1 before the upsampler and replace z 2 by z in the implementation, as shown in 3.5(b). We can also observe a duality between the analysis and synthesis filter bank and that G p (z) is a transpose of H p (z), in indices concept [24]. If we substitute equation (3.14), (3.15) and (3.17) in (3.19) we result in a relation between the z-transform of the output signal ˆx(n) and the z-transform of the input signal x(n): [ ] ˆX(z) = 1 z 1 G p (z 2 )H p (z 2 ) x p (z 2 ). (3.20) Perfect Reconstruction When we say that a signal can be perfectly reconstructed by a filter bank, we mean that the output signal ˆx[n] is a copy of the input signal x[n] with no further distortion than a time shift and an amplitude scaling [5]. Below we give the constraints on the analysis and synthesis filter banks that ensure perfect reconstruction in direct, z-transform domain, matrix and polyphase form. Direct Characterization: If we consider the filter bank shown in Fig. 3.1, then an analysis filter bank is Perfect Reconstruction (PR) iff M 1 i=0 h i (Nn + n 1 ) g i ( Nn + n 2 ) = δ(n 1 n 2 ), n 1, n 2. (3.21) n Correspondingly [23], a synthesis filter bank is PR iff h i (L)g i ( Ln + n 2 ) = δ(l)δ(i j), i, j {0, 1,..., M 1}. (3.22) n If the order of the downsampler equals the order of the upsampler (M = N), (where the sampling is called critical subsampling), equations (3.21), (3.22) are equal.

50 28 Multichannel Audio Modelling and Coding Using a Multiscale Source/Filter Model x(n) H (z) 0 H (z) y (m) 0 y (m) y (m) 0 y (m) 1 G (z) 0 G (z) 1 + ^x(n) j ω H (e ) 0 j ω H (e ) 1 (a) (b) Figure 3.6: (a) Two-channel filter bank (b) Signal spectra with aliasing. Z-transform domain: In z-transform domain the general condition for perfect reconstruction is given by ˆX(z) = z q X(z) (3.23) where ˆX(z) is the z-transform of the constructed signal and X(z) is the z-transform of the original signal. Let us now consider the two channel filter bank of Fig. 3.6(a). Further down we calculate the relation between the input X(z) and the output ˆX(z). ˆX(z) = Y 0(z)G 0 (z) + Y 1(z)G 1 (z). (3.24) However, Y i (z) (i = 0, 1) can be represented with the use of equation (3.4) as Equation (3.25) combined with (3.2), is rewritten as Y 0(z) = Y 0 (z 2 ), Y 1(z) = Y 1 (z 2 ). (3.25) Y 0 (z 2 ) = 1 2 [H 0(z)X(z) + H 0 ( z)x( z)] Y 1 (z 2 ) = 1 2 [H 1(z)X(z) + H 1 ( z)x( z)] (3.26) The combination of (3.24), (3.25) and (3.26) yields the input-output relation: X(z) = 1 2 [H 0(z)G 0 (z) + H 1 (z)g 1 (z)]x(z) [H 0( z)g 0 (z) + H 1 ( z)g 1 (z)]x( z). (3.27) In equation (3.27), the first adding factor characterizes the transmission of X(z) through a two-channel filter bank (Fig. 3.6(a)) and the second characterizes the aliasing compo-

51 Chapter 3. Filter Banks 29 nent of the filter bank as depicted in Fig. 3.6(b). As we mentioned in the beginning of this section, perfect reconstruction is achieved if the ˆx(n) is a delayed version of x(n). Thus, in order to achieve PR the next two conditions must hold [5]: H 0 (z)g 0 (z) + H 1 (z)g 1 (z) = 2z q (3.28) H 0 ( z)g 0 (z) + H 1 ( z)g 1 (z) = 0 (3.29) The condition (3.28) refers to the amplitude distortion cancellation, while (3.29) refers to the aliasing cancellation. If only (3.29) holds, the output signal is not affected by aliasing, but amplitude distortions may occur. These distortions disappear if both (3.28) and (3.29) hold. The filter banks, which are critically subsampled and achieve PR are called biorthogonal filter banks. Matrix Characterization: The input-output relation (3.24) is given in a matrix form by ˆX(z) = [G 0 (z) G 1 (z)] y p (z 2 ), (3.30) where from (3.7) and from (3.8) y p (z 2 ) = 1 2 H m(z)x m (z) y p (z) = Y 0(z). Y 1 (z) Combining equations (3.7) and (3.30), the input-output relation is rewritten as ˆX(z) = 1 2 [G 0(z) G 1 (z)] H m (z)x m (z). (3.31) as Finally, the PR conditions (3.28) and (3.29) can be expressed in a matrix representation [ [G 0 (z) G 1 (z)] H m (z) = 2 0 ] z q (3.32)

52 30 Multichannel Audio Modelling and Coding Using a Multiscale Source/Filter Model Polyphase Characterization: As we saw in Polyphase Decomposition, the input-output relationship (3.20) was [ ] ˆX(z) = 1 z 1 G p (z 2 )H p (z 2 ) x p (z 2 ). It is obvious that if G p (z)h p (z) = I, we have [ ˆX(z) = ] 1 z 1 X 0 (z 2 ) = X X 0 (z 2 ) + z 1 X1 (z 2 ) = X(z) 1 (z 2 ) since equation (3.9) for M = 2 gives: X(z) = X 0 (z 2 ) + z 1 X1 (z 2 ) So, a polyphase analysis - synthesis filter bank achieves PR iff G p (z)h p (z) = I. (3.33) 3.2 Quadrature Mirror Filter Banks Quadrature mirror filter banks (QMF banks) are two-channel filter banks (see Fig. 3.6(a)), where H 0 (z) is a linear phase lowpass filter [5, 27] and H 1 (z), G 0 (z), G 1 (z) are given by G 0 (z) = H 0 (z) H 1 (z) = H 0 ( z) G 1 (z) = H 1 (z) (3.34) If we substitute equations (3.34) into (3.29), we observe that QMF banks achieve complete aliasing cancellation [24]: H 0 (z) H 0 ( z) H 0 ( z) H 0 (z) = 0. On the other hand, amplitude distortion cancellation (3.28) is only approximately satisfied. By substituting the equations (3.34) in the condition (3.28) we get H 2 0(z) H 2 0( z) = 2z q. (3.35)

53 Chapter 3. Filter Banks 31 For FIR 1 filters, equation (3.35) is not satisfied exactly, but it can be approximated. So, we refer to QMF as linear phase filters, which achieve almost perfect reconstruction. Also, equation (3.35) describes the reason why these filter banks are named quadrature mirror. In particular, on the unit circle the filter H 1 (z) = H 0 ( z) = H(e j(ω+π) ) is the mirror image of H 0 (z) and these mirror images are squared to 2z q. Thus, the name QMF comes from the mirror image property [5]: H1 (e j π 2 ω ) = H0 (e j π 2 +ω ) (3.36) with symmetry around π 2. The importance of QMF banks results from the efficient implementation, since the highpass and lowpass filters are modulated as H 1 (z) = H 0 ( z). 3.3 Orthogonal Filter Banks Orthogonal (or unitary or paraunitary) filter bank is a two-channel filter bank, whose synthesis filters are reflections of the analysis filters [29]: g i (n) = h i ( n), i, j = 0, 1. (3.37) This kind of filter banks are commonly used because they are computational efficient. Only one of the analysis and synthesis filters needs to be designed, since equation (3.37) holds, which results in a convenient implementation. Another advantage is that the energy between the input x(n) and the subband signals y 0, y 1 (see Fig. 3.6(a)) is preserved: x 2 = y y 1 2 (3.38) In what follows, we describe some definitions for the orthogonal filter banks. Definition 1 (Orthogonality in Time Domain): [24]: A filter bank is orthogonal iff the following conditions among the synthesis filters hold g i [n 2k], g j [n 2l] = δ[i j]δ[k l], i, j = 0, 1. (3.39) If we use equation (3.37), we result in similar equation with (3.39) for the analysis filters. 1 The systems which their impulse response has only a finite number of nonzero samples are called finite-duration impulse response (FIR) systems [30].

54 32 Multichannel Audio Modelling and Coding Using a Multiscale Source/Filter Model In the special critical subsampling case, the analysis filter bank is orthogonal iff the synthesis filter bank is orthogonal. Definition 2 (Orthogonality using Matrix Characterization): The orthogonality condition of (3.39) in a matrix form is given by g 0 [n ], g 0 [n + 2(k l)] = δ[k l], (3.40) where i = j = 0 and n = n 2k. If we substitute k l = m and n n, equation (3.40) is rewritten as g 0 [n], g 0 [n + 2m] = δ[m]. (3.41) Note that r[l] = g 0 [n], g 0 [n + l] is the autocorrelation function of g 0 [n]. In that way the left side of (3.41) is the even indexed (l = 2m) autocorrelation of g 0 [n]. From way point of view, g 0 [n], g 0 [n + 2m] represents the autocorrelation of g 0 [n], downsampled by 2, such that r [m] = r[2m], where r [m], in the z-transform domain, is given from equation (3.2) by: P (z) = 1 2 [P (z1/2 ) + P ( z 1/2 )]. (3.42) In equation (3.42) and without loss of generality, we replace z by z 2 : P (z 2 ) = 1 [P (z) + P ( z)]. (3.43) 2 The z-transform of autocorrelation g 0 [n] is given by: P (z) = G 0 (z) G 0 (z 1 ) (3.44) and combining (3.43) and (3.44) P (z 2 ) = 1 2 [G 0(z)G 0 (z 1 ) + G 0 ( z)g 0 ( z 1 )]. (3.45) So, from equation (3.43), the z-transform of (3.41) is given by G 0 (z)g 0 (z 1 ) + G 0 ( z)g 0 ( z 1 ) = 2. (3.46)

55 Chapter 3. Filter Banks 33 In a similar way, we also have from (3.39): G 1 (z)g 1 (z 1 ) + G 1 ( z)g 1 ( z 1 ) = 2 (3.47) G 0 (z)g 1 (z 1 ) + G 0 ( z)g 1 ( z 1 ) = 0. (3.48) Equations (3.46)-(3.48) can be rewritten in a matrix form as G 0(z 1 ) G 0 ( z 1 ) G 0(z) G 1 (z) = 2 G 1 (z 1 ) G 1 ( z 1 ) G 0 ( z) G 1 ( z) (3.49) and if we use the definition of synthesis matrix G m (z) in (3.6) we have that G T m(z 1 )G m (z) = 2I. (3.50) Similarly, since equation (3.37) holds [24]: H m (z 1 )H T m(z) = 2I, (3.51) where H m (z) is the analysis matrix in (3.5). Note that if the filters have real coefficients the following equation holds: G(e jω ) = G (e jω ) and from (3.46), (3.47) we conclude that the squared magnitude of the filters G 1, G 0 and their corresponding modulated version sum up to the constant 2: Gi (e jω ) 2 + Gi (e j(ω+π) ) 2 = 2, i = 0, 1. (3.52) and we say that these quantities are power complementary. Since we assumed that the filters of the filter bank have real coefficients, the matrix in equation (3.51) is called paraunitary matrix and if this system is also stable the filter bank is an orthogonal filter bank. A very useful property of a two-channel orthogonal filter bank is the fact that all its filters can be obtained from a single prototype filter g o, as long as g 0 has the power complementary property (see eq. (3.52)): g 1 [n] = ( 1) n g 0 [2K 1 n] (3.53)

56 34 Multichannel Audio Modelling and Coding Using a Multiscale Source/Filter Model or in a matrix form G 1 (z) = z 2K+1 G 0 ( z 1 ) (3.54) Equations (3.53), (3.54) arise from the fact that in eq. (3.49) the quantity [G 1 (z 1 ) G 1 ( z 1 )] T has to be orthogonal to [G 0 (z) G 0 ( z)] T [24]. Definition 3 (Orthogonality in Polyphase Domain): An analysis filter bank is orthogonal iff H p (z)h T p (z 1 ) = H T p (z 1 )H p (z) = I. (3.55) A synthesis filter bank is orthogonal iff G T p (z 1 )G p (z) = G p (z)g T p (z 1 ) = I. (3.56) Note that H p (z) = G T p (z 1 ). 3.4 Linear-Phase Filter Banks In many applications it is desirable to have filter banks with linear phase filters, since they reduce the number of the free parameters in the design. It would be very convenient if we could design orthogonal linear phase filter banks, but unfortunately except the Haar filters (which are orthogonal and linear phase) there are no such filter banks. Let us now examine a way to design linear phase FIR filter banks. Consider an FIR filter H(z) of length L, with impulse range 0, 1,..., L 1. With the use of H(z) a linear phase filter bank can be described as H(z) = ±z L+1 H(z 1 ) (3.57) where ± corresponds to the symmetric and antisymmetric filter. It has been proved that in a linear phase filter bank, PR is achieved iff one of the following relations holds: H 00 (z)h 11 (z) H 01 (z)h 10 (z) = z l (3.58) H 0 (z)h 1 ( z) H 0 ( z)h 1 (z) = 2z 2l 1 (3.59) where in (3.58) the left-side expresses the determinant of the analysis polyphase matrix H p (z), which must be equal to a delay and the left-side in (3.59) represents the determinant

57 Chapter 3. Filter Banks 35 of the analysis matrix H m (z). It has been also proved that the corresponding synthesis filters are equal to an arbitrary shift k of the filters H i as: G 0 (z) = z k H 1 ( z) G 1 (z) = z k H 0 ( z) (3.60) Consequently, we can say that in a PR two-channel filter bank, consisting of linear phase filters, the analysis filters may have three forms [24]: Both H 0, H 1 have odd length, are symmetric and differ by 2k, where k is an odd number. Both filters have even length, are equal or differ by 2λ, where λ is an even number. In this case one of H 0 and H 1 is symmetric, while the other is antisymmetric. Filter H 0 is symmetric and H 1 is antisymmetric (or reversely), or both are symmetric. One has even length, while the other has odd. An example of a two-channel linear phase filter bank is the aforementioned QMF, which has the desirable property of linearity, but approximates PR with an unavoidable reconstruction error. This approximation of PR is derived from the following function [24]: O = cs + (1 c)e (3.61) where S is the stopband attenuation error of H 0 (z) and is defined as S = π H0 (e jω ) 2 dω, (3.62) ω s E is the reconstruction error given by E = π 0 2 (H0 (e jω )) 2 + (H 0 (e j(ω+π) )) 2 2 dω (3.63) and c is a constant, which shares the reconstruction cost between E and S. Our intent is to minimize O, given the coefficients of H Tree-Structured Filter Banks In many applications there is a need to construct multichannel filter banks. A simple way to design this kind of filter banks is to build appropriate cascades of two-channel

58 36 Multichannel Audio Modelling and Coding Using a Multiscale Source/Filter Model H (z) 0 2 H (z) 1 H (z) G 0 (z) G (z) G (z) 1 + H (z) 1 2 H (z) 1 2 H (z) π ω 2 2 G 0 (z) G (z) G 0 (z) (a) (b) (c) Figure 3.7: Regular tree-structured filter banks. banks. Figure 3.7 shows a regular tree structure [24, 5]. Another more general structure is when a J-step subdivision into 2 channels appears, leading in a tree with 2 J leaves (Fig. 3.8). Each leave corresponds to (1/2 J ) th of the original bandwidth, with downsampling factor 2 J. One more structure is shown in Fig. 3.9, where we observe an iterating two-channel division on the previous lowpass channel. This framework is called octave-band filter bank, since each highpass output contains an octave of the input bandwidth. This structure sometimes is called constant-q, because in every channel the division of the bandwidth by its center frequency is constant; or it is called logarithmic, since the channels have equal bandwidth on a logarithmic scale. Further arbitrary tree-structured filter banks can be built, which are chosen in a way that it is best matched to each application and give rise to wavelet packets, discussed in a following chapter. 3.6 Polyphase M-Channel Filter Banks In the previous section we discussed the main features of the polyphase presentation of a two-channel filter bank. Here, we examine the polyphase representation of M-channel filter banks, with subsampling factor N (see Fig. 3.10). Analysis: With the use of the following relationships [ x p (z) = X 0 (z) X1 (z)... XN 1 (z) ] T (3.64) [ y p (z) = ] T Y 0 (z) Y 1 (z)... Y M 1 (z) (3.65)

59 Chapter 3. Filter Banks 37 x(n) H (z) 1 2 H (z) 2 H (z) stage 1 H (z) stage 2 H (z) 1 H (z) 0 2 stage J (a) 2 G (z) 1 + x(n) ^ 2 G (z) G (z) 0 2 G (z) G (z) 0 stage 2 stage 1 2 G (z) 0 stage J (b) Figure 3.8: (a) Analysis (b) Synthesis Octave-band tree structure with J stages.

60 38 Multichannel Audio Modelling and Coding Using a Multiscale Source/Filter Model x(n) H (z) 1 2 H (z) 2 H (z) H (z) π ω (a) (b) z D 1 2 G (z) 0 + ^x(n) z D 2 2 G (z) G 0(z) + 2 G (z) 0 (c) Figure 3.9: (a) Analysis (b) Synthesis Octave-band tree structure filter bank and (c) the corresponding frequency response. x(n) z 1 z 1. N N N _ x 0(m) _ x 1(m) _ x N 1(m) y (m) 0 y (m) 1. H(z). G(z)... y (m) N 1 _ ^ x 0(m) _ ^ x 1(m) _ ^ x N 1(m) N N N. z 1 z 1 z 1 x(n) ^ Figure 3.10: M-channel filter bank in polyphase structure.

61 Chapter 3. Filter Banks 39 H p (z) = H 00 (z) H 01 (z)... H 0,N 1 (z) H 10 (z) H 11 (z)... H 1,N 1 (z) H M 1,0 (z) H M 1,1 (z)... H M 1,N 1 (z) (3.66) the analysis filter bank is described by y p (z) = H p (z)x p (z). (3.67) Synthesis: In a similar way, the synthesis filter bank is described by ˆx p (z) = G p (z)y p (z), (3.68) where G p (z) = G 00 (z) G 10 (z)... G M 1,0 (z) G 01 (z) G 11 (z)... G M 1,1 (z) G 0,N 1 (z) G 1,N 1 (z)... G N 1,M 1 (z). (3.69) Perfect Reconstruction: Combining (3.67) and (3.68) we have: ˆx p (z) = G p (z)y p (z) = G p (z)h p (z)x p (z). (3.70) Using the required condition for perfect reconstruction (3.23), the following requirement comes to question: which results in a total delay of Mq 0 + r + M 1 samples. H p (z)g p (z) = z q 0 I, (3.71) 3.7 DFT Filter Banks Discrete Fourier Transform Filter Banks are modulated filter banks, where all the analysis filters are modulations of a prototype P k (z) = P 0 (zw k ) [25]. The idealized fre-

62 40 Multichannel Audio Modelling and Coding Using a Multiscale Source/Filter Model H 0 H 1 H M 1 0 π / Μ 2π ω Figure 3.11: Idealized frequency response of the DFT filter bank. quency responses are depicted in Fig In a similar way, the synthesis filters are Q k (z) = Q 0 (zw k ), where Q 0 is the prototype synthesis filter. Note that if P 0 and Q 0 are linear phase filters, then all filters have linear phase. As we can see in Fig. 3.12, in a polyphase implementation of DFT filter bank, the input signal x(n) is decomposed into M delayed components and these components are downsampled by M. Next, the resulting signals are filtered by the polyphase components P k (z) and are transformed using DFT. The synthesis banks, on the other hand, reverse the steps mentioned above. For the case of critical subsampling, PR condition for the DFT filter banks is proved to be P k (z)q M 1 k (z) = z q 0 M. (3.72) This means that critical subsampling DFT filter banks with PR are degraded into DFT. Also, in the oversampling by a factor l = M N holds: Z, PR is achieved if the following condition l 1 P k+in (z)q M 1 k in (z) = z q 0 i=0 M (3.73) As we can see, the critically oversampled case is a sub-case of the subsampled framework, since (3.72) is included in (3.73). The (3.73) PR condition gives an increased design freedom in comparison with (3.72), which can be exploited in order to design FIR prototypes P 0, Q 0 with convenient filter properties. In general, the prototypes P 0 and Q 0 are picked to be lowpass filters. A customary criterion for these prototypes is to minimize the stopband energy and the passband ripple: min{ α( P (e jω ) 1) 2 dω + β P (e jω ) 2 dω}. (3.74) passband stopband Such kind of filter banks have very efficient implementations, because only suitable prototypes have to be found (one for analysis and one for the synthesis part) and in many applications the same prototype is used.

63 Chapter 3. Filter Banks 41 x(n) z 1 z 1 N N.. P (z) 0 P (z) 1. N P (z) N 1 W H y (m) 0 y (m) 1. y (m) N 1 W Q (z) N 1 Q (z) N 2. Q (z) 0 N N. N. z 1 z 1 z 1 x(n) ^ Figure 3.12: DFT polyphase filter bank. At this point, we should mention that polyphase implementations of DFT filter banks are very fast and efficient, but they have the disadvantage that are not good in reconstruction and aliasing cancellation is impossible. For this reason, an alternative bank, which is simultaneously fast and reconstructs perfectly, is the Cosine Modulated filter bank, which will be discussed in a following section. Note that all prototypes for M-channel cosine filter banks, which ensure PR, can be also used as prototypes for oversampled 2M-channel DFT filter banks MDFT Filter Banks Modified Discrete Fourier Transform (MDFT) Filter Banks have the feature that the perfect reconstruction is achieved with the use of FIR filters. As we can see in Fig the key to achieve PR is to subsample the filter output signals by M/2, extracting the real R{ } and imaginary I{ } parts and use these parts to compose the complex subband signals y k (m), k = 0, 1,..., M 1. The real R{ } and imaginary I{ } parts are extracted in adjoining channels and the inverse operation is followed for the synthesis part, as it is depicted in Fig Cosine Modulated Filter Banks Cosine Modulated Filter Bank (CMFB) is a filter bank category,which accomplishes very fast, efficient implementations and simultaneously reconstructs perfectly [5, 31, 32, 33, 28, 34, 24, 25]. It has been widely used in speech and audio coding applications, in order to transform an audio sequence from time domain into subband domain for compression. Consider the analysis filters h k (n), k = 0,..., M 1 and the synthesis filters g k (n), k = 0,..., M 1 defined as: [ π h k (n) = 2p(n) cos M g k (n) = 2q(n) cos [ π M ( k + 1 ) ( n D ) ] + φ k, n = 0,..., L p ( k + 1 ) ( n D ) ], (3.75) φ k, n = 0,..., L q 1 2 2

64 42 Multichannel Audio Modelling and Coding Using a Multiscale Source/Filter Model Figure 3.13: Modified DFT filter bank [5]. where p(n), q(n) are FIR prototypes for analysis and synthesis filters, L p, L q are the lengths of p(n), q(n) and D is the overall delay of the whole system. In order to examine the PR conditions, we will decompose the prototypes P (z), Q(z) into their polyphase components as: P i (z) = Q i (z) = M 1 j=0 M 1 j=0 p (2jM + i) z j, i = 0,..., 2M 1 q (2jM + i) z j, i = 0,..., 2M 1. (3.76) Critically Subsampled Case In the critically subsampled case (see Fig. 3.14), the analysis polyphase matrix [5] is written as : H(z) = K 1 A 0(z 2 ), (3.77) z 1 A 1 (z 2 ) where [K 1 ] k,j = 2 cos [ ( ) ( ) ] π M k j D 2 + φk, k = 0,..., M 1, j = 0,..., 2M 1 (3.78) A 0 (z 2 ) = diag {P 0 ( z 2 ), P 1 ( z 2 ),..., P M 1 ( z 2 )} A 1 (z 2 ) = diag {P M ( z 2 ), P M+1 ( z 2 ),..., P 2M 1 ( z 2 )}. (3.79) Similarly, the synthesis polyphase matrix is given by: [ G(z) = z 1 S 1 (z 2 ) S 0 (z 2 ) ] K T 2 (3.80)

65 Chapter 3. Filter Banks 43 x(n) z 1. z 1 M M 2 P 0 ( z ) 2 P ( z ) M 2 P 1( z ) 2 P ( z ) M+1. z 1 z M 1 M M+1 Cosine Transform T 1 y 0 (m) y (m) 1. y M 1(m) M 2 P ( z ) M 1 2 P ( z ) 2M 1 z 1 2M 1 (a) y 0 (m) y (m) 1. y M 1(m) Cosine Transform T 2 T 0 1 M 1 M M+1 z 1 2 Q ( z ) 2M 1 2 Q M 1 ( z ). + M z 1 z 1 Q 2M 2 ( z ) 2 Q ( z ) + M + M M 1 z 1 2 Q M( z ) 2 Q ( z ) 0 + z 1 M + x(n) ^ (b) Figure 3.14: (a) Analysis and (b) Synthesis Cosine Modulated filter bank with critical subsampling.

66 44 Multichannel Audio Modelling and Coding Using a Multiscale Source/Filter Model where and [K 2 ] k,j = 2 cos [ ( ) ( ) ] π M k M 1 j D 2 φk, k = 0,..., M 1, j = 0,..., 2M 1 S 0 (z 2 ) = diag {Q M 1 ( z 2 ),..., Q 1 ( z 2 ), Q 0 ( z 2 )} S 1 (z 2 ) = diag {Q 2M 1 ( z 2 ),..., Q M+1 ( z 2 ), Q M ( z 2 )}. (3.81) (3.82) The perfect reconstruction condition is defined as: G(z)H(z) = z q 0 I, (3.83) where I is an M M unitary matrix. From equation (3.83), we have the following equivalent conditions: P k (z)q 2M 1 k (z) + P M+k (z)q M 1 k (z) = z s 2M (3.84) where k = 0,..., M 2 1 and q 0 = 2s + 1. P k (z)q M+k (z) P M+k (z)q k (z) = 0 (3.85) If we use the same prototype for analysis and synthesis, that is we set Q(z) = P (z), the conditions (3.84), (3.85) become: P k (z)p 2M 1 k (z) + P M+k (z)p M 1 k (z) = z s 2M, k = 0,..., M 2 1. (3.86) Oversampled Case In the oversampled case, considering the oversampling factor µ = M N matrices are given by: Z, the polyphase H (µ) (z) = 1 µ K A 0 (z 2µ ) z 1 A 1 (z 2µ ) z (2µ 1) A 2µ 1 (z 2µ ), (3.87) G (µ) (z) = 1 µ [ z (2µ 1) S 2µ 1 (z 2µ )... z 1 S 1 (z 2µ ) S 0 (z 2µ ) ] K T. (3.88) where

67 Chapter 3. Filter Banks 45 A i (z 2µ ) = diag { P in ( z 2µ ), P in+1 ( z 2µ ),..., P in+(n 1) ( z 2µ )}, i = 0, 1, 2 (3.89) S i (z 2µ ) = diag { Q in+(n 1) ( z 2µ ),..., Q in+1 ( z 2µ ), Q in ( z 2µ )}, i = 0, 1, 2. (3.90) The perfect reconstruction condition is defined as: and equivalently we have: G (µ) (z)h (µ) (z) = z q(µ) 0 (3.91) 2µ 1 i=0 P k+in (z) Q 2M 1 k in (z) = z s 2M (3.92) P k+in (z) Q M+k+iN (z) P M+k+iN (z) Q k+in (z) = 0, (3.93) where k = 0,..., N 1 and i = 0,..., µ 1. The delay q (µ) 0 relates to s and is defined as: q (µ) 0 = 2µs + 2µ 1 and the total delay as: q = N 1 + q (µ) 0 N Modified Discrete Cosine Transform Filter Banks Modified Discrete Cosine Transform (MDCT) Filter Banks is a member of the CMFBs family, which also reconstructs perfectly and is efficiently implemented [35, 34, 18, 8]. It is based on the type-iv discrete cosine transform (DCT). It has the design property of being lapped by 50% (the last half of each block coincides with the first half of its following) and the Time Domain Aliasing Cancellation (TDAC) property that is explained next. MDCT is a linear orthogonal lapped transform, which has half as many outputs as inputs (it is a linear function of the form f : R 2n R n ), thus aliasing is introduced. Nevertheless, perfect reconstruction of the original signal is achieved by the overlap-and-add procedure applied to the coefficients of the inverse MDCT (IMDCT), causing the errors introduced by the transform to be canceled. This technique is called TDAC. These properties make the MDCT filter bank very attractive for signal compression

68 46 Multichannel Audio Modelling and Coding Using a Multiscale Source/Filter Model applications, such as MP3, Dolby AC-3 and MPEG-2 AAC coders. In MP3, the MDCT is applied on the output of a polyphase quadrature filter. Such a combination of a filter bank with MDCT is called hybrid filter bank or subband MDCT. Finally, AAC and AC-3 use MDCT filter bank. The forward MDCT filter bank is defined as: X k = 1 N N 1 j=0 and the corresponding inverse MDCT (IMDCT): ˆx j = N 1 k=0 ( 2π x j cos N (j + n 0) (k + 1 ) 2 ) for k = 0, 1,..., N 1 (3.94) ( 2π X k cos N (j + n 0) (k + 1 ) 2 ) for j = 0, 1,..., N 1 (3.95) where N is the transform block length, j is the index of the sample and n 0 = N MDCT and Window Functions In most of the signal compression implementations, the input samples are modulated by a window function and then transformed by MDCT, because some transform properties such as the frequency selectivity 2 of the filter bank and avoidance of discontinuities at block boundaries are better implemented. One of the most commonly used windows is the sine window, which is defined as: [ ( π w k = sin k + 1 )] 2n 2 k = 0,..., 2n 1 (3.96) The invertibility and PR of the transform is retained (after the application of a symmetric window function w k = w 2n 1 k ), as long as w k satisfies the Princen-Bradley condition: w 2 k + w 2 k+n = 1, (3.97) which is obviously satisfied by (3.96). An alternative window function is the Kaiser-Bessel Derived (KBD) window, which also guarantees PR (it satisfies eq. 3.96) and it is has been designed for use with MDCT filter banks. The definition of KBD is based on the Kaiser window, which is given by [30, 2 with selectivity in the frequency domain we mean that the filter bank demonstrates good separation of contiguous spectral components.

69 Chapter 3. Filter Banks 47 Figure 3.15: Kaiser window for window length n = 10 and α = 0.5, 1, 2, 4, 8, 16 (obtained from window). Figure 3.16: KBD window function for window length n = 100 and α = 0.5, 2, 8, 32 (obtained from window). p. 474]: w k = ť I 0 şπα 1 ( 2k n 1)2 I 0 (πα) if 0 k n 0 otherwise (3.98) where I 0 is the zeroth length modified Bessel function of the first kind, α R is a constant determining the window s shape and n N is the window s length. Several plots of the Kaiser window are depicted in Fig As we can see, increasing α while holding n constant, results in decrease of the main lobe s width and for large α values the Kaiser window tends to the shape of a Gaussian curve. Finally, for α = 0 the shape of the window becomes rectangular. The KBD window is given by: Pk i=0 w i P n i=0 w i if 0 k < n KBD k = P2n 1 k i=0 w i P n i=0 w i if n k < 2n (3.99) 0 otherwise

70 48 Multichannel Audio Modelling and Coding Using a Multiscale Source/Filter Model where w i is the Kaiser Window of eq Many compression applications use window functions. In particular, in MPEG-2 AAC dynamical switching between sine and KBD window is allowed in order to improve the perceptual audio encoder [8]. KBD is found to be efficient when the spectral components are more than 220 Hz apart, while sine when the components are spaced closely than 140 Hz. On the other hand, MP3 and Dolby AC-3 compression systems use the sine window.

71 Chapter 4 Wavelets 4.1 Introduction in Wavelets The theory of Wavelets provides a useful framework for applications in many fields, as signal processing, speech, image compression, pattern recognition and applied mathematics. It covers a very large scientific area, especially because it treats both the continuous and the discrete-time cases. In particular, the Wavelet-Transform (WT), provides an alternative solution to the classical Short-Time Fourier Transform (or Gabor), since it analyzes more efficiently non-stationary signals, which are the classes of the most real-world signals. In contrast to STFT and other transforms, the WT uses short windows at high frequencies and long windows at low frequencies, as we can see in Fig This is the reason why WT is called constant-q. In many applications, WT is used to decompose the signal onto a set of basis functions, which are called wavelets. They arise from a single prototype wavelet by dilations and contractions (scaling) and this is one of the reasons that wavelets have a very efficient and simple implementation. As we may have noticed in Fig. 4.1, wavelet analysis does not use a time-frequency region, but rather a time-scale region. This means that a signal is mapped into a time-scale plane (the equivalent of the time-frequency plane used in STFT) and that we examine and analyze the signal at various scales and resolutions. Also, Wavelet Analysis can perform local analysis (analysis of a particular portion of a large signal), which gives the advantage of satisfactory localization of discontinuities, that other analysis techniques miss. Another advantage is that wavelets afford a different view of data, resulting to the ability of compressing or de-noising a signal without appreciable degradation. 1 Obtained from

72 50 Multichannel Audio Modelling and Coding Using a Multiscale Source/Filter Model Figure 4.1: Wavelet Analysis contrary to other analysis techniques. 4.2 Continuous Wavelet Transform The Continuous Wavelet Transform (CWT) W x (b, α) of a continuous-time signal x(t) is defined as W x (b, α) = 1 α ( ) t b x(t)ψ dt (4.1) α where 1 denotes complex conjugation, ψ(t) is called wavelet, is a parameter which α normalizes the energy in the wavelets and α, b R are the scaling parameters [5, 25, 24, 6, 26, 36]. The parameter α is given by α = ω ω 0 where ω 0 is the central frequency of the Fourier Transform of ψ(t) and b is a parameter which specifies the location in time domain. Also, the parameter α is responsible for the time and frequency resolution of the CWT; when α is small (corresponds to high analysis frequencies) good time localization and poor frequency resolution are achieved; for the big values of α (low analysis frequencies), on the other hand, the inverse resolution is achieved (good in frequency domain, but poor in time). From the above definition, we see that the wavelet transform is computed as the inner product of x(t) and shifted and scaled versions of a single function ψ(t) L 2 (R) 2, which is also called mother wavelet or prototype and is considered to be a bandpass impulse response, resulting to view wavelet analysis as a bandpass analysis. Variation of the scaling parameter α corresponds to a shift of the central frequency and the bandwidth of the bandpass, while variation of b simple corresponds to a shifting in time. 2 L 2 (R) is the space of input signals of the wavelet transform.

73 Chapter 4. Wavelets 51 Admissibility Condition: For the wavelet transform the condition that must be met in order to ensure perfect reconstruction is C ψ = Ψ(ω) 2 dω < (4.2) ω where Ψ(ω) denotes the Fourier transform of the wavelet ψ(t) (Ψ(ω) ψ(ω)). The above condition is refereed as admissibility condition. Apparently, in order to satisfy (4.2) the wavelet must satisfy: Ψ(0) = ψ(t)dt = 0 (4.3) Furthermore, Ψ(ω) must be reduced sharply while ω 0 and ω, which can be interpreted as ψ(t) must be bandpass filter. However, a bandpass impulse response has the form of a small wave, which is the reason that this transform is called wavelet transform. Scalogram: Similarly with spectrograms in STFT, in wavelet analysis we define the wavelet spectogram or scalogram as the squared modulus of the wavelet transform: W x (b, α) 2 = 1 α ( ) t b x(t)ψ dt α 2 (4.4) Scalograms, represent the energy distribution of the signal in a time-scale plain, wherefore are expressed in power per frequency unit. The main difference with spectograms, is the fact that scalograms use different resolutions in order to picturize the distribution of the signal energy, where in spectograms the resolution remains constant. This fact is obvious in Fig The only disadvantage of this kind of representation is the absence of phase information, since in many cases the phase representation shows in more distinctness local burst in a signal. Figure 4.2: A Dirac pulse at t = t 0 and its region of influence (a) for CWT and (b) for STFT [6, p. 20].

74 52 Multichannel Audio Modelling and Coding Using a Multiscale Source/Filter Model Inverse Continuous Wavelet Transform: As with every transform, the inversibility property is desired. It can be shown that we can reconstruct the original signal x(t) from its wavelet transform, under the admissibility condition (see eq. (4.2)), by: where x(t) = 1 C ψ 0 ψ b,α (t) = 1 α ψ W x (b, α)ψ b,α(t) dαdb α 2 (4.5) ( t b Properties of the Continuous Wavelet Transform: α ). (4.6) Some of the basic properties of the continuous wavelet transform are the following: Linearity: The linearity of CWT follows from the linearity of the inner product and can be expressed as follows: W x+y (b, α) = W x (b, α) + W y (b, α) Shift Property: If x(t) has a CWT given by W x (b, α), then it can be shown that x (t) = x(t b ) has the following CWT: W x (b, α) = W x (b b, α) Scaling Property: If x(t) has a CWT given by W x (b, α), then x (t) = 1 s x ( t s) has the following CWT: W x (b, α) = W x ( b s, α s ) Energy Conservation: The CWT of x(t), W x (b, α), has an energy conservation property (similarly with Parserval s Theorem of the Fourier transform), which is given by: x(t) 2 dt = 1 C ψ W x (b, α) 2 dαdb a 2

75 Chapter 4. Wavelets 53 The proofs of the above equations and some more properties can be found in [24, 37]. Multiresolution Analysis: As we mentioned in introduction, we can see the CWT as a decomposition of the signal in a set of basis functions. The input signal is decomposed in a coarse approximation plus added details. The set of tools used in order to derive wavelet bases is based on the multiresolution approach. In particular, multiresolution analysis is defined as follows [24, 37]: A multiresolution analysis consists of a sequence (V m ) m Z L 2 (R) of embedded closed subspaces... V 2 V 1 V 0 V 1 V 2... such that: (i) Upward Completeness: (ii) Downward Completeness: V m = L 2 (R) m Z V m = {0} m Z (iii) Scale Invariance: f(t) V m f(2 m t) V 0 (iv) Shift Invariance: f(t) V 0 f(t n) V 0, n Z (v) Existence of a Basis: There exists a function ϕ(t) V 0 such that {ϕ(t n)} n Z is an orthogonal basis for V 0. This function is called scaling function. If we use the properties (iii)-(v), we can obtain that {2 m/2 ϕ(2 m t n) n Z} is a basis for V m. For more details on the multiresolution approach the interested reader is referred to [24, 37, 38]. 4.3 Morlet Wavelet The wavelet most frequently used in continuous-time wavelet analysis, is the Morlet wavelet, which is depicted in Fig It uses a windowed complex exponential for prototype, which is characterized as modulated Gaussian function and is defined as follows: ψ(t) = 1 2π e jω 0t e t2 2 (4.7)

76 54 Multichannel Audio Modelling and Coding Using a Multiscale Source/Filter Model Figure 4.3: Morlet Wavelet in time domain. where the factor 1 2π in (4.4) ensures that ψ(t) = 1. The central frequency ω 0 is usually selected in a way that the second maximum value of the real part of ψ(t) (for positive t), is half of the first value ψ(0). Note that this wavelet does not accomplish the admissible condition since Ψ(0) 0. In fact Ψ(0) = , which does not pose a practical problem, since this value is negligible. improvement is very small. It can be corrected, in order to have Ψ(0) = 0, but the 4.4 Discrete Wavelet Transform It is obvious from the definition (4.1) that the continuous wavelet transform cannot be used in practical applications, since we would like to be able to reconstruct the original signal from the set of basis functions on the discrete grid. To do so, we select the next discretization of the scaling parameters [24]: where b 0 is the sampling interval. The discretized wavelet family is given by: a = a m 0, b = nb 0 a m 0, m, n Z, a 0 > 1, b 0 > 0 (4.8) and the wavelet analysis can be written as: ψ mn (t) = a m/2 0 ψ(a m 0 t nb 0 ) (4.9) The scaling function ϕ(t) is defined as: W x (b, a) = W x (a m 0 nb 0, a m 0 ) = x, ψ mn (4.10) ϕ(t) = n h(n) 2ϕ(2t n), n Z (4.11) where h(n) are called scaling coefficients. It is well known that the body part of computations in DWT and in octave-bank filter

77 Chapter 4. Wavelets 55 bank are indistinguishable. The only difference is that the filters used in wavelet analysis are regular (filters that are both orthogonal and converge to continuous functions). Thus, in Fig. 4.4 we see an octave-band analysis and synthesis filter bank, which can be considered as the wavelet analysis and synthesis, for α 0 = 2, where ψ(t), ψ(t) are respectively the analysis, synthesis discretized wavelet families and x(t), ˆx(t) are the original, reconstructed signals. Stability Condition: At this point we have to remark that for a given wavelet ψ(t), the possibility of perfect reconstruction depends on the sampling interval b 0. There is a trade off between the oversampling factor and the freedom in selecting the basis functions. If we have oversampling (by selecting b 0 very small), high redundancy shows up and reconstruction is feasible, with mild restrictions on the basis functions. On the other hand, if the redundancy is small (b 0 is large), for example close to critical sampling, then the basis function is tighten up, in order to assure the existence of a dual set ψ(t). Set ψ(t) exists if the stability condition holds: A x 2 x, ψ mn 2 B x 2 (4.12) m= n= where 0 < A < B < are called frame bounds. If A = B, we speak for tight frame and perfect reconstruction is possible, with ψ mn (t) = ψ mn (t). If the above condition (4.12) holds, the admissibility condition (4.2) is also valid. Biorthogonal Condition: If b 0 is elected as large as possible, the samples contain no redundancy at all (critical sampling), stability condition holds, functions ψ mn (t) form a basis and the next parity, called biorthogonal condition, holds: ψ mn, ψ lk = δ ml δ nk, m, n, l, k Z (4.13) Figure 4.4: Octave-band analysis and synthesis filter bank [5, p. 222].

78 56 Multichannel Audio Modelling and Coding Using a Multiscale Source/Filter Model The wavelets, which meet the above condition are named biorthogonal wavelets. Orthonormal Condition: A particular case in biorthogonal condition, is the following orthonormal condition: ψ mn, ψ lk = δ ml δ nk, m, n, l, k Z (4.14) Again, the wavelets which meet the above condition have a particular name, called orthonormal wavelets and the very interesting sign is the fact that both the analysis and synthesis functions are equal. This is the reason why these wavelets are characterized as self-reciprocal. Also, this kind of bases (orthonormal) have the same frame bounds (tight frame) and in that case, equation (4.14) is a special case of Parseval s theorem. 4.5 Haar Wavelets The simplest example in orthogonal wavelets is Haar Wavelet. Fig. 4.5, Haar wavelet is defined by: ψ(t) = ψ(2t) ψ(2t 1) = The basis functions in the Haar case is given by: 1, 0 t 1 2 1, 1 2 0, otherwise As we can observe in (4.15) ϕ 2k (n) = { 1 2, n = 2k, 2k + 1 0, otherwise 1 2, n = 2k ϕ 2k+1 (n) = 1 2, n = 2k + 1 0, otherwise (4.16) (4.17) So, the scaling function, which corresponds to Haar wavelet is given by: ϕ(t) = ϕ(2t) + ϕ(2t 1) = { 1, 0 t < 1 0, otherwise (4.18) Note that the Haar coefficients in the scaling function of eq are h(0) = h(1) = 1/ 2. It is very important to note that Haar wavelet has perfect time resolution, but disappointing frequency resolution.

79 Chapter 4. Wavelets 57 Figure 4.5: Haar wavelet and scaling function [5, p. 230]. 4.6 Daubechies Wavelets Daubechies Ingrid, was a famous woman in wavelet research field, who designed Daubechies Family Wavelets. These wavelets have the special and very useful feature of orthonormality (see eq. 4.14). Reminding from eq. (3.52) in the subsection (3.3), one expression for orthogonality and perfect reconstruction condition is the following: G0 (e jω ) 2 + G0 (e j(ω+π) ) 2 = 2 If we substitute G 0 (e jω ) = 2L 0 (e jω ) for normalization reasons, we result: L0 (e jω ) 2 + L0 (e j(ω+π) ) 2 = 1 (4.19) Also, for regularity (continuity) reasons, we assume the next relation: L 0 (e jω ) = [ ] N 1 2 (1 + ejω ) R(e jω ) L0 (e jω ) [ 2 = cos 2 ω ] N R(e jω ) 2 (4.20) 2 where N is called the order of Daubechies family. If we use the notations y = cos 2 ω 2, P (1 y) = R(e jω ) 2, equation (4.20) becomes: form: y N P (1 y) + (1 y) N P (y) = 1 (4.21) Daubechies proved that the polynomial P (y), which satisfies equation (4.21) has the P (y) = N 1 j=0 N 1 j j y i + y N Q(y) (4.22)

80 58 Multichannel Audio Modelling and Coding Using a Multiscale Source/Filter Model where Q(y) is an antisymmetric polynomial. The family of Daubechies wavelets is obtained for Q 0 and L 0 (e jω ) is called Daubechies filter. Note that for N = 1, we have the Haar wavelet. In Fig. 4.6 we see nine members, for orders 2,..., 10 respectively, of the Daubechies family (note that db1 Haar). 4.7 Biorthogonal Wavelets The aforementioned wavelets were most of all orthogonal. Now, we examine the case of biorthogonal wavelets, which are frequently used in practical applications, because of the convenience of the symmetry property, as it is shown in Fig The biorthogonal condition, as we discussed in the subsection (3.3), is given by: ψ mn, ψ lk = δ ml δ nk, m, n, l, k Z (4.23) where ψ mn, ψ mn are the analysis and synthesis biorthogonal families and m,n are used for dilation and shifting, respectively. The wavelets that satisfy eq. (4.23) are called biorthogonal wavelets. The biorthogonal wavelets are constructed based on biorthogonal filter banks, which can be seen as a more relaxed case of the orthogonal ones, in terms of the used conditions. So, we consider such a filter bank, with analysis filters H i (z) and synthesis G i (z) (i = 0, 1). To ensure perfect reconstruction, we choose these filters in a such a way that they satisfy the relation: and H 0 (z)g 0 (z) + H 0 ( z)g 0 ( z) = 2 (4.24) db db db db db db db db db10 Figure 4.6: Some members of the Daubechies family.

81 Chapter 4. Wavelets 59 Figure 4.7: Scaling function, wavelet and their duals of Biorthogonal Wavelets, used in FBI fingerprint compression Standard [5, p. 251]. H 1 (z) = z 2k+1 G 0 ( z) G 1 (z) = z (2k+1) H 0 ( z) (4.25) For normalization reasons, we define the filters such that L i (ω) = H i(e jω ) 2, L i (ω) = G i(e jω ) 2, i = 0, 1 (4.26) where the associated scaling functions for analysis and synthesis, are defined as: and the wavelets as: Φ(ω) = L 0 ( ω 2 ), Φ(ω) = L k 0 ( ω 2 ) (4.27) k k=1 k=1 ψ(ω) = L 1 ( ω 2 ) L 0 ( ω 2 ) k k=2 ψ(ω) = L 1 ( ω 2 ) L 0 ( ω (4.28) 2 ) k k=2 4.8 Comparison of Orthogonal and Biorthogonal Wavelets Biorthogonal Wavelets can be thought as the generalization of the Orthogonal ones, since the former are more flexible, resulting in easier and more efficient implementations. The main differences between these families are the following: The main reason preferring biorthogonal over orthogonal is the fact that the first

82 60 Multichannel Audio Modelling and Coding Using a Multiscale Source/Filter Model category has symmetric scaling functions and wavelets. The basic disadvantage of biorthogonal wavelets is that the Parseval s theorem no longer holds; for this reason the focus of many researchers were to make the biorthogonal systems as close as possible to orthogonal. In orthogonal systems, the scaling functions and the wavelets, must have the same even length, contrary to biorthogonal where this limitation is relaxed. If we change the use of the primary and the dual wavelet, in biorthogonal analysissynthesis framework, the system will still be valid. So, the arrangement depends on the needs of the application. If an orthogonal transform is applied to white Gaussian noise, the noise will remain white, which does not hold in nonorthogonal transforms, like biorthogonal ones. So, special treatment is possibly needed for noise removal or elimination in biorthogonal systems. 4.9 Other Wavelet Families Symlets Wavelets: The symlets (see Fig. 4.8) are almost symmetrical wavelets proposed by Ingrid Daubechies as modification to the popular Daubechies family and they have alike properties sym sym sym sym sym sym sym sym sym10 Figure 4.8: Symlets Wavelets.

83 Chapter 4. Wavelets Figure 4.9: Mexican Hat Wavelet. Mexican Hat Wavelet: Mexican Hat wavelet (see Fig. 4.9) has no scaling function and it has a similar shape to the second derivative of the Gaussian probability density function. Meyer Wavelet: The main idea in Meyer wavelet (see Fig. 4.10) is to find a smoother version of the ideal-sin case and because of the fact that it can not be efficiently implemented, is considered of theoretical interest. Also, another feature of the Meyer wavelet is that both the scaling function and the wavelet are defined in the frequency domain Figure 4.10: Meyer Wavelet.

85 Chapter 5 Autocorrelation Analysis 5.1 Autoregressive Models A very popular linear stochastic model is the Autoregressive (AR) Model of order M. This model is a time series u(n), u(n 1),..., u(n M), which satisfies the following equation u(n) + a 1 u(n 1) + a 2 u(n 2) a M u(n M) = v(n) (5.1) where v(n) is a white-noise process, the parameters a 1, a 2,..., a M are called AR coefficients and the asterisk defines the inner product of the neighbor quantities. First of all, let us examine which models are entitled regressive. When we have a linear model N x = b i y i + e (5.2) i=1 where the variable x is a linear combination of the independent variables y i plus an error e, the model is called regression model. Now, if we rewrite (5.1) as u(n) = a 1 u(n 1) a 2 u(n 2)... a M u(n M) + v(n) (5.3) we notice that it is a regression model of the form (5.2), where u is regressed on past realizations of itself. Therefore, these models are called autoregressive. Given the parameters a 1, a 2,..., a M of an AR model 1 and the filter of the Fig. 5.1, a white-noise process can be produced. This particular filter is called AR analyzer and it is used for the description of our model. 1 the estimation of these parameters is explained later in this chapter

86 64 Multichannel Audio Modelling and Coding Using a Multiscale Source/Filter Model u(n) u(n 1) u(n 2) z 1 z 1... z 1 u(n M)... a 1 a 2 a M v(n) Figure 5.1: AR analyzer with delayed inputs u(n 1), u(n 2),..., u(n M) and parameters a 1, a 2,..., a M Correlation Matrix The correlation matrix R of a stationary discrete-time stochastic process, with observation vector u(n) = [u(n), u(n 1),..., u(n M + 1)] T (5.4) is the expectation of the outer product of u(n) and its Hermitian 2 transpose [39]: R = E[u(n)u H (n)]. (5.5) The autocorrelation function of a wide-sense stationary process is defined as r(i) = 1 M M 1 n=0 u(n)u(n i), 0 i M 1. (5.6) With the use of definition (5.6), we can rewrite the correlation matrix definition (5.5) as r(0) r(1)... r(m 1) r( 1) r(0)... r(m 2) R = r( M + 1) r( M + 2)... r(0) Main properties of the correlation matrix: (5.7) 1. The correlation matrix of a discrete-time stationary stochastic process has the Hermitian property. From the definition (5.5) follows that the correlation matrix is equal with its conju- 2 Hermitian: transpose and complex conjugate

87 Chapter 5. Autocorrelation Analysis 65 gate transpose (Hermitian property): R H = R (5.8) or in the autocorrelation function point of view: r( k) = r (k) (5.9) With the use of equation (5.9), the correlation matrix definition (5.7) can be written as r(0) r(1)... r(m 1) r (1) r(0)... r(m 2) R = (5.10) r (M 1) r (M 2)... r(0) 2. The correlation matrix of a discrete-time stationary stochastic process is a Toeplitz matrix. A square matrix T is called Toeplitz when all the diagonal elements are equal and the elements of every other diagonal, parallel to the main one, are equal: t 0 t t k t 2 t 0 t T =.. t 2 t 0... (5.11) t 1 t n t 2 t 0 By comparing the correlation matrix in (5.10) and the Toeplitz matrix in (5.11) it is clear that they have the same form Power Spectral Density While in time domain the most common description of a stochastic process is the aforementioned autocorrelation function, in the frequency domain the power spectral density is usually used and is the one we are going to define below. If we consider an infinite stochastic process u(n), n =..., 2, 1, 0, 1, 2,..., the power spectral density is given by S(ω) = r(i)e jωi (5.12) i=

88 66 Multichannel Audio Modelling and Coding Using a Multiscale Source/Filter Model where ω is the angular frequency, ω ( π, π] and r(i) is the autocorrelation function of the process. Main properties of the Power Spectral Density: 1. The power spectral density is the discrete-time Fourier Transform of the autocorrelation function and the autocorrelation function is the inverse discrete-time Fourier Transform of the power spectral density: S(ω) = i= r(i)e jωi, ω ( π, π] (5.13) r(i) = 1 π S(ω)e jωi dω, i = 0, ±1, ±2,... (5.14) 2π π 2. The expected power of a stationary discrete-time stochastic process is given by The quantity of equation (5.15) results as r(0) = 1 π S(ω)dω, ω ( π, π]. (5.15) 2π π r(k) = E[u(n) u (n k)] k=0 r(0) = E[u(n) u (n)] = E[ u(n) 2 ]. (5.16) Therefore, r(0) is the mean-square value of the process. 3. The power spectral density of a stationary discrete-time stochastic process is nonnegative: S(ω) 0, ω. (5.17) Yule-Walker Equations As it has been mentioned, in order to well define an AR model we need to estimate the AR coefficients a i, i = 1,..., M, which we can prove that are evaluated from the following set of equations, known as Yule-Walker equations: r(0) r(1)... r(m 1) r (1) r(0)... r(m 2) r (M 1) r (M 2)... r(0) w 1 w 2. w M r (1) r (2) =. r (M) (5.18) where w i = a i, i = 1,..., M are the unknown parameters and r(0), r(1),..., r(m) are

89 Chapter 5. Autocorrelation Analysis 67 the autocorrelation functions defined in (5.6). The equation (5.18) can be written in matrix form as Rw = r (5.19) where the vectors w, r are defined as w = [w 1 w 2... w M ] T r = [r (1) r (2)... r (M)] T. as If the correlation matrix R is nonsingular 3 the AR coefficients are calculated from (5.19) w = R 1 r. (5.20) 5.2 Linear Prediction Model One of the most common applications in time-series analysis is the linear prediction of a future sample of a discrete-time stationary stochastic process using a linear combination of process s past samples. There are two forms of the prediction procedure; the forward linear prediction and the backward linear prediction. Consider a time-series u(n), u(n-1),..., u(n-m). In forward linear prediction, the past samples u(n-1),..., u(n-m) are used to estimate the predicted value u(n), where M is the number of samples, needed for the estimation of u(n), and is called the order of the predictor. On the other hand, in backward linear prediction we use the future samples u(n), u(n-1),..., u(n-m+1) to predict the past sample u(n-m). In this section we consider the forward form of linear prediction, since it is the one we use in the implementation of our model. The predicted value û(n) of the sample u(n) of a forward linear prediction model is given by û(n) = M wi u(n i), (5.21) i=1 where every term of the above summation is the inner product between the complex conjugate of coefficient w i, i = 1,..., M and the input u(n i). These coefficients are called linear prediction coefficients (LPC s) and they represent the unknown parameters of the model. The implementation of this model is depicted in Fig. 5.2, where we have M delays (labelled z 1 ), which are usually refereed as tap points, M tap inputs u(n 1),..., u(n M) and M tap weights w1,..., wm. In this point, we have to mention that 3 A square matrix is nonsingular iff its determinant is nonzero.

90 68 Multichannel Audio Modelling and Coding Using a Multiscale Source/Filter Model u(n) z 1 u(n 1) u(n 2) u(n M+1) u(n M) z 1 z... 1 z 1 w * w * w * M 1 w * M Figure 5.2: Forward Linear Predictor of order M, with tap inputs u(n 1),..., u(n M) and linear prediction coefficients w 1,..., w M. the tap inputs are assumed to come from a wide-sense stationary 4 stochastic process with mean value equal to zero (µ = 0). The error of the forward prediction model of order M is given by e M (n) = u(n) û(n) = u(n) M wi u(n i) = i=1 M w i u(n i) (5.22) i=0 where w i = { w i, 1, i = 0 i = 1, 2,..., M (5.23) The model of the equation (5.22) is implemented by the filter of Fig. 5.3, which uses the inputs u(n), u(n 1),..., u(n M) to produce the error e M (n) and is called forward prediction error-filter. Since we assumed that u(n) has zero mean, the error e M (n) will have zero mean, too. The minimum mean-square error of the predictor is equal to P M = E[ e M (n) 2 ], (5.24) which is equal to the variance of the error (5.22) since e M (n) has zero mean and can be also considered as the power of e M (n) Wiener-Hopf Equations To specify a linear prediction model, we need to estimate the linear prediction coefficients (LPC s). In order to achieve this estimation we use the Wiener-Hopf equations, 4 A stochastic process u(n) whose statistical properties are invariant to a shift of time (strictly stationary) is called wide-sense stationary iff E[ u(n) 2 ] <, n. [39]

91 Chapter 5. Autocorrelation Analysis 69 u(n) z 1 u(n 1) z 1... u(n M+1) z 1 u(n M) w 0 * w 1 *... w * M 1 * w M e (n) M Figure 5.3: The forward prediction error-filter, which uses the inputs u(n), u(n 1),..., u(n M) to produce the error e M (n). which are described from the following set r(0) r(1)... r(m 1) r (1) r(0)... r(m 2) r (M 1) r (M 2)... r(0) w1 w2. w M = r (1) r (2). r (M), (5.25) where the first matrix is the correlation matrix R of the tap inputs, the second vector w = [w 1 w 2... w M] T (5.26) consists of the optimum, unknown LPC s and the last vector is the cross-correlation vector between the tap input vector u = [u(n 1) u(n 2)... u(n M)] T (5.27) and the desired u(n) and is equal to r = E[u u (n)] = r (1) r (2). r (M) = r( 1) r( 2). r( M). (5.28) If we use equations (5.26), (5.27), we can rewrite the Wiener-Hopf equations (5.25) in a compact form as follows R w = r, (5.29)

92 70 Multichannel Audio Modelling and Coding Using a Multiscale Source/Filter Model where if correlation matrix R is nonsingular the final solution is given by w = R 1 r. (5.30) Equations (5.30) and (5.20) have identical mathematical form, which is discussed in section (5.3) Levinson-Durbin Algorithm The Levinson-Durbin algorithm is a very efficient method for estimating the prediction error coefficients by solving the Wiener-Hopf equations. This algorithm is recursive and takes advantage of the Toeplitz structure of the input s correlation matrix. Another application of the Levinson-Durbin formulation is in the Yule-Walker AR problem, where an unknown system is modeled as an autoregressive process. More precisely, the system is modeled as the output of an all-pole IIR filter with white Gaussian noise input. In general, the Levinson-Durbin algorithm is a fast algorithm, which solves an n th order linear system of the form R a = b, where R is is a Hermitian, positive-definite, Toeplitz matrix, b is identical to the first column of R shifted by one element and with the opposite sign and a is the vector of the unknown coefficients. This method is characterized by computational efficiency, since it requires O(n 2 ) operations contrary to similar methods such as the standard Gaussian elimination method, which requires O(n 3 ) operations. 5.3 Linear Prediction and Autoregressive Models When we compare the solution of the Yule-Walker equations (5.20) and the solution of the Wiener-Hopf equations (5.30), we observe that they have identical mathematical form. For this reason, we can theoretically denote that when a forward linear prediction model is optimized, its coefficients have the same values as the corresponding coefficients of an autoregressive model. However, when the stochastic process is not autoregressive, the linear prediction coefficients (LPC s) approximate the random process as an AR process. We notice then that the two models are strictly related and we can consider the forward linear predictor complementary to an autoregressive process, as depicted in Fig. 5.4, where (a) shows a forward prediction error-filter of order M and (b) shows the corresponding autoregressive model. These models are characterized as complementary because the

93 Chapter 5. Autocorrelation Analysis 71 u(n) z 1 u(n 1) z 1... u(n M+1) z 1 u(n M) w 0 * w 1 *... w * M 1 * w M (a) + e (n) M v(n) + u(n) z 1 + a 1 u(n 1) z 1 + a M 1 u(n M+1) z 1 u(n M) a M (b) Figure 5.4: (a) A forward prediction error-filter of order M and (b) the corresponding autoregressive model.

94 72 Multichannel Audio Modelling and Coding Using a Multiscale Source/Filter Model first model can be viewed as the analysis operation, which whitens the original stationary process u(n), estimates the LPC s w1, w2,..., wm (since w 0 = 1) and produces the error e M (n), while the second can be considered the synthesis part, which inputs a white-noise signal v(n) (with zero mean vector, variance σv), 2 computes the AR coefficients a 1, a 2,..., a M and outputs an estimation of the original process u(n). The relation between LPC s and AR coefficients is given by equations: w i = a i, i = 1,..., M. Another argument that the two models can represent a matched pair is the fact that they have equivalent spectrums. From equation (5.12), the power spectral density of a forward prediction error-filter of a stochastic process u(n) can be determined as S LP C (ω) = P M M 1 + 2, (5.31) wk e jωl k=1 where wk are the linear prediction coefficients and P M is the minimum mean-squared error of the predictor of order M [39]. Correspondingly, the power spectral density of an autoregressive process u(n) is given by σv 2 S AR (ω) = M 1 2, (5.32) a k e jωk where a k are the AR coefficients and σv 2 is the variance of white-noise. This density is usually called autoregressive power spectrum or in a more simple way, AR spectrum. Finally, as we also observe from (5.31), (5.32), the linear prediction and the autoregressive model are indeed equivalent models and they can be seen as complementary. k= Line Spectral Frequencies (LSF) In our model, once the Linear Prediction Coefficients (LPC s) have been estimated, the next step is to code and transmit them to the communication channel. In audio compression though, LPC s are characterized as inappropriate for quantization, since they exhibit large dynamic range and frequent stability problems, which result in corruptions during transmittion. Hence, the direct quantization of LPC s is rarely used. Instead, equivalent and with better properties representations of the same spectral information are used, such as Line Spectral Frequencies (LSF s), reflection coefficients and log area ratios [40, 41, 42, 43]. Among these and for reasons that are discussed further down in this

95 Chapter 5. Autocorrelation Analysis 73 section, LSF s are the most convenient and are used in our model. At this point, we will define the LSF s and some of their essential properties. In Linear Prediction Models of order M, is assumed that the predicted value is the output of an all-pole filter where H(z) = 1 A(z), and a 1,..., a M are the parameters of the model. A(z) = 1 + a 1 z a M z M (5.33) With the use of (5.33), we construct the two following polynomials P (z) = A(z) + z (M+1) A(z 1) (5.34) Q(z) = A(z) z (M+1) A(z 1). (5.35) The roots of the even symmetric P(z) and the odd symmetric Q(z) are called Line Spectral Frequencies (LSF s) [40]. Main properties of Line Spectrum Pair {P(z), Q(z)}: 1. All the roots of the pair {P(z), Q(z)} lie in the unit circle. 2. The roots of the symmetric polynomial P(z) and the antisymmetric polynomial Q(z) exhibit an ordering and interlacing property. From the first property we conclude that the roots can be represented as e jω i, where ω i are the frequencies and for this reason they are called line spectral frequencies. Both the above properties are used in order to evaluate the roots of P(z), Q(z). Also, the second property is used in error control schemes to detect and correct -when it is possibletransmission errors. More precisely, when a packet is lost we can estimate it from its adjacent packets by interpolation and with the use of the above properties. In Fig. 5.5(a), the spectrum of a signal with LPC s at 210, 1280, 2320, 2720 and 3180 Hz is depicted. These coefficients correspond to the poles of {P(z), Q(z)}, whose angles θ i, i = 1,..., 4 with the horizontal axis are the LSF s (5.5(b)). Furthermore, the LSF s are independent to transmission distortions, which means that any unacceptable change in one frequency will not leak to all the estimated LSF s, it will affect a small neighborhood, at most. This property is frequently declared as localized spectral sensitivity property of LSF s [41] and is very helpfull, since it ensures that we can quantize LSF s without having to consider the spread of quantization distortion from one spectral area to another. Thus, we can represent LSF s at higher frequencies with fewer

96 74 Multichannel Audio Modelling and Coding Using a Multiscale Source/Filter Model y x x θ 4 θ 1 x x x θ 2 θ 3 (a) (b) Figure 5.5: (a) The spectrum of a signal with linear prediction coefficients at 210, 1280, 2320, 2720 and 3180 Hz and (b) the corresponding poles,whose angles θ i, (i = 1,..., 4) with the horizontal axis are the LSF s. bits (taking advantage of human ear s inability to comprehend high frequencies) and gain in terms of bit rate.

97 Chapter 6 Random Process Modeling and Decorrelation 6.1 Introduction This chapter consists of two fundamental sections. The fist section begins with the description of a very popular technique for the modeling of a random process and the estimation of its arbitrary density function, the Gaussian Mixture Model (GMM). We give the formal definition of the particular mixture model and we explain the motivations for its use. The second section studies an optimal transform with many applications in signal processing, which implements the decorrelation of the correlation matrix of a random process. This transform is called Karhunen Loéve Transform (KLT). We begin with some tools of eigen-analysis for the description of the transform and we continue with the definition of KLT. 6.2 Gaussian Mixture Model Model Definition The Gaussian Mixture Model is a technique (Fig. 6.1), which approximates the unknown probability density function (pdf) of a random vector x as a mixture of Gaussians and is often collectively represented as {p(ω i ), µ x i, Σ xx i } and is given by the equation g(x) = M i=1 p(ω i )N(x; µ x i, Σ xx i ) (6.1)

98 76 Multichannel Audio Modelling and Coding Using a Multiscale Source/Filter Model x x x x x x N( x, µ 1, Σ 1 ) N( x, µ x, Σ ) x x x Σ M N(, µ, ) M x p( ω 1 ) p( ω 2 ) + p( ω ) M g(x) Figure 6.1: The representation of a GMM with M components. The unknown probability density function g(x) of a random vector x is given as the weighted sum of M Gaussian component densities N(x; µ x i, Σ xx i ) with mixture weights p(ω i ), i = 1,..., M. where x is a random D-dimensional vector, N(x; µ x i, Σ xx i ) are the M Gaussian component densities of the model with mean vector µ x i, covariance matrix Σ xx i and p(ω i ) are the mixture weights (prior probabilities of the Gaussian component), under the constraint M i=1 p(ω i) = 1. The form of the Gaussian component densities is given by N(x; µ x i, Σ xx i ) = 1 (2π) D/2 Σ xx i { exp 1/2 1 2 (x µx i ) T (Σ xx i ) 1 (x µ x i ) } (6.2) The GMM has several forms, since the covariance matrix can be chosen to be diagonal, full or we can have one covariance matrix per component or one covariance matrix for all components. In our model we choose a full matrix, even though anyone of the other choices could be made. Another relevant issue is, given the training data, to estimate the parameters of the Gaussian component densities, p(ω i ), µ x i and Σ xx i, which best match the training data s distribution. There are many available algorithms for this estimation [44], but the most efficient and commonly used is the Expectation Maximization algorithm, known and as EM algorithm. The basic idea of EM is to iteratively estimate the likelihood given the data that are present. The algorithm begins with some initial parameters s 0 ={p (0) (ω i ), µ (0)x i, Σ (0)xx } and the set x of N training data vectors, estimates a new parameters set i s t ={p (t) (ω i ), µ (t)x i, Σ (t)xx i }, such that p(x s t ) p(x s 0 ). Then, the new set s t becomes the initial set and a new iteration begins. This process continues until a threshold is reached. For the convenience of the reader, we briefly review the basic formulas of the EM algorithm for a GMM pdf. The parameters needed to be estimated are p(ω i ), µ x i and Σ xx i,

99 Chapter 6. Random Process Modeling and Decorrelation 77 for each Gaussian class ω i. The initial values of these parameters are usually initialized by a clustering procedure such as k-means. During the t th iteration of the EM algorithm, the expectation step of the algorithm (E Step) involves calculating the following conditional probabilities: p (t) (ω i x) = p(t) (ω i )N(x; µ (t)x i, Σ (t)xx i ). (6.3) M p (t) (ω j )N(x; µ (t)x j, Σ (t)xx j ) j=1 During the maximization step (M Step) that follows the E Step, the GMM parameters are re-estimated and will be used at the E Step of the next (t + 1 th ) iteration: p t+1 (ω i ) = 1 n n p (t) (ω i x), (6.4) k=1 Σ (t+1)xx i = µ (t+1)x i = n k=1 n p (t) (ω i x) x k=1 n p (t) (ω i x) k=1 p (t) (ω i x)(x µ (t+1)x i n p (t) (ω i x) k=1, (6.5) )(x µ (t+1)x i ) T (6.6) This estimation is iterated until a convergence criterion is reached, while monotonic increase in the likelihood is guaranteed Model Motivations There are several motivations for the use of Gaussian Mixture Model (GMM) as a technique for the approximation of the pdf of our data vectors. The first, is that GMM s have a well recorded history in functional approximation of the training data probability density function. Specifically, one of the advantages of the GMM s is their attribute to provide a smooth fitting of arbitrarily-shaped densities to an overall distribution [45]. Thus, they provide an alternative to the traditional histograms, which can also be used as densities, and they also give great flexibility in modeling the underlying statistics of the data. Secondly, in our model the features that are going to be coded and be transmitted are the Line Spectral Frequencies (LSF s) and it is well known that the GMM s have a very good performance in modeling LSF s [17, 46]. The third and very important motivation for the use of GMM is the fact that given

100 78 Multichannel Audio Modelling and Coding Using a Multiscale Source/Filter Model the parameters of the GMM and especially given the covariance matrices, the data can be easily decorrelated. In essence, when we make the assumption that the modeled data might come from different Gaussian sources, each source with different mean, covariance and prior probability, we consider each data vector as a realization of one of M Gaussian pdfb s. In other words, each vector can be classified to one of the Gaussian classes, thus efficient data clustering can be achieved. Once the classification has been achieved, the Karhunen Loève Transform, which is discussed in the next section, can be applied using the covariance of the particular Gaussian, in order to decorrelate the data. In fact, this scheme has been suggested in [47], where the problem of data coding with a GMM pdf is considered. 6.3 Karhunen Loéve Transform Eigen-analysis Consider the square matrix R, which is the (M M) correlation matrix of a discretetime stochastic process u(n). As it was mentioned in the properties of a correlation matrix (subsection 5.1.1), R has the Hermitian property. We wish to find a (M 1) vector q R n 0, which satisfies the equation R q = λ q, (6.7) where 0 is the null vector and λ is called eigenvalue with corresponding eigenvector q. Equation (6.7) can be rewritten in the form (R λi)q = 0, (6.8) where I is the (M M) identity matrix. From the Cramer rule, it is well known that a linear system of equations, like these of (6.8), has nontrivial solutions iff the determinant vanishes. Thus, the solutions of equation (6.8) are given by det(r λi)q = 0. (6.9) Equation (6.9) is known as the characteristic equation and the polynomial det(r λi)q is known as the characteristic polynomial, which is a polynomial in λ of degree M with M roots. If we consider λ 1, λ 2,..., λ M the M roots of the characteristic polynomial, these roots are the eigenvalues of the square matrix R and they can be distinct or some of them can be identical and in that case the characteristic polynomial has multiple roots. If λ i is

101 Chapter 6. Random Process Modeling and Decorrelation 79 one of eigenvalue, then the vector q i Rq i = λ i q i (6.10) is called the corresponding eigenvector of eigenvalue λ i. Since q i and a q i (a R, a 0) are eigenvectors associated with λ i, we conclude that an eigenvalue may have several eigenvectors, but an eigenvector can be associated with only one eigenvalue. Main properties of eigenvalues and eigenvectors: 1. If the eigenvalues λ 1, λ 2,..., λ M of an (M M) correlation matrix R are distinct, then the corresponding eigenvectors q 1, q 2,..., q M are linearly independent. The eigenvectors q 1, q 2,..., q M are linearly independent if c 1, c 2,..., c M (scalars and not all zeros), such that M c i q i = 0. (6.11) i=0 2. All the eigenvalues λ 1, λ 2,..., λ M of an (M M) correlation matrix R are real and nonnegative: λ i 0, i = 1,..., M. (6.12) 3. If the eigenvalues λ 1, λ 2,..., λ M of an (M M) correlation matrix R are distinct, then the corresponding eigenvectors q 1, q 2,..., q M are orthogonal to each other: q H i q j = 0, i, j, i j. (6.13) 4. The sum of the eigenvalues λ 1, λ 2,..., λ M of an (M M) correlation matrix R equals the trace of R: M trace[r] = λ i, (6.14) where trace of a square matrix is the sum of the elements in the main diagonal. 5. Every eigenvalue λ i (i = 1,..., M) of the (M M) correlation matrix R of a discretetime stochastic process is bounded by the minimum and the maximum value of the processe s power spectral density: i=1 S min λ i S max, i = 1..., M. (6.15) 6. Matrix Diagonalization:

102 80 Multichannel Audio Modelling and Coding Using a Multiscale Source/Filter Model Matrix diagonalization is the process of converting a square matrix into a diagonal matrix. Matrix diagonalization is equivalent to transforming the underlying system of equations into a special set of coordinate axes in which the matrix takes this canonical form [48]. In order to diagonalize the (M M) correlation matrix R of a discrete-time stochastic process, we have to evaluate from equation (6.7) its eigenvectors q 1, q 2,..., q M that are associated with the distinct eigenvalues λ 1, λ 2,..., λ M. Let Q = [q 1 q 2... q M ] (6.16) be the matrix composed of the eigenvectors of R. Then the square correlation matrix can be transposed in a diagonal matrix Λ from the equation Λ = Q H RQ. (6.17) Karhunen Loéve Transform Definition The Karhunen Loéve Transform (KLT) is an optimal transform with many applications in signal processing, such as multichannel audio coding [2], image compression [49], speech recognition [50], neural clustering [51], speaker and face recognition [52, 53]. In particular, KLT is a unitary 1 linear transform, which exploits the statistical properties of a discretetime stochastic process in order to optimally decorrelate the process and to diagonalize its correlation matrix. If R is the correlation matrix of a discrete-time stochastic process u(n) of zero mean and Q the matrix of the eigenvectors as described in equation (6.16), KLT is defined as the unitary transform of the form: R = QΛQ H Λ = Q H RQ, (6.18) where Λ is a diagonal matrix Λ = λ 1 0 λ 2... λ k... 0 λ M, (6.19) 1 A unitary transform has the form A = BAB H, where H denotes the conjugate transpose.

103 Chapter 6. Random Process Modeling and Decorrelation 81 λ 1, λ 2,..., λ M are the eigenvalues of R and Q is called the KLT matrix. This matrix is used for the transformation of R in the diagonal Λ and therefore for the decorrelation of the stochastic process. The decorrelation property can be seen if we consider the forward and the inverse KLT. If c is the forward transform of the zero mean random vector u such that c = Q 1 u = Q H u, (6.20) then the inverse transform of c, in order to return in the original space (before the transform in KLT domain), is given by u = Qc. (6.21) and the correlation matrix of c (with the use of equation (6.20)) is given by E[cc H ] = E[Q H uu H Q] = Q H E[uu H ]Q = Q H RQ = Λ. (6.22) From equation (6.22), since the correlation matrix of vector c is diagonal, we can conclude that the cross-correlation has been removed and the vector u was transformed in a decorrelated vector c, with the use of the KLT matrix Q.

105 Part II Multichannel Audio Modeling and Coding

106

107 Chapter 7 Multiscale Source/Filter Model 7.1 Introduction In this chapter we propose a source/filter model for the transmission of multichannel audio signals, which takes advantage of the redundancy among the channels in order to achieve low data rate requirements. Our methodology is based on a model that divides the signal into two parts, a low-dimensional part that contains the microphone-specific information, and a high-dimensional part, which however contains most of the inter-channel redundancy. We begin with a brief description of how multimicrophone recordings for multichannel rendering are made, we continue with the description of the proposed method and we conclude with some modeling results. 7.2 Recordings for Multichannel Audio Before proceeding to the description of the proposed method, a brief description is given of how the multiple microphone signals for multichannel rendering are recorded. In this thesis we mainly focus on live concert hall performances, due to our expertise on recording in such venues, although this does not result in a loss of generality of our methods. A number of microphones are used to capture several characteristics of the venue, resulting in an equal number of microphone signals (stem recordings). These signals are then mixed and played back through a multichannel audio system that attempts to recreate the spatial realism of the recording venue. Our objective is to design a system based on available microphone signals, that is able to recreate all of these target microphone signals from a smaller set (or even only one) of reference microphone signals at the receiving end. The result would be a significant reduction in transmission requirements, while enabling remote mixing at the receiving end. In a relevant work [54], the interest was to completely synthesize the target signals using

108 86 Multichannel Audio Modelling and Coding Using a Multiscale Source/Filter Model the reference signals, without any additional information. Here, we propose using some additional information for each microphone for achieving high quality resynthesis, with the constraint that this additional information requires minimal data rates for transmission. By examining the acoustical characteristics of the various stem recordings, the distinction of microphones is made into reverberant and spot microphones. Spot microphones are microphones that are placed close to the sound source. These microphones introduce a very challenging situation. Because the source of sound is not a point source but rather distributed such as in an orchestra, the recordings of these microphones depend largely on the instruments that are near the microphone and not so much on the hall acoustics. Resynthesizing the signals captured by these microphones, therefore, involves enhancing certain instruments and diminishing others, which in most cases overlap both in the time and frequency domains. Reverberant microphones are the microphones placed far from the sound source. These microphones must be treated separately as one category because they mainly capture reverberant information of the venue (that can be reproduced by the surround channels in a multichannel playback system). The recordings captured by these microphones can be synthesized by filtering the reference recordings through linear time-invariant (LTI) filters, which can be designed using the methods that are described in [54]. We have obtained such stem microphone recordings from two US orchestra halls by placing microphones at various locations throughout the hall. By recording a performance with a total of sixteen microphones our objective is to design a system that recreates these recordings from a smaller subset of the sixteen recordings. Ideally, only one microphone recording could be used to resynthesize all the remaining recordings; we discuss the implications of our approach later in this work. It is important to note that in this thesis we are interested only in the spot microphones resynthesis. The reverberant microphones problem has been treated in [54], where it has been shown that the reverberant recordings can be resynthesized from a reference recording using specially designed LTI filters that accurately capture the acoustics of the venue. Thus, for these microphones no additional information is necessary for high quality resynthesis (apart from the filters that add only a negligible burden to data flow). As mentioned, our objective has been to significantly reduce the bandwidth requirement of the multiple microphone recordings of a music performance. In order to achieve such low bit rates, it is considered necessary to introduce some trade-off regarding the quality of the recording. Here we propose, instead, a tradeoff regarding the accuracy of the final multichannel recording. We will show that it is possible to achieve low data rates by substituting some microphone signals with others, which, although they are different acoustically, they however retain the objectives of the initial recording.

109 Chapter 7. Multiscale Source/Filter Model 87 The term objectives corresponds to the main purpose for the microphone placement in a particular position of the venue. If a microphone was placed, for example, near the chorus of an orchestra, then the main objective of the microphone placement is to capture a recording of the music where the chorus sounds as the most prevailing part with respect to the remaining parts of the orchestra. If this microphone signal is substituted by a different (i.e. resynthesized) one, which again contains the same performance and the chorus is the prevailing part of the new signal, this is considered as a recording that retains the objective of the initial microphone signal. On the other hand, the term accuracy corresponds to the distance between the two signals. Here, we will show that our methods result in a resynthesized signal that, although retains the objective of the initial microphone signal, it introduces a tradeoff between the required bit rates and the accuracy achieved. In our previous example, in the resynthesized signal the chorus will still be the prevailing part of the orchestra (thus the objective is retained), but the other parts of the orchestra might be now more audible than in the initial signal (i.e.loss of accuracy). Subjectively, this will have the effect that the new signal sounds as if it was captured with a microphone that was placed farther from the chorus compared with the microphone placement of the original recording. However, since the objective is retained, the resynthesized signal will still sound as if it was made with a microphone placed close to the chorus. We claim that is possible to achieve low data rates, without significant sacrifices regarding the accuracy of the multichannel recording. 7.3 Multiscale Source/Filter Model Our proposed methodology, which is based on a multiscale source/filter representation of the multiple microphone signals, consists of the following steps. Each microphone signal is segmented into a series of short-time overlapping frames using a sliding window. For each frame, the audio signal is considered approximately stationary, and the spectral envelope is modeled as a vector of linear predictive coefficients (LPC s), using autocorrelation analysis (see chapter 5). The resulting vector contains the coefficients of an all-pole filter that approximates the spectral envelope of the audio signal at the particular frame. The modeling error is the result of inverse filtering the audio frame with the estimated linear prediction filter. We employ linear predictive analysis resulting in an all-pole filter, practically with much less coefficients than the samples of the audio frame. Under the source/filter model, for each frame the signal s(n) at time n is related with

110 88 Multichannel Audio Modelling and Coding Using a Multiscale Source/Filter Model the p previous signal samples by the following autoregressive (AR) equation s(n) = p a(i)s(n i) + e(n) (7.1) i=1 where e(n) is the modeling error, and p is the AR filter order. In the frequency domain, this relation can be written as P s (ω) = 1 A(ω) 2 P e (ω) (7.2) where P x (ω) denotes the power spectrum of signal x(n). response of the AR filter, i.e. A(ω) denotes the frequency A(ω) = 1 p a(i)e jωi (7.3) i=1 The p + 1 th -dimensional vector a T = [1, a 1, a 2,, a p ] T is the low dimensional representation of the signal spectral properties. If s(n) is an AR process, the noise e(n) is white, thus a completely characterizes the signal spectral properties. In the general case though, the error signal will not have white noise statistics and thus cannot be ignored. In this general case, the AR all-pole filter, which can be computed using linear predictive (LP) analysis (see section 5.2), gives only an approximation of the signal spectrum, and more specifically the spectral envelope. The error signal can be obtained from the original signal by inverse filtering with the estimated AR filter: e(n) = s(n) p a(i)s(n i) (7.4) i=1 In essence, this is the result of filtering s(n) with the all-zero filter A(ω). This error signal is also referred as the residual signal. For the particular case of audio signals, the spectrum contains only the frequency components that correspond to the fundamental frequencies of the recorded instruments, and all their integer multiples (harmonics). The AR filter for an audio frame will capture its spectral envelope, while the error signal will contain exactly the same harmonics as the audio frame. The error signal is the result of the audio frame filtered with the inverse of its spectral envelope. Thus, we conclude that the error signal will contain the same harmonics as the audio frame, but their amplitudes will now have significantly flatter shape in the frequency spectrum (see Fig. 7.1). Consider now two microphone signals of the same music performance, which have

111 Chapter 7. Multiscale Source/Filter Model 89 s 1 M 1 e 1 ω (Hz) multiplication (convolution in time) ω (Hz) ω (Hz) Figure 7.1: Consider a microphone signal M 1, which can be seen as the convolution of its spectral envelope s 1 and the residual signal e 1. Then, the residual e 1 will contain the same harmonics as M 1, but their amplitudes will have almost flat shape in the frequency spectrum. been placed close to two different groups of instruments of the orchestra. Each of these microphones mainly captures that particular group of instruments, but also captures all the other instruments of the orchestra. For simplification, consider that the orchestra consists of only two instruments, e.g. a violin and a trumpet. Microphone 1 is placed close to the violin and microphone 2 close to the trumpet. It is true in most practical situations, that microphone 1 will also capture the trumpet, in much lower amplitude than the violin, and vice versa for microphone 2. In that case, the signal s 1 from microphone 1, and the signal s 2 from microphone 2 will contain the fundamentals and corresponding harmonics of both instruments, but they will differ in their spectral amplitudes. Consider a particular frame for these 2 signals, which corresponds to the exact same music part (i.e. some timealignment procedure will be necessary to align the two microphone signals). We model each of the two audio frames with the source/filter model, resulting in two different AR filters and residual signals. s 1 (n) = s 2 (n) = p a 1 (i)s 1 (n i) + e 1 (n) (7.5) i=1 p a 2 (i)s 2 (n i) + e 2 (n) (7.6) i=1 It is apparent that if the AR vector could capture the exact envelope (shape) of the spectrum of the particular audio segment, then the two different residual signals e 1 and e 2 would contain the same flat harmonic frequency components. If the envelope modeling was perfect, then it follows that they would be equal (differences in total gain are of no

112 90 Multichannel Audio Modelling and Coding Using a Multiscale Source/Filter Model M 1 M 2 1 ω (Hz) ω (Hz) Envelope 2 residual signal 1 Envelope 1 ω (Hz) Figure 7.2: If the AR vector could capture the exact envelope of the spectrum of microphones M 1 and M 2 recordings, the two residual signals e 1 and e 2 would have flat magnitude with exactly the same frequency components and they would resemble each other. Thus, they are almost equal in the frequency domain. interest for this application), since they would have flat magnitude with exactly the same frequency components (see Fig. 7.2). In that case, we would be able to resynthesize each of the two audio frames using only the AR filter that corresponds to that audio frame, and the residual signal of only one of the other microphones. The final signal is resynthesized from the audio frames using the overlap-add procedure. If we used similarly the source/filter model for all the spot microphone signals of a single performance, we would be able to completely resynthesize these signals using their AR vector sequences (one vector for each audio frame) and the residual of only one microphone signal (see Fig. 7.3). This would result in a great reduction of the data rate of the multiple microphone signals. In practice, the AR filter is not an exact representation of the spectral envelope of the audio frame, and the residual signals for the two microphone signals will not be equal. However, we can improve the modeling performance of the AR filter by using filter banks (see Ch. 3). As depicted in Fig. 7.3, the spectrum of every recording is decomposed into M subband signals, using filter bank analysis and then we apply AR analysis in each band separately (subband signals are downsampled). A small AR filter order for each band can result in much better estimation of the spectral envelope than a high-order filter for the full frequency band. The multiscale source/filter model in Fig. 7.3 achieves a flatter frequency response for the residual signals. Consequently, we can achieve a better approximation of the residual signal of microphone 1 with the residual of microphone 2. Then we can use one of them for resynthesizing the other microphone signals, in the manner explained in the previous

113 Chapter 7. Multiscale Source/Filter Model 91 Recording 1 Filter Bank Analysis AR Analysis Residual Signal 1 Spectral Envelope 1 Coding Recording 2. Filter Bank Analysis. AR Analysis Spectral Envelope 2. Coding. Transmission Recording N Filter Bank Analysis AR Analysis Spectral Envelope N Coding Figure 7.3: Every recording is decomposed into M subband signals. For each signal we compute its spectral envelope, except from the subbands of the first recording, where we compute the residual signal, too. Then, we code and transmit all the envelopes with the residual of the first recording. paragraph. However, the error signals cannot be made exactly equal, thus the resynthesized signals will not sound exactly the same as the originally recorded signals. This corresponds to the loss of accuracy for the multichannel recording that was discussed. We claim that the use of the multiband source/filter model results in audio signals of high-quality which retain the objective of the initial recordings, in the sense that was introduced here. In other words, the main instrument that is captured still remains the prominent part of the microphone signal, while other parts of the orchestra might be more audible in the resynthesized signal than in the original microphone signal. Returning to the example of the two microphones and the two instruments, if we use the residual of microphone 1 to resynthesize the signal of microphone 2, then in the result the violin will most likely be more audible than in the original microphone 2 signal. This happens because some information of the first microphone signal remains in the error signal, since the spectral envelope modeling is not perfect. However, the trumpet will still be the prominent of the two instruments in the resynthesized signal for microphone 2, since we used the original spectral information of that microphone signal. Equally important is the fact that the accuracy and the final audio quality the multiscale source/filter model can be controlled with a variety a parameters: the duration of the audio frames for each band, the AR order for each band, the percentage of frame overlapping, the total number of bands and the filter bank used.

114 92 Multichannel Audio Modelling and Coding Using a Multiscale Source/Filter Model By changing these parameters we can achieve various data rates with the corresponding varying audio quality. Thus, our system is quality scalable which is a significant property for communications systems. The appropriate values of frame duration and percentage of overlapping are experimentally found as the highest values that result in an acceptable audio quality. The choice of AR order and the number of bands can vary more freely, depending on the accuracy and quality desired and the constraints in data rates. These are mainly the parameters that offer the quality scalability property of our model. It is easy to verify experimentally that our claims hold for other types of harmonic signals, e.g. speech signals. We should note that some specific types of microphone signals, such as percussive instruments and signals from microphones far from the source, present different challenges,which are addressed in [54]. The method proposed in this paper focuses on the large class of audio signals that can be modeled using a short-time analysis approach with emphasis on their spectral envelope. On the other hand, the proposed model does not hold for the case of percussive sounds. For these types of sounds, the residual contains the most important information about the instrument excitation [55], while our approach here has focused on the AR coefficients. Also, the proposed approach is not suitable for the reverberant microphones. This is due to the fact that our proposed methodology is based on a short-time analysis approach, which is not suitable for reverberation effects that can be modeled using much larger time frames [54]. 7.4 Modeling Results In this section, we show that the use of the proposed method results in a modeled signal that is objectively and subjectively very close to the original recording. For this purpose, we use two microphone recordings of a live concert hall performance we recorded with a large number of microphones. One of the microphones captures the male voices of the chorus of the orchestra, while the other one captures the female voices only. The objective is to resynthesize one of these recordings using its corresponding low-dimensional model coefficients along with the residual of the other recording. An important issue is the availability of an objective measure of performance for the designed system. The best would be to use an objective measure which would indicate the quality of the final synthesized microphone signals. Unfortunately, such a measure is not available. However, for our system a performance indicator is the distance between the residual signals obtained from the reference and target microphone signals, using the source/filter model. Our intuitive claim is that as the distance between these two residual signals is decreased, the audio quality of the resynthesized signal is improved.

115 Chapter 7. Multiscale Source/Filter Model 93 From initial informal listening tests it has been clear that using a number of bands around 8 for our multiband source/filter model produced high quality resynthesis without loss of the objective of the initial recording. For example, we have been able to resynthesize the male voices recording based on the residual from the female voices. On the other hand, without the use of a filter bank, the resulting quality of the resynthesized signal was greatly deteriorated with a complete loss of the recording objective. In order to show this objectively, we measured the distance between the residual signals of the two recordings, using the normalized mutual information as a distance measure. As we mentioned in the previous paragraph, the intuitive claim is that decreasing the distance of the two residuals will increase the quality of the resynthesized recording. Our listening tests indicated that increasing the number of subbands in our model, and consequently improving the model accuracy, resulted in much better quality of the resynthesized signals. While several measures were tested, the normalized mutual information proved to be very consistent in this sense. The use of mutual information as a distance measure is very common in pattern comparison (see for example [55]). By definition, the mutual information of two random variables X and Y with joint probability density function (pdf) p(x, y) and marginal pdf s p(x) and p(y) is the relative entropy between the joint distribution and the product distribution: It is easy to prove that I(X; Y ) = p(x, y) p(x, y)log p(x)p(y) x X y Y I(X; Y ) = H(X) H(X Y ) = H(Y ) H(Y X) (7.7) (7.8) where H(X) is the entropy of X and H(X Y ) is the conditional entropy. The mutual information is always positive. Since our interest is in comparing two vectors X and Y (Y being the desired response), it is useful to use a modified definition for the mutual information, the Normalized Mutual Information (NMI) I N (X; Y ) which can be defined as [55, p. 47]: Obviously, I N (X; Y ) = H(Y ) H(Y X) H(Y ) = I(X;Y ) H(Y ) 0 I N (X; Y ) 1 (7.9) NMI obtains its minimum value when X and Y are statistically independent and its maximum values when X = Y. NMI does not constitute a metric since it lacks symmetry. On the other hand, the NMI is invariant to amplitude differences [56], which is a very

116 94 Multichannel Audio Modelling and Coding Using a Multiscale Source/Filter Model Median Segmental NMI daubechies 40 daubechies 20 daubechies 10 daubechies 4 daubechies Number of bands Figure 7.4: Normalized Mutual Information between the residual signals from the reference and target recordings as a function of the number of bands of the filter bank, for various filter orders. The increase of the filter order results in better separation of the different bands in frequency domain and thus in better modeling of the spectral envelopes. The latter is very important feature, since it leads to more relevant residual signals, which in turn increases NMI. important property especially for comparing audio waveforms. In Fig. 7.4 we plot the NMI between the power spectra of the two residual signals with reference to the number of different subbands used, for different orders of the Daubechies wavelet filters, which were used for our tree-structured filter bank. As a result, our filter bank has the perfect reconstruction property, which is essential for an analysis/synthesis system, and also octave frequency-band division, which is important since the LP algorithm is especially error-prone in lower frequency bands. For our implementation, we used a 32 nd order LP filter for a 1024 sample frame (corresponding to about 23 msec, for 44.1 khz sampling rate) for the full band analysis. For the subband analysis, we used an 8 th order filter for each band, with a constant frame rate of 256 samples for each band (thus varying frame in msec). The amount of overlapping for best quality was found to be 75% for all cases. These parameters were chosen so that the total number of transmitted coefficients for the resynthesized recording remains the same for both the full band and the subband cases. For the particular number of parameters used, the total number of coefficients used for the resynthesis is eight times less than the total number of audio samples. The final coefficients that we intend to code for each mi-

117 Chapter 7. Multiscale Source/Filter Model 95 crophone signal are the line spectral frequencies (LSF s) given their favorable quantization properties compared to the LP coefficients. The NMI values in Fig. 7.4 are median values of the segmental NMI between the two residual signals using an analysis window of 6 msec. The residual signals are obtained using an overlap-add procedure so that they can be compared using the same analysis window. Our claim, that using a subband analysis with a small LP order for each band will produce much better modeling results than using a high LP order over the full frequency band, is greatly justified by the results shown. For the full band analysis we obtained a NMI value of while for a 4-band filter bank the median NMI is (40 th order wavelet filters). In Fig. 7.4 we plot the median NMI for different orders of the Daubechies filters. We can see that increasing the filter order leads to slightly better results. Intuitively, this was expected; an increase in the filter order results in better separation of the different bands, which is important since we model each subband signal independently of the others. In a similar experiment, we compared the residual signals in the time domain and found that the median NMI doubles when using the 8-band system when compared to the fullband case. The results for both the frequency and time domains are similar regardless of the analysis window length for obtaining the NMI segmental values. When increasing the window size the NMI drops, which is expected since more data are compared. However, the decrease is similar regardless of the number of bands we tested In order to test the performance of our method, we also employed subjective (listening) tests, in which a total of 17 listeners participated (individually, using good quality headphones, Philips SBC HP 800). We used the two concert hall recordings from the same performance as mentioned earlier (one capturing the male voices and one capturing the female voices of the chorus). We chose three parts of the performance (about 10 sec. each, referred to as Signals 1-3 here) where both parts of the chorus are active so that the two different microphone signals can be easily distinguished. For each signal we designed an ABX test, where A and B correspond to the male and female chorus recording (in random order), while each listener was asked to classify X as being closer to A or B regarding as to whether the male or female voices prevail in the recording. We tested 4 different types of filter banks (3 wavelet-based and 1 MDCT-based), namely 8-band with filters db40 (test ABX-1) and db4 (ABX-2), 2-band with db40 (ABX-3) and 32-level MDCT-based with KBD window (ABX-4). For each of these 4 tests, we used all three of the chosen signals, thus a total of 12 ABX tests was conducted per listener. The results are given in Table 7.1. We can conclude that the objective results, as well as the various claims made in the previous sections regarding the model, are verified by the listening tests. It is clear that the 8-level wavelet-based filter bank produces excel-

118 96 Multichannel Audio Modelling and Coding Using a Multiscale Source/Filter Model ABX-1 ABX-2 ABX-3 ABX-4 Results correct 86% 63% 10% 8% Table 7.1: Results from the ABX listening tests. We tested 4 different types of filter banks (3 wavelet-based and 1 MDCT-based), namely 8-band with 40 th order Daubechies filtersdb40 (test ABX-1) and db4 (ABX-2), 2-band with db40 (ABX-3) and 32-level MDCTbased with KBD window (ABX-4). not perceived perceived but not annoying slightly annoying annoying very annoying signal 1 signal 2 signal 3 Figure 7.5: Results from the 5-grade scale DCR-based listening tests, where graphical representations of the 95% confidence interval are shown (the x s mark the mean value and the two horizontal lines indicate the confidence limits). These results show clearly that the resynthesized signals are of high quality (similar to the quality of the original recording) and the model does not seem to introduce any serious artifacts. lent results when aliasing is limited (i.e. db40 case), although there is certainly room for improvement and further enhancement to our model is currently underway. On the other hand, when aliasing is high or when the number of bands (and thus the modeling accuracy) drops (ABX-2 and ABX-3), the performance of the proposed method greatly deteriorates, not only in the sense of enhancing the male voices, but also regarding final quality (which most listeners noticed during the experiments). Accuracy (but not quality) dropped in the case of the MDCT-based filter bank as well (ABX-4). In other words we noticed that octave filter banks produce results far superior when compared to equally-spaced filter banks, which could be attributed to the fact that the LP algorithm is especially error-prone in lower frequency bands. At this point we note that in our informal tests the Laplacian pyramid, which is a different type of octave-spaced filter bank [57], produced results comparable to the wavelet case. The choice of filter bank and whether octave-spaced filter banks are indeed better from equal-spaced for our model is a subject of our ongoing research. We also conducted DCR-based (Degradation Category Rating) listening tests for eval-

119 Chapter 7. Multiscale Source/Filter Model 97 uating the quality of the resynthesized signals using a 5-grade scale in reference to the original recording (5 corresponding to being of same quality, and 1 to the lowest quality, when compared with the original male chorus recording). This test is often performed for speech coding [58]. Subjects listened to the three sound clips (Signals 1-3), where the resynthesized signals were obtained using the best modeling parameters (8-level db40 wavelet-based). The results are depicted in Fig. 7.5, where graphical representations of the 95% confidence interval are shown (the x s mark the mean value and the two horizontal lines indicate the confidence limits). These results show clearly that the resynthesized signals are of high quality (similar to the quality of the original recording) and the model does not seem to introduce any serious artifacts.

121 Chapter 8 Multichannel Audio Coding 8.1 Introduction After the description of the source/filter model, where the Linear Predictive Coefficients (LPC s) have been estimated, we proceed in this chapter to the coding procedure of LPC s. The approach is based on the source coding scheme of [17], which estimates the probability density function of the source, giving a compact representation of the transmission data and efficiently quantizing the estimated parameters with various bit rates. Our method is tailored towards the transmission of many microphone recordings in a performance before they are mixed and thus can be used in applications, where the mixing procedure is taking place at the receiver (remote mixing). This innovative approach relaxes the current bandwidth constraints of relevant demanding applications as remote mixing or distributed performances, enabling their widespread usage. 8.2 General Model Description As we mentioned in Ch. 7, in our model every multichannel recording is decomposed into a number of subband signals, using filter bank analysis and AR analysis is applied in each band separately (see Fig. 8.1). In the first recording, we estimate the spectral envelopes and the residual signals of every band and we encode only the envelopes (see Fig. 8.2(a)). As for the rest of the recordings, we proceed in the same approach, except the estimation of the residuals (see Fig. 8.2(b)). Finally, we encode the spectral envelopes and we transmit them together with the residuals of the first recording, which will be combined with the spectral envelopes in order to reconstruct the original signals. Now, some parts of the proposed model will be explained. In particular, several filter banks have been used in the implementation. In this chapter, without loss of generality, we use wavelets for the description of the coding scheme. The original signals are decomposed

122 100 Multichannel Audio Modelling and Coding Using a Multiscale Source/Filter Model Recording 1 Wavelet Analysis AR Analysis Residual Signal 1 Spectral Envelope 1 Coding Recording 2. Wavelet Analysis AR Analysis Spectral Envelope 2.. Coding. Transmission Recording N Wavelet Analysis AR Analysis Spectral Envelope N Coding Figure 8.1: The proposed Multiscale Source/Filter Model. Layer 1 AR Analysis Residual Signal 1 Spectral Envelope 1 Coding Recording 1 Wavelet Decomposition Layer 2. AR Analysis Residual Signal 2 Spectral Envelope 2 Coding.. Transmission Layer W AR Analysis Residual Signal W Spectral EnvelopeW Coding (a) AR Analysis Layer 1 Spectral Envelope 1 Coding Recording i Wavelet Decomposition Layer 2 AR Analysis Spectral Envelope 2 Coding.. Transmission Layer W AR Analysis Spectral Envelope 2 Coding (b) Figure 8.2: The modeling and coding procedure of (a) the first recording and (b) the j recordings (j = 2,..., N), in the special case where a wavelet filter bank of W layers is been used. in W layers as shown in Fig Every layer is then segmented into a series of k short-time overlapping frames using a sliding Hamming window and AR analysis is applied in each frame, as depicted in Fig The spectral envelopes are modeled as vectors of linear predictive coefficients (LPC s), since audio signals frames are considered approximately stationary and they are coded.

123 Chapter 8. Multichannel Audio Coding 101 The coding procedure consists of two steps (Fig. 8.3). The first one is the conversion of the LPC s into Line Spectral Frequencies (LSF s), since the latter has been found to be more tolerant in quantization errors. This step will not be analyzed, since it has been described in detail in section The second step of encoding is the quantization of LSF s, which will be extensively discussed further down in this chapter. 8.3 Quantization of Speech Line Spectral Frequencies As mentioned before, the next step in our algorithm is to quantize the spectral envelopes for each of the microphone signals. We follow the quantization scheme of [17], developed for vector quantization of speech line spectral frequencies (LSF s). We transform the AR coefficients of each microphone signal to LSF s, since LSF s are more resistant to quantization errors. The next step in our method after transforming the LPC s into LSF s, is to model the sequence of LSF s that we obtain from each microphone signal with the use of a Gaussian Mixture Model (GMM) m g(x) = p i N(x; µ i, Σ i ) (8.1) i=1 where N(x; µ, Σ) is the normal multivariate distribution with mean vector µ and covariance matrix Σ, m is the number of clusters and p i is the prior probability that the observation x has been generated by cluster i. The set of parameters {p i, µ i, Σ i } is estimated from the training data, using the EM algorithm (see Ch. 6). The Karhunen Loève Transform (KLT) is adopted for the LSF s decorrelation. KLT is especially suitable for GMM-modeled parameters since it is the optimal transform for Gaussian signals in a minimum-distortion sense. Using GMM s, each LSF vector is assigned to one of the Gaussian classes using a classification measure (which is discussed later in this chapter) and thus can be considered as approximately Gaussian and can be best decorrelated using the KLT. Using the GMM modeling of the spectral parameters, the covariance matrix of each class can be diagonalized using the eigenvalue decomposition as Σ i = Q i Λ i Q T i (8.2) where i = 1,..., m and Λ i = diag(λ i,1, λ i,2,..., λ i,p ). In other words, Λ i is the diagonal matrix containing the eigenvalues, and Q i is the matrix containing the corresponding set of orthogonal eigenvectors of Σ i, for the i th Gaussian class of the model.

124 102 Multichannel Audio Modelling and Coding Using a Multiscale Source/Filter Model frame 1 AR Analysis Residual 1 LPCs 1 LSFs Quantization Layer 1 Hamming Window frame 2 AR Analysis Residual 2 LPCs 2 LSFs Quantization.... Transmission frame k AR Analysis Residual k LPCs k LSFs Quantization Coding (a) frame 1 AR Analysis LPCs LSFs Quantization Layer j Hamming Window frame 2. AR Analysis LPCs LSFs... Quantization Transmission frame k AR Analysis LPCs LSFs Quantization Coding (b) Figure 8.3: Every layer from the wavelet decomposition is segmented into a series of k shorttime overlapping frames using a sliding Hamming window and AR analysis is applied in each frame. The spectral envelopes are modeled as LPC s, which are converted in LSF s. (a) In Layer 1, LSF s are quantized and transmitted together with the residual signals while (b) in Layer j (j = 2,..., M) only the quantized LSF s are transmitted.

125 Chapter 8. Multichannel Audio Coding 103 A. Cluster Quantization. After the estimation of LSF s and the matrices Q i we quantize each frame s LSF vector with the parameters of every GMM cluster. The quantization procedure consists of the following steps (Fig. 8.4): (i) We subtract from the LSF vector z k the mean µ i of the i-th cluster, where i = 1,..., m. (ii) We decorrelate the resultant vector using the matrix Q T i w k = Q T i (z k µ i ). (8.3) (iii) We pass the components of the vector w k through a scalar compressor c(.), which is given by [59]: c(w i ) = 1 2 (1 + erf(w i/ 6)), (8.4) where w i are the components of the vector w k and erf is the error function: erf(w i ) = 2 π wi (iv) Uniform quantizer is applied to the compressed vector. scheme for the uniform quantizer follows later in this section. 0 e t2 dt. (8.5) A detailed bit allocation (v) The resultant quantized vector q k is expanded using the inverse function of the compressor c(.): where q i are the components of q k. c 1 (q i ) = 6erf 1 (2q i 1), (8.6) (v) Similarly with step (ii), the inverse KLT procedure (IKLT) reconstructs z k from the quantized and expanded q k using the inverse relation ẑ k = Q i q k + µ i. (8.7) (vi) We finally add the cluster mean µ i to obtain the quantized value of z k by the i th cluster, ẑ k. The combination of a compressor, a uniform quantizer and an expander realizes a nonuniform quantizer. As mentioned above, the decorrelated vectors are processed using a logarithmic compression function shown in Fig. 8.5, quantized by a uniform quantizer and

126 104 Multichannel Audio Modelling and Coding Using a Multiscale Source/Filter Model Nonuniform Quantizer z 1 z 2. z p Q µ i + µ i Q KLT.. Q. IKLT z^ 1 z^ 2. z^ p z k Decorellator Compressor Quantizer Expander Correlator ^z k Figure 8.4: Quantization scheme among clusters c(x) x Figure 8.5: The logarithmic compression function c(x) = 1 2 (1 + erf(x/ 6)). expanded using the inverse of the compression function. The function of (B.12) is used, since the logarithmic companding laws result in a more robust quantizer. A bit allocation scheme for the uniform quantizer is needed in order to allocate the total available bits (denoted by b tot and specified by the user) for quantizing the source, among the various clusters of the GMM. Let b i be the bits for quantizing cluster i, and q i the quantity q i = [ p j=1 λ i,j ] 1 p, i = 1,..., m, (8.8) where p is the dimensionality of the LSF vector. In the fixed rate bit allocation scheme the length of the codewords is fixed and can be easily shown that satisfies the constraint 2 btot = m 2 b i. i=1 Subject to this constraint, the optimal bit allocation which minimizes the total average

127 Chapter 8. Multichannel Audio Coding 105 mean square distortion is given by [ m b i = b tot log 2 j=1 (p j q j ) p p+2 where p i is the prior probability of the cluster i. ] + p p+2 log 2(p i q i ), i = 1,..., m (8.9) After the evaluation of the cluster allocated bits, we calculate the bit allocation among the cluster dimensions as [ ] b i,j = b i p + 1 log λi,j 2 2 q i, i = 1,..., m j = 1,..., p (8.10) where b i,j is the allocated bits to the j th component of the i th cluster and λ i,j is the j th eigenvalue of cluster i. In our implementation we rounded b i,j in the nearest integer number for more accurate bit allocation. B. Overall Quantization. As discussed above, each frame s LSF vector z k is quantized with the use of every cluster s parameters of the GMM. In order now to choose the GMM cluster that best models a particular LSF vector, we evaluate the relative distortion value for the vector and we choose the one with the minimum distortion (Fig. 8.6). Here, we employ the Log Spectral Distortion (LSD) as a measure of distance as in [17]: LSD(i) = 1 F s F s 0 [10 log 10 ( S(f) Ŝ (i) (f) )] 2 df 1 2 (8.11) where F s is the sampling rate, S(f), Ŝ(i) (f) are respectively the LPC power spectra corresponding to the original vector z k and the quantized vector ẑ (i) k, using the parameters set of cluster i (i = 1,..., m). We choose the ẑ (i) k, which corresponds to the cluster of minimum LSD and transmit its bitwise representation. At the receiver (see Fig. 8.7), we convert the quantized LSF vector into its corresponding LPC value; we combine the LPC vectors of every channel with the residual signal of the first recording, using autocorrelation analysis; we use the synthesis part of the filter bank, which achieves perfect reconstruction and we reconstruct the recordings under the accuracy-objective perspective discussed in Ch. 7.

128 106 Multichannel Audio Modelling and Coding Using a Multiscale Source/Filter Model Cluster 1 z k Cluster 2 Minimum LSD ^z (i) k Cluster m Figure 8.6: Overall Quantization scheme. Residual Signal 1 Spectral Envelope 1 AR Analysis Wavelet Synthesis Recording 1 Transmission Residual Signal 1 Spectral Envelope 2. AR Analysis Wavelet Synthesis. Recording 2. Residual Signal 1 Spectral Envelope N AR Analysis Wavelet Synthesis Recording N Figure 8.7: The reconstruction of the initial recordings at the receiver. 8.4 Coding Results In this section we describe some practical details of the implementation and we present some results regarding to the quantization distortion. In particular, for our experiments, we have obtained microphone signals from a US orchestra hall by placing 16 microphones at various locations throughout the hall. Our objective is to indicate that the model and coding methods we propose, result in a high quality recording with low data rate requirements. For this purpose, we use two of these microphone signals, where one of the microphones mainly captures the male voices of the chorus of the orchestra, while the other one mainly captures the female voices. These recordings are very easy to distinguish acoustically. The efficiency of the proposed model has been tested via listening tests (which are discussed further down), indicating that the model results in a very good quality microphone signal that retains the objective of the initial recording with a very small loss of accuracy. Before we proceed to some objective results using the Log Spectral Distortion (LSD) measure (regarding the proposed coding scheme), let us describe some parameters of the model. Specifically, the sampling rate for the audio data was 44.1 khz; we divided the

129 Chapter 8. Multichannel Audio Coding 107 frequency range into 8 octave subbands using 40 th order Daubechies wavelet filters (these offer the perfect reconstruction property of a tree-structured filter bank [25]). In each subband we applied a constant Hamming window of 256 samples, using an 8th order AR filter in each band and the amount of overlapping for best quality was found to be 75% for all cases. The LSF s of each band are modeled using a GMM of 16 components. The parameters of the GMM were estimated with the EM algorithm (see section 6.2), using a training audio dataset of about vectors per subband, similarly with [17]. For obtaining this training dataset, we employed a different window rate for each subband (for maximizing the number of training vectors). This database consists of recordings of the same performance as the data we encoded (but a different part of the recording than the one used for testing). For the parameters discussed above and with varying choice of bit rate, we obtain the values of Table 8.1 (bitrate vs. LSD). An example of the fixed rate total bits that were allocated in every band for the coding procedure is given in Table 8.2 corresponding to the 10 KBits/sec case. The value of 10 KBits/sec was found to be the minimal bit rate for high quality coding (similar to the original recording). By varying the number of GMM classes per band we obtained the LSD values of Table 8.3, from where we can conclude that the LSD is only marginally decreased as the number of the GMM clusters increases. For comparison, current compression algorithms for multichannel audio have minimal bit rate requirements in the order of 64 KBits/sec/channel, for achieving high quality encoding. We note, though, that our algorithm has been developed based on the accuracy-objectives tradeoff that was explained in Ch. 7, and thus mainly focuses on encoding the microphone signals before those are mixed into the channels of the final multichannel signal. At this point, it is important to examine the minimum LSD criterion. In particular, during the coding procedure, we code each LSF vector with the parameters of every GMM LSD(dB) KBits/sec Table 8.1: The Log Spectral Distortion for various bit rates. The value of 10 KBits/sec is found to be the minimal bit rate for high quality coding (similar to the original recording). For comparison, current compression algorithms for multichannel audio have minimal bit rate requirements in the order of 64 KBits/sec/channel, for achieving high quality encoding.

130 108 Multichannel Audio Modelling and Coding Using a Multiscale Source/Filter Model Band Nr. Bits/band LSD (db) Table 8.2: An example of the fixed rate total bits that were assigned in every band for the coding procedure, which corresponds to the 10 KBits/sec case of Table 8.1 (bitrate vs. LSD). clusters/layer LSD (db) Table 8.3: The LSD values for various numbers of GMM classes per band. We can conclude that the LSD is only marginally decreased as the number of the GMM clusters increases. min LSD(dB) max Prob(dB) KBits/sec Table 8.4: During the coding procedure we choose among the coded LSF vectors the one with the minimum LSD. Here, we compare it with the case of coding the LSF vector with the GMM cluster of maximum probability.

131 Chapter 8. Multichannel Audio Coding 109 not perceived perceived but not annoying slightly annoying annoying 10 kbps 20 kbps very annoying signal 1 signal 2 signal 3 Figure 8.8: Results from the 5-grade scale DCR-based listening tests for the coding procedure. cluster and we decide to transmit the coded LSF which gives the minimum LSD. It is reasonable to wonder what would happen to the coded signal s distortion if we code (classify) the LSF vector using the parameters of the GMM cluster, which has the maximum a posteriori probability, given by: P (ω i x) = p(x ω i )P (ω i ) m j=0 p(x ω j)p (ω j ) (8.12) where P (ω i ) is the a priori probability of the cluster i and p(x ω i ) is the probability of the observation x given the cluster i. The answer is given in Table 8.4, where we coded the same signal, with the same aforementioned modeling parameters, using both choices, for various bit rates. The resultant distortion (in db) for both cases, indicates that the minimum LSD criterion gives better results compared to the maximum probability case. Finally, in the listening tests described in subsection 7.4, we also tested the performance of the coding procedure. Particularly, we employed DCR (Degradation Category Rating) listening tests, using the same 5-grade scale. We used three sound clips (Signals 1-3), which were obtained using 8-level and 40 th order Daubechies wavelet filter bank and a GMM of 16 clusters. The listening tests volunteers listened to these signals, which were coded with bit rates 10 and 20 kbps. For every coded signal a 95% confidence interval was evaluated and the results are shown in Fig. 8.8, where the blue and red graphics correspond to the case of 10 and 20 kbps respectively; the x s indicate the mean value of the subjects score and the other two lines mark the upper and lower confidence limits respectively. It is clear from the results that the quality of the coded audio signals is high and the proposed model

132 110 Multichannel Audio Modelling and Coding Using a Multiscale Source/Filter Model seems to be resistant to the used coding scheme. We believe that the bit rates can be further reduced with further research.

133 Chapter 9 Conclusion and Future work We presented a new approach for modeling and coding multichannel audio using a multiscale source/filter model. The proposed method was based on the accuracy-objectives tradeoff introduced. It can be used for resynthesizing the multiple microphone signals before they are mixed into the channels of the final multichannel recording, and is thus tailored towards applications such as remote mixing and distributed musicians collaboration. The main advantage of the model is that it separates each microphone signal into a low-dimensional signal which mainly captures the microphone-specific properties, and a high-dimensional signal which mainly contains the inter-channel similarities. Our model also introduces a novel multichannel audio coding scheme where only one audio channel, along with side information of few kbps per channel, can be decoded into the multiple channels of the original recording at the receiving end. This fact results in a great reduction of the data rate of the multiple microphone signals, allowing low bit rate transmission. The presented results show good quality for coded audio signals using side information in the order of 10 KBits/sec/channel, while current compression algorithms for high quality multichannel audio have minimal requirements in the order of 64 KBits/sec/channel. Thus, the model can be used in many coding applications for multichannel audio. We now describe future research directions which could further result in an improved source/filter model. In Chapter 7 we presented some subjective results for several types of filter banks used in our model. We noticed that the octave-spaced filter banks displayed better results compared to the equal-spaced case. In our future research, we intend to study the underlying reasons of this observation and test more filter bank types, hoping to modeling improvement. It is important to mention here that our method can be used in the more general case when the residual signals are not considered equal. A possible approach might be, for example, to investigate relations between the residual signals for deriving an equally

134 112 Multichannel Audio Modelling and Coding Using a Multiscale Source/Filter Model efficient but more accurate method for coding these signals. For the coding scheme described in Ch. 8, we intend to further decrease the required bit rate by employing variable rate as well as coding with memory schemes. Another important issue for further research is the manipulation of lost packets during transmission, which is very common in practical implementations. We plan to investigate a retrieval scheme, where the lost packet of LSF s can be estimated from the received LSF s of the same channel or even other channels of the same recording.

135 Part III Appendices

136

137 Appendix A The first steps of Multichannel Audio In early 1930 s, Bell Labs experimented with the first recordings in surround sound. They set up a three-channel stereo recording system from the Philadelphia Orchestra (conducted by Leopold Stokowski) and live transmitted the recordings to a listening room in a theater in Washington, D.C (also set up by Bell Labs). This project was similar to the modern multichannel systems consisted of a right, a left and a center channel, but it s value was debased and finally it was shelved. A few years later, Stokowski met Walt Disney, who was working on Fantasia. During their meeting, they hatched the idea of a cartoon with classical music s features. Stokowski encouraged Disney s engineers to contact Bell Labs in order to use their earlier experimental stereophonic recording system in the film. After the confirmation of the collaboration with Bell Labs, Walt Disney had an inspiration. He thought that if some sounds of the movie would sound all around the audience instead of just in front of them, would add some extra realism in the movie. Thereby, the soundtrack of the film curried out with a three channel recording system, where the rear channels electronically steered in when desired. For this reason, Walt Disney is considered as the inventor of surround sound. The resultant know how was named Fantasound and was demonstrated in New York, Los Angeles, and a few other places, but it did not receive the desired effect and it was retired. At World War II, Stalin s government expressed an interest in using it for soviet films, an agreement was made and the whole amount of the equipment was sent to Europe by ship. Unfortunately, this ship was torpedoed by a German pigboat and Fantasound went down to the bottom of the ocean. After the war (around 1950), with the explosion of new technologies, Twentieth Century Fox resurrected the concept of multichannel sound. They included in the film s package a four-channel stereophonic sound (left, center, right, and one mono surround channel). But the costs of this project was very expensive and it was abandoned. In the 1970 s, Quadraphonic sound tried to increase the quality of sound, but since psychoacoustic principles

138 116 Multichannel Audio Modelling and Coding Using a Multiscale Source/Filter Model were not taken into account (was though only in terms of playback technology) it failed. Later on, in 1977 Dolby Laboratories introduced a pioneer surround sound application, which was used in the famous film Stars Wars ; this soundtrack of multichannel audio finally caught everyone s attention. It was the first time that a left and a right side channel of six channels array were used for low channel effects, introducing the today s.1 subwoofer. Nowadays, the 5.1 channels surround sound tends to be dominant in home theatre systems or in cinema. However, some researchers believe that additional side channels would improve the sound experience. Thus, they experiment on a 10.2 channels system, where they add to the existing 5.1 the following six channels: a left and a right-frontwide channel (at 60 degrees from the listener), a left and a right-front-high (to break the vertical plane), a center rear and an additional subwoofer channel. On the other hand, some listening tests of the Swedish Radio indicated that the human ear s directional and frequency perception drops off in side and rear positions; human auditory system is forward focused and it is not especially perceptive to focusing behind the head.

139 Appendix B Description of the thesis in greek Eisagwg - Orismìc Probl matoc O polukanalikìc qoc prosfèrei shmantikˆ pleonekt mata ìson aforˆ thn anaparagwg thc mousik c se sôgkrish me ton stereofwnikì qo. H qr sh megˆlou arijmoô kanali n anaparagwg c (hqeðwn) gôrw apì ton akroat èqei san apotèlesma thn realistik anaparagwg thc mousik c, prosjètontac pollaplèc akoustikèc phgèc kai sunep c perikleðontac (immersion) ton akroat me thn akoustik skhn. H qr sh ìmwc perissìterwn apì 2 kanali n sta polukanalikˆ sust mata (se sôgkrish me to stereofwnikì sôsthma eggraf c/ anaparagwg c) sunepˆgetai perissìtera apì 2 mikrofwnikˆ s mata (pou ja odhghjoôn sta anˆloga hqeða). 'Amesh sunèpeia eðnai h aôxhsh thc apaitoômenhc plhroforðac gia anaparagwg thc mousik c, shmantikì prìblhma se efarmogèc metˆdoshc tou qou mèsw diktôwn. Gia autì to lìgo èqoun anaptuqjeð prìsfata algìrijmoi gia th sumpðesh polukanalikoô qou, pou mei noun ìqi mìno thn plhroforða kˆje kanalioô xeqwristˆ (intrachannel) allˆ kai thn koin plhroforða pou pijanìn upˆrqei metaxô perissìterwn tou enìc kanali n (inter-channel). Oi diˆforoi algìrijmoi sumpðeshc polukanalikoô qou pou èqoun anaptuqjeð mèqri stigm c, an kai mei noun thn apaitoômenh taqôthta metˆdoshc tou polukanalikoô qou, exakoloujoôn na jewroôntai mh praktikoð gia efarmogèc ìpou to eôroc metˆdoshc eðnai idiaðtera qamhlì (ìpwc p.q. to DiadÐktuo). Mia efarmog ìpou den eparkoôn oi shmerinèc teqnikèc kwdikopoðhshc eðnai h ex apostˆsewc tautìqronh sunergasða mousik n. Aut jewreðtai mia apì tic shmantikìterec efarmogèc eikonik n periballìntwn s mera, kai èqei deiqjeð ìti apaitoôntai megˆlec taqôthtec metˆdoshc gia thn sunergasða twn mousik n qwrðc antilhptèc qronikèc kajuster seic sth metˆdosh [15]. Mia polukanalik eggraf gðnetai me perissìtera mikrìfwna apì ton telikì arijmì twn kanali n thc hqogrˆfhshc. Ta mikrofwnikˆ autˆ s mata qrhsimopoioôntai gia th

140 118 Multichannel Audio Modelling and Coding Using a Multiscale Source/Filter Model sônjesh thc polukanalik c eggraf c, mia diadikasða pou lègetai mixing kai gðnetai apì eidikoôc me krit ria kurðwc aisjhtik c kai empeirikèc gn seic. S mera, gia na metadojeð mia polukanalik hqogrˆfhsh (p.q. zwntan sunaulða) mèsw yhfiakoô radiof nou apaiteð thn parousða tou eidikoô gia mixing sto sugkekrimèno q ro pou gðnetai h sunaulða. EpÐshc, ìlh h diadikasða prèpei na gðnei sto q ro autì. ApoteleÐ dhlad prìblhma sthn prˆxh h diadikasða aut, kai eðnai ènac apì touc lìgouc pou èqoun empodðsei mèqri stigm c th metˆdosh zwntan n programmˆtwn polukanalikoô qou. Stìqoc aut c thc ergasðac eðnai h parousðash miac nèac mejìdou pou kwdikopoieð thn polukanalik mousik me polô qamhlèc taqôthtec metˆdoshc, epitrèpontac th metˆdos thc mèsw tou DiadiktÔou asôrmatwn diktôwn. H mèjodìc mac emfanðzei pollˆ pleonekt mata eidikìtera gia efarmogèc ìpwc eðnai h ex apostˆsewc sônjesh polukanalikoô qou (remote mixing) kaj c ex apostˆsewc sunergasða mousik n mèsw p.q. tou DiadiktÔou. Montèlo Apeikìnishc PolukanalikoÔ HqhtikoÔ S matoc (Perigraf -Apotelèsmata) Prin proqwr soume sthn perigraf thc ulopoðhshc eðnai qr simo na anaferjoôn kˆpoia basikˆ stoiqeða se sqèsh me ton trìpo hqogrˆfhshc twn mousik n shmˆtwn. H mousik ekd lwsh pou hqograf jhke eðnai apì mða mousik orq stra sthn Amerik kai qrhsimopoi jhkan dekaèxi mikrìfwna se diaforetikˆ shmeða tou sunauliakoô q rou. O stìqoc thc efarmog c pou ja parousiˆsoume eðnai h sqedðash enìc sust matoc to opoðo ja anaparˆgei ton qwrotaxikì realismì thc topojesðac sthn opoða diexˆgei h hqogrˆfhsh apì èna uposônolo dekaèxi hqograf sewn. Me ˆlla lìgia, stìqoc eðnai h meðwsh tou eôrouc z nhc thc plhroforðac pou epijumoôme na metad soume, qwrðc bèbaia na qajeð h poiìthta twn hqograf sewn. Ja deðxoume ìti gia na petôqoume aut th meðwsh eðnai jemitì na antikatast soume kˆpoiec hqograf seic me kˆpoiec ˆllec, oi opoðec parìlo pou akoôgontai diaforetikˆ, diathroôn ton {antikeimenikì} skopì twn arqik n hqograf sewn. Me ton ìro {antikeimenikìc skopìc} ennooôme to basikì lìgo gia ton opoðo topojet jhke to mikrìfwno sth sugkekrimènh jèsh. An gia parˆdeigma èqoume topojet sei èna mikrìfwno kontˆ sth qorwdða thc sunaulðac o antikeimenikìc skopìc eðnai h katagraf enìc s matoc sto opoðo to ˆkousma pou epikrateð eðnai autì thc qorwdðac kai se mikrìtero bajmì h upìloiph orq stra. An autì to s ma antikatastajeð apì èna ˆllo pou emeðc sunjèsame kai sto opoðo exakoloujeð na epikrateð h qorwdða se antðjesh me touc ˆllouc qouc, tìte jewroôme ìti h kainoôria hqogrˆfhsh diathreð ton antikeimenikì skopì thn arqik c. Akìmh, prèpei na anaferjoôme kai sto katˆ pìso to s ma pou sunjètoume eðnai {akribèc}, dhlad pìso orjì eðnai se sqèsh me thn arqik hqogrˆfhsh. H teqnik pou proteðnoume en diathreð ton antikeimenikì skopì, parousiˆzei ìmwc mða dielkunstðda (trade

141 Appendix B. Description of the thesis in greek 119 off) metaxô tou epijumhtoô rujmoô plhroforðac kai thc akrðbeiac. Pio sugkekrimèna, sto prohgoômeno parˆdeigma, h qorwdða exakoloujeð na eðnai to kôrio mèroc tou s matoc pou sunjètoume (diat rhsh antikeimenikoô stìqou), mìno pou oi upìloipoi suntelestèc thc mousik c parˆstashc pijan c na akoôgontai se megalôtero bajmì se sqèsh me thn pragmatik hqogrˆfhsh. An ìmwc diathreðtai o antikeimenikìc skopìc thc hqogrˆfhshc, aut h ap leia ìson aforˆ thn akrðbeia isostajmðzetai apì to qamhlì rujmì plhroforðac pou epitugqˆnetai. UposthrÐzoume ìti eðnai dunatìn na epiteuqjoôn qamhloð rujmoð plhroforðac me mða mikr jusða se ìti aforˆ thn akrðbeia. O basikìc skopìc thc mejìdou pou proteðnoume eðnai h montelopoðhsh twn hqograf sewn me tètoio trìpo, ste o rujmìc thc plhroforðac proc metˆdosh na eðnai mikrìc. Gi' autì to lìgo kˆje hqogrˆfhsh montelopoieðtai me th qr sh tou montèlou phg /fðltro. H hqhtik eggraf tmhmatopoieðtai se mikr c diˆrkeiac epikaluptìmena plaðsia me th qr sh enìc parˆjurou pou olisjaðnei sto qrìno. Kˆje èna apì autˆ ta plaðsia mporeð na jewrhjeð statik stoqastik diadikasða, opìte h fasmatik tou peribˆllousa mporeð na montelopoihjeð wc èna diˆnusma suntelest n grammik c prìbleyhc (anafèretai parakˆtw kai wc diˆnusma paramètrwn). Sugkekrimèna, sto montèlo mac to s ma s(n) susqetðzetai me p prohgoômena sto qrìno deðgmata me thn akìloujh sqèsh (autoregressive equation): s(n) = p a(i)s(n i) + e(n) i=1 (B.1) ìpou e(n) eðnai to sfˆlma thc prìbleyhc (residual signal) kai p eðnai h tˆxh tou montèlou autosusqètishc. Sto q ro twn suqnot twn, aut h sqèsh mporeð na grafteð wc ex c: P s (ω) = 1 A(ω) 2 P e (ω) (B.2) ìpou P x (ω) eðnai to fˆsma isqôoc (power spectrum) tou s matoc x(n) kai A(ω) eðnai h apìkrish sth suqnìthta tou fðltrou autosusqètishc (AR filter). Autì èqei wc apotèlesma th meðwsh tou ìgkou thc plhroforðac, miac kai perigrˆfoume to kˆje plaðsio me èna mikrì arijmì suntelest n: a T = [1, a 1, a 2,, a p ] T (B.3) Gia ton upologismì twn suntelest n, qrhsimopoi jhke h jewrða grammik c prìbleyhc kai pio sugkekrimèna oi anadromikèc sqèseic Levinson-Durbin. To sfˆlma thc grammik c prìbleyhc upologðzetai wc ex c: e(n) = s(n) p a(i)s(n i) i=1 (B.4)

142 120 Multichannel Audio Modelling and Coding Using a Multiscale Source/Filter Model 'Eqontac upologðsei tìso touc suntelestèc grammik c prìbleyhc thc fasmatik c peribˆllousac (B.3) gia kˆje hqogrˆfhsh ìso kai to lˆjoc tou montèlou (sfˆlma prìbleyhc), stìqoc mac eðnai na metad soume to lˆjoc mðac mìno hqogrˆfhshc kai tic fasmatikèc peribˆllousec ìlwn twn hqograf sewn, afoô pr ta tic kbantðsoume. UposthrÐzoume ìti me autˆ ta dedomèna o paral pthc eðnai se jèsh na sunjèsei tic arqikèc hqograf seic diathr ntac ton antikeimenikì touc skopì, en parˆllhla epitugqˆnetai meðwsh tou rujmoô plhroforðac. Se autì to shmeðo eðnai aparaðthto na anafèroume ìti sthn prˆxh oi suntelestèc grammik c prìbleyhc den anaparistoôn me akrðbeia th fasmatik peribˆllousa tou hqhtikoô s matoc, me apotèlesma ta sfˆlmata prìbleyhc twn hqograf sewn na mhn eðnai Ðsa. Parìla autˆ ìmwc, mporoôme na belti soume thn apìdosh tou montèlou autosusqètishc me th qr sh filter banks. Pio sugkekrimèna, qwrðzoume to s ma pou prokôptei apì ta diˆfora plaðsia thc arqik c hqogrˆfhshc (me th bo jeia olisjaðnontoc parajôrou), se diˆforec z nec suqnot twn kai efarmìzoume se kˆje mða apì autèc, anˆlush grammik c prìbleyhc gia ton upologismì thc fasmatik c peribˆllousac. Autìc o diaqwrismìc ulopoieðtai me th qr sh kumatidiak n metasqhmatism n. Sthn prospˆjeiˆ mac na deðxoume ìti o upologismìc thc fasmatik c peribˆllousac se diˆforec z nec suqnot twn me th qr sh filter banks èqei kalôtera apotelèsmata se sqèsh me ton antðstoiqo upologismì se olìklhro to plaðsio thc arqik c hqogrˆfhshc, qrhsimopoi same dôo hqograf seic apì mða zwntan mousik parˆstash, tic opoðec hqograf same me megˆlo arijmì mikrof nwn. Sth mða katagrˆfhkan kurðwc oi antrikèc fwnèc thc qorwdðac (antikeimenikìc stìqoc hqogrˆfhshc), en sthn ˆllh kurðwc oi gunaikeðec. Apì anepðshmec hqhtikèc dokimèc tan idiaðtera emfanèc ìti qrhsimopoi ntac 8 z nec suqnot twn sto polukanalikì montèlo phg c/fðltrou kai epanasunjètontac tic antrikèc fwnèc apì to s ma lˆjouc thc hqogrˆfhshc me tic gunaikeðec fwnèc, petôqame uyhl c poiìthtac epanasônjesh twn arqik n hqograf sewn, qwrðc ap leia tou antikeimenikoô stìqou. AntÐjeta, ìtan efarmìsame thn Ðdia diadikasða qwrðc th qr sh twn filter banks ìqi mìno qˆjhke o antikeimenikìc stìqoc thc hqogrˆfhshc, allˆ ekfulðsthke se megˆlo bajmì kai h poiìthta tou s matoc pou sunjèsame. Gia na apodeðxoume me majhmatikì trìpo thn parapˆnw diapðstwsh, metr same thn apìstash metaxô twn dôo shmˆtwn lˆjouc gia tic dôo hqograf seic, qrhsimopoi ntac wc krit rio apìstashc thn apì koinoô plhroforða (mutual information). Ex orismoô, h apì koinoô plhroforða dôo tuqaðwn metablht n X kai Y dðnetai apì ton tôpo: I(X; Y ) = p(x, y) p(x, y)log p(x)p(y) x X y Y (B.5) ìpou p(x) kai p(y) eðnai oi sunart seic puknìthtac pijanìthtac (spp) twn metablht n X

143 Appendix B. Description of the thesis in greek 121 kai Y kai p(x, y) eðnai h apì koinoô spp touc. ApodeiknÔetai ìti: I(X; Y ) = H(X) H(X Y ) = H(Y ) H(Y X) (B.6) ìpou H(X) eðnai h entropða thc metablht c X kai H(X Y ) eðnai h desmeumènh entropða thc metablht c Y dedomènhc thc Y. H apì koinoô plhroforða eðnai pˆnta jetik. Endiaferìmenoi na sugkrðnoume ta dianôsmata X kai Y (epijumht apìkrish), eðnai protimìtero na qrhsimopoi soume thn apì koinoô kanonikopoihmènh plhroforða (normalized mutual information- NMI), h opoða orðzetai wc ex c: I N (X; Y ) = H(Y ) H(Y X) H(Y ) = I(X;Y ) H(Y ) (B.7) EÐnai profanèc ìti: 0 I N (X; Y ) 1 H apì koinoô kanonikopoihmènh plhroforða parousiˆzei elˆqisto ìtan oi metablhtèc X kai Y eðnai statistikˆ anexˆrthtec, en parousiˆzei mègisto ìtan X = Y. Akìmh, parìlo pou den apoteleð metrik, miac kai stereðtai thc summetrik c idiìthtac, den ephreˆzetai apì tic diakumˆnseic plˆtouc tou s matoc (polô shmantik idiìthta sth sôgkrish hqhtik n kumatomorf n). Sto Sq ma 1 apeikonðzetai h apì koinoô kanonikopoihmènh plhroforða tou fasmatikoô plˆtouc twn dôo shmˆtwn lˆjouc se sqèsh me ton arijmì twn zwn n suqnot twn, gia diˆforec tˆxeic kumatidiak n fðltrwn Daubechies. Sth sugkekrimènh ulopoðhsh, qrhsimopoi jhke sôsthma grammik c prìbleyhc 32hc tˆxhc kai h tmhmatopoðhsh ègine se tm mata 1024 deigmˆtwn, to opoðo antistoiqeð se 23 msec me rujmì deigmatolhyðac 44.1 khz. Gia thn anˆlush se z nec suqnot twn qrhsimopoi jhke 8hc tˆxhc fðltro me rujmì plaisðwn 256 deðgmata gia kˆje z nh. To posostì epikˆluyhc twn plaisðwn, brèjhke ìti eðnai bèltisto sto 75%. Oi sugkekrimènoi parˆmetroi epilèqjhkan me tètoio trìpo ste o sunolikìc arijmìc twn metadidìmenwn suntelest n thc hqogrˆfhshc pou epanasunjètoume na paramènei o Ðdioc kai sthn perðptwsh thc anˆlushc se z nec suqnot twn allˆ kai sthn perðptwsh pou apousiˆzei h proanaferìmenh anˆlush. Akìmh, gia tic sugkekrimènec timèc twn paramètrwn, o sunolikìc arijmìc twn suntelest n thc hqogrˆfhshc pou sunjètoume eðnai 8 forèc mikrìteroc se sqèsh me ton sunolikì arijmì twn deigmˆtwn qou. Exetˆzontac tic timèc thc apì koinoô kanonikopoihmènhc plhroforðac sto Sq ma 1 a- podeiknôoun ìti h upìjesh pou kˆname alhjeôei. Sugkekrimèna, h apì koinoô kanonikopoihmènh plhroforða pou antistoiqeð sthn perðptwsh olìklhrou tou plaisðou thc arqik c hqogrˆfhshc eðnai , en sthn perðptwsh thc qr shc 4 zwn n suqnot twn h antðstoiqh

144 122 Multichannel Audio Modelling and Coding Using a Multiscale Source/Filter Model tim eðnai (me 40hc tˆxhc kumatidiakì fðltro). Akìmh, sto Sq ma 1 parathr ntac ta graf mata twn diˆforwn tˆxewn kumatidiak n fðltrwn Daubechies sumperaðnoume ìti ìso auxˆnei h tˆxh touc èqoume elafr c kalôtera apotelèsmata Median Segmental NMI daubechies 40 daubechies 20 daubechies 10 daubechies 4 daubechies Number of bands Sq ma 1: H apì koinoô kanonikopoihmènh plhroforða tou fasmatikoô plˆtouc twn dôo shmˆtwn lˆjouc se sqèsh me ton arijmì twn zwn n suqnot twn, gia diˆforec tˆxeic kumatidiak n fðltrwn. ParathroÔme ìti ìso auxˆnei h tˆxh twn kumatidiak n fðltrwn epitugqˆnetai kalôteroc diaqwrismìc sth suqnìthta kai katˆ sunèpeia kalôterh montelopoðhsh thc fasmatik c peribˆllousac; prˆgma pou shmaðnei ìti ta sfˆlmata montelopoðhshc ja eðnai perissìtero ìmoia metaxô touc kai ìti h apì koinoô kanonikopoihmènh plhroforða touc ja èqei megalôterh tim. Gia na exetˆsoume thn apìdosh tou proteinìmenou montèlou diexˆgame peiramatikèc hqhtikèc dokimèc (listening tests), stic opoðec p ran xeqwristˆ mèroc 17 ejelontèc, qrhsimopoi ntac kal c poiìthtac akoustikˆ (Philips SBC HP 800). Qrhsimopoi same dôo hqograf seic apì mða zwntan mousik parˆstash, ìpou sth mða katagrˆfhkan kurðwc oi andrikèc fwnèc thc qorwdðac (antikeimenikìc stìqoc hqogrˆfhshc), en sthn ˆllh kurðwc oi gunaikeðec. Epilèxame trða mousikˆ kommˆtia apì th sugkekrimènh mousik parˆstash (diˆrkeiac perðpou 10 deuterolèptwn), ta opoða anafèroume parakˆtw wc 'Hqoc 1,2 kai 3 antðstoiqa kai sta opoða oi dôo hqograf seic (antr n kai gunaik n) mporoôsan na diakrijoôn polô eôkola. Gia kˆje qo sqediˆsame dokimèc tôpou ABQ, ìpou sto A kai B tan

145 Appendix B. Description of the thesis in greek 123 ABQ-1 ABQ-2 ABQ-3 ABQ-4 Swst antistoðqhsh 86% 63% 10% 8% PÐnakac 1: Ta apotelèsmata twn ABQ dokim n. ta s mata me tic antrikèc kai tic gunaikeðec fwnèc (se tuqaða seirˆ). ZhtoÔsame apì touc akroatèc na antistoiq soun to s ma Q se èna apì ta A kai B, wc proc thn epikrˆthsh twn fwn n. Sto s ma Q topojet same 4 diaforetikˆ eðdh filter banks (trða eðdh basizìmena se kumatidiakoôc metasqhmatismoôc kai èna basizìmeno sthn MDCT kathgorða 1 ): ABQ-1: 40 c tˆxhc fðltra Daubechies, me 8 z nec suqnot twn ABQ-2: 4hc tˆxhc fðltra Daubechies, me 8 z nec suqnot twn ABQ-3: 40 c tˆxhc fðltra Daubechies, me 2 z nec suqnot twn ABQ-4: MDCT filter bank, me parˆjuro KBD kai 32 z nec suqnot twn Gia kˆje mða apì tic parapˆnw dokimèc, qrhsimopoi same kai touc treic qouc ('Hqoc 1-3), katˆ sunèpeia kˆje akroat c upobl jhke se 12 dokimèc. Ta apotelèsmata twn dokim n faðnontai ston PÐnaka 1. EÐnai emfanèc ìti oi filter banks pou basðzontai se kumatidiakoôc metasqhmatismoôc me 8 z nec suqnot twn kai mikrì bajmì epikˆluyhc (aliasing-blèpe 40 c tˆxhc fðltra Daubechies) parousiˆzoun polô kalˆ apotelèsmata, qwrðc na apokleðoume thn perðptwsh peraitèrw beltðwshc. Apì thn ˆllh pleurˆ, ìtan parousiˆzetai uyhlìc bajmìc epikˆluyhc (ABQ-2) ìtan mei netai to pl joc twn zwn n suqnot twn (ABQ-3), parathreðtai meðwsh thc apìdoshc tou montèlou kai ìson aforˆ thn epikrˆthsh twn fwn n, allˆ kai ìson aforˆ thn telik poiìthta (gegonìc pou epis manan kai oi akroatèc). Sthn perðptwsh thc MDCT filter bank, h {akrðbeia} twn apotelesmˆtwn mei jhke aisjhtˆ, en h poiìthta parèmeine analloðwth. ParathroÔme dhlad, ìti h anˆlush se oktˆbec dðnei idiaðtera kalˆ apotelèsmata, se antðjesh me thn anˆlush se Ðsec z nec suqnot twn, gegonìc to opoðo mporeð na apodojeð sto ìti o algìrijmoc grammik c prìbleyhc eðnai idiaðtera epirrep c se lˆjh stic qamhlèc z nec suqnot twn. Na shmei soume se autì to shmeðo ìti se anepðshma peirˆmata dokimˆsthke èna akìmh eðdoc filter bank, th Laplasian puramðda [57], h opoða èdwse parìmoia apotelèsmata me autˆ twn kumatidiak n metasqhmatism n. H epilog thc filter bank kaj c kai h epilog metaxô thc anˆlushc se oktˆbec kai se Ðsec z nec suqnot twn eðnai dôo jèmata ta opoða eðnai mèroc thc trèqousac èreunac tou ergasthrðou mac. Diexˆgame epðshc, hqhtikèc dokimèc tôpou DCR (Degradation Category Rating) [58], oi opoðec èginan kˆtw apì tic Ðdiec sunj kec me tic ABQ dokimèc. Stìqoc mac tan o èlegqoc thc poiìthtac twn hqograf sewn pou anasunjèsame me bˆsh to proteinìmeno montèlo. 1 Perissìterec plhroforðec upˆrqoun sto Kef. 3

146 124 Multichannel Audio Modelling and Coding Using a Multiscale Source/Filter Model Zht same apì touc akroatèc na bajmolog soun touc 'Hqouc me klðmaka apì to 1 èwc to 5 (1 gia th qamhlìterh poiìthta kai 5 gia thn uyhlìterh) se sqèsh me thn aujentik hqogrˆfhsh. Ta apotelèsmata apeikonðzontai sto Sq ma 2; h anˆlush ègine me 95% diast mata empistosônhc, ìpou ta x apeikonðzoun th mèsh tim twn bajmologi n pou dìjhkan, en oi orizìntiec grammèc deðqnoun ta ìria twn diasthmˆtwn empistosônhc. EÐnai emfanèc apì to sq ma ìti h poiìthta twn hqograf seic pou anakt jhkan apì to proteinìmeno montèlo eðnai uyhl. not perceived perceived but not annoying slightly annoying annoying very annoying signal 1 signal 2 signal 3 Sq ma 2: Apotelèsmata akoustik n peiramatik n dokim n, me skopì thn anˆdrash thc proteinìmenhc mejìdou montelopoðhshc polukanalikoô qou. KwdikopoÐhsh PolukanalikoÔ HqhtikoÔ S matoc (Perigraf -Apotelèsmata) Sthn prohgoômenh enìthta anafèrame ìti prin metad soume tic fasmatikèc peribˆllousec twn hqograf sewn, tic kbantðzoume me skopì thn pio apotelesmatik sumpðesh twn dedomènwn. Aut h diadikasða tou kbantismoô basðzetai se èna sôsthma kwdikopoðhshc shmˆtwn fwn c [17], to opoðo proseggðzei th sunˆrthsh puknìthtac pijanìthtac thc phg c (parèqontac mða sumpuknwmènh anaparˆstash twn dedomènwn proc metˆdosh) kai kbantðzei me apotelesmatikì trìpo tic paramètrouc tou montèlou me poikðlouc rujmoôc metˆdoshc. Katarqˆc, metatrèpoume touc suntelestèc grammik c prìbleyhc thc kˆje fasmatik c peribˆllousac se èna ˆllo sônolo paramètrwn, tic paramètrouc Line Spectral Frequencies (LSF). Oi parˆmetroi LSF (Kef ) èqoun mia 1-1 antistoiqða me tic paramètrouc grammik c prìbleyhc (pou montelopoioôn thn peribˆllousa tou fˆsmatoc), me shmantikˆ pleo- 2 Oi 'Hqoi 1-3 dhmiourg jhkan me 40 c tˆxhc fðltra Daubechies kai me 8 z nec suqnot twn.

147 Appendix B. Description of the thesis in greek 125 nekt mata ìmwc ìson aforˆ efarmogèc kwdikopoðhshc. Ta LSF tou kanalioô qrhsimopoioôntai san dedomèna ekpaðdeushc enìc parametrikoô montèlou thc sunˆrthshc puknìthtac pijanìthtac twn paramètrwn. H ekpaðdeush tou montèlou ègine me ton algìrijmo Expectation Maximization (Kef. 6). To montèlo pou qrhsimopoioôme eðnai polô diadedomèno gia montelopoðhsh parìmoiwn paramètrwn se s mata qou kai fwn c kai onomˆzetai Montèlo Migmˆtwn Gkaousian n Katanom n (Gaussian Mixture Model - GMM). 'Ena GMM anaparistˆ thn puknìthta pijanìthtac enìc tuqaðou dianôsmatoc san g(x) = M p i N(x; µ i, Σ i ) i=1 (B.8) ìpoun(x; µ i, Σ i ) eðnai h poludiˆstath kanonik katanom me mèsh tim m, pðnaka susqètishc S, pl joc katanom n m (suqnˆ anafèrontai kai wc klˆseic) kai arqik (a priori) pijanìthta (ìti h parat rhsh x an kei sthn klˆsh i) p i. O metasqhmatismìc Karhunen Loéve (KLT) qrhsimopoieðtai gia thn aposusqètish twn paramètrwn LSF, miac kai jewreðtai idiaðtera katˆllhloc gia dedomèna pou èqoun montelopoihjeð me GMM. Sugkekrimèna, efarmìzoume ton KLT stouc pðnakec susqètishc Σ i kai upologðzoume touc pðnakec aposusqètishc Q i wc ex c: Σ i = Q i Λ i Q T i (B.9) ìpou i = 1,..., m kai Λ i = diag(λ i,1, λ i,2,..., λ i,p ). Me ˆlla lìgia, o pðnakac Λ i eðnai ènac diag nioc pðnakac me stoiqeða tic idiotimèc tou pðnaka susqètishc Σ i, en Q i eðnai o pðnakac me ta idiodianôsmata tou Σ i. 'Etsi, o KLT metatrèpei to LSF diˆnusma z k (pou antistoiqeð sth qronik stigm k) se èna diˆnusma w k me aposusqetismènec suntetagmènec: w k = Q T i (z k µ i ). (B.10) Antistoiqa, o antðstrofoc Karhunen Loéve metasqhmatismìc (IKLT) anaktˆ to diˆnusma z k apì to aposusqetismèno w k wc ex c: z k = Q i w k + µ i. (B.11) Metˆ thn aposusqètish twn dedomènwn kai prin thn aposusqètis touc, qrhsimopoieðtai ènac mh omoiìmorfoc kbantist c, o opoðoc ulopoieðtai me thn diˆtaxh se seirˆ enìc sumpiest (compressor), enìc omoiìmorfou kbantist kai enìc aposumpiest (expander). O sumpiest c dðnetai apì th sqèsh [59]: c(w i ) = 1 2 (1 + erf(w i/ 6)), (B.12)

148 126 Multichannel Audio Modelling and Coding Using a Multiscale Source/Filter Model ìpou w i eðnai oi suntetagmènec tou w k, en o aposumpiest c: c 1 (q i ) = 6erf 1 (2q i 1) (B.13) ìpou q i eðnai oi suntetagmènec tou kbantismènou dianôsmatoc q k. O omoiìmorfoc kbantist c apì thn ˆllh, ulopoieðtai apì thn akìloujh sqèsh anˆjeshc bits se kˆje mða apì tic m klˆseic tou GMM: b i = b tot log 2 [ m j=1 (p j q j ) p p+2 ] + p p + 2 log 2(p i q i ), i = 1,..., m. (B.14) ìpou b tot eðnai ta sunolikˆ bits gia ton kbantismì thc phg c (eisˆgetai apì ton qr sth), b i eðnai ta bits pou anatðjentai se kˆje mða apì tic m klˆseic, p eðnai h diˆstash twn LSF dianusmˆtwn, p i eðnai h arqik (a priori) pijanìthta kai q i eðnai o gewmetrikìc mèsoc twn idiotim n tou pðnaka susqètishc, o opoðoc dðnetai apì th sqèsh: q i = [ p j=1 λ i,j ] 1 p, i = 1,..., m, (B.15) A. Kbantismìc metaxô twn klˆsewn Gia ton kbantismì twn paramètrwn LSF, èstw z k, akoloujeðtai h akìloujh diadikasða: (blèpe Sq ma 3): AfairoÔme to diˆnusma mèshc tim c m i apì to diˆnusma z k. AposusqetÐzoume to nèo diˆnusma pou prokôptei me th qr sh tou pðnaka Q T i. QrhsimopoioÔme ènan ekjetikì sumpiest gia na periorðsoume to eôroc tim n. Efarmìzoume ènan omoiìmorfo bajmwtì kbantist. Kˆnoume qr sh tou antðstrofou ekjetikoô sumpiest gia na epanafèroume ta kbantismèna plèon dedomèna sto arqikì eôroc tim n touc. EpanasusqetÐzoume to diˆnusma pou prokôptei me th qr sh tou pðnaka Q i. Prosjètoume to diˆnusma mèshc tim c m i.

149 Appendix B. Description of the thesis in greek 127 Nonuniform Quantizer z 1 z 2. z p Q µ i + µ i Q KLT.. Q. IKLT z^ 1 z^ 2. z^ p z k Decorellator Compressor Quantizer Expander Correlator ^z k Sq ma 3: Kbantismìc dianôsmatoc LSF, z k, me qr sh twn paramètrwn thc i-klˆshc. B. Sunolikìc kbantismìc H parapˆnw diadikasða epanalambˆnetai gia kajemða apì tic m klˆseic (blèpe Sq ma 4) kai epilègoume ekeðno to kbantismèno diˆnusma, to opoðo èqei th mikrìterh apìklish apì to arqikì. O upologismìc thc apìklishc kˆje dianôsmatoc apì to arqikì gðnetai me th bo jeia thc Log Spectral Distortion (LSD), h opoða dðnetai apì th sqèsh: LSD(i) = 1 F s F s 0 [10 log 10 ( S(f) Ŝ (i) (f) )] 2 df 1 2 (B.16) ìpou F s eðnai o rujmìc deigmatolhyðac, S(f), Ŝ (i) (f) eðnai antðstoiqa to fˆsma isqôoc tou arqikoô dianôsmatoc z k kai tou kbantismènou ẑ (i) k pou antistoiqeð sthn klˆsh i (i = 1,..., m). MetadÐdetai loipìn, sto dèkth to kbantismèno LSF diˆnusma me th mikrìterh apìklish LSD, ìpou metatrèpetai stouc antðstoiqouc suntelestèc grammik c prìbleyhc gia thn anakataskeu tou tm matoc k thc hqogrˆfhshc. Cluster 1 z k Cluster 2 Minimum LSD ^z (i) k Cluster m Sq ma 4: Sunolikìc kbantismìc metaxô twn m klˆsewn. Apì anepðshmec hqhtikèc dokimèc tou trìpou kwdikopoðhshc pou perigrˆfetai parapˆnw, sumperˆname ìti gia rujmì dedomènwn thc tˆxhc twn 10 kbps, h poiìthta thc kwdikopoihmènhc hqogrˆfhshc tan parìmoia me thn poiìthta thc arqik c. Qrhsimopoi ntac rujmì deigmatolhyðac 44.1kHz kai 40 c tˆxhc fðltra Daubechies, diairèsame to fˆsma suqnìthtac

150 128 Multichannel Audio Modelling and Coding Using a Multiscale Source/Filter Model se 8 oktˆbec. Oi parˆmetroi LSF kˆje mðac oktˆbac montelopoi jhke me èna GMM 16 Gkaousian n katanom n. Oi parˆmetroi tou GMM upologðsthkan qrhsimopoi ntac èna sônolo dedomènwn ekpaðdeushc thc tˆxhc twn dianusmˆtwn gia kˆje z nh suqnot twn. Me autèc tic paramètrouc kai gia diˆforouc rujmoôc dedomènwn, proèkuyan ta apotelèsmata tou PÐnaka 2 (bitrate vs. LSD). 'Ena parˆdeigma thc katanom c twn sunolik n bits (b tot ) se kˆje z nh suqnot twn sthn perðptwsh twn 10 kbps, dðnetai ston PÐnaka 3. O rujmìc twn 10 kbps eðnai o elˆqistoc pou èqei epiteôqjei, an kai pisteôoume ìti mporeð na meiwjeð perissìtero. Endeiktikˆ anafèroume ìti ˆlloi algìrijmoi pou qeirðzontai to jèma thc kwdikopoðhshc polukanalikoô qou (p.q. Dolby AC-3 kai MPEG-2 AAC) epitugqˆnoun 64 kbps/kanˆli gia uyhl poiìthta kwdikopoðhshc. Gia na exetˆsoume thn epðdrash thc kwdikopoðhshc twn paramètrwn tou proteinìmenou montèlou diexˆgame akoustikèc peiramatikèc dokimèc tôpou DCR, stic opoðec summeteðqan qwristˆ 20 ejelontèc. Qrhsimopoi jhkan 3 hqograf seic ('Hqoi 1-3) apì mða sunaulða klasik c mousik c, oi opoðoi kwdikopoi jhkan me rujmì dedomènwn 10 kai 20 kbps. Ta apotelèsmata thc bajmologðac pou dìjhke apì touc akroatèc apeikonðzontai sto Sq ma 5, ìpou omoðwc me thn prohgoômenh DCR dokim (blèpe Sq ma 2) qrhsimopoi jhkan 95% diast mata empistosônhc. SumperaÐnoume loipìn, ìti to sôsthma kwdikopoðhshc pou qrhsimopoi same dðnei uyhl c poiìthtac apotelèsmata (parìmoiac poiìthtac me thn arqik hqogrˆfhsh) qwrðc na eisˆgei sobarˆ probl mata sthn poiìthta tou qou. not perceived perceived but not annoying slightly annoying annoying 10 kbps 20 kbps very annoying signal 1 signal 2 signal 3 Sq ma 5: Apotelèsmata akoustik n peiramatik n dokim n tôpou DCR, me skopì thn anˆdrash thc mejìdou kwdikopoðhshc.

151 Appendix B. Description of the thesis in greek 129 LSD(dB) KBits/sec PÐnakac 2 (bitrate vs. LSD): Ta apotelèsmata me bˆsh thn apìklish LSD gia diˆforouc rujmoôc dedomènwn. Band Nr. Bits/band LSD (db) PÐnakac 3: 'Ena parˆdeigma thc katanom c twn sunolik n bits (b tot ) se kˆje z nh suqnot twn katˆ th diadikasða thc kwdikopoðhshc sthn perðptwsh twn 10 kbps. Sumperˆsmata Sta plaðsia aut c thc ergasðac parousiˆsame mða nèa mèjodo montelopoðhshc kai kwdikopoðhshc polukanalikoô qou, h opoða epitugqˆnei qamhlèc taqôthtec metˆdoshc. O proteinìmenoc algìrijmoc kwdikopoieð polukanalikèc hqograf seic me th qr sh enìc montèlou phg c/fðltrou, to opoðo diaqwrðzei thn koin plhroforða metaxô twn kanali n apì thn plhroforða pou qarakthrðzei apokleistikˆ to kˆje kanˆli. Isqurizìmaste ìti h koin plhroforða qreiˆzetai na metadojeð mìno gia èna kanˆli, epitugqˆnontac katˆ autìn ton trìpo idiaðtera aisjht meðwsh thc plhroforðac pou eðnai aparaðthth na metadojeð. Tèloc, na shmei soume ìti o sugkekrimènoc algìrijmoc en diathreð ton {antikeimenikì skopì} thc hqogrˆfhshc, exisorropeð metaxô epijumhtoô rujmoô plhroforðac kai {akrðbeiac}; uposthrðzoume ìti mporoôn na epiteuqjoôn qamhlèc taqôthtec metˆdoshc qwrðc sobarèc ap leiec sthn {akrðbeia} thc arqik c hqogrˆfhshc.

153 Bibliography [1] A. J. Bower, Audio in digital radio - the Eureka 147 DAB system, Electronic Engineering, pp , April [2] D. Yang, H. Ai, C. Kyriakakis, and C.-C. J. Kuo, High-fidelity multichannel audio coding with Karhunen-Loeve transform, IEEE Trans. Speech and Audio Processing, vol. 11, pp , July [3] C. Faller and F. Baumgarte, Binaural cue coding - Part II: Schemes and Applications, IEEE Trans. Speech and Audio Processing, vol. 11, pp , November [4] J. Herre, P. Kroon, C. Faller and S. Geyersberger, Spatial Audio Coding - An Enabling Technology for Bitrate-Efficient and Compatible Multichannel Audio Broadcasting. To be published in Society of Motion Picture and Television Engineers (SMPTE) Journal. [5] Alfred Mertins, Signal Analysis: Wavelets, Filter Banks, Time-Frequency Transforms and Applications. John Wiley & Sons Ltd, [6] Olivier Rioul and Martin Vetterli, Wavelets and Signal Processing, IEEE Signal Processing Magazine, vol. 8, pp , Oct [7] J. D. Johnston and A. J. Ferreira, Sum-difference stereo transform coding, in Proc. IEEE Int. Conf. Acoustics, Speech and Signal Processing (ICASSP), pp , [8] K. Brandenburg and M. Bosi, ISO/IEC MPEG-2 advanced audio coding: Overview and applications. in AES 103rd Conv., New York, Sept ser. AES preprint [9] M. Bosi, K. Brandenburg, S. Quackenbush, L. Fielder, K. Akagiri, H. Fuchs, M. Dietz, J. Herre, G. Davidson, and Y. Oikawa, ISO/IEC MPEG-2 advanced audio coding. in AES 101st Conv., Los Angeles, CA, Nov ser. AES preprint [10] Ted Painter and Andreas Spanias, Perceptual Coding of Digital Audio, Proceedings of the IEEE, vol. 88, April [11] J. Herre, K. Brandenburg, and D. Lederer, Intensity stereo coding, in Proc. 96 th Convention of the Audio Engineering Society (AES), preprint No. 3799, [12] Digital Audio Compression Standard (AC-3). atsc Document A/52. [13] M. Davis, The AC-3 multichannel coder. in AES 95th Conv., New York, Oct ser. AES preprint [14] F. Baumgarte and C. Faller, Binaural cue coding - Part I: Psychoacoustic Fundamentals and Design Principles, IEEE Trans. Speech and Audio Processing, vol. 11, pp , November 2003.

154 132 Multichannel Audio Modelling and Coding Using a Multiscale Source/Filter Model [15] A. Sawchuk, E. Chew, R. Zimmermann, C. Papadopoulos, and C. Kyriakakis, From remote media immersion to distributed immersive performance, in Proc. ACM SIGMM Workshop on Experiential Telepresence (ETP), (Berkeley, CA), November [16] K. Karadimou, A. Mouchtaris and P. Tsakalides, Multichannel Audio Modeling and Coding Using a Multiband Source/Filter Model, in 39th Annual Asilomar Conference on Signals, Systems and Computers, (Pacific Grove, CA), Octomber [17] A. D. Subramaniam and B. D. Rao, PDF Optimized Parametric Vector Quantization of Speech Line Spectral Frequencies, IEEE Trans. Speech and Audio Processing, vol. 11, pp , March [18] D. Yang, High Fidelity Multichannel Audio Compression. Phd thesis, University of Southern California, August [19] ISO/IEC JTC1/SC29/WG11 (MPEG), Document N7390, Tutorial on MPEG Surround Audio Coding. Poznan, Poland, July [20] J. Herre, J. Hilpert, C. Ertel, A. Hoelzer, C. Spenger, S. Disch, K. Linzmeier and C. Faller, An Introduction to MP3 Surround. Fraunhofer Institute for Integrated Circuits IIS, Germany. [21] Sirius Satellite Radio. [22] XM Satellite Radio. [23] C. Sidney Burrus, Ramesh A. Gopinath and Haitao Guo, Introduction to Wavelets and Wavelet Transforms: A Primer. Prentice Hall, [24] Martin Vetterli and Jelena Kovačević, Wavelets and Subband Coding. Prentice Hall Signal Processing Series, [25] G. Strang and T. Nguyen, Wavelets and Filter Banks. Wellesley-Cambridge, [26] Thomas Quatieri, Discrete-Time Speech Signal Processing: Principles and Practice. Prentice Hall, Second ed., [27] P. P. Vaidyanathan, Multirate Digital Filters, Filter Banks, Polyphase Networks, and Applications: A tutorial, Proceedings of the IEEE, vol. 78, January [28] Helmut Bölcskei and Franz Hlawatsch, Oversampled cosine modulated filter banks with perfect reconstruction, IEEE Trans. Circuits and Systems II (Special Issue on Multirate Systems, Filter Banks, and Wavelets), vol. 45, pp , Aug [29] Ying-Jui Chen and Kevin Amaratunga, How to Complete Paraunitary Filter Banks And Simultaneously Preserve Linear Phase?, in Proc. IEEE Int. Conf. Acoustics, Speech and Signal Processing (ICASSP), no. 2755, (Montreal, Canada), May [30] Alan V. Oppenheim, Ronald W. Schafer and John R. Buck, Discrete-Time Signal Processing. Prentice Hall Signal Processing Series, Second ed., [31] Peter N. Heller, Tanja Karp, and Truong Q. Nguyen, A General Formulation of Modulated Filter Banks, IEEE Trans. Signal Processing, vol. 47, April [32] Gerald D. T. Schuller and Tanja Karp, Modulated Filter Banks with Arbitrary System Delay: Efficient Implementations and the Time-Varying Case, IEEE Trans. Signal Processing, vol. 48, March [33] Chi-Min Liu and Wen-Chieh Lee, A Unified Fast Algorithm for Cosine Modulated Filter Banks in Current Audio Coding Standards, JAES, vol. 47, pp , December 1999.

155 133 [34] Ye Wang and Miikka Vilermo, The Modified Discrete Cosine Transform: Its Implications for Audio Coding and Error Concealment, in AES 22 nd International Conference on Visual, Synthetic and Entertainment Audio, (Espoo, Finland), June [35] Ralf Geiger and Gerald Schuller, Integer Lowdelay and MDCT Filter Banks, Asilomar Conf. Signals, Systems and Computers, [36] Taran K.Sarkar and C. Su, A Tutorial on Wavelets from an Electrical Engineering Perspective, Part 2: The Continuous Case, IEEE Antennas & Propagation Magazine, vol. 40, pp , Dec [37] S. Mallat, A Wavelet Tour of Signal Processing. Academic Press, Second ed., [38] C. K. Chui, An Introduction to Wavelets. Academic Press, [39] S. Haykin, Adaptive Filter Theory. Prentice Hall, [40] F. K. Soong and B. H. Juang, Line Spectrum Pair (LSP) and Speech Data Compression, Proc. IEEE Int. Conf. Acoustics, Speech and Signal Processing (ICASSP), pp , [41] K. K. Paliwal and B. S. Atal, Efficient Vector Quantization of LPC parameters at 24 bits/frame, Proc. IEEE Int. Conf. Acoustics, Speech and Signal Processing (ICASSP), pp , [42] R. Viswanasthan and J. Markoul, Quantization Properties of Transmission Parameters in Linear Predictive Systems, IEEE ASSP Trans., vol. ASSP-24, pp , June [43] A. Gray and J. Markel, Quantization and Bit Allocation in Speech Processing, IEEE ASSP Trans., vol. ASSP-24, pp , December [44] P. E. H. Richard O. Duda and D. G. Stork, Pattern Classification. Wiley Interscience, Second ed., November [45] D. A. Reynolds and R. C. Rose, Robust Text-Independent Speaker Identification Using Gaussian Mixture Speaker Models, IEEE Trans. Speech and Audio Processing, vol. 3, January [46] A. Kain and M. W. Macon, Spectral voice conversion for text-to-speech synthesis, in Proc. IEEE Int. Conf. Acoustics, Speech and Signal Processing (ICASSP), (Seattle, WA), pp , May [47] J. K. Su and R. M. Mersereau, Coding using Gaussian mixture and generalized Gaussian models, Proc. IEEE Int. Conf. Acoustics, Speech and Signal Processing (ICASSP), pp , [48] Eric W. Weisstein, Matrix Diagonalization. From MathWorld-A Wolfram Web Resource, [49] D. Tretter and C.A. Bouman, Optimal transforms for multispectral and multilayer image coding, IEEE Trans. IP, vol. 4, pp , March [50] A. Herrara, M. Martinez and O. Sanchez, An acoustic isolated speech recognition approach using KLT and VQ, in ICSPAT 97, (San Diego, CA), Sept [51] G. Martinelli, L.P. Ricolti and G. Marcone, Neural clustering for optimal KLT image compression, IEEE Trans. SP, vol. 41, pp , April [52] C.C.T. Chen, C.T. Chen and C.M. Tsai, Karhunen-Loeve transform for text independent speaker recognition, in 1997 Intl. Symp. on Communications, (Hsinchu, Taiwan), Dec 1997.

156 134 Multichannel Audio Modelling and Coding Using a Multiscale Source/Filter Model [53] N. Tsapatsoulis V. Alexopoulos and S. Kollias, A vector based approximation of KLT and its application to face recognition, in EUSIPCO-98, vol. 3, (Island of Rhodes, Greece), pp , Sept [54] A. Mouchtaris, S. S. Narayanan, and C. Kyriakakis, Virtual microphones for multichannel audio resynthesis, EURASIP Journal on Applied Signal Processing, Special Issue on Digital Audio for Multimedia Communications, vol. 2003:10, pp , [55] J. Laroche and J.-L. Meillier, Multichannel excitation/filter modeling of percussive sounds with application to the piano, IEEE Trans. Speech and Audio Processing, vol. 2, pp , [56] C. Shekhar and R. Chellappa, Experimental evaluation of two criteria for pattern comparison and alignment, in Proc. Fourteenth International Conference on Pattern Recognition, vol. 1, (Brisbane, Australia), pp , August [57] P. J. Burt and E. H. Adelson, The Laplacian pyramid as a compact image code, IEEE Trans. Comm., pp , [58] W. B. Kleijn and K. K. Paliwal, eds., Speech Coding and Synthesis. Elsevier Science, [59] J. A. Bucklew and N. C. Gallagher, A note on the computation of optimal minimum mean-square error quantizers, IEEE Trans. Comm., vol. 30, no. 1, pp , 1982.