Compensation Approaches for Far-field Speaer Identification Qin Jin, Kshitiz Kmar, Tanja Schltz, and Richard Stern Carnegie Mellon University, USA {qjin,shitiz,tanja,rms}@cs.cm.ed Abstract While speaer identification performance has improved dramatically over the past years, the presence of interfering noise and the variety of channel conditions pose a major obstacle. Particlarly the mismatch between training and test condition leads to severe performance degradations. In this paper we explored several approaches to compensation for the effects of reverberation inclding compensation sing linear post-filtering and frame-base score competition.. Introdction Over the years atomatic Speaer IDentification (SID) has developed into a rather matre technology that is crcial to a large variety of spoen langage applications. However, SID systems still lac robstness, i.e. their performance degrades dramatically when the acostic training data mismatch with the given test conditions [][]. Robstness is crrently the major challenge for real-world applications of speaer recognition. We proposed a reverberation compensation approach sing cepstral post-filtering [8]. The goal of this processing was to minimize the sqared distortion between training and testing featres by passing the cepstral featres that emerge from the featre extraction process throgh a linear filter. As noted above, the filter was designed to minimize mean sqare distortion between cepstral in the training and testing conditions. We also proposed the Frame based Score Competition (FSC) approach in [5] to improve speaer recognition in far-field sitations. In this paper we frther elaborate this approach by adding simlated data on a large variety of noise conditions, i.e. we artificially create additional data by applying a filter approach and extend the nmber and variety of models for the competition approach. The paper is organized as follows: In the next section we describe the reverberation compensation approach sing cepstral post-filtering and experimental reslts on the YOHO database. Section 3 describes the Frame-based Score Competition approach and shows the experimental reslts on a live FarSID database and section 4 concldes the paper. featres derived from the corresponding clean speech. Let x[n] represent the cepstral featres of a speech waveform (and not the original waveform itself), and let y [n] represent the corresponding cepstral featres after ndergoing room reverberation. Becase reverberation can be thoght of as the convoltion of the inpt speech with the effective implse response of a room, there wold be a constant difference between x[n] and y [n] if these featres represented long-term cepstra of the entire waveform. When x[n] and y [n] are cepstral coefficients of brief segments of speech (as in shorttime Forier analysis), there is an interaction between the speech and the analysis window and the difference between x[n] and y [n] is no longer constant. For simplicity, we propose that the reverberated cepstral featres y [n] can be represented as the convoltion between the inpt cepstra x[n] and the cepstral coefficients h[n] representing the effects of the room: y N h [] n = x[] n + h[][ i x n i] i= [ n] M n Referring to Figre, x = x = represents clean speech featres and y [ ] M = y n n= represents the corresponding reverberated featres. The sbscript in y indicates ncompensated speech in the testing environment. Ths, the assmption in () implies a linear filter in the cepstral featre h = h L. Note that in this h N h domain with filter taps being [ ] representation h[] =. As a reslt of reverberation, system training will be performed on the featres of clean speech x bt testing will se reverberated featres y. We define the instantaneos ncompensated distortion d[n] to be d [] n = x[] n y [] n = x[] n h[] n x[] n = h[][ i x n i] N h i= (The second eqality is valid becase h[] =.) () (). Reverberation Compensation sing Cepstral Post-Filtering.. Mathematical Representation of Reverberation This section provides a representation of reverberated speech in terms of the corresponding clean speech. The SID system wors in conventional fashion, by extracting featres from the signal and determining which of a set of trained models provides the best match to an ensemble of incoming featres. In order to maintain high SID accracy, it is desirable that the featres derived from reverberated speech closely match Figre : Bloc diagram representing the model of reverberation and compensation We compensate for the effects of reverberation by imposing a finite-implse response LTI filter on the observed featres (p in Figre ). We refer to the otpts of these featres as compensated, and we se the notations x c and y c to indicate
featres representing compensated clean speech and reverberated speech, respectively. We define the instantaneos compensated distortion d c [n] to be the difference between x c [n] and y c [n]: d n = x n y n = p n x n p n h n x n = c [] c[] c[] [] [] [] [] [ ] N p p[] n d [] n = p[] j d [ n j] j = Where Np is the nmber of taps in the p filter. We see to obtain the optimal p filter which, when applied to both x and y, minimizes the mean sqare compensated distortion d c [n] as defined above... Soltion to the Minimization Problem In this section, we determine the optimal p filter as defined in Sec... We define the objective for optimization to be the minimm expected distortion between the compensated training and testing featres, and we find p to minimize E[ d c [ n ]. Using (3), obtain E d c [] n as below: [ ] [ ] c p[ i] p[ j] E d[ i] d[ j] d = E d n c = i, j NP d in (4), the terms [ [ ] [ ] For evalating c E d m d n can be obtained by sing () as below: E d [ m] d [ n] = E h[ i] h[ j] x[ m i] x[ n j] (5) i, j Nh Frther, assming that E h[ i] h[ j] = σδ[ i j], σ (6) E hih [ ] [ jxmxn ] [ ] [ ] = E hih [ ] [ j] E xmxn [ ] [ ], ijmn,,, with δ being Kronecer delta, we can obtain E[ d [ m] d[ n ] in (5) as E[ d [ m] d [ n] ] = N σ R [ n m] (7) h x where Rx is the atocorrelation seqence of x. Sbstitting (7) into (4), we obtain d = N σ c h p[] i p[ j] Rx[ i j] (8) i, j Np We can differentiate (8) with respect to p to find the optimal p bt this will reslt in the optimal p being : if all the elements in p are eqal to, all featres in x and y will be mapped to, and the mean sqare distortion d c will always be zero as well. While this is clearly the optimal soltion in the mathematical sense, it is not a sefl soltion. In order to avoid the degenerate soltion p = we frther constrain p: N p p[ j] (9) The constraint in (9) means that the p filter mst have non-zero DC gain. Next, the fact that the p filter will be applied to both training and testing implies that scaling featres by the same factor in both training and testing will leave the SID accracy nchanged. This implies that we lose no generality by sing the more specific constraint on p N p p[ j] = () (4) (3) To minimize d c in (8) nder (), we constrct a Lagrangian optimization criterion as below: Λ ( p, λ) = N σ h p[ i] p[ j] Rx[ i j] + i, j Np N p () λ( p j ) [ ] Differentiating () with respect to [,λ] p and eqating the differentials to zero, we can obtain the optimal p as below: R [] R [] L R [ N ] x x x p R [] R [] R [ N 3] x x x p L L L L L R [ N ] R [ N 3] L R [] x p x p x L () p[] p[] L = L pn [ ] p λ ' λ where λ = Rx[]. Note that the nnown in N hσ N σ h de to the reverberation filter h has been incorporated into λ '. For later reference and compactness, we write () eqivalently as (3). T p Φ = b (3) λ We note that Rx, the atocorrelation seqence of the clean featres nderlying the reverberated featres, is the only nnown reqired to find p in (3), and specifically that the optimal soltion does not depend on the reverberation filter h. We have ths designed an optimal post-filter p which is invariant to the reverberation de to h and ths far depends only on Rx. Of corse, an SID system operating in a reverberant environment can observe directly the reverberated featres y, bt the atocorrelation of the clean speech Rx is not directly observable. Nevertheless, we can approximate Rx as the atocorrelation seqence obtained from clean featres extracted from training data: R [ m] R [ m], m=, L N (4) x T p where R T [m] is the atocorrelation seqence obtained from clean featres in training data. Combining (4) and (), we can solve for p, so the p is invariant to the nderlying clean featres in x. The Φ matrix in (3) is Hermitian bt not Toeplitz, its invertibility garantees a soltion that is both optimal and niqe for p. Φ was fond to be invertible in SID evalations so combining (), (6) (9), and (4), we claim that we have developed an optimal and niqe soltion for p which is not only invariant to the reverberation de to h bt also invariant to the nderlying clean featres in x. This invariance greatly simplifies or SID system. We can design p sing training featres alone and se p to generate compensated training featres x c, as in Figre. The processed featres x c are sed for training speaer models. Dring testing we apply the same filter p to the observed testing featres in y and generate compensated testing featres y c, again as in Figre. Becase the same p is applied across all reverberation conditions, no
modification of the filter design needs to be done for any particlar reverberant environment..3. EXPERIMENTAL VERIFICATION.3.. Experimental Procedres We developed the simlated FarSID database for this stdy, which was obtained by degrading clean speech from the English portion of Verbmobil I (VMI) database. The tterances consisted of high-qality recordings of spontaneosly spoen face-to-face dialogs between two speaers who are discssing appointment schedling and travel arrangements []. The data were recorded at a sampling rate of 6 Hz, with 6-bit resoltion, and PCM encoding. For or stdy we selected a sbset of male speaers from the VMI database based on the drations of their tterances. The speaers contribtions are segmented by trns, where each file contains one trn. The trns are annotated for the dialog sessions to accommodate withinand between-session testing. In a second step we distorted the clean speech data by convolving it with simlated implse responses that represent rooms of different size and reverberation time, and with differing distances between the sond sorce and the simlated microphones. The choice of the far-field conditions and the degradation process itself are described in detail in the next two sbsections. The far-field conditions sed to compile this database were based on the reslts of a series of pilot stdies. In these stdies, we passed tterances from the YOHO database [7] throgh simlated room implse responses of varios types. The reslts of the pilot stdy were sed to determine the ranges of the ey parameters of room dimensions, reverberation time, and distance between sorce and microphone that were both compact and meaningfl in terms of the type of reverberation introdced. We ltimately selected three different room sizes, for different reverberation times, and five different distances between sorce and microphone, which were observed to have a significant impact on speaer identification performance nder simlated distorted conditions. The reslting far-field conditions are listed in Table. In total, the FarSID database contains 45 far-field conditions pls the original clean speech condition. We name each of the far-field conditions in the format <room><reverberation><distance>. For example, SR5D refers to a medim conference room with reverberation time.5 seconds and a sorce-microphone distance of centimeters. We distorted the clean speech by convolving it with the simlated Room Implse Response (RIR) generated sing [6]. While the RIR program incorporates the dependence of room implse response on many different physical characteristics of the room and environment, we focsed on three major attribtes, the room size, reverberation time, and distance between the sorce and microphone, sing the parameter vales in Table. We sed the standard definition of reverberation time as the time time reqired for the acostic signal power in the room to decrease by 6 db when a sond sorce is trned off..3.. Experimental Reslts 94 9 9 88 86 84 8 UnCompensated Compensated SR3D5 SR3D SR3D SR3D4 SR5D5 SR5D SR5D SR5D4 SRD5 SRD SRD SRD4 SRD5 SRD SRD SRD4 Figre : SID performance in a medim-sized room with post-filtering reverberation compensation Figre presents SID performance with the Reverberation Compensation algorithm applied across conditions in Medim Room. The compensation algorithm attempts to find an optimal casal linear filter which minimizes sqared distortion de to mismatch in training and testing conditions. Under certain assmptions, the optimal filter is invariant to the RIR as well as invariant to the test speech nder consideration. Compensation is applied individally to different cepstral featres. The nmber of taps in the optima filter was taen as 5. The filter has to be applied dring testing as well as in training and ths new speaer models need to be bilt by sing filtered training speech. Comparing the reslts in Figre 5, we see that applying compensation algorithm provides sbstantial improvement in SID accracy. We see that on an average across different conditions, applying compensation provides a relative redction in error by % with the highest relative redction being %..4. DISCUSSION In this section we discss some of the assmptions made in this approach. At first we assmed a representation for reverberated featres in (). As featres are generated by windowing on overlapping segments of speech signal, an exact relationship between reverberated and clean featres is hard to find. We therefore, needed approximations and borrowed () from representation of reverberation as a LTI filer in time domain. h = was chosen to eep the problem analytically attractable. The assmption of (6) essentially means that the freqency response of the h filter is flat. This wold occr if the RTs were constant over all freqencies. This assmption was made in simlating the speech data that was sed for the reslts in Figre, bt it is not empirically valid, as noted above. Nevertheless, the algorithm was also sccessfl for the environmental conditions smmarized in Figre. These data indirectly validate the proposition that the assmption of (4), althogh physically invalid, still can prodce sefl compensation reslts in practice. We can obtain an estimate of nmber of taps in the optimal p as the nee in the crve describing the dependence of d c on Np.
Althogh d c decreases with a larger nmber of taps, the dissimilarity among compensated featres for different speaers also decreases. As Np becomes very large, the p filter converges to be a niform moving average filter that smoothes ot all the data and redces every featre to its mean vale, which is zero for the MFCC featres nder consideration. This redces the SID decisions to a random gess. For these reasons, we expect a local maximm in performance as a fnction of Np. Next, we note that the optimization constrction in section. garantees that for optimal p, the mean sqared distortion for compensated case d c [] n ], is never greater than that for ncompensated case E[ d [] n ]. Frther, the optimal p is a linear phase filter. The post-filtering algorithm was also applied directly to the speech signal in the time domain bt in this case ncompensated case otperformed compensated case. This indicates that the modeling and assmptions in Section. and. is not easily generalizable to the time domain. Or approach for dereverberation is somewhat similar to Wiener Filtering at a conceptal level. The major digression from the Wiener filter is that the p is applied to both training and testing speech, which leads to different reqirements and soltions to the problem. While or approach is invariant to the detailed natre of the reverberation this is not the case for Wiener Filtering. We will consider generalizations of or approach in later stdies. We will also apply the algorithm for speaer verification tass. 3. Frame-base Score Competition (FSC) In this section we first qicly review the decision process of speaer identification systems based on GMM lielihood scores and then smmarize or FSC approach which is described in more detail in [5]. Let S be the total nmber of enrolled speaers and LL( X Θ ) the log lielihood score that the test featre seqence X was generated by the GMM Θ of speaer, which consists of M mixtres of Gassian distribtions. Then the recognized speaer identity S * is given by: S * ( X Θ )} =,, L, S = arg max{ LL (5) Since the vectors of the seqence X = ( x, x, L, xn ) are assmed to be independent and identically distribted, the lielihood score for speaer and model Θ is compted as: LL N ( X Θ ) = LL( Θ ). Or mltiple microphone setp x n n= i allows s to bild mltiple GMM models Θ for each speaer and each channel CH i, reslting in a set CH CH C Θ = { Θ, L, Θ } for speaers and C channels. The ey idea of or FSC approach is to se this set of mltiple GMM models rather than a single GMM model for the speaer identity decision. In each frame we compare featre vector x i provided by channel CH h to the mltiple GMMs of speaer. The highest log lielihood score is chosen to be the frame score. In this case the lielihood score of the observed featres given speaer is compted as: CH LL N ( X Θ ) = LL( x Θ ) n= N = max n= n CH j { LL( xn Θ )} C (6) Note that this process does not rely on models for the test channel. Also, this competition process differs from the monochannel scoring process in that per-frame log lielihood scores for different speaers are not necessarily derived on the same channel. 6 Array -7 3 4 5 8 9 3.. Data and Setp A Far-Field Speaer Identification (FarSID) Database has recently been collected at Carnegie Mellon University to stdy the performance of speaer identification algorithms in adverse conditions, inclding a far-field microphone setp, varios interfering noise sorces, and reverberant room characteristics. Similar to the database described in [5] the FarSID database consists of speech recordings from mltiple far-distant microphones as depicted in Figre 3. In addition, to mae the FarSID database even more challenging than its predecessor database, we additionally recorded nder varios noise and reverberation conditions. The FarSID database consists of conversational speech recorded in face-to-face dialog sessions nder two different reverberation conditions (small vs. medim-sized room) nder six noise conditions per room. The noise conditions were applied by playing interfering noises at different Signal to Noise Ratio (SNR) levels (msic, white noise, speech) while recording the respective sessions. Each condition lasted for abot 3 mintes per session per speaer. The different noise conditions and their respective SNR are indicated below: No noise Msic Noise -5 db SNR Msic Noise db SNR White Noise -5 db SNR White Noise db SNR Speaer Interfering Noise -5 db In total we recorded native speaers of American English, where each speaer is engaged by an interviewer in a conversation abot varios topics. Each speaer participated in 4 recording sessions, two in a small and two in a medim-sized room, totaling to hors dration per speaer. Figre 3 shows the distant microphone setp in the FarSID database. The right hand side illstrates the microphone positioning in the 3D space. Five microphones (labeled 9 to and 4) are hanging from the ceiling or are monted to high microphone stands. Seven microphones (labeled to 7) are bilding a microphone array, with a distance of 5cm between each microphone. The microphone array and two other microphones (labeled 3 and 5) are set p on the table which is 3 9 Array -7 4 6 8 Figre 3: Microphone setp in the FarSID database 5
arranged between the interviewer and interviewee. In addition, we se two lapel microphones, nmber 8 worn by the interviewer and nmber 6 worn by the interviewee, i.e. the speaer whose identity is to be recognized. The left hand side of Figre 3 illstrates the distance between the speaer (mared by a red X ) and these 6 microphones. One grid nit roghly corresponds to.5 meters. 3.. Experimental Setp and Reslts The experiments reported in this section are based on the FarSID database. The aim of the investigation is to demonstrate the robstness of or FSC approach for far-field scenarios and show how it addresses the challenges posed by mismatched condition, i.e. the fact that speaer models are applied to acostic channel conditions which have not been observed dring training. For this prpose we train and test or approach nder for scenarios, where each scenario is designed to be more challenging than the previos one and more close to the challenges for real-world applications. The for scenarios are described next, followed by the experimental reslts in the respective sbsections. [Match] Matched condition: train a speaer model with data recorded nder the same conditions as in the test case this is the golden line as it reflects the best case scenario. [Mis-MM] Mismatched condition with Mltiple Microphone data: train speaer models with data simltaneosly recorded by mltiple microphones that cover a variety of conditions bt the test condition. [Mis-SM-] Mismatched conditions with Single Microphone data and nowledge abot test condition: train speaer models with data recorded with one microphone nder one condition and tested on a different condition sing some prior nowledge abot the test condition. [Mis-SM-n] lie [Mis-SM-] bt this time we varied the nmber of microphone positions involved for model training. The experiments were carried ot on data with the first noise condition of the FarSID database (see section ), i.e. distant speech with common bacgrond noise sch as air conditioning, compter fans, and reverberation recorded by the 6 microphones (as described above) in a medim-sized room. We selected 6 seconds of these data per speaer to train the speaer model. For testing we selected 3 seconds per speaer of the same noise condition and same room. The major mismatch reslts from the selection of the microphone positions for training and test, as described above. In total we had 6 test trials. All described experiments are condcted as closed-set speaer identification. Performance is measred in terms of identification accracy, i.e. the percentage of correctly identified test trials. The applied speech featres X are Mel Freqency Cepstral Coefficients (MFCC); the speaer models consist of Gassian Mixtre Models (GMM) with 64 Gassians per model. 3... Experiment on [Match] Scenario To get the pper bond performance, i.e. the best case scenario we trained and tested nder the matched scenario. The second colmn of Table shows the breadown of speaer identification performance for each microphone position. To get the performance for microphone position y we trained all speaer models on the training trials recorded by microphone y and tested on the test trials recorded by microphone y. So, on average we get 98.4% identification rate on speaers for farfield recordings if we assme that we now the test condition and that we do have recordings in this condition available for each speaer. 3... Experiment on [Mis-MM] Scenario The reslts of the second experiment are described in colmn 3 of Table. Here we assme that we do have simltaneos recordings from microphone positions 9-5. To calclate the performance on position 9 we train 6 speaer models per speaer, one on each of the remaining positions,,, 3, 4, and 5. For the final nmber given in the table we average over the identification rates for each of these mismatching conditions. We achieve 9.% accracy over all positions, i.e. a drastic drop from the matched condition. The forth colmn in Table compares this brte-force approach to or FSC techniqe. The same 6 models per speaer are now combined sing competition at the scoring stage. As can been seen FSC significantly improves over the brte-force approach and even gets close to the [Match] performance. This reslt indicates that FSC compensates well for scenarios, in which recordings with mltiple microphones bt the matching one are available for a speaer. Or earlier reslts also showed that the SID performance frther improves if the matching condition is available [5]. Table. Performance for Mlti-Microphone Setp Test [Mis-MM] [Match] [Mis-MM] Microphone FSC Position 9 98. 85.8 97. Position 99. 9. 99. Position 98. 9. 97. Position 96. 9.4 97. Position 3. 94.. Position 4 99. 95.9 99. Position 5 98. 95. 98. Average 98.4 9. 98. 3..3. Experiment on [Mis-SM-] Scenario In this next set of experiments we investigate the performance of or FSC approach in the more realistic case where only single microphone recordings are available per speaer. We assme here withot loss of generality to have training data from microphone position 4. As colmn 3 of Table 3 (labeled as ) shows, the performance drops drastically compared to the matched and the mlti-microphone performance. The gap is more sbstantial for microphone positions 9, which is intitively clear as these are frther away from microphone 4 than microphone positions 3 5. The ey idea to mae se of or FSC approach in the single microphone case is to simlate mltiple microphone recordings from the single microphone data. In order to apply FSC, we simlated different channels from the microphone 4 speech. This simlation targets microphone positions 9-5 by convolving the sorce speech with the simlated Room Implse Response (RIR) generated sing [6]. While the RIR program incorporates the dependence of room implse response on many different physical characteristics of the room and environment, we focsed here on three major attribtes, the room size,
reverberation time, and distance between the speaer and the microphone. Colmn 4 in Table 3 (labeled -FSC) shows the performance when simlating the 7 microphone positions. FSC on simlated channels significantly otperforms the baseline nder mismatched conditions althogh it cannot beat the performance of FSC on real mlti-microphone data [Mis-MM] FSC from Table. Please note that for both, the FSC on real mlti-microphone data and on simlated mlti-microphone data, we prposely exclde the data from the matching channel, i.e. we assme to not now the microphone position of the test condition. However, we do assme in the simlation to have some nowledge abot possible microphone positions, i.e. the RIR filter do now the actal room size and reverberation time, and create realistic microphone distances. Table 3. Performance for Single-Microphone Setp Test [Match] Microphone FSC Position 9 98. 57.5 84.9 Position 99. 66. 93.4 Position 98. 68.9 93.4 Position 96. 7.7 94.3 Position 3. 94.3 97. Position 4 99. 95.3 98. Position 5 98. 93.4 97. Average 98.4 78. 94.. 9. 8. 7. 6. 5. mic 9 mic Matched FSC mic mic mic 3 Test Channel [Mis-MM] FSC Mismatched mic 4 mic 5 Avg Figre 4: SID Performance comparison nder all setps Figre 4 smmarizes or findings on the for cases, i.e. trained and tested on microphone 4 [Match], real mlti-microphone conditions with FSC [Mis-MM]-FSC, single microphone conditions with simlation -FSC, and the mismatched case [Mismatched], in which the models are trained on single microphone data at position 4 and applied to position 9-5 microphones. 3..4. Experiment on [Mis-SM-n] Scenario In this final set of experiments, we compared the impact of the nmber of microphone positions on performance. We tested this by repeating the -FSC experiments bt this time applying FSC on different nmbers of simlated mltimicrophone data streams. FSC6 refers to the case, in which we sed all 6 mismatched microphone data (position 9-5 except the position matched with test condition) to train 6 models. This corresponds to experiment -FSC. FSC5 refers to the case where we sed only 5 ot of 6 simlated mlti-microphone data. Since we can have 6 over 5 = 6 different choices, we averaged the performance over all different choices. In addition, we calclated the best and worst performance depending on the choice. FSC4, FSC3, FSC repeat the same experiments with fewer microphones giving s 6 over 4 = 5, 6 over 3 =, and 6 over = 5 choices, respectively. Figres 5 and 6 show the best and worst performance for all selections.. 8. 6. 4.... 8. 6. 4.... 95. 9. 85. 8. 75. mic 9 FSC6 FSC5 FSC4 FSC3 FSC mic mic mic mic 3 mic 4 mic 5 Test Channel Figre 5: Best performance of -FSC mic 9 FSC6 FSC5 FSC4 FSC3 FSC mic mic mic mic 3 mic 4 mic 5 Test Channel Figre 6: Worst performance of -FSC FSC6 Best Average Worst FSC5 FSC4 FSC3 FSC Avg Avg nofsc Figre 7: Performance smmary of FSC with different nmber of channels
As can been seen from Figres 5 and 6 the worst selection of microphone positions for the data simlation does not have a significant impact on the system performance compared to the best choice. In other words, the sccess of the FSC approach does not depend on a proper selection of microphone positions for the simlation. Figre 7 compares the worst, average, and best case and with the mismatched condition. Even in the worst case, FSC still significantly improves performance compared to the baseline performance nder mismatched condition (nofsc in Figre 7) 4. Discssion and Conclsions In this paper, we presented an algorithm for reverberation compensation sing Cepstral Post-Filtering. The compensation procedre consisted of a relatively simple FIR filter that is applied to seqences of cepstral coefficients, with the coefficients of the filter optimized to minimize the mean sqare difference between the compensated coefficients for speech in the training and testing environments. The optimal filter obtained was niqe and invariant to the environmental conditions of a particlar test trial. This approach provided significant improvement in SID accracy across different reverberation conditions encompassing simlated as well as actal RIRs and also across different speech databases. In this paper we also reported far-field speaer recognition performances nder mismatched conditions. The aim of the investigation is to demonstrate the robstness of or framebased score competition approach (FSC) for far-field scenarios and show how it addresses the challenges posed by mismatched condition, i.e. the fact that speaer models are sally applied to acostic channel conditions which have not been observed dring training. For this prpose we trained and tested or approach nder for scenarios, where each scenario is designed to be more challenging than the previos one and more close to the challenges for real-world applications. The first scenario assmes to now the test condition, i.e. the best bt most nliely case. The second case assmes to not now the test condition bt to have training samples recorded from mltiple microphone positions for the speaer in qestion. Here, FSC significantly improves over the mismatched case, i.e. applying mlti-microphone data gives more robstness since they cover mltiple microphone positions and ths better prepare for the nnown. In the third scenario we assme to have only singlemicrophone data available and compensate this lac by simlating mlti-microphone data sing room implse response filters. FSC manages to still significantly otperform the mismatched scenario. In other words, even when only single microphone recordings from a speaer are available, the simlation of mltiple microphone recordings combined with or FSC approach improves the overall performance significantly. In the last scenario we vary the selection of microphone positions for the simlated data and show that even if we mae the worst choice of microphone positions, we still see significant improvements over the mismatched case. official policies, either expressed or implied, of the NGA, the United States Government, or Rosettex. We wold also lie to than Fred Goodman for sefl discssions and comments. 6. References [] Chin-Hi Lee, Fran K. Soong, Kldip K. Paliwal, Atomatic Speech and Speaer Recognition: Advanced Topics, Springer, 996, ISBN:793976. [] S. Fri, Towards Robst Speech Recognition Under Adverse Conditions, ESCA Worshop on Speech Processing in Adverse Conditions, p. 3-4, 99. [3] F. Bimbot, J.-F. Bonastre, C. Fredoille, G. Gravier, I. Magrin- Chagnollea, S. Meignier, T. Merlin, J. Ortega-Garcia, D. Petrovsa-Delacretaz and D. A. Reynolds, A Ttorial on Text- Independent Speaer Verification, Jornal on Applied Signal Processing 4, pp. 43-45, 4. [4] D. Reynolds, Speaer Identification and Verification Using Gassian Mixtre Speaer Models, Speech Commnication, Vol. 7, No. -, p. 9-8, Agst 995. [5] Q. Jin, T. Schltz, and A. Waibel, Far-field Speaer Recognition, IEEE Transactions on Adio, Speech, and Langage Processing, Vol. 5, No.7, p. 3-3, 7. [6] D. Campbell, K. Palomäi, and G. Brown, A MATLAB Simlation of Shoebox Room Acostics for se in Research and Teaching. http://cis.paisley.ac./research/jornal /V9/V9N3/campbell.doc [7] J.P. Campbell and D.A. Reynolds, Corpora for the evalation of speaer recognition systems, IEEE ICASSP, 999. [8] K. Kmar, and R. M. Stern, Environment-Invariant Compensation for Reverberation sing Linear Post-Filtering for Minimm Distortion, IEEE International Conference on Acostics, Speech, and Signal Processing, April 8, Las Vegas, Nevada. [9] S. G. McGovern, A model for room acostics, http://pi.s/rir.html. [] CMU Sphinx Open Sorce Speech Recognition Engines, http://cmsphinx.sorceforge.net/html/cmsphinx.php. [] RWCP Sond Scene Database in Real Acostical Environments, http://tosa.mri.co.jp/sonddb/indexe.htm. [] S. Brger, K. Weilhammer, F. Schiel, H.G. Tillmann, Verbmobil Data Collection and Annotation, in: Wolfgang Wahlster (Ed.) Verbmobil: Fondations of Speech-to-Speech Translation, Springer-Verlag Berlin, Heidelberg, New Yor, pp. 537-549,. 5. Acnowledgements The wor was fnded in part by the National Geospatial- Intelligence Agency (NGA) nder National Technology Alliance (NTA) Agreement Nmber NMA 4--9-. The views and conclsions contained in this docment are those of the athors and shold not be interpreted as representing the