Discriminative Decision Function Based Scoring Method Used in Speaker Verification

Transcription

1 Chinese Journal of Electronics Vol.21, No.4, Oct Discriminative Decision Function Based Scoring Method Used in Speaker Verification LIANG Chunyan, ZHANG Xiang and YAN Yonghong The Key Laboratory of Speech Acoustics and Content Understanding, Institute of Acoustics, Chinese Academy of Sciences, Beijing , China Abstract Decision function of log likelihood ratio derived from classical hypothesis testing theory is widely used in Gaussian mixture model based speaker recognition system. This paper introduces a discriminative decision function based scoring method for speaker recognition with the state-of-the-art Joint factor analysis JFA system. In the scoring module of JFA system, an approximate form of the decision function is proposed. Based on the approximation, we present a discriminative decision function by re-estimating the contribution of each speech sound unit to the decision function to further improve the performance of speaker verification. The discriminative decision function is used to exploit the individual Gaussian component for better classification. The experiments are carried on the core conditions of National institute of standards and technology NIST 2010 speaker recognition evaluation data. The experimental results show that the proposed scoring method outperforms the conventional frame-by-frame strategy on the whole. Key words Speaker verification, Joint factor analysis JFA, Discriminative decision function. I. Introduction The task of speaker verification is to determine whether a given segment of speech is spoken by the hypothesized speaker [1,2]. The task can be treated as a hypothesis-testing problem. Given a trial including both the test utterance and the target speaker, a decision should be made to tell True or False based on the comparison between the log likelihood score of the trial and a threshold. Gaussian mixture models GMMs have always been the dominant approach in speaker verification [1]. In this approach, GMMs are applied to model data distribution and the Log likelihood ratio LLR derived from hypothesis testing theory is used as decision function. In recent years, Joint factor analysis JFA [3,4] has become the state-of-the-art technique in speaker verification. It has been proposed to solve the problem of speaker and session variability in GMMs framework. Many sites used JFA in the latest NIST evaluations, and there are many ways in the step of scoring [5 7]. Frame-by-frame scoring method is the most conventional one, where the whole feature file of each utterance is processed based on a full GMMs log-likelihood evaluation. It treats the GMMs simply as a probability density function of the feature vectors from a target speaker. In this study, we propose a scoring method based on discriminative decision function which is applied to expand a single GMM into a set of individual Gaussian components. In the proposed method, we re-estimate the contribution of each speech sound unit to the decision function to further improve the performance of speaker verification. The rest of this paper is organized as follows. We briefly introduce the theory of JFA in Section II. The traditional frameby-frame scoring method is presented in Section III. We propose the discriminative decision function based scoring strategy in Section IV. Experiment results are shown in Section V. Finally, we give the conclusion in Section VI. II. Joint Factor Analysis JFA has obtained wide attention during the last few years and become the state-of-the-art system in the field of speaker recognition. JFA model is used to solve the problem of speaker and session variability in GMMs framework. In this model, the speaker and channel dependent mean supervector M can be represented as a sum of two supervectors: M = s + c 1 where s is the speaker supervector and c is the channel supervector, both of which are normally distributed. They can be respectively represented by s = m + Vy+ Dz 2 c = Ux 3 where m is the speaker-independent mean supervector, that is the mean supervector of the Universal background model UBM, V is the speaker loading matrix with high speaker variability eigenvoices, D is the diagonal loading matrix describing remaining speaker variability not covered by V,and Manuscript Received June 2011; Accepted Apr This work is supported by the National Natural Science Foundation of China No , No , No , No , No and the Strategic Priority Research Program of the Chinese Academy of Sciences No.XDA

2 Discriminative Decision Function Based Scoring Method Used in Speaker Verification 693 U is the channel loading matrix with high intersession variability eigenchannels. y, z and x are the speaker factor, diagonal factor and channel factor respectively, which are all assumed to be standard normally distributed random variables. The underlying task in JFA is to train the hyperparameters U, V and D on a large training set. In the Bayesian framework, posterior distribution of the factors knowing their priors can be computed using the enrollment data. The likelihood of test utterance χ is then computed by integrating over the posterior distribution of y and z, and the prior distribution of x [8]. III. Traditional Frame-by-Frame Scoring Method The frame-by-frame scoring method is based on a full GMM log-likelihood evaluation [7]. The log-likelihood of test utterance χ and model s is computed as an average frame log-likelihood. The formula is as follows log P χ s log ω cn o t; s c, Σ c log po t s 4 where o t is the feature vector at frame t, T is the length in frames for test utterance χ, C is the number of Gaussians in the GMM and s = s + Ux is the supervector of the target model after channel adaptation while Ux is the channel supervector for the test utterance. Similarly, when calculating the log-likelihood of utterance χ and the UBM, the mean supervector of UBM is also compensated as m = m + Ux. This is equivalent to set the mean supervectors of both the target model and the UBM into the same channel space where the test utterance lies, which can effectively solve the acoustic mismatch problem between the training and test environment. Thus, the average verification score is obtained by computing the log-likelihood ratio between the compensated target speaker model s and UBM m, for the test utterance χ, Λχ log po t s log po t m 5 IV. Discriminative Decision Function Based Scoring Method 1. The approximation of decision function If we define po t to denote the total probability of both the speaker model s and UBM m, given a feature frame o t, that is po t=po t s +po t m 6 Then the Eq.5 can be written as Λχ log pot s log pot m po t po t po t s log po t s +po t m po t m log 7 po t s +po t m In a GMM λ, the probability po t λ for an observed feature frame o t is po t λ = ω cpo t λ c= go t λ c 8 Two terms of the Taylor series logx x 1areusedtoobtain the approximation of Eq.7 and we discard the 1 since the change will not affect the classification accuracy. Λχ 1 po t s T po t s +po t m po t m po t s +po t m 1 = T If we define go t s c go t m c C j=1 got s j +got m j go t s c go t m c C j=1 got s j +got m j 9 Φ c = 1 go t s c go t m c T C,, 2,,C 10 j=1 got s j +got m j as the difference of average occupation probability among the whole observation series for Gaussian component c between the adapted speaker model and UBM, Eq.9 can be rewritten in the following form of inner product Λχ =w bη 11 where w = [1,, 1] is a unit weight vector and bη = [Φ 1,, Φ C] t denotes the difference vector of occupation probability for a trial η. From Eq.11, we can see that, given a trial η, thevalueof the decision function, hence the decision of True or False for the trial, is completely determined by a weight vector w and a difference vector bη. The average occupation probability Φ c can be thought to represent the occurrence frequency of Gaussian mixture component c among the whole observation sequences. We call the difference vector bη as the trial s information vector, which is used to map the trial into a vector. The values in weight vector w canbeviewedasthecontribution to the decision function of the corresponding elements in the trial s information vector. Hence, we can name w the contribution factor, which can also be considered as a classifier between the true information vectors and the false ones. In Eq.11, the values in w are the same, which indicates that the contributions of the differences of average occupation probability corresponding to all the Gaussian components are equal. In GMMs for speaker verification, the Gaussian components can be considered to model the underlying broad phonetic sounds that characterize a person s voice [1]. Hence, Φ c, c =1,,C, can be thought to represent the differences between the average occupation probability for the event that the feature vector of the test utterance is accounted for by each corresponding speech sound unit characterized by the target

3 694 Chinese Journal of Electronics 2012 speaker model and that for the UBM. The contributions to the decision function of the sound units are determined by w. Actually, some of the sound units have more discriminative information for different speakers, which should be given heavy weight. In contrast, the sound units which are less discriminative should be less weighted. In the following, we will show how to obtain a discriminative contribution factors w to further improve the speaker verification performance. 2. MSE criterion Suppose we have a training set consisting of N + + N trials, in which true trials are denoted as {x i}, i =1,,N +, and false trials as {y j}, j = 1,,N. Each of the trials is mapped into a difference vector of occupation probability bx i, i =1,,N +, and by j, j =1,,N. Thus, the score of the decision function for a trial x can be written as score = w t bx. We can first obtain the discriminative contribution factor w based on Minimizing the sum-of-squares error MSE criterion [9]. w = arg min w E{wt bx yx 2 } 12 where E denotes expectation and yx is the ideal output for trial x. Let the ideal output for true trial vectors be 1 and 0 for false trial vectors, i.e. ytrue = 1andyfalse = 0, the criterion above can be approximated using the training set as [ N + N ] w = arg min w t bx i w t by j 2 w j=1 13 We construct matrix M + and M respectively using all the information vectors of true and false trials as follows bx 1 t by 1 t bx 2 t M + =., M = by 2 t 14. bx N+ t by N t And we define [ ] M + M = M Then, the problem of Eq.13 becomes 15 w = arg min Mw o 2 16 w where o is a vector consisting of N + ones followed by N zeros i.e., the ideal outputs for the training trials. The problem of Eq.16 can be solved using the method of normal equations M t Mw = M t o 17 And Eq.17 can be rearranged by M t M w = M t +1 + M t 0 = M t 1 18 where 1 is the vector of all ones and 0 is an all-zeros vector. If we define R = M t M, w can be obtained by w = R 1 M t In the MSE criterion, the classifier focuses on all the training samples but not those which are easily classified wrongly, so the discriminability of w trained by Eq.19 is limited. Based on Eq.19, we then use the Generalized linear discriminant sequence GLDS kernel based Support vector machine SVM to obtain the optimal w. 3. GLDS kernel method for the discriminative training of contribution factor w Combining the solution of Eq.19 with the scoring equation form 11, we have The above equation can become score = b t w = b t R 1 M t score = b t R 1 b+ 21 where b + =1/N +M t +1 and R =1/N +R. We compare two trials x and y by mapping them into trial information vectors b x and b y first and then computing the GLDS kernel as [10] K GLDS = b t xr 1 b y 22 To reduce training time, we factor R = U t U using the Cholesky decomposition. Then K GLDS =Ub x t Ub y 23 If we transform all the trial information vectors by Ub x,the kernel is a simple inner product. This will dramatically reduce the time used in SVM training. Finally, SVM training procedure will find the corresponding α i for each support vector b i and a universal d. Thus the optimal contribution factor w can be solved as follows l w = α iy ir 1 b i + d 24 where d =[d 0 0] t. Given a new trial z, we firstly convert it to the corresponding information vector b z. Then the discriminative decision function based on the optimal contribution factor w can be expressed as l t score = α iy ir 1 b i + d b z 25 V. Experiments 1. Experiments setup The experiments for different JFA systems based on the two kinds of scoring methods the traditional frame- by-frame and the proposed discriminative decision function based scoring methods are carried out on the NIST 2010 speaker recognition evaluation corpus. The NIST SRE 2010 is similar to SRE 2008 but different from prior evaluations by including in the training and test conditions for the core test not only conversational telephone speech recorded over ordinary telephone channels, but also such speech recorded over a room microphone channel, and conversational speech from an interview scenario recorded over a room microphone channel. We respectively name the above three conditions telephone, microphone and interview for short. In this study, we focus on three types of trials: telephone-telephone, interview-interview

4 Discriminative Decision Function Based Scoring Method Used in Speaker Verification 695 and interview-telephone. Equal error rate EER and the minimum Decision cost function mindcf are used as metrics for evaluation [11,12]. In our experiments, we use Mel-frequency cepstral coefficients MFCCs as the acoustic cepstral features. 18 cepstral coefficients are computed and first order derivatives over 5 frames are appended to each feature vector, which results in a dimensionality of 36. These feature vectors are modeled using GMMs and JFA is used to treat the problem of speaker and session variability. The gender dependent UBM models with 1024 mixture components are trained using the NIST SRE side training corpus. The Switchboard II, Switchboard Cellular corpus as well as the telephone data from NIST SRE 2005 and 2006 corpus is used to train the speaker loading matrix with 300 speaker factors. And the NIST SRE 2004 corpus is used to train the diagonal matrix. For channel loading matrix, a telephone loading matrix with 100 channel factors is trained based on the phone data from NIST SRE 2004, 2005 and 2006 corpus for the telephone-telephone condition. A common channel loading matrix also with 100 channel factors for both the interview-interview and interview-telephone conditions is trained based on the telephone and microphone data from NIST SRE 2004, 2005 and 2006 corpus as well as the MIXER5 interview development corpus. The true and false trials for telephone-telephone, interview-interview and interview-telephone conditions provided in NIST SRE 2008 are used for training the contribution factor w respectively for the corresponding test conditions in NIST SRE Experiments of Taylor series approximation Since we obtain an approximate decision function, from which the discriminative decision function based scoring method is derived, the effect of using the Taylor series should be examined. Fig.1 shows the relationship of LLR score obtained from the traditional decision function and the approximation form with two terms of Taylor series. We tested on utterances respectively for male and female speakers and each utterance is scored both on Eqs.5 and 11. It can be seen that the relationship between scores from the two scoring forms is nearly linear, which means that in the purpose of classification, the effect of using Taylor series can be ignored. 3. Experiments on NIST SRE 2010 In this subsection, we list the results of JFA systems using frame-by-frame and Discriminative decision function DDF based scoring methods on the three test conditions in NIST SRE Table 1 lists the performance of the JFA systems based on the two scoring methods for the telephone-telephone condition. From Table 1, we can see that the proposed scoring method outperforms the conventional frame-by-frame strategy for both male and female speakers. Our system can achieve 14.85% relative improvement in EER and 5.53% relative improvement in mindcf for male speakers and relative gains of 16.12% EER and 16.12% mindcf for female speakers. Table 1. Comparison of different scoring methods for the telephone-telephone task EER% mindcf EER% mindcf Frame-by-frame DDF The performance of different JFA systems based on our method and the traditional frame-by-frame one for the interview-interview task is shown in Table 2. As can be seen from Table 2, our method has achieved relative 11.27% and 7.28% improvement in EER and mindcf for male speakers as well as 6.21% and 4.22% improvement in EER and mindcf for female speakers. Table 2. Comparison of different scoring methods for the interview-interview task EER% mindcf EER% mindcf Frame-by-frame DDF Table 3 compares the proposed system with the frame-byframe one for the interview-telephone condition. It demonstrates that except for the measurement of EER for male speakers, the performance of our proposed system is comparable or even better than that of the frame-by-frame one. Relative gains of 5.54% in mindcf for male speakers and 8.39% in EER for female speakers are obtained. We have noticed that the performance of male speakers for the interview-telephone task is not very comparable. This may due to the fact that the number of interview-telephone trials both true and false from NIST SRE 2008 is too small to train the contribution factor w well. Table 3. Comparison of different scoring methods for the interview-telephone task EER% mindcf EER% mindcf Frame-by-frame DDF Fig. 1. Relationship of scores obtained from traditional decision function and approximate form. a ;b 4. Speed The aim of this experiment was to show the approximate scoring time for the two different systems to compare their complexity. The time measured included reading necessary

5 696 Chinese Journal of Electronics 2012 data connected with the trial and computing the likelihood ratio. Each measuring was repeated 5 times and averaged. Table 4 shows the average scoring time per trial. From Table 4, we can see that proposed scoring method is faster than the traditional frame-by-frame one. Table 4. Comparison of average scoring time per trial using frame-by-frame and DDF based scoring methods Scoring time cost s Frame-by-frame 3.75 DDF 2.01 VI. Conclusion In this paper, we have introduced a discriminative decision function based scoring method used in speaker verification with the JFA system. Experiments show that the proposed method is effective and outperforms the traditional frame-byframe scoring method on the whole. As well, the computing complexity of the proposed method is much lower than the frame-by-frame scoring method. References [1] D.A. Reynolds, T.F. Quatieri and R.B. Dunn, Speaker verification using adapted Gaussian mixture models, Digital Signal Processing, Vol.10, No.1-3, pp.19 41, [2] X. Zhang, X. Xiao, H Wang, J. Zhang and Y. Yan, Multiclass maximum a posteriori linear regression for speaker verification, Chinese Journal of Electronics, Vol.19, No.4, pp , [3] M.H. Sanchez, L. Ferrer, E. Shriberg, A. Stolcke, Constrained cepstral speaker recognition using matched UBM and JFA training, Proc. of Interspeech, Florence, Italy, pp , [4] P. Kenny, P. Ouellet, N. Dehak, V. Gupta and P. Dumouchel, A study of inter-speaker variability in speaker verification, IEEE Trans. on Audio, Speech and Language Processing, Vol. 16, No.5, pp , [5] N.Dehak,P.Kenny,R.Dehak,P.OuelletandP.Dumouchel, Front-end factor analysis for speaker verification, IEEE Trans. on Audio, Speech and Language Processing, Vol.19, No. 4, pp , [6] N. Brümmer, L. Burget, J. Cernocky, O. Glembek et al., Fusion of heterogeneous speaker recognition systems in the stbu submission for the NIST speaker recognition evaluation 2006, IEEE Trans. on Audio, Speech and Language Processing, Vol.15, No.7, pp , [7] O. Glembek, L. Burget, N. Dehak, N. Brümmer and P. Kenny, Comparision of scoring methods used in speaker recognition with joint factor analysis, Proceeding of the International Conference on Acoustic Speech and Signal Processing, Taipei, Taiwan, pp , [8] P. Kenny and P. Dumouchel, Experiments in speaker verification using factor analysis likelihood ratios, Proceedings of Odyssey 2004, Toledo, Spain, pp , [9] R. Duda and P. Hart, Pattern Classification and Scene Analysis, Wiley, New York, [10] W. Campbell, Generalized linear discriminant sequence kernels for speaker recognition, Proceedings of the International Conference on Acoustics Speech and Signal Processing, Orlando, Florida, USA, Vol.1, pp , [11] The NIST year 2008 speaker recognition evaluation plan, [12] The NIST year 2010 speaker recognition evaluation plan, LIANG Chunyan received the B.E. degree in Communication Engineering from Shandong Normal University in Now she is a M.S. & Ph.D. candidate in Key Laboratory of Speech Acoustics and Content Understanding at Institute of Acoustics, Chinese Academy of Sciences. Her research interests include speaker recognition and language recognition. liangchunyan@hccl.ioa.ac.cn ZHANG Xiang received B.E. degree in Electronic Information Engineering from Shangdong University in 2006 and Ph.D. degree from Key Laboratory of Speech Acoustics and Content Understanding at Institute of Acoustics, Chinese Academy of Sciences. His research interests include speaker recognition, language identification, speaker diarization, and audio watermarking. YAN Yonghong received B.E. degree from Tsinghua University in 1990, and Ph.D. degree from Oregon Graduate Institute OGI. He worked in OGI as an Assistant Professor 1995, Associate Professor 1998 and Associate Director 1997 of Center for Spoken Language Understanding. He worked in Intel from 1998 to 2001, chaired Human Computer Interface Research Council, worked as Principal Engineer of Microprocessor Research Laboratory and Director of Intel China Research Center. Currently he is a professor and director of Think IT Laboratory. His research interests include speech processing and recognition, language/speaker recognition, and human computer interface. He has published more than 100 papers and holds 40 patents.