Proceedings of the IASTED International Conference Signal Processing, Pattern Recognition and Applications (SPPRA 3) February -, 3 Innsbruck, Austria ITERATIVE CONSTRAINED MLLR APPROACH FOR SPEAKER ADAPTATION Giorgio Biagetti, Alessandro Curzi, Massimo Mercuri, Claudio Turchetti DII Dipartimento di Ingegneria dell Informazione, Università Politecnica delle Marche, I-3 Ancona, Italy. g.biagetti@univpm.it, a.curzi@univpm.it, massimo.mercuri@univpm.it, c.turchetti@univpm.it ABSTRACT In this paper an effective technique for speaker adaptation on the feature domain is presented. This technique starts from the well known maximum-likelihood linear regression (MLLR) auxiliary function to obtain the constrained MLLR transformation in an iterative fashion. The proposed approach is particularly suitable to be implemented on the client side of a distributed speech recognition scheme, due to the reduced number of iterations required to reach convergence. Extensive experimentation using the CMU Sphinx ASR system along with a preliminarily trained speaker-independent acoustic model for the Italian language, in a setting designed for large-vocabulary continuous speech recognition, demonstrates the effectiveness of the approach even with small amounts of adaptation data. KEY WORDS MLLR,,, DSR, SI, SD. Introduction Speaker adaptation techniques have proven to be very effective in modern speech recognition systems [], especially when there are significant mismatches between the training and decoding conditions. In these techniques one starts with a speaker-independent (SI) model, and then tries to accommodate the model to a new speaker to obtain a speaker-dependent (SD) model, using a relatively small amount of speech data from the new speaker. The basic idea is to compensate for the mismatch between training and test conditions by modifying the model parameters on the basis of some adaptation data. Among these techniques the maximum-likelihood linear regression (MLLR) [] and constrained MLLR () [9],[] are powerful and widely used methods for speaker adaptation in large-vocabulary continuous speech recognition (LVCSR). MLLR uses the expectation-maximization (EM) criterion to estimate a linear transformation to adapt Gaussian parameters, i.e. their mean and variance, of hidden Markov models (HMMs) []. Although the two transformations are estimated separately, the computational complexity is reasonably high. An alternative scheme to adapt both mean vectors and covariance matrices is to use a approach, in which the transformation applied to the covariance matrix corresponds to the transformation applied to the mean vector. It can be shown that is equivalent to a transformation in the feature domain. This property makes CM- LLR particularly suitable in a distributed speech recognition (DSR) scheme, in which the recognition process is split up into a front-end on the client side primarily related to feature extraction, and a back-end on the server side devoted to the recognition itself. The main drawbacks of the approach are: the algorithm is more complex than MLLR; it is an iterative process which converges usually after about 3 iterations, but in some cases it does not converge even after iterations [3]. Thus, as the algorithm for the implementation of CM- LLR is more complex than standard MLLR, there is a need for simpler algorithms to be efficiently implemented on the client side of a DSR scheme. The proposed algorithm meets this requirement, and due to simpler formulation is able to overcome some of the limitations of the. It s worth noting that the iterative has exactly the same formulation as the MLLR algorithm, while requiring less iterations than to converge. The algorithm has been evaluated by extensive experimentation using the CMU Sphinx recognizer in a setting defined for LVCSR and performance comparison with MLLR and techniques shows the effectiveness of the approach. MLLR and background Both MLLR and use the EM criterion to estimate a linear transformation to adapt the Gaussian parameters of HMMs. Starting from the current set of parameters M, the adapted model parameters ˆM are obtained by maximizing the following auxiliary function: Q(M, ˆM) = K γ m (τ)[k m +log( ˆΣ m ) + (o(τ) ˆµ m ) T ˆΣ m (o(τ) ˆµ m )], () where ˆµ m and ˆΣ m are the adapted mean and variance of component m for the target acoustic condition while M and DOI:.3/P.3.79-39
T represent respectively the number of components associated with the particular transform and the number of observations. K is a constant dependent only on the transition probabilities, K m is the normalisation constant associated with Gaussian component m, and γ m (τ) = p(q m (τ) M, O T ) () is the posterior occupancy of component m, being q m (τ) the Gaussian m at time τ and O T = [o(),..., o(t )] the observation sequence.. Unconstrained transformation In this adaptation method the mean and variance are transformed independently of each other. The mean µ is transformed as: ˆµ = Aµ + b = W ξ, (3) where ξ is the extended mean vector, [ µ T ] T, and W = b A] is the extended linear transform. The transform of the covariance matrix Σ is given by: ˆΣ = HΣH T, () where H is the matrix to be estimated. Equation () represents the objective function to be maximized during adaptation to obtain the parameters W and H of the transformations. It was ly proposed to adapt the mean vector [], extending the technique to variance adaptation only later []. The mean based linear transform is referred to as MLLR, while covariance matrix transform is named variance MLLR.. Constrained transformation The mean and the variance MLLR transformations can be simultaneously applied to both mean vectors and covariance matrices. However, as in this case the computational cost is high, a constrained scheme to adapt both mean vectors and covariance matrices can be used [, 7]. This is referred to as constrained MLLR: ˆµ = Aµ + b, () ˆΣ = AΣA T, () which is a particular case of unconstrained transformation with H = A. By substituting () and () into equation () and assuming a diagonal covariance matrix Σ, the following auxiliary function to be maximized is obtained: Q(M, ˆM) = K where γ m (τ)[k m +log( Σ m ) log( A ) + (ô(τ) µ m ) T Σ m (ô(τ) µ m )], (7) ô(τ) = A o(τ) A b = A c o(τ)+b c = W c ζ(τ) () as usual W c = [b c A c ] represents the extended matrix transformation, and ζ(τ) = [ o(τ) T ] T is the extended vector of observations. Equation (7) clearly shows that the constrained transformation can be directly applied in the feature domain []. 3 Iterative The proposed algorithm, referred to as iterative (), is able to implement the constrained transformation using the standard auxiliary function () for MLLR, instead of maximizing the more complex objective function (7). The transform estimation is an iterative process: a first transformation W is estimated by MLLR given an initial estimate of Σ, then at each iteration a new estimation ˆΣ k is forced to be ˆΣ k = A k ˆΣk A T k (9) until convergence is reached. The algorithm proceeds as follows:. Assume an initial estimate Σ of ˆΣ.. Estimate the mean transformation W k = [b k A k ] by equation (). 3. A new estimate is obtained by the constraint (9).. If a stop criterion on both ˆµ and ˆΣ is not met, return to step.. Otherwise, if a stop criterion on both ˆµ and ˆΣ is met, the solution is reached and the transformation () on feature domain can be applied. Fig. shows a block diagram of the algorithm. An example of the effect of its operation can be found in Figs. and 7. In particular, Fig. shows as a reference the effect of applying a conventional algorithm to the 3 MFCCs of one short utterance chosen among those used in our experiments. The graphs also report an aligned phonetic transcription of the selected utterance, the dark area marking a short silence period within. Fig. 7 reports analogous results for our algorithm. As can be seen, the conventional adaptation algorithm appears not to be able to substantially adapt the first few (three) components, while the proposed algorithm is able not only to adapt these components, but also to displace the adapted observation vector more. We will see in section that recognition performance with the proposed adaptation is higher, which led us to believe that the larger displacement was able to move the cepstral vector towards those of the baseline SI model more effectively. Computational cost The complexity of an iteration is of the same order as the complexity of MLLR. Given an estimate of Σ, 397
Assume an initial estimate Σ of ˆΣ Estimate the mean transform W k = [b k A k ] Stopping criterion met? YES Apply transformation on feature domain ô(τ) = W c ζ(τ) NO ˆΣ k = A k ˆΣk A T k Figure : Block diagram of the algorithm. at each iteration the computes the transformation W, which can be obtained by solving γ m (τ)σ m o(τ)ξ T m = = γ m (τ)σ m W ξ m ξ T m () For the full covariance matrix case the solution is computationally very expensive, however, for the diagonal case a closed-form solution is computationally feasible []. The left-hand side of equation () is independent of the transformation matrix and will be referred to as Z, where: Z = A new variable G i is defined as G i = m= and W is calculated using γ m (τ)σ m o(τ)ξ T m () σm ξ m ξm T i w i T γ m (τ) () τ= = G T i z i (3) where w i is the i-th row of W and z i is the i-th row of Z. Solving equation (3) requires the inversion of an (n+) (n+) matrix for each row of W, being n the size of the mean vectors. As the matrix inversion takes O(n 3 ) operations, thus the estimation of W requires O(n ) operations at each iteration. Once the transformation is obtained, O(M n) operations are required to achieve a new estimation of ˆΣ by transformation (9). Thus the total computational cost for each iteration is approximately given by O(n ) + O(Mn) O(n ). n corresponds also to the number of Mel-frequency cepstral coefficients (MFCC) plus and, which is how the input vectors are compounded. Table : Parameters used in the experiments. baseline acoustic model: training language: Italian total audio length: 9 hours states per HMM: 3 + final state Gaussians per state: tied states: test corpus: language: Italian book title: I promessi sposi, by Alessandro Manzoni book chapter: source: Liber Liber (http://www.liberliber.it/) total number of utterances: average number of phones per utterance: front-end: standard: ETSI ES audio: ks/s, mono, bit features: 3 MFCC + + language model: 3-gram statistical model In, optimising the auxiliary function (7) with respect to W leads to the update formulae. It has been shown in [9] that the i-th row of W is given by: w i = (αp i + k i )G i () where p i is the cofactor [ c i... c in ], (i.e. c ij = cof(a ij )) and G i = k i = σ m= m i τ= m= σ m i µ mi Given the total occupancy β = γ m (τ)ζ(τ)ζ(τ) T, () γ m (τ)ζ(τ) T. () τ= γ m (τ), (7) the coefficient α satisfies the following quadratic expression: α p T i G i p i + αp T i G i k i β =. () This is a simple quadratic expression in α and may be solved in the usual way. The main cost in estimating W is due to the computation of cofactors. Every row requires 3 n3 + n + n operations, thus the computational cost is of the order of O(n ) per iteration. This neglects the actual cost of inverting G i which only needs to be performed once, costing O(n ). Unfortunately the constrained case use an indirect optimisation scheme. The total cost then becomes (I + )O(n ) where I is the total number of iterations. In reality of course when using incremental adaptation the new transform estimate is initialised with the previous one, thus dramatically reducing the required number of iterations. Furthermore, it is not necessary to invert G i, as an indirect optimisation over each row may be used. 39
Iterative Conventional MLLR 3 Iterative WER [%] Iterative relative error 3 word error rate [%] relative error [%] word error rate [%] 3 number of iterations Figure : Accuracy evaluation as a function of the number of iterations. Adaptation was carried out with the first test corpus utterance, while the remaining material was used for recognition purposes. 3 number of iterations Figure 3: relative error and recognition accuracy evaluation as a function of the number of iterations. Test setup is the same as for Fig. Experimental results In order to verify the effectiveness of the algorithm, experiments were conducted using the CMU Sphinx ASR system, together with an advanced ETSI ES feature extractor. The setup used in the experiments is reported in Tab.. The SI baseline model was generated according to the method described in []. All the experiments reported in this section were conducted using the first chapter of a long audiobook in Italian, whose audio and text transcriptions are freely available. In the first experiment, the iteration process was initialized by adapting the model on the first utterance of the test corpus, while the remaining utterances had been left available for recognition purposes. Each utterance was expected to be phones long, corresponding to an average duration in time of about ETSI frames (. s). After initialization, the transform W c = [b c A c ] was estimated according the iteration process depicted in Fig.. Then the matrix W c was used to transform the features as shown in equation (). The estimation accuracy of ICM- LLR was evaluated by performing several recognition tests, and comparing the results with those obtained by MLLR and conventional. The behavior reported in Fig. shows that the word error rate (WER), defined as the ratio of wrongly recognized or missing words to total words in the text, reduces for both constrained algorithms as the number of iterations increases, slightly approaching the reference MLLR accuracy. It also must be noted that behaves always better than. In addition, as a means to evaluate the convergence, the behavior of the relative error as a function of the number of iterations can be derived as well. The relative error for the iterative estimation of matrix W at step k is defined as e k % = W k W k W W (9) and can be used when defining a stopping criterion. Fig. 3 shows that the relative error decreases remarkably as the number of iterations increases, following the error rate trend. A second experiment was performed to assess whether the recognition accuracy improved using an increasing number of utterances as adaptation data. All the results were compared with those obtained with conventional by setting the number of iterations for both algorithms at. As can be seen from Fig., the approach shows better performance in terms of word error rate, and thus gives a better adaptation. The algorithm was also tested in an incremental online adaptation framework, collecting data during all the recognition task. This setup is commonly used when the transcription is not available like in spoken dialog applications. Thus, as soon as a new incoming utterance becomes available, a new transformation is applied to the previously adapted model. As with any other online MLLR adaptation approach, and as confirmed by the results in Fig., in this case was not stable and the model needed to be restored periodically, requiring an objective comparison with the baseline [3]. Nevertheless, even if only for a limited number of updates, the accuracy kept improving. Conclusion This work presents an algorithm performing speaker adaptation on the feature domain, particularly suitable to be employed on the client side of a DSR scheme. Starting from the general MLLR auxiliary function, the proposed tech- 399
Iterative Conventional 3 3 Iterative Model restoration points word error rate[%] word error rate [%] 3 number of utterances Figure : Accuracy comparison between and. The number of iterations was fixed at for both algorithms while letting adaptation data increase one utterance at a time. incoming utterance number Figure : accuracy performance in an incremental adaptation framework. Model was updated each time a new utterance became available. The large square dots depict the WER when a restore operation was performed. nique is able to implement the constrained transformation on an iterative basis. Several adaptation tests on a preliminarily trained SI baseline model lead to a notable recognition performance already after very few iterations, thus demonstrating the effectiveness of the approach. Comparisons with the widely used conventional constrained MLLR, showed that ICM- LLR improves convergence rate, and thus the overall computational complexity, yet maintaining a slight benefit in terms of word error rate even with a small amount of adaptation data. The proposed method was also tested on an on-line adaptation scenario, giving promising though quite preliminary results. Further work need to be done in this area in order to improve the long-term stability of the system. References [] D. Povey and K. Yao, A basis method for robust estimation of constrained MLLR, Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP-), Prague, Czech Republic, May, 3. [] L. R. Rabiner, A tutorial on hidden Markov models and selected applications in speech recognition, Proc. IEEE, 77(), 99, 7. [3] Y. Li, H. Erdogan, Y. Gao, and E. Marcheret, Incremental on-line feature space MLLR adaptation for telephony speech recognition, Proc. 7th International Conference on Spoken Language Processing (ICSLP - Interspeech ), Denver, Colorado,, 7. density hidden Markov models, Computer Speech and Language, 9(), 99, 7. [] M. Gales and P. Woodland, Mean and variance adaptation within the MLLR framework, Computer Speech and Language, (), 99, 9. [] V. Digalakis, D. Rtischev, and L. Neumeyer, Speaker adaptation using constrained estimation of Gaussian mixtures, IEEE Transactions on Speech and Audio Processing, 3(), 99, 37 3. [7] M. Ferras, C. C. Leung, C. Barras, and J.-L. Gauvain, Constrained MLLR for speaker recognition, in IEEE International Conference on Acoustics, Speech and Signal Processing,7. ICASSP 7.,, 7, IV-3 IV-. [] M. Ferras, C.-C. Leung, C. Barras, and J.-L. Gauvain, Comparison of speaker adaptation methods as feature extraction for SVM-based speaker recognition, IEEE Transactions on Audio, Speech, and Language Processing, (),, 3 37. [9] M. Gales, Maximum likelihood linear transformations for HMM-based speech recognition, Computer Speech and Language, (), 99, 7 9. [] M. Alessandrini, G. Biagetti, A. Curzi, C. Turchetti, Semi-automatic acoustic model generation from large unsynchronized audio and text chunks, Proc. th Annual Conference of the International Speech Communication Association (Interspeech ), Florence, Italy,,. [] C. Leggetter and P. Woodland, Maximum likelihood linear regression for speaker adaptation of continuous
k w e l r a m o d e l l a ě o d i k o m o [s]k e v o l à e a m e ţ o à o r n o k w e l r a m o d e l l a ě o d i k o m o [s]k e v o l à e a m e ţ o à o r n o MFCC # 3... 3 3. MFCC # 3... 3 3. MFCC # - - - -... 3 3. MFCC # - - - -... 3 3. 3 3 MFCC #3 - -3 -... 3 3. MFCC #3 - -3 -... 3 3. MFCC # - - -3 -... 3 3. MFCC # - - -3 -... 3 3. MFCC # 3 - - -3 MFCC # 3 - - -3... 3 3.... 3 3. MFCC # - - - -... 3 3. MFCC # - - - -... 3 3. Figure : Comparison of the MFCC features and the same features after adaptation with the conventional algorithm. Above the graphs is shown the aligned phonetic transcription of the utterance under consideration. Figure 7: Comparison of the MFCC features and the same features after adaptation with the proposed ICM- LLR algorithm. Above the graphs is shown the aligned phonetic transcription of the utterance under consideration.
MFCC #7 - - - -... 3 3. MFCC #7 - - - -... 3 3. MFCC # - - -... 3 3. MFCC # - - -... 3 3. MFCC #9 - - -... 3 3. MFCC #9 - - -... 3 3. MFCC # - - -... 3 3. MFCC # - - -... 3 3. MFCC # - - -... 3 3. MFCC # - - -... 3 3. MFCC # - -... 3 3. MFCC # - -... 3 3. MFCC #3 - - -... 3 3. MFCC #3 - - -... 3 3. Figure : continued. Figure 7: continued.