Investigations on Error Minimizing Training Criteria for Discriminative Training in Automatic Speech Recognition

Transcription

1 , Lisbon Investigations on Error Minimizing Training Criteria for Discriminative Training in Automatic Speech Recognition Wolfgang Macherey Lars Haferkamp Ralf Schlüter Hermann Ney Human Language Technology and Pattern Recognition Lehrstuhl für Informatik VI Computer Science Department RWTH Aachen University, Germany Error Minimizing Training Criteria 1

2 Contents 1. Introduction 2. Overview of Discriminative Training Criteria 3. Minimum Classification Error Training for LVCSR 4. Comparative Experiments 5. Conclusions Error Minimizing Training Criteria 2

3 Introduction Discriminative training criteria like Maxmimum Mutual Information (MMI) by now established for large scale speech recognition tasks [Woodland 2002]. Recently: special interest in error minimizing training criteria Types of criteria in general: Criteria aimed at optimal distribution estimation (e.g. ML, MMI) Criteria representing expectation of error rate taken on training data, e.g. Minimum Word Error (MWE) criterion Minimum Phone Error (MPE) criterion significantly outperform MMI on many tasks [Povey 2002, Povey 2004]. Criteria representing (smoothed) empirical error rate on training data, e.g. Minimum Classification Error (MCE) aims at minimizing smoothed sentence error on training data. consistently better results than MMI on small vocabulary tasks. Error Minimizing Training Criteria 3

4 Maximum Mutual Information (MMI) Criterion: F MMI (θ) = 1 R = 1 R Discriminative Criteria R log p θ (W r X r ) r=1 R log p θ(x r W r ) p(w r ) p θ (X r W ) p(w ) r=1 competing model includes correct class sensitive to outliers W Minimum Classification Error (MCE) Criterion: F MCE (θ) = 1 R R r= [ p θ (X r W r ) p(w r ) α p α θ (X r W ) p α (W ) W W r ] 2ϱ competing model excludes correct class approximates sentence error rate on training data Error Minimizing Training Criteria 4

5 Discriminative Criteria Minimum Word Error (MWE) Criterion: F MWE (θ) = 1 R R r=1 p θ (X r W ) p(w ) A(W, W r ) W p θ (X r W ) p(w ) W criterion: expectation of approximation of word accuracy on training data A(W, W r ) approximates raw accuracy of hypothesis W instead of Levenshtein alignment: A(W, W r ) is defined locally on word level: a b c reference a b b hypothesis d time overlap t /3 2/3 A(w, w r ) = { t(w, wr ) if w = w r 1 + t(w, w r ) if w w r Minimum Phone Error (MPE) Criterion: replace word accuracy by phone accuracy measure Error Minimizing Training Criteria 5

6 Extended Unifying Approach F ( θ; f, α, G, {M r } ) = 1 R R f r=1 p α θ (X r W ) p α 1/α (W ) G(W, W r ) log W p α θ (X r W ) p α (W ) W M r criterion smoothing function alternative word sequences exponent gain function f(z) M r α G(W, W r ) ML z - MMI z all (recognized) 1 CT best (recognized) MCE 1 all without W r free δ(w, W r ) FT 1 + e 2ϱz best (recognized) W r Diversity β 1 (1 eβz ) all (recognized) free Jeffreys 1 z z all (recognized) 1 MWE/MPE exp(z) all (recognized) 1 A(W, W r ) Properties of Diversity Index: equals MCE with ϱ = 1/2 for β = 1 equals MMI for β in case of MPE, A(W, W r ) gives phone accuracy Error Minimizing Training Criteria 6

7 MCE Criterion Few publications investigate use of MCE on large vocabulary tasks. Reason: requires exclusion of correct class from set of competing classes. ASR: Exclusion of spoken word sequence from set of all possible word sequences. Difficult if set of competing word sequences is encoded as word lattice: Lattice may contain multiple alignments and pronunciation variants of spoken utterance. Arcs may not uniquely be assigned to correct or competing sentences without changing lattice structure. Remedies: Use N-best lists. Problem: Coverage considerably reduced. Use finite state machines to restructure training lattices. Problem: In general lattice density increases. Exclusion after computing statistics: this work. Advantage: all statistics needed could be extracted from original training lattices Error Minimizing Training Criteria 7

8 MCE Optimization Optimization of MCE criterion lead to expressions containing word posterior like weights. But: corresponding summations exclude spoken word sequence: q [tb, t e ](w X r ) = p {W Mr W Wr w [tb, te] W } {V Mr V Wr} α λ (X r, W ) p α λ (X r, V ) Goal: efficient lattice based computation of MCE weights q [tb, t e ](w X r ) Idea: exclude partial sum over spoken word sequences numerically Error Minimizing Training Criteria 8

9 Efficient Lattice-Based MCE Algorithm Efficient computation of MCE weights for training: Algorithm: q [tb, t e ](w X r ) = = p {W Mr W Wr w [tb, te] W } {V Mr V Wr} p {W Mr w [tb, te] W } V Mr α λ (X r, W ) p α λ (X r, V ) α λ (X r, W ) p α λ (X r, V ) p {W Mr W =Wr w [tb, te] W } {V Mr V =Wr} α λ (X r, W ) p α λ (X r, V ) 1. label all alignments of spoken word sequence in denominator lattice. Corresponding sublattice is equivalent to numerator lattice. 2. Compute arc posteriors in numerator and denominator lattice using forward-backward algorithm (similar to MMI). 3. Subtract posteriors of labeled arcs in denominator lattice by corresponding numerator arc posteriors. Error Minimizing Training Criteria 9

10 Experiments on Wall Street Journal (WSJ0) Task initial ML trained acoustic models for WSJ0: 16 cepstral coefficients ms frame shift LDA (±2 frames, ) 2000 general. triphone states + 1 silence state 6-state HMM within-word triphone models gender independent Gaussian mixtures 1 pooled variance, 149k Gaussian densities corpus WSJ0 train dev eval acoustic data [h] 15:17 0:46 0:40 # speakers # sentences # running words # lexicon words corpus ARPA WSJ0 NOV. 92 dev eval WER SER WER SER ML MMI MCE MWE Error Minimizing Training Criteria 10

11 Experiments on North American Business (NAB) Task initial ML trained acoustic models for WSJ0+1: 16 cepstral coefficients ms frame shift LDA (±1 frames, 99 32) 7000 general. triphone states + 1 silence state 6-state HMM across-word triphone models gender independent Gaussian mixtures corpus WSJ0+1 NAB Nov. 94 train dev eval acoustic data [h] 81:23 0:48 0:53 # speakers # sentences # running words # lexicon words pooled variance, 412k Gaussian densities NAB NAB-20k NAB-65k Nov. 94 corpus dev eval dev eval WER SER WER SER WER SER WER SER ML MMI MCE MWE MPE Error Minimizing Training Criteria 11

12 Conclusions Minimum Classification Error (MCE) applied to large vocabulary speech recognition. Efficient lattice-based computation of MCE training statistics. No need for N-best lists or restructuring of word lattices. Common representation and performance comparison of: Maximum Mutual Information (MMI) criterion, Minimum Classification Error (MCE) criterion, Minimum Word/Phone Error (MWE/MPE) criterion. MCE showed same performance gains as MWE. Acknowledgments This work was partially funded by the European Union under the integrated project TC-STAR Technology and Corpora for Speech to Speech Translation IST-2002-FP , Error Minimizing Training Criteria 12

13 References D. Povey, P. C. Woodland, Minimum phone error and i-smoothing for improved discriminative training, in IEEE Int. Conf. on Acoustics, Speech, and Signal Processing, 2002, vol. 1, Orlando, FL, May 2002, pp D. Povey, Discriminative training for large vocabulary speech recognition, Ph.D. dissertation, Dept. of Eng., Cambridge Univ., Cambridge, August B.-H. Juang, S. Katagiri, Discriminative learning for minimum error classification, in IEEE Transactions on Signal Processing, vol. 40, no. 12, December 1992, pp W. Chou, C.-H. Lee, B.-H. Juang, Minimum Error Rate Training based on N-Best String Models, in IEEE International Conference on Acoustics, Speech and Signal Processing, vol. 2, Minneapolis, MN, USA, April 1993, pp R. Schlüter, W. Macherey, Comparison of discriminative training criteria, in 1998 Int. Conf. on Acoustics, Speech and Signal Processing, vol. 1, Seattle, WA, May 1998, pp R. Schlüter, W. Macherey, B. Müller, H. Ney, Comparison of discriminative training criteria and optimization methods for speech recognition, Speech Communication, vol. 34, no. 1, pp , May E. McDermott, T. J. Hazen, Minimum classification error training of landmark models for real-time continuous speech recognition, in IEEE Int. Conf. on Acoustics, Speech and Signal Processing, vol. 1, Montreal, Canada, May 2004, pp Error Minimizing Training Criteria 13

14 References (cont d) E. McDermott, S. Katagiri, Minimum classification error for large scale speech recognition tasks using weighted finite state transducers, in IEEE Int. Conf. on Acoustics, Speech and Signal Processing, vol. 1, Philadelphia, PA, March 2005, pp K. K. Paliwal, M. Bacchiani, Y. Sagisaka, Minimum classification error training algorithm for feature extractor and pattern classifier in speech recognition, in 1995 Europ. Conf. on Speech Communication and Technology, vol. 1, Madrid, Spain, September 1995, pp W. Macherey, Implementation and comparison of discriminative training methods for automatic speech recognition, Diploma Thesis, Lehrstuhl für Informatik VI, RWTH Aachen University, Aachen, November D. S. Pallett, J. G. Fiscus, W. M. Fisher, J. S. Garofolo, B. A. Lund, and M. A. Przybocki, 1994 Benchmark test for the ARPA spoken language program, in ARPA Human Language Technology Workshop, Austin, TX, January 1995, pp F. Kubala, Design of the 1994 CSR benchmark tests, in ARPA Human Language Technology Workshop, Austin, TX, January 1995, pp W. Macherey, R. Schlüter, H. Ney, Discriminative training with tied covariance matrices, in 8th Int. Conf. on Spoken Language Processing, vol. 1, Jeju Island, Korea, October 2004, pp Error Minimizing Training Criteria 14