Investigations on Error Minimizing Training Criteria for Discriminative Training in Automatic Speech Recognition



Similar documents
Turkish Radiology Dictation System

Modified MMI/MPE: A Direct Evaluation of the Margin in Speech Recognition

The XMU Phrase-Based Statistical Machine Translation System for IWSLT 2006

Hardware Implementation of Probabilistic State Machine for Word Recognition

AUTOMATIC PHONEME SEGMENTATION WITH RELAXED TEXTUAL CONSTRAINTS

Automatic slide assignation for language model adaptation

Establishing the Uniqueness of the Human Voice for Security Applications

BLIND SOURCE SEPARATION OF SPEECH AND BACKGROUND MUSIC FOR IMPROVED SPEECH RECOGNITION

IBM Research Report. Scaling Shrinkage-Based Language Models

Secure-Access System via Fixed and Mobile Telephone Networks using Voice Biometrics

Statistical Machine Learning

THE BAVIECA OPEN-SOURCE SPEECH RECOGNITION TOOLKIT. Daniel Bolaños. Boulder Language Technologies (BLT), Boulder, CO, USA

Lecture 12: An Overview of Speech Recognition

THE RWTH ENGLISH LECTURE RECOGNITION SYSTEM


Developing an Isolated Word Recognition System in MATLAB

Speech: A Challenge to Digital Signal Processing Technology for Human-to-Computer Interaction

Speech Transcription

Building A Vocabulary Self-Learning Speech Recognition System

Automatic Evaluation Software for Contact Centre Agents voice Handling Performance

Membering T M : A Conference Call Service with Speaker-Independent Name Dialing on AIN

MUSICAL INSTRUMENT FAMILY CLASSIFICATION

Online Diarization of Telephone Conversations

SPEAKER IDENTIFICATION FROM YOUTUBE OBTAINED DATA

Evaluating grapheme-to-phoneme converters in automatic speech recognition context

2014/02/13 Sphinx Lunch

A secure face tracking system

Myanmar Continuous Speech Recognition System Based on DTW and HMM

Phonetic-Based Dialogue Search: The Key to Unlocking an Archive s Potential

Ericsson T18s Voice Dialing Simulator

Speech Recognition System of Arabic Alphabet Based on a Telephony Arabic Corpus

Emotion Detection from Speech

AUDIMUS.media: A Broadcast News Speech Recognition System for the European Portuguese Language

Linear Classification. Volker Tresp Summer 2015

Comparing Support Vector Machines, Recurrent Networks and Finite State Transducers for Classifying Spoken Utterances

Advanced Signal Processing and Digital Noise Reduction

An Arabic Text-To-Speech System Based on Artificial Neural Networks

Objective Intelligibility Assessment of Text-to-Speech Systems Through Utterance Verification

Robust Methods for Automatic Transcription and Alignment of Speech Signals

: Introduction to Machine Learning Dr. Rita Osadchy

Transcription System Using Automatic Speech Recognition for the Japanese Parliament (Diet)

Machine Learning. Chapter 18, 21. Some material adopted from notes by Chuck Dyer

Spot me if you can: Uncovering spoken phrases in encrypted VoIP conversations

Spoken Document Retrieval from Call-Center Conversations

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

OPTIMIZATION OF NEURAL NETWORK LANGUAGE MODELS FOR KEYWORD SEARCH. Ankur Gandhe, Florian Metze, Alex Waibel, Ian Lane

Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning

Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j

Annotated bibliographies for presentations in MUMT 611, Winter 2006

MISSING FEATURE RECONSTRUCTION AND ACOUSTIC MODEL ADAPTATION COMBINED FOR LARGE VOCABULARY CONTINUOUS SPEECH RECOGNITION

SENTIMENT EXTRACTION FROM NATURAL AUDIO STREAMS. Lakshmish Kaushik, Abhijeet Sangwan, John H. L. Hansen

How To Identify A Churner

ADAPTIVE AND DISCRIMINATIVE MODELING FOR IMPROVED MISPRONUNCIATION DETECTION. Horacio Franco, Luciana Ferrer, and Harry Bratt

Estonian Large Vocabulary Speech Recognition System for Radiology

Victoria Kostina Curriculum Vitae - September 6, 2015 Page 1 of 5. Victoria Kostina

Linear Threshold Units

Workshop. Neil Barrett PhD, Jens Weber PhD, Vincent Thai MD. Engineering & Health Informa2on Science

Cell Phone based Activity Detection using Markov Logic Network

Automatic Speech Recognition and Hybrid Machine Translation for High-Quality Closed-Captioning and Subtitling for Video Broadcast

tance alignment and time information to create confusion networks 1 from the output of different ASR systems for the same

APPLYING MFCC-BASED AUTOMATIC SPEAKER RECOGNITION TO GSM AND FORENSIC DATA

Class-specific Sparse Coding for Learning of Object Representations

Experiments with Signal-Driven Symbolic Prosody for Statistical Parametric Speech Synthesis

TED-LIUM: an Automatic Speech Recognition dedicated corpus

Programming Exercise 3: Multi-class Classification and Neural Networks

Analysis of Bayesian Dynamic Linear Models

Linear Models for Classification

Weighting and Normalisation of Synchronous HMMs for Audio-Visual Speech Recognition

Speech recognition for human computer interaction

ADAPTIVE AND ONLINE SPEAKER DIARIZATION FOR MEETING DATA. Multimedia Communications Department, EURECOM, Sophia Antipolis, France 2

A TOOL FOR TEACHING LINEAR PREDICTIVE CODING

Application of discriminant analysis to predict the class of degree for graduating students in a university system

Training Universal Background Models for Speaker Recognition

Reliable and Cost-Effective PoS-Tagging

Transcription:

, Lisbon Investigations on Error Minimizing Training Criteria for Discriminative Training in Automatic Speech Recognition Wolfgang Macherey Lars Haferkamp Ralf Schlüter Hermann Ney Human Language Technology and Pattern Recognition Lehrstuhl für Informatik VI Computer Science Department RWTH Aachen University, Germany Error Minimizing Training Criteria 1

Contents 1. Introduction 2. Overview of Discriminative Training Criteria 3. Minimum Classification Error Training for LVCSR 4. Comparative Experiments 5. Conclusions Error Minimizing Training Criteria 2

Introduction Discriminative training criteria like Maxmimum Mutual Information (MMI) by now established for large scale speech recognition tasks [Woodland 2002]. Recently: special interest in error minimizing training criteria Types of criteria in general: Criteria aimed at optimal distribution estimation (e.g. ML, MMI) Criteria representing expectation of error rate taken on training data, e.g. Minimum Word Error (MWE) criterion Minimum Phone Error (MPE) criterion significantly outperform MMI on many tasks [Povey 2002, Povey 2004]. Criteria representing (smoothed) empirical error rate on training data, e.g. Minimum Classification Error (MCE) aims at minimizing smoothed sentence error on training data. consistently better results than MMI on small vocabulary tasks. Error Minimizing Training Criteria 3

Maximum Mutual Information (MMI) Criterion: F MMI (θ) = 1 R = 1 R Discriminative Criteria R log p θ (W r X r ) r=1 R log p θ(x r W r ) p(w r ) p θ (X r W ) p(w ) r=1 competing model includes correct class sensitive to outliers W Minimum Classification Error (MCE) Criterion: F MCE (θ) = 1 R R r=1 1 + 1 [ p θ (X r W r ) p(w r ) α p α θ (X r W ) p α (W ) W W r ] 2ϱ competing model excludes correct class approximates sentence error rate on training data Error Minimizing Training Criteria 4

Discriminative Criteria Minimum Word Error (MWE) Criterion: F MWE (θ) = 1 R R r=1 p θ (X r W ) p(w ) A(W, W r ) W p θ (X r W ) p(w ) W criterion: expectation of approximation of word accuracy on training data A(W, W r ) approximates raw accuracy of hypothesis W instead of Levenshtein alignment: A(W, W r ) is defined locally on word level: a b c reference a b b hypothesis d time overlap t 1.0 0.8 0.2 1/3 2/3 A(w, w r ) = { 1 + 2 t(w, wr ) if w = w r 1 + t(w, w r ) if w w r Minimum Phone Error (MPE) Criterion: replace word accuracy by phone accuracy measure Error Minimizing Training Criteria 5

Extended Unifying Approach F ( θ; f, α, G, {M r } ) = 1 R R f r=1 p α θ (X r W ) p α 1/α (W ) G(W, W r ) log W p α θ (X r W ) p α (W ) W M r criterion smoothing function alternative word sequences exponent gain function f(z) M r α G(W, W r ) ML z - MMI z all (recognized) 1 CT best (recognized) MCE 1 all without W r free δ(w, W r ) FT 1 + e 2ϱz best (recognized) W r Diversity β 1 (1 eβz ) all (recognized) free Jeffreys 1 z z all (recognized) 1 MWE/MPE exp(z) all (recognized) 1 A(W, W r ) Properties of Diversity Index: equals MCE with ϱ = 1/2 for β = 1 equals MMI for β in case of MPE, A(W, W r ) gives phone accuracy Error Minimizing Training Criteria 6

MCE Criterion Few publications investigate use of MCE on large vocabulary tasks. Reason: requires exclusion of correct class from set of competing classes. ASR: Exclusion of spoken word sequence from set of all possible word sequences. Difficult if set of competing word sequences is encoded as word lattice: Lattice may contain multiple alignments and pronunciation variants of spoken utterance. Arcs may not uniquely be assigned to correct or competing sentences without changing lattice structure. Remedies: Use N-best lists. Problem: Coverage considerably reduced. Use finite state machines to restructure training lattices. Problem: In general lattice density increases. Exclusion after computing statistics: this work. Advantage: all statistics needed could be extracted from original training lattices Error Minimizing Training Criteria 7

MCE Optimization Optimization of MCE criterion lead to expressions containing word posterior like weights. But: corresponding summations exclude spoken word sequence: q [tb, t e ](w X r ) = p {W Mr W Wr w [tb, te] W } {V Mr V Wr} α λ (X r, W ) p α λ (X r, V ) Goal: efficient lattice based computation of MCE weights q [tb, t e ](w X r ) Idea: exclude partial sum over spoken word sequences numerically Error Minimizing Training Criteria 8

Efficient Lattice-Based MCE Algorithm Efficient computation of MCE weights for training: Algorithm: q [tb, t e ](w X r ) = = p {W Mr W Wr w [tb, te] W } {V Mr V Wr} p {W Mr w [tb, te] W } V Mr α λ (X r, W ) p α λ (X r, V ) α λ (X r, W ) p α λ (X r, V ) p {W Mr W =Wr w [tb, te] W } {V Mr V =Wr} α λ (X r, W ) p α λ (X r, V ) 1. label all alignments of spoken word sequence in denominator lattice. Corresponding sublattice is equivalent to numerator lattice. 2. Compute arc posteriors in numerator and denominator lattice using forward-backward algorithm (similar to MMI). 3. Subtract posteriors of labeled arcs in denominator lattice by corresponding numerator arc posteriors. Error Minimizing Training Criteria 9

Experiments on Wall Street Journal (WSJ0) Task initial ML trained acoustic models for WSJ0: 16 cepstral coefficients + 16 + 1 10ms frame shift LDA (±2 frames, 165 33) 2000 general. triphone states + 1 silence state 6-state HMM within-word triphone models gender independent Gaussian mixtures 1 pooled variance, 149k Gaussian densities corpus WSJ0 train dev eval acoustic data [h] 15:17 0:46 0:40 # speakers 84 10 8 # sentences 7 240 410 330 # running words 130 976 6 784 5 353 # lexicon words 10 133 5 007 corpus ARPA WSJ0 NOV. 92 dev eval WER SER WER SER ML 4.48 40.2 3.72 34.9 MMI 4.16 38.8 3.46 33.3 MCE 3.98 37.1 3.44 33.0 MWE 4.05 37.3 3.36 31.2 Error Minimizing Training Criteria 10

Experiments on North American Business (NAB) Task initial ML trained acoustic models for WSJ0+1: 16 cepstral coefficients + 16 + 1 10ms frame shift LDA (±1 frames, 99 32) 7000 general. triphone states + 1 silence state 6-state HMM across-word triphone models gender independent Gaussian mixtures corpus WSJ0+1 NAB Nov. 94 train dev eval acoustic data [h] 81:23 0:48 0:53 # speakers 284 20 20 # sentences 37 474 310 316 # running words 642 074 7 387 8 193 # lexicon words 15 013 19 978 64 735 1 pooled variance, 412k Gaussian densities NAB NAB-20k NAB-65k Nov. 94 corpus dev eval dev eval WER SER WER SER WER SER WER SER ML 11.48 73.9 11.46 76.3 9.21 67.1 9.35 71.2 MMI 11.18 73.6 11.02 74.1 8.93 67.7 8.97 69.0 MCE 11.07 73.2 11.00 75.0 8.76 66.8 8.97 68.3 MWE 11.13 72.9 10.98 74.4 8.62 65.5 9.03 71.2 MPE 11.24 73.2 11.29 75.0 9.00 66.8 9.30 71.2 Error Minimizing Training Criteria 11

Conclusions Minimum Classification Error (MCE) applied to large vocabulary speech recognition. Efficient lattice-based computation of MCE training statistics. No need for N-best lists or restructuring of word lattices. Common representation and performance comparison of: Maximum Mutual Information (MMI) criterion, Minimum Classification Error (MCE) criterion, Minimum Word/Phone Error (MWE/MPE) criterion. MCE showed same performance gains as MWE. Acknowledgments This work was partially funded by the European Union under the integrated project TC-STAR Technology and Corpora for Speech to Speech Translation IST-2002-FP6-506738, http://www.tc-star.org. Error Minimizing Training Criteria 12

References D. Povey, P. C. Woodland, Minimum phone error and i-smoothing for improved discriminative training, in IEEE Int. Conf. on Acoustics, Speech, and Signal Processing, 2002, vol. 1, Orlando, FL, May 2002, pp. 105 108. D. Povey, Discriminative training for large vocabulary speech recognition, Ph.D. dissertation, Dept. of Eng., Cambridge Univ., Cambridge, August 2004. B.-H. Juang, S. Katagiri, Discriminative learning for minimum error classification, in IEEE Transactions on Signal Processing, vol. 40, no. 12, December 1992, pp. 3043 3054. W. Chou, C.-H. Lee, B.-H. Juang, Minimum Error Rate Training based on N-Best String Models, in IEEE International Conference on Acoustics, Speech and Signal Processing, vol. 2, Minneapolis, MN, USA, April 1993, pp. 652 655. R. Schlüter, W. Macherey, Comparison of discriminative training criteria, in 1998 Int. Conf. on Acoustics, Speech and Signal Processing, vol. 1, Seattle, WA, May 1998, pp. 493 496. R. Schlüter, W. Macherey, B. Müller, H. Ney, Comparison of discriminative training criteria and optimization methods for speech recognition, Speech Communication, vol. 34, no. 1, pp. 287 310, May 2001. E. McDermott, T. J. Hazen, Minimum classification error training of landmark models for real-time continuous speech recognition, in IEEE Int. Conf. on Acoustics, Speech and Signal Processing, vol. 1, Montreal, Canada, May 2004, pp. 937 940. Error Minimizing Training Criteria 13

References (cont d) E. McDermott, S. Katagiri, Minimum classification error for large scale speech recognition tasks using weighted finite state transducers, in IEEE Int. Conf. on Acoustics, Speech and Signal Processing, vol. 1, Philadelphia, PA, March 2005, pp. 113 116. K. K. Paliwal, M. Bacchiani, Y. Sagisaka, Minimum classification error training algorithm for feature extractor and pattern classifier in speech recognition, in 1995 Europ. Conf. on Speech Communication and Technology, vol. 1, Madrid, Spain, September 1995, pp. 541 544. W. Macherey, Implementation and comparison of discriminative training methods for automatic speech recognition, Diploma Thesis, Lehrstuhl für Informatik VI, RWTH Aachen University, Aachen, November 1998. D. S. Pallett, J. G. Fiscus, W. M. Fisher, J. S. Garofolo, B. A. Lund, and M. A. Przybocki, 1994 Benchmark test for the ARPA spoken language program, in ARPA Human Language Technology Workshop, Austin, TX, January 1995, pp. 5 36. F. Kubala, Design of the 1994 CSR benchmark tests, in ARPA Human Language Technology Workshop, Austin, TX, January 1995, pp. 41 46. W. Macherey, R. Schlüter, H. Ney, Discriminative training with tied covariance matrices, in 8th Int. Conf. on Spoken Language Processing, vol. 1, Jeju Island, Korea, October 2004, pp. 681 684. Error Minimizing Training Criteria 14