HMM Speech Recognition. Words: Pronunciations and Language Models. Pronunciation dictionary. Out-of-vocabulary (OOV) rate.

Similar documents
Turkish Radiology Dictation System

Improving Data Driven Part-of-Speech Tagging by Morphologic Knowledge Induction

Word Completion and Prediction in Hebrew

Language Modeling. Chapter Introduction

Automatic Speech Recognition and Hybrid Machine Translation for High-Quality Closed-Captioning and Subtitling for Video Broadcast

Search and Data Mining: Techniques. Text Mining Anya Yarygina Boris Novikov

Offline Recognition of Unconstrained Handwritten Texts Using HMMs and Statistical Language Models. Alessandro Vinciarelli, Samy Bengio and Horst Bunke

Basic probability theory and n-gram models

Estonian Large Vocabulary Speech Recognition System for Radiology

Building A Vocabulary Self-Learning Speech Recognition System

Lecture 12: An Overview of Speech Recognition

CS 533: Natural Language. Word Prediction

Chapter 7. Language models. Statistical Machine Translation

Speech Recognition on Cell Broadband Engine UCRL-PRES

Automatic slide assignation for language model adaptation

Investigations on Error Minimizing Training Criteria for Discriminative Training in Automatic Speech Recognition

An Adaptive Approach to Named Entity Extraction for Meeting Applications

Sample Cities for Multilingual Live Subtitling 2013

SOME ASPECTS OF ASR TRANSCRIPTION BASED UNSUPERVISED SPEAKER ADAPTATION FOR HMM SPEECH SYNTHESIS

Analysis of Morph-Based Speech Recognition and the Modeling of Out-of-Vocabulary Words Across Languages

Automatic Transcription of Conversational Telephone Speech

Perplexity Method on the N-gram Language Model Based on Hadoop Framework

Using Morphological Information for Robust Language Modeling in Czech ASR System Pavel Ircing, Josef V. Psutka, and Josef Psutka

Grammars and introduction to machine learning. Computers Playing Jeopardy! Course Stony Brook University

Slovak Automatic Transcription and Dictation System for the Judicial Domain

A Mixed Trigrams Approach for Context Sensitive Spell Checking

OPTIMIZATION OF NEURAL NETWORK LANGUAGE MODELS FOR KEYWORD SEARCH. Ankur Gandhe, Florian Metze, Alex Waibel, Ian Lane

Text-To-Speech Technologies for Mobile Telephony Services

Tagging with Hidden Markov Models

AUTOMATIC PHONEME SEGMENTATION WITH RELAXED TEXTUAL CONSTRAINTS

Information Leakage in Encrypted Network Traffic

Scaling Shrinkage-Based Language Models

Generating Training Data for Medical Dictations

Chapter 8. Final Results on Dutch Senseval-2 Test Data

Framework for Joint Recognition of Pronounced and Spelled Proper Names

Managed N-gram Language Model Based on Hadoop Framework and a Hbase Tables

German Speech Recognition: A Solution for the Analysis and Processing of Lecture Recordings

Membering T M : A Conference Call Service with Speaker-Independent Name Dialing on AIN

An Adaptive Speech Recognizer that Learns using Short and Long Term Memory

Robust Methods for Automatic Transcription and Alignment of Speech Signals

POS Tagging 1. POS Tagging. Rule-based taggers Statistical taggers Hybrid approaches

Language Model Adaptation for Video Lecture Transcription

Pronunciation change in conversational speech and its implications for automatic speech recognition q

AUDIMUS.media: A Broadcast News Speech Recognition System for the European Portuguese Language

Log-Linear Models. Michael Collins

PHONETIC TOOL FOR THE TUNISIAN ARABIC

Machine Translation. Agenda

Phonetic-Based Dialogue Search: The Key to Unlocking an Archive s Potential

Hybrid algorithms for preprocessing agglutinative languages and less-resourced domains effectively

Micro blogs Oriented Word Segmentation System

Using classes has the potential of reducing the problem of sparseness of data by allowing generalizations

Evaluating grapheme-to-phoneme converters in automatic speech recognition context

Speech and Language Processing. Daniel Jurafsky & James H. Martin. Copyright c All rights reserved. Draft of September 1, 2014.

Development of a Speech-to-Text Transcription System for Finnish

Studies on Training Text Selection for Conversational Finnish Language Modeling

A STATE-SPACE METHOD FOR LANGUAGE MODELING. Vesa Siivola and Antti Honkela. Helsinki University of Technology Neural Networks Research Centre

Statistical Machine Translation: IBM Models 1 and 2

Measuring the confusability of pronunciations in speech recognition

ADVANCES IN ARABIC BROADCAST NEWS TRANSCRIPTION AT RWTH. David Rybach, Stefan Hahn, Christian Gollan, Ralf Schlüter, Hermann Ney

RADIOLOGICAL REPORTING BY SPEECH RECOGNITION: THE A.Re.S. SYSTEM

Transcription System for Semi-Spontaneous Estonian Speech

Spot me if you can: Uncovering spoken phrases in encrypted VoIP conversations

An Arabic Text-To-Speech System Based on Artificial Neural Networks

Editorial Manager(tm) for Journal of the Brazilian Computer Society Manuscript Draft

Speech Recognition System of Arabic Alphabet Based on a Telephony Arabic Corpus

IBM Research Report. Scaling Shrinkage-Based Language Models

Speech Transcription

ICFHR 2010 Tutorial: Multimodal Computer Assisted Transcription of Handwriting Images

Transcription System Using Automatic Speech Recognition for the Japanese Parliament (Diet)

EVALUATION OF AUTOMATIC TRANSCRIPTION SYSTEMS FOR THE JUDICIAL DOMAIN

EFFECTS OF TRANSCRIPTION ERRORS ON SUPERVISED LEARNING IN SPEECH RECOGNITION

Developing acoustics models for automatic speech recognition

Course: Model, Learning, and Inference: Lecture 5

Syntactic Theory. Background and Transformational Grammar. Dr. Dan Flickinger & PD Dr. Valia Kordoni

Speech and Data Analytics for Trading Floors: Technologies, Reliability, Accuracy and Readiness

Speech and Language Processing

DEPENDENCY PARSING JOAKIM NIVRE

Modeling Cantonese Pronunciation Variations for Large-Vocabulary Continuous Speech Recognition

Objective Intelligibility Assessment of Text-to-Speech Systems Through Utterance Verification

Applying Repair Processing in Chinese Homophone Disambiguation

POS Tagsets and POS Tagging. Definition. Tokenization. Tagset Design. Automatic POS Tagging Bigram tagging. Maximum Likelihood Estimation 1 / 23

Automatic Transcription of Czech Language Oral History in the MALACH Project: Resources and Initial Experiments

Reliable and Cost-Effective PoS-Tagging

Research Article Exploiting Speech for Automatic TV Delinearization: From Streams to Cross-Media Semantic Navigation

Enterprise Voice Technology Solutions: A Primer

Automatic diacritization of Arabic for Acoustic Modeling in Speech Recognition

Language and Computation

Design and Data Collection for Spoken Polish Dialogs Database

LINGSTAT: AN INTERACTIVE, MACHINE-AIDED TRANSLATION SYSTEM*

VoiceXML-Based Dialogue Systems

Module Catalogue for the Bachelor Program in Computational Linguistics at the University of Heidelberg

Using Words and Phonetic Strings for Efficient Information Retrieval from Imperfectly Transcribed Spoken Documents

Evaluation of Interactive User Corrections for Lecture Transcription

IMPROVING TTS BY HIGHER AGREEMENT BETWEEN PREDICTED VERSUS OBSERVED PRONUNCIATIONS

1 Maximum likelihood estimation

The LIMSI RT-04 BN Arabic System

Combination of Multiple Speech Transcription Methods for Vocabulary Independent Search

Multi language e Discovery Three Critical Steps for Litigating in a Global Economy

Transition-Based Dependency Parsing with Long Distance Collocations

TED-LIUM: an Automatic Speech Recognition dedicated corpus

Transcription:

HMM Speech Recognition ords: Pronunciations and Language Models Recorded Speech Decoded Text (Transcription) Steve Renals Acoustic Features Acoustic Model Automatic Speech Recognition ASR Lecture 9 January-March 2012 Training Data Lexicon Language Model Search Space ASR Lecture 9 ords: Pronunciations and Language Models 1 ASR Lecture 9 ords: Pronunciations and Language Models 2 Pronunciation dictionary Out-of-vocabulary (OOV) rate ords and their pronunciations provide the link between sub-word HMMs and language models ritten by human experts Typically based on phones Constructing a dictionary involves 1 Selection of the words in the dictionary want to ensure high coverage of words in test data 2 Representation of the pronunciation(s) of each word Explicit modelling of pronunciation variation OOV rate: percent of word tokens in test data that are not contained in the ASR system dictionary Training vocabulary requires pronunciations for all words in training data (since training requires an HMM to be constructed for each training utterance) Select the recognition vocabulary to minimize the OOV rate (by testing on development data) Recognition vocabulary may be different to training vocabulary Empirical result: each OOV word results in 1.5 2 extra errors (>1 due to the loss of contextual information) ASR Lecture 9 ords: Pronunciations and Language Models 3 ASR Lecture 9 ords: Pronunciations and Language Models 4

Multilingual aspects Many languages are morphologically richer than English: this has a major effect of vocabulary construction and language modelling Compounding (eg German): decompose compund words into constituent parts, and carry out pronunciation and language modelling on the decomposed parts Highly inflected languages (eg Arabic, Slavic languages): specific components for modelling inflection (eg factored language models) Inflecting and compounding languages (eg Finnish) All approaches aim to reduce ASR errors by reducing the OOV rate through modelling at the morph level; also addresses data sparsity Vocabulary size for different languages Unique words [million words] Finnish 2 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 Turkish Estonian English 0 0 4 8 12 16 20 24 28 32 36 40 44 Corpus size [million words] M. Creutz et al, Morph-based speech recognition and modeling OOV words across languages, ACM Trans Speech and Language Processing, 5(1),art.3.http://doi.acm.org/10.1145/1322391.1322394 ASR Lecture 9 ords: Pronunciations and Language Models 5 ASR Lecture 9 ords: Pronunciations and Language Models 6 OOV Rate for different languages Single and multiple pronunciations New words in test set [%] 30 25 20 15 10 5 English Estonian Turkish Finnish 0 0 4 8 12 16 20 24 28 32 36 40 44 Training corpus size [million words] M. Creutz et al, Morph-based speech recognition and modeling OOV words across languages, ACM Trans ords may have multiple pronunciations: 1 Accent, dialect: tomato, zebra global changes to dictionary based on consistent pronunciation variations 2 Phonological phenomena: handbag/ h ae m b ae g Ican tstay/ [ah k ae n s t ay] 3 Part of speech: project, excuse This seems to imply many pronunciations per word, including: 1 Global transform based on speaker characteristics 2 Context-dependent pronunciation models, encoding of phonological phenomena BUT state-of-the-art large vocabulary systems average about 1.1 pronunciations per word: most words have a single pronunciation Speech and Language Processing, 5(1),art.3.http://doi.acm.org/10.1145/1322391.1322394 ASR Lecture 9 ords: Pronunciations and Language Models 7 ASR Lecture 9 ords: Pronunciations and Language Models 8

Consistency vs Fidelity Modelling pronunciation variability Empirical finding: adding pronunciation variants can result in reduced accuracy Adding pronunciations gives more flexibility to word models and increases the number of potential ambiguities more possible state sequences to match the observed acoustics Speech recognition uses a consistent rather than a faithful representation of pronunciations A consistent representation requires only that the same word has the same phonemic representation (possibly with alternates): the training data need only be transcribed at the word level A faithful phonemic representation requires a detailed phonetic transcription of the training speech (much too expensive for large training data sets) State-of-the-art systems absorb variations in pronunciation in the acoustic models Context-dependent acoustic models may be though of as giving broad class representation of word context Cross-word context dependent models can implicitly represent cross-word phonological phenomena Hain (2002): a carefully constructed single pronunciation dictionary (using most common alignments) can result in a more accurate system than a multiple pronunciation dictionary ASR Lecture 9 ords: Pronunciations and Language Models 9 ASR Lecture 9 ords: Pronunciations and Language Models 10 Mathematical framework HMM Framework for speech recognition. Let be the universe of possible utterances, and X be the observed acoustics, then we want to find: =argmax P( X ) =argmax P(X )P( ) P(X ) =argmaxp(x )P( ) ords are composed of a sequence of HMM states Q: =argmax arg max arg max P(X Q, )P(Q, ) P(X Q)P(Q )P( ) Q max P(X Q)P(Q )P( ) Q Three levels of model Acoustic model P(X Q) Probability of the acoustics given the phone states: context-dependent HMMs using state clustering, phonetic decision trees, etc. Pronunciation model P(Q ) Probability of the phone states given the words; may be as simple a dictionary of pronunciations, or a more complex model Language model P( ) Probability of a sequence of words. Typically an n-gram ASR Lecture 9 ords: Pronunciations and Language Models 11 ASR Lecture 9 ords: Pronunciations and Language Models 12

Language modelling Finite-state network Basic idea The language model is the prior probability of the word sequence P( ) Use a language model to disambiguate between similar acoustics when combining linguistic and acoustic evidence never mind the nudist play / never mind the new display Use hand constructed networks in limited domains one two three ticket tickets to Edinburgh London Leeds and typically hand-written does not have a wide coverage or robustness ASR Lecture 9 ords: Pronunciations and Language Models 13 ASR Lecture 9 ords: Pronunciations and Language Models 14 Language modelling Basic idea The language model is the prior probability of the word sequence P( ) Use a language model to disambiguate between similar acoustics when combining linguistic and acoustic evidence never mind the nudist play / never mind the new display Use hand constructed networks in limited domains Statistical language models: cover ungrammatical utterances, computationally efficient, trainable from huge amounts of data, can assign a probability to a sentence fragment as well as a whole sentence ASR Lecture 9 ords: Pronunciations and Language Models 15 Statistical language models For use in speech recognition a language model must be: statistical, have wide coverage, and be compatible with left-to-right search algorithms Only a few grammar-based models have met this requirement (eg Chelba and Jelinek, 2000), and do not yet scale as well as simple statistical models n-grams are (still) the state-of-the-art language model for ASR Unsophisticated, linguistically implausible Short, finite context Model solely at the shallow word level But: wide coverage, able to deal with ungrammatical strings, statistical and scaleable Probability of a word depends only on the identity of that word and of the preceding n-1 words. These short sequences of n words are called n-grams. ASR Lecture 9 ords: Pronunciations and Language Models 16

Bigram language model Bigram network ord sequence = w 1, w 2,...w M P() =P(w 1 )P(w 2 w 1 )P(w 3 w 1, w 2 )...P(w M w 1, w 2,...w M 1 ) Bigram approximation consider only one word of context: P() P(w 1 )P(w 2 w 1 )P(w 3 w 2 )...P(w M w M 1 ) Parameters of a bigram are the conditional probabilities P(w i w j ) Maximum likelihood estimates by counting: P(w i w j ) c(w j, w i ) c(w j ) where c(w j, w i ) is the number of observations of w j followed by w i,andc(w j ) is the number of observations of w j (irrespective of what follows) ASR Lecture 9 ords: Pronunciations and Language Models 17 P(one start of sentence) one P(ticket one) ticket Edinburgh P(Edinburgh one) P(end of sentence Edinburgh) n-grams can be represented as probabilistic finite state networks only some arcs (and nodes) are shown for clarity: in a full model there is an arc from every word to every word note the special start and end sentence probabilities ASR Lecture 9 ords: Pronunciations and Language Models 18 The zero probability problem Smoothing language models Maximum likelihood estimation is based on counts of words in the training data If a n-gram is not observed, it will have a count of 0 and the maximum likelihood probability estimate will be 0 The zero probability problem: just because something does not occur in the training data does not mean that it will not occur As n grows larger, so the data grow sparser, and the more zero counts there will be Solution: smooth the probability estimates so that unobserved events do not have a zero probability Since probabilities sum to 1, this means that some probability is redistributed from observed to unobserved n-grams hat is the probability of an unseen n-gram? Add-one smoothing: add one to all counts and renormalize. Discounts non-zero counts and redistributes to zero counts Since most n-grams are unseen (for large n more types than tokens!) this gives too much probability to unseen n-grams (discussed in Manning and Schütze) Absolute discounting: subtract a constant from the observed (non-zero count) n-grams, and redistribute this subtracted probability over the unseen n-grams (zero counts) Kneser-Ney smoothing: family of smoothing methods based on absolute discounting that are at the state of the art (Goodman, 2001) ASR Lecture 9 ords: Pronunciations and Language Models 19 ASR Lecture 9 ords: Pronunciations and Language Models 20

Backing off References How is the probability distributed over unseen events? Basic idea: estimate the probability of an unseen n-gram using the (n-1)-gram estimate Use successively less context: trigram bigram unigram Back-off models redistribute the probability freed by discounting the n-gram counts For a bigram P(w i w j )= c(w j, w i ) D c(w j ) = P(w i )b wj otherwise if c(w j, w i ) > c c is the count threshold, and D is the discount. b wj is the backoff weight required for normalization Fosler-Lussier (2003) - pronunciation modelling tutorial Hain (2002) - implicit pronunciation modelling by context-dependent acoustic models Gotoh and Renals (2003) - language modelling tutorial (and see refs within) Good coverage of n-gram models in Manning and Schütze (1999) Jelinek (1991) - review of early attempts to go beyond n-grams Chelba and Jelinek (2000) - example of a probabilistic grammar-based language model Goodman (2001) - state-of-the-art smoothing for n-grams ASR Lecture 9 ords: Pronunciations and Language Models 21 ASR Lecture 9 ords: Pronunciations and Language Models 22