Low Cost Lexicon. Nagendra Kumar Goel, Samuel Thomas, Pinar Akyazi

Similar documents
Using Words and Phonetic Strings for Efficient Information Retrieval from Imperfectly Transcribed Spoken Documents

Building A Vocabulary Self-Learning Speech Recognition System

Investigations on Error Minimizing Training Criteria for Discriminative Training in Automatic Speech Recognition

Turkish Radiology Dictation System

Sample Cities for Multilingual Live Subtitling 2013

ADVANCES IN ARABIC BROADCAST NEWS TRANSCRIPTION AT RWTH. David Rybach, Stefan Hahn, Christian Gollan, Ralf Schlüter, Hermann Ney

Phonetic-Based Dialogue Search: The Key to Unlocking an Archive s Potential

Offline Recognition of Unconstrained Handwritten Texts Using HMMs and Statistical Language Models. Alessandro Vinciarelli, Samy Bengio and Horst Bunke

THE RWTH ENGLISH LECTURE RECOGNITION SYSTEM

TED-LIUM: an Automatic Speech Recognition dedicated corpus

Develop Software that Speaks and Listens

An Arabic Text-To-Speech System Based on Artificial Neural Networks

Automatic slide assignation for language model adaptation

PHONETIC TOOL FOR THE TUNISIAN ARABIC

Speech and Data Analytics for Trading Floors: Technologies, Reliability, Accuracy and Readiness

Automatic diacritization of Arabic for Acoustic Modeling in Speech Recognition

Robust Methods for Automatic Transcription and Alignment of Speech Signals

Automatic Speech Recognition and Hybrid Machine Translation for High-Quality Closed-Captioning and Subtitling for Video Broadcast

Development of a Speech-to-Text Transcription System for Finnish

Membering T M : A Conference Call Service with Speaker-Independent Name Dialing on AIN

Estonian Large Vocabulary Speech Recognition System for Radiology

The National Reading Panel: Five Components of Reading Instruction Frequently Asked Questions

Automatic Transcription of Conversational Telephone Speech

German Speech Recognition: A Solution for the Analysis and Processing of Lecture Recordings

Evaluation of Interactive User Corrections for Lecture Transcription

Audio Indexing on a Medical Video Database: the AVISON Project

Text-To-Speech Technologies for Mobile Telephony Services

Speech Analytics. Whitepaper

Language Modeling. Chapter Introduction

Leveraging Large Amounts of Loosely Transcribed Corporate Videos for Acoustic Model Training

Acoustical Surfaces, Inc.

C E D A T 8 5. Innovating services and technologies for speech content management

LMELECTURES: A MULTIMEDIA CORPUS OF ACADEMIC SPOKEN ENGLISH

Modern foreign languages

Dragon Solutions Enterprise Profile Management

have more skill and perform more complex

EFFECTS OF TRANSCRIPTION ERRORS ON SUPERVISED LEARNING IN SPEECH RECOGNITION

Cheap, Fast and Good Enough: Automatic Speech Recognition with Non-Expert Transcription

Common Core State Standards English Language Arts. IEP Goals and Objectives Guidance: Basic Format

Comprehensive Reading Assessment Grades K-1

Cheap, Fast and Good Enough: Automatic Speech Recognition with Non-Expert Transcription

Evaluating grapheme-to-phoneme converters in automatic speech recognition context

Improving Data Driven Part-of-Speech Tagging by Morphologic Knowledge Induction

An Adaptive Speech Recognizer that Learns using Short and Long Term Memory

Things to remember when transcribing speech

Spoken Document Retrieval from Call-Center Conversations

The Influence of Topic and Domain Specific Words on WER

Portuguese Broadcast News and Statistical Machine Translation (SMT)

SOME ASPECTS OF ASR TRANSCRIPTION BASED UNSUPERVISED SPEAKER ADAPTATION FOR HMM SPEECH SYNTHESIS

AUDIMUS.media: A Broadcast News Speech Recognition System for the European Portuguese Language

Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus

Pronunciation change in conversational speech and its implications for automatic speech recognition q

PTE Academic Recommended Resources

ICFHR 2010 Tutorial: Multimodal Computer Assisted Transcription of Handwriting Images

Speech Analytics Data Reliability: Accuracy and Completeness

Unsupervised Language Model Adaptation for Automatic Speech Recognition of Broadcast News Using Web 2.0

SPEAKER IDENTIFICATION FROM YOUTUBE OBTAINED DATA

EVALUATION OF AUTOMATIC TRANSCRIPTION SYSTEMS FOR THE JUDICIAL DOMAIN

Transcribing and annotating audio and video: Jeff Good MPI EVA and the Rosetta Project

The LIMSI RT-04 BN Arabic System

Reading and writing processes in a neurolinguistic perspective

PTE Academic Recommended Resources

IEEE Proof. Web Version. PROGRESSIVE speaker adaptation has been considered

CAMBRIDGE FIRST CERTIFICATE Listening and Speaking NEW EDITION. Sue O Connell with Louise Hashemi

Generating Training Data for Medical Dictations

How To Perform An Ensemble Analysis

Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j

Proper Name Retrieval from Diachronic Documents for Automatic Speech Transcription using Lexical and Temporal Context

SOME INSIGHTS FROM TRANSLATING CONVERSATIONAL TELEPHONE SPEECH. Gaurav Kumar, Matt Post, Daniel Povey, Sanjeev Khudanpur

PTE Academic Recommended Resources

Transcription System for Semi-Spontaneous Estonian Speech

Scaling Shrinkage-Based Language Models

Year 1 Parents Literacy Workshop. Please write on a post-it note any specific difficulties you have reading with your child.

Document downloaded from: This paper must be cited as:

Optimizing Clinical Productivity:

Spot me if you can: Uncovering spoken phrases in encrypted VoIP conversations

MUSICAL INSTRUMENT FAMILY CLASSIFICATION

The ROI. of Speech Tuning

CALL-TYPE CLASSIFICATION AND UNSUPERVISED TRAINING FOR THE CALL CENTER DOMAIN

31 Case Studies: Java Natural Language Tools Available on the Web

Automatic Classification and Transcription of Telephone Speech in Radio Broadcast Data

RAMP for SharePoint Online Guide

Applications of speech-to-text in customer service. Dr. Joachim Stegmann Deutsche Telekom AG, Laboratories

Lecture 12: An Overview of Speech Recognition

Speech: A Challenge to Digital Signal Processing Technology for Human-to-Computer Interaction

ACOUSTIC KEYWORD SPOTTING IN SPEECH WITH APPLICATIONS TO DATA MINING

Data Mining Practical Machine Learning Tools and Techniques

AUTOMATIC PHONEME SEGMENTATION WITH RELAXED TEXTUAL CONSTRAINTS

Dragon Solutions. Using A Digital Voice Recorder

Assessment in Modern Foreign Languages in the Primary School

iboard Phonics Curriculum Guidance

Presentation Video Retrieval using Automatically Recovered Slide and Spoken Text

Design Grammars for High-performance Speech Recognition

Text-To-Speech for Languages without an Orthography अ रप त नसल भ ष स ठ व ण स षण

Evaluation of speech technologies

2014/02/13 Sphinx Lunch

A System for Searching and Browsing Spoken Communications

Enterprise Voice Technology Solutions: A Primer

ADAPTIVE AND ONLINE SPEAKER DIARIZATION FOR MEETING DATA. Multimedia Communications Department, EURECOM, Sophia Antipolis, France 2

Transcription:

Low Cost Lexicon Nagendra Kumar Goel, Samuel Thomas, Pinar Akyazi

Speech Recognition Speech Feature Extraction Features x cat ran Lexicon or Pronunciation Dictionary k a t R a n Acoustic Model arg max p( W ) p( x W ) Language Model P( wn wn 1, wn 2) Wˆ arg max w P( W X ) arg max w P( W ) P( X P( X ) W )

Cost of Developing a New Language Transcribed audio data Subspace acoustic models (UBM s) need less data Text data for language modeling Obtain from the web if possible Pronunciation Lexicon Qualified phoneticians are expensive Phoneticians may make mistakes Conversational (callhome) English has 4.6% OOV rate for a 5K lexicon and 0.4% for a 62K lexicon Try to guess pronunciation given a limited lexicon and audio

Estimating Pronunciations Ideal Situation will be to just estimate all the pronunciations for the word that maximize the likelihood given the audio Prn ˆ arg max Prn P( X Prn ) There are words for which spoken audio is not available but they need to exist in the recognizer. Multiple pronunciations have not yet significantly inproved the performance This objective function needs a lot of regularization

Estimating Pronunciation from Graphemes One way is to guess the pronunciation from the orthography of the word (e.g. Bisani & Ney) Iterative process based on grapheme/phoneme alignment Start with an initial set of graphone probabilities. a x t t t e e n n t I o n S x n Prn ˆ arg max Prn P( W, Prn) Use the probabilities to realign graphones with phones on training data. Re-estimate graphone probabilities from the alignments.

Training a Pronunciation Dictionary Training Initial Pronunciation Dictionary G2P Training Model for Predicting P from G Prediction Out of Vocabulary Words G2P Predicted Phoneme Sequence

% Error G2P Plot for English 100 90 80 70 60 50 40 30 20 10 0 1 2 3 4 5 6 Model Context Size % String errors % Symbol errors

Estimating Pronunciations If the audio recording is also available, that can be used to augment the estimates Prn ˆ arg max Prn P( X Prn) P( Prn W ) We use an approximation to the above Prn ˆ Prn arg max { Top 5 Prn} P( X Prn)

Estimating Pronunciations Start with a handmade phone set and a dictionary Train g2p with dictionary Train acoustic models Force align the training data with multiple pronunciations Create new dictionary with selected pronunciation

Estimating Pronunciations Pick pronunciations Match with word level transcripts Free Phonetic Recognition Start with a handmade phone set and a dictionary Train g2p with dictionary Train acoustic models Force align the training data with multiple pronunciations Create new dictionary with selected pronunciation

Estimating Pronunciations Pick pronunciations Start with a handmade phone set and a dictionary Introduce new pronunciation from unsupervised learning Match with word level transcripts Free Phonetic Recognition Train g2p with dictionary Train acoustic models Force align the training data with multiple pronunciations Force align and create pronunciations Pick words with high confidence Create lattices on similar acoustic datasets Create new dictionary with selected pronunciation

Training Procedure - Bootstrapping 1000 most frequently occuring words of training data Remaining Training data Words used for building LM which covers the test data Train g2p Trained g2p model Multiple pronunciation dictionary Dictionary Callhome training lexicon size 5 K LM vocabulary size 62 K Training acoustic data without partial words 6 hrs Complete training data 15 hrs

Training Procedure Bootstrapping 1000 most frequently occuring words Training data Words used for building LM which covers the test data Train g2p Trained g2p model Multiple pronunciation dictionary Dictionary Test data Trained acoustic model Train acoustic model Recognition

Training Procedure Building Up Multiple pronunciation dictionary Train data Force alignment Best pronunciation for training words Acoustic Models from previous iteration

Training Procedure Building Up Multiple pronunciation dictionary Train data Force alignment Best pronunciation for training words Acoustic Models from previous iteration Words used for building the LM which covers the test data Train g2p Dictionary Test data Acoustic model Recognition

Training Procedure Building Up Start with a handmade phone set and a dictionary Train g2p with dictionary Train acoustic models Force align the training data with multiple pronunciations Create new dictionary with selected pronunciation

Results Results % Accuracy with full dictionary available 44.35 Accuracy if 5K manual lexicon is available 40.53 Accuracy with 1000 words available 37.58 After retraining acoustic models 39.37 2nd iteration of g2p & acoustic re-train 41.60 3rd iteration of g2p & acoustic re-train 42.11 After increasing the amount of data to 15 hrs 43.56

Unsupervised Learning Start with a handmade phone set and a dictionary Introduce new pronunciation from unsupervised learning Train g2p with dictionary Train acoustic models Force align the training data with multiple pronunciations Force align and create pronunciations Pick words with high confidence Create lattices on similar acoustic datasets Create new dictionary with selected pronunciation

Unsupervised Lexicon Learning Results 6 Hrs of training data 15 Hrs of training data Baseline accuracy 42.11 42.33 43.56 43.44 After Unsupervised Learning

WER dilemma for Spanish Callhome Spanish pronunciation is very graphemic Accuracy for Spanish are about 31.13% (about 13% lower than callhome english) Phone recognition accuracy is better than callhome english English: 45.13% Spanish: 53.77% LM Perplixity is not too bad: 127 Can learning alternate pronunciations of reduced words help?

Possible lexicon training paths Pick pronunciations Start with a handmade phone set and a dictionary Introduce new pronunciation from unsupervised learning Match with word level transcripts Free Phonetic Recognition Train g2p with dictionary Train acoustic models Force align the training data with multiple pronunciations Force align and create pronunciations Pick words with high confidence Create lattices on similar acoustic datasets Create new dictionary with selected pronunciation

% String Errors Lexicon Enhancement for Spanish 60 G2P accuracies after augmenting with phone recognition based pronunciations 50 40 30 20 10 Dev Eval 0 1 2 3 4 Iteration (Model) #

% String Errors Lexicon Enhancement for Spanish 4.5 4 3.5 3 2.5 2 1.5 1 0.5 0 G2P Plot for Spanish 2 3 4 Iteration (Model) # Dev Eval

English Results and Spanish Results with unconstrained phonetic recognition approach Baseline After adding pronunciations Spanish 31.13 30.71 English 43.54 42.71 Log likelihood of training data increases with the new lexicon.

Lexicon Enhancement Keep the manual Lexicon but augment with most likely pronunciation in the training data Affected about 250 pronunciations Accuracy improved from 44.33 to 45.01% Multiple Pronunciations had no significant impact: 45.02%

Summary G2p based lexicon retraining method helps in achieving accuracies close to hand made lexicons It can also help in improving an existing lexicon Unsupervised lexicon learning approach and phonetic recognition based lexicon learning approaches hold promise and need to be explored with a wider variety of smoothing and pronunciation extraction scenarios

Training Procedure Train g2p to generate pronunciations using your best baseline lexicon Generate multiple pronunciations using the g2p Use the training data to select the best pronunciation out of these multiple choices Retrain the acoustic models and iterate over the above process