Using Voice Transformations to Create Additional Training Talkers for Word Spotting

Similar documents
Artificial Neural Network for Speech Recognition

Establishing the Uniqueness of the Human Voice for Security Applications

Hardware Implementation of Probabilistic State Machine for Word Recognition

Emotion Detection from Speech

Thirukkural - A Text-to-Speech Synthesis System

School Class Monitoring System Based on Audio Signal Processing

Lecture 1-6: Noise and Filters

Myanmar Continuous Speech Recognition System Based on DTW and HMM

A Sound Analysis and Synthesis System for Generating an Instrumental Piri Song

Lecture 1-10: Spectrograms

Spot me if you can: Uncovering spoken phrases in encrypted VoIP conversations

Speech: A Challenge to Digital Signal Processing Technology for Human-to-Computer Interaction

Automatic Detection of Emergency Vehicles for Hearing Impaired Drivers

A TOOL FOR TEACHING LINEAR PREDICTIVE CODING

An Arabic Text-To-Speech System Based on Artificial Neural Networks

Speech Recognition on Cell Broadband Engine UCRL-PRES

Speech recognition for human computer interaction

Automatic Evaluation Software for Contact Centre Agents voice Handling Performance

Solutions to Exam in Speech Signal Processing EN2300

A Comparison of Speech Coding Algorithms ADPCM vs CELP. Shannon Wichman

This document is downloaded from DR-NTU, Nanyang Technological University Library, Singapore.

BLIND SOURCE SEPARATION OF SPEECH AND BACKGROUND MUSIC FOR IMPROVED SPEECH RECOGNITION

SPEAKER IDENTIFICATION FROM YOUTUBE OBTAINED DATA

RANDOM VIBRATION AN OVERVIEW by Barry Controls, Hopkinton, MA

An Introduction to Neural Networks

Speech Signal Processing: An Overview

Ericsson T18s Voice Dialing Simulator

APPLYING MFCC-BASED AUTOMATIC SPEAKER RECOGNITION TO GSM AND FORENSIC DATA

Audio Engineering Society. Convention Paper. Presented at the 129th Convention 2010 November 4 7 San Francisco, CA, USA

Turkish Radiology Dictation System

Developing an Isolated Word Recognition System in MATLAB

SOME ASPECTS OF ASR TRANSCRIPTION BASED UNSUPERVISED SPEAKER ADAPTATION FOR HMM SPEECH SYNTHESIS

The Role of Size Normalization on the Recognition Rate of Handwritten Numerals

MUSICAL INSTRUMENT FAMILY CLASSIFICATION

Speech and Network Marketing Model - A Review

SOFTWARE FOR GENERATION OF SPECTRUM COMPATIBLE TIME HISTORY

Carla Simões, Speech Analysis and Transcription Software

Training Ircam s Score Follower

L9: Cepstral analysis

A Segmentation Algorithm for Zebra Finch Song at the Note Level. Ping Du and Todd W. Troyer

Workshop Perceptual Effects of Filtering and Masking Introduction to Filtering and Masking

Available from Deakin Research Online:

The effect of mismatched recording conditions on human and automatic speaker recognition in forensic applications

Perceived Speech Quality Prediction for Voice over IP-based Networks

Linear Predictive Coding

Predicting the Risk of Heart Attacks using Neural Network and Decision Tree

Separation and Classification of Harmonic Sounds for Singing Voice Detection

SPEECH INTELLIGIBILITY and Fire Alarm Voice Communication Systems

Figure1. Acoustic feedback in packet based video conferencing system

AUTOMATIC PHONEME SEGMENTATION WITH RELAXED TEXTUAL CONSTRAINTS

Advanced Signal Processing and Digital Noise Reduction

From Concept to Production in Secure Voice Communications

Novelty Detection in image recognition using IRF Neural Networks properties

Time Domain and Frequency Domain Techniques For Multi Shaker Time Waveform Replication

A Learning Algorithm For Neural Network Ensembles

Secure-Access System via Fixed and Mobile Telephone Networks using Voice Biometrics

Music technology. Draft GCE A level and AS subject content

Lecture 12: An Overview of Speech Recognition

Electroencephalography Analysis Using Neural Network and Support Vector Machine during Sleep

COMPARISON OF OBJECT BASED AND PIXEL BASED CLASSIFICATION OF HIGH RESOLUTION SATELLITE IMAGES USING ARTIFICIAL NEURAL NETWORKS

COMPARATIVE STUDY OF RECOGNITION TOOLS AS BACK-ENDS FOR BANGLA PHONEME RECOGNITION

DESIGN AND SIMULATION OF TWO CHANNEL QMF FILTER BANK FOR ALMOST PERFECT RECONSTRUCTION

THREE DIMENSIONAL REPRESENTATION OF AMINO ACID CHARAC- TERISTICS

Alignment and Preprocessing for Data Analysis

Function Guide for the Fourier Transformation Package SPIRE-UOL-DOC

How To Recognize Voice Over Ip On Pc Or Mac Or Ip On A Pc Or Ip (Ip) On A Microsoft Computer Or Ip Computer On A Mac Or Mac (Ip Or Ip) On An Ip Computer Or Mac Computer On An Mp3

Supporting Online Material for

Data Mining using Artificial Neural Network Rules

THIS paper reports some results of a research, which aims to investigate the

Chapter 2 The Research on Fault Diagnosis of Building Electrical System Based on RBF Neural Network

Video Affective Content Recognition Based on Genetic Algorithm Combined HMM

Convention Paper Presented at the 118th Convention 2005 May Barcelona, Spain

Creating voices for the Festival speech synthesis system.

Measuring Line Edge Roughness: Fluctuations in Uncertainty

AN-007 APPLICATION NOTE MEASURING MAXIMUM SUBWOOFER OUTPUT ACCORDING ANSI/CEA-2010 STANDARD INTRODUCTION CEA-2010 (ANSI) TEST PROCEDURE

Objective Intelligibility Assessment of Text-to-Speech Systems Through Utterance Verification

have more skill and perform more complex

1. Classification problems

Dynamic sound source for simulating the Lombard effect in room acoustic modeling software

Classification of Fingerprints. Sarat C. Dass Department of Statistics & Probability

Audio Coding Algorithm for One-Segment Broadcasting

4 Pitch and range in language and music

Low Cost Correction of OCR Errors Using Learning in a Multi-Engine Environment

Efficient on-line Signature Verification System

Subjective SNR measure for quality assessment of. speech coders \A cross language study

International Journal of Computer Trends and Technology (IJCTT) volume 4 Issue 8 August 2013

Objective Speech Quality Measures for Internet Telephony

Neural network models: Foundations and applications to an audit decision problem

Broadband Networks. Prof. Dr. Abhay Karandikar. Electrical Engineering Department. Indian Institute of Technology, Bombay. Lecture - 29.

MFCC-Based Voice Recognition System for Home Automation Using Dynamic Programming

Adding Sinusoids of the Same Frequency. Additive Synthesis. Spectrum. Music 270a: Modulation

Incorporating Window-Based Passage-Level Evidence in Document Retrieval

Annotated bibliographies for presentations in MUMT 611, Winter 2006

Enhanced Boosted Trees Technique for Customer Churn Prediction Model

Signal to Noise Instrumental Excel Assignment

Doppler. Doppler. Doppler shift. Doppler Frequency. Doppler shift. Doppler shift. Chapter 19

Analecta Vol. 8, No. 2 ISSN

EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set

Analysis of Compression Algorithms for Program Data

Tutorial about the VQR (Voice Quality Restoration) technology

Transcription:

Using Voice Transformations to Create Additional Training Talkers for Word Spotting Eric I. Chang and Richard P. Lippmann MIT Lincoln Laboratory Lexington, MA 02173-0073, USA eichang@sst.ll.mit.edu and rpl@sst.ll.mit.edu Abstract Speech recognizers provide good performance for most users but the error rate often increases dramatically for a small percentage of talkers who are "different" from those talkers used for training. One expensive solution to this problem is to gather more training data in an attempt to sample these outlier users. A second solution, explored in this paper, is to artificially enlarge the number of training talkers by transforming the speech of existing training talkers. This approach is similar to enlarging the training set for OCR digit recognition by warping the training digit images, but is more difficult because continuous speech has a much larger number of dimensions (e.g. linguistic, phonetic, style, temporal, spectral) that differ across talkers. We explored the use of simple linear spectral warping to enlarge a 48-talker training data base used for word spotting. The average detection rate overall was increased by 2.9 percentage points (from 68.3% to 71.2%) for male speakers and 2.5 percentage points (from 64.8% to 67.3%) for female speakers. This increase is small but similar to that obtained by doubling the amount of training data. 1 INTRODUCTION Speech recognizers, optical character recognizers, and other types of pattern classifiers used for human interface applications often provide good performance for most users. Performance is often, however, low and unacceptable for a small percentage of "outlier" users who are presumably not represented in the training data. One expensive solution to this problem is to obtain more training data in the hope of including users from these outlier

876 Eric I. Chang, Richard P. Lippmann classes. Other approaches already used for speech recognition are to use input features and distance metrics that are relatively invariant to linguistically unimportant differences between talkers and to adapt a recognizer for individual talkers. Talker adaptation is difficult for word spotting and with poor outlier users because the recognition error rate is high and talkers often can not be prompted to recite standard phrases that can be used for adaptation. An alternative approach, that has not been fully explored for speech recognition, is to artificially expand the number of training talkers using voice transformations. Transforming the speech of one talker to make it sound like that of another is difficult because speech varies across many difficult-to-measure dimensions including linguistic, phonetic, duration, spectra, style, and accent. The transformation task is thus more difficult than in optical character recognition where a small set of warping functions can be successfully applied to character images to enlarge the number of training images (Drucker, 1993). This paper demonstrates how a transformation accomplished by warping the spectra of training talkers to create more training data can improve the performance of a whole-word word spotter on a large spontaneous-speech data base. 2 BASELINE WORD SPOTTER A hybrid radial basis function (RBF) - hidden Markov model (HMM) keyword spotter has been developed over the past few years that provides state-of-the-art performance for a whole-word word spotter on the large spontaneous-speech credit-card speech corpus. This system spots 20 target keywords, includes one general filler class, and uses a Viterbi decoding backtrace as described in (Chang, 1994) to backpropagate errors over a sequence of input speech frames. This neural network word spotter is trained on target and background classes, normalizes target outputs using the background output, and thresholds the resulting score to generate putative hits, as shown in Figure 1. Putative hits in this figure are input patterns which generate normalized scores above a threshold. The performance of this, and other spotting systems, is analyzed by plotting a detection versus false alarm rate curve. This curve is generated by adjusting the classifier output threshold to allow few or many putative hits. The figure of merit (FOM) is defined as the average keyword detection rate when the false alarm rate ranges from 1 to 10 false alarms per keyword per hour. The previous best FOM for this word spotter is 67.8% when trained using 24 male talkers and tested on 11 male talkers, and 65.9% when trained using 24 female talkers and tested on 11 female talkers. The overall FOM for all talkers is 66.3%. CONTINUOUS SPEECH INPUT NEURAL NETWORK WORDSPOTTER ATIVE HITS Figure 1: Block diagram of neural network word spotter.

Using Voice Transfonnations to Create Additional Training Talkers for Word Spotting 877 3 TALKER VARIABILITY FOM scores of test talkers vary over a wide range. When training on 48 talkers and then performing testing on 22 talkers from the 70 conversations in the NIST Switchboard credit card database, the FOM ofthe test talkers varies from 16.7% to 100%. Most talkers perform well above 50%, but there are two female talkers with FOM's of 16.7% and 21.4%. The low FOM for individual speakers indicates a lack of training data with voice qualities that are similar to these test speakers. 4 CREATING MORE TRAINING DATA USING VOICE TRANSFORMATIONS Talker adaptation is difficult for word spotting because error rates are high and talkers often can not be prompted to verify adaptation phrases. Our approach to increasing performance across talkers uses voice transformation techniques to generate more varied training examples of keywords as shown in Figure 2. Other researchers have used talker transformation techniques to produce more natural synthesized speech (lwahashi, 1994, Mizuno, 1994), but using talker transformation techniques to generate more training data is novel. We have implemented a new voice transformation technique which utilizes the Sinusoidal Transform Analysis/Synthesis System (STS) described in (Quatieri, 1992). This technique attempts to transform one talker's speech pattern to that of a different talker. The STS generates a 512 point spectral envelope of the input speech 100 times a second and also separates pitch and voicing information. Separation of vocal tract characteristic and pitch information has allowed the implementation of pitch and time transformations in previous work (Quatieri, 1992). The system has been modified to generate and accept a spectral en- VOICE TRANSFORMATION SYSTEM ORIGINAL SPEECH TRANSFORMED SPEECH Figure 2: Generating more training data by artificially transforming original speech training data.

878 Eric I. Chang, Richard P. Lippmann velo.pe file fro.m an input speech sample. We info.nnally explo.red different techniques to' transfo.nn the spectral envelo.pe to' generate mo.re varied training examples by listening to' transfo.nned speech. This resulted in the fo.llo.wing algo.rithm that transfo.nns a talker's vo.ice by scaling the spectral envelo.pe o.f training talkers. 1. Training co.nversatio.ns are upsampled fro.m 8000 Hz to' 10,000 Hz to' be co.mpatible with existing STS co.ding so.ftware. 2. The STS system pro.cesses the upsampled files and generates a 512 point spectral envelo.pe o.f the input speech wavefo.rm at a frame rate o.f 100 frames a seco.nd and with a windo.w length o.f appro.ximately 2.5 times the length o.f each pitch perio.d. 3. A new spectral envelo.pe is generated by linearly expanding o.r co.mpressing the spectral axis. Each spectral po.int is identified by its index, ranging fro.m 0 to' 511. To. transfo.rm a spectral pro.file by 2, the new spectral value at frequency f is generated by averaging the spectral values aro.und the o.riginal spectral pro.file at frequency o.f 0.5 f The transfo.nnatio.n process is illustrated in Figure 3. In this figure, an o.riginal spectral envelo.pe is being expanded by two.. The spectral value at index 150 is thus transfo.rmed to. spectral index 300 in the new envelo.pe and the o.riginal spectral info.rmatio.n at high frequencies is lo.st. 4. The transfo.rmed spectral value is used to. resynthesize a speech wavefo.rm using the vocal tract excitatio.n info.nnatio.n extracted fro.m the o.riginal file. Vo.ice transfo.rmatio.n with the STS co.der allo.ws listening to. transfo.rmed speech but requires lo.ng co.mputatio.n. We simplified o.ur appro.ach to' o.ne o.f mo.difying the spectral scale in the spectral do.main directly within a mel-scale filterbank analysis pro.gram. The inco.ming speech sample is pro.cessed with an FFT to' calculate spectral magnitudes. Then spectral magnitudes are linearly transfo.rmed. Lastly mel-scale filtering is perfo.rmed with 10 linearly spaced filters up to' 1000 Hz and lo.garithmically spaced filters fro.m 1000 Hz up. A co.sine transfo.rm is then used to' generate mel-scaled cepstral values that are used by the wo.rdspo.tter. Much faster pro.cessing can be achieved by applying the spectral transfo.rmatio.n as part o.f the filterbank analysis. Fo.r example, while perfo.rming spectral transfo.nnatio.n using the STS algo.rithm takes up to' approximately 10 times real time, spectral transfo.rmatio.n within the mel-scale filterbank pro.gram can be acco.mplished within 1110 real time o.n a Sparc 10 wo.rkstatio.n. The rapid pro.cessing rate allo.ws o.n-line spectral transfo.rmatio.n. 5 WORD SPOTTING EXPERIMENTS Linear warping in the spectral do.main, which is used in the above algo.rithm, is co.rrect when the vo.cal tract is mo.delled as a series o.f lo.ssless aco.ustic tubes and the excitatio.n so.urce is at o.ne end o.f the vo.cal tract (Wakita, 1977). Wakita sho.wed that if the vo.cal tract is mo.delled as a series o.f equal length, lo.ssless, and co.ncatenated aco.ustic tubes, then the ratio. o.f the areas between the tubes determines the relative reso.nant frequencies o.f the vo.cal tract, while the o.verall length o.fthe vo.cal tract linearly scales fo.rmant frequencies. Preliminary research was co.nducted using linear scaling with spectral ratio.s ranging fro.m 0.6 to. 1.8 to. alter test utterances. After listening to. the STS transfo.rmed speech and also. o.bserving

Using Voice Transformations to Create Additional Training Talkers for Word Spotting 879 Original Spectral Envelope o III I I 150 511 Transformed Spectral Envelope " o Spectral Value Index 300 511 Figure 3: An example of the spectral transformation algorithm where the original spectral envelope frequency scale is expanded by 2. spectrograms of the transformed speech, it was found that speech transformed using ratios between 0.9 and 1.1 are reasonably natural and can represent speech without introducing artifacts. Using discriminative training techniques such as FOM training carries the risk of overtraining the wordspotter on the training set and obtaining results that are poor on the testing set. To delay the onset of overtraining, we artificially transformed each training set conversation during each epoch using a different random linear transformation ratio. The transformation ratio used for each conversation is calculated using the following formula: ratio == a + N (0,0.06), where a is the transformation ratio that matches each training speaker to the average of the training set speakers, and N is a normally distributed random variable with a mean of 0.0 and standard deviation of 0.06. For each training conversation, the long term averages of formant frequencies for formant 1, 2, and 3 are calculated. A least square estimation is then performed to match the formant frequencies of each training set conversation to the group average formant frequencies. The transformation equation is described below:

880 Eric I. Chang, Richard P. Lippmann 100 90 t :--! ~ o 80 - a-.,. NORMAL 50 Figure 4: o 234 5 678 9 10 NUMBER OF EPOCHS OF FOM TRAINING Average detection accuracy (FOM) for the male training and testing set versus the number of epochs of FOM training. The transform ratio for each individual conversation is calculated to improve the naturalness of the transformed speech. In preliminary experiments, each conversation was transformed with fixed ratios of 0.9, 0.95, 1.05, and 1.1. However, for a speaker with already high formant frequencies, pushing the formant frequencies higher may make the transformed speech sound unnatural. By incorporating the individual formant matching ratio into the transformation ratio, speakers with high formant frequencies are not transformed to very high frequencies and speakers with low formant frequencies are not transformed to even lower formant frequency ranges. Male and female conversations from the NIST credit card database were used separately to train separate word spotters. Both the male and the female partition of data used 24 conversations for training and 11 conversations for testing. Keyword occurrences were extracted from each training conversation and used as the data for initialization of the neural network word spotter. Also, each training conversation was broken up into sentence length segments to be used for embedded reestimation in which the keyword models are joined with the filler models and the parameters of all the models are jointly estimated. After embedded reestimation, Figure of Merit training as described in (Lippmann. 1994) was performed for up to 10 epochs. During each epoch. each training conversation is transformed using a transform ratio randomly generated as described above. The performance of the word spotter after each iteration of training is evaluated on both the training set and the testing set. 6 WORD SPOTTING RESULTS Training and testing set FOM scores for the male speakers and the female speakers are shown in Figure 4 and Figure 5 respectively. The x axis plots the number of epochs of FOM training where each epoch represents presenting all 24 training conversations once. The FOM for word spotters trained with the normal training conversations and word spotters

Using Voice Transformations to Create Additional Training Talkers for Word Spotting 881 100 90 ~ - = I - -. " -.- au TRAIN.. - ra- ".?j 80-60 ~= \ NORMAL 50 o 23456789 NUMBER OF EPOCHS OF FOM TRAINING 10 Figure 5: Average detection accuracy (FOM) for the female training and testing set versus the number of iterations of FOM training. trained with artificially expanded training conversations are shown in each plot. After the first epoch, the FOM improves significantly. With only the original training conversations (normal), the testing set FOM rapidly levels off while the training set FOM keeps on improving. When the training conversations are artificially expanded, the training set FOM is below the training set FOM from the normal training set due to more difficult training data. However, the testing set FOM continues to improve as more epochs of FOM training are performed. When comparing the FOM of wordspotters trained on the two sets of data after ten epochs of training, the FOM for the expanded set was 2.9 percentage points above the normal FOM for male speakers and 2.5 percentage points above the normal FOM for female speakers. For comparison, Carlson has reported that for a high performance word spotter on this database, doubling the amount of training data typically increases the FOM by 2 to 4 percentage points (Carlson, 1994). 7 SUMMARY Lack of training data has always been a constraint in training speech recognizers. This research presents a voice transformation technique which increases the variety among training talkers. The resulting more varied training set provided up to 2.9 percentage points of improvement in the figure of merit (average detection rate) of a high performance word spotter. This improvement is similar to the increase in performance provided by doubling the amount of training data (Carlson, 1994). This technique can also be applied to other speech recognition systems such as continuous speech recognition, talker identification, and isolated speech recognition.

882 Eric I. Chang. Richard P. Lippmcmn ACKNOWLEDGEMENT This work was sponsored by the Advanced Research Projects Agency. The views expressed are those of the authors and do not reflect the official policy or position of the U.S. Government. We wish to thank Tom Quatieri for providing his sinusoidal transform analysis/synthesis system. BIBLIOGRAPHY B. Carlson and D. Seward. (1994) Diagnostic Evaluation and Analysis of Insufficient and Task-Independent Training Data on Speech Recognition. In Proceedings Speech Research Symposium XlV, Johns Hopkins University. E. Chang and R. Lippmann. (1994) Figure of Merit Training for Detection and Spotting. In Neural Information Processing Systems 6, G. Tesauro, J. Cohen, and J. Alspector, (Eds.), Morgan Kaufmann: San Mateo, CA. H. Drucker, R. Schapire, and P. Simard. (1993) Improving Performance in Neural Networks Using a Boosting Algorithm. In Neural Information Processing Systems 5, S. Hanson, J. Cowan, and C. L. Giles, (Eds.), Morgan Kaufmann: San Mateo, California. N. Iwahashi and Y. Sagisaka. (1994) Speech Spectrum Transformation by Speaker Interpolation. In Proceedings International Conference on Acoustics Speech and Signal Processing, Vol. 1,461-464. R. Lippmann, E. Chang & C. Jankowski. (1994) Wordspotter Training Using Figure-of Merit Back Propagation. In Proceedings of International Conference on Acoustics Speech and Signal Processing, Vol. 1,389-392. H. Mizuno and M. Abe. (1994) Voice Conversion Based on Piecewise Linear Conversion Rules of Formant Frequency and Spectrum Tilt. In Proceedings International Conference Oil Acoustics Speech and Signal Processing, Vol. 1,469-472. T. Quatieri and R. McAulay. (1992) Shape Invariant Time-Scale and Pitch Modification of Speech. In IEEE Trans. Signal Processing, vol 40, no 3. pp. 497-510. Hisashi Wakita. (1977) Normalization of Vowels by Vocal-Tract Length and Its Application to Vowel Identification. In IEEE Trans. Acoustics, Speech, and Signal Processing, vol. ASSP-25, No.2., pp. 183-192.