Tues/Fri, Nov min/person in my office Be prepared to give an update on progress on your (part of the) project

Size: px

Start display at page:

Download "Tues/Fri, Nov min/person in my office Be prepared to give an update on progress on your (part of the) project"

Clementine Riley
7 years ago
Views:

1 Project updates Tues/Fri, Nov min/person in my office Be prepared to give an update on progress on your (part of the) project Brief characterization of project (task, language, code, modules) What you ve done so far What s left to be done Evaluation Roadblocks? Questions?

2 LANGUAGE RECOGNITION (SPOKEN LANGUAGE IDENTIFICATION)

4 Problem What language is being spoken?

5 Problem What language is being spoken? 1. Tamil 2. Spanish 3. Mandarin 4. Korean 5. Japanese 6. Hindi

6 Applications Skip Para español, oprime 2 step Call Centers (e.g., 911) Signals Intelligence First step in multilingual voice UI or translator

7 Baseline Always guess the most common language.

8 Two Main Solutions Acoustic analysis only Train classifiers based on spectral information Linguistic information Phonotactics (most successful) Broad-class phonotactics Phone duration Silence, Filled Pauses Prosody (difficult and less effective)

9 Most Successful Approach

10 Most Successful System

12 BYU s Solution Phone Call Sphinx 4 Time slices Praat Feature Definitions Feature Transformer Maximum Entropy Classifier Language

13 10,000-foot View Phone Call Sphinx 4 Praat Feature Transformer Maximum Entropy Classifier Language

14 Sphinx-4: Phoneme Recognizer ah <s> b... </s> z

15 Sphinx-4: Phonetic Class Models Phonetic classes are language independent sets of related sounds, based on manner of articulation; e.g., VOC consists of vowels FRIC consists of fricatives Create a language model based on the classes to constrain recognizer: <s> CLOS VOC CLOS </s> <s> CLOS FRIC CLOS CLOS FRIC PRVS CLOS FRIC VOC

16 Advantages Simplicity Only 1 acoustic model Only 1 Maximum Entropy model per language Rich feature set Flexibility No phonetically-labeled data is needed (though we use it where possible)

17 Speech Recognizer Three components to our speech recognizer Acoustic model (1) Phonotactic language model (N) Pronunciation dictionary

18 Phonetic Class Language Models Group similar phones together Based on manner of articulation; e.g., VOC consists of vowels FRIC consists of fricatives Probability that n-phone classes occur in order: <s> CLOS VOC CLOS </s> <s> CLOS FRIC CLOS CLOS FRIC PRVS CLOS FRIC VOC

19 Speech Recognizer Acoustic Model English LM English-like phonemes Audio File Acoustic Model Mandarin LM... Mandarin-like phonemes Acoustic Model Tamil LM Tamil-like phonemes

20 Praat

21 Recognizer and Praat Output <seglolafile length="3" Xlanguage="sp"> <filefeatures ingcount="6" f1average="2000"/> <timeslices> <slice starttime="0ms" duration="0.111ms"> <seglola> CLOS </seglola> <avgf1> </avgf1> <avgf2> </avgf2> <avgf3> </avgf3> <avgf4> </avgf4> <avgf5> </avgf5> <f0beg> 2.00 </f0beg> <f0end> </f0end> </slice> </timeslices> </seglolafile>

22 Feature Definition File Linguists identify relevant acoustic-phonetic features No need to estimate relative impact Examples: Statistical phonotactics (n-grams) Average phoneme duration Pitch contour Rising or falling tone

23 Maximum Entropy Classifier Binary decision Makes no assumptions beyond what is observed in the data Features provide constraints (Berger et al. 1996)

24 Evaluation Training set: OGI-TS corpus Hand-segmented LOLA format phone class labels (no recognizer) 6 languages, 338 files, 4.6 secs average length Features Unigram, bigram, trigram, 4-gram, and 5-gram features (broad phone class) 80/20 Train/Test split

25 Evaluation Metric: NIST 2005 LRE defined detection cost Weighted average of false negatives and false positives Perfect system, cost = 0 1 CDetection () i = P(Miss() Target()) i i P(FalsePositive( i) NonTarget( j)) 2( N 1) j i

26 Results Hits Misses False Alarms Cost Spanish English Mandarin Japanese Tamil Korean

27 Results Average Cost NIST Cost Score gram 1/2-gram 1/2/3-gram 1/2/3/4-gram 1/2/3/4/5-gram Feature Set

28 Results Improvement with Increasing n-gram Order Count Average Misses Average Hits Average Falses Max. n-gram Order

29 Future Work Experiments from speech (rather than from true transcripts) Optimize Sphinx s parameters using Powell s algorithm Train a new acoustic model Train better language models Define better (& more) linguistic features Re-write of feature transformer using fixed definition of feature definition lang. More experiments on lots more data Participation in NIST evaluation

Thirukkural - A Text-to-Speech Synthesis System

Thirukkural - A Text-to-Speech Synthesis System G. L. Jayavardhana Rama, A. G. Ramakrishnan, M Vijay Venkatesh, R. Murali Shankar Department of Electrical Engg, Indian Institute of Science, Bangalore 560012,