Speech Processing for Audio Indexing

Transcription

1 Speech Processing for Audio Indexing Lori Lamel with J.L. Gauvain and colleagues from Spoken Language Processing Group CNRS-LIMSI LIMSI GoTAL Gothenburg, Sweden, August 25-27,

2 Introduction Huge amounts of audio data produced (TV, radio, Internet, telephone) Automatic structuring of multimedia, multilingual data Much accessible content in the audio and text streams Speech and language processing technologies are key components for indexing Applications: digital multimedia libraries, media monitoring, News on Demand, Internet watch services, speech analytics,... LIMSI GoTAL Gothenburg, Sweden, August 25-27,

3 Processing Audiovisual Data Language recognition Speaker recognition signal Audio Segmentation Speech transcription Emotions Named entities,... XML Acoustic phonetic modeling Corpus based studies Perceptual tests Translation Q&A Audio Analysis Dialog Generation Synthesis signal LIMSI GoTAL Gothenburg, Sweden, August 25-27,

4 Some Recent Projects Conversational interfaces & intelligent assistants: Computers in the Human Interaction Loop (chil.server.de) Human-Human communication mediated by the machine multimodality (speech, image, video) who & where speaker identification and tracking what linguistic content (far-field ASR) why & how style, attitude, emotion Speech translation: Technology and Corpora for Speech to Speech Translation ( Global Autonomous Language Exploitation (DARPA GALE) Developing multimedia and multilingual indexing and management tools QUAERO ( Speech analytics: LIMSI GoTAL Gothenburg, Sweden, August 25-27,

5 Why Is Speech Recognition Difficult? Automatic conversion of spoken words in text Text: I do not know why speech recognition is so difficult Continuous: Idonotknowwhyspeechrecognitionissodifficult Spontaneous: Idunnowhyspeechrecnitionsodifficult Pronunciation: YdonatnowYspiCrEkxgnISxnIzsodIfIk lt YdonowYspiCrEknISNsodIfxk l YdontnowYspiCrEkxnISNsodIfIk lt YdxnowYspiCrEknISNsodIfxk lt Important variability factors: Speaker Acoustic environment physical characteristics (gender, background noise (cocktail party,...) age,...), accent, emotional state, room acoustic, signal capture situation (lecture, conversation, (microphone, channel,...) meeting,...) [after lecture Speech Recognition and Understanding by Tanja Schultz and John Kominek] LIMSI GoTAL Gothenburg, Sweden, August 25-27,

6 Audio Samples (1) WSJ Read stop PREVIOUSLY HE WAS PRESIDENT OF CIGNA S PROPERTY CASUAL GROUP FOR FIVE YEARS AND SERVED AS CHIEF FINANCIAL OFFICER WSJ Spontaneous ACCORDING TO KOSTAS STOUPIS AN ECONOMIST WITH THE GREEK GOV- ERNMENT IT IS IT IS NOT YET POSSIBLE FOR GREECE TO TO MAINTAIN AN EXPORT LEVEL THAT IS EQUIVALENT TO EVEN THE THE SMALLEST LEVEL OF THE OTHER WESTERN NATIONS Broadcast News ON WORLD NEWS TONIGHT THIS WEDNESDAY TERRY NICHOLS AVOIDS THE DEATH SENTENCE FOR HIS ROLE IN THE OKLAHOMA CITY BOMBING. ONE OF THE WORLD S MOST POPULAR AIRPLANES THE GOVERNMENT ORDERS AN IMMEDIATE SAFE TY INSPECTION. AND OUR REPORT ON THE CLONING CONTROVERSY BECAUSE ONE AMERICAN SCIENTIST SAYS HE S READY TO GO. LIMSI GoTAL Gothenburg, Sweden, August 25-27,

7 Conversational Telephone Speech Audio Samples (2) stop Hello. Yes, This is Bill. Hi Bill. This is Hillary. So this is my first call so I haven t a clue as to - - what we re supposed to do yeah. It s pretty funny that we re Bill and Hillary talking on the phone though isn t it? Ah I didn t even think about that. {laughter} uhm we re supposed to talk about did you get the topic. Yes. So we re supposed to talk about what changes if any you have made since... CHIL, spontaneous, non native I just give you a brief overview of [noise] what s going on in uh audio and why we bother with all these microphones in the the materials of the walls, the carpet and all that... LIMSI GoTAL Gothenburg, Sweden, August 25-27,

8 Audio Samples (3) EPPS stop Mister President, in these questions we see the European social model in flashing lights. The European social model is a mishmash which pleases no one: a bit of the free market here and a bit of welfare state there, mixed with a little green posturing. EPPS, spontaneous Thank you very much, Mister Chairman. Hum When I came to this room I I have seen and heard homophobia but sort of vaguely ; through T. V. and the rest of it. But I mean listening to some of the Polish colleagues speak here today, especially Roszkowski, Pek, Giertych and Krupa,... EPPS, non native My main my main problem with the environmental proposals of the Commission is that they do not correspond to the objectives of the Sixth Environmental Action Plan. As far as the traffic is concerned, the sixth E. A. P. stressed the decoupling of transport growth and G. D. P. growth. LIMSI GoTAL Gothenburg, Sweden, August 25-27,

9 Indicative ASR Performance Task Condition Word Error Dictation read speech, close-talking mic. 3-4% (humans 1%) read speech, noisy (SNR 15dB) 10% read speech, telephone 20% spontaneous dictation 14% read speech, non-native 20% Found audio TV & radio news broadcasts 10-15% (humans 4%) TV documentaries 20-30% Telephone conversations 20-30% (humans 4%) Lectures (close mic) 20% Lectures (distant mic) 50% EPPS 8% LIMSI GoTAL Gothenburg, Sweden, August 25-27,

10 t 10 t Sw itc hboar d ht Me Eng lis Me ting Rea Me ting Sw itc hboar di in CTSAra bicdar itc hboar dce lu lar Broah ting= IHM News Man dar in10ara bic10 X Var ie(non\ English) hones ATraveK h her (UL) News Eng lis h1 X News Eng lis h10 X News Eng lis hun lim ite d No isy )%( R E W k % 4 % 2 % 10 from NIST hmar ktes His tory NISTSTBenc SDM MDM ing Spec (UL)CTSMan (UL)0MeXNewsCTSFis Conversa iona lspec h (Non= h) Sw dcas tspec dmicrop % dspec h k20 ir lp lan ing ios kspec k5 % LIMSI GoTAL Gothenburg, Sweden, August 25-27,

11 Some Research Topics Development of generic technology Speaker independent, task independent (large vocabulary > 200k words) Robust (noise, microphone,...) Improving model accuracy Processing found audio: varied speaking styles, uncontrolled conditions Multilinguality: low e-resourced languages Developing systems with less supervision Model adaptation: keeping models up-to-date Annotation of metadata (speaker, language, topic, emotion, style...) LIMSI GoTAL Gothenburg, Sweden, August 25-27,

12 Internet Languages Internet users in Millions per language Increase (%) in Internet users p internetworldstats.com, Nov 30, 2007 language from 2000 to 2007 LIMSI GoTAL Gothenburg, Sweden, August 25-27,

13 ASR Systems Text Corpus Speech Corpus Training Normalization Manual Transcription Feature Extraction N gram Estimation Training Lexicon HMM Training Decoding Language Model Recognizer Lexicon Acoustic Models P(W) P(H W) f(x H) Speech Sample Y Acoustic Front end X Decoder W * Speech Transcription LIMSI GoTAL Gothenburg, Sweden, August 25-27,

14 System Components Acouctic models: HMMs (10-20M parameters), allophones (triphones), discriminative features, discriminative training Pronunciation dictionary with probabilities Language models: statistical N-gram models (10-20M N-grams), model interpolation, connectionist LMs, text normalization Decoder: multipass decoding, unsupervised acoustic model adaptation, system combination (Rover, cross-system adaptation) LIMSI GoTAL Gothenburg, Sweden, August 25-27,

15 Multi Layer Perceptron Front-End Features incorporate longer temporal information than cepstral features MLP NN projects the high-dimensional space onto phone state targets LP-TRAP features (500ms), 4 layer bottle-neck architecture (HMM state targets, 3rd layer size is 39), PCA [Hermansky & Sharma, ICSLP 98; Fousek, 2007] Studied various configurations & ways to combine with cepstral features Main results: bnat06 WER (%) Features PLP MLP wlp PLP + MLP wlp No adaptation SAT+CMLLR+MLLR Signifiant WER reduction without acoustic model adaptation Best combination via feature vector concatenation (78 features) Acoustic model adaptation is less effective for MLP Rover (PLP,PLP+MLP) further reduces the WER LIMSI GoTAL Gothenburg, Sweden, August 25-27,

16 Pronunciation Modeling Pronunciation dictionary is an important system component Usually represented with phones + non-speech units (silence, breath, filler) Well-suited to recognition of read and prepared speech Modeling variants particularly important for spontaneous speech multiwords I don t know Explore (semi-)automatic methods to generate pronunciation variants Explicit use of phonetic knowledge to hypothesize rules anti, multi: /i/ - /ay/ variation inter: /t/ vs nasal flap, retroflex schwa vs schwa stop deletion/insertion y-insertion (coupon, news, super) Data driven: confront pronunciations with very large amounts of data Estimate probabilities of pronunciation variants (very important for Arabic) LIMSI GoTAL Gothenburg, Sweden, August 25-27,

17 Example: coupon kyupan kupan LIMSI GoTAL Gothenburg, Sweden, August 25-27,

18 HMM a 11 a 22 a 33 1 a 12 a f(. s 1) f(. s 2) f(. s ) 3 x 1 x 2 x 3 x 4 x 5 x 6 x 7 x8 LIMSI GoTAL Gothenburg, Sweden, August 25-27,

19 Example: interest InXIst IntrIst LIMSI GoTAL Gothenburg, Sweden, August 25-27,

20 Pronunciation Variants - BN vs CTS 100 hours of data Word Pronunciation BN CTS Word Pronunciation BN CTS interest IntrIst interests IntrIss IntXIst 3 33 IntrIsts InXIst 0 11 IntXIsts 3 2 IntXIss 3 1 interested IntrIstxd interesting IntrIst G IntXIstxd 3 80 IntXIst G InXIstxd InXIst G Similar but less frequent words: interestingly, disinterest, disinterested, intercom, interconnected, interfere... Estimate pronunciation probabilities, but no generalization LIMSI GoTAL Gothenburg, Sweden, August 25-27,

21 Morphological Decomposition Morphological decomposition proposed to address problem of lexical variety (Arabic, German, Turkish, Finnish, Amharic,...) Rule-based vs data-driven approaches Reduces out-of-vocabulary rate Improve n-gram estimation of infrequent words Tendency to create errors due to acoustic confusions of small units Some proposed solutions: - inhibit decomposition of frequent words - incorporate linguistic knowledge (recognizer confusions, assimilation) Recognition performace of optimized systems is typically comparable to and can be combined with word based systems LIMSI GoTAL Gothenburg, Sweden, August 25-27,

22 Continuous Space Language Models Project words onto a continuous space (about 300) Estimate projection and n-gram probabilities with neural network Better generalization for unknown n-grams (compared to general back-off methods) Efficient training with a resampling algorithm Only model the N most common words (10-12k) Interpolated with a standard 4-gram backoff LM 20% perplexity reduction, % reduction in WER across a variety of languages and tasks [Schwenk, 2007] LIMSI GoTAL Gothenburg, Sweden, August 25-27,

23 Reducing Training Costs - Motivation State-of-the-art speech recognizers trained on hundreds to thousands hours of transcribed audio data hundreds of million to several billions of words of texts large recognition lexicons Less e-represented languages Over 6000 languages, about 800 written Poor representation in accessible form Lack of economic and/or political incentives PhD theses: Vietnamese, Khmer [Le, 2006], Somali [Nimaan, 2007], Amharic, Turkish [Pellegrini, 2008] SPICE: Afrikaans, Bulgarian, Vietnamese, Hindi, Konkani, Telugu, Turkish, but also English, German, French [Schultz, 2007] LIMSI GoTAL Gothenburg, Sweden, August 25-27,

24 Questions How much data is needed to train the models? What performance can be expected with a given amount of resources? Are audio or textual training data more important for ASR in less e-represented languages (languages with texts available)? Assess the impact on WER of the quantity of Speech material (acoustic modeling) Texts (transcriptions) Texts (newspapers, newswires available on the Web) Some studies on Amharic (low e-resourced language) [Pellegrini, 2008] LIMSI GoTAL Gothenburg, Sweden, August 25-27,

25 Amharic, 2 hour development set OOV and WER Rates 50 2h 10h 35h h 5h 10h 35h OOV(%) 25 log WER(%) k 100k 500k 1M 4.6M 10k Text corpus size (# words) Audio transcriptions closer to development data Even with 4.6M texts, more audio transcripts help 100k 500k 1M Text corpus size (# words) 4.6M LIMSI GoTAL Gothenburg, Sweden, August 25-27,

26 Amharic, 2 hour development set Influence of Transcriptions in LM h 10h 35h MA2h ML10h MA2h ML35h 50 10h MA10h ML35h 35h log WER(%) 50 log WER(%) k 100k 500k Text corpus size (# words) 1M 4.6M 2h of audio training data is not sufficient Good operating point: 10h audio, 1M word texts 10k 100k 500k Test corpus size (# words) Collecting texts is more efficient than transcribing audio at this operating point 1M 4.6M LIMSI GoTAL Gothenburg, Sweden, August 25-27,

27 Reducing Training Supervision Use of quick transcriptions (less detailed, less costly to produce) low-cost approximate annotations (commercial transcripts, closed-captions) raw data with closely related contemporary texts raw data with complementary texts (only newspaper, newswire,...) Lightly supervised and unsupervised training using speech recognizer [Zavaliagkos & Colthurst 1998; Kemp & Waibel 1999; Jang & Hauptmann, 1999, Lamel, Gauvain & Adda 2000] Standard MLE training assumes that transcriptions are perfect, and are aligned with audio Flexible alignment using full N-gram decoder and garbage model Can skip over unknown words, transcription errors Implicit modeling of short vowels in Arabic LIMSI GoTAL Gothenburg, Sweden, August 25-27,

28 Reducing Training Supervision - Non-speech Use existing models (trained on manual annotations) to automatically detect fillers and breath noises Add these to reference transcriptions Small reduction in word error rate breath filler Manual (50h) 8.1% 0.6% BN (967h) 4.9% 3.4% BC (162h) 5.4% 6.8% LIMSI GoTAL Gothenburg, Sweden, August 25-27,

29 Reducing Supervision: Pronunciation Generation Pronunciation generation in Arabic is relatively straightforward from a vocalized word form Most sites working on Arabic use the Buckwalter morphological analyzer to generate all possible vocalizations But many words are not handled Introduced generic vowel (@) model as a placeholder Developed rules to automatically generate pronunciations with a generic vowel from non-vocalized word form Enables training with partial information ktb k@t@b@ k@t@b@n kt@b@ kt@b@n k@tb@ k@tb@n k@t@b Reduces OOV of recognition lexicon LIMSI GoTAL Gothenburg, Sweden, August 25-27,

30 Alignment Example: li AntilyAs n l y A s LBC NEWS PLAY LIMSI GoTAL Gothenburg, Sweden, August 25-27,

31 Towards Applications Transcription for Translation (TC-STAR) Using the WEB as a Source of Training Data Indexing Internet audio-video data: Audiosurf Open Domain Question/Answering qast/2008 Open domain Interactive Information Retrieval by Telephone (RITel) ritel.limsi.fr Speech Analytics (Infom@gic/Callsurf) LIMSI GoTAL Gothenburg, Sweden, August 25-27,

32 ASR Transcripts Versus Texts Spoken language, disfluencies (hesitations, fillers,...) No document/sentence boundaries Lack of capitalization, punctuation Transcription errors (substitutions, deletions, insertions) Strong correlation between WER and IE metrics (NIST) Slightly more errors on NE tagged words Vocabulary limitation essentially affects person names + Known vocabulary, i.e. no orthographic errors, no need for normalization + Time codes + Confidence measures LIMSI GoTAL Gothenburg, Sweden, August 25-27,

33 Transcription for Translation (TC-STAR) Three languages (EN, ES, MAN), 3 tasks (European Parliament, Spanish Parliament, BN) Improving modeling and decoding techniques audio segmentation, discriminative features, duration models, automatic learning, discriminative training, pronunciation models, NN language model, decoding strategies,... Model adaptation methods speaker adaptive trainining, acoustic and language model adaptation, cross-system adaptation, adaptation to noisy environments,... Integration of ASR and Translation transcription errors, confidence scores, word graph interface punctuation for MT transcription normalization (98 vs. ninety eight, hesitations,...) LIMSI GoTAL Gothenburg, Sweden, August 25-27,

34 TC-Star ASR Progress 30 Best TCStar LIMSI OCT04-BN More task-specific data More complex models Improved modeling and decoding techniques Automatic learning, discriminative training, pronunciation modeling Demo LIMSI GoTAL Gothenburg, Sweden, August 25-27,

35 Using the WEB as a Source of Training Data (Work done in collaboration with Vecsys Research) How to build a system without speaking the language (e.g Russian) and without transcribed data? Main steps Get texts from the WEB, and basic linguistic knowledge for letter-tosound rules Get BN audio data with incomplete and approximate transcripts Normalize texts/transcripts and build n-gram LM Start with seed AMs from other languages Align incomplete transcripts and audio, discard badly aligned data Train AMs Iterate last 2 steps to minimize the overestimated WER LIMSI GoTAL Gothenburg, Sweden, August 25-27,

36 Results: Approximate WER Russian WER Russian WER select Estimated WER using approximate transcripts Approximate WER reduced from 30% to 10% Similar experiments underway with Spanish and Portuguese LIMSI GoTAL Gothenburg, Sweden, August 25-27,

37 Indexing Internet audio/video Processed data 259 sources 80h/day, 33 languages Transcribing and indexing 203 sources 6 languages AUDIOSURF System English French German Arabic Spanish Mandarin All h 10735h 964h 1756h 60h h h 14950h 1639h 3383h 1369h h h 18675h 1970h 4634h 2337h h h 19943h 1972h 4736h 2401h h LIMSI GoTAL Gothenburg, Sweden, August 25-27,

38 LIMSI GoTAL Gothenburg, Sweden, August 25-27,

39 Open Domain Question/Answering in Audio data Automatic structurization of audio data Detect semantically meaningful units Specific entity taggers (named, task, linguistic entities) - rewrite rules (work like local grammars with specific dictionaries) - named and linguistic entity tagging are language dependent but task independent - task entity tagging is language independent but task dependent Projects: RITel (ritel.limsi.fr), Chil (chil.server.de), Quaero ( Participation in CLEF track QAst07, QAst08 ( qast/ 2008) [Rosset et al, ASRU 2007] Compare QA performance on manual and ASR transcripts of varied audio Best accuracy about , generally degradation with increasing WER LIMSI GoTAL Gothenburg, Sweden, August 25-27,

40 RITel System Example stop LIMSI GoTAL Gothenburg, Sweden, August 25-27,

41 RITel System Information Retrieval by Telephone [Rosset et al, 2005,2006] ritel.limsi.fr Example stop Fully integrated open domain Interactive Information Retrieval System Speech detection and recognition Non-Contextual analysis (for texts and speech transcripts) General Information Retrieval IR on databases and dialogic interactions Management of dialog history Natural Language Generation Data collection platform (natural, real-time) LIMSI GoTAL Gothenburg, Sweden, August 25-27,

44 Question: What is the Vlaams Blok? Manual transcript: the Belgian Supreme Court has upheld a previous ruling that declares the Vlaams Blok a criminal organization and effectively bans it. Answer: criminal organisation Extracted portion of an automatic transcript (CTM file format): (...) EN SAT Vlaams EN SAT Blok EN SAT a EN SAT criminal EN SAT organisation EN SAT and (...) Answer: LIMSI GoTAL Gothenburg, Sweden, August 25-27,

45 Speech Analytics Statistical analyses of huge speech corpora Client satisfaction, service efficiency,... Extraction of metadata (content, topics, quality of interaction...) Confidentiality and privacy issues Anonymized transcriptions LIMSI GoTAL Gothenburg, Sweden, August 25-27,

46 CallSurf stop Speech Analytics Sample Extract A: oui bonjour Monsieur, EDF pro à l appareil. C: oui. ah oui oui oui. A: ouais je je vous appelle Monsieur, je vous avais eu il y a quelques jours au téléphone, C: tout à fait. A: et je vous avais transféré vers euh des conseillers euh nouvelle offre qui m ont renvoyé un mail ce matin, en me disant de vous recontacter puisque la demande de mise en service par le premier conseiller que vous aviez eu à l origine n avait pas été faite. C: ah bon A: ouais. donc euh bon c est pas coupé sur place. c est pas le problème... LIMSI GoTAL Gothenburg, Sweden, August 25-27,

47 Detailed Transcription edfp h5 fre spk <o,f0,male> allò oui edfp h5 fre spk <o,f0,female> {fw} Monsieur Dupont edfp h5 fre spk <o,f0,male> oui edfp h5 fre spk <o,f0,female> oui Ma(demoiselle) Mademoiselle Brun EDF pro edfp h5 fre spk <o,f0,male> oui edfp h5 fre spk <o,f0,female> {breath} {fw} je vous rappelle je viens d avoir le service technique {fw} concernant {fw} notre discussion {fw} de ce matin edfp h5 fre spk <o,f0,female> donc {fw} après vérification en fait le prédécesseur qui était donc un client {fw} particulier en référence particulier {breath} a été... LIMSI GoTAL Gothenburg, Sweden, August 25-27,

48 Quick Anonymized Transcription A/ Sylviane XX EDF Pro bonjour. C/ allò bonjour A/ oui C/ je vous donne ma référence client? A/ s il vous plait C/ rrrrr A/ rr rrr C/ rrr A/ oui C/ rrr tiret r r LIMSI GoTAL Gothenburg, Sweden, August 25-27,

49 Training with Reduced Information Transcriptions assumed to be imperfect Alignment allowing all reference words to be optionally skipped Filler words and breath can be inserted between words Anonymized forms absorbed by a generic model Training uses a word lattice generated by the decoder Word error rate currently about 30% LIMSI GoTAL Gothenburg, Sweden, August 25-27,

50 Some Perspectives State-of-the-art transcription systems have WER ranging from 10-30% depending upon the tasks Comparable results obtained on several languages (English, French, German, Mandarin, Spanish, Dutch...) Performance is sufficient for some tasks (indexing huge amounts of audio/video, spoken document retrieval, speech-to-speech translation) Requires substantial training resources but recent research has demonstrated the potential of semi- and unsupervised methods Processing of more varied data (podcasts, voice mail, meetings) Dealing with different speaker accents (regional and non-native accents) Keeping models up-to-date Developing systems for oral languages (without a written form) LIMSI GoTAL Gothenburg, Sweden, August 25-27,

51 Over 25 Years of Progress Voice commands single speaker 2 30 words Controled dialog speaker indep words Conversational Telephone Speech speaker indep. Speech Analytics Audio mining Isolated words single speaker 2 30 words Isolated word dictation single speaker 20k words Transcription for indexation TV & radio Unlimited Domain Transcription Internet audio Connected words speaker indep 10 words Continuous dictation single speaker 60k words Unlimited Domain Speech to speech Translation LIMSI GoTAL Gothenburg, Sweden, August 25-27,