Practical Applications of Speech Signal Processing

Practical Applications of Speech Signal Processing Vishu R Viswanathan TI Fellow, Director, Speech Technologies Lab DSP Solutions R&D Center Texas Instruments, Dallas, Texas v-viswanathan@ti.com March 2004 Vishu Viswanathan 1

Lecture Outline Goals of the Lecture Speech Coding Speech Synthesis Speech Recognition & Understanding Speaker Recognition Speech Enhancement Speech Modification March 2004 Vishu Viswanathan 2

Goals of the Lecture Introduce and discuss each of a number of speech signal processing areas List examples of practical applications Discuss some selected topics in each area High level presentation only March 2004 Vishu Viswanathan 4

Goal Speech Coding Reduce speech signal data rate Maintain high speech quality General Principle: Take advantage of Redundancies in the speech signal Properties of speech production and perception Applications Digital cellular telephony, voice over IP, IP phone, audio/video conferencing, PSTN trunking, secure voice communication, digital answering machines, voice mail, voice response systems, talking products March 2004 Vishu Viswanathan 6

Components of a Speech Coding System Sampled Speech s(n) Analyzer Channel or Encoder x(n) y(n) Medium y (n) Decoder x (n) Synthesizer s (n) Goal: Minimize data rate of y(n) while maximizing speech quality of s (n) March 2004 Vishu Viswanathan 7

Waveform Coders Types of Speech Coders Goal: Reproduce speech on a sample-by-sample basis High data rates, high speech quality Examples: 64 kb/s PCM (G.711), 32 kb/s ADPCM (G.726) Parametric Coders Speech production characterized by parametric models Low data rates, good speech intelligibility, communications/synthetic speech quality Examples: 2.4 kb/s LPC (FS 1015), 2.4 kb/s MELP (recent NATO standard) Analysis-by-Synthesis Coders Hybrid between waveform and parametric coders, with medium data rates Parametric models used, with excitation signal computed by minimizing error between synthesized speech and input speech Examples: 16 kb/s G.728, 8 kb/s G.729 March 2004 Vishu Viswanathan 8

Speech Quality Terms Used Toll quality: High-grade wireline telephone High quality Good quality Communications quality Transparent quality Formal Subjective Testing Methods Expensive, time consuming Mean opinion score (MOS): Used in all industry standards bodies Diagnostic acceptability measure (DAM): Used by US Dep t of Defense Informal and Semi-Formal Subjective Tests Pairwise or A/B comparisons Rating tests Objective Methods Signal-to-Noise Ratio, ITU P.802 (PESQ) Automatic, repeatable, useful in coder development and optimization March 2004 Vishu Viswanathan 9

Speech Coder Attributes Low bit rate Low quality Clean Speech Low delay Low Complexity Human Speech 1200 2400 4800 8000 16000 32000 64000 Bits/Second 2.5 3.0 3.5 4.0 Handheld Mean Opinion Score Hands-free 10 50 100 200 Milliseconds MIPS, Memory Sound Effects High bit rate High quality Noisy Speech High delay High Complexity Music March 2004 Vishu Viswanathan 10

Speech Coding Standards ITU Standards coder rate (kb/s) approach G.711 64 Mu/A-law G.726 16-40 ADPCM G.728 16 LD-CELP G.729 8 CS-ACELP G.723.1 5.3/6.3 MP/ACELP ITU standards are targeted for telephone network applications Also used in Voice over IP applications All produce toll quality speech March 2004 Vishu Viswanathan 11

Europe North America Japan Speech Coding Standards Digital Cellular Standards coder rate (kb/s) chan rate approach date GSM FR 13 22.8 RPE-LTP 1987 GSM HR 5.6 11.4 VSELP 1994 GSM EFR 12.2 22.8 ACELP 1995 GSM AMR 4.75-12.2 11.4-22.8 ACELP 1998 TIA IS54 7.95 13 VSELP 1989 TIA IS95 0.8-8.55 QCELP 1993 TIA Q13 0.8-13.3 QCELP 1995 TIA IS641 7.4 13 ACELP 1996 TIA EVRC 0.8-8.55 R-ACELP 1996 TIA SMV 0.8-8.5 R-ACELP 2001 PDC FR 6.7 11.2 VSELP 1990 PDC HR 3.45 5.6 PSI-CELP 1993 PDC EFR 8 11.2 ACELP 1999 PDC EFR 6.7 11.2 ACELP 2000 March 2004 Vishu Viswanathan 12

Speech Coding Standards Wideband Standards coder rate (kb/s) approach G.722 48,56,64 SB-ADPCM G.722.1 24,32 Transform ITU WB 16,24 ACELP AMR WB 6.60-23.85 ACELP VMR WB 1.0-13.3 ACELP Wideband: 50 Hz 7 khz (versus narrowband telephone, 300-3200 Hz) March 2004 Vishu Viswanathan 13

Speech Synthesis Human Speech Based Systems Suitable for known material Speech coding based Talking toys, talking books, voice prompts, voice response systems Concatenation of pre-recorded voice data Information retrieval (stock quotes, airline schedules, banking) Text-to-Speech Systems Suitable for unknown or arbitrary text Applications: e-mail/fax reading, phone access to web based services, spoken telephone directory, car navigation, locationbased services, customer service, help desk, reading machines for the blind March 2004 Vishu Viswanathan 15

Components of a TTS System Dictionary and Rules Text Text Analysis Letter-to- Sound Synthesizer Speech - Numerical expansion (dates, times, money) - abbreviations, acronyms -proper name id Dr. Smith lives at 23 Lakeshore Dr. Courtesy of Larry Rabiner - Phonemes -Pitch - Duration -Pauses - loudness/amplitude choice of units words, phones, diphones, dyad, syllables choice of parameters LPC, formants, waveform templates, articulatory parameters, sinusoidal parameters method of computation rules, concatenation March 2004 Vishu Viswanathan 16

Speech Recognition & Understanding Problem Recognition: Automatic recognition of human speech by machine Understanding: Interpret the meaning of recognized speech and map them to actions to be taken Applications Voice dialing (name or number dialing) in telephone, cellphone, PDA, smartphone (Safety laws against handheld cellphone use while driving) Voice command & control in telematics, cellphone, PDA, smartphone, PC, toys Voice-enabled web browsing, information retrieval (stock quotes, weather forecast, airline flight information, banking), navigation, e-mail, SMS, dictation Automated customer service and help desks Benefits: hands-free, eyes-free use; not using keypad; faster task completion; ease of use; part of multi-modal interface; cost savings March 2004 Vishu Viswanathan 18

March 2004 Vishu Viswanathan 19

Components of a Speech Recognizer speech signal word string Feature Extraction Acoustic Scoring Decoding Acoustic Models Language Models Front end Back end March 2004 Vishu Viswanathan 20

Speaker Dependent Small Vocabulary Isolated Words Recognition Speech Recognizer Attributes Speaker Adaptive 10 100 1000 10000 Words Continuous Speech Syntax Semantics Speaker Independent Large Vocabulary Conversational Speech Understanding Clean Speech Handheld Hands-free Noisy Speech Low Complexity MIPS, Memory High Complexity Server Based Distributed Client Based March 2004 Vishu Viswanathan 21

Performance & Robustness Performance Recognition Accuracy: Word error rate (WER) or task completion rate High enough performance required for user acceptance Robustness Issues Training versus operational condition differences Background noise: extent of noise, its variability (Usually additive) Channel variability: different microphones, different telephone circuits, handheld, handsfree, handheld-handsfree (Usually convolutive) Recognizer must have means to compensate for noise and channel variabilities Out-of-vocabulary rejection capability Speaker dialect and accent variability (handled by speaker adaptation) User Interface: Very important for the success of an application March 2004 Vishu Viswanathan 22

Recognition in Multiple Languages Speaker-Dependent Recognition Language independent (User can enroll names for voice dialing in multiple languages!) Some Observations for Speaker-Independent Recognition Same recognition engine but different data (models, dictionary) needed Recognition grammar to handle language-specific usage differences (e.g., French speak telephone numbers in pairs; natural number dialing needed) Training requires speech databases and dictionary in the new language Automatic training tools to minimize time to develop recognition in a new language March 2004 Vishu Viswanathan 23

Speaker Recognition Speaker Verification / Authentication Problem: Use voice input to verify the user s claimed identity Applications: Secure access to premises, information (banking), services (voice dialing), etc. Issues True user acceptance traded off with impostor acceptance Total voice verification Fixed text versus free text Speaker Identification Problem: Use voice to identify speaker from a closed or open set of speakers Applications: Legal and forensic use, intelligence, security Issues: Uncooperative user, often relatively short-duration speech, noisy and/or distorted speech. March 2004 Vishu Viswanathan 25

Speech Enhancement Noise Suppression Playback Enhancement Acoustic Echo Cancellation March 2004 Vishu Viswanathan 27

Noise Suppression Problem Remove acoustic noise from noisy speech signal for better listenability or for improved performance of speech processing devices Requirements: No speech signal distortion, no loss of speech intelligibility, no artifacts like musical noises, natural sounding residual noise Methods Single microphone approach: spectral subtraction family of methods Multi-microphone approach: adaptive noise cancellation, microphone array based fixed or adaptive beamforming, blind signal separation March 2004 Vishu Viswanathan 28

Playback Enhancement Problem Enhanced playback of speech to the listener Methods Spectrally shape the speech signal prior to playback, for improved intelligibility when the listener is in a noisy environment (PA system in aircraft, airports, sports arenas) Active noise cancellation to cancel noise acoustically in listener s ears (ANC headsets) Narrowband to wideband speech extension to provide wideband speech perception March 2004 Vishu Viswanathan 29

Acoustic Echo Cancellation rn ( ) Downlink Signal s( n) Far End Signal loudspeaker Error Signal A E C ˆ ( ) H z H(z) channel x( n) en ( ) - yn ˆ( ) vn ( ) = un ( ) + yn ( ) + n( n) 0 microphone Uplink Signal + Near End Signal Goal: Cancel feedback from loudspeaker into microphone using adaptive linear filter March 2004 Vishu Viswanathan 30

Speech Modification Voice Conversion Convert one voice to sound like another A female voice converted to sound like a low-pitched male voice (security) Time-Scale or Rate Modification Speed up or slow down speech, while preserving naturalness Applications: talking books, pre-recorded lectures, language learning March 2004 Vishu Viswanathan 32