Mobile Speech Processing

Transcription

1 Mobile Speech Processing David Huggins-Daines Language Technologies Institute Carnegie Mellon University September 19, 2008

2 Outline Mobile Devices What are they? What would we like to do with them? Mobile Speech Applications Mobile Speech Technologies Current Research

3 Mobile Devices What is a mobile device? A hammer is a device, and you can carry it around with you! But no, that s not what we mean here

4 Mobile Devices What is a mobile device? A device that goes everywhere with you... which provides some or all of the functions of computer... and some things it doesn t, such as a cell phone or GPS.

5 Speech on Mobile Devices Why do we care about speech processing on these devices? Because they are the future of computers Because speech is actually a useful way to interact with them, unlike full-sized computers What kind of speech processing do we care about? Speech coding to improve voice quality for cellular and VoIP Speech recognition for hands-free input to apps Speech synthesis for eyes-free output from apps In some cases, speech is a natural and convenient modality In other cases, it is a necessity (e.g. in-car navigation)

6 Speech on Mobiles vs. Mobile Speech None of this necessarily implies doing actual speech processing (aside from coding) on the device itself Telephone dialog systems are mobile by any definition Let s Go - bus scheduling information HealthLine - medical information for rural health workers But all synthesis and recognition is done on a server This can be a good thing especially in the latter case You can t run a speech recognizer on a Motofone or a Nokia 1010 Speech processing on the device is useful for: Multimodal applications Disconnected applications Access to local data

7 Some Mobile Speech Applications GPS navigation Older systems used a small number of recorded prompts ( turn left, 100 metres, etc) More recently, TTS has been used to speak street names Even more recently, ASR is used for input Voice dialing Old systems used DTW and required training Newer ones build models from your address book Cactus for iphone - uses CMU Flite and Sphinx Voice-driven search (local, web, etc) Nuance, Vlingo, TellMe, Microsoft are all doing this Voice-to-text Typically server-based, requires a data connection on-line, ASR-based: Vlingo, Nuance off-line, human-assisted: SpinVox, Jott, ReQall Speech to Speech Translation

8 Mobile Speech Technologies Speech Coding Efficient digital representation of speech signals Fundamental for 2G and 3G cell networks and VoIP Speech Synthesis Speech output for commands, directions Text-to-speech for messages, books, other content Speech Recognition Command and control ( voice control ) Dictation (Speech-to-text for , SMS) Search input (questions, keywords) Dialogue

9 Speech Coding A fairly mature technology (started in the 1960s) Early versions were mostly for military applications Digital cell phone networks changed this dramatically Almost universally based on linear prediction and the source-filter model. Each sample is a weighted sum of P previous samples. Weights are linear prediction coefficients (LPCs), and are calculated to minimize mean squared error. Conveniently enough, this is actually a good model of the frequency response of the vocal tract (given enough LPCs). An excitation function models the glottal source. Everything else is just tweaking Better excitation functions (CELP) Variable bit rates (AMR) Compression tricks (VAD + comfort noise)

10 Mobile Speech Synthesis Two traditional categories, one new one Synthesis by rule, e.g. formant synthesis Concatenative synthesis, e.g. diphone, unit selection Statistical-parametric synthesis ( HMM synthesis ) We have had very efficient (often hardware-based) implementations of TTS for decades They sound terrible (but are often quite intelligible) The challenges for mobile devices are: Achieving natural-sounding speech Dealing with very large, irregular vocabularies Dealing with raw and diverse input text

11 Mobile Speech Synthesis Unit selection currently gives the most natural output But it is very ill-suited to mobile implementations Best systems use gigabytes of speech data But, you say... I have an 8GB microsd card in my phone! Search time: finding the right units of speech Access time: loading them from the storage medium Signal generation can also be time-consuming if not efficiently implemented Some ways to improve efficiency: Compress the speech database Prune the speech database by discarding units that are infrequently or never used Approximate search algorithms (much like ASR)

12 Mobile Speech Synthesis Statistical-parametric synthesis is quite promising Models are quite small (1-2MB) The search problem is nonexistent Parameter and waveform generation are the most time consuming parts currently Requires higher dimensionality parameterizations than concatenative synthesis Output parameters are smoothed using an iterative algorithm (similar to EM) Waveform generation from mcep is much slower than LPC Dictionary compression and text normalization Dictionary can be compressed by building letter-to-sound models and listing only the exceptions Efficient finite-state transducer representations can be created for pronunciation and text processing rules

13 Mobile Speech Recognition Challenges for mobile devices are: Variable and noisy acoustic environments Large vocabularies Open domain dictation input As with speech synthesis, simple ASR is not very resource intensive, although it has not been as widely implemented Even with large vocabularies, ASR can be done efficiently The most important factor is the complexity of the grammar Commercial systems achieve impressive performance based on very constrained grammars Systems tend to be extensively tuned for a given application

14 Mobile Speech Recognition: Acoustic Issues How do you talk to a device? This depends on the application, user, and environment Acoustic feature vectors can look very different Microphones may not be optimized for all positions Noisy environments Mobile devices are more likely to be used in noisy environments Worse, they are more likely to be used in difficult ones Non-stationary noise, crosstalk, human babble Array processing is not well suited to handheld devices On the bright side: Usually a mobile device has only one user Speaker adaptation can improve acoustic modeling Speaker identification can be used to filter out babble and crosstalk

15 Mobile Speech Recognition: Computational Issues Acoustic feature extraction Efficient, as long as it is implemented properly Fixed-point arithmetic, data-parallel processing Most processing time is consumed by, in roughly equal amounts: Acoustic model evaluation Search (hypothesis generation and evaluation) These can be made computationally efficient but must also be made memory efficient, search in particular. This necessarily involves tuning heuristics because a complete solution is intractable.

16 Mobile Speech Recognition: Acoustic Modeling Exact acoustic model evaluation is intractable P(o s i, λ) = K k=1 w ik 1 (2π) D Σ ik exp D d=1 (o d µ ikd ) 2 2σ 2 ikd Typical continuous-density acoustic model: 5000 tied states, each with 32 Gaussian densities, of 39 dimensions Complete evaluation of all log-likelihoods for one 10ms frame: log-additions subtractions multiplications That s 2500 million operations per second! Your new MacBook Pro can do that, but just barely (yes, its video card can do it easily)

17 Mobile Speech Recognition: Acoustic Modeling How do we make this fast enough? Only evaluate densities for active phones in search Predict which densities will score highly using a smaller, approximate model set, and only evaluate these ones Use fewer densities and: Share them between all HMM states (semi-continuous HMM) or all the states for some phonetic class (phonetically-tied HMM) Make density computation faster by quantizing acoustic features and parameters Skip some frames in the input, either by Blindly computing only multiples of N (usually 2 or 3) Detecting interesting regions in the input and only computing densities there (landmark detection) Every ASR system in existence uses some combination of these However, too many approximations can make the system slower

18 Mobile Speech Recognition: Search Search is not arithmetically intensive It largely consists of adding up scores and comparing them to other scores However it is very memory intensive The search module in an ASR system touches: Acoustic scores Language model scores Dictionary entries Viterbi path scores and backpointers Backpointer table entries In other words, pretty much every piece of memory except the acoustic model parameters Worse yet, there are sequential dependencies between all these memory accesses

19 Mobile Speech Recognition: Search Fundamentally, the speed of the recognizer is proportional to the number of different hypotheses it considers at once Optimizing search is entirely devoted to reducing this number without significantly affecting accuracy This includes: Careful tuning of various thresholds (beams) for word transitions, phone transitions, etc. Absolute pruning - hard limits on words per frame Phonetic lookahead Language model lookahead (factorization / weight pushing) Finite-state transducer systems can be very fast Dictionary, grammar, and (part of) acoustic model are composed into a single decoding network Determinization - allows exact language model search Minimization - merges common subpaths Weight pushing - more general kind of LM lookahead

20 Common Problems for Mobile Speech Processing Moore s Law works differently for mobile devices Instead of getting faster, they get smaller and cheaper Storage gets bigger, RAM doesn t Memory doesn t get much faster Memory bandwidth is a major bottleneck Making things smaller almost always makes them faster Memory allocations can be very expensive (depending on the operating system) Audio input quality is often much lower Typically 8kHz or 11kHz maximum sampling rate Dubious microphones

21 Current and Future Research Incorporating user feedback in multimodal (speech + touch) applications Presenting information efficiently using speech synthesis Very low bitrate speech coding using ASR and TTS Distributed processing for mobile speech recognition Acoustic robustness for handheld mobile devices Voice and multimodal user interface design