Artificial speech in communications research

Artificial speech in communications research ASHA 2011 Brad Story Speech, Language, and Hearing Sciences University of Arizona Research supported by NIH R01-04789

What is artificial speech? Speech produced by mechanical, electronic, or digital means that emulates the human speech production system and/or the acoustic characteristics of the speech signal. Production Perception Speech signal MITalk (1979)

The Historical Challenge of Artificial Speech Without doubt it would be one of the most important discoveries to construct a machine that could properly express all sounds and tones of our speech with all articulations. Leonard Euler (1761) Letters to a German Princess Our ability to produce convincing artificial speech is a measure of the degree to which we understand human speech production

What purpose does artificial speech serve? As an augmentative device to aid those whose speech production system is impaired L. Euler (1761)- The preachers and orators whose voices were not strong or attractive enough could then play their sermons and discourses on such a[n artificial speech] machine, in the way that the organ players perform their pieces of music.

What purpose does artificial speech serve? As an augmentative device to aid those whose speech production system is impaired As an educational tool As an entertainment device As a text reading system As a research tool that facilitates investigations of speech production and speech perception

Development of artificial speech pre - 1600 1600 1700 1800 1900 2000

Development of artificial speech pre - 1600 1600 1700 1800 1900 2000 Talking Heads (deceptions) Albertus Magnus (1198-1280) Roger Bacon (1214-1294)

Development of artificial speech pre - 1600 1600 1700 1800 1900 2000 C. G. Kratzenstein (1780)* German physiologist/physicist Used pressure driven reeds to the excite air cavities corresponding to five vowels. (Won a prize offered by the Imperial Academy of St. Petersburg under the guidance of Leonard Euler) This device didn t actually talk ; it only produced static sounds that replicated the acoustic characteristics of the five vowels. *May have possibly been an inspiration for Mary Shelley s Dr. Frankenstein

Development of artificial speech pre - 1600 1600 1700 1800 1900 2000 Wolfgang von Kempelen (1791)* Hungarian Engineer/Industrialist Spent 20 years developing a talking machine to a large degree it simulated the human speech production system Bellows was used to drive the vibration of a metal reed A pliable leather tube was used as the vocal tract The original purpose was for the machine to become an augmentative device. *his credibility was compromised by construction of a chess automaton which truly was a deception.

Development of artificial speech pre - 1600 1600 1700 1800 1900 2000 Joseph Faber (1844-) German anatomist/mechanic The Amazing Talking Machine Perhaps the most well-designed and functional mechanical talking machine. Represented the sound generating parts of the speech production system (simulation)

What was Faber s purpose in developing a talking machine? It seems to have simply been a desire to create a machine that speaks like a human (to simulate human speech production) Some observers noted that the machine had a strong German accent but in general spoke better English than Faber himself Faber s own speaking patterns were imposed on the hand and foot motions used to produce speech with the machine!

Development of artificial speech pre - 1600 1600 1700 1800 1900 1928-1940 2000 In October, 1928, Homer Dudley of Bell Telephone Laboratories sketched in his technical notebook a device which subsequently became known as vocoder a term derived from the words VOice and CODER. Schroeder, 1966, IEEE This was the beginning of electronic speech coding and electronic speech synthesis

Vocoder The Idea: Send speech signals over the trans-atlantic telegraph cable(s) The Problem: Cable bandwidth was 100 Hz telephone quality speech bandwidth is about 3000 Hz But Dudley knew that the speech articulators moved rather slowly, and as a result produced slowly varying spectral characteristics. i.e., The wide (3000 Hz) speech bandwidth resulted from the comparatively high frequency of the voice excitation not the movement of the articulators. Solution: Transmit only the slowly-varying spectral characteristics and supply the high frequency excitation locally.

Decompose, send, and remake (synthesize) the speech Filter speech into discrete frequency bands Extract the amplitude envelopes in each band Send only the envelopes Unpack the envelopes over here Use them to remake the speech signal High frequency excitation (carrier) signal is provided locally Excitation (buzzer) Filter n.... Env. Mod. n Filter 2 Env. Mod. 2 Filter 1 Env. Mod. 1

Dudley s next step: Artificial speech as a means of understanding human speech production/perception After one believes he has a good understanding of the physical nature of speech, there comes the acid test of whether he understands the construction of speech well enough to fashion it from suitably chosen elements. -Dudley, 1940 Relating the vocoder to human speech production

VODER = Voice Operation DEmonstratoR (1939) Ten finger keys control the amplitude modulation (envelope) of each frequency band Essentially the same as the vocoder except that a human operator generates the input with finger, wrist and foot controls. Operators required at least one year of training to become fluent, intelligible speakers

The VODER was designed to be an exhibition at the 1939 San Francisco and New York World s Fairs

Similarities of a human talker and the voder In both cases: 1. Message originates in the brain of the sender 2. Transmission of control signals by the talker s nervous system to the appropriate muscles human 3. Muscles produce displacements of body parts formulating speech information as mechanical (syllabic) message waves. Human vocal tract movements Voder fingers, wrist, and foot voder 4. Slow modulations (message waves) are superimposed on a high frequency carrier (voice or noise) to make the signal audible.

Common Thread The operators of these synthesizers developed and internalized a set of rules or principles by which they coaxed the device into talking; In other words, they learned to play the machine.

The Sound Spectrograph (1946) Now the interest for artificial speech shifted to what could be seen in a spectrogram (i.e. formants)

Playback Synthesis Pattern Playback F. Cooper, 1950; 1952 To synthesize speech the user would literally paint the formants on a transparent film Frequency (Hz) Time sample

Pattern Playback was used to generate stimuli for many types of speech perception experiments Q: What is significant in the spectrographic pattern and what is not?

Formant Synthesis electrical resonators tuned to the formant frequencies observed in a spectrogram F0 Input (source) transfer function (filter) R1 F1 output R2 R3 F2 F3 R4 R5 F4 F5

Development of artificial speech pre - 1600 1600 1700 1800 1900 1950-1970 2000 Formant synthesis examples Gunnar Fant, OVE, 1953: Welcome Walter Lawrence, PAT, 1953: F2 F1 Time Gunnar Fant, OVE II: 1962: Walter Lawrence, PAT, 1962: Klatt, Stevens, Holmes, Rosen,

Note: Formant synthesizers were a starting point for Text-to-Speech Systems Need a set of rules to transform orthographic representations to phonetic and finally to acoustic.

Along came something new (or old)... Articulatory Synthesis Mathematical replication of the physics and physiology of the speech production system. - A computational form of the speaking machines of previous centuries Allows control of the positions and physical characteristics of the tongue, velum, jaw, lips, larynx/vocal folds. Coker, 1968; 1976 Mermelstein, 1973; Rubin et al., 1981

Computational Models of Articulatory Structures Synthesis Simulation Tongue FEM Wilhelms-Tricarico (1995) Baker (2008) Velum FEM Dang and Honda (2004) Perrier et al. (2003) Berry et al. (1999) Vocal Folds, Alipour and Titze Thomson, Mongeau & Frankel (2005)

The traditional motivation for research in speech synthesis has been simply to explain how humans use their vocal tracts to produce connected speech. -I. Mattingly (1974). Speech synthesis for phonetic and phonological models Pharynx Vocal tract Oral cavity Lips Glottis Trachea complex simple

Complexity vs Simplicity indeed, the purpose of a model is to substitute simple structures for complex ones. -F. Cooper (1961). Speech synthesizers

TubeTalker : Airway Modulation Model* Modulation of vocal tract shape (gestures): 1-D wave prop w/losses Voice source: glottal flow based on nonlinear interaction with VT/Tracheal pressures Energy source *Story, (2005). JASA, 117, 3231-3254

Modulation of the glottis by vocal fold vibration Modulation of the vocal tract shape vocal folds Titze (2006)

Build a phrase: Happy Birthday

A. Starting point: neutral vocal tract shape, constant voice source Even though the neutral vowel is neutral it does carry information about the speaker s identity.

Vocal tract modulations: continuous flow of vowel transitions interrupted by consonant constrictions C V V = V C V Happy Birthday

B: Vowel transitions

B. Second step: modulate vocal tract shape for vowel transitions, constant voice source

C. Impose consonants (constrictions) on the flow of vowel transitions d p,b

C. Impose consonants (constrictions) on the flow of vowel transitions

D. Not there yet- need to modify the voice source Must change fundamental frequency F0 (vibrational frequency of vocal folds). Abduct vocal folds for voiceless consonants (and respiration), and adduct for voiced consonants and vowels

Example: Vocal fold vibration w/adduction & abduction

Finally all of it together

Modifications to the system for Scaling of the vocal tract and vocal folds Hypo-adduction & vocal fold asymmetry (Robin Samlan) research purposes Vocal tremor (Rosemary Lester) Insufficient closure of nasal port (Co-PI: Kate Bunton) Centralized vowels Alternate timing of vocal tract movement

TubeTalker model scaled for age Vocal tract Adult 6 yr 2 yr Trachea The objective is to develop a model of sound production in children for the purpose of understanding how a child uses this system to generate speech.

Synthesis of singing - soprano Nonlinear interaction of source and filter Q: What is significant about the movement patterns of the vocal tract and vocal folds and what is not?

Development of artificial speech pre - 1600 1600 1700 1800 1900 1930 1940 1970 2000 The essential point here, as in all science, is that we must simplify nature if we are to understand nature. The great virtue of speech synthesizers is that they help us make such simplifications. -F. Cooper (1961). Speech synthesizers

The End

VODER = Voice Operation DEmonstratoR Ten finger keys control the amplitude modulation (envelope) of each frequency band Operators required at least one year of training to become fluent, intelligible speakers