Generation of Aect in Synthesized Speech. Janet E. Cahn. phone:

Similar documents
Julia Hirschberg. AT&T Bell Laboratories. Murray Hill, New Jersey 07974

Text-To-Speech Technologies for Mobile Telephony Services

Functional Auditory Performance Indicators (FAPI)

Things to remember when transcribing speech

Intonation difficulties in non-native languages.

THE VOICE OF LOVE. Trisha Belanger, Caroline Menezes, Claire Barboa, Mofida Helo, Kimia Shirazifard

Degree of highness or lowness of the voice caused by variation in the rate of vibration of the vocal cords.

Phonetic and phonological properties of the final pitch accent in Catalan declaratives

Prosodic Characteristics of Emotional Speech: Measurements of Fundamental Frequency Movements

Thirukkural - A Text-to-Speech Synthesis System

SPEECH Biswajeet Sarangi, B.Sc.(Audiology & speech Language pathology)

L2 EXPERIENCE MODULATES LEARNERS USE OF CUES IN THE PERCEPTION OF L3 TONES

Culture and Language. What We Say Influences What We Think, What We Feel and What We Believe

Nonverbal Communication Human Communication Lecture 26

SPEECH OR LANGUAGE IMPAIRMENT EARLY CHILDHOOD SPECIAL EDUCATION

An Analysis of the Eleventh Grade Students Monitor Use in Speaking Performance based on Krashen s (1982) Monitor Hypothesis at SMAN 4 Jember

PERCENTAGE ARTICULATION LOSS OF CONSONANTS IN THE ELEMENTARY SCHOOL CLASSROOMS

PTE Academic Preparation Course Outline

Types of communication

Teaching and Educational Development Institute. Presentation skills for teachers

A Guide to Cambridge English: Preliminary

Teaching and Learning Mandarin Tones. 19 th May 2012 Rob Neal

PTE Academic. Score Guide. November Version 4

Emotion Detection from Speech

Develop Software that Speaks and Listens

The Role of Listening in Language Acquisition; the Challenges & Strategies in Teaching Listening

Neurogenic Disorders of Speech in Children and Adults

Rubrics for Assessing Student Writing, Listening, and Speaking High School

The Effect of Long-Term Use of Drugs on Speaker s Fundamental Frequency

There are many reasons why reading can be hard. This handout describes

ACOUSTICAL CONSIDERATIONS FOR EFFECTIVE EMERGENCY ALARM SYSTEMS IN AN INDUSTRIAL SETTING

CAMBRIDGE FIRST CERTIFICATE Listening and Speaking NEW EDITION. Sue O Connell with Louise Hashemi

Assessing Speaking Performance Level B2

Phonetic Perception and Pronunciation Difficulties of Russian Language (From a Canadian Perspective) Alyssa Marren

INFORMATIVE SPEECH. Examples: 1. Specific purpose: I want to explain the characteristics of the six major classifications of show dogs.

THE CHOIR: SING WHILE YOU WORK. THE BASICS OF SINGING bbc.co.uk/thechoir

Prosodic Phrasing: Machine and Human Evaluation

Proceedings of Meetings on Acoustics

SWING: A tool for modelling intonational varieties of Swedish Beskow, Jonas; Bruce, Gösta; Enflo, Laura; Granström, Björn; Schötz, Susanne

Purpose: To acquire language and the ability to communicate successfully with others

Lecture 1-10: Spectrograms

CHARTES D'ANGLAIS SOMMAIRE. CHARTE NIVEAU A1 Pages 2-4. CHARTE NIVEAU A2 Pages 5-7. CHARTE NIVEAU B1 Pages CHARTE NIVEAU B2 Pages 11-14

Bachelors of Science Program in Communication Disorders and Sciences:

Colaboradores: Contreras Terreros Diana Ivette Alumna LELI N de cuenta: Ramírez Gómez Roberto Egresado Programa Recuperación de pasantía.

Measuring and synthesising expressivity: Some tools to analyse and simulate phonostyle

A. Schedule: Reading, problem set #2, midterm. B. Problem set #1: Aim to have this for you by Thursday (but it could be Tuesday)

Open-Source, Cross-Platform Java Tools Working Together on a Dialogue System

Reading Competencies

KODÁLY METHOD AND ITS INFLUENCE ON TEACHING WIND INSTRUMENTS IN HUNGARY

Thai Language Self Assessment

MODELING OF USER STATE ESPECIALLY OF EMOTIONS. Elmar Nöth. University of Erlangen Nuremberg, Chair for Pattern Recognition, Erlangen, F.R.G.

Speech Production 2. Paper 9: Foundations of Speech Communication Lent Term: Week 4. Katharine Barden

Nonverbal Factors in Influence

COMPREHENSIVE SPEECH EVALUATION Sheet for Teachers

The Brain, Prosody, and Reading Fluency

Speech Analytics. Whitepaper

A System for Labeling Self-Repairs in Speech 1

EMOTION IN SPEECH: RECOGNITION AND APPLICATION TO CALL CENTERS

Understanding Impaired Speech. Kobi Calev, Morris Alper January 2016 Voiceitt

Active Listening. Learning Objectives. By the end of this module, the learner will have

Socio-Cultural Knowledge in Conversational Inference

Creating voices for the Festival speech synthesis system.

Control of affective content in music production

Imagine It! ICEBREAKER:

Limited Domain Synthesis of Expressive Military Speech for Animated Characters

Bilingual Education Assessment Urdu (034) NY-SG-FLD034-01

The primary goals of the M.A. TESOL Program are to impart in our students:

Program curriculum for graduate studies in Speech and Music Communication

What Is Linguistics? December 1992 Center for Applied Linguistics

INTEGRATING THE COMMON CORE STANDARDS INTO INTERACTIVE, ONLINE EARLY LITERACY PROGRAMS

ADVANCED COMMUNICATION SERIES STORYTELLING. Assignment #1: THE FOLK TALE

SPEECH OR LANGUAGE IMPAIRMENT

THESE ARE A FEW OF MY FAVORITE THINGS DIRECT INTERVENTION WITH PRESCHOOL CHILDREN: ALTERING THE CHILD S TALKING BEHAVIORS

Academic Standards for Reading, Writing, Speaking, and Listening

St. Petersburg College. RED 4335/Reading in the Content Area. Florida Reading Endorsement Competencies 1 & 2. Reading Alignment Matrix

Master of Arts in Linguistics Syllabus

How emotional should the icat robot be? A children s evaluation of multimodal emotional expressions of the icat robot.

Efficient diphone database creation for MBROLA, a multilingual speech synthesiser

How To Teach Reading

Predicting Speech Intelligibility With a Multiple Speech Subsystems Approach in Children With Cerebral Palsy

BBC Learning English - Talk about English July 11, 2005

MATRIX OF STANDARDS AND COMPETENCIES FOR ENGLISH IN GRADES 7 10

5 Free Techniques for Better English Pronunciation

Ph.D in Speech-Language Pathology

Workshop Perceptual Effects of Filtering and Masking Introduction to Filtering and Masking

Modern foreign languages

INCREASE YOUR PRODUCTIVITY WITH CELF 4 SOFTWARE! SAMPLE REPORTS. To order, call , or visit our Web site at

Linguistics 2288B Introductory General Linguistics

Mother Tongue Influence on Spoken English

Directions for Administering the Graded Passages

Virtual Patients: Assessment of Synthesized Versus Recorded Speech

Revising and Editing Your Essay 1

WinPitch LTL II, a Multimodal Pronunciation Software

Evaluation of a Segmental Durations Model for TTS

Prelinguistic vocal behaviors. Stage 1 (birth-1 month) Stage 2 (2-3 months) Stage 4 (7-9 months) Stage 3 (4-6 months)

Turkish Radiology Dictation System

Study Plan for Master of Arts in Applied Linguistics

Articulatory Phonetics. and the International Phonetic Alphabet. Readings and Other Materials. Introduction. The Articulatory System

Alphabetic Knowledge / Exploring with Letters

Create stories, songs, plays, and rhymes in play activities. Act out familiar stories, songs, rhymes, plays in play activities

Transcription:

Generation of Aect in Synthesized Speech Janet E. Cahn MIT Media Technology Laboratory 20 Ames Street Cambridge, MA 02139 phone: 617-253-0666 Generation of Aect in Synthesized Speech 1 Introduction When compared to human speech, synthesized speech is distinguished by insucient intelligibility, inappropriate prosody and inadequate expressiveness. These are serious drawbacks for conversational computer systems. Intelligibility is basic intelligible phonemes are necessary for word recognition. Prosody intonation (melody) and rhythm claries syntax and semantics and aids in discourse ow control. Expressiveness, or aect, provides information about the speaker's mental state and intent beyond that revealed by word content. My work explores improvements to the aective component of synthesized speech. It is embodied in the Aect Editor program, which is intended to show that variations in aect can be generated in synthetic speech and to point the way towards improving the recognizability and naturalness of the aect [2]. Its success in generating recognizable aect was conrmed by an experiment in which the intended aect was perceived for the majority of presentations. Aect should be a concern in speech synthesis for theoretical and practical reasons. Its role in human speech is to provide the context in which utterances should be interpreted and to signal speaker intentions. It should play the same role in synthesized speech. More practically, the addition of controllable aect will allow synthesized speech to be used in any application in which expressiveness is appropriate for example, tools for the presentation of dramatic material, information{giving systems and synthesizers used by the speech{handicapped. 2 Prerequisites To incorporate aect into synthesized speech, one must identify the acoustic correlates of emotions and choose a representation that allows for systematic control of these correlates. Both are discussed in this section. 1

2.1 Speech correlates of emotion Speech correlates of emotion have been identied by studies of the speech signal [4, 5, 9] and of perceptual responses to emotional speech [3, 8]. The primary conveyers of aect are fundamental frequency (F0) and duration. Much of the associated speech phenomena are explained by the eect of emotion on physiology. With the arousal of the sympathetic nervous system as for fear, anger or joy heart rate and blood pressure increase, the mouth becomes dry and there are occasional muscle tremors. Speech is correspondingly loud, fast and enunciated and has much high frequency energy. With the arousal of the parasympathetic nervous system as for boredom or sadness heart rate and blood pressure decrease and salivation increases. Speech is slow and low{pitched and high frequency energy is weak [10]. 2.2 Representation of the eect of emotion on speech The representation of the speech correlates of emotion can proceed from a speaker model or an acoustic model. The rst is the more generative approach the eects of emotion on physiology and hence, on speech, are derived from the representation of the speaker's mental state and intentions. The second describes primarily what the listener hears. Of the two, the acoustic model is the simpler and requires less knowledge overall. Moreover, since perceptual parameters are explicit in the model, they can be easily manipulated to test perceptual responses. This is the model that is incorporated into the Aect Editor. The parameters of the acoustic model are grouped into four categories pitch, timing, voice quality and articulation. The pitch parameters describe features of F0. The timing parameters control speed and rhythm (the combination of word stress and silence). Often pitch and timing parameters describe features of words or phrases. In contrast, the voice quality parameters aect features of the speech signal as a whole. Articulation parameters control variations in enunciation, from slurred to precise. The pitch parameters are: Accent shape: the steepness of the intonation contour at the site of a pitch accented 1 syllable. Average pitch: the average F0 for the utterance. Contour slope: the slope (increasing, level or decreasing) of the F0 contour. Final lowering: the steepness of the F0 decrease at the end of falling contours, or of the rise at the end of rising contours (as for yes-no questions). Pitch range: the distance between the speaker's lowest and highest F0 for the utterance. Reference line: a term borrowed from work on generative intonation [1]. It species the F0 that is returned to after a high or low pitch excursion. The timing parameters are: 1 A word stressed with noticeably high or low F0 is a pitch accented word. This signicant pitch occurs within the syllable carrying the greatest lexical stress [7].

Exaggeration: the degree to which pitch accented words receive exaggerated duration as a means of further emphasis. (The exaggeration is currently unimplemented because of synthesizer introduced side eects.) Fluent pauses: the frequency of pausing between syntactic and semantic units. Hesitation pauses: the frequency of pausing within a syntactic or semantic unit Speech rate: the rate of speech. The speech rate aects the number of words spoken per minute and the pause lengths. Stress frequency: the percent of stressed words to stressable words for an utterance. The voice quality parameters are: Breathiness: the amount of frication noise that may be co{present with non{fricative phonemes (vowels, for example). Brilliance: the ratio of low to high frequency energy. Laryngealization: vocal fold vibration unaccompanied by air ow across the vocal cords. The speech of older speakers is often laryngealized. Loudness: loudness. Pause discontinuity: the smoothness or abruptness of a pause onset. Pitch discontinuity: the distance between successive F0 values in the intonation contour. Tremor: regularities between successive glottal pulses. It is not a Dectalk3 parameter, so currently has no eect on the output. The sole articulation parameter is: Precision: the degree of slurring or enunciation. Perhaps precision should be expanded so that it is specied separately for phonemes distinguished by means of production or place of articulation. The acoustical model has two signicant features. First, the parameter values are quantied on an abstract scale that is centered around 0 and ranges from -10 to 10. This allows simple comparisons between the eects of dierent emotions. New aective colorations can be created by changing the parameter values. Additionally, perceptual thresholds may be tested via incremental changes. Note that parameter values are arranged such that the mid{range value (0) for the range quanties a neutral aect, while the extreme values of 10 and -10 represent a parameter's maximum and minimum inuence, respectively. This makes the implementation of aect quantiers straightforward. Moving the values away from their neutral settings corresponds to more of an aective coloration, while moving the values closer corresponds to less.

2.2.1 Input The input to the Aect Editor is an annotated utterance, meant to represent the plausible output of a text generation program. The phrases in the utterance are grouped according to their roles, as per case frame descriptions. For example: [[SUBJ I thought] [OBJ you [ACTION really meant it]]] Additionally, each word is annotated with its part of speech and the probability of its receiving signicant stress. The Aect Editor assigns word stress and enunciation from word annotation and determines pause locations and the shape of the terminal contour from phrase annotation. 2.3 Discussion The Aect Editor implements a transfer function from an acoustical description of emotional speech to synthesized speech. It is a tool with which the eects of dierent emotions on speech can be quantied and correlations between acoustic parameters can be investigated. 3 Experiment A simple experiment was performed in which ve aectively neutral sentences were synthesized with six dierent aects and presented in random order to twenty{eight subjects. The subjects heard these sentences I'm almost finished. I saw your name in the paper. I thought you really meant it. I'm going to Look at the city. that picture. presented for angry, fearful, disgusted, surprised, sad and glad aective coloring. The hypothesis had two parts: 1) that the aect conveyed by the voice would be recognized despite the sentence semantics and 2) that semantically and acoustically similar emotions would be confused. In fact, sentence semantics did color the judgments. For example, the sentence \I thought you really meant it." was rarely perceived as glad, though often, instead, as surprised. Of all the emotions, sadness was the most consistently recognized. This makes sense because its characteristics soft, slow, halting speech, with minimal high frequency energy are the most distinct. The other emotions were recognized a little over 50% of the presentations, and were mistaken for similar emotions (anger and disgust, gladness and surprise) for an additional 20%. The number of correct recognitions was signicantly above chance (at 17% correct). Errors were not random, but followed the pattern of errors made when identifying aect in human speech. The experiment answered broadly, in the armative, the question of whether recognizable aect could be generated in synthesized speech.

3.1 Synthesizer considerations and eects Because the acoustic representation is synthesizer independent its parameters must be interpreted for each synthesizer it drives. The mapping for the Dectalk3 { the only synthesizer used so far involves both one{to{many and many{to{one mappings from the acoustic parameters to the synthesizer settings. The parameters not represented in the Dectalk's parameter set are instead implemented in software. Thus, a contour slope up or down is approximated by assigning a high F0 to the word at either end of the utterance, pauses are added by inserting a silence character, the smoothness or abruptness of the pause onset is eected by inserting phonemes prior to the silence and precision of articulation is achieved by phoneme substitutions or additions. The Dectalk3 was chosen because it allows control over aspects of pitch, timing, voice quality and articulation. However, some of its limitations make it unclear whether an emotion is poorly specied or correctly specied but poorly reproduced. The limitations are of two kinds side eects and limited capabilities. A side eect of expressing a word in phonemic form is that it often has a lower F0 than when it is specied with text. Word stress markings should cause primarily local perturbations but instead sometimes aect the intonation contour before and after the word such that subsequent word stresses are unrealized. Finally, as the average pitch is raised the voice begins to sound like a dierent speaker, instead of a speaker raising her voice. This side eect occurs because the pitch range and average pitch values are interdependent but should instead be independent. Many of these side eects can be remedied by implementing an intonational description system with primarily local eects (such as the two tone annotation developed by Pierrehumbert and colleagues [7, 6, 1]). Aect synthesis would be made easier by the addition of the features implemented in software in the Aect Editor, particularly the ability to specify the precision of articulation and the overall pitch contour slope. 4 Conclusions This work shows that recognizable and even natural{sounding aect can be produced by imitating in synthesized speech the eects of emotion on human speech. More importantly, it is a tool for exploring what is needed in an aect generating system. Its eectiveness would be increased with synthesizers that more accurately reproduce the Aect Editor specications. References [1] Mark Anderson, Janet Pierrehumbert, and Mark Liberman. Synthesis by Rule of English Intonation Patterns. In Proceedings of the Conference on Acoustics, Speech, and Signal Processing, page 2.8.1 to 2.8.4, 1984. [2] Janet E. Cahn. Generating Expression in Synthesized Speech. Master's thesis, Massachusetts Institute of Technology, May 1989. Unpublished. [3] Joel Davitz. The Communication of Emotional Meaning. McGraw-Hill, 1964.

[4] G. Fairbanks. Recent experimental investigations of vocal pitch in speech. Journal of the Acoustic Society of America, (11):457{466, 1940. [5] G. Fairbanks and W. Pronovost. An experimental study of the pitch characteristics of the voice during the expression of emotions. Speech Monographs, 6:87{104, 1939. [6] Mark Liberman and Alan Prince. On Stress and Linguistic Rhythm. Linguistic Inquiry, 8(2):249{ 336, 1977. [7] Janet B. Pierrehumbert. The Phonology and Phonetics of English Intonation. PhD thesis, Massachusetts Institute of Technology, 1980. [8] Klaus Scherer. Speech and emotional states. In John K. Darby, editor, Speech Evaluation in Psychiatry, pages 189{220. Grune and Stratton, Inc., 1981. [9] Carl E. Williams and Kenneth N. Stevens. On determining the emotional state of pilots during ight: An exploratory study. Aerospace Medicine, 40(12):1369{1372, December 1969. [10] Carl E. Williams and Kenneth N. Stevens. Emotions and Speech: Some Acoustical Correlates. Journal of the Acoustic Society of America, 52(4 (Part 2)):1238{1250, 1972.