Emotional speech synthesis Technologies and research approaches

Similar documents
Thirukkural - A Text-to-Speech Synthesis System

Emotion Detection from Speech

Things to remember when transcribing speech

THE VOICE OF LOVE. Trisha Belanger, Caroline Menezes, Claire Barboa, Mofida Helo, Kimia Shirazifard

Measuring and synthesising expressivity: Some tools to analyse and simulate phonostyle

Prosodic Characteristics of Emotional Speech: Measurements of Fundamental Frequency Movements

Efficient diphone database creation for MBROLA, a multilingual speech synthesiser

Automatic Evaluation Software for Contact Centre Agents voice Handling Performance

Carla Simões, Speech Analysis and Transcription Software

Control of affective content in music production

Nonverbal Communication Human Communication Lecture 26

Recognition of Emotions in Interactive Voice Response Systems

Lecture 12: An Overview of Speech Recognition

SWING: A tool for modelling intonational varieties of Swedish Beskow, Jonas; Bruce, Gösta; Enflo, Laura; Granström, Björn; Schötz, Susanne

Module Catalogue for the Bachelor Program in Computational Linguistics at the University of Heidelberg

WinPitch LTL II, a Multimodal Pronunciation Software

ADVANCED COMMUNICATION SERIES STORYTELLING. Assignment #1: THE FOLK TALE

EMOTION IN SPEECH: RECOGNITION AND APPLICATION TO CALL CENTERS

Text-To-Speech Technologies for Mobile Telephony Services

Functional Auditory Performance Indicators (FAPI)

Intonation difficulties in non-native languages.

Between voicing and aspiration

Robust Methods for Automatic Transcription and Alignment of Speech Signals

SPEECH Biswajeet Sarangi, B.Sc.(Audiology & speech Language pathology)

Turkish Radiology Dictation System

CIVIL Corpus: Voice Quality for Forensic Speaker Comparison

Unlocking Value from. Patanjali V, Lead Data Scientist, Tiger Analytics Anand B, Director Analytics Consulting,Tiger Analytics

Pronunciation in English

The effect of mismatched recording conditions on human and automatic speaker recognition in forensic applications

Musical Literacy. Clarifying Objectives. Musical Response

TRUE: an Online Testing Platform for Multimedia Evaluation

The Minor Third Communicates Sadness in Speech, Mirroring Its Use in Music

WHY DO WE HAVE EMOTIONS?

Program curriculum for graduate studies in Speech and Music Communication

Fuzzy Emotion Recognition in Natural Speech Dialogue

Psychological Motivated Multi-Stage Emotion Classification Exploiting Voice Quality Features

Open-Source, Cross-Platform Java Tools Working Together on a Dialogue System

MODELING OF USER STATE ESPECIALLY OF EMOTIONS. Elmar Nöth. University of Erlangen Nuremberg, Chair for Pattern Recognition, Erlangen, F.R.G.

The LENA TM Language Environment Analysis System:

Electrophysiological studies of prosody

An Arabic Text-To-Speech System Based on Artificial Neural Networks

Socio-Cultural Knowledge in Conversational Inference

Develop Software that Speaks and Listens

Regionalized Text-to-Speech Systems: Persona Design and Application Scenarios

1 Introduction. An Emotion-Aware Voice Portal

Effective Business Communication CHAPTER 1: Definition, Components & Non-verbal Communication

L2 EXPERIENCE MODULATES LEARNERS USE OF CUES IN THE PERCEPTION OF L3 TONES

SPEECH OR LANGUAGE IMPAIRMENT EARLY CHILDHOOD SPECIAL EDUCATION

The Effect of Long-Term Use of Drugs on Speaker s Fundamental Frequency

Emotion in Speech: towards an integration of linguistic, paralinguistic and psychological analysis

AVoiceportal Enhanced by Semantic Processing and Affect Awareness

Speech Analytics. Whitepaper

Degree of highness or lowness of the voice caused by variation in the rate of vibration of the vocal cords.

Reading Competencies

Final Project Presentation. By Amritaansh Verma

Specialty Answering Service. All rights reserved.

62 Hearing Impaired MI-SG-FLD062-02

Speech Signal Processing: An Overview

Phonetic and phonological properties of the final pitch accent in Catalan declaratives

Recent advances in Digital Music Processing and Indexing

Online experiments with the Percy software framework experiences and some early results

Are your employees: Miss Miller s Institute will train your employees in: huge benefits for your company.

HOWARD COUNTY PUBLIC SCHOOLS MUSIC TECHNOLOGY

Acoustics for the Speech and Hearing Sciences SPH 267 Fall 2003

Nonverbal Factors in Influence

Speech Transcription

SOME ASPECTS OF ASR TRANSCRIPTION BASED UNSUPERVISED SPEAKER ADAPTATION FOR HMM SPEECH SYNTHESIS

Limited Domain Synthesis of Expressive Military Speech for Animated Characters

IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 10, NO. 6, OCTOBER

Speech: A Challenge to Digital Signal Processing Technology for Human-to-Computer Interaction

Formant Bandwidth and Resilience of Speech to Noise

9.63 Laboratory in Cognitive Science. Interaction: memory experiment

Workshop Perceptual Effects of Filtering and Masking Introduction to Filtering and Masking

HMM-based Speech Synthesis with Various Degrees of Articulation: a Perceptual Study

PERCENTAGE ARTICULATION LOSS OF CONSONANTS IN THE ELEMENTARY SCHOOL CLASSROOMS

Integration of Negative Emotion Detection into a VoIP Call Center System

AUTOMATIC PHONEME SEGMENTATION WITH RELAXED TEXTUAL CONSTRAINTS

Thai Language Self Assessment

A Comparison of Speech Coding Algorithms ADPCM vs CELP. Shannon Wichman

Bachelors of Science Program in Communication Disorders and Sciences:

Common Phonological processes - There are several kinds of familiar processes that are found in many many languages.

Generating natural narrative speech for the Virtual Storyteller

CHARTES D'ANGLAIS SOMMAIRE. CHARTE NIVEAU A1 Pages 2-4. CHARTE NIVEAU A2 Pages 5-7. CHARTE NIVEAU B1 Pages CHARTE NIVEAU B2 Pages 11-14

How emotional should the icat robot be? A children s evaluation of multimodal emotional expressions of the icat robot.

Alphabetic Knowledge / Exploring with Letters

VoiceXML-Based Dialogue Systems

APPLYING MFCC-BASED AUTOMATIC SPEAKER RECOGNITION TO GSM AND FORENSIC DATA

Holistic Music Therapy and Rehabilitation

Teaching and Learning Mandarin Tones. 19 th May 2012 Rob Neal

Artificial Neural Network for Speech Recognition

Speech Recognition on Cell Broadband Engine UCRL-PRES

by Learning Area Achievement Objectives The New Zealand Curriculum Set of 8 charts

Proceedings of Meetings on Acoustics

Music Mood Classification

Teaching and Educational Development Institute. Presentation skills for teachers

Aspects of North Swedish intonational phonology. Bruce, Gösta

MATTHEW K. GORDON (University of California, Los Angeles) THE NEUTRAL VOWELS OF FINNISH: HOW NEUTRAL ARE THEY? *

Quarterly Progress and Status Report. An ontogenetic study of infant speech perception

A Computer Program for Pronunciation Training and Assessment in Japanese Language Classrooms Experimental Use of Pronunciation Check

Music in schools: promoting good practice

Transcription:

Emotional speech synthesis Technologies and research approaches Marc Schröder, DFKI with contributions from Olivier Rosec, France Télécom R&D Felix Burkhardt, T-Systems HUMAINE WP6 workshop, Paris, 10 March 2005

Overview Speech synthesis technologies formant synthesis HMM synthesis diphone synthesis unit selection synthesis voice conversion Research on emotional speech synthesis straightforward approach (and why not to do it) systematic parameter variation: Burkhardt (2001) non-extreme emotions: Schröder (2004) Marc Schröder, DFKI 2

Overview Speech synthesis technologies formant synthesis HMM synthesis diphone synthesis unit selection synthesis voice conversion Research on emotional speech synthesis straightforward approach (and why not to do it) systematic parameter variation: Burkhardt (2001) non-extreme emotions: Schröder (2004) Marc Schröder, DFKI 3

Speech synthesis Text or Speech synthesis markup Either plain text or SSML document natural language processing techniques text analysis prosodic parameters Phonetic transcription Intonation specification Pausing & speech timing signal processing techniques audio generation Wave or mp3 Marc Schröder, DFKI 4

Speech synthesis technologies Text or Speech synthesis markup text analysis prosodic parameters audio generation Formant synthesis HMM-based synthesis Diphone synthesis Unit selection synthesis Voice conversion Marc Schröder, DFKI 5

Speech synthesis technologies Formant synthesis Acoustic modelling of speech Many degrees of freedom, can potentially reproduce speech perfectly Rule-based formant synthesis: Imperfect rules for acoustic realisation of articulation => robot-like sound Examples: Janet Cahn (1990): angry happy sad Felix Burkhardt (2001): neutral angry happy sad fearful fearful Marc Schröder, DFKI 6

Speech synthesis technologies HMM synthesis Hidden Markov Models trained from speech database(s) synthesis using acoustic model (MLSA) => robot-like sound Examples: Miyanaga et al. (2004): Parametrise HMM output parameters using a style control vector trained from corpus: neutral trained from corpus: 0.5 joyful joyful 1.5 joyful interpolated interpolated Marc Schröder, DFKI 7

Speech synthesis technologies Diphone synthesis Diphones = small units of recorded speech from middle of one sound to middle of next sound e.g. [greit] = _-g g-r r-ei EI-t t-_ Signal manipulation to force pitch (F0) and duration into a target contour Examples: Can control prosody, but not voice quality Marc Schröder (1999): neutral angry happy sad Ignasi Iriondo (2004): angry happy sad fearful fearful Marc Schröder, DFKI 8

Speech synthesis technologies Diphone synthesis Is voice quality indispensable? Interesting diversity of opinions in the literature Tentative conclusion: It depends!...on the emotion (Montero et al., 1999) prosody conveys surprise, sadness voice quality conveys anger, joy...on speaker strategies (Schröder, 1999) angry1 orig_angry1 angry2 orig_angry2 Marc Schröder, DFKI 9

Speech synthesis technologies Diphone synthesis Partial remedy: Record voice qualities Schröder & Grice (2003): Diphone databases with three levels of vocal effort male: loud modal soft female: loud modal soft Voice quality interpolation: Turk et al. (in prep.) female: loud 1 2 modal 3 4 soft Not yet successful: smiling voice modal1 smile1 modal2 smile2 Marc Schröder, DFKI 10

Speech synthesis technologies Unit selection synthesis Select small speech units out of very large speech corpus (e.g., 5 hours of speech) Avoid signal manipulation to maintain natural prosody from the units Cannot control prosody or voice quality Very good playback quality with emotional recordings Examples: Akemi Iida (2000): angry happy sad Ellen Eide (IBM, 2004): good news bad news Marc Schröder, DFKI 11

Speech synthesis technologies Voice conversion How to learn a new voice? Source Target speech Analysis parameters Alignment parameters Analysis speech Classification / regression Transformation function Learning data needed: about 5 minutes Transformed parameters: timbre and F 0 Conversion techniques: VQ, GMM, Potential application to emotion source = neutral speech target = emotional speech Marc Schröder, DFKI 12

Speech synthesis technologies Voice conversion Transformation step Source Source Converted speech Analysis parameters Conversion parameters Synthesis Converted speech Residual information Analysis / synthesis: LPC, formant or HNM Output quality of the converted speech Can be fairly good in terms of speaker (/emotion?) identification Degradation of naturalness Example for speaker transformation: France Télécom speech synthesis team: source target conversion Marc Schröder, DFKI 13

Speech synthesis technologies: Summary Current choice: Explicit modelling approaches low naturalness high flexibility, high control over acoustic parameters explicit models of emotional prosody Playback approaches high naturalness no flexibility, no control over acoustic parameters emotional prosody implicit in recordings Technical challenge over next years: combine the best of both worlds! Marc Schröder, DFKI 14

Overview Speech synthesis technologies formant synthesis HMM synthesis diphone synthesis unit selection synthesis voice conversion Research on emotional speech synthesis straightforward approach (and why not to do it) systematic parameter variation: Burkhardt (2001) non-extreme emotions: Schröder (2004) Marc Schröder, DFKI 15

Research on emotional speech synthesis The straightforward approach (and why not to do it) The straightforward approach record one actor with four emotions anger, fear, sadness, joy (+neutral) measure acoustic correlates overall pitch level + range, tempo, intensity copy synthesis or prosody rules, synthesise forced-choice perception test with neutral text overall recognition rates...and then? there has been neither continuity nor cumulativeness in the area of the vocal communication of emotion (Scherer, 1986, p. 143) Marc Schröder, DFKI 16

Research on emotional speech synthesis The straightforward approach (and why not to do it) May not be representative The straightforward approach record one actor with four emotions Needed: quality control anger, fear, (e.g., sadness, expert rating) joy (+neutral) lose local effects measure acoustic correlates overall pitch level + range, tempo, intensity lose interaction with copy synthesis more or and linguistic structure prosody different rules, parameters synthesise needed: voice quality! unexpected percepts? forced-choice perception test with neutral text overall recognition rates...and then? Why these four? Applications don't need basic emotions Emotion words too ambiguous use frame stories when recording Untypical for applications there Applications has been needneither continuity nor cumulativeness How bad are errors? in the area of suitability, the vocal not recognition communication of emotion Need measure of semantic similarity of states (Scherer, 1986, p. 143) Marc Schröder, DFKI 17

Research on emotional speech synthesis The straightforward approach (and why not to do it) May not be representative The straightforward approach record one actor with four emotions Needed: quality control anger, fear, (e.g., sadness, expert rating) joy (+neutral) lose local effects measure acoustic correlates overall pitch level + range, tempo, intensity lose interaction with copy synthesis more or and linguistic structure prosody different rules, parameters synthesise needed: voice quality! unexpected percepts? forced-choice perception test with neutral text overall recognition rates...and then? Why these four? Applications don't need basic emotions Emotion words too ambiguous use frame stories when recording Untypical for applications there Applications has been needneither continuity nor cumulativeness How bad are errors? in the area of suitability, the vocal not recognition communication of emotion Need measure of semantic similarity of states (Scherer, 1986, p. 143) Marc Schröder, DFKI 18

Emotional speech synthesis research Listener-centred orientation Marc Schröder, DFKI 19

Emotional speech synthesis research Listener-centred approach: Burkhardt (2001) Stimuli: systematically varied selected acoustic features using formant synthesis pitch height (3 variants) pitch range (3 variants) phonation (5 variants) segment durations (4 variants) vowel quality (3 variants) one semantically neutral sentence Complete factorial design would be >2000 stimuli tested three groups of parameters combinations Pitch/Phonation: 45 stimuli, Pitch/Segmental: 108 stimuli, Phonation/Segmental: 60 stimuli Marc Schröder, DFKI 20

Emotional speech synthesis research Listener-centred approach: Burkhardt (2001) Forced choice perception test neutral, fear, anger, joy, sadness, boredom => Perceptually optimal values for each category Second step: varied additional acoustic parameters further differentiation into subcategories: hot/cold anger, joy/happiness, despair/sorrow Marc Schröder, DFKI 21

Emotional speech synthesis research Dimensional approach: Schröder (2004) Goals Model many, gradual states on a continuum Allow for gradual changes over time Model many acoustic parameters, including voice quality Success criterion Voice fits with the text Marc Schröder, DFKI 22

Emotional speech synthesis research Dimensional approach: Description system Representation of emotional states in a 2-dim. space, activation-evaluation space: Essential emotional properties in listeners' minds Continuous angry afraid VERY ACTIVE excited interested happy pleased VERY NEGATIVE sad bored relaxed content VERY POSITIVE VERY PASSIVE Marc Schröder, DFKI 23

Emotional speech synthesis research Dimensional approach: Emotional prosody rules Database analysis Belfast Naturalistic Emotion Database: 124 speakers, spontaneous emotions Search for correlations between emotion dimensions and acoustic parameters Activation numerous, robust correlations Evaluation and Power fewer, weaker correlations Marc Schröder, DFKI 24

VERY NEGATIVE Emotional speech synthesis research Dimensional approach: Synthesis method angry afraid sad VERY ACTIVE bored excited interested happy pleased relaxed content VERY PASSIVE VERY POSITIVE Rules map each point in emotion space onto its acoustic correlates Flexibility: gradual build-up of emotions, non-extreme emotional states Emotions are not fully specified through the voice complementary information required: verbal content, visual channel, situational context Marc Schröder, DFKI 25

Emotional speech synthesis research Dimensional approach: Realisation in the system written text text analysis phonetic transcription, intonation, rhythm... rules for intonation, speech rate,... audio generation diphone synthesis with three voice qualities Marc Schröder, DFKI 26

Emotional speech synthesis research MARY: DFKI's speech synthesis http://mary.dfki.de Developed in cooperation with Institute of Phonetics, Saarland Univ. Languages: German, English Transparent and flexible Modular Internal MaryXML format Input/output possible at all intermediate processing steps allows for fine-grained control Marc Schröder, DFKI 27

Emotional speech synthesis research Dimensional approach: Technical realisation <emotion activation="67" evaluation="42"> Hurra, wir haben es geschafft! </emotion> Emotional prosody rules (XSLT stylesheet) <maryxml> <prosody accent-prominence="+13%" accent-slope="+46%" fricative-duration="+21%" liquid-duration="+13%" nasal-duration="+13%" number-of-pauses="+47%" pause-duration="-13%" pitch="134" pitch-dynamics="+5%" plosive-duration="+21%" preferred-accent-shape="alternating" preferred-boundary-type="high" range="52" range-dynamics="+40%" rate="+42%" volume="72" vowel-duration="+13%"> Hurra, wir haben es geschafft! </prosody> </maryxml> Marc Schröder, DFKI 28

Emotional speech synthesis research Dimensional approach: Listening test Eight emotion-specific texts Prosodic parameters predicted for each of the eight emotional states Factorise text x prosody => 64 stimuli Listeners evaluate stimuli on a scale How well does the sound of the voice fit to the text spoken? Marc Schröder, DFKI 29

Emotional speech synthesis research Dimensional approach: Listening test results Activation dimension successfully conveyed / perceived as intended Evaluation dimension less successful Acceptability is gradual: Neighbouring states more acceptable than distant states Marc Schröder, DFKI Prosody B, all texts 31

Emotional speech synthesis research Dimensional approach: Listening test results Activation dimension successfully conveyed / perceived as intended Evaluation dimension less successful Acceptability is gradual: Neighbouring states more acceptable than distant states Marc Schröder, DFKI Prosody F, all texts 32

Emotional speech synthesis research Dimensional approach: Summary Flexible framework Successful in expressing degree of activation Failure to express evaluation sound of the smile?! specialised modalities? text => evaluation voice => activation Emotional prosody rules not fine-tuned only global evaluation so far Marc Schröder, DFKI 33

Summary Speech synthesis technology data-driven or flexible Research on emotional prosody rules listener-centred task database analyses to be validated perceptually Marc Schröder, DFKI 34

Outlook: Speech synthesis research in HUMAINE Capability 5.2: Speech expressivity address the dilemma of data-driven vs. flexible investigate suitable measures for prosody and voice quality in controlled recordings attempt copy synthesis using different technologies attempt voice conversion evaluate success of different methods Marc Schröder, DFKI 35