Vocal Emotion Recognition

Transcription

1 Vocal Emotion Recognition State-of-the-Art in Classification of Real-Life Emotions October 26, 2010 Stefan Steidl International Computer Science Institute (ICSI) at Berkeley, CA Overview 2 / 49 1 Different Perspectives on Emotion Recognition 2 FAU Aibo Emotion Corpus 3 Own Results on Emotion Classification 4 INTERSPEECH 2009 Emotion Challenge

2 Overview 3 / 49 1 Different Perspectives on Emotion Recognition Psychology of Emotion Computer Science 2 FAU Aibo Emotion Corpus 3 Own Results on Emotion Classification 4 INTERSPEECH 2009 Emotion Challenge Facial Expressions of Emotion 4 / 49

3 5 / 49 Universal Basic Emotions Paul Ekman postulates the existence of 6 basic emotions: anger, fear, disgust, surprise, joy, sadness other emotions are mixed or blended emotions universal facial expressions Terminology 6 / 49 Different affective states [1]: type of affective state inten- dura- syn- event appraisal rapid- behavsity tion chroni- focus elicita- ity of ioral zation tion change impact emotion - mood - interpersonal stances - - attitudes personality traits - : low, : medium, : high, : very high, -: indicates a range [1] K. R. Scherer: Vocal communication of emotion: A review of research paradigms, Speech Communication, Vol. 40, pp , 2003

4 7 / 49 Terminology (cont.) Definition of Emotion Emotion (Scherer) episodes of coordinated changes in several components including at least: neurophysiological activation, motor expression, and subjective feeling but possibly also action tendencies and cognitive processes in response to external or internal events of major significance to the organism Vocal Expression of Emotion 8 / 49 Results from studies in Psychology of Emotion anger/ fear/ sadness joy/ boredom stress rage panic elation Intensity F 0 floor/mean F 0 variability F 0 range ( ) 1 Sentence contour High frequency energy ( ) 2 Speech and articulation rate ( ) 2 1 Banse and Scherer found a decrease in F 0 range 2 inconclusive evidence Goal Classification of the subject s actual emotional state (some sort of lie detector for emotions)

5 9 / 49 Human-Computer Interaction (HCI) Emotion-Related User States naturally occurring states of users in human-machine communication emotions in a broader sense coordinated changes in several components NOT required classification of the perceived emotional state, not necessarily the actual emotion of the speaker 10 / 49 Pattern Recognition Pattern Recognition Point of View classification task: choose 1 of n given classes discrimination of classes rather than classification definition of good features machine classification Actually not needed definition of term emotion information on how specific features change

6 11 / 49 Emotional Speech Corpora Acted data based on Basic Emotions theory suited for studying prototypical emotions corpora easy to create (inexpensive, no labeling process) high audio quality balanced classes neutral linguistic content (focus on acoustics only) high recognition results 12 / 49 Emotional Speech Corpora (cont.) Popular corpora Emotional Prosody Speech and Transcript corpus (LDC): 15 classes Berlin Emotional Speech Database (EmoDB): 7 classes 89.9 % accuracy (speaker independent LOSO evaluation, speaker adaptation, feature selection) [2] Danish Emotional Speech Corpus: 5 classes 74.5 % accuracy (10-fold SCV, feature selection) [3] [2] B. Vlasenko et al.: Combining Frame and Turn-Level Information for Robust Recognition of Emotions within Speech, INTERSPEECH 2007 [3] Schuller et al.: Emotion Recognition in the Noise Applying Large Acoustic Feature Sets, Speech Prosody 2006

7 13 / 49 Emotional Speech Corpora (cont.) Naturally occurring emotions states that actually appear in HCI (real applications) difficult to create (appropriate scenario needed, ethical concerns, need to label data) low emotional intensity in general 80 % neutral low audio quality (reverberation, noise, far-distance microphones) needed for machine classification (because conditions between training and test must not differ too much) research on both acoustic and linguistic features possible new research questions: optimal emotion unit almost no corpora large enough for machine classification available (do not exist or are not available for research) Overview 14 / 49 1 Different Perspectives on Emotion Recognition 2 FAU Aibo Emotion Corpus Scenario Labeling of User States Data-driven Dimensions of Emotion Units of Analysis Sparse Data Problem 3 Own Results on Emotion Classification 4 INTERSPEECH 2009 Emotion Challenge

8 15 / 49 The FAU Aibo Emotion Corpus 51 children (30 f, 21 m) at the age of 10 to hours of spontaneous speech (mainly short commands) 48,401 words in 13,642 audio files 16 / 49 FAU Aibo Emotion Corpus (cont.) data base for CEICES and INTERSPEECH 2009 Emotion Challenge available for scientific, non-commercial use [4] S. Steidl: Automatic Classification of Emotion-Related User States in Spontaneous Children s Speech, Logos Verlag, Berlin available online:

9 Emotion-Related User States 17 / categories: prior inspection of the data before labeling joyful surprised motherese neutral bored emphatic helpless touchy/irritated reprimanding angry other motherese the way mothers/parents address their babies either because Aibo is well-behaving or because the child wants Aibo to obey; positive equivalent to reprimanding emphatic pronounced, accentuated, sometimes hyper-articulated way but without showing any emotion reprimanding the child is reproachful, reprimanding, wags the finger Labeling of User States 18 / 49 Labeling: 5 students of linguistics holistic labeling on the word level majority vote emotion category words angry (A) % touchy (T) % reprimanding (R) % emphatic (E) 2, % neutral (N) 39, % motherese (M) 1, % joyful (J) %. all 48, %

10 19 / 49 Labeling of User States (cont.) Confusion matrix majority vote emotion category A T R E N M J angry (A) touchy (T) reprimanding (R) emphatic (E) neutral (N) motherese (M) joyful (J) / 49 Data-driven Dimensions of Emotions Non-metric dimensional scaling: arranging the emotion categories in the 2-dimensional space states that are often confused are close to each other +interaction interaction motherese reprimanding touchy neutral emphatic angry joyful interaction negative valence positive

11 21 / 49 Units of Analysis Units of analysis v1 v2 p3 s3 stopp Aibo g radeaus fein machst du das stopp sitz word level chunk level Ohm_18_342 turn level Ohm_18_343 Advantages/disadvantages of larger units + more information less emotional homogeneity S. Steidl: Vocal Emotion Recognition 22 / 49 Sparse Data Problem Super classes: Motherese 0.5 joyful motherese neutral angry emphatic 1 touchy reprimanding Anger: angry, touchy/irritated, reprimanding Emphatic Neutral Motherese Neutral 0 0 Anger S = 0.32 S. Steidl: RSQ = 0.73 Vocal Emotion Recognition Emphatic S = RSQ =

12 23 / 49 Sparse Data Problem (cont.) Data subsets Aibo corpus Aibo turn set Aibo chunk set Aibo word set data set number of taken from words # chunks # turns Aibo corpus 48,401 18,216 13,642 Aibo word set 6,070 4,543 3,996 Aibo chunk set 13,217 4,543 3,996 Aibo turn set 17,618 6,413 3,996 Overview 24 / 49 1 Different Perspectives on Emotion Recognition 2 FAU Aibo Emotion Corpus 3 Own Results on Emotion Classification Results for different Units of Analysis Machine vs. Human Feature Types and their Relevance 4 INTERSPEECH 2009 Emotion Challenge

13 Most Appropriate Unit of Analysis 25 / 49 Classification complete set of features classification with Linear Discriminant Analysis (LDA) 51-fold speaker-independent cross-validation unit of number of number of average analysis features samples recall word level 265 6,070 words 67.2 % chunk level 700 4,543 chunks 68.9 % turn level 700 3,996 turns 63.2 % Chunks: best compromise between length of the segment homogeneity of the emotional state within the segment Machine Classifier vs. Human Labeler 26 / 49 Entropy based measure: labeler class A E A A 1 2 A E N M decoder: M 1 2 A E N M A E N M H dec = 1.41 implicit weighting of classification errors depending on the word that is classified

14 27 / 49 Machine Classifier vs. Human Labeler (cont.) Classification: Aibo word set rel. frequency [%] avg. human labeler machine classifier entropy [5] S. Steidl, M. Levit, A. Batliner, E. Nöth, H. Niemann: Of All Things the Measure is Man Classification of Emotions and Inter-Labeler Consistency, ICASSP / 49 Evaluation of Different Types of Features Types of features acoustic features prosodic features spectral features voice quality features linguistic features Evaluation Artificial Neural Networks (ANN) 51-fold speaker-independent cross-validation combination by early or late fusion

15 29 / 49 Acoustic Features: Prosody Prosody suprasegmental characteristics such as pitch contour energy contour temporal shortening/lengthening of words duration of pauses between words 30 / 49 Acoustic Features: Prosody (cont.) Classification results: Aibo chunk set average recall [%] pauses (16) duration (37) energy (25) all F0 (29)

16 31 / 49 Acoustic Features: Spectral Characteristics (cont.) Classification results: Aibo chunk set average recall [%] prosody (107) HNR (2) TEO (64) MFCC (24) formants (16) jitter/shimmer (4) best combination Acoustic Features: Voice Quality 32 / 49 Classification results: Aibo chunk set average recall [%] prosody (107) MFCC (24) formants (16) HNR (2) jitter/shimmer (4) TEO (64) best combination

17 Acoustic Features: Combination 33 / 49 Classification results: Aibo chunk set average recall [%] prosody (107) MFCC (24) formants (16) jitter/shimmer (4) HNR (2) TEO (64) best combination Linguistic Features 34 / 49 Types of linguistic features word characteristics average word length (number of letters, phonemes, syllables) proportion of word fragments average number of repetitions part-of-speech features unigram models bag-of-words

18 35 / 49 Linguistic Features (cont.) Part-of-Speech (POS) Features only 6 coarse POS categories can be annotated without considering context nouns, proper names % of total inflected adjectives not inflected adjectives present/past participles (other) verbs, infinitives auxiliaries articles, pronouns, particles, interjections Anger Joyful Neutral Emphatic Motherese Other - 36 / 49 Linguistic Features (cont.) Unigram Models u(w, e) = log 10 P(e w) P(e) Anger P(A w) Emphatic P(E w) böser (bad) 29.2 % stopp (stop) 30.5 % stehenbleiben (stop) 18.9 % halt (halt) 29.3 % nein (no) 17.0 % links (left) 20.5 % aufstehen (get up) 12.3 % rechts (right) 18.9 % Aibo (Aibo) 10.1 % nein (no) 17.6 % Neutral P(N w) Motherese P(M w) okay (okay) 98.6 % fein (fine) 57.5 % und (and) 98.5 % ganz (very) 41.9 % Stück (bit) 98.5 % braver (good) 36.0 % in (in) 98.2 % sehr (very) 23.5 % noch (still) 96.2 % brav (good) 21.7 %

19 37 / 49 Linguistic Features (cont.) Bag-of-Words utterance: Aibo, geh nach links! (Aibo, move to the left!) Aibo allen geh nach links Aibolein representation of the linguistic content word order getting lost various dimensionality reduction techniques 38 / 49 Linguistic Features (cont.) Classification results: Aibo chunk set average recall [%] word statistics (6) POS (6) unigram models (16) best combination BOW (254 50)

20 39 / 49 Combination of Acoustic and Linguistic Features Classification results: Aibo chunk set 80 average recall [%] acoustic features (late fusion, ANN) best combination (late fusion, ANN) best combination linguistic features combination (late fusion, ANN) combination (early fusion, LDA) 40 / 49 Similar Results within CEICES CEICES: Combining Efforts for Improving Automatic Classification of Emotional User States collaboration of various research groups within the European Network of Excellence HUMAINE ( ) state-of-the-art feature set with 4,000 features SVM (linear kernel), 3-fold speaker-independent cross-validation selection of 150 features (SFFS): surviving feature types? only chunk based features, no information outside Aibo chunk set [6] A. Batliner, S. Steidl, B. Schuller, D. Seppi, T. Vogt, J. Wagner, L. Devillers, L. Vidrascu, V. Aharonson, L. Kessous, N. Amir: Whodunnit Searching for the Most Important Feature Types Signalling Emotion-Related User States in Speech, Computer, Speech, and Language, Vol. 25, Issue 1 (January 2011), pp. 4-28

21 41 / 49 Similar Results within CEICES(cont.) duration energy F0 spectrum cepstrum voice quality wavelets all acoustic BOW POS higher semantics varia all linguistic all SFFS # total # F MEASURE SHARE PORTION SFFS # F MEASURE SHARE PORTION Overview 42 / 49 1 Different Perspectives on Emotion Recognition 2 FAU Aibo Emotion Corpus 3 Own Results on Emotion Classification 4 INTERSPEECH 2009 Emotion Challenge

22 INTERSPEECH 2009 Emotion Challenge 43 / 49 New goals: challenge with standardized test conditions open microphone: using the complete corpus highly unbalanced classes including all observed emotional categories including chunks with low inter-labeler agreement 44 / 49 INTERSPEECH 2009 Emotion Challenge (cont.) Speaker independent training and test sets 2-class problem: NEGative vs. IDLe # NEG IDL train test class problem: Anger, Emphatic, Neutral, Positive, Rest # A E N P R train test

23 45 / 49 INTERSPEECH 2009 Emotion Challenge (cont.) Sub-Challenges 1 Feature Sub-Challenge optimisation of feature extraction/selection; classifier settings fixed 2 Classifier Sub-Challenge optimisation of classification techniques; feature set given 3 Open Performance Sub-Challenge optimisation of feature extraction/selection and classification techniques 46 / 49 INTERSPEECH 2009 Emotion Challenge (cont.) Participants Open Performance Classifier Feature Sub-Challenge Sub-Challenge Sub-Challenge number of 2 classes 5 classes 2 classes 5 classes 2 classes 5 classes participants [7] B. Schuller, A. Batliner, S. Steidl, D. Seppi: Recognising Realistic Emotions and Affect in Speech: State of the Art and Lessons Learnt from the First Challenge, Speech Communication, Special Issue Sensing Emotion and Affect - Facing Realism in Speech Processing, to appear

24 47 / 49 INTERSPEECH 2009 Emotion Challenge (cont.) 2-class problem: NEGative vs. IDLe 74 average recall [%] unweighted avg. recall weighted avg. recall 62 Majority voting Dumouchel et al. Vlasenko et al. Kockmann et al. 60 Baseline Barra-Chicote et al. Vogt et al. Bozkurt et al. Polzehl et al. Luengo et al. 48 / 49 INTERSPEECH 2009 Emotion Challenge (cont.) 5-class problem: Anger, Emphatic, Neutral, Positive, Rest 55 average recall [%] Lee et al. Vlasenko et al. Luengo et al. Planet et al. Dumouchel et al unweighted average recall weighted average recall 35 Vogt el al. Barra-Chicote et al. Baseline Majority voting Kockmann et al. Bozkurt et al.

25 State-of-the-Art: Summary 49 / 49 Berlin Emotion Speech Database 7-class problem: hot anger, disgust, fear/panic, happiness, sadness/sorrow, boredom, neutral balanced classes 90 % accuracy FAU Aibo Emotion Corpus 4-class problem: Anger, Emphatic, Neutral, Motherese subset with roughly balanced classes (Aibo chunk set) 69 % unweighted average recall 5-class problem: Anger, Emphatic, Neutral, Positive, Rest highly unbalanced classes, complete corpus 44 % unweighted average recall 2-class problem: NEGative vs. IDLe highly unbalanced classes, complete corpus 71 % unweighted average recall