Vocal Emotion Recognition

Similar documents
Emotion Detection from Speech

Automatic Emotion Recognition from Speech

Recognition of Emotions in Interactive Voice Response Systems

MODELING OF USER STATE ESPECIALLY OF EMOTIONS. Elmar Nöth. University of Erlangen Nuremberg, Chair for Pattern Recognition, Erlangen, F.R.G.

Automatic Evaluation Software for Contact Centre Agents voice Handling Performance

EMOTION IN SPEECH: RECOGNITION AND APPLICATION TO CALL CENTERS

Final Project Presentation. By Amritaansh Verma

SALIENT FEATURES FOR ANGER RECOGNITION IN GERMAN AND ENGLISH IVR PORTALS

How to Improve the Sound Quality of Your Microphone

How To Assess Dementia With A Voice-Based Assessment

1 Introduction. An Emotion-Aware Voice Portal

You Seem Aggressive! Monitoring Anger in a Practical Application

THE VOICE OF LOVE. Trisha Belanger, Caroline Menezes, Claire Barboa, Mofida Helo, Kimia Shirazifard

Automatic Detection of Laughter and Fillers in Spontaneous Mobile Phone Conversations

Sentiment analysis: towards a tool for analysing real-time students feedback

Psychological Motivated Multi-Stage Emotion Classification Exploiting Voice Quality Features

Technische Universität München. Speech Processing When it gets Emotional

Prosodic Characteristics of Emotional Speech: Measurements of Fundamental Frequency Movements

APPLYING MFCC-BASED AUTOMATIC SPEAKER RECOGNITION TO GSM AND FORENSIC DATA

AMMON: A Speech Analysis Library for Analyzing Affect, Stress, and Mental Health on Mobile Phones

Carla Simões, Speech Analysis and Transcription Software

Word Completion and Prediction in Hebrew

The Effect of Long-Term Use of Drugs on Speaker s Fundamental Frequency

engin erzin the use of speech processing applications is expected to surge in multimedia-rich scenarios

SPEAKER IDENTIFICATION FROM YOUTUBE OBTAINED DATA

Emotion in Speech: towards an integration of linguistic, paralinguistic and psychological analysis

AUTOMATIC DETECTION OF CONTRASTIVE ELEMENTS IN SPONTANEOUS SPEECH

Nonverbal Communication Human Communication Lecture 26

INFERRING SOCIAL RELATIONSHIPS IN A PHONE CALL FROM A SINGLE PARTY S SPEECH Sree Harsha Yella 1,2, Xavier Anguera 1 and Jordi Luque 1

Thirukkural - A Text-to-Speech Synthesis System

Author Gender Identification of English Novels

Gender Identification using MFCC for Telephone Applications A Comparative Study

Big Data and Opinion Mining: Challenges and Opportunities

Sentiment analysis of Twitter microblogging posts. Jasmina Smailović Jožef Stefan Institute Department of Knowledge Technologies

Emotion Recognition Using Blue Eyes Technology

Robust Methods for Automatic Transcription and Alignment of Speech Signals

Degree of highness or lowness of the voice caused by variation in the rate of vibration of the vocal cords.

A System for Labeling Self-Repairs in Speech 1

Comparative Error Analysis of Dialog State Tracking

Principal Components of Expressive Speech Animation

Speech Signal Processing: An Overview

BLIND SOURCE SEPARATION OF SPEECH AND BACKGROUND MUSIC FOR IMPROVED SPEECH RECOGNITION

Speech: A Challenge to Digital Signal Processing Technology for Human-to-Computer Interaction

Integration of Negative Emotion Detection into a VoIP Call Center System

Establishing the Uniqueness of the Human Voice for Security Applications

Separation and Classification of Harmonic Sounds for Singing Voice Detection

Automatic slide assignation for language model adaptation

On-line emotion recognition in a 3-D activation-valence-time continuum using acoustic and linguistic cues

Artificial Neural Network for Speech Recognition

LMELECTURES: A MULTIMEDIA CORPUS OF ACADEMIC SPOKEN ENGLISH

Experiments with Signal-Driven Symbolic Prosody for Statistical Parametric Speech Synthesis

PoS-tagging Italian texts with CORISTagger

Automated Content Analysis of Discussion Transcripts

Voice User Interfaces (CS4390/5390)

Selected Topics in Applied Machine Learning: An integrating view on data analysis and learning algorithms

LANGUAGE! 4 th Edition, Levels A C, correlated to the South Carolina College and Career Readiness Standards, Grades 3 5

Myanmar Continuous Speech Recognition System Based on DTW and HMM

Introduction to the Database

How To Learn From The Revolution

Fast Labeling and Transcription with the Speechalyzer Toolkit

Efficient diphone database creation for MBROLA, a multilingual speech synthesiser

COURSE SYLLABUS ESU 561 ASPECTS OF THE ENGLISH LANGUAGE. Fall 2014

SENTIMENT EXTRACTION FROM NATURAL AUDIO STREAMS. Lakshmish Kaushik, Abhijeet Sangwan, John H. L. Hansen

Applying Repair Processing in Chinese Homophone Disambiguation

KNOWLEDGE-BASED IN MEDICAL DECISION SUPPORT SYSTEM BASED ON SUBJECTIVE INTELLIGENCE

Module Catalogue for the Bachelor Program in Computational Linguistics at the University of Heidelberg

Analysis of SMO and BPNN Model for Speech Emotion Recognition System

A Method for Automatic De-identification of Medical Records

II. RELATED WORK. Sentiment Mining

Applications of speech-to-text in customer service. Dr. Joachim Stegmann Deutsche Telekom AG, Laboratories

Ericsson T18s Voice Dialing Simulator

Microblog Sentiment Analysis with Emoticon Space Model

Introduction to Machine Learning Lecture 1. Mehryar Mohri Courant Institute and Google Research

Turkish Radiology Dictation System

IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, VOL. 6, NO. X, XXXXX Sentiment Analysis: From Opinion Mining to Human-Agent Interaction

Tibetan-Chinese Bilingual Sentences Alignment Method based on Multiple Features

Chapter 8. Final Results on Dutch Senseval-2 Test Data

Audio Engineering Society. Convention Paper. Presented at the 129th Convention 2010 November 4 7 San Francisco, CA, USA

Overview. Evaluation Connectionist and Statistical Language Processing. Test and Validation Set. Training and Test Set

AUTOMATIC PHONEME SEGMENTATION WITH RELAXED TEXTUAL CONSTRAINTS

Annotated bibliographies for presentations in MUMT 611, Winter 2006

62 Hearing Impaired MI-SG-FLD062-02

MODELING DYNAMIC PATTERNS FOR EMOTIONAL CONTENT IN MUSIC

Technologies for Voice Portal Platform

Social Media Analytics Summit April 17-18, 2012 Hotel Kabuki, San Francisco WELCOME TO THE SOCIAL MEDIA ANALYTICS SUMMIT #SMAS12

The Minor Third Communicates Sadness in Speech, Mirroring Its Use in Music

Master of Arts in Linguistics Syllabus

Analysis and Synthesis of Hypo and Hyperarticulated Speech

Measuring and synthesising expressivity: Some tools to analyse and simulate phonostyle

On Intuitive Dialogue-based Communication and Instinctive Dialogue Initiative

Visualization of large data sets using MDS combined with LVQ.

Open-Source, Cross-Platform Java Tools Working Together on a Dialogue System

Lecture 1-10: Spectrograms

Active Learning SVM for Blogs recommendation

Sentiment Analysis. D. Skrepetos 1. University of Waterloo. NLP Presenation, 06/17/2015

Feature Subset Selection in Spam Detection

Strand: Reading Literature Topics Standard I can statements Vocabulary Key Ideas and Details

School Class Monitoring System Based on Audio Signal Processing

Objective Intelligibility Assessment of Text-to-Speech Systems Through Utterance Verification

SOME ASPECTS OF ASR TRANSCRIPTION BASED UNSUPERVISED SPEAKER ADAPTATION FOR HMM SPEECH SYNTHESIS

Transcription:

Vocal Emotion Recognition State-of-the-Art in Classification of Real-Life Emotions October 26, 2010 Stefan Steidl International Computer Science Institute (ICSI) at Berkeley, CA Overview 2 / 49 1 Different Perspectives on Emotion Recognition 2 FAU Aibo Emotion Corpus 3 Own Results on Emotion Classification 4 INTERSPEECH 2009 Emotion Challenge

Overview 3 / 49 1 Different Perspectives on Emotion Recognition Psychology of Emotion Computer Science 2 FAU Aibo Emotion Corpus 3 Own Results on Emotion Classification 4 INTERSPEECH 2009 Emotion Challenge Facial Expressions of Emotion 4 / 49

5 / 49 Universal Basic Emotions Paul Ekman postulates the existence of 6 basic emotions: anger, fear, disgust, surprise, joy, sadness other emotions are mixed or blended emotions universal facial expressions Terminology 6 / 49 Different affective states [1]: type of affective state inten- dura- syn- event appraisal rapid- behavsity tion chroni- focus elicita- ity of ioral zation tion change impact emotion - mood - interpersonal stances - - attitudes - - - personality traits - : low, : medium, : high, : very high, -: indicates a range [1] K. R. Scherer: Vocal communication of emotion: A review of research paradigms, Speech Communication, Vol. 40, pp. 227-256, 2003

7 / 49 Terminology (cont.) Definition of Emotion Emotion (Scherer) episodes of coordinated changes in several components including at least: neurophysiological activation, motor expression, and subjective feeling but possibly also action tendencies and cognitive processes in response to external or internal events of major significance to the organism Vocal Expression of Emotion 8 / 49 Results from studies in Psychology of Emotion anger/ fear/ sadness joy/ boredom stress rage panic elation Intensity F 0 floor/mean F 0 variability F 0 range ( ) 1 Sentence contour High frequency energy ( ) 2 Speech and articulation rate ( ) 2 1 Banse and Scherer found a decrease in F 0 range 2 inconclusive evidence Goal Classification of the subject s actual emotional state (some sort of lie detector for emotions)

9 / 49 Human-Computer Interaction (HCI) Emotion-Related User States naturally occurring states of users in human-machine communication emotions in a broader sense coordinated changes in several components NOT required classification of the perceived emotional state, not necessarily the actual emotion of the speaker 10 / 49 Pattern Recognition Pattern Recognition Point of View classification task: choose 1 of n given classes discrimination of classes rather than classification definition of good features machine classification Actually not needed definition of term emotion information on how specific features change

11 / 49 Emotional Speech Corpora Acted data based on Basic Emotions theory suited for studying prototypical emotions corpora easy to create (inexpensive, no labeling process) high audio quality balanced classes neutral linguistic content (focus on acoustics only) high recognition results 12 / 49 Emotional Speech Corpora (cont.) Popular corpora Emotional Prosody Speech and Transcript corpus (LDC): 15 classes Berlin Emotional Speech Database (EmoDB): 7 classes 89.9 % accuracy (speaker independent LOSO evaluation, speaker adaptation, feature selection) [2] Danish Emotional Speech Corpus: 5 classes 74.5 % accuracy (10-fold SCV, feature selection) [3] [2] B. Vlasenko et al.: Combining Frame and Turn-Level Information for Robust Recognition of Emotions within Speech, INTERSPEECH 2007 [3] Schuller et al.: Emotion Recognition in the Noise Applying Large Acoustic Feature Sets, Speech Prosody 2006

13 / 49 Emotional Speech Corpora (cont.) Naturally occurring emotions states that actually appear in HCI (real applications) difficult to create (appropriate scenario needed, ethical concerns, need to label data) low emotional intensity in general 80 % neutral low audio quality (reverberation, noise, far-distance microphones) needed for machine classification (because conditions between training and test must not differ too much) research on both acoustic and linguistic features possible new research questions: optimal emotion unit almost no corpora large enough for machine classification available (do not exist or are not available for research) Overview 14 / 49 1 Different Perspectives on Emotion Recognition 2 FAU Aibo Emotion Corpus Scenario Labeling of User States Data-driven Dimensions of Emotion Units of Analysis Sparse Data Problem 3 Own Results on Emotion Classification 4 INTERSPEECH 2009 Emotion Challenge

15 / 49 The FAU Aibo Emotion Corpus 51 children (30 f, 21 m) at the age of 10 to 13 8.9 hours of spontaneous speech (mainly short commands) 48,401 words in 13,642 audio files 16 / 49 FAU Aibo Emotion Corpus (cont.) data base for CEICES and INTERSPEECH 2009 Emotion Challenge available for scientific, non-commercial use http://www5.cs.fau.de/fauaiboemotioncorpus [4] S. Steidl: Automatic Classification of Emotion-Related User States in Spontaneous Children s Speech, Logos Verlag, Berlin available online: http://www5.cs.fau.de/en/our-team/steidl-stefan/dissertation/

Emotion-Related User States 17 / 49 11 categories: prior inspection of the data before labeling joyful surprised motherese neutral bored emphatic helpless touchy/irritated reprimanding angry other motherese the way mothers/parents address their babies either because Aibo is well-behaving or because the child wants Aibo to obey; positive equivalent to reprimanding emphatic pronounced, accentuated, sometimes hyper-articulated way but without showing any emotion reprimanding the child is reproachful, reprimanding, wags the finger Labeling of User States 18 / 49 Labeling: 5 students of linguistics holistic labeling on the word level majority vote emotion category words angry (A) 134 0.3 % touchy (T) 419 0.9 % reprimanding (R) 463 1.0 % emphatic (E) 2,807 5.8 % neutral (N) 39,975 82.6 % motherese (M) 1,311 2.7 % joyful (J) 109 0.2 %. all 48,401 100.0 %

19 / 49 Labeling of User States (cont.) Confusion matrix majority vote emotion category A T R E N M J angry (A) 43.3 13.0 12.9 12.1 18.1 0.1 0.0 touchy (T) 4.5 42.9 11.7 13.7 23.5 1.0 0.1 reprimanding (R) 3.8 15.7 45.8 14.0 18.2 1.3 0.1 emphatic (E) 1.3 5.8 6.7 53.6 29.9 1.2 0.5 neutral (N) 0.4 2.2 1.5 13.9 77.8 2.7 0.5 motherese (M) 0.0 0.8 1.4 4.9 30.4 61.1 0.9 joyful (J) 0.1 0.6 1.1 7.3 32.4 2.0 54.2 20 / 49 Data-driven Dimensions of Emotions Non-metric dimensional scaling: arranging the emotion categories in the 2-dimensional space states that are often confused are close to each other +interaction interaction motherese reprimanding touchy neutral emphatic angry joyful interaction negative valence positive

21 / 49 Units of Analysis Units of analysis v1 v2 p3 s3 stopp Aibo g radeaus fein machst du das stopp sitz word level chunk level Ohm_18_342 turn level Ohm_18_343 Advantages/disadvantages of larger units + more information less emotional homogeneity S. Steidl: Vocal Emotion Recognition 22 / 49 Sparse Data Problem Super classes: Motherese 0.5 joyful motherese neutral angry 0.5 1 emphatic 1 touchy reprimanding Anger: angry, touchy/irritated, reprimanding Emphatic Neutral Motherese Neutral 0 0 Anger -0.5-0.5-1 -1-1.5-1 S = 0.32 S. Steidl: -0.5 0 RSQ = 0.73 Vocal Emotion Recognition 0.5 1 1.5 Emphatic -1.5-1 S = 0.19-0.5 RSQ = 0.90 0 0.5 1 1.5

23 / 49 Sparse Data Problem (cont.) Data subsets Aibo corpus Aibo turn set Aibo chunk set Aibo word set data set number of taken from words # chunks # turns Aibo corpus 48,401 18,216 13,642 Aibo word set 6,070 4,543 3,996 Aibo chunk set 13,217 4,543 3,996 Aibo turn set 17,618 6,413 3,996 Overview 24 / 49 1 Different Perspectives on Emotion Recognition 2 FAU Aibo Emotion Corpus 3 Own Results on Emotion Classification Results for different Units of Analysis Machine vs. Human Feature Types and their Relevance 4 INTERSPEECH 2009 Emotion Challenge

Most Appropriate Unit of Analysis 25 / 49 Classification complete set of features classification with Linear Discriminant Analysis (LDA) 51-fold speaker-independent cross-validation unit of number of number of average analysis features samples recall word level 265 6,070 words 67.2 % chunk level 700 4,543 chunks 68.9 % turn level 700 3,996 turns 63.2 % Chunks: best compromise between length of the segment homogeneity of the emotional state within the segment Machine Classifier vs. Human Labeler 26 / 49 Entropy based measure: labeler class 1 2 3 4 A E A A 1 2 A E N M 0.75 0.25 0.0 0.0 + decoder: M 1 2 A E N M 0.0 0.0 0.0 1.0 A E N M 0.375 0.125 0.0 0.5 H dec = 1.41 implicit weighting of classification errors depending on the word that is classified

27 / 49 Machine Classifier vs. Human Labeler (cont.) Classification: Aibo word set rel. frequency [%] 0.25 0.2 0.15 0.1 0.05 avg. human labeler machine classifier 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 entropy [5] S. Steidl, M. Levit, A. Batliner, E. Nöth, H. Niemann: Of All Things the Measure is Man Classification of Emotions and Inter-Labeler Consistency, ICASSP 2005 28 / 49 Evaluation of Different Types of Features Types of features acoustic features prosodic features spectral features voice quality features linguistic features Evaluation Artificial Neural Networks (ANN) 51-fold speaker-independent cross-validation combination by early or late fusion

29 / 49 Acoustic Features: Prosody Prosody suprasegmental characteristics such as pitch contour energy contour temporal shortening/lengthening of words duration of pauses between words 30 / 49 Acoustic Features: Prosody (cont.) Classification results: Aibo chunk set 80 70 average recall [%] 60 50 40 30 20 42.0 50.6 54.4 58.5 59.0 10 0 pauses (16) duration (37) energy (25) all F0 (29)

31 / 49 Acoustic Features: Spectral Characteristics (cont.) Classification results: Aibo chunk set 80 70 average recall [%] 60 50 40 30 20 59.0 58.9 48.2 10 0 prosody (107) HNR (2) TEO (64) MFCC (24) formants (16) jitter/shimmer (4) best combination Acoustic Features: Voice Quality 32 / 49 Classification results: Aibo chunk set 80 70 average recall [%] 60 50 40 30 20 59.0 58.9 48.2 47.0 32.5 52.3 10 0 prosody (107) MFCC (24) formants (16) HNR (2) jitter/shimmer (4) TEO (64) best combination

Acoustic Features: Combination 33 / 49 Classification results: Aibo chunk set 80 70 65.4 average recall [%] 60 50 40 30 20 59.0 58.9 48.2 47.0 32.5 52.3 10 0 prosody (107) MFCC (24) formants (16) jitter/shimmer (4) HNR (2) TEO (64) best combination Linguistic Features 34 / 49 Types of linguistic features word characteristics average word length (number of letters, phonemes, syllables) proportion of word fragments average number of repetitions part-of-speech features unigram models bag-of-words

35 / 49 Linguistic Features (cont.) Part-of-Speech (POS) Features only 6 coarse POS categories can be annotated without considering context nouns, proper names % of total inflected adjectives not inflected adjectives present/past participles (other) verbs, infinitives auxiliaries articles, pronouns, particles, interjections Anger Joyful Neutral Emphatic Motherese Other - 36 / 49 Linguistic Features (cont.) Unigram Models u(w, e) = log 10 P(e w) P(e) Anger P(A w) Emphatic P(E w) böser (bad) 29.2 % stopp (stop) 30.5 % stehenbleiben (stop) 18.9 % halt (halt) 29.3 % nein (no) 17.0 % links (left) 20.5 % aufstehen (get up) 12.3 % rechts (right) 18.9 % Aibo (Aibo) 10.1 % nein (no) 17.6 % Neutral P(N w) Motherese P(M w) okay (okay) 98.6 % fein (fine) 57.5 % und (and) 98.5 % ganz (very) 41.9 % Stück (bit) 98.5 % braver (good) 36.0 % in (in) 98.2 % sehr (very) 23.5 % noch (still) 96.2 % brav (good) 21.7 %

37 / 49 Linguistic Features (cont.) Bag-of-Words utterance: Aibo, geh nach links! (Aibo, move to the left!) 1 1 1... 0 0............ 1 4 4 4 4 Aibo allen geh nach links Aibolein representation of the linguistic content word order getting lost various dimensionality reduction techniques 38 / 49 Linguistic Features (cont.) Classification results: Aibo chunk set 80 70 61.9 61.9 62.2 average recall [%] 60 50 40 30 20 54.3 56.1 10 0 word statistics (6) POS (6) unigram models (16) best combination BOW (254 50)

39 / 49 Combination of Acoustic and Linguistic Features Classification results: Aibo chunk set 80 average recall [%] 70 60 50 40 30 20 65.4 62.2 67.1 68.9 10 0 acoustic features (late fusion, ANN) best combination (late fusion, ANN) best combination linguistic features combination (late fusion, ANN) combination (early fusion, LDA) 40 / 49 Similar Results within CEICES CEICES: Combining Efforts for Improving Automatic Classification of Emotional User States collaboration of various research groups within the European Network of Excellence HUMAINE (2004-2007) state-of-the-art feature set with 4,000 features SVM (linear kernel), 3-fold speaker-independent cross-validation selection of 150 features (SFFS): surviving feature types? only chunk based features, no information outside Aibo chunk set [6] A. Batliner, S. Steidl, B. Schuller, D. Seppi, T. Vogt, J. Wagner, L. Devillers, L. Vidrascu, V. Aharonson, L. Kessous, N. Amir: Whodunnit Searching for the Most Important Feature Types Signalling Emotion-Related User States in Speech, Computer, Speech, and Language, Vol. 25, Issue 1 (January 2011), pp. 4-28

41 / 49 Similar Results within CEICES(cont.) duration energy F0 spectrum cepstrum voice quality wavelets all acoustic BOW POS higher semantics varia all linguistic all SFFS # total 391 265 333 656 1699 153 216 3713 476 31 12 12 531 4244 # 10 32 16 15 16 7 5 101 25 7 17 0 49 150 F MEASURE 49.6 56.3 46.8 46.2 46.4 38.7 35.3 37.4 48.1 56.0 65.5 SHARE 6.7 21.3 10.7 10.0 10.7 4.7 3.4 67.3 16.7 4.7 11.3 0.0 32.7 100.0 PORTION 2.6 12.1 4.8 2.3 1.0 4.6 2.3 2.7 5.3 22.6 141.7 0.0 9.6 3.5 SFFS # 28 33 23 17 23 11 15 150 94 27 27 2 150 F MEASURE 54.9 56.9 46.7 49.9 50.4 41.5 44.9 63.4 53.2 54.9 57.9 62.6 SHARE 18.7 22.0 15.3 11.3 15.3 7.3 10.0 100.0 62.7 18.0 18.0 0.1 100.0 PORTION 7.2 12.5 6.9 2.6 1.4 7.2 6.9 4.0 19.7 87.1 225.0 16.7 28.2 Overview 42 / 49 1 Different Perspectives on Emotion Recognition 2 FAU Aibo Emotion Corpus 3 Own Results on Emotion Classification 4 INTERSPEECH 2009 Emotion Challenge

INTERSPEECH 2009 Emotion Challenge 43 / 49 New goals: challenge with standardized test conditions open microphone: using the complete corpus highly unbalanced classes including all observed emotional categories including chunks with low inter-labeler agreement 44 / 49 INTERSPEECH 2009 Emotion Challenge (cont.) Speaker independent training and test sets 2-class problem: NEGative vs. IDLe # NEG IDL train 3 358 6 601 9 959 test 2 465 5 792 8 257 5 823 12 393 18 216 5-class problem: Anger, Emphatic, Neutral, Positive, Rest # A E N P R train 881 2 093 5 590 674 721 9 959 test 611 1 508 5 377 215 546 8 257 1 492 3 601 10 967 889 1 267 18 216

45 / 49 INTERSPEECH 2009 Emotion Challenge (cont.) Sub-Challenges 1 Feature Sub-Challenge optimisation of feature extraction/selection; classifier settings fixed 2 Classifier Sub-Challenge optimisation of classification techniques; feature set given 3 Open Performance Sub-Challenge optimisation of feature extraction/selection and classification techniques 46 / 49 INTERSPEECH 2009 Emotion Challenge (cont.) Participants Open Performance Classifier Feature Sub-Challenge Sub-Challenge Sub-Challenge number of 2 classes 5 classes 2 classes 5 classes 2 classes 5 classes participants 7 2 2 1 1 1 [7] B. Schuller, A. Batliner, S. Steidl, D. Seppi: Recognising Realistic Emotions and Affect in Speech: State of the Art and Lessons Learnt from the First Challenge, Speech Communication, Special Issue Sensing Emotion and Affect - Facing Realism in Speech Processing, to appear

47 / 49 INTERSPEECH 2009 Emotion Challenge (cont.) 2-class problem: NEGative vs. IDLe 74 average recall [%] 72 70 68 66 64 67.7 66.4 67.1 67.2 67.6 67.9 68.3 69.2 70.3 71.2 unweighted avg. recall weighted avg. recall 62 Majority voting Dumouchel et al. Vlasenko et al. Kockmann et al. 60 Baseline Barra-Chicote et al. Vogt et al. Bozkurt et al. Polzehl et al. Luengo et al. 48 / 49 INTERSPEECH 2009 Emotion Challenge (cont.) 5-class problem: Anger, Emphatic, Neutral, Positive, Rest 55 average recall [%] 50 45 40 39.4 39.4 38.2 38.2 41.2 41.4 41.4 41.6 41.6 41.7 Lee et al. Vlasenko et al. Luengo et al. Planet et al. Dumouchel et al. 44.0 unweighted average recall weighted average recall 35 Vogt el al. Barra-Chicote et al. Baseline Majority voting Kockmann et al. Bozkurt et al.

State-of-the-Art: Summary 49 / 49 Berlin Emotion Speech Database 7-class problem: hot anger, disgust, fear/panic, happiness, sadness/sorrow, boredom, neutral balanced classes 90 % accuracy FAU Aibo Emotion Corpus 4-class problem: Anger, Emphatic, Neutral, Motherese subset with roughly balanced classes (Aibo chunk set) 69 % unweighted average recall 5-class problem: Anger, Emphatic, Neutral, Positive, Rest highly unbalanced classes, complete corpus 44 % unweighted average recall 2-class problem: NEGative vs. IDLe highly unbalanced classes, complete corpus 71 % unweighted average recall