Emotion in Speech: towards an integration of linguistic, paralinguistic and psychological analysis

Transcription

1 Emotion in Speech: towards an integration of linguistic, paralinguistic and psychological analysis S-E.Fotinea 1, S.Bakamidis 1, T.Athanaselis 1, I.Dologlou 1, G.Carayannis 1, R.Cowie 2, E.Douglas-Cowie 2, N.Fragopanagos 3, J.G.Taylor 3 1 Institute for Language and Speech processing (ILSP) Tel: , evita@ilsp.gr 2 Department of Psychology, Queen s University, Belfast, UK Tel:0044-(0) , r.cowie@qub.ac.uk 3 Department of Mathematics, King s College, London, UK Tel: , john.g.taylor@kcl.ac.uk Abstract. If speech analysis is to detect a speaker s emotional state, it needs to derive information from both linguistic information, i.e., the qualitative targets that the speaker has attained (or approximated), conforming to the rules of language; and paralinguistic information, i.e., allowed variations in the way that qualitative linguistic targets are realised. It also needs an appropriate representation of emotional states. The ERMIS project addresses the integration problem that those requirements pose. It mainly comprises a paralinguistic analysis and a robust speech recognition module. Descriptions of emotionality are derived from these modules following psychological and linguistic research that indicates the information likely to be available. We argue that progress in registering emotional states depends on establishing an overall framework of at least this level of complexity. 1 Introduction Speech recognition is a technically sophisticated field, with numerous commercial systems already available for transforming speech to text. However, these systems ignore a large part of the information that humans extract from speech signals that is, information about the emotional state of the speaker. There are various specific applications for the detection of emotional and emotion-related states [1]; but probably the most important reason for addressing the issue is completely generic. In this paper we describe progress towards a system capable of recovering the emotional content of speech signals. Our general case is that understanding the emotional dimension of speech communication is a thoroughly interdisciplinary problem. Learning algorithms in general, and neural networks in particular, have an indispensable part to play. However, they need to be applied within a framework that makes use of other computational techniques, and of knowledge derived from several traditions within linguistics and psychology. In humans, there are at least two separate systems involved in the processing of information about emotion from speech. One derives information from the words that are spoken; the other derives information

2 2 S-E.Fotinea, S.Bakamidis, T.Athanaselis, I.Dologlou, G.Carayannis, R.Cowie, E.Douglas- Cowie, N.Fragopanagos, J.G.Taylor from the way they are spoken, particularly from the patterns of rise and fall in pitch and intensity known as prosody and the changes in fine structure known as voice quality. There are indications that these distinctions may be associated with different cortical processing streams [2]. Following this bipartite division of emotion processing in the human, our work distinguishes two basic components for the emotional speech analysis system. The first consists of a linguistic analysis system, which derives information from a word string, extracted as text from the signal. A postprocessor stage then provides an interpretation of the emotion associated with the speaker. The other component is composed of a paralinguistic analysis system. This uses different components of the raw acoustic signal to infer underlying emotion states of the speaker. The structure of the emotion recognition process depends critically on the definition of emotion-related states. There is a large body of psychological research in that area, but it is not well known in the IT communities that have expertise in the basic extraction processes. We highlight a well-established parameterisation of emotional states (into activation and valence levels) that is soft in its state delineation. Using that representation makes it possible to avoid some of the problems of binary state representation (with too much dependence on a linguistic definition of emotional states). Ideas that are less well established, but much more useful than uninformed intuitions, are relevant to the extraction of information from specifically verbal sources. In the next section we describe the system that we have developed for prosodic analysis. Various emotionally important components, such as the F0 and intensity plots, are extracted, and then used to give a separate indication of the speaker s emotional state of the speaker. Section 3 describes the linguistic analyser, with subsections devoted to the explicit text recognition process and to the post-processing emotional state look-up. We conclude the paper with a discussion of the issues facing research in the immediate future. 2 Paralinguistic Analysis of Speech The paralinguistic module extracts information about emotion that resides in the way words are spoken. The first target in this module is the extraction of phonetic structures, such as pitch and intensity contours, spectral profiles, and feature boundaries. From these are derived measures such as average pitch and energy, and parameters of timing. These are measured across sections of an utterance marked by natural endpoints. The module derives from a system called ASSESS (standing for Automatic Statistical Summary and Extraction of Speech Segments), which we have shown captures information relevant to speakers emotional states [3]. Hence, we call the new system ASSESS MU (for modular unit). 2.1 Overall organization For several reasons, it is desirable to apply paralinguistic processing to units of speech which correspond roughly to sentences or phrases - lasting of the order of a second or more, and bounded by substantial breaks in speech. Some of the features

3 Emotion in Speech: towards an integration of linguistic, paralinguistic and psychological analysis 3 that are most often associated with emotion, are only defined relative to that kind of unit. An example is declination, i.e. a pattern in which pitch shows an overall tendency to fall from the beginning of a phrase to the end. Hence a good deal of processing must be held back until such a break occurs. The linguistic analyser needs to work continuously, and so it will provide the signal that a break has occurred; and at that point, ASSESS MU will be triggered and will analyse the file, operating in three main stages, described below. 2.2 Stage 1 Stage 1 will take the plot of voltage against time specified by a pause-defined file, and output descriptions of three basic types overall signal energy, signal spectrum, and vocal cord openings. Voltage is sampled at 22.5Khz. Overall energy and spectral properties will be described in terms of slices, that is, portions of the signal which span 512 points in the voltage plot. The overall energy measures will describe RMS of voltage measurements within a slice, and the basic spectral description of each slice will describe signal intensity within each of 18 bands, which are generally 1/3 octave but wider (for practical reasons) at the top and the bottom of the range. From that will be derived descriptions of the energy in four broad bands associated with measures used in [5] to capture qualities of voice such as breathiness, tension, etc; plus one lower boundary. which other work (including our own) has shown is emotion-sensitive. The bands are: # Hz, #2 0-2kHz, #3 2-5kHz, #4 5-8kHz Vocal cord openings form the basis on which the pitch contour (F0) will be estimated. They will be identified using an algorithm which picks up rapid upswings in the voltage/time curve. In the context of emotion detection, that approach is more appropriate than standard cepstral techniques, because it has the potential to detect local irregularities which underlie emotionally significant qualities of vocalization, such as creak. Detecting vocal cord openings reliably is a non-trivial problem. There are standard algorithms which give rough solutions, but we believe that neural net techniques may give more precise identification. 2.3 Stage 2 The core of Stage 2 will be the description of two contours, one representing the rise and fall of intensity and the other describing the rise and fall of pitch (or strictly speaking F0). Two main operations are applied to the intensity contour. It is smoothed to filter events that last much less than a syllable. A more complex problem is suggesting a reference constant for the db scale. A histogram-based technique is currently used to give reasonable estimates of intensity given a calibration sample of normal speech. A more sophisticated approach is to use evidence indicative of vocal effort (the energy in our third spectral band is reported to correlate with perceived vocal effort). Finding appropriate functions is another task where neural net techniques are probably appropriate. Constructing a pitch contour is complex because (a) samples usually contain time periods where there is no pitch contour most obviously pauses; and (b) stage 1 outputs may lack direct information about pitch

4 4 S-E.Fotinea, S.Bakamidis, T.Athanaselis, I.Dologlou, G.Carayannis, R.Cowie, E.Douglas- Cowie, N.Fragopanagos, J.G.Taylor during time periods where there is a pitch contour, or contain misleading information about pitch during time periods when there is none. Our response to these problems is based on a flexible string that is (so to speak) stretched across the sample from the first slice that contains good pitch information to the last. Each point is the string is pulled towards data points on one hand, and towards its neighbours on the other. An iterative process finds a balance between the two, giving a robust estimate of the pitch contour. After contour extraction, the speech signal is divided into significant units before quantitative descriptions are formed. The main units to be considered are tunes, roughly phrase-like units; and pauses, i.e. silences which form the outer boundary of a tune (these must last for more than 150ms). Shorter intervals when no speech is detected are called silences. 2.4 Stage 3 Stage 3 takes the general descriptions provided by stage 2 and recovers parameters that are expected to correlate with emotional states. In general, the relevant parameters come from straightforward statistical summary of data derived in stage 2. That strategy yields both parameters that are generally regarded as basic (for instance, mean, range, and standard deviation of intensity or pitch range) and others that are at a higher level, for instance parameters related to durations of chunks, tunes and silences. A few key descriptors involve more specific operations. These involve specific properties of tunes, which we have considered under the heading tune shape, and some spectral properties. In the spectral domain, various measures which have been correlated with perceptual qualities will also be generated from the basic stage 2 outputs, notably; Energy in 0-500Hz region relative to total energy (see [4]). Measures from [5] based on peak energy in selected spectral bands. Band 2 Band 3 (correlates with perceived coarseness of voice) Band 3 Band 4 (correlates with perceived stability of voice) Band 2 Band 4 (correlates with perceived use of head register vs chest register) The approach described up to this point defines a wide range of parameters that could in principle be passed to the emotion recognition subsystem. We have reported elsewhere on the relationships between these parameters and speakers emotional states, using a range of learning algorithms to identify the parameters which have most predictive value. 3 Linguistic Analysis of Speech The Linguistic Analyser processes the speech signal and provides the linguistic parameters used for the deduction of the user s emotion based on the speech signal. It consists of a Signal Enhancement/Adaptation module to provide the enhanced speech signal from the original speech input, and a robust Speech Recognition module that outputs a text string representing what the speaker has uttered. This text serves as input to the Text Postprocessing module that converts text to emotion.

5 Emotion in Speech: towards an integration of linguistic, paralinguistic and psychological analysis Recognising Speech To guarantee the best possible quality of speech recognition for emotionally coloured speech, the Linguistic Analyser should use uncompressed speech signals before any enhancement or recognition algorithm is being applied. The modules that need to be combined in order to recognise speech are in short presented below The Signal Enhancement/Adaptation Module Signal Enhancement: The uncompressed speech signal is fed to the Signal Enhancement/Adaptation module and it is processed in order to enhance the signal and remove noise prior to recognition. Two methods are currently implemented. The first, is the well known non-linear spectral subtraction [6] and the second comprises a noise reduction technique presented in [7], based on the Singular Value Decomposition (SVD) approach. Validation tests are being conducted to evaluate the speech enhancement algorithms with respect to the word error rate and initial comparative results are reported in [8]. Speaker adaptation: An important source of variability in speech is due to the difference between speakers, e.g., male/female, adult/child. The performance may be improved considerably if normalisation of the input speech against speaker variability is performed. The selected strategy involves feature extraction for the current speaker be adapted to the acoustic models, instead of models being adapted to the input The Speech Recognition Module This module allows the processing of the speech signal and the feature extraction, by converting each speech frame into a set of cepstral coefficients. Then, acoustic phoneme models provide estimates of the probability of the features, given a sequence of words. Language modelling provides a mechanism for estimating the probability of some word in an utterance given its preceding words. The output of this process is a text, representing what the speaker has uttered. The developed Speech Recognition module has been inspired by the work proposed in [9]. Parameter extraction: The prime function of the parameter extraction module is to divide the input speech into blocks; then for each block to derive a smoothed spectral estimate. (The spacing between blocks is typically 10 msecs and blocks are normally overlapped to give a longer analysis window, typically 25 msecs). In almost all cases of such processing, it is quite usual to apply a tapered window function (e.g. Hamming) to each block. Mel-Frequency Cepstral Coefficients (MFCCs) are used to model the spectral characteristictis of each block. Acoustic modelling: The purpose of the acoustic models is to provide a method of calculating the likelihood of any vector sequence Y given a word w. In principle, the required probability distribution could be found by finding many examples of each w and collecting the statistics of the corresponding vector sequences. However, this is impractical for LVR systems and instead, word sequences are decomposed into basic sounds called phones. Each individual phone is represented by a hidden Markov model (HMM). Contextual effects cause large variations in the way that different sounds are produced. Hence, to achieve good phonetic discrimination, different HMMs have to be trained for each different context, instead for one HMM per phone.

6 6 S-E.Fotinea, S.Bakamidis, T.Athanaselis, I.Dologlou, G.Carayannis, R.Cowie, E.Douglas- Cowie, N.Fragopanagos, J.G.Taylor Our approach involves using triphones where every phone has a distinct HMM model for every unique pair of left and right neighbours. Moreover, state-tying techniques with continuous density HMMs are used. Language modelling: An effective way of estimating the probability of a word given its preceding words, is to use N-grams which simultaneously encode syntax, semantics and pragmatics and they concentrate on local dependencies, which makes them very effective for languages where word order is important and the strongest contextual effects tend to come from near neighbours. We have also chosen N-grams, because the N-gram probability distributions can be computed directly from text data, yielding hence no requirement to have explicit linguistic rules (e.g. formal grammars). Search engine: The basic recognition problem is to find the most probable sequence of words given the observed acoustic signal (based on the Bayes rule for decomposition). In our system, we use the breadth-first approach and specifically, beam search and Viterbi decoding (it exploits Bellman s optimality principle). The dynamic performance in this search engine accomplishes a system capable of exploiting complex language models and HMM phone models depending on both the previous and succeeding acoustic context, such as coarticulation. Moreover, it can do this in a single pass, in contrast to most other Viterbi-systems that use multiple passes. 3.2 Emotion-related information from text Converting speech to text is the outcome of the Speech Recognition procedure described above. The extraction, however, of the speaker s emotional state requires conversion from text to emotion. This module of the Linguistic Analyser, being the last, in terms of sequential execution, is called the Text Post-Processing Module. The simplest way to proceed is to assume that Text Post-Processing comprises text retrieval techniques, such as Word Spotting, in order to provide a classification of the user s emotion based on the linguistic characteristics of the user s utterance. The possible use of Emotional Lexicons is being investigated. Such lexicons exist in English, and the appropriate adaptation for the Greek language seems necessary, should we foresee emotion recognition for Greek as well. The basic process we start with here is to use a look-up table to describe the speaker s emotional state from interpreted words. We base this on the original one of Whissel, extended more recently to 8700 word, with a 90% matching rate for most documents [10]. This uses the two dimensions of activation and evaluation. The first of these is the degree of arousal associated with various emotion-relevant words, such as having a low value of 2.2 for bashful and a high of over 6 for surprised. The second emotion dimension is the degree of pleasantness, with low value of 1.1 for the word guilty, and a high value of 6.4 associated with delighted. We use the two-dimensional look-up table to produce a two-dimensional activation-evaluation coding for each word in a text. This transformation produces a dynamic trajectory followed in the two-dimensional emotion space, as a string of words is successively processed. This trajectory is to be related to the associated feel-trace trajectory arising by using the ASSESS system to analyse the prosodic components of the speech input string. An important question to be answered is to how these two trajectories are to be fused to produce a suitable coding of the emotional content of the speech trace. This is discussed in section 4.

7 Emotion in Speech: towards an integration of linguistic, paralinguistic and psychological analysis 7 4 Fusion of the two streams A plethora of issues arise from the effort to fuse the emotional features extracted from the linguistic analysis and those extracted from the paralinguistic (prosody). These issues pertain to both technical intricacies and the generic complexity of the fusion task. On the technical level, there must be an effective method of synchronizing and combining the two streams of data that represent the emotional state as detected by the two types of analysis. To this end, we need to specify what the unit of analysis is and which characteristic(s) of the speech stream should trigger the different modules. On a more generic level, there is a question of harmonization of the two emotion inference procedures by means of balancing the authority of the linguistic vs. paralinguistic analysis with respect to which is apposite for deducing the emotional state at each instance. This is particularly important in cases where the two methods report incompatible emotional states. For instance, we know that the same phrase spoken with a different tone (different prosodic features) can have quite a different emotional effect. Thus, prosody can often be more informative than the actual words spoken as when one uses sarcasm or when the semantic content of the phrase spoken is neutral but the tone is highly emotional (e.g. in frustration). The aptness of the two individual analyses is important before a merge occurs, as both linguistic and paralinguistic analysis are susceptible to emotion detection errors. In the case of the text post-processing, we have to improve our approach by removing incorrect emotion assignments by the presence of further context indicating that the speaker is not themselves experiencing the emotion state spotted. Thus, in the example He said to me that he was very angry it is clear that the speaker is not angry. Thus items of reported speech containing emotion words should be treated as a separate category. They may contain implications of an emotion state that needs to be taken notice of. But that needs to be treated differently from that of recognising and responding appropriately to the emotion state of the speaker. Thus any presence of reported speech words: that p, X felt, or equivalents must be treated separately. In the case of emotion detection by prosody, it has been reported that different emotional states correspond to the same or similar prosodic patterns. Thus, special care should be given for classifying emotion based on these patterns with the utilisation of crossreferencing between the two feature streams to resolve ambiguity. One solution to this problem of fusion is by means of a neural network, trained on a suitable training set of speech streams with known emotional state tagging. Such a set of data is being developed as part of ERMIS, with the associated emotional activation-evaluation trajectories being part of the developing FEELTRACE database. An initial version of this was already used in an earlier project (PHYSTA), using a variety of techniques; our present approach is more principled as well as involving a larger and better-defined FEELTRACE data-base. We will also take seriously the suggestion of the accompanying paper, to take lesions from the human brain. More specifically we propose to build a feedback system, essentially mimicking the ventral route (amygdala and prefrontal cortices) to emotion recognition in the human brain. This will allow attention to be directed to subsets of the overall speech features being analysed; in that way we will obtain speed-up as well as improved accuracy.

8 8 S-E.Fotinea, S.Bakamidis, T.Athanaselis, I.Dologlou, G.Carayannis, R.Cowie, E.Douglas- Cowie, N.Fragopanagos, J.G.Taylor 5. Conclusions We have presented a description of the ongoing work in the ERMIS project to marry prosodic and linguistic analyses of speech so as to create an emotional recognition system. The problems of so doing are not trivial, as has been noted elsewhere in some detail [11, 12]. However we consider that we have built an expertise on both the fundamentals of emotional recognition and word recognition. Addition of feedback may help give the edge needed to obviate the difficulties noted in [11, 12]. Acknowledgements: This work has been partially supported by the European Commission under the ERMIS Project Grant (IST ). References 1. Cowie, R., Douglas-Cowie, E., Tsapatsoulis, N., Votsis, G., Kollias, S., Fellenz, W., Taylor, J.: Emotion Recognition in Human-Computer Interaction. IEEE Signal Processing Magazine 18 (1), (2001) Taylor, J.G. et al: The Emotional Recognition Architecture in the Human Brain. Submitted to ICONIP/ICANN 2003, Istanbul, Turkey (2003) 3. McGilloway, S., Cowie, R. Douglas-Cowie, E., Gielen, S., Westerdijk and Stroeve, S.: Automatic recognition of emotion from speech: a rough benchmark. Proceedings of ISCA Workshop on Speech and Emotion: A Conceptual framework for research, Belfast:Textflow, (2000) Klasmeyer, G.: An automatic description tool for time-contours and long-term average voice features in large emotional speech databases. Proceedings of ISCA Workshop on Speech and emotion: A conceptual framework for research. Belfast:Textflow, (2000) Hammarberg, B., Fritzell, B., Gauffin, J., Sundberg, J., Wedin, l.: Perceptual and acoustic correlates of voice qualities. Acta Otolaryng 90, (1980) Pellom, B. L., Hansen, J.H.L.: Voice Analysis in Adverse Conditions: The Centennial Olympic Park Bombing 911 Call, Proceedings of IEEE Midwest Symposium on Circuits & Systems, August (1997) Doclo, S., Dologlou, I., Moonen, M.: A novel iterative signal enhancement algorithm for noise reduction in speech, Proceedings of ICSLP, Sydney, Australia, (1998) Athanaselis, T., Fotinea, S-E., Bakamidis, S., Dologlou, I., Giannopoulos, G.: Signal Enhancement for Continuous Speech Recognition. Submitted to ICONIP/ICANN 2003, Instabul, Turkey (2003) 9. Young, S.J.: Large Vocabulary Continuous Speech Recognition IEEE Signal Processing Magazine 13(5) (1996) Whissel, C.M.: The dictionary of affect in language. In R Plutchik, H Kellerman, eds, Emotion: Theory, Research and Experience: vol 4. The Measurement of Emotions New York: Academic Press (1989) 11. Russell, J.A. et al: Facial & Vocal Expressions of Emotion. Ann Rev Psychol 54 (2003) McNeely, H.E. & Parlow, S.E.: Complimentarity of Linguistic and Prosodic Processes in the Intact Brain. Brain & Language 79 (2001)