EMOTION DETECTION IN CUSTOMER CARE

Transcription

1 Computational Intelligence, Volume 59, Number 000, 2010 EMOTION DETECTION IN CUSTOMER CARE NARENDRA GUPTA, MAZIN GILBERT, AND GIUSEPPE DI FABBRIZIO AT&T Labs - Research, Inc. Florham Park, NJ USA {ngupta,mazin,pino}@research.att.com Prompt and knowledgeable responses to customers are critical in maximizing customer satisfaction. Such messages often contain complaints about unfair treatment due to negligence, incompetence, rigid protocols, unfriendly systems, and unresponsive personnel. In this paper, we refer to these messages as emotional . They provide valuable feedback to improve contact center efficiency and the quality of the overall customer care experience, which in turn results in increased customer retention. We describe a method that uses salient features to identify emotional in the customer care domain. Salient features in customer care related are expressions of customer frustration, dissatisfaction with the business, and threats to either leave, take legal action and/or report to authorities. Compared to a baseline system using word unigram features, our proposed approach significantly improves emotional detection performance. Key words: Boosting; Customer Service; Emotional ; Salient Features; Support Vector Machines; Text Classification. 1. INTRODUCTION is becoming the preferred communication channel for customer service. For customers, it is a way to avoid long hold times on call center phone calls and to keep a record of information exchanges with the business. For businesses, it offers an opportunity to best utilize customer service representatives by evenly distributing the work load over time, and for representatives, it allows time to research issues and respond to customers in a manner consistent with business policies. Businesses can further exploit the offline nature of this communication channel by automatically routing involving critical issues to specialized representatives. Besides direct concerns related to products and services, businesses can ensure that messages complaining about unfair treatment due to negligence, incompetence, rigid protocols and unfriendly systems, are always handled with care. Such , referred to as emotional , is critical to reduce churn i.e., retain customers who otherwise would have taken their business elsewhere, and, at the same time, they are a valuable source of information for improving business processes together with the quality of the overall customer care experience, which in turn results in increased customer retention. In service-oriented businesses, a large number of customer messages may contain routine complaints. While such complaints are important and are routinely addressed by customer service representatives, this work aims to identify a critical subset of such messages called emotional . We define emotional as customer messages to a provider complaining about specific aspects of product or service, where the severity of the complaints and the customer dissatisfaction could lead to loss of business or brand value, or litigation and penalties. Emotional may contain abusive and probably emotionally charged language, but we are mainly interested in identifying messages where, in addition to flames (Spertus, 1997), the customer included a description of the problem they experienced with the company providing the service. In the context of customer service, customers express their concerns in many ways. Sometimes they communicate their negative emotions or opinions by using phrases like disgusted, you suck and by using orthographic features like all upper-case words and repeated exclamation marks. In other cases, there is a minimum emotional involvement and they use factual phrases such as you overcharged, or take my business elsewhere. In many cases, emotional, opinion and factual components are actually present. In this work, we have identified eight different ways customers use to express themselves by phrases in emotional . Throughout this paper, these will be referred to as salient features. We have also identified three different orthographic ways customers use to express their emotions. These will be referred to as orthographic features. We treat identification of emotional as a text classification problem, and show that the salient and the orthographic features significantly improve the C 2010 The Authors. Journal Compilation C 2010 Wiley Periodicals, Inc.

2 2 COMPUTATIONAL INTELLIGENCE identification performance. Compared to a baseline system using only the word unigram features, our proposed approach resulted in an F-measure improvement from to in detecting emotional . The rest of the paper is organized as follows. In Section 2, we provide a summary of previous work and its relationship with our contribution. In Section 3, we explain why we view identification of emotional as text classification problem. Section 4 describes the rationale behind our choice of classification techniques. In Section 5, we illustrate our method for detecting emotional by using salient and orthographical features. In Section 6, we present a series of experiments demonstrating improvements in classification performance over a word unigram baseline. In Section 7, we conclude by highlighting the contributions of this work. 2. PREVIOUS WORK Extensive research has been done on emotion detection. In the context of human-computer dialogs, although richer features including acoustic and intonation are available, there is a general consensus (Litman and Forbes- Riley, 2004b; Lee and Narayanan, 2005) that lexical evidence can significantly improve the accuracy of emotion detection. Research has also been done in predicting basic emotions (also referred to as affects) within text (Alm et al., 2005; Liu et al., 2003). In Alm et al. (2005), to render speech with a prosodic contour conveying the emotional content of the text, one of 6 types of human emotions (e.g., angry, disgusted, fearful, happy, sad, and surprised) are identified for each sentence in the running text. Deducing such emotions from lexical constructs is a hard problem as evidenced by little agreement among annotators. A Kappa value 1 (Cohen, 1960) of was shown in Alm et al. (2005). Liu et al. (2003) have argued that the absence of affect laden surface features i.e., keywords, from the text does not imply absence of emotions, therefore they have relied more on commonsense knowledge. In our work also, the absence of emotion or opinion laden keywords or phrases, does not automatically make the message non-emotional. The presence of other salient features could still make the message as emotional. Furthermore, instead of deducing types of emotions and opinions in each sentence, we are interested in knowing if the as a whole conveys an emotional content or not. In addition, we are also interested in the intensity and cause of those emotions and opinions. There is also a body of work in areas of creating Semantic Orientation dictionaries (Hatzivassiloglou and McKeown, 1997; Turney and Littman, 2003; Esuli and Sebastiani, 2005) and their use in identifying sentiment 2 laden sentences and polarity (Yu and Hatzivassiloglou, 2003; Kim and Hovy, 2004; Hu and Liu, 2004) of those sentiments, i.e., whether those sentiments are positive or negative. While such dictionaries provide a useful starting point, their use alone does not yield satisfactory results. In Wilson et al. (2005), classification of phrases containing positive, negative or neutral sentiments is discussed. For this problem, they show high agreement among human annotators (Kappa of 0.84). They also show that labeling phrases with positive, negative or neutral sentiments only on the basis of presence of keywords from such dictionaries, yields a classification accuracy of 48%. An obvious reason for this poor performance is that word semantic orientations are context dependent. In addition to Wilson et al. (2005), Pang et al. (2002) and Dave et al. (2003) have attempted to mitigate this problem by using supervised classification methods exploring the role of the syntactic and semantic context in which the opinion words are expressed. They report classification results using a number of different sets of features, including word unigrams. Wilson et al. (2005) reports an improvement (from 63% to 65.7% accuracy) in performance by adding a host of features extracted from syntactic dependencies. Similarly, Gamon (2004) demonstrates that the use of deep semantic features along with word unigrams improve the performance. Pang et al. (2002) and Dave et al. (2003) on the other hand demonstrated that word unigrams provide the best classification results. This is in line with our experience as well and could be related to the sparseness of the data. We also used supervised methods to detect emotional . To train predictive models we used word unigrams and a number of binary features indicating the presence of words/phrases from specific dictionaries. There is also a body of work in classifying messages according to their speech acts (Searle, 1969). For example, Lampert et al. (2010) discussed a text classification approach for detecting messages 1 Kappa usually denoted by κ, is a statistical measure of inter-annotator agreement. In its computation chance agreement is accounted for. κ = 1 when there is complete agreement among the annotators, and κ 0 when there is no agreement. Kappa is negative only when the annotators agree less than by chance agreement. 2 In opinion mining literature often the word sentiment is used to refer to the opinions. Such opinions could be directly expressed through phrased like is not good or indirectly through phrased like it is depressing

3 EMOTION DETECTION IN CUSTOMER CARE 3 containing Request for Action. Like our work, they also use supervised learning to train classification models, but report a poor inter-annotator agreement (Kappa of 0.681). For this reason, Lampert et al. (2010) used only 505 unanimously agreed training instances. Similar to our work, they also show that, even with a small data set, highly discriminating features significantly improve text classification performances. While we use salient features, in Lampert et al. (2010) different regions of messages, called zones, are used to improve the classification performance. In the context of -based customer care, Busemann et al. (2000) propose a traditional classification approach to classify customer messages in one of the 47 categories defined for a specific domain. Although they address classification in customer care settings, their objective is not to detect emotional , but to identify the nature of customer requests to assist a representative to quickly compose the response. In their work, Busemann et al. (2000) use features extracted from text by shallow parsing and orthographic analysis. They also explored several machine learning algorithms and show that Support Vector Machines (SVMs) (Vapnik, 1998) perform best for their task. Spertus (1997) presents a system called Smokey which recognizes hostile messages and is quite similar to our work. While Smokey is interested in identifying messages that contain abusive or insulting language, i.e., flames, our research on emotional also looks for the reasons for such flames. Besides word unigrams, Smokey uses rules to derive additional features for classification. These features are intended to capture different manifestations of the flames; for example by noun appositions like you bozos, imperatives like get lost and insults like sick, idiotic. In Smokey, to extract these features, rules are applied on syntactic parses of the sentences. In our work we also used rules. However, because of inaccurate grammatical structures, typos and misused punctuation in , we used table look-up to derive salient features of emotional CLASSIFICATION VS. REGRESSION We have defined emotional as the customer s messages expressing relatively severe complaints and relatively high levels of dissatisfaction. Therefore, identification of emotional requires ranking the messages with respect to the severity of complaints and the level of customer dissatisfaction, and subsequently selecting the top ranking messages using some optimal thresholding criteria. Ordering by severity, may suggest to adopt an ordinal regression model trained using labeled, for instance, with a severity scale from 1 to 5. Higher ranks indicate higher gravity of complaints and level of customer dissatisfaction. We did not pursue this approach for the following reasons. (1) It is hard to get consistently labeled data to train regression models. As demonstrated in Pang and Lee (2005), although humans can discern more than one grade of positive or negative judgments, labeler agreement degrades when rank differences are small. (2) In our experience with a similar problem in the domain of restaurant rating prediction 3 discussed in Gupta et al. (2010a), classification performs as well as or better than regression. We believe the reason for this is that such ranking tasks are inherently classification tasks. Labelers first make a classification decision between low/high ranks. Then, they pick the rank from the lower/upper half of the rank range. In this process, annotators agreement is quite uniform when making their initial classification decision, but diverges around close ranking values. 4. CLASSIFICATION ALGORITHMS We experimented with Boosting (Schapire, 1999) and SVMs. We used BoosTexter (Schapire and Singer, 2000), an implementation of AdaBoost (Schapire and Singer, 1999), and LLAMA which among other machine learning algorithms has an implementation of SVMs. LLAMA is a highly-scalable machine learning toolkit developed at AT&T. Similar to LIBSVM (Chang and Lin, 2001), in LLAMA, SVMs are also implemented by using Platt s Sequential Minimal Optimization algorithm (Platt, 1999). Haffner (2006a) has shown that classification performances obtained on several applications by using LLAMA are identical to those obtained 3 Like the emotional problem, where severity of complaints and the level of dissatisfaction have to be ranked. In the restaurant rating prediction problem customers opinions about restaurant features are ranked on a scale of 1 to 5 using the customer provided textual review.

4 4 COMPUTATIONAL INTELLIGENCE by using LIBSVM. Nevertheless, LLAMA was preferred because it is an AT&T internally-approved software for building deployable and production-grade applications. To augment the results presented in Gupta et al. (2010b) using boosting, we experimented with SVMs since these algorithms represent two broad categories of large margin classifiers (Haffner, 2006b). A detail discussion of these algorithms can be found in the references cited above. Here we summarize their properties which we found useful for the analysis of our experimental results Boosting AdaBoost is a feature-based learning method. In this type of learning method, only a few features are considered necessary to express the classification model. To minimize overfitting, Adaboost uses l 1-norm for regularization during training. As a consequence, many features are assigned to zero weights and are ignored in the learned models. BoosTexter uses word ngrams as features. Therefore, it is particularly suited for text classification and has been successfully applied to numerous text classification applications (Gupta et al., 2005) at AT&T. Boosting builds a highly accurate classifier by combining many weak base classifiers, each one of which may only be moderately accurate. In BoosTexter, weak classifiers are essentially tests on a feature value, e.g., testing for the presence of a word ngram. Boosting constructs the collection of base classifiers iteratively. On each iteration t, the boosting algorithm supplies the base learner weighted training data and the base learner generates a base classifier h t. Set of non negative weights w t encode how important it is that h t correctly classifies each training example. Generally, training examples that were most often misclassified by the preceding base classifiers will be given the most weight to force the base learner to focus on the hardest examples. BoosTexter uses confidence rated (Schapire and Singer, 1999) base classifiers that for every example x (in our case it is the customer message) output a real number h(x) whose sign (-1 or +1) is interpreted as a prediction (+1 indicates emotional message), and whose magnitude h(x) is a measure of confidence. The output of the final classifier is f(x) = P T t=1 ht(x), i.e., the sum of confidence of all classifiers ht built iteratively over t iterations. We map the real-valued predictions of the final classifier onto a confidence value between 0 and 1 by a logistic function; conf(x = emotional message) = e f(x). This provides a convenient mechanism to set the rejection threshold for detection of emotional . To minimize overfitting in BoosTexter, the number of iterations, i.e., the number of base classifiers generated, must be selected using validation data. This is what makes BoosTexter a feature-based learning method. In the final model, only the features used in base classifiers are considered useful Support Vector Machines SVMs (Support Vector Machines) is an example-based learning method where weights to only a few examples called Support Vectors, are considered necessary to express the classification model. SVMs use l 2- norm for regularization of the models during training. As a consequence all input features are assigned non-zero weights and are present in the final model. Given a set of M training input vectors X = {x 1,..., x M } and their corresponding target labels Y = {y 1,..., y M } where each y i { 1, 1}, SVMs, in their simplest form, learn linear classifiers characterized by the hyperplane w T x = 0 4 in the input feature space. A test example that lies on the left/right side of this hyperplane respectively i.e., w T x < 0 or w T x > 0 is classified as negative/positive respectively. SVMs look for the separating hyperplane (w T x = 0) which maximizes separation margin between two hyperplanes separating positive examples i.e., w T x = 1 and the negative examples i.e., w T x = 1. The margin is the minimum distance between these two hyperplanes in the direction of the weight vector w, and its value is 2 P Nk=1 w 2 k Training examples falling on the positive and negative hyperplanes are called support vectors. So far we have described hard margin SVMs, where classification errors on the training set are not permitted. SVMs are generalized as soft margin SVMs, which allow some errors, i.e., vectors that are inside or on the wrong side of the margin. In such cases classification errors can be quantified as max `0, 1 y iw T x i, and the learning. 4 To include a bias term the input vector x is extended to include an extra element with value 1.

5 EMOTION DETECTION IN CUSTOMER CARE 5 process entails minimization of the following loss function. MX L = max 0, 1 y iw T x i + 1 NX wk 2 (1) C i=1 Notice the second term in this equation, which is supposed to maximize the margin, is the regularization term which allows preference to simpler models and avoids over training. This constrained optimization problem can be solved in its dual form where the weight vector are expressed as a combination of training support vectors: MX w = α iy ix i (2) i=1 The α i, are Lagrange multipliers. They parametrize the weight vectors in dual form and can be used to compute the classifier output as w T x = P M i=1 αiwt x i. More generally, in the dual representation, only computations of inner products between input vectors are needed (without ever needing an explicit representation of the weight vector w). This property of the dual form of SVMs also allows learning of higher order models tractable. It also enables, under some conditions, expansion of feature the space by using relations among the original features. We used this feature of SVMs to experiment with polynomial kernels. Another feature of the SVM implementation in LLAMA, which is important for our work, is the ability to assign different weights to different sets of features. For example, for two sets of features x 1 and x 2, if one assigns weights a 1 and a 2 respectively, then a linear classification model corresponding to the following relationship is learned. y = sign a 1(w1 T x 1 x 1 ) + a2(wt 2 k=1 «x 2 x ) 2 Here individual feature vectors are normalized to fall on the unit sphere. As shown in Herbrich and Graepel (2002), for linear classifiers this results in the optimal SVMs with respect to much tighter bounds on the margins. We also used this feature of LLAMA to explore and report the nature of the feature space. For test examples, SVMs output their distance from the dividing hyperplane. This distance reflects the confidence in the classifier output. In our experiments, we used this confidence value to establish acceptance/rejection thresholds on the classifier output. (3) 5. EMOTIONAL DETECTION USING SALIENT FEATURES Emotional is a reaction to perceived excessive loss of time and/or money by customers. Expressions of such reactions in are salient features of emotional . For in the context of customer service we have identified the 8 features listed below. While many of these features are of a general nature and can be present in most customer service related emotional , in this paper we make no claims about the completeness of this list. (1) Negative Emotions: Customers express their negative emotions explicitly by phrases like it upsets me or I am frustrated. Customers also express such emotions using the orthographic features listed in the next subsection. (2) Negative Opinions: Customers express their negative opinions about the company by evaluative expressions like dishonest dealings or disrespectful. These could also be insulting expressions like stink, suck and idiots. Sometimes a clear distinction between expressions of Negative Opinions and Negative Emotions is difficult. For example in sentence That is insulting to me, insulting can be interpreted both as an opinion and an emotional expression. (3) Business Threats: Customers threaten to take their business elsewhere by expressions like business elsewhere or look for another provider. These expressions are neither emotional nor evaluative opinions, but they explicitly refer to cancelling a service or a major change in their relationship with the business. (4) Report Threats: Customers threaten to report to authorities using phrases like federal agencies or consumer protection. These are domain dependent names of official agencies. The mere presence of such names implies customer threat. (5) Legal Threats: Customers threaten to take legal action by using phrases like seek retribution or lawsuit. These expressions may also not be emotional or evaluative in nature.

6 6 COMPUTATIONAL INTELLIGENCE (1) You are making this very difficult for me. I was assured that my <SERVICE> would remain at <CURRENCY> per month. But you raised it to <CURRENCY> per month. If I had known you were going to go back on your word, I would have looked for another Internet provider. Present bill is <CURRENCY>, including <CURRENCY> for <SERVICE>. (2) I cannot figure out my current charges. I have called several times to straighten out a problem with my service for <PHONENO1> and <PHONENO2>. I am tired of being put on hold. I cannot get the information from the automated phone service. FIGURE 1. samples: 1) emotional; 2) neutral (6) No Justification: Customers try argue that the company has no justification for treating them the way it did. A common way of doing this is by saying things like I am a long time customer, or a loyal customer, etc. The semantic orientations of most phrases used to express this feature are positive. Therefore, simply looking for negative opinions to label an message as an emotional will not be appropriate, unless more sophisticated analysis is carried out to identify the opinion holders and the targets of the opinions (see Wiebe et al. (2005)). (7) Disassociation: When customers are not happy about a service or think that they have been wronged, they tend to disassociate themselves from the company. They express this disassociation by using phrases like you people, your service representative, etc. These are analogous to the rule classes Noun Appositions and Second-Person respectively as described in Spertus (1997). (8) Causes: There is always a reason for an emotional message. Customers typically use verb phrases like grossly overcharged, on hold for hours, etc. to describe what was done wrong to them. Many such phrases contain customer opinions. We distinguish them from Negative Opinions using the following criteria. a) Generally Causes are verb phrases and Negative Opinions are adjectives or noun phrases. b) Targets of Negative Opinions are objects or concepts of general nature, while those of Causes are more specific. For example, dishonest dealings is an expression of Negative Opinions and threatening collection calls of Causes. When such a distinction is not possible, we consider the presence of both features. While labeling the training data (see Section 6 for details), labelers look for salient features within the message and evaluate the severity of complaints and level of customer dissatisfaction. For example, the first fragment of message in Fig. 1 contains several salient features: a) difficult for me expressing Negative Emotions 5 ; b) you raised it expressing the Causes; c) looked for another Internet provider expressing Business Threats. It is labeled as emotional because the customer complaint is severe to the point that the customer may cancel the service. The second message fragment in Fig. 1 also contains several salient features: a) tired expressing Negative Emotions; b) cannot get information and problem with my service expressing Causes. It is not emotional because the customer perceived loss is not severe to the point of service cancellation. This customer would be satisfied if she/he receives the requested information in a timely fashion. In addition to the word unigrams, we also use eight binary valued salient features of emotional for training/testing the emotional classifier. To extract salient features from an , eight separate lists of phrases customers use to express each of the salient features were manually created. If any of the phrase(s) from a list is present in a message then the corresponding salient feature is set to 1 otherwise it is set to 0. These lists are extracted from the training data and can be considered as basic rules to identify emotional . In the labeling guide for emotional , labelers were instructed to look for salient features in the and keep a list of encountered phrases. Excerpts of these lists are shown in Table 1. We manually enriched these lists by: a) using general knowledge of English, by adding variations to existing phrases and b) searching a large body of messages (different from that used for testing) for different phrases in which seed words from known phrases participated. For example, from the known phrase lied to we used the seed word lied and found the phrase blatantly lied. Using these lists, we extracted eight binary salient features for each message, indicating the presence or the absence of phrases from the corresponding list in the . 5 Difficulties cause sadness.

7 EMOTION DETECTION IN CUSTOMER CARE 7 Salient feature Negative emotions Negative opinions Business Threats Report Threats Legal Threats No Justification Disassociation Total Counts Example phrases 61 emotional distress, hassle, hate, powerless, unhappy, upset, sick and tired, pissed off, infuriating, aggravation, ashamed, disgusted, frustrated, feel strongly, drive me nuts, nightmare, outraged, pain, makes me angry, last staw, insulting 158 abuse, arrogant, atrocious, bullshit, clueless, not helpful, worst way possible, worst service, worse, wrongful, unreasonable, unhelpful, unfriendly, unbelievable, stupid, stinks, ridiculous, rude, predatory pricing, outrageous, liar, negligent, moron, insulting, not competent, absolutely disgusting, fraudulent 44 another company, business elsewhere, move my account, reconsidering my provider, discontinue, cancelling everything, changing my phone company, leave you, close my account 22 bbb, better business bureau, chamber of commerce, utility board, federal agencies 13 attorney, illegal, suing, retribution, litigation, lawsuit, lawyer 12 long time customer, loyal customers, faithful, good customer, have your service since 14 you folks, your customer service, your service representatives, your representative, your company, your people, your service, you people, you bozo, your kind, your so-called Causes 77 bait and switch, coporate greed, double talk, duped, empty promises, excuses, extortion, false advertising, false promises, go back on your word, not honor your word, on hold for hours, never showed up, lied, overcharged, playing games, punish, rake me over the coals, rectify, rigamarole, rip us off, run around, screwed up, shortchanged, swindled, threaten, violated, threatening collection calls, paying more TABLE 1. Example phrases for different Salient Features In Table 2 we show the distribution of each salient feature in messages labeled as emotional in the training data. As can be seen although salient features have a higher tendency to occur in emotional , they frequently also occur in non-emotional Orthographical Features Besides direct expressions of their emotional state in words, users also use other orthographical means to express their emotions. In this work, we have used three such orthographical features in messages. (1) Repeated punctuations: Users tend to use repeated question marks or exclamations to express their frustration and anxiety. An example is shown in Fig. 2. In our implementation we consider the presence of this feature in a message only if it contains more than 1 instance of such repeated punctuations. (2) Uppercase words: Use of some words in upper case in a message is typically interpreted as shouting. An example is shown in Fig. 3. For this work we considered the presence of this feature if the number of upper case words in the message is more than three but less than 50%. (3) length: In general, we observed that emotional messages are not short. Users with severe complaints and a higher degree of dissatisfaction tend to write longer messages.

8 8 COMPUTATIONAL INTELLIGENCE Salient feature Number of occurrences In Emotional Negative Emotions % Negative Opinions % Business Threats % Report Threats % Legal Threats % No Justification % Disassociation % Causes % TABLE 2. Distribution of salient features in emotional in training data I was just wondering how many s and telephone calls does it take to actually get my voic activated????? Do I have to switch to <COMPANY1> or <COMPANY2> to get this resolved???? I have spent about 3 hours of my time either on hold, ing, or explaining to my associates why they can t leave a message... Just wondering and VERY frustrated!!!!! FIGURE 2. samples: Using repeated punctuations to express emotions Sir: Today, I woke up to turn on the TV and there was NOTHING of the channels I had chosen AVAILABLE. My home phone has been DISCONNECTED! My correspondence to the receipt of a Disconnection Notice was NOT ADDRESSED by you AT ALL. So, a few minutes ago I made payment of <CURRENCY> ONLY. I am REQUESTING that my <SERVICE> REMAIN DISCONNECTED as I don t need it. I m ALSO REQUESTING that the <SERVICE> BE TERMINATED, the box be removed from my home, and the antenna be removed and replaced with my <COMPANY> ANTENNA. This was handled badly and I SURELY DON T WANT IT REPEATED. Thanks for nothing FIGURE 3. samples: Using upper case words to express emotions In our data we did not find many emoticons. Therefore, we did not use them as features. This is not surprising because intuitively people are not inclined to put funny symbols in highly charged, argumentative messages EXPERIMENTS AND EVALUATION We performed several experiments to compare the performance of our emotional classifier. For these experiments, we labeled 620 messages as training examples and 457 messages as test examples. These examples were randomly selected from a corpus of over two million customer . Training examples were labeled independently by two different labelers 7 with a relatively high degree of agreement among them. Kappa value of was observed. Compare this with Kappa of reported for emotion labeling tasks (Alm and Sproat, 2005; Litman and Forbes-Riley, 2004a). Because of the relatively high agreement among these labelers, we did not feel the need to check the agreement among more than 2 labelers. Table 3 shows that emotional messages are about 12-13% of the population. Due to the limited size of the training data we used a cross validation (leave-one-out) technique on the test set to evaluate the outcomes of different experiments. In this round-robin approach, each example from the test set is classified using a model trained on all remaining 1076 (620 plus 456) examples. Test results on all 457 test examples are averaged. Throughout all of our experiments, we computed the classification performance of detecting emotional 6 Noted by one of the anonymous reviewers of this paper 7 One of the labelers was one of the authors of this paper and the other also had a linguistic background.

9 EMOTION DETECTION IN CUSTOMER CARE 9 Set Number of examples Emotional Training % Test % TABLE 3. Distribution of emotional using precision, recall and F-measure. Specifically, we used balanced F 1 score giving equal weights to recall and precision. One could also use other evaluation criteria depending on recall and precision trade-offs. Notice that for our test data, a classifier with majority vote has a classification accuracy of 87%, but since none of the emotional are identified, recall and F-measure are both zero. On the other hand, a classifier which generates many more false positives for each true positive, will have a lower classification accuracy but a higher (non-zero) F-measure than the majority vote classifier. Fig. 4 shows precision/recall curves for different experiments using boosting. The black circles represent the operating point corresponding to the best F-measure for each curve. Actual values of these points are provided in Table 4. As a baseline experiment, we used word unigram features to train a classifier model. Word unigram features were extracted by selecting only those words in the that appeared at least 3 times in the training data. Word matching was done in case insensitive manner. The graph labeled as Unigram Features in Fig. 4 shows the performance of this classifier. The best F-measure in this case is only Obviously, this low performance can be attributed to the small training set and the large feature space formed by word unigrams. Recall Precision F-measure Word Unigram Features Rule based: Thresholding on Salient Features counts Salient Features Word Unigram & Salient Features Word Unigram, Salient & Orthographic Features Word Unigram & Random Features TABLE 4. Recall and precision corresponding to best F-measure using boosting 6.1. Salient Features The baseline system was compared with a similar system using salient features. First, we used a simple classification rule that we formulated by looking at the training data. According to this rule, if an message contains 3 or more salient features it is classified as an emotional . We classified the test data using this rule and obtained an F-measure of (see row labeled as 3 in Table 4). Since no confidence thresholding can be used with a deterministic rule, its performance is indicated by a single point marked by the gray circle in Fig. 4. This result clearly demonstrates the high utility of our salient features. To verify that the salient feature threshold count of 3 used in our simple classification rule is the best, we also evaluated the performance of the rule for the salient features threshold counts of 2 and 4 (rows labeled as 2 and 4 in Table 4). In our next set of experiments, we trained a classifier model using salient features with and without word unigrams. Corresponding cross validation results on the test data are annotated in Table 4 and in Fig. 4 as Salient Features and Word Unigrams & Salient Features, respectively. Incremental improvement in best F- measure clearly shows: a) BoosTexter is able to learn better rules than the simple rule of identifying three or more salient features; and b) even though salient features provide a significant improvement in performance,

10 10 COMPUTATIONAL INTELLIGENCE FIGURE 4. Precision/Recall curves for different experiments using boosting. Large black circles indicate the operating point with best F-measure there is still discriminative information in unigram features. A direct consequence of the second observation is that the detection performance can be further improved by extending or refining the phrase lists and/or by using more labeled data to exploit the discriminative information in the word unigram features. To test the contribution of orthographic features of we also added them in the feature set. However, as shown in the row annotated as Word Unigram, Salient and Orthographic Features in Table 4, detection performance deteriorated. With boosting this was a counter intuitive result. We will discuss the possible reasons for this at the end of this section. Salient features of emotional are the consequence of our knowledge of how customers react to their excessive loss. To empirically demonstrate that eight different salient features used in identification of emotional do provide complementary evidence, we randomly distributed the phrases in eight lists. We then used them to extract eight binary features in the same manner as before. Best F-measure for this experiment is shown in the last row of Table 4, and labeled as Word Unigram & Random Features. The degradation in performance (more than 10% absolute in F-measure) of this experiment clearly demonstrates that salient features used by us provide complementary and not redundant information Results using SVMs To compare the results obtained from a feature based learning algorithm 8 with an example based learning algorithm, we repeated our experiments using SVMs. Table 5 shows the results obtained from linear SVMs and Table 6 shows the results obtained by using second order polynomial SVMs. Results from second order 8 We also experimented with MaxEnt classifier, which like boosting is also a feature based learning algorithm, and provided results similar to boosting.

11 EMOTION DETECTION IN CUSTOMER CARE 11 polynomials show a similar trend, but are not as good as those of linear polynomials. We believe this may be due to the data sparseness problem. Recall Precision F-measure Word Unigram Features Salient Features Word Unigram & Salient Features Word Unigram, Salient & Orthographic Features TABLE 5. Recall and precision corresponding to best F-measure using linear SVMs Recall Precision F-measure Word Unigram Features Salient Features Word Unigram & Salient Features Word Unigram, Salient & Orthographic Features TABLE 6. Recall and precision corresponding to best F-measure using second order polynomial SVMs From these tables one can observe that the performance of both linear and second order polynomial SVMs are slightly inferior to boosting. However, adding orthographic features improves their performance for both linear and second order polynomial SVMs. Our results show that salient features are more discriminative of emotional messages than the unigram words in the text. This implies that while modeling, more weights must be given to salient features than to the word unigram features. Recall that boosting is feature-based and automatically gives more weights to more discriminating features. Furthermore, since boosting uses l 1-norm for regularization, in BoosTexter the number of features considered is the same as the number of iterations of boosting. Other features are completely ignored. To avoid overfitting, the results of the boosting experiments reported in Table 4 are obtained by tuning the number of iterations on a development data. SVMs, on the other hand, are example based and use l 2-norm for regularization. All features of support vector examples are used (see equation 2). This could be one reason why the detection performance of SVMs in our experiments is inferior to boosting and why addition of orthographic features are helpful. In the former case many noisy features are being considered by SVMs while in the latter case many discriminatory features being ignored by boosting. Similar to boosting where the number of iterations needs to be optimized, the weight assignments to different feature sets must also be tuned in SVMs (see equation 3) to obtain the best performance. We carried out experiments with different weight settings (parameters a 1 and a 2 in equation 3) for salient features and unigram word features using linear SVMs. Fig. 5 shows the best F-measure obtained by using the linear SVMs for different ratios of weights for salient and unigram word features. Best performance is obtained when weight on salient features is more than 2 times that on unigram word features. The best detection performance (F-measure of 0.746) is obtained by using both salient and orthographical features along with unigram word features. This amounts to a 42% improvement over the baseline F-measure obtained by using only unigram word features. 7. CONCLUSIONS Customer messages complaining about unfair treatment are often emotional and are critical for businesses. They provide valuable feedback for improving business processes and coaching agents. Furthermore,

12 12 COMPUTATIONAL INTELLIGENCE FIGURE 5. Best F-measure obtained by using linear SVMs for different weights settings for salient features and for unigram word features careful handling of such helps to improve customer retention. In this paper, we presented a method for emotional identification. We analyzed the messages in the context of customer service and identified 8 salient features. We demonstrated higher agreement between two labelers labeling emotional using the presence/absence of salient features in the messages. We also demonstrated that extracting salient features and certain orthographical features from messages and using them to train classifier models results in significantly better models, even when very little training data is used. Compared to a baseline classifier which uses only the word unigrams features, the addition of the salient and orthographical features improved the F-measure from to We compared and analyzed the classifier performances using two different categories of large margin classifiers, i.e., feature based (boosting) and example based (SVM), and demonstrated that SVMs perform better than boosting when a set of features (salient and orthographic features) are more discriminating than others (unigram word features). Labeling large amount of training and test data necessary to build accurate models for identifying emotional messages is very expensive. This work demonstrates that the model building process can be bootstrapped by using the domain knowledge (list of phrases that make up the salient features) obtained by analyzing small amount of data. This seed knowledge can then be used in unsupervised or semi-supervised manner to automatically extract more such knowledge from a large corpus of customer . In our current work we are focusing on this problem by leveraging publicly available semantic orientation dictionaries, and by syntactic pattern matching. 8. ACKNOWLEDGMENT The authors thank Patrick Haffner for his help with text classification, particularly on his guidance on how to leverage features in LLAMA to obtain the best classification models. The authors also thank Amanda Stent for reviewing this paper and providing valuable suggestions and corrections.

13 EMOTION DETECTION IN CUSTOMER CARE 13 REFERENCES Alm, C. and Sproat, R. (2005). Emotional sequencing and development in fairy tales. In Proceedings of the First International Conference on Affective Computing and Intelligent Interaction. Alm, C. O., Roth, D., and Sproat, R. (2005). Emotions from text: machine learning for text-based emotion prediction. In HLT 05: Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing, pages , Morristown, NJ, USA. Association for Computational Linguistics. Busemann, S., Schmeier, S., and Arens, R. G. (2000). Message classification in the call center. In In Proceedings of the sixth conference on Applied Natural Language Processing, pages Morgan Kaufmann. Chang, C.-C. and Lin, C.-J. (2001). Libsvm a library for support vector machines. Software available at cjlin/libsvm. Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20(1), Dave, K., Lawrence, S., and Pennock, D. M. (2003). Mining the peanut gallery: Opinion extraction and semantic classification of product reviews. In Proceedings of WWW, pages Esuli, A. and Sebastiani, F. (2005). Determining the semantic orientation of terms through gloss classificaion. In Proceedings of 14th ACM International Conference on Information and Knowledge Management (CIKM- 05), pages , Bremen, DE. Gamon, M. (2004). Sentiment classification on customer feedback data: Noisy data large feature vectors and the role of linguistic analysis. In Proceedings of COLING 2004, pages , Geneva, Switzerland. Gupta, N., Tur, G., Hakkani-Tür, D., Banglore, S., Riccardi, G., and Rahim, M. (2005). The AT&T Spoken Language Understanding System. IEEE Transactions on Speech and Audio Processing, 14(1), Gupta, N., Di Fabbrizio, G., and Haffner, P. (2010a). Capturing the stars: Predicting ratings for service and product reviews. In Procceedings of the Workshop on Semantic Search at North American Chapter of the Association for Computational Linguistics (NAACL-2010), Los Angeles, CA. Gupta, N., Gilbert, M., and Di Fabbrizio, G. (2010b). Emotion detection in customer care. In Proceedings of the HLT-NAACL Workshop on Computational Approaches to Analysis and Generation of Emotion in Text, Los Angeles, CA, USA. Haffner, P. (2006a). Fast transpose methods for kernel learning on sparse data. In Proceedings of the 23rd international conference on Machine learning, pages ACM New York, NY, USA. Haffner, P. (2006b). Scaling large margin classifiers for spoken language understanding. Speech Communication, 48(3-4), Hatzivassiloglou, V. and McKeown, K. (1997). Predicting the semantic orientation of adjectives. In Proceedings of the Joint Association of Computational Linguistics and European Chapter of Association of Computational Linguistics Conference (ACL/EACL), pages Herbrich, R. and Graepel, T. (2002). A pac-bayesian margin bound for linear classifiers. IEEE Transactions on Information Theory, 48(12), Hu, M. and Liu, B. (2004). Mining and summarizing customer reviews. In Proceedings of the ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), pages Kim, S.-M. and Hovy, E. (2004). Determining the sentiment of opinions. In Proceedings of the International Conference on Computational Linguistics (COLING). Lampert, A., Dale, R., and Paris, C. (2010). Detecting s containing requests for action. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, HLT 10, pages , Morristown, NJ, USA. Association for Computational Linguistics. Lee, C. M. and Narayanan, S. S. (2005). Toward detecting emotions in spoken dialogs. IEEE Transactions on Speech and Audio Processing, 13(2), Litman, D. and Forbes-Riley, K. (2004a). Annotating student emotional states in spoken tutoring dialogues. In Proceedings of the 5th SIGdial Workshop on Discourse and Dialogue (SIGdial), Boston, MA. Litman, D. and Forbes-Riley, K. (2004b). Predicting student emotions in computer-human tutoring dialogues. In Proceedings of the 42nd Annual Meeting of the Association for Compuational Linguistics (ACL), Barcelone, Spain. Liu, H., Lieberman, H., and Selker, T. (2003). A model of textual affect sensing using real-world knowledge. In IUI 03: Proceedings of the 8th international conference on Intelligent user interfaces, pages , Miami, Florida, USA. ACM Press.

14 14 COMPUTATIONAL INTELLIGENCE Pang, B. and Lee, L. (2005). Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In Proceedings of the Association for Computational Linguistics (ACL), pages Pang, B., Lee, L., and Vaithyanathan, S. (2002). Thumbs up? Sentiment classification using machine learning techniques. In Proceedings of the 2002 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 79 86, Philadelphia, Pennsylvania. Platt, J. C. (1999). Using analytic QP and sparseness to speed training of support vector machines. In M. S. Kearns, S. A. Solla, and D. A. Cohn, editors, Advances in Neural Information Processing Systems, volume 11, Cambridge, MA. MIT Press. Schapire, R. (1999). A brief introduction to boosting. In Proceedings of International Joint Conference on Artificial Intelligence (IJCAI). Schapire, R. and Singer, Y. (1999). Improved boosting algorithms using confidence-rated predictions. Machine Learning, 37(3), Schapire, R. and Singer, Y. (2000). Boostexter: A boosting-based system for text categorization. Machine Learning, 39(2), Searle, J. R. (1969). Speech Acts: An Essay in the Philosophy of Language. Cambridge University Press. Spertus, E. (1997). Smokey: Automatic recognition of hostile messages. In In Proc. of Innovative Applications of Artificial Intelligence, pages Turney, P. and Littman, M. (2003). Measuring praise and criticism: Inference of semantic orientation from association. ACM Transactions on Information Systems, 21(4), Vapnik, V. N. (1998). Statistical Learning Theory. John Wiley & Sons. Wiebe, J., Wilson, T., and Cardie, C. (2005). Annotating expressions of opinions and emotions in language. Language Resources and Evaluation, 39, Wilson, T., Wiebe, J., and Hoffmann, P. (2005). Recognizing contextual polarity in phrase-level sentiment analysis. In HLT 05: Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing, pages , Morristown, NJ, USA. Association for Computational Linguistics. Yu, H. and Hatzivassiloglou, V. (2003). Towards answering opinion questions: Separating facts from opinions and identifying the polarity of opinion sentences. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP).