Integration of Negative Emotion Detection into a VoIP Call Center System

Integration of Negative Detection into a VoIP Call Center System Tsang-Long Pao, Chia-Feng Chang, and Ren-Chi Tsao Department of Computer Science and Engineering Tatung University, Taipei, Taiwan Abstract - The speech signal itself contains not only the semantics of the spoken words but also the emotion state of the speaker. By analyzing the voice signal to recognize the emotion hidden in the speech signal, it is possible to identify the emotion state of the speaker. With the integration of a speech emotion system into a VoIP call center system, we can continuously monitor the emotion state of the service representatives and the customers. In this paper, we proposed a framework that integrates the speech emotion system into a VoIP call center system Using this setup, we can detect in real time the speech emotion from the conversation between service representatives and customers. It can display the emotion states of the conversation in a monitoring console and, in the event of a negative emotion being detected, issue alert signal to the service manager who can then promptly react to the situation. Keywords: Speech Recognition, Call Center, Negative Detection, WD-KNN Classifier 1 Introduction The voice signal in the conversation represents the semantics of the spoken words and also the emotion state of the speaker. If a dispute happened between the service representative and the customer, there is no way to notify the service manager to take action immediately in current call center system. Since the customer service is playing an important role for an enterprise, the customer satisfaction is very important. So, the management of customer service department to improve the customer satisfaction is an important issue for the enterprise. Traditional customer service lacks of the ability in issuing alerts for conversation with negative emotion in realtime. For example, when the customer disagrees with the service representative, a dispute may arise. The traditional call center handles a large amount of calls every day and will usually make recording of all the calls for later analysis to see whether there is any improper conversation or not. However, these setups cannot handle the dispute situation in a timely manner. A considerable number of studies have been made on speech emotion over the past decades [1-16]. By integrating the speech emotion system into the VoIP call center system, we can continuously monitor the emotional state of the service representatives and the customers. In this paper, we propose the mechanism to integrate the negative emotion detection engine into the VoIP call center system. A parallel processing architecture is implemented to meet the performance requirements for the system. We also record the emotion states for all of the calls into the database. Alerts will be issued to the service manager whenever a negative emotion such as anger is being detected. The service manager has a chance to intervene into the two quarreling parties to pacify the customer and resolve the problem immediately. With this mechanism, it can enhance the level of customer satisfaction. The organization of this paper is as follows. In Section 2, the background and the related researches of speech emotion and voice over Internet Protocol (VoIP) are reviewed. In Section 3, the system architecture of a multi-line negative emotion detection in VoIP Call Center is described. In Section 4, the experimental setup is presented and the results are discussed. Conclusions are presented in Section 5. 2 Backgrounds 2.1 Speech Recognition In the past, quite a lot of researchers studied the human emotion and try to define what an emotion is. But it is hard to define emotional category because of there is no a single universally agreed definition. The emotion category defined by Ortony and Turner is a commonly accepted definition [10]. In recent years, the study of psychological tends to divide emotion category into basic emotion and complex emotion. The complex emotion is a derived version from the basic emotion in their definition. In addition to the semantics of spoken word, the speech signal also carries information of the emotion state of the speaker. That is, inside the speech signal, there are features that are related to the emotion state at the time of making that speech. By analyzing these features, it is possible to classify the emotion categories with a suitable classifier.

Features related to speaking rate, signal amplitude, frequency, itter, and formant are being studied in the speech emotion researches. [1-4]. In previous studies of emotion, there are several aspects that are being addressed. Some of the studies tried to find the most relevant acoustic features to the emotion inside the speech signal [13-16]. Searching for the most suitable machine learning algorithms for the classifier is also a topic that attracted quite a lot of attention [7][9-12]. In most of the previous studies, short speech corpora are used in the experiments. However, in this research, we need to deal with the continuous speech, which is long in nature. Therefore, we need to find a way to properly segment the speech signal and categorize the emotion such that burst misclassification will not affect the accuracy of the udgment. The corpus is a set of a large collection of utterance segments. A corpus plays an important role in the emotion research. In this paper, the D80 corpus built in the previous studies is used. We use the D80 to train our emotion engine. The speech corpus was collected from 18 males and 16 females who were given 20 scripts and were asked to speak out in five emotions including anger, boredom, happiness, neutral and sadness for each of them. A subective test is performed and those utterances with over 80% agreement were kept. After this process, there are 570 utterances left. The number of utterances in each emotion categories in the D80 corpus is 151 for anger, 83 for boredom, 96 for happiness, 116 for neutral, and 124 for sadness. A considerable number of previous studies have been made on feature selection in order to improve the accuracy of the emotion. The core of the feature selection is to reduce the dimension of the feature set to a smallest feature combination that yields the highest accuracy. The speech features commonly used in previous emotion researches include Formants (F1, F2 and F3), Shimmer, Jitter, Linear Predictive Coefficients (LPC), Linear Prediction Cepstral Coefficients (LPCC), Mel-Frequency Cepstral Coefficients (MFCC), first derivative of MFCC (dmfcc), second derivative of MFCC (ddmfcc), Log Frequency Power Coefficients (LFPC), Perceptual Linear Prediction (PLP), and Zero-Crossing Rat (ZCR). According to previous studies, the MFCC is the commonly used feature for the emotion [13]. Therefore, we choose the MFCC feature as the acoustic feature as input to the classifier input in this research. In the machine learning, the purpose of a classifier is to classify obects with similar characteristics into the same class. The classification can be divided into two types, supervised (e.g. KNN) and unsupervised (e.g. k-means). In this research, we used the Weighted D-KNN (WD-KNN) classifier, which is a variant of KNN and is proposed in [11]. The KNN is a classification algorithm that assigns the test sample to a class based on the distance between test sample and k-nearest training samples. The WD-KNN extends the KNN by comparing the weighted distance sum to maximize the classification accuracy. The weight calculation is to assign a higher weight to neighbors that provide more reliable information. For an M-class classification using KNN, let the k neighbours nearest to unknown test sample y be N k (y), and c(z) be the class label for training sample z. The subset of 1L,, M is the nearest neighbours in class { } k k = N ( y) = { z N ( y); c( z) } (1) If we denote the cardinality (the number of elements) of the set N k (y) as ( y) N k. Then the classification of y belonging to class * is the maority class vote, that is: * = arg max{ ( y) } (2) For WD-KNN classification, we need to select k nearest neighbors from each class. Let denotes the Euclidean distance between the i th nearest neighbor in class to unknown sample y. The distance measure is in ascending order, that is di di+ 1. The weighted distance sum for sample y to all the k nearest neighbors in class will be D = k i= 1 N k i i w d where w i w i+ 1 for all i. As discussed in [11], the best rate could be obtained by using a weighting in the reverse ordered Fibonacci sequence, which is: di (3) w w + w w = w 1 (4) i = i+ 1 i+ 2, k 1 k = The classification of y belonging to class * is the class with the shortest weighted distance * = arg min{ D } (5) As a conclusion, the speech emotion is a system that takes the voice signal as input and then recognizes the emotion of the speaker at the instance of making that speech. From the extracted features, the most likely emotion was udged by using a classification algorithm. The block diagram of a speech emotion system is shown in Figure 1. In this system, the

selected features were extracted and sent to the classifier to determine the most probable emotion of that segment. al Speech Feature Database Preprocessing Feature Extraction Classification Evaluation Result Figure 1. Block diagram of emotion system 2.2 VoIP Telephony and Packet Capture Techniques VoIP or known as IP phone is a technology that uses internet to accomplish telephone communication. The IP phone was used internally in the enterprise in the past. However, owning to the rapid grown of internet, the IP phone is now widely adopted and is gradually replacing the traditional telephone communication system. Currently, the commonly used VoIP communication protocol is the Session Initiation Protocol (SIP). The SIP is a protocol developed by IETF MMUSIC which is used in the establishment, modification and termination of an interactive call session. Possible applications include voice and video communication, instant messaging, online game, virtual reality and other multimedia applications. The purposes of SIP are to define the format and the control of the packets transmitted over the internet. The packet capture tool used in this study is WinPcap. WinPcap is a tool for link-layer network access in Windows environments. It includes kernel-level packet filter, a dynamic link library (packet.dll) and a high-level and system-independent library. It can be used in win32 platform to capture network packets. WinPcap consists of a driver that facilitates the operating system to access the lowlevel networks. WinPcap also provides a library that can be used easily to access the low-level network layers by the application programs. In addition, the kernel of WinPcap also provides packet filtering. Through the filter setting, the driver will directly discard unwanted packet in driver layer. The performance of the packet filter is good. Hence it is now widely used in a lot of software such as WireShark. 3 System Architecture In this paper, we proposed a framework that integrates the speech emotion system into a VoIP call center system. The components of the system are shown in Figure 2. The customer and the service representative communicate with each other through the VoIP phones. We defined session as the conversation between the pair of the customer and the service representative. We use a layer 2 switch to mirror packets going in and out from the IP PBX into our packet capture agent. Then, we identify the session and assign the session with a Session ID (SID) if it is new. After the session is established, we extract the voice signal from the RTP packets captured and regroup it into proper segments. The speech emotion system will be activated to classify the emotion of each segment. Finally, the result will be stored into a database. The system will issue an alert whenever a negative emotion, mainly anger, is detected. The SIP is a point to point protocol. Using a distributed architecture, the SIP transmits the text-based information and names the address by URL. The SIP use similar syntax as some protocols used in the internet such as HTTP and SMTP which consist of headers and message body. There are two types of SIP operations for the client connection, one is to connect the two clients in a point to point manner, and the other is to connect the party though a proxy server. In this research, we adopt proxy server configuration to simplify the packet capture process. The IP PBX (IP Private Branch exchange) works as the proxy server and plays the role of packet relaying. We capture the packets in and out from the IP PBX by port mirroring from the switch the server connected to. Figure 2. Components of the proposed system The operation steps of system are shown in Figure 3. First, the system will capture the packets and filter out the SIP and RTP packets. The Speech Recognition System (SERS) will assign a Session ID (SID) according to the phone number of the service representative and

customers. Then, the speech segment will be sent to the emotion engine to classify the emotion state. Finally, the classification results will be stored into database. The service manager acquires the all the results using a web page interface. The system will issue alert to service manager when the negative emotion is detected. Packet capture Session identification Voice data reconstruction Database Figure 3. Operations of the speech emotion 3.1 Packet Capture and Session Identification The packet is the smallest unit in network communication. Packets are transmitted across the internet through a series of switches and routers. A packet consists of a header and a payload. The header has the control information for the transmission of packets. The control information includes source and destination IP addresses, source and destination ports, etc. In our experiment environment, all of the VoIP packet will pass through the SIP proxy server. In this framework, we can use a mechanism called port mirror in the layer 2 switch to mirror the packets in and out of the SIP server. Then, we can capture the packets and analyze its content easily. In the call center, each phone line is independent. But once the call is established, the phone number will not change during the call. So, the identification of a session is necessary for storing the captured RTP packet to the correct pair of conversation. We build a session obect to manage the session. The session obect include source IP, setup time, speech coding and used buffers. We create a session obect when a session start and we will remove the session obect when the phone call ends. In order to implement multi-line packet capture, we need to identify the source and destination IP address of the packet first. According to the information, we assign a session ID to the session, and then allocate the storage for storing the session ID and related information. 3.2 Recognition Engine Negative emotion detection Alert To integrate emotion function into the VoIP call center system require further effort then the speech emotion. Some attempts have been made by scholars to use the combination of multiple speech features to increase the accuracy of speech emotion. However, in order to process multi-line voice segment simultaneously in real-time, the engine needs to reduce the computational complexity. From previous studies, the MFCC feature has been proven to be one of the most robust features in emotion [12]. So we choose to use only the MFCC in the emotion. We use the WD-KNN to be our classifier. In order to process multi-line voice segment, we may need more than one computer to perform the speech. We design a mechanism to share the load among several speech emotion engines. Virtual machine architecture is used to fully utilize the computing power of a high performance server. With this framework, we can distribute the voice segment to each virtual machine in turn. The parallel processing mechanism can resolve the bottleneck problem due to the high computational resource required by the engine. 3.3 Negative Detection In the D80 corpus, there are five basic emotion categories. We further divided them into positive and negative class. The positive emotion includes happiness and neutral, and the negative emotion includes anger, sadness and boredom. In this paper, we focus on detecting the anger emotion incurred in the conversation between the service representative and customer. The system will issue an alert whenever the anger emotion being detected during the conversion between the service representative and the customer. To decrease the false alarm rate, a score board udgment mechanism is implemented. The score board is zero at the beginning of a session. When the emotion result is anger, the score board will increase by 4 points. The score board will decrease 1 point when no anger emotion is detected for a voice segment if the score is positive. The system will confirm the anger emotion when the score exceed the threshold. An alert flag will be written to database whenever the score exceed the threshold. An alert message will be sent to the service manager through a web page interface whenever alert flag is being set. 4 Experiments and Result Discussion 4.1 Experimental Setup The proposed system consists of three subsystems. The detail description of each subsystem will be presented in this section. The framework of the proposed system is shown in Figure 4.

Packet capture & Session identification File system Figure 4. Framework of the proposed system The Asterisk IP PBX server is installed in a Linux platform. In this research, we assign the SIP ID 1xx as the phone number for customers, SIP ID 4xx for the service manager, and SIP ID 6xx for the service representatives. The SIP phone is configured in the proxy mode, such that all the SIP and RTP related communication packets between the customer and the service representative will pass through the IP PBX. With this configuration, we can capture all the SIP and RTP packets by mirroring the traffic going into the server from the Layer 2 switch where the server is attached. The session information is stored in a table structure. When a new SIP or RTP packet is received, the packet capture module will capture the packet, analyze the content of header and compare it with the contents in the session table. If it is a new session, the system will create a new session identifier and assign a new session ID to it; otherwise the existing session ID will be retrieved. 4.2 Recognition engine 1 engine... 2 engine N Alert Negative emotion detection Database We capture the voice signal from the conversation between the customer and service representative. The voice we captured will be regrouped into sound file in WAV format with 1 second in duration each. We will send the segmented voice file into the speech engine, and the engine will output the result. In order to increase the performance of the emotion engine, we use a parallel architecture to build our system. We setup several machines and install the engine written in MATLAB into each machine. By using this structure, the bottleneck problem of the speech emotion system can be avoided. The output from each engine is stored into a database. The results are analyzed by using the score board algorithm stated above to determine whether the anger emotion exists in the conversation or not. 4.3 Testing Samples A six scripts corpus was recorded by inviting volunteers to speak out the scripts with emotion as they like. Each corpus is tagged subectively by human udge. Only the results with more than 80% agreement in this process are kept. The emotion of human seems to change gradually. So we use the score board concept to determine whether or not an anger emotion was occurred in the conversation between the customer and the service representative. In our environment, we set the score to 0 at the beginning. If a negative emotion is being detected, the system will add 4 points to the corresponding session, or otherwise the system will subtract 1 point from that session if the score is positive. In order to reduce the false negative emotion rate, a threshold should be set. We test several threshold values, including 4, 8, 12, and 16, to check the accuracy. We compare the emotion results with human udge for the chosen threshold. The result of the negative emotion detection is listed in Table 1. Table 1. Number of detected negative emotion with different threshold T. Human Judge T=4 T=8 T=12 T=16 T=20 Script1 2 59 36 18 1 0 Script2 0 2 0 0 0 0 Script3 0 48 33 16 5 0 Script4 1 21 1 1 1 0 Script5 7 97 85 61 29 10 Script6 10 54 39 23 12 7 In Table 1, the negative emotion detection results are close to the human udge when the threshold is 16 except the script 3 and 5. For script 3, the content of the script consist of a lot of happy sentence. It is hard to differentiate between happiness and anger in a speech emotion system since they are both in the activation category. For script 5, the content of the script consists of continuous negative emotion. Due to the score board architecture, it is hard to count the dispute correctly. The advantage of the score board framework is that it can reduce the udgment error. But it cannot count the dispute exactly. For the negative emotion detection as mentioned above, the most important thing is to issue an alert when the dispute happened. The frequency of the dispute is not an important issue in this research. 5 Conclusions The call center plays an important role for an enterprise. It is obviously that collects the opinion from the customer

and pacify the customer if he or she is too agitated is important for the enterprise. To improve the quality of the service of the call center, we can integrate a negative emotion detection mechanism into the call center system. When a service representative faces too many angry customers, the call distribution system can reduce the number of calls to that representative to avoid the representative running into angry state. In this paper, we propose and implement the frameworks which can handle multi-line phone call with negative emotion detection capability. We modularize the system components that make the system more flexible. We develop the subsystem individually which includes packet capture and analysis, emotion, result recording, and negative emotion detection. In the part of emotion, we adopt the parallel architecture to avoid the possible bottleneck problem. By the combination of these subsystems, we build a system that can detect negative emotion from the conversation and issue an alert to service manager accordingly. Consequently, this system can improve the service quality of a call center because of the service manager has a chance to intervene to resolve the problem immediately. Acknowledgement The authors would like to thank the National Science Council (NSC) for financial support of this research under NSC proect No: NSC 100-2221-E-036-043. 6 References [1] H. Altun and G. Polat, Boosting selection of speech related features to improve performance of multi-class SVMs in emotion detection, Expert Systems with Applications, 36, 8197-8203, 2009 [2] C. Busso and S. S. Narayanan, Between Speech and Facial Gestures in al Utterances: A Single Subect Study, IEEE Transactions on Audio, Speech, and Language Processing, Vol. 15, No. 8, pp. 2331-2347, 2007. [3] C. Busso, S. Lee, and S. Narayanan, Analysis of ally Salient Aspects of Fundamental Frequency for Detection, IEEE Transacations On Audio,Speech, And language Processing, Vol. 17, No. 4, pp.582-596, May. 2009 [4] M. Cernak and C. Wellekens, al aspects of intrinsic speech variabilities in automatic speech, International Conference on Speech and Computer, 405-408, 2006 [5] Z. J. Chuang and C. H. Wu, Multi-Modal Recognition from Speech and Text, International Journal of Computational Linguistics and Chinese Language Processing, pp. 1-18, Vol. 9, No. 2, 2004. [6] R. E. Cowie, E. Douglas-Cowie, N. Tsapatsoulis, G. Votsis, S. Kollias, W. Fellenz, and J. Taylor, Recognition in Human-Computer Interaction, IEEE Signal Processing Magazine, Vol. 18, No. 1, pp. 32-80, 2001. [7] X. Jin and Z. Wang, An Space Model for Recognition of s in Spoken Chinese, First International Conference on Affective Computing and Intelligent Interaction, pp. 397-402, 2005. [8] M. Lugger and B. Yang, Extracting voice quality contours using discrete hidden Markov models, Proceedings of the Speech Prosody, 2008 [9] T. Nwe, S. Foo, and L. De Silva, Speech emotion using hidden Markov models, Journal of Speech Communication, 41(4), 603-623, 2003 [10] Ortony and T. J. Turner, What's Basic about Basic s, Psychological Review, pp. 315-331, 1990. [11] T. L. Pao, Y. M. Cheng, Y. T. Chen & J. H. Yeh, Performance Evaluation of Different Weighting Schemes on KNN-Based Recognition in Mandarin Speech, International Journal of Information Acquisition, Vol. 4, No. 4, pp. 339-346, Dec. 2007. [12] T. L. Pao, Y. T. Chen, Chen and J. H Yeh, Comparison of classification methods for detecting emotion from Mandarin speech, IEICE Transactions on Information and Systems, Vol. E91-D, no. 4, pp. 1074-1081, April 2008. [13] T. L. Pao, Y. T. Chen, and J. H. Yeh, Recognition and Evaluation from Mandarin Speech Signals, International Journal of Innovative Computing, Information and Control (IJICIC), Vol.4, no. 7, pp. 1695-1709, July 2008. [14] J. Rong, G. Li, and Y. P. Chen, (2009). Acoustic feature selection for automatic emotion from speech, Information Processing and Management, 45, 315-328, 2009 [15] D. Ververidis and C. Kotropoulos, C, Fast and accurate sequential floating forward feature selection with the Bayes classifier applied to speech emotion, Signal Processing, 88, 2956-2970, 2008 [16] B. Yang and M. Lugger, from speech signals using new Harmony features, Signal Processing, 90, 1415-1423, 2010