This article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and

Transcription

1 This article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and education use, including for instruction at the authors institution and sharing with colleagues. Other uses, including reproduction and distribution, or selling or licensing copies, or posting to personal, institutional or third party websites are prohibited. In most cases authors are permitted to post their version of the article (e.g. in Word or Tex form) to their personal website or institutional repository. Authors requiring further information regarding Elsevier s archiving and manuscript policies are encouraged to visit:

2 Expert Systems with Applications 36 (2009) Contents lists available at ScienceDirect Expert Systems with Applications journal homepage: Boosting selection of speech related features to improve performance of multi-class SVMs in emotion detection Halis Altun *, Gökhan Polat Nigde University, Faculty of Engineering, Department of Electrical and Electronics, Kampus, Nigde, Turkey article Keywords: Emotion detection Feature selection Speech analysis Machine learning info abstract This paper deals with the strategies for feature selection and multi-class classification in the emotion detection problem. The aim is two-fold: to increase the effectiveness of four feature selection algorithms and to improve accuracy of multi-class classifiers for emotion detection problem under different frameworks and strategies. Although, a large amount of research has been conducted to determine the most informative features in emotion detection, it is still an open problem to identify reliably discriminating features. As it is believed that highly informative features are more critical factor than classifier itself, recent studies have been focused on identifying the features that contribute more to the classification problem. In this paper, in order to improve the performance of multi-class SVMs in emotion detection, 58 features extracted from recorded speech samples are processed in two new frameworks to boost the feature selection algorithms. Evaluation of the final feature sets validates that the frameworks are able to select more informative subset of the features in terms of class-separability. Also it is found that among four feature selection algorithms, a recently proposed one, LSBOUND, significantly outperforms the others. The accuracy rate obtained in the proposed framework is the highest achievement reported so far in the literature for the same dataset. Ó 2008 Elsevier Ltd. All rights reserved. 1. Introduction Emotion detection is currently very active research field. It is considered as indispensable capability of automated systems in the future which are supposed to interact with human beings. In this respect, emotion recognition from speech is currently very active research area, which has attracted the interest of the research community (Altun, Shawe-Taylor, & Polat, 2007; Altun & Polat, 2007; Cowie et al., 2001; Fragopanagos & Taylor, 2005; Pantic, Sebe, Cohn, & Huang, 2005; Polat & Altun, 2007a; Reynolds, Ishikawa, & Tsujino, 2006; Shami & Verhelst, 2007; Ververidis & Kotropoulos, 2004, 2005; Wang & Guan, 2004). As it is considered a pattern classification problem, an automatic emotion detection system might be composed of at least three main components: feature extraction, feature selection and classification. The final aim of emotion detection research typically is to build a system which is able to detect emotion in spoken speech. Although, a considerable effort towards identifying the best classifiers for emotion detection is witnessed in literature (Polat & Altun, 2007a; Shami & Verhelst, 2007; Yacoub, Simske, Lin, & Burns, 2003), a recent tendency in emotion recognition is to seek a subset of speech-related features, * Corresponding author. Tel.: ; fax: addresses: halisaltun@nigde.edu.tr, haltun@nigde.edu.tr (H. Altun), gpolat51@yahoo.com (G. Polat). from a high-dimensional feature space, that would give better accuracy in classifying emotional states (Fernandez & Picard, 2005; Polat & Altun, 2007b; Schuller, Arsic, Wallhoff, & Rigoll, 2006; Xiao, Dellandrea, Dou, & Chen, 2005). However, despite of a large amount of research it is still an open problem to identify reliable informative features for this task (Juslin & Scherer, 2005). Xiao et al. (2005) employed Fisher s discriminant ratio to select the more informative features from 50 speech-related features. Their results indicate that the energy distribution in the frequency domain and the speech rate contribute significantly in emotion detection. In a study by Yacoub et al. (2003), 37 prosodic features related to pitch, loudness and segments have been extracted. Out of 37 features, 19 are selected using standard Forward Selection algorithm, which ranks the features using 10-fold cross validation. Then, a comparison between Neural Network (NN), Support Vector Machines (SVM), k-nearest neighbour, and decision tree classifiers has been carried out. They have reported that in classifying multiple emotions, prosodically close emotion classes should be grouped together and separate features and classifiers should be employed across groups. In a recent study, Ververidis and Kotropoulos (2005) have performed emotion detection using a Gaussian Mixture Model (GMM). A more sophisticated feature selection algorithm, Sequential Floating Forward Selection (SFFS), was employed to find the best 10 features among 65 global statistical features. Results showed that SFFS improved the accuracy of the Naïve /$ - see front matter Ó 2008 Elsevier Ltd. All rights reserved. doi: /j.eswa

3 8198 H. Altun, G. Polat / Expert Systems with Applications 36 (2009) Bayes classifier by 3%, compared to the results of their previous work where SFS feature selection in the same setting was used. Fernandez and Picard (2005) highlighted results from an extensive investigation developing new features, comparing them with classical features using machine learning techniques to recognize five emotional states. SFFS with the leave-one-out (LOO) generalization error of a k-nearest neighbor (k-nn) classifier was used in the feature selection phase to rank 87 features. The discriminative ability of the selected feature set is evaluated by SVMs and their generalization error estimated through 15-fold cross validation. Feature selection for emotion detection in noisy speech has been discussed by Schuller et al. (2006). They employed Information Gain Ratio to select the best features, out of a set of The selected features were given to an SVM classifier in order to evaluate their impact on the performance of the classifier. In the reported works above, a single wrapper or filter type feature selection algorithm was employed to find the most informative features. However, the quality of the features selected is highly dependent on the algorithm employed. Despite the researches on finding the best representative features that give a higher accuracy, the subset of the selected best features are completely dependent on the ability of the algorithm used to rank the set of features. Consequently, each feature selection algorithm will end up with a different subset of features as the best feature subset. Therefore, there is an obvious need to define a framework within which it is more likely to obtain a reliable subset of features. In this paper recently proposed two frameworks (Polat & Altun, 2007a) and three multi-class classifiers are employed to evaluate the ability of the frameworks in determining more informative feature set. The underlying property of the frameworks is to decompose a multi-class classification problem into binary classification as either one-vs-rest or one-vs-one problem. Although some variable selection methods, such as Sequential Forward Selection (SFS) method, treat the multi-class case directly rather than decomposing it into several two-class problems, it will be shown that decomposing the problem into binary-classification and reconstruction of a final feature subset from a set of candidate feature subsets results in an improved performance in terms of classifier accuracy. Feature selection algorithms are chosen to be one wrapper type, one filter type and two recently proposed embedded feature selection algorithms. The SFS algorithm with LOO Cross Validation (LOOCV) error of a k-nn classifier is chosen as a wrapper type feature selection algorithm. The embedded type algorithms are two state of the art feature selection algorithms, namely the algorithms based on Least Squared SVM Bound (LSBOUND) (Zhou & Mao, 2005) and on W2R2 concept (Weston et al., 2000), respectively. Finally, a filter approach based on Mutual Information (MUTINF) is used (Zaffalon & Hutter, 2002). 2. Feature extraction The data used in the experiments is extracted from the Berlin Emotional Speech Database-EmoDB (Burkhardt, Paeschke, Rolfes, Sendlmeier, & Weiss 2005). The utterances in the EmoDB are expressed by 10 professional actors (5 male and 5 female) in seven emotional styles: anger, boredom, disgust, fear, happiness, sadness, and neutral. The total number of speech samples is 493, and the set comprises 286 female and 207 male voices. In our experiments, 338 samples corresponding to 4 emotional classes, namely anger, happiness, neutral and sadness, have been used. Fifty-eight features have been extracted from the speech samples as explained in Dogan (2006) and given in Table 1. There are four groups of features. Seventeen of them are related to prosodic features based on statistical properties of the fundamental frequency F0; among them, four new features have been defined. These four features Table 1 The set of speech related features (# indicates the number of features in each group). Group (#) PROSODIC (17) SUBBAND (5) MFCC (20) LPC (16) are based on the concept of the maximum and minimum region of F0. An F0 region is defined as a maximum F0 region if the fundamental frequency satisfies the criteria F0 P F0 m where F0 m is the mean of the pitch frequency given by F0 m ¼ 1 n X n i¼1 F0 i Description of features Maximum, minimum, mean and standard deviation of F0 Mean and standard deviation of F0 in the maximum region Mean and standard deviation of F0 in the minimum region. Max, min, mean and standard deviation of positive slope in F0 Max, min, mean and standard deviation of negative slope in F0 Voiced unvoiced ratio Sub-band energies Mel-frequency cepstrum coefficients 16th order LPC coefficients where F0 i the estimated pitch frequency in the current speech frame of 20msn. Otherwise, the region is defined as a minimum F0 region where F0 satisfies the criteria F0<F0 m. Another five features are formed from the sub-band energies of the utterances, using sixth order elliptic filters with center frequencies of 400, 800, 1200, 1600 and 3200 Hz, respectively. Furthermore, 20 Mel-Frequency Cepstrum Coefficients (MFCC) and 16th order Linear Predictive Coding (LPC) parameters have been extracted from the speech samples as feature vectors. 3. Feature selection algorithms As it is well known, the presence of irrelevant features may reduce the accuracy of classifiers. In this sense, feature selection is typically used to achieve three objectives: reduce the size of the feature set in order to improve the prediction performance of a classifier, to provide a fast and more computationally efficient classifier, and to provide a better understanding of the underlying process that generated the data (Guyon & Elisseeff, 2003). Two common methods are employed in feature selection algorithms: the filter and the wrapper methods (Jain & Zongker, 1997). The main difference between the two methods arises from the evaluation criterion. The selection of feature subsets based in the wrapper method relies on the performance of the classifier, while the filter method employs intrinsic properties of data, such as mutual information or a Mahalanobis class separability measure, as the criterion for feature subset evaluation. In most pattern recognition applications, the wrapper method is reported to outperform the filter method. However, wrapper methods are computationally expensive compared to filter methods. Mladenić, Brank, Grobelnik, and Milic-Frayling (2004) show that feature scoring and selection based on the normal vector obtained from a linear SVM combines very well with all the classifiers considered in their study. As the linear SVM classifier has an output prediction of the form, F(x) = sgn(w T x + b), where F(x) is the prediction function, a feature with a weight close to zero will have a smaller effect on the prediction, so the feature can be removed to obtain a more predictive subset of features. Similar approaches have been proposed in the literature for feature selection (Bennett, Embrechts, Breneman, & Song, 2003; Guyon & Elisseeff, 2003; Abe & Kudo, 2006). The idea in these approaches is to employ a relatively simple linear predictor in the feature selection algorithm and then train a more complex non-linear predictor on the subset of features. In this study, we have employed four feature selection algorithms which are detailed below; a wrapper, a filter and two embedded feature selection algorithms, respectively.

4 H. Altun, G. Polat / Expert Systems with Applications 36 (2009) Sequential Forward Selection (SFS) In SFS, features that are not already selected are considered for selection on the basis of their impact on the objective function. Let J(h) be our objective function. Given a feature set X ¼ fx i ji ¼ 1; 2;...; mg, the aim in the SFS algorithm is to find a subset Y k # X where k < m, starting from an empty subset Y 0 ={ }. A feature x + which has not been selected before, will be a candidate to add into the subset based the condition x þ ¼ argmax½jðy k þ x þ ÞŠ. xry k On each iteration, exactly one feature is added to the feature subset. The process stops after an iteration where no feature additions result in an improvement in accuracy or if the pre-specified number of features has been reached. A more sophisticated version of SFS is Sequential Floating Forward Selection (SFFS). However, as it has been shown by Reunanen (2003), contrary to common believe, intensive search techniques like SFFS do not necessarily outperform the simpler and faster method like SFS. SFS is suitable to perform a feature selection task in a multi-class classification problem. We refer to SFS algorithm as SFS (classic) if it is employed directly without proposed framework. On the other hand, we call the algorithm SFS (proposed) if it is employed in the proposed frameworks. We have used SFS with the LOOCV error of a k-nn classifier as objective function in our experiments Least Square Bound Feature Selection (LSBOUND) Recently a new feature selection algorithm has been proposed (Zhou & Mao, 2005). It is based on a new filter-like evaluation criterion, called the Least Square (LS) Bound measure and has the advantage of both filter and wrapper methods. A criterion for feature selection is derived from the leave-one-out cross validation (LOOCV) procedure of the least squares support vector machines (LS-SVM) and is closely related to an upper bound of LOOCV classification results. As the estimation of the upper bound implicitly involves the training of the classifier only once, without repeated use of cross validation, the computational complexity of the algorithm is significantly reduced compared with the classical wrapper method. It has been proved that when an LS-SVM classifier is trained on the entire training set, if the corresponding Lagrangian multiplier a 0 p of the training sample x p is positive, the following inequality holds in the leave-one-out procedure (Zhou & Mao, 2005) y p f p ðx p Þ 6 a 0 p ½ðDp min Þ2 þ 2=cŠ 1 where a 0 p is the corresponding Lagrangian multiplier of x p when the LS-SVM classifier is trained on the entire training set, D p min is the distance between x p and its nearest neighbour and c is a positive value which penalizes errors. f p (x p ) is the validation result for the sample x p in the leave-one-out procedure. If y p f p (x p ) is negative the sample x p is considered as a leave-one-out error, and if y p f p (x p ) is positive x p is correctly classified in the leave-one-out procedure. The bound on the right-hand side of the inequality is related to both the corresponding training result and the nearest neighbour. Negative value of the bound indicates that the sample must be correctly classified in the leave-one-out procedure. On the other hand a positive value of the bound indicates, to some extent, the probability of misclassification, and hence can be used to evaluate the goodness of the feature set. Combining the bounds for all training data together, the following measure is proposed, called LS Bound measure, as the evaluation criterion for feature selection: M ¼ Xl p¼1 ða 0 p ½ðDp min Þ2 þ 2=cŠ 1Þ þ where () + = max(0,x). ð1þ ð2þ A simple greedy heuristic search algorithm such as SFS or a more complex search algorithm such as SFFS can be used with the LS Bound measure M in Eq. (2) to form a new feature selection algorithm. In our study we use this approach with a linear kernel Mutual Information Based Feature Selection (MUTINF) The mutual information (MUTINF) is a widely used information-theoretic measure for the stochastic dependency of discrete random variables. In the context of a filter approach, one may employ mutual information to discard irrelevant features in order to find a small subset of features on the basis of low values of mutual information. This approach relies on empirical estimates of the mutual information between each variable and the target as follows: Z IðiÞ ¼ x i Z y pðx i ; yþ log pðx i; yþ pðx i ÞpðyÞ dxdy where p(x i ) and p(y) are the probability densities of x i and y, and p(x i,y) is the joint density. In the case of discrete or nominal variables, the integral becomes a sum IðiÞ ¼ X x i X y PðX ¼ x i ; Y ¼ yþ log PðX ¼ x i; Y ¼ yþ PðX ¼ x i ÞPðY ¼ yþ where the probabilities are then estimated from frequency counts of the input variables and class distribution. In our study we have used an implementation of mutual information based feature selection explained in Zaffalon and Hutter (2002) R2W2 R2W2 is a state of the art feature selection algorithm especially designed for binary classification tasks using an SVM classifier (Weston et al., 2000). It can be considered as a wrapper approach and exploits indirectly the maximal margin principle for feature selection. The idea in this approach is to find a weight vector over the features in order to minimize the objective function of an SVM. For a training set of size m belonging to a sphere of size R and are separable with margin M, a bound on the expected generalisation of the SVM is given by EP err 6 1 m E R ¼ 1 M m fr2 W 2 ða 0 Þg ð5þ As optimization of the objective function is accomplish using gradient descent, a new SVM optimization problem is constructed after each gradient step making the procedure computationally expensive for large datasets. In our study, R2W2 is used together with an RBF kernel which is defined as follows: Kðx 1 ; x 2 Þ¼exp kx 1 x 2 k 2r 2 which is implemented in the Spider Package (Weston, Elisseeff, Bakır, & Sinz, 2006). 4. Feature selection frameworks in the emotion detection problem The feature selection approach used in this work is based on two recently proposed frameworks. The effectiveness of the proposed frameworks is evaluated and results are given in literature (Altun & Polat, 2007; Altun et al., 2007). Feature selection tasks in emotion detection are performed as depicted in Fig. 1. The underlying properties of the frameworks are to decompose a ð3þ ð4þ ð6þ

5 8200 H. Altun, G. Polat / Expert Systems with Applications 36 (2009) SET2 ¼ [N S i i¼1 ð8þ multi-class classification problem into binary classification problems and then perform feature selection for each sub-problems using one of four feature selection algorithms. Then two feature construction operators; namely the intersection and the unification, are defined to construct the final feature set from each subsets. At the final stage, multi-class classifiers are employed to determine the emotional state using the final feature set. In the first stage, the emotion detection problem has been treated in a one-vs-rest framework. This setting is called FRM1. In this framework, the problem is to discriminate one class of emotion from the rest and class-specific features are expected to be selected by the feature selection algorithms. As there are M classes of emotional states, M subsets of features will be selected by each of the feature selection algorithms. In the second framework, which is labeled as FRM2, the problem is organized in a one-vs-one manner. In this approach the feature selection algorithms are expected to select highly class-specific features which are informative in discriminating one class of emotion form another one. The number of subsets produced in FRM2 will be MðM 1Þ=2. In FRM1, each feature selection algorithm produces four subsets of features, S i, which are corresponding to one of four decomposed binary classification problems. In the feature construction stage, these subsets, S i, are processed to obtain the best feature set. In this stage, two strategies in construction of a final feature set have been followed. Firstly, the intersection operator given in (7) is performed on the subsets, S i. The feature that occurs more than one subset is selected to form the final set which is labeled as SET1. The operation is defined as follows: SET1 ¼ [N Fig. 1. Framework for feature selection in the emotion detection. i;j¼1 i j S i \ S j where N = M is equal to the total number of subsets, S i. In the second strategy, the subsets of features, S i, are simply combined together to form the final set which is labeled as SET2. This task corresponds to performing a unification operation on the subsets S i as follows: ð7þ where N = M equals the total number of subsets, S i. The number of selected features in the SET1 is expected to be less than the number of features in the SET2, due to the intersection operation. In FRM2, as the emotion detection is organized as one-vs-one binary classification problems, each of the feature selection algorithms will then produce six subsets of features. Then the same steps are followed as in FRM1: two final sets are formed from the subsets of features, S i, by performing the intersection and union operations, respectively. The final feature sets are labeled as SET1 and SET2, respectively, following the convention employed in the FRM1. The final stage in the proposed frameworks is to detect the emotional states using multi-class classifiers as illustrated in Fig. 1. Support Vector Machines (SVM) has been chosen as classifier for this task. A classic way of performing a multi-class classification using a binary SVM is to cast multi-classification problem into binary ones. This can be accomplished by either formulating the problem as one-vs-one classification (referred here to as SVM1) or as one-vs-rest classification (referred to as SVM2). In addition, we have also used a recently proposed multi-classification approach, the maximal margin robot (MMR), proposed by Szedmak, Saunders, Shawe-Taylor, and Rousu (2005). For all SVM classifiers, the radial basis kernel with the same hyper-parameters has been used and five-fold Cross Validation (CV) error is calculated to evaluate the accuracy of the classifiers. Four feature selection algorithms produce four SET1s and four SET2s in each framework. As a result, eight final feature sets are employed to train a multi-class classifier in FRM1 and FRM2. Total number of classification tasks, therefore, amounts to 48 which are enough to carry out a fair comparison. Figs. 2 and 3 illustrate the average percentage of features selected from each feature groups, namely the prosodic, the subband energy, the MFCC and the LPC groups, in FRM1 and FRM2, respectively. Also, in order to illustrate the effectiveness of the proposed framework, the behavior of SFS (classic) and SFS (proposed) is compared in Table 2. For a fair comparison, the number of selected feature by the SFS (proposed) and by the SFS (classical) is set identical to each other. The figures are an indication of the informative power of each feature groups assigned by the feature selection algorithms. A close inspection of Figs. 2 and 3 reveals that SFS (proposed) tends to select more features from the prosodic and subband energy feature groups, compared to SFS (classic), while the number of selected features from LPC group is decreased. As it is shown in Table 2 this tendency of the algorithm considerably reduces CV error in general. This behavior is also consistent with the findings presented in literature that prosodic and energy related features are more informative (Cowie et al., 2001; Xiao et al., 2005). The accuracy of the classifiers in terms of five-fold Cross Validation (CV) error of the SFS algorithm in the proposed and classical implementation is illustrated in Table 2 using two multi-class classifiers (SVM1 and SVM2). In the table, the first row illustrates the case where no feature selection has been performed yet. The total number of features in this case is 58 and the baseline for accuracy is 79.9% and 78.1% for SVM1 and SVM2, respectively. As shown in Table 2 the SFS (classic) is able to successfully reduce the number features, but in most of the cases the accuracy of the classifiers deteriorate, producing higher CV error compared to the no-feature selection case. On the other hand, the classifiers produce outperforming results when the features selected by the SFS (proposed) are employed. The accuracy of SVM1 and SVM2 classifiers, in this case, is improved by up to 17.4% and 17.3%, respectively. As indicated above, the SFS algorithm in the proposed frameworks tends

6 H. Altun, G. Polat / Expert Systems with Applications 36 (2009) Fig. 2. Normalized percentage of features selected by feature selection algorithms in FRM1. Fig. 3. Normalized percentage of features selected by feature selection algorithms in FRM2. Table 2 Comparison of SFS (classic) and SFS (proposed) algorithms. (SET1: The intersection operator is used to obtain the final feature set. SET2: The unification operator is used to obtain the final feature set. FRM1: Feature selection is performed in one-vs-one framework. FRM2: Feature selection is performed in one-vs-rest framework). Feature set # of features SVM1 SVM2 Average Average accuracy (%) accuracy (%) Not applied SFS (classic) SFS (proposed) in SET1-FRM SFS (classic) SFS (proposed) in SET2-FRM SFS (classic) SFS (proposed) in SET1-FRM SFS (classic) SFS (proposed) in SET2-FRM Table 3a Average accuracy of the classifiers associated with SFS algorithm. Feature set # of features MMR SVM1 SVM2 NONE SET1-FRM SET1-FRM SET2-FRM SET2-FRM Table 3b Average accuracy of the classifiers associated with LSBOUND algorithm. Feature set # of features MMR SVM1 SVM2 SET1-FRM SET1-FRM SET2-FRM SET2-FRM to emphasis the prosodic and sub-band energy related features which improve the performance of the classifiers. These results indicate that the SFS algorithm is able to select more informative features in the proposed frameworks. Figs. 2 and 3 also reveal that the LSBOUND algorithms in the proposed frameworks tends to select more MFCC features in FRM1 and in FRM2, when compared to rest of the algorithms, while the number of LPC parameters selected by LSBOUND are decreased in general. As it is shown in Tables 3b and 6, this behaviour of the algorithm is the main contributor to its success. LSBOUND is the most successful algorithm in further reducing the average CV error. The tendency of the LSBOUND algorithms seems to be highly plausible, as MFCC related features are expected to show a clear distinction between the features of speech, which are related to the source of voice and that of the vocal tract as a result of their log operation in the frequency domain. Fig. 4 emphases the performance of the proposed frameworks in terms of the average number of the features selected from each distinct feature group. The figure reveals that the feature selection algorithms in FRM1 place more emphasis on the MFCC and the sub-band energy feature groups in general. On the other hand the number of features selected from LPC feature group is reduced

7 8202 H. Altun, G. Polat / Expert Systems with Applications 36 (2009) Fig. 4. Comparison of FRM1 and FRM2 in terms of average number of the selected features. Table 3c Average accuracy of the classifiers associated with R2W2 algorithm. Feature set # of features MMR SVM1 SVM2 SET1-FRM SET1-FRM SET2-FRM SET2-FRM Table 3d Average accuracy of the classifiers associated with MUTINF algorithm. Feature Set # of features MMR SVM1 SVM2 SET1-FRM SET1-FRM SET2-FRM SET2-FRM Table 4 Average accuracy of classifiers associated with the feature selection frameworks. Framework MMR SVM1 SVM2 FRM FRM in FRM1. This behavior of the algorithms in each framework explains the higher success rate of the classifiers in FRM1, compared to in FRM2, which is summarized in Table Emotion detection using multi-class SVM classifier After successful application of the proposed framework using SFS algorithm, a comprehensive comparison is carried out for all three multi-class classifiers using the feature sets selected by LSBOUND, R2W2 and MUTINF algorithms, respectively. The accuracy of the classifiers associated with a specific feature selection algorithm is illustrated in Table 3. As seen from Table 3, the accuracy of the classifiers is improved considerably. Especially LSBOUND algorithm seems to be the most successful in reducing CV error of the classifiers (see Table 3b). In order to clearly illustrate the effect of the proposed frameworks, the feature construction strategies and the feature selection algorithms, Table 3 is summarized in Tables 4, 5 and 6, respectively. As it is seen in Table 4, the multi-class classifiers in FRM1 is more successful in reducing average CV error. Among them SVM1 is able to produce an average CV error as low as (82.5% accuracy). The success of SVM1 is also apparent in the second framework producing an error level of (81.3% accuracy). The classifiers MMR and SVM2 are comparable to each other in terms of average CV error. In terms of the unification and intersection operators, it is apparent that the classifier SVM1 is able to produce less error compared to the other multi-classifiers as seen in Table 5. It is especially remarkable in the case where SVM1 is trained using the final feature sets obtained by the unification approach. There is no clear evidence, however, to reach a decisive conclusion on which feature construction strategy is preferable. Although, in general, the unification operation is able to reduce the average CV error further, compared to the intersection operation on the subsets, the average number of features in the SET1s and SET2s is given and 27.75, respectively. This finding favors the intersection operator. So it is difficult to prefer one approach over another. Lastly, a comparison between feature selection algorithms is illustrated in Table 6 with respect to the two feature selection frameworks. In both cases, the newly proposed LSBOUND based feature selection algorithm clearly outperforms the rest of the algorithms generating significantly lower average CV errors. The best accuracy is obtained by SVM1 with a CV error of (85.5% accuracy), when the LSBOUND based feature selection algorithm is Table 5 Average accuracy of classifiers associated with the feature construction strategies. Feature construction strategy Number of selected feature in average MMR SVM1 SVM2 Accuracy Accuracy Accuracy (%) (%) (%) SET1 (intersection) SET2 (unification) Table 6 Average accuracy of classifiers associated with particular feature selection algorithms. Feature construction algorithm Framework1 Framework2 Accuracy (%) Accuracy (%) SFS LSBOUND MUTINF R2W

8 H. Altun, G. Polat / Expert Systems with Applications 36 (2009) Table 7 Average computation time of the feature selection algorithms (in s). SFS LSBOUND R2W employed (see Table 3b). The obtained accuracy rate is slightly lower than human classification accuracy of 87.4% (Burkhardt et al., 2005). Furthermore, our result is the most successful achievement using the same number of emotional classes of Berlin Emotional Dataset compared to a recent study by Shami and Verhelst (2007). These results show that feature selection and multi-classification strategy is very effective in the multi-class emotion detection problem. A successful application of newly proposed LSBOUND feature selection algorithm has been also indicated. A cross validation error of as low as ± 0.02 is achieved when LSBOUND algorithm is used in the frameworks. Furthermore, compared to the wrapper type algorithm, (i.e. SFS, and the state-of-the-art algorithm, R2W2) the computational complexity of LSBOUND in the frameworks is not heavy as seen from Table Conclusion In this paper, an evaluation of the newly proposed feature selection frameworks and the construction approaches are carried out using four feature selection algorithms and three different multiclass classifiers in the emotion detection problem. Results show that the first framework where more informative features have been selected in terms of distinguishing one emotional state from the rest is more successful in achieving higher accuracy. It has also been shown that among all of the four different feature selection algorithms, the recently proposed LSBOUND based feature selection is superior in terms of reducing average CV error. Furthermore, the results have shown that SVM1 (one-vs-one approach to multiclassification scheme) outperforms the other type of multi-task classifiers investigated and is consistently able to give higher accuracy in emotion detection. It is found that among all features groups, the prosodic and subband energy features are most selected ones by all the algorithms in each framework. Also the success of the LSBOUND algorithm suggests that MFCC features are more informative than LPC features in terms of reducing CV error of multi-class classifiers. The achievement of the classification accuracy validates that the proposed framework boosts the selection of more informative features from speech related feature groups. Acknowledgments This work has been sponsored by TUBITAK Project under the contract of 104E179. Corresponding author also would like to thank Prof. Dr. J Shawe-Taylor for his hospitality and guidance during the academic visit in the summer of 2006 granted by TUB _ ITAK at University of Southampton and at University College of London. We would also like to thank Dr. S. Szedmak for providing the MMR algorithm. References Abe, N., & Kudo, M. (2006). Non-parametric classifier-independent feature selection. Pattern Recognition, 39, Altun, H., & Polat, G. (2007). New Frameworks to boost feature selection algorithms in emotion detection for improved human computer interaction, brain vision and artificial intelligent. In Lecturer notes in computer science, LNCS (Vol. 4729, pp ). Berlin, Heidelberg: Springer-Verlag. Altun, H., Shawe-Taylor, J., & Polat, G. (2007). New feature selection frameworks. In Emotion recognition to evaluate the informative power of speech related features, ISSPA07, ninth international symposium on signal processing and its applications. Bennett, B. J., Embrechts, K., Breneman, C. M., & Song, M. (2003). Dimensionality reduction via sparse support vector machines. Journal of Machine Learning Research, JMLR, 3, Burkhardt, F., Paeschke, A., Rolfes, M., Sendlmeier, W., & Weiss, B. (2005). A database of german emotional speech. In Proceedings Interspeech. Cowie, R., Douglas-Cowie, E., Tsapatsoulis, N., Votsis, G., Kollias, S., Fellenz, W., & Taylor, J. G. (2001). Emotion recognition in human computer interaction. IEEE Signal Processing Magazine, 18(1), Dogan, G. (2006). Emotion detection using neural networks. M.Sc. thesis, Nigde University. Fernandez, R., & Picard, R. W. (2005). Classical and novel discriminant features for affect recognition from speech. In Interspeech 2005, Eurospeech 9th European Conference on Speech Communication and Technology (pp ). Fragopanagos, N., & Taylor, J. G. (2005). Emotion recognition in human computer interaction. In International symposium on neural networks (pp ). Guyon, I., & Elisseeff, A. (2003). An introduction to variable and feature selection. Journal of Machine Learning Research, 3, Jain, A., & Zongker, D. (1997). Feature selection: Evaluation application and small sample performance. IEEE Transactions on PAMI, 19(2), Juslin, P. N., & Scherer, K. R. (Eds.). (2005). The new handbook of methods in nonverbal behavior research. Oxford, UK: Oxford University Press. Mladenić, D., Brank, J., Grobelnik, M., & Milic-Frayling, N. (2004). Feature selection using linear classifier weights: Interaction with classification models. In Proceedings of the 27th annual international ACM SIGIR conference on research and development (pp ). Pantic, M., Sebe, N., Cohn, J. F., & Huang, T. (2005). Affective multimodal human computer interaction. In Proceedings of ACM international conference on multimedia (pp ). Polat, G., & Altun, H. (2007a). Evaluation of performance of KNN, MLP and RBF classifiers in emotion detection problem. In IEEE 15th signal processing and communications applications. doi: /siu Polat, G., & Altun, H. (2007b). Determining efficiency of speech feature groups in emotion detection. In IEEE 15th signal processing and communications applications. doi: /siu Reunanen, J. (2003). Overfitting in making comparisons between variable selection methods. Journal of Machine Learning Research, 3, Reynolds, C., Ishikawa, M., & Tsujino, H. (2006). Realizing affect in speech classification in real-time. In Aurally informed performance: Integrating machine listening and auditory presentation in robotic systems, conjunction with AAAI Fall Symposia. Schuller, B., Arsic, D., Wallhoff, F., & Rigoll, G. (2006). Emotion recognition in the noise applying large acoustic feature sets. Speech Prosody. Dresden. Shami, M., & Verhelst, W. (2007). An evaluation of the robustness of existing supervised machine learning approaches to the classification of emotions in speech. Speech Communication, 49, Szedmak, S., Saunders, C. J., Shawe-Taylor, J., & Rousu, J. (2005). Learning hierarchies at two-class complexity. In Workshop on kernel methods and structured domains. Ververidis, D., & Kotropoulos, D. (2004). Automatic speech classification to five emotional states based on gender information. In Proceedings 12th European signal processing conference (EUSIPCO) (pp ). Ververidis, D., & Kotropoulos, D. (2005). Emotional speech classification using Gaussian mixture models and the sequential floating forward selection algorithm. In IEEE international conference on multimedia and expo (ICME) (pp ). Wang, Y., & Guan, L. (2004). An investigation of speech based human emotion recognition. In IEEE sixth workshop on multimedia signal processing (pp ). Weston, J., Mukherjee, S., Chapelle, O., Pontil, M., Poggio, T., & Vapnik, V. (2000). Feature selection for SVMs, advances in neural information processing systems, neural information processing systems (NIPS) (pp ). Weston, J., Elisseeff, A., Bakır, G., & Sinz, F. (2006). The spider software package. < Xiao, Z., Dellandrea, E., Dou, W., & Chen, L. (2005). Features extraction and selection for emotional speech classification. In Proceedings of IEEE conference on advanced video and signal based surveillance (AVSS) (pp ). Yacoub, S., Simske, S., Lin, X., & Burns, J. (2003). Recognition of emotions in interactive voice response system. In Eighth European conference on speech communication and technology (pp ). Zaffalon, M., & Hutter, M. (2002). Robust feature selection by mutual information distributions. In A. Darwiche, N. Friedman (Eds.), UAI-2002: Proceedings of the 18th conference on uncertainty in artificial intelligence (pp ). San Francisco, USA: Morgan Kaufmann. Zhou, X., & Mao, K. Z. (2005). LS bound based gene selection for DNA microarray data. Bioinformatics, 21(8),