Sample subset optimization for classifying imbalanced biological data

Transcription

1 Sample subset optimization for classifying imbalanced biological data Pengyi Yang 1,2,3, Zili Zhang 4,5, Bing B. Zhou 1,3 and Albert Y. Zomaya 1,3 1 School of Information Technologies, University of Sydney, NSW 2006, Australia 2 NICTA, Australian Technology Park, Eveleigh, NSW 2015, Australia 3 Centre for Distributed and High Performance Computing, University of Sydney NSW 2006, Australia 4 Faculty of Computer and Information Science, Southwest University CQ , China 5 School of Information Technology, Deakin University, VIC 3217, Australia yangpy@it.usyd.edu.au; zhangzl@swu.edu.cn Abstract. Data in many biological problems are often compounded by imbalanced class distribution. That is, the positive examples may largely outnumbered by the negative examples. Many classification algorithms such as support vector machine (SVM) are sensitive to data with imbalanced class distribution, and result in a suboptimal classification. It is desirable to compensate the imbalance effect in model training for more accurate classification. In this study, we propose a sample subset optimization technique for classifying biological data with moderate and extremely high imbalanced class distributions. By using this optimization technique with an ensemble of SVMs, we build multiple roughly balanced SVM base classifiers, each trained on an optimized sample subset. The experimental results demonstrate that the ensemble of SVMs created by our sample subset optimization technique can achieve higher area under the ROC curve (AUC) value than popular sampling approaches such as random over-/under-sampling; SMOTE sampling, and those in widely used ensemble approaches such as bagging and boosting. 1 Introduction Modern molecular biology is rapidly advanced by the increasing use of computational techniques. For tasks such as RNA gene prediction [1], promoter recognition [2], splice site identification [3], and the classification of protein localization sites [4], it is often necessary to address the problem of imbalanced class distribution because the datasets extracted from those biological systems are likely to contain a large number of negative examples (referred to as majority class) and a small number of positive examples (referred to as minority class). Many popular classification algorithms such as support vector machine (SVM) have been applied to a large variety of bioinformatics problems including those mentioned above (e.g. refs. [1, 3, 4]). However, most of these algorithms are sensitive to the Corresponding author

2 2 P. Yang et al. imbalanced class distribution and may not perform well if being directly applied on the imbalanced data [5, 6]. Sampling is a popular approach to addressing the imbalanced class distribution [7]. Simple methods such as random under-sampling and random oversampling are routinely applied in many bioinformatics studies [8]. With random under-sampling, the size of the majority class is reduced to compensate the imbalance, whereas with random over-sampling, the size of the minority class is increased to compensate the imbalance. Although they are straightforward and computationally efficient, these two methods are prone to either increased noise and duplicated samples or informative sample removal [9]. A more sophisticated approach known as SMOTE is to synthesize new samples using original samples in the dataset [10]. However, many bioinformatics problems often present several thousands of samples with a highly imbalanced class distribution. Applying SMOTE will introduce a large number of synthetic samples which may increase the data noise substantially. Alternatively, a cost-metric can be specified to force the classifier to pay more attention to the minority class [11]. This requires to choose a correct cost-metric which is often unknown a priori. Several recent studies found that ensemble learning could improve the performance of a single classifier in imbalanced data classification [6, 12]. In this study, we explore along this direction. In particular, we introduce a sample subset optimization technique for intelligent under-sampling in imbalanced data classification. Using this technique, we designed an ensemble of SVMs specifically for learning from imbalanced biological datasets. This system has several advantages over the conventional ones: It creates each base classifier using a roughly balanced training subset with a built-in intelligent under-sampling. This is important in learning from imbalanced data because it reduces the risk of bias towards one class while neglecting the other one. The system embraces an ensemble framework in which multiple roughly balanced training subsets are created to train an ensemble of classifiers. Thus, it reduces the risk of removing informative samples from the majority class, which may occur when a simple under-sampling technique is applied. As opposed to random sampling, the sample subset optimization technique is applied to identify optimal sample subsets. This may improve the quality of the base classifiers and result in a more accurate ensemble. The aforementioned biological problems often present several thousands of training samples. The proposed technique is essentially an under-sampling approach. It can avoid the introduction of data noise and the generated data subsets may be more efficient for classifier training. The rest of the paper discusses the details of the proposed sample subset optimization technique and the associated ensemble learning system. Section 2 presents the ensemble learning system. Section 3 describes the main idea of sample subset optimization. The base classifier and fitness function of the ensemble system are described in Section 4. Comparisons with typical sampling and ensemble methods are given in Section 5. Section 6 concludes the paper.

3 Sample subset optimization for classifying imbalanced biological data 3 2 Ensemble system Ensemble learning is an effective approach for improving the prediction accuracy of a single classification algorithm. Such an improvement is commonly achieved by using multiple classifiers (known as the base classifiers) each trained on a subset of samples created by random sampling such as those used in bagging [13], or cost-sensitive sampling such as those used in boosting [14]. The base classifiers are typically combined using an integration function such as averaging [15] or majority voting [16]. Training set Optimized training subsets Test set m m m m n Optimize samples from majority class n 1 n 2 n L c 1 Base classifiers c 2 c L m n Majority voting Prediction AUC value Fig. 1. A schematic representation of the proposed ensemble system. We propose an ensemble learning system specifically designed for imbalanced biological data classification. The schematic representation of the proposed system is shown in Figure 1. It has three main components sample subset optimization, base classifier, and fitness function. The key of this ensemble system is the application of the sample subset optimization techniques (to be described in Section 3). Suppose that a highly imbalanced dataset contains n samples from the majority class and m samples from the minority class where n m, the system creates each sample subset by including all m minority samples and selecting a subset of samples from the n majority samples according to an internal optimization procedure. This procedure is conducted to generate multiple optimized sample subsets, each being a roughly balanced subset containing m minority samples and n i carefully selected majority samples, where n i n (i = 1...L) and L is the total number of optimized sample subsets. Using those optimized sample subsets, we can obtain a group of base classifiers c i (i = 1...L), each

4 4 P. Yang et al. being trained on its corresponding sample subset {m + n i }. The base classifiers are then combined using majority voting to form an ensemble of classifiers. Algorithm 1 summarizes the procedure. A line starting with // in the algorithm is a comment for its adjacent next line. Algorithm 1 samplesubsetoptimization Input: Imbalanced dataset D I Output: Roughly balanced dataset D B 1: cvsize = 2; 2: cvsets = crossvalidate(d I, cvsize); 3: for i = 1 to cvsize do 4: // obtain the internal training samples 5: D i T = gettrain(cvsets, i); 6: // obtain the internal test samples 7: D i t = gettest(cvsets, i); 8: // obtain samples of the minority class 9: D i minor = getminoritysample(d i T ); 10: // obtain samples of the majority class 11: D i major = getmajoritysample(d i T ); 12: // select a subset of samples from the majority class 13: D i major = optimizemajoritysample(di major, D i minor, D i t); 14: D B = D B (D i minor D i major ); 15: end for 16: return D B ; 3 Sample subset optimization The key function in Algorithm 1 is the optimization procedure applied to select a subset of samples from the majority class (Algorithm 1, line 13). The principal idea of the sample subset optimization procedure is to apply a cross validation procedure to form a subset in which each sample is selected according to the internal classification accuracy. In this section, we describe its formulation using a particle swarm optimization (PSO) algorithm [17], and analyze its behavior using a synthetic dataset. The base classifier and the fitness function used for optimization are discussed in Section Formulation of sample subset optimization We formulate the sample subset optimization using a particle swarm optimization algorithm. In particular, for each sample from the majority class a dimension in the particle space is assigned. That is, for n majority samples, the particle is coded as an indicator function set p = {I x1, I x2,..., I xn }. For each dimension, an indicator function I xj takes value 1 when the corresponding jth sample

5 Sample subset optimization for classifying imbalanced biological data 5 x j is included to train a classifier. Similarly, a 0 denotes that the corresponding sample is excluded from training. By optimizing a population of L particles p i (i = 1...L), the velocity of the ith particle v i,j (t) and the position of this particle s i,j (t) in the jth dimension of the solution space are updated in each iteration t as follows: v i,j (t + 1) = w v i,j (t) + c 1 r 1 (pbest i,j s i,j (t)) + c 2 r 2 (gbest i,j s i,j (t)) (1) { 0: if random() S(vi,j (t + 1)) s i,j (t + 1) = (2) 1: if random() < S(v i,j (t + 1)) 1 S(v i,j (t + 1)) = (3) 1 + e vi,j(t+1) where pbest i,j and gbest i,j are the previous best position and the best position found by informants, respectively. c 1, r 1, c 2, and r 2 are the learning rates and social coefficients. random() is the random number generator with a uniform distribution of [0,1]. Representing this optimization procedure in pseudocode, we obtain Algorithm 2. Note that the PSO algorithm produces multiple optimized sample subsets in parallel. Therefore, by specifying the popsize parameter, we can obtain any number of optimized sample subsets with a single execution of the algorithm. Algorithm 2 optimizemajoritysamples Input: Majority samples D major, Minority samples D minor, Internal test samples D t Output: Optimized sample subsets D p i major (i = 1...L) 1: popsize = L; 2: initiateparticles(d major, popsize); 3: for t = 1 to termination do 4: // go through each particle in the population 5: for i = 1 to popsize do 6: // extract the samples according to the indicator function set 7: D p i major = extractselectedsamples(p i, D major ); 8: D p i train = Dp i major D minor; 9: // train a classifier using selected majority samples and all minority samples 10: h i = trainclassifier(d p i train ); 11: // calculate the fitness of the trained classifier using internal test samples 12: fitness = calculatefitness(h i, D t ); 13: // update velocity (Eq. (1)) and position (Eq. (2)) according to fitness value 14: v i,j (t) = updatevelocity(v i,j (t), fitness); 15: s i,j(t) = updateposition(s i,j(t), fitness); 16: end for 17: end for 18: return D p i major (i = 1...L)

6 6 P. Yang et al. 3.2 Analysis of behavior We analyze the behavior of sample subset optimization by using an imbalanced synthetic data. Samples are created with each has two features. These two features are generated from the same distribution. Specifically, 20 samples of the majority class are generated from a normal distribution N (5, 1) and 10 samples of the minority class are generated from a normal distribution N (7, 1). In addition, 5 outlier samples are introduced to the dataset. They are labeled as majority class, but are generated from the normal distribution of the minority class. The class ratio of the data is 25:10. Figure 2(a) shows the original dataset and the resulting classification boundary of a linear SVM, and Figure 2(b) shows a dataset after applying sample subset optimization and the resulting classification boundary of a linear SVM. Note that this is one of the optimized dataset which is used to train one base classifier. Our ensemble is the aggregation of multiple base classifiers trained on multiple optimized datasets. It is evident that the class ratio is more balanced after optimization (from 25:10 to 15:10). In addition, the 3 out of 5 outlier samples are removed, and 7 redundant majority samples which has limited effect on the decision boundary of the linear SVM classifier are removed to correct the imbalanced class distribution. 9 8 Linear SVM border 9 8 Linear SVM border 7 7 Feature Feature Majority samples Minority samples Feature 1 (a) orinigal dataset 3 Majority samples Minority samples Feature 1 (b) dataset after optimization Fig. 2. The green lines are the classification boundary created using a linear SVM with (a) the original dataset and (b) the dataset after optimization. 4 Base classifier and fitness function We select SVM as the base classifier for building the ensemble system. SVM is routinely applied to many challenging bioinformatics problems. The design of the fitness function is another important facet for sample subset optimization. It determines the quality of the base classifiers, and thus the performance of the ensemble. The following subsections describe these two components in details.

7 Sample subset optimization for classifying imbalanced biological data Base classifier of support vector machine SVM is a popular classification algorithm which has been widely used in many bioinformatics problems. Among different kernel choices, linear SVM with a soft margin is robust for large scale and high-dimensional dataset classification [18]. Let us denote each sample in the dataset as a vector x i (i = 1...M) where M is the total number of samples, and y i is the class label of sample x i. Each component in x i is a feature x ij (j = 1...N) interpreted as the jth feature of the ith sample, where N is the dimension of the feature space. In our case, features could be GC-content, dinucleotide values, or other biological markers used to characterize each sample. A linear SVM with a soft margin is trained by optimizing following functions: subject to : 1 min w,b,ξ 2 w 2 + C M i=1 ξ i y i (< w, x i >) + b 1 ξ i where w is the weight vector, ξ i are slack variables, and b is the bias. The constant C determines the trade-off between maximizing the margin and minimizing the amount of slack. In this study, we utilize the implementation proposed by Hsieh et al. [19]. This is an implementation for fast and large scale linear SVM, which is especially suited as base classifier for ensemble learning due to its computational efficiency. Notice that classifiers are trained both for sample subset optimization and for composing ensemble. However, these two procedures are independent from each other, and therefore, the classifiers trained for sample subset optimization are not the classifiers used for ensemble. The purpose of the classifiers trained in the sample subset optimization procedure are to provide fitness feedbacks of the selected samples, whereas the classifiers used for composing ensemble are trained by using the optimized sample subsets and serve as the base classifiers of the ensemble. To maximize the specificity of the feedbacks, the same classification algorithm, that is, linear SVM, is used for both procedures. 4.2 Fitness function For building a classifier, a subset of samples from the majority class is selected according to an indicator function set p i (see Section 3.1), and combined with the samples from the minority class to form a training set D pi train. The goodness of an indicator function set can be assessed by the performance of the classifier trained with the samples specified by it. For imbalanced data, one effective way to evaluate the performance of the classifier is to use area under the ROC curve metric [20]. Hence, we devise AUC(h i (D p i train, D test)) as a component of fitness function, where D p i train denotes the training set generated using p i and D test denotes the test data. Function AU C() calculates the AUC value of a classification model h i (D a, D b ) which is trained on D a and evaluated on D b.

8 8 P. Yang et al. Moreover, the size of the subset is also important because a small training set is likely to result in a poorly trained model with poor generalization. Therefore, the fitness function can be constructed by combining the two components: fitness(u i ) = w 1 AUC(h i (D p i train, D test)) + w 2 Size(p i ) (4) where Size() determines the size of a subset (specified by p i ). Coefficients w 1 and w 2 are empirical constants which can be adjusted to alter the relative importance of each fitness component. The default values are w 1 = 0.8 and w 2 = 0.2 as they work well in a range of datasets. 5 Experimental results In this section, we first describe four imbalanced biological datasets used in our experiment. They are generated from several important and diverse biological problems and represent different degrees of imbalanced class distribution. Next we present the performance results of our ensemble algorithm compared with six other algorithms using those datasets. 5.1 Datasets We evaluated different algorithms using datasets generated for identification of mirna, classification of protein localization sites, and prediction of promoter (drosophila and human). Specifically, the mirna identification dataset contains 691 positive samples and 9248 negative samples, which is described by 21 features [21]. The protein localization dataset is generated from the study discussed in [22]. We attempted to differentiate membrane proteins (258) from the rests (1226). The human promoter dataset contains 471 promoter sequences and 5131 coding sequences (CDS) and intron sequences. Compared to the human promoter dataset, the drosophila promoter dataset has a relatively balanced class distribution with 1936 promoter sequences and 2722 CDS and intron sequences. We calculated the 16 dinucleotide features according to [23]. The datasets are summarized and organized according to class ratio in Table 1. Table 1. Summary of biological datasets used for evaluation. Dataset (short name) # Sample # Features Minority vs. Majority drosophila promoter (DroProm) ( 1:2.5) protein localization (ProtLoc) ( 1:5) human promoter (HuProm) ( 1:10) mirna identification (mirna) ( 1:13)

9 Sample subset optimization for classifying imbalanced biological data Performance comparison The performance of the single classifier of SVM was used as the baseline for all datasets. We compared the single classifier approaches including random undersampling with SVM (RUS-SVM), random over-sampling with SVM (ROS-SVM), SMOTE sampling with SVM (SMOTE-SVM), and the ensemble approaches including boosting with base classifiers of SVM (Boost-SVMs), bagging with base classifiers of SVM (Bag-SVMs), and our sample subset optimization technique with SVM (SSO-SVMs) Area Under ROC Curve SSO SVMs Bag SVMs Boost SVMs Single SVM ROS SVM RUS SVM SMOTE SVM Number of Base Classifiers Area Under ROC Curve SSO SVMs Bag SVMs Boost SVMs Single SVM ROS SVM RUS SVM SMOTE SVM Number of Base Classifiers (a) drosophila promoter (b) protein localization Area Under ROC Curve SSO SVMs Bag SVMs Boost SVMs Single SVM ROS SVM RUS SVM SMOTE SVM Number of Base Classifiers Area Under ROC Curve SSO SVMs Bag SVMs Boost SVMs Single SVM ROS SVM RUS SVM SMOTE SVM Number of Base Classifiers (c) human promoter (d) mirna identification Fig. 3. The comparison of different algorithms for data classification. The x-axis denotes the ensemble sizes and the y-axis denotes the AUC value. For those algorithms that use a single classifier, the same AUC value is plotted on different ensemble sizes for the purpose of comparison. For the ensemble methods, we tested the ensemble size from 10 to 100 with a step of 10. A 5-fold cross-validation procedure was applied to partition datasets for training and testing, and each algorithm was tested on the same partition

10 10 P. Yang et al. to reduce evaluation variance. Among the six tested algorithms, four of them employed the randomization procedure. They are RUS-SVM, ROS-SVM, Bag- SVMs, and SSO-SVMs (note that the Boost-SVMs algorithm uses the reweighting implementation and is deterministic). For those with the randomization procedure, we repeated the test 10 times, each time with a different random seed. Figure 3 shows the results comparison. It can be seen that in most cases ensemble approaches give higher AUC values than the single classifier approaches. For single classifier approaches, random under-sampling, random over-sampling, and SMOTE sampling do improve the classification results when the analyzed dataset has a highly imbalanced class distribution such as the cases in Figure 3(b)(c)(d). However, the improvements become less significant when the imbalance is moderate (drosophila promoter dataset in Figure 3(a)). SMOTE sampling performs better than random under-sampling and over-sampling approaches in the case of protein localization (Figure 3(b)). However, the performance gain is marginal in other three datasets (Figure 3(a)(c)(d)). We do not observe significant difference of the performance between random under-sampling and random over-sampling, except in the case of mirna identification (Figure 3(d)) where random over-sampling is relatively better than random under-sampling. For ensemble approaches, Boost-SVMs performs surprisingly worse than the other two approaches in most cases and the performance fluctuates among different ensemble sizes. This may be caused by its training process in that the boosting algorithm assigns increasingly more classification weights to those most difficult samples in each iteration. However, those difficult samples could be the outliers and cause deleterious effect when the classifiers pay too much attention on classifying them while ignoring other more representative samples. In this regard, Bag-SVMs and SSO-SVMs appear to be the better approaches. However, SSO-SVMs almost always performs the best in every case and generates much smaller performance variance when different random seeds were used. It is likely that the SSO-SVMs can capture the most representative samples from the training set which gives a better generalization on unseen data classification. We also observe that the improvement is more significant when the datasets has a highly imbalanced class distribution (Figure 3(b)(c)(d)). Table 2. The comparison of different algorithms for data classification according to AUC value. The value for ensemble approaches are averaged across different ensemble sizes. Algorithm DroProm ProtLoc HuProm mirna Single-SVM RUS-SVM ROS-SVM SMOTE-SVM Boost-SVMs Bag-SVMs SSO-SVMs

11 Sample subset optimization for classifying imbalanced biological data 11 Table 3. P -value using one-tail student t-test to compare the performance difference Algorithm DroProm ProtLoc HuProm mirna SSO-SVMs vs. Single-SVM SSO-SVMs vs. RUS-SVM SSO-SVMs vs. ROS-SVM SSO-SVMs vs. SMOTE-SVM SSO-SVMs vs. Boost-SVMs SSO-SVMs vs. Bag-SVMs Table 2 shows the AUC values of both single classifier and ensemble approaches. For the ensemble approaches, the AUC value is the average of those given by the ensemble sizes from 10 to 100. The proposed SSO-SVMs performs the best in all four tested datasets. Comparing these results with the baseline of a single SVM, they account for 10%-20% improvements. To confirm the improvements are statistically significant, we applied a one-tail student t-test and compared SSO-SVMs with the other six methods. Table 3 shows the p-value of the comparison. In all four datasets, the performance of SSO-SVMs is significantly better than the other six methods, with a p-value smaller than Therefore, we confirmed the effectiveness of the proposed ensemble approach. 6 Conclusion In this paper we introduced a sample subset optimization technique for sampling optimal sample subsets from training data. We integrated this technique in an ensemble learning framework and created an ensemble of SVMs specifically for imbalanced biological data classification. The proposed algorithm was applied to several bioinformatics tasks with moderate and highly imbalanced class distributions. According to our experimental results, (1) the approaches based on data sampling for a single SVM are generally less effective compared to the ensemble approaches; (2) the proposed sample subset optimization technique appears to be very effective and the ensemble optimized by this technique produced the best classification results in terms of AUC value for all evaluation datasets. References 1. Meyer, I.: A practical guide to the art of RNA gene prediction. Briefings in bioinformatics 8(6) (2007) Zeng, J., Zhu, S., Yan, H.: Towards accurate human promoter recognition: a review of currently used sequence features and classification methods. Briefings in Bioinformatics 10(5) (2009) Sonnenburg, S., Schweikert, G., Philips, P., Behr, J., Rätsch, G.: Accurate splice site prediction using support vector machines. BMC Bioinformatics 8(Suppl 10) (2007) S7 4. Hua, S., Sun, Z.: Support vector machine approach for protein subcellular localization prediction. Bioinformatics 17(8) (2001)

12 12 P. Yang et al. 5. Akbani, R., Kwek, S., Japkowicz, N.: Applying Support Vector Machines to Imbalanced Datasets. In: Proceedings of the 15th European Conference on Machine Learning. (2004) Liu, Y., An, A., Huang, X.: Boosting prediction accuracy on imbalanced datasets with SVM ensembles. In: Proceedings of the 10th Pacific-Asia Conference on Knowledge Discovery and Data Mining. (2006) Japkowicz, N., Stephen, S.: The class imbalance problem: A systematic study. Intelligent Data Analysis 6(5) (2002) Batuwita, R., Palade, V.: A New Performance Measure for Class Imbalance Learning. Application to Bioinformatics Problems. In: 2009 International Conference on Machine Learning and Applications, IEEE (2009) Chawla, N., Japkowicz, N., Kotcz, A.: Editorial: special issue on learning from imbalanced data sets. ACM SIGKDD Explorations Newsletter 6 (2004) Chawla, N., Bowyer, K., Hall, L., Kegelmeyer, W.: SMOTE: synthetic minority over-sampling technique. Journal of Artificial Intelligence Research 16(1) (2002) Weiss, G.: Mining with rarity: a unifying framework. ACM SIGKDD Explorations Newsletter 6(1) (2004) Hido, S., Kashima, H., Takahashi, Y.: Roughly balanced bagging for imbalanced data. Statistical Analysis and Data Mining 2(5-6) (2009) Breiman, L.: Bagging predictors. Machine Learning 24(2) (1996) Schapire, R., Freund, Y., Bartlett, P., Lee, W.: Boosting the margin: A new explanation for the effectiveness of voting methods. The Annals of Statistics 26(5) (1998) Tax, D., Van Breukelen, M., Duin, R.: Combining multiple classifiers by averaging or by multiplying? Pattern Recognition 33(9) (2000) Lam, L., Suen, S.: Application of majority voting to pattern recognition: an analysis of its behavior and performance. IEEE Transactions on Systems, Man, and Cybernetics, Part A: Systems and Humans 27(5) (1997) Poli, R., Kennedy, J., Blackwell, T.: Particle swarm optimization. Swarm Intelligence 1(1) (2007) Ben-Hur, A., Ong, C., Sonnenburg, S., Schölkopf, B., Rätsch, G.: Support vector machines and kernels for computational biology. PLoS Computational Biology 4(10) (2008) 19. Hsieh, C., Chang, K., Lin, C., Keerthi, S., Sundararajan, S.: A dual coordinate descent method for large-scale linear SVM. In: Proceedings of the 25th International Conference on Machine Learning, ACM (2008) Fawcett, T.: An introduction to ROC analysis. Pattern Recognition Letters 27(8) (2006) Batuwita, R., Palade, V.: micropred: effective classification of pre-mirnas for human mirna gene prediction. Bioinformatics 25(8) (2009) Horton, P., Nakai, K.: A probabilistic classification system for predicting the cellular localization sites of proteins. In: Proceedings of the Fourth International Conference on Intelligent Systems for Molecular Biology, AAAI Press (1996) Rani, T., Bhavani, S., Bapi, R.: Analysis of E. coli promoter recognition problem in dinucleotide feature space. Bioinformatics 23(5) (2007)