Learning Classifiers from Imbalanced, Only Positive and Unlabeled Data Sets Yetian Chen Department of Computer Science Iowa State University yetianc@cs.iastate.edu Abstract In this report, I presented my results to the tasks of 2008 UC San Diego Data Mining Contest. This contest consists of two classification tasks based on data from scientific experiment. The first task is a binary classification task which is to maximize accuracy of classification on an evenly-distributed test data set, given a fully labeled imbalanced training data set. The second task is also a binary classification task, but to maximize the F1-score of classification on a test data set, given a partially labeled training set. For task 1, I investigated several re-sampling techniques in improving the learning from the imbalanced data. These include SMOTE (Synthetic Minority Over-sampling Technique), Oversampling by duplicating minority examples, random undersampling. These techniques were used to create new balanced training data sets. Then three standard classifiers (Decision Tree, Naïve Bayes, Neural Network) were trained on the rebalanced training sets and used to classify the test set. The results showed the re-sampling techniques significantly improve the accuracy on the test set except for the Naïve Bayes classifier. For task 2, I implemented twostep strategy algorithm to learn a classifier from the only positive and unlabeled data. In step 1, I implemented Spy technique to extract reliable negative (RN) examples. In step 2, I then used the labeled positive examples and the reliable negative examples as training set to learn standard Naïve Bayes classifier. The results showed the two-step algorithm significantly improves the F1 score compared to the learning that simply regards unlabeled examples as negative ones. 1. Introduction 2008 UC San Diego Data Mining Contest 1 consists of two tasks, both of which are binary classification tasks based on data from a scientific experiment. The first task is a standard classification task, which is to maximize accuracy of classification on a test data set, given a fully labeled training set. This is a binary classification task that involves 20 real-valued features from an experiment in the physical sciences. The training data consist of 40,000 examples, but there are roughly ten times as many negative examples as positive. Thus, it is a typical class imbalance problem. The 1. http://mill.ucsd.edu/index.php?page=main second task is a Positive-Only Semi-Supervised Learning task which aims to maximize the F1-score of classification based on a test data set, given a partially labeled training set. This is also a binary classification problem. But most of the training examples are unlabeled. In fact, only a few of the positive examples have labels. There are both positive and negative unlabeled examples, but there are several times as many negative training examples as positive. As is in the Standard Classification Task, there are 20 real-valued features, but these are not the same features. The task is to classify the test set examples as accurately as possible, which is evaluated using F1 score. We call this PU-learning. 1.1 Learning from imbalanced data The class imbalance problem is prevalent in many applications, including: fraud/intrusion detection, risk management, text classification, and medical diagnosis/monitoring, etc [7]. It typically occurs when, in a classification problem, there are many more instances of some classes than others. In such cases, standard classifiers tend to be overwhelmed by the large classes and ignore the small ones. Particularly, they tend to produce high predictive accuracy over the majority class, but poor predictive accuracy over the minority class. A number of solutions to the class-imbalance problem were proposed both at the data and algorithmic levels. At the data level, these solutions include many different forms of re-sampling such as over-sampling and under-sampling. These techniques modify the prior probability of the majority and minority class in training set to obtain a more balanced number of instances in each class. The under sampling method extracts a smaller set of majority instances while preserving all the minority instances. This method is suitable for large-scale applications where the number of majority samples is tremendous and lessening the training instances reduces the training time and makes the learning problem more tractable. In contrast to under-sampling, over-sampling method increases the number of minority instances by over-sampling them. At the algorithmic level, solutions include adjusting the costs of various classes so as the counter the class
imbalance when training the data, adjusting the decision threshold, etc. In section 2, I investigated the techniques at the data level, i.e. re-sampling methods. I employed three re-sampling techniques: SMOTE (Synthetic Minority Over-sampling Technique), Over-sampling by duplicating minority examples, random under-sampling. Using the rebalanced data, I trained three different classifiers, Decision Tree (C4.5), Naïve Bayes and Neural Network (with one hidden layer) and used then to classify the test set. 1.2 Learning from only positive and unlabeled data Considering a binary classification problem, given a set P that is an incomplete set of positive instances and a set U of unlabeled instances that contains both positive and negative instances, we want to build a classifier to classify the instances in U or new test set into positive or negative instances. This problem is called Learning from Only Positive and unlabeled data or PU-learning. In real life, PUlearning has a lot of applications. For example, there are over 1000 specialized molecular biology database, each of which defines a set of positive examples (gene/proteins related to certain disease or function) but has no information about examples that should not be included (and it is unnatural to build such set). Apparently, the traditional classification techniques are inapplicable since they all require both labeled positive and negative examples to build a classifier. Recently, a few algorithms were proposed to solve the problem [2][3][4]. One class of algorithms is based on a two-step strategy. These algorithms include S-EM, PEBL, and Roc-SVM. Step 1: Identify a set of reliable negative examples (RN) from the unlabeled set. In this sep, S-EM uses a Spy techniques, PEBL uses a techniques called 1-DNF, and Roc-SVM uses Rocchio algorithm. Step 2: Building a set of classifier by iteratively applying a classification algorithm and then selecting a good classifier. In this step, S-EM uses the Expectation-Maximization (EM) algorithm with a NB classifier, while PEBL and Roc-SVM use SVM. 2. Task 1: Learning Classifiers from Imbalanced Data Sets 2.1 Datasets The training data set consists of 40,000 examples, each of which involves 20 real-valued features. 3636 of them are labeled as 1 (positive examples), 36364 of them are labeled as -1 (negative examples). There are no missing values in the data set. The test set consists of 10,000 examples, in which the two classes are evenly distributed. More related information about these two date sets can be reached at [1]. In the experiment, all the datasets are converted to.arff format to be used by Weka. 2.2 Re-sampling Techniques SMOTE SMOTE (Synthetic Minority Over-sampling Technique) is an over-sampling approach proposed and designed in [8]. They generate synthetic examples in a less applicationspecific manner, by operating in feature space rather than data space. The minority class is over-sampled by taking each minority class sample and introducing synthetic examples along the line segments joining any/all of the k minority class nearest neighbors. Depending upon the amount of over-sampling required, neighbors from the k nearest neighbors are randomly chosen. Their implementation currently uses five nearest neighbors. For instance, if the amount of over-sampling needed is 200%, only two neighbors from the five nearest neighbors are chosen and one sample is generated in the direction of each. Synthetic samples are generated in the following way: take the difference between the feature vector (sample) under consideration and its nearest neighbor; multiply this difference by a random number between 0 and 1, and add it to the feature vector under consideration. This causes the selection of a random point along the line segment between two specific features (Fig 1). This approach effectively forces the decision region of the minority class to become more general. In this report, I implemented a two-step algorithm in which the step 1 uses Spy technique. After identify a set of reliable negative examples (RN), I use P (labeled positive examples) and RN to build a Naïve Bayes classifier. Section 3 includes the implementation details and the results. Fig 1. Over-Sampling with SMOTE. The minority class is oversampled by taking each minority class sample and introducing synthetic examples (blue circle) along the line segments joining any/all of the k (default=5) minority class nearest neighbors (red circles). 2
In the experiment design, the positive examples were oversampled by 900% so that the size of positive class is 36360, roughly equal to the size of negative class 36364. The SMOTE technique is embedded in the Weka package: weka.fiters.supervised.instance.smote. Over-sampling by duplicating the minority examples To produce a contrast to SMOTE, I implemented a simple over-sampling approach which over-samples the minority example by simply duplicating the minority examples. In this experiment, each positive example was duplicated 9 times to make the size of positive class roughly equal to the size of negative class. Random under-sampling As mentioned previously, the under sampling method extracts a smaller set of majority instances while preserving all the minority instances. In this experiment, I implemented an under-sampling approach which randomly selects a subset of examples from the majority class. For this data set, 3720 negative examples are randomly selected from all 36364 negative examples. And all 3636 positive examples are preserved in the new training set. 2.3 Building Standard Classifiers Using the new training data sets, I then trained three different classifiers: Decision Tree, Naïve Bayes, Neural Network (with one hidden layer). 15, 20. It is showed the accuracy reaches a plateau after 11 hidden units. Thus, the table only gives the accuracy for the 11 hidden-unit Neural Network. Table 1 Effect of re-sampling techniques in improving the classification accuracies on test set no resampling US OSbD OS_SMOTE DT 0.791 0.828 0.788 0.875 NB 0.834 0.827 0.827 0.838 NN 0.835 0.909 0.904 0.91 Notation: DT(Decision Tree), NB(Naïve Bayes), NN (Neural Network). No resampling (no resampling techniques applied), US (random under-sampling), OSbD (over-sampling by duplications), OS_SMOTE (over-sampling with SMOTE). For NN, the number of hidden units is 11. The results in Table 1 are plotted in Fig 1. Accuracy 1.0.9.8.7.6.5 Fig 1 Effect of resampling techniques on imbalanced data Decision Tree Decision Tree classifiers were trained using each of the three rebalanced training sets. I use weka.classifiers.trees.j48.j48 in WEKA package. When building the tree, I selected the default pruning option..4 no resampling US OSbD OS_SMOTE Decision Tree Naive Bayes Neural Network(hidden neurons = 11) Naïve Bayes Similarly, Naïve Bayes Classifiers were trained on the three new training set using weka.classifiers.bayes.naivebayes class. 5-fold cross-validation is selected. Neural Network Three-layer feed forward neural networks (one hidden layer) were trained using the new data sets. I experimented with different number of hidden units and selected the one with the best accuracy. I used the default learning rate 0.3 and momentum rate 0.2. The training algorithm I used is weka.classifiers.functions.neural.neuralnetwork. 2.4 Results Table 1 summarizes the accuracies of the classifiers trained using different training sets in classifying the test set. When training the Neural Network classifier, I experimented with different number of hidden units: 5, 11, For Naïve Bayes classifier, all of the three re-sampling approaches do not significantly improve the predictive accuracy on the test set. The accuracies of them are around 0.83, roughly the same as the NB classifier trained using the original imbalanced data set. For Decision Tree Classifier, random under-sampling and over-sampling with SMOTE significantly improve the accuracy. Over-sampling with SMOTE gives the best accuracy (0.875), which shows 8% improvement compared to the DT classifier trained directly using the imbalanced data. For Neural Network, all three re-sampling techniques significantly improve the predictive accuracy on test set. Neural Network classifier with over-sampling using SMOTE gives the best accuracy compared to other classifiers and re-sampling techniques. Thus, my best accuracy achieved is 0.91, which is ranked at 52 th among all 199 teams. The best accuracy among this ranking is 0.928. 3
3. Task 2: Learning Classifiers from Only Positive and Unlabeled Data Sets 3.1 Data Set The training data set consists of 68,560 examples, each of which also involves 20 real-valued features. Only 60 of them are labeled as 1 (positive examples), others are unlabeled. There are also no missing values in the data set. The test set consists of 11,427 examples. More related information about these two date sets can be reached at [1]. In the experiment, all the datasets are converted to.arff format to be used by Weka. 3.2 Two-step Strategy Theoretically, the PU-learning problem (Learning from Only Positive and Unlabeled data) is learnable [3][5]. There are a number of solutions proposed, among which are a class of algorithms based on a two-step strategy. In step 1, these algorithms implemented various techniques in order to extract a set of reliable negative examples from the unlabeled examples. In step 2, different classifiers can then be trained using the reliable negative examples obtained from step 1. In my project, I employed a Spy technique [3] in step 1 to extract reliable examples. Then I built a Naïve Bayes classifier using the labeled positive examples and the reliable negative examples as the training set. The two-step strategy is illustrated in Fig 2. U P positive negative Reliable Negative (RN) Step 1 Step 2 positive Q =U - RN Using P, RN and Q to build the final classifier iteratively or Using only P and RN to build a classifier Fig 2. Illustration of the two-step strategy for PU-learning Step 1: The Spy technique to find reliable negative instances The algorithm for Spy technique is given in Fig 3. It first randomly selects a set S of positive examples from P and put them in U (lines 2 and 3). The default value for s% is 15%. Examples in S act as spy examples from the positive set to the unlabeled set U. The spies behave similarly to the unknown positive instances in U. Hence, they allow the algorithm to infer the behavior of the unknown positive instances in U. It then runs I-EM algorithm using the set P S as positive and the set U S as negative (lines 3-7). I-EM basically runs NB twice (see the EM algorithm below). After I-EM completes, the resulting classifier uses the probabilities assigned to the documents in S to decide a probability threshold t to identify possible negative documents in U to produce the set RN. See [3] for details. One thing to be mentioned, since the 20 features are realvalued, I first discretized them into 10 bins, the width of which is ( xi x )/10 for each feature i. Then I,max i,min computed the posterior probabilities using the suffice statistics. Algorithm Step-1 1. N = U = 2. S = sample( P, s%) 3. MS = M S 4. P = P S 5. Assign every examples in P the class label 1 6. Assign every examples in MS the class label -1 7. Run I-EM(MS, P) 8. Classify each examples in MS 9. Determine the probability threshold t using S 10. for each example e in M 11. if its probability Pr[1 x] < t 12. N = N {} e 13. else U = U {} e I-EM(M, P) 1. Building a initial naïve Bayesian classifier NB-C using P as positive examples, M as negative examples 2. Loop while classifier parameters change 3. for each example e M 4. compute Pr[1 e ] 5. update Pr[ x i 1] and Pr[1] given the probabilistic Pr[1 e ] and P Fig 3. Algorithm for the Spy technique Step 2: Building a Standard Naïve Bayes Classifier using P and RN After Step 1, we obtained a set of examples that we believe are most likely negative examples (RN). Then I used the labeled positive examples (P) and (RN) to train a Naïve Bayes classifier. Similarly, I also used the class weka.classifiers.bayes.naivebayes in Weka. 5-fold crossvalidation is selected. 3.3 Results I did two experiments. In the first experiment, I simply regarded all unlabeled examples (U) as negative examples. The training set then combines P and U. A Naïve Bayes classifier was trained using this training set. The second experiment follows the two-step strategy above. The 4
performances were evaluated using F1 score from the classifying the test set. Table 2 F1 score of the PU-learning P, N=U P, N=RN F1 score 0.545 0.651 Notation: F 1 = 2 Precision Recall /(Precision+Recall) When trained with P vs U (U as the negative examples), the F1 score is 0.545. The two-step strategy gives F1=0.651, which is significant improvement. The best score among all teams in the contest is 0.721, which means I still have a long way to go. Anyway, it is showed the two-step strategy does improve the predictive power. 4. Conclusion and Discussion In this report, I studied two challenging data mining tasks presented in 2008 UC San Diego Data Mining competition. The first task is to improve the learning from imbalanced data sets. For this problem, I investigated a set of resampling approaches, random under-sampling, oversampling by duplicating the minority class, SMOTE (Synthetic Minority Over-sampling Technique), in improving the learning from the imbalanced data sets. I then built three classifiers using the rebalanced data sets. It is showed that, for Naive Bayes, the three re-sampling techniques do not have significant improvement in the classification accuracy over the test set. For Decision Tree Classifiers, random under-sampling and SMOTE significantly improve the accuracy. For Neural Network, all three re-sampling techniques significantly improve the accuracy. Neural Network classifier with SMOTE gives the best accuracy compared to other classifiers and re-sampling techniques. Although in this case, under-sampling has peered predictive accuracy with the SMOTE, a problem associated with it is that we may lose informative instance from the discarded instance. The competition uses accuracy as the criteria to evaluate the performance of a classifier. This may not be a good idea, since higher accuracy does not necessarily imply better performance on target task. It turns out that AUC (area under the ROC curve) is a better performance measure than accuracy. The second task is a problem of learning from partially labeled data set. Only a subset of positive examples are labeled. Others are unlabeled. There are no labeled negative examples. For this task, I investigated a two-step strategy for learning this type of data sets. In step 1, I used the Spy technique to extract reliable negative examples. In step 2, I then used these reliable negative examples combining the labeled positive examples to learn a Naïve Bayes classifier. This two-step strategy give significantly better F1 score than simply using all unlabeled examples as negative ones. Furthermore, the best score among all teams in the contest is 0.721, far better than mine (0.651), which means I still have a long way to go. There are other alternate two-step strategies. For example, in step 1, there are various algorithms to identify reliable negative examples, such as 1- DNF, Rocchio algorithm. In step 2, it turns out SVM is a better classifier to build the final classifier. Thus, some future work is to try these different two-step strategies, as well as other non two-step approaches. Class imbalance and PU-learning are problems that still worth studying. References [1] http://mill.ucsd.edu/index.php?page=datasets [2] B. Liu, Y. Dai, X. Li, W. S. Lee, and P. S. Yu. Building text classifiers using positive and unlabeled examples. In Proceedings of the 3rd IEEE International Conference on Data Mining (ICDM 2003), pages 179 188, 2003. [3]B. Liu, W.S.Lee, P.S. Wu, X. Li. Partially Classification of Text Documents. Proceedings of the Nineteenth International Conference on Machine Learning (ICML- 2002), 8-12, July 2002, Sydney, Australia. [4]Wee Sun Lee, Bing Liu. Learning with Positive and Unlabeled Examples using Weighted Logistic Regression. Proceedings of the Twentieth International Conference on Machine Learning (ICML-2003), August 21-24, 2003, Washington, DC USA. [5] C. Elkan and K. Noto. Learning Classifiers from Only Positive and Unlabeled Data. In Proceedings of the Fourteenth International Conference on Knowledge Discovery and Data Mining (KDD'08). [6]Giang Hoang Nguyen, Abdesselam Bouzerdoum, Son Lam Phung: A supervised learning approach for imbalanced data sets. ICPR 2008: 1-4 [7]Nitesh V. Chawla, Nathalie Japkowicz, Aleksander Kotcz: Editorial: special issue on learning from imbalanced data sets. SIGKDD Explorations 6(1): 1-6 (2004) [8]Nitesh V. Chawla et. al. (2002). "SMOTE: Synthetic Minority Over-sampling Technique". Journal of Artificial Intelligence Research. Vol.16, pp.321-357. 5