Learning Classifiers from Imbalanced, Only Positive and Unlabeled Data Sets

Similar documents
Random Forest Based Imbalanced Data Cleaning and Classification

E-commerce Transaction Anomaly Classification

Analyzing PETs on Imbalanced Datasets When Training and Testing Class Distributions Differ

Constrained Classification of Large Imbalanced Data by Logistic Regression and Genetic Algorithm

Social Media Mining. Data Mining Essentials

ClusterOSS: a new undersampling method for imbalanced learning

Introduction to Data Mining

Data Mining. Nonlinear Classification

Addressing the Class Imbalance Problem in Medical Datasets

Using Random Forest to Learn Imbalanced Data

Data Mining - Evaluation of Classifiers

Predict Influencers in the Social Network

Knowledge Discovery and Data Mining

Equity forecast: Predicting long term stock price movement using machine learning

Active Learning SVM for Blogs recommendation

Sample subset optimization for classifying imbalanced biological data

UNDERSTANDING THE EFFECTIVENESS OF BANK DIRECT MARKETING Tarun Gupta, Tong Xia and Diana Lee

Learning with Skewed Class Distributions

Final Project Report

Chapter 6. The stacking ensemble approach

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

An Introduction to Data Mining

Data Mining Techniques for Prognosis in Pancreatic Cancer

Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification

Getting Even More Out of Ensemble Selection

Machine Learning. Chapter 18, 21. Some material adopted from notes by Chuck Dyer

Editorial: Special Issue on Learning from Imbalanced Data Sets

Introducing diversity among the models of multi-label classification ensemble

Data Mining Algorithms Part 1. Dejan Sarka

Predicting the Risk of Heart Attacks using Neural Network and Decision Tree

Review of Ensemble Based Classification Algorithms for Nonstationary and Imbalanced Data

FRAUD DETECTION IN ELECTRIC POWER DISTRIBUTION NETWORKS USING AN ANN-BASED KNOWLEDGE-DISCOVERY PROCESS

Predicting borrowers chance of defaulting on credit loans

A Logistic Regression Approach to Ad Click Prediction

Data Mining Yelp Data - Predicting rating stars from review text

A Hybrid Approach to Learn with Imbalanced Classes using Evolutionary Algorithms

Supervised Learning (Big Data Analytics)

CLASSIFYING NETWORK TRAFFIC IN THE BIG DATA ERA

Knowledge Discovery and Data Mining

III. DATA SETS. Training the Matching Model

Data Mining Part 5. Prediction

Handling imbalanced datasets: A review

Detection. Perspective. Network Anomaly. Bhattacharyya. Jugal. A Machine Learning »C) Dhruba Kumar. Kumar KaKta. CRC Press J Taylor & Francis Croup

Azure Machine Learning, SQL Data Mining and R

Direct Marketing When There Are Voluntary Buyers

Predictive Data modeling for health care: Comparative performance study of different prediction models

IDENTIFIC ATION OF SOFTWARE EROSION USING LOGISTIC REGRESSION

Classification algorithm in Data mining: An Overview

Cross-Validation. Synonyms Rotation estimation

COMPARING NEURAL NETWORK ALGORITHM PERFORMANCE USING SPSS AND NEUROSOLUTIONS

Université de Montpellier 2 Hugo Alatrista-Salas : hugo.alatrista-salas@teledetection.fr

Knowledge Discovery and Data Mining

A Two-Pass Statistical Approach for Automatic Personalized Spam Filtering

Colon cancer survival prediction using ensemble data mining on SEER data

Mining Life Insurance Data for Customer Attrition Analysis

Introduction to Data Mining and Machine Learning Techniques. Iza Moise, Evangelos Pournaras, Dirk Helbing

Machine Learning. Mausam (based on slides by Tom Mitchell, Oren Etzioni and Pedro Domingos)

Domain Classification of Technical Terms Using the Web

A Content based Spam Filtering Using Optical Back Propagation Technique

AUTO CLAIM FRAUD DETECTION USING MULTI CLASSIFIER SYSTEM

Experiments in Web Page Classification for Semantic Web

Use Data Mining Techniques to Assist Institutions in Achieving Enrollment Goals: A Case Study

Customer Classification And Prediction Based On Data Mining Technique

EFFICIENT DATA PRE-PROCESSING FOR DATA MINING

The Artificial Prediction Market

Decision Tree Learning on Very Large Data Sets

Impact of Feature Selection on the Performance of Wireless Intrusion Detection Systems

Predicting Flight Delays

Data Mining Practical Machine Learning Tools and Techniques

Bing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. ~ Spring~r

Expert Systems with Applications

Using Data Mining for Mobile Communication Clustering and Characterization

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

A Novel Classification Approach for C2C E-Commerce Fraud Detection

Impact of Boolean factorization as preprocessing methods for classification of Boolean data

Data Mining Analysis (breast-cancer data)

not possible or was possible at a high cost for collecting the data.

Choosing the Best Classification Performance Metric for Wrapper-based Software Metric Selection for Defect Prediction

Learning on the Border: Active Learning in Imbalanced Data Classification

Taking Advantage of the Web for Text Classification with Imbalanced Classes *

Model Combination. 24 Novembre 2009

Machine Learning.

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

How To Solve The Kd Cup 2010 Challenge

Prediction of Heart Disease Using Naïve Bayes Algorithm

Novelty Detection in image recognition using IRF Neural Networks properties

Proposal of Credit Card Fraudulent Use Detection by Online-type Decision Tree Construction and Verification of Generality

A Survey on Pre-processing and Post-processing Techniques in Data Mining

Scalable Developments for Big Data Analytics in Remote Sensing

A Hybrid Data Mining Model to Improve Customer Response Modeling in Direct Marketing

An Overview of Predictive Analytics for Practitioners. Dean Abbott, Abbott Analytics

On the application of multi-class classification in physical therapy recommendation

Mobile Phone APP Software Browsing Behavior using Clustering Analysis

Prediction and Diagnosis of Heart Disease by Data Mining Techniques

A Mechanism for Selecting Appropriate Data Mining Techniques

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus

Finding Minimal Neural Networks for Business Intelligence Applications

Advanced analytics at your hands

Keywords data mining, prediction techniques, decision making.

Transcription:

Learning Classifiers from Imbalanced, Only Positive and Unlabeled Data Sets Yetian Chen Department of Computer Science Iowa State University yetianc@cs.iastate.edu Abstract In this report, I presented my results to the tasks of 2008 UC San Diego Data Mining Contest. This contest consists of two classification tasks based on data from scientific experiment. The first task is a binary classification task which is to maximize accuracy of classification on an evenly-distributed test data set, given a fully labeled imbalanced training data set. The second task is also a binary classification task, but to maximize the F1-score of classification on a test data set, given a partially labeled training set. For task 1, I investigated several re-sampling techniques in improving the learning from the imbalanced data. These include SMOTE (Synthetic Minority Over-sampling Technique), Oversampling by duplicating minority examples, random undersampling. These techniques were used to create new balanced training data sets. Then three standard classifiers (Decision Tree, Naïve Bayes, Neural Network) were trained on the rebalanced training sets and used to classify the test set. The results showed the re-sampling techniques significantly improve the accuracy on the test set except for the Naïve Bayes classifier. For task 2, I implemented twostep strategy algorithm to learn a classifier from the only positive and unlabeled data. In step 1, I implemented Spy technique to extract reliable negative (RN) examples. In step 2, I then used the labeled positive examples and the reliable negative examples as training set to learn standard Naïve Bayes classifier. The results showed the two-step algorithm significantly improves the F1 score compared to the learning that simply regards unlabeled examples as negative ones. 1. Introduction 2008 UC San Diego Data Mining Contest 1 consists of two tasks, both of which are binary classification tasks based on data from a scientific experiment. The first task is a standard classification task, which is to maximize accuracy of classification on a test data set, given a fully labeled training set. This is a binary classification task that involves 20 real-valued features from an experiment in the physical sciences. The training data consist of 40,000 examples, but there are roughly ten times as many negative examples as positive. Thus, it is a typical class imbalance problem. The 1. http://mill.ucsd.edu/index.php?page=main second task is a Positive-Only Semi-Supervised Learning task which aims to maximize the F1-score of classification based on a test data set, given a partially labeled training set. This is also a binary classification problem. But most of the training examples are unlabeled. In fact, only a few of the positive examples have labels. There are both positive and negative unlabeled examples, but there are several times as many negative training examples as positive. As is in the Standard Classification Task, there are 20 real-valued features, but these are not the same features. The task is to classify the test set examples as accurately as possible, which is evaluated using F1 score. We call this PU-learning. 1.1 Learning from imbalanced data The class imbalance problem is prevalent in many applications, including: fraud/intrusion detection, risk management, text classification, and medical diagnosis/monitoring, etc [7]. It typically occurs when, in a classification problem, there are many more instances of some classes than others. In such cases, standard classifiers tend to be overwhelmed by the large classes and ignore the small ones. Particularly, they tend to produce high predictive accuracy over the majority class, but poor predictive accuracy over the minority class. A number of solutions to the class-imbalance problem were proposed both at the data and algorithmic levels. At the data level, these solutions include many different forms of re-sampling such as over-sampling and under-sampling. These techniques modify the prior probability of the majority and minority class in training set to obtain a more balanced number of instances in each class. The under sampling method extracts a smaller set of majority instances while preserving all the minority instances. This method is suitable for large-scale applications where the number of majority samples is tremendous and lessening the training instances reduces the training time and makes the learning problem more tractable. In contrast to under-sampling, over-sampling method increases the number of minority instances by over-sampling them. At the algorithmic level, solutions include adjusting the costs of various classes so as the counter the class

imbalance when training the data, adjusting the decision threshold, etc. In section 2, I investigated the techniques at the data level, i.e. re-sampling methods. I employed three re-sampling techniques: SMOTE (Synthetic Minority Over-sampling Technique), Over-sampling by duplicating minority examples, random under-sampling. Using the rebalanced data, I trained three different classifiers, Decision Tree (C4.5), Naïve Bayes and Neural Network (with one hidden layer) and used then to classify the test set. 1.2 Learning from only positive and unlabeled data Considering a binary classification problem, given a set P that is an incomplete set of positive instances and a set U of unlabeled instances that contains both positive and negative instances, we want to build a classifier to classify the instances in U or new test set into positive or negative instances. This problem is called Learning from Only Positive and unlabeled data or PU-learning. In real life, PUlearning has a lot of applications. For example, there are over 1000 specialized molecular biology database, each of which defines a set of positive examples (gene/proteins related to certain disease or function) but has no information about examples that should not be included (and it is unnatural to build such set). Apparently, the traditional classification techniques are inapplicable since they all require both labeled positive and negative examples to build a classifier. Recently, a few algorithms were proposed to solve the problem [2][3][4]. One class of algorithms is based on a two-step strategy. These algorithms include S-EM, PEBL, and Roc-SVM. Step 1: Identify a set of reliable negative examples (RN) from the unlabeled set. In this sep, S-EM uses a Spy techniques, PEBL uses a techniques called 1-DNF, and Roc-SVM uses Rocchio algorithm. Step 2: Building a set of classifier by iteratively applying a classification algorithm and then selecting a good classifier. In this step, S-EM uses the Expectation-Maximization (EM) algorithm with a NB classifier, while PEBL and Roc-SVM use SVM. 2. Task 1: Learning Classifiers from Imbalanced Data Sets 2.1 Datasets The training data set consists of 40,000 examples, each of which involves 20 real-valued features. 3636 of them are labeled as 1 (positive examples), 36364 of them are labeled as -1 (negative examples). There are no missing values in the data set. The test set consists of 10,000 examples, in which the two classes are evenly distributed. More related information about these two date sets can be reached at [1]. In the experiment, all the datasets are converted to.arff format to be used by Weka. 2.2 Re-sampling Techniques SMOTE SMOTE (Synthetic Minority Over-sampling Technique) is an over-sampling approach proposed and designed in [8]. They generate synthetic examples in a less applicationspecific manner, by operating in feature space rather than data space. The minority class is over-sampled by taking each minority class sample and introducing synthetic examples along the line segments joining any/all of the k minority class nearest neighbors. Depending upon the amount of over-sampling required, neighbors from the k nearest neighbors are randomly chosen. Their implementation currently uses five nearest neighbors. For instance, if the amount of over-sampling needed is 200%, only two neighbors from the five nearest neighbors are chosen and one sample is generated in the direction of each. Synthetic samples are generated in the following way: take the difference between the feature vector (sample) under consideration and its nearest neighbor; multiply this difference by a random number between 0 and 1, and add it to the feature vector under consideration. This causes the selection of a random point along the line segment between two specific features (Fig 1). This approach effectively forces the decision region of the minority class to become more general. In this report, I implemented a two-step algorithm in which the step 1 uses Spy technique. After identify a set of reliable negative examples (RN), I use P (labeled positive examples) and RN to build a Naïve Bayes classifier. Section 3 includes the implementation details and the results. Fig 1. Over-Sampling with SMOTE. The minority class is oversampled by taking each minority class sample and introducing synthetic examples (blue circle) along the line segments joining any/all of the k (default=5) minority class nearest neighbors (red circles). 2

In the experiment design, the positive examples were oversampled by 900% so that the size of positive class is 36360, roughly equal to the size of negative class 36364. The SMOTE technique is embedded in the Weka package: weka.fiters.supervised.instance.smote. Over-sampling by duplicating the minority examples To produce a contrast to SMOTE, I implemented a simple over-sampling approach which over-samples the minority example by simply duplicating the minority examples. In this experiment, each positive example was duplicated 9 times to make the size of positive class roughly equal to the size of negative class. Random under-sampling As mentioned previously, the under sampling method extracts a smaller set of majority instances while preserving all the minority instances. In this experiment, I implemented an under-sampling approach which randomly selects a subset of examples from the majority class. For this data set, 3720 negative examples are randomly selected from all 36364 negative examples. And all 3636 positive examples are preserved in the new training set. 2.3 Building Standard Classifiers Using the new training data sets, I then trained three different classifiers: Decision Tree, Naïve Bayes, Neural Network (with one hidden layer). 15, 20. It is showed the accuracy reaches a plateau after 11 hidden units. Thus, the table only gives the accuracy for the 11 hidden-unit Neural Network. Table 1 Effect of re-sampling techniques in improving the classification accuracies on test set no resampling US OSbD OS_SMOTE DT 0.791 0.828 0.788 0.875 NB 0.834 0.827 0.827 0.838 NN 0.835 0.909 0.904 0.91 Notation: DT(Decision Tree), NB(Naïve Bayes), NN (Neural Network). No resampling (no resampling techniques applied), US (random under-sampling), OSbD (over-sampling by duplications), OS_SMOTE (over-sampling with SMOTE). For NN, the number of hidden units is 11. The results in Table 1 are plotted in Fig 1. Accuracy 1.0.9.8.7.6.5 Fig 1 Effect of resampling techniques on imbalanced data Decision Tree Decision Tree classifiers were trained using each of the three rebalanced training sets. I use weka.classifiers.trees.j48.j48 in WEKA package. When building the tree, I selected the default pruning option..4 no resampling US OSbD OS_SMOTE Decision Tree Naive Bayes Neural Network(hidden neurons = 11) Naïve Bayes Similarly, Naïve Bayes Classifiers were trained on the three new training set using weka.classifiers.bayes.naivebayes class. 5-fold cross-validation is selected. Neural Network Three-layer feed forward neural networks (one hidden layer) were trained using the new data sets. I experimented with different number of hidden units and selected the one with the best accuracy. I used the default learning rate 0.3 and momentum rate 0.2. The training algorithm I used is weka.classifiers.functions.neural.neuralnetwork. 2.4 Results Table 1 summarizes the accuracies of the classifiers trained using different training sets in classifying the test set. When training the Neural Network classifier, I experimented with different number of hidden units: 5, 11, For Naïve Bayes classifier, all of the three re-sampling approaches do not significantly improve the predictive accuracy on the test set. The accuracies of them are around 0.83, roughly the same as the NB classifier trained using the original imbalanced data set. For Decision Tree Classifier, random under-sampling and over-sampling with SMOTE significantly improve the accuracy. Over-sampling with SMOTE gives the best accuracy (0.875), which shows 8% improvement compared to the DT classifier trained directly using the imbalanced data. For Neural Network, all three re-sampling techniques significantly improve the predictive accuracy on test set. Neural Network classifier with over-sampling using SMOTE gives the best accuracy compared to other classifiers and re-sampling techniques. Thus, my best accuracy achieved is 0.91, which is ranked at 52 th among all 199 teams. The best accuracy among this ranking is 0.928. 3

3. Task 2: Learning Classifiers from Only Positive and Unlabeled Data Sets 3.1 Data Set The training data set consists of 68,560 examples, each of which also involves 20 real-valued features. Only 60 of them are labeled as 1 (positive examples), others are unlabeled. There are also no missing values in the data set. The test set consists of 11,427 examples. More related information about these two date sets can be reached at [1]. In the experiment, all the datasets are converted to.arff format to be used by Weka. 3.2 Two-step Strategy Theoretically, the PU-learning problem (Learning from Only Positive and Unlabeled data) is learnable [3][5]. There are a number of solutions proposed, among which are a class of algorithms based on a two-step strategy. In step 1, these algorithms implemented various techniques in order to extract a set of reliable negative examples from the unlabeled examples. In step 2, different classifiers can then be trained using the reliable negative examples obtained from step 1. In my project, I employed a Spy technique [3] in step 1 to extract reliable examples. Then I built a Naïve Bayes classifier using the labeled positive examples and the reliable negative examples as the training set. The two-step strategy is illustrated in Fig 2. U P positive negative Reliable Negative (RN) Step 1 Step 2 positive Q =U - RN Using P, RN and Q to build the final classifier iteratively or Using only P and RN to build a classifier Fig 2. Illustration of the two-step strategy for PU-learning Step 1: The Spy technique to find reliable negative instances The algorithm for Spy technique is given in Fig 3. It first randomly selects a set S of positive examples from P and put them in U (lines 2 and 3). The default value for s% is 15%. Examples in S act as spy examples from the positive set to the unlabeled set U. The spies behave similarly to the unknown positive instances in U. Hence, they allow the algorithm to infer the behavior of the unknown positive instances in U. It then runs I-EM algorithm using the set P S as positive and the set U S as negative (lines 3-7). I-EM basically runs NB twice (see the EM algorithm below). After I-EM completes, the resulting classifier uses the probabilities assigned to the documents in S to decide a probability threshold t to identify possible negative documents in U to produce the set RN. See [3] for details. One thing to be mentioned, since the 20 features are realvalued, I first discretized them into 10 bins, the width of which is ( xi x )/10 for each feature i. Then I,max i,min computed the posterior probabilities using the suffice statistics. Algorithm Step-1 1. N = U = 2. S = sample( P, s%) 3. MS = M S 4. P = P S 5. Assign every examples in P the class label 1 6. Assign every examples in MS the class label -1 7. Run I-EM(MS, P) 8. Classify each examples in MS 9. Determine the probability threshold t using S 10. for each example e in M 11. if its probability Pr[1 x] < t 12. N = N {} e 13. else U = U {} e I-EM(M, P) 1. Building a initial naïve Bayesian classifier NB-C using P as positive examples, M as negative examples 2. Loop while classifier parameters change 3. for each example e M 4. compute Pr[1 e ] 5. update Pr[ x i 1] and Pr[1] given the probabilistic Pr[1 e ] and P Fig 3. Algorithm for the Spy technique Step 2: Building a Standard Naïve Bayes Classifier using P and RN After Step 1, we obtained a set of examples that we believe are most likely negative examples (RN). Then I used the labeled positive examples (P) and (RN) to train a Naïve Bayes classifier. Similarly, I also used the class weka.classifiers.bayes.naivebayes in Weka. 5-fold crossvalidation is selected. 3.3 Results I did two experiments. In the first experiment, I simply regarded all unlabeled examples (U) as negative examples. The training set then combines P and U. A Naïve Bayes classifier was trained using this training set. The second experiment follows the two-step strategy above. The 4

performances were evaluated using F1 score from the classifying the test set. Table 2 F1 score of the PU-learning P, N=U P, N=RN F1 score 0.545 0.651 Notation: F 1 = 2 Precision Recall /(Precision+Recall) When trained with P vs U (U as the negative examples), the F1 score is 0.545. The two-step strategy gives F1=0.651, which is significant improvement. The best score among all teams in the contest is 0.721, which means I still have a long way to go. Anyway, it is showed the two-step strategy does improve the predictive power. 4. Conclusion and Discussion In this report, I studied two challenging data mining tasks presented in 2008 UC San Diego Data Mining competition. The first task is to improve the learning from imbalanced data sets. For this problem, I investigated a set of resampling approaches, random under-sampling, oversampling by duplicating the minority class, SMOTE (Synthetic Minority Over-sampling Technique), in improving the learning from the imbalanced data sets. I then built three classifiers using the rebalanced data sets. It is showed that, for Naive Bayes, the three re-sampling techniques do not have significant improvement in the classification accuracy over the test set. For Decision Tree Classifiers, random under-sampling and SMOTE significantly improve the accuracy. For Neural Network, all three re-sampling techniques significantly improve the accuracy. Neural Network classifier with SMOTE gives the best accuracy compared to other classifiers and re-sampling techniques. Although in this case, under-sampling has peered predictive accuracy with the SMOTE, a problem associated with it is that we may lose informative instance from the discarded instance. The competition uses accuracy as the criteria to evaluate the performance of a classifier. This may not be a good idea, since higher accuracy does not necessarily imply better performance on target task. It turns out that AUC (area under the ROC curve) is a better performance measure than accuracy. The second task is a problem of learning from partially labeled data set. Only a subset of positive examples are labeled. Others are unlabeled. There are no labeled negative examples. For this task, I investigated a two-step strategy for learning this type of data sets. In step 1, I used the Spy technique to extract reliable negative examples. In step 2, I then used these reliable negative examples combining the labeled positive examples to learn a Naïve Bayes classifier. This two-step strategy give significantly better F1 score than simply using all unlabeled examples as negative ones. Furthermore, the best score among all teams in the contest is 0.721, far better than mine (0.651), which means I still have a long way to go. There are other alternate two-step strategies. For example, in step 1, there are various algorithms to identify reliable negative examples, such as 1- DNF, Rocchio algorithm. In step 2, it turns out SVM is a better classifier to build the final classifier. Thus, some future work is to try these different two-step strategies, as well as other non two-step approaches. Class imbalance and PU-learning are problems that still worth studying. References [1] http://mill.ucsd.edu/index.php?page=datasets [2] B. Liu, Y. Dai, X. Li, W. S. Lee, and P. S. Yu. Building text classifiers using positive and unlabeled examples. In Proceedings of the 3rd IEEE International Conference on Data Mining (ICDM 2003), pages 179 188, 2003. [3]B. Liu, W.S.Lee, P.S. Wu, X. Li. Partially Classification of Text Documents. Proceedings of the Nineteenth International Conference on Machine Learning (ICML- 2002), 8-12, July 2002, Sydney, Australia. [4]Wee Sun Lee, Bing Liu. Learning with Positive and Unlabeled Examples using Weighted Logistic Regression. Proceedings of the Twentieth International Conference on Machine Learning (ICML-2003), August 21-24, 2003, Washington, DC USA. [5] C. Elkan and K. Noto. Learning Classifiers from Only Positive and Unlabeled Data. In Proceedings of the Fourteenth International Conference on Knowledge Discovery and Data Mining (KDD'08). [6]Giang Hoang Nguyen, Abdesselam Bouzerdoum, Son Lam Phung: A supervised learning approach for imbalanced data sets. ICPR 2008: 1-4 [7]Nitesh V. Chawla, Nathalie Japkowicz, Aleksander Kotcz: Editorial: special issue on learning from imbalanced data sets. SIGKDD Explorations 6(1): 1-6 (2004) [8]Nitesh V. Chawla et. al. (2002). "SMOTE: Synthetic Minority Over-sampling Technique". Journal of Artificial Intelligence Research. Vol.16, pp.321-357. 5